I'm a bit of a pedant when it comes to URL structure. I like my URLs clean and coherent, and I like to retain redirects
for deep-links even when it doesn't really matter (like for this blog). So when I shifted to Pelican
I spent quite a bit of time configuring Pelican and mod_rewrite
in order to generate the URL structure I wanted.
There were several things I wanted to achieve for the URL structure of this blog:
- Pages are grouped by year and month. Call me old fashioned but it just doesn't seem like a blog if this isn't the URL structure.
- Pages do not have a
.html
or other file extension. I believe the URL should be an identifier, if there are different content types that
should be handled with content negotiation.
- Groupings of pages end with a
/
. If the version without a /
is accessed the client is redirected to the version with the trailing /
. This redirect is
because I think there should only ever by one canonical URL for a piece of content.
index.html
and other such files can exist on disk, but never in a URL. Any attempt to access such a file should be redirected to
the directory URL. Again these rules mean that there is only one canonical URL.
- URLs used by Blogger and other previous generations of the site have a redirect to the current canonical URL.
I wrote a lot about the redirects for backwards compatibility with Blogger previously
so I'm not going to cover them again here.
Get Page Structure Right
First things I wanted to get right was the year and month grouping of my posts. To do this I set the well documented pelican configuration option
ARTICLE_SAVE_AS
in my pelicanconf.py
:
ARTICLE_SAVE_AS = '{date:%Y}/{date:%m}/{slug}.html'
That setting means that each post will be at something like 2020/01/why-is-2020-like-this.html
.
I also wanted the YEAR/
and YEAR/MONTH/
URLs to return something. Nothing links directly to these, but I don't like URL segments that return 404
errors. So I configured pelican to generate an archive for each year and month as the index.html
in these directories.
MONTH_ARCHIVE_SAVE_AS = '{date:%Y}/{date:%m}/index.html'
YEAR_ARCHIVE_SAVE_AS = '{date:%Y}/index.html'
Remove HTML Extension from Blog Posts
The above settings gave me the structure I wanted. Next I wanted to remove the .html
extension so that the URL identified the content not the format.
I could have achieved this by removing the .html
extension from the ARTICLE_SAVE_AS
setting. But I'd need to use another way of telling the web server to
serve these with a text/html
content type. Keeping that up to date felt fragile.
I decided it's best to keep the file on disk with the .html
extension and just instruct the webserver to serve it for the URL without the .html extension.
This was easily done with mod_rewrite
. In my .htaccess
file I added
RewriteRule "^([0-9]{4}/[0-9]{2}/.+)" $1.html [NE,END]
This means that the web server adds .html
to anything looking like a blog URL before trying to find the file on disk. Only problem was that the generated
links from Pelican all included the .html
extension. I fixed that by setting ARTICLE_URL
in the pelicanconf.py
. That configures how the link to a post
is generated. (As opposed to the _SAVE_AS
which indicates where to create the actual blog page)
ARTICLE_URL = '{date:%Y}/{date:%m}/{slug}'
With that a URL like 2020/01/everything-is-burning
is translated by the web server to the file 2020/01/everything-is-burning.html
. All the links from other
pages are to 2020/01/everything-is-burning
.
NB: Technically the ARTICLE_URL
must be set any time the ARTICLE_SAVE_AS
is set, if I was happy with the .html
extensions I would have just set it to
the same value as ARTICLE_SAVE_AS
Redirect .html Extension to the Canonical URL
This did create a problem, when posts were accessed with the .html
extension, another .html
extension was added. This resulted in the
web server being unable to find the file and so it returned a 404 not found
error. That didn't feel clean, particularly as the old blogger URLs included
the .html
extension.
I used mod_rewrite
to define a redirect so the .html
extension was permanently redirected to the canonical URL for the post.
RewriteRule "^([0-9]{4}/[0-9]{2}/[^/]+)\.html" $1 [NE,END,R=301,E=permacache:1]
This rule needed to be before the rule to add the .html
extension so that anything ending with .html
is redirected first. If this rule was later in the rule to
add the .html
extension would take precedence and this rule would never be run.
Same trick for Categories and Tags
Pelican organizes posts by category and tag. There is a page for each category, and each tag, that has a list of matching posts. By default these have a .html
extension which I wanted to remove.
I used the same approach as for posts. I set the _SAVE_AS
setting to produce a .html
page in and _URL
setting to ensure that any links to that page
don't have the .html
extension.
CATEGORY_SAVE_AS = 'categories/{slug}.html'
CATEGORY_URL = 'categories/{slug}'
TAG_SAVE_AS = 'tags/{slug}.html'
TAG_URL = 'tags/{slug}'
I added two more mod_rewrite
rules. One to redirect any URLs with the .html
extension to the canonical URL. Another to add the .html
extension when looking
for these pages on disk.
RewriteRule "^((tags|categories)/[^/]+)\.html$" $1 [NE,END,R=301,E=permacache:1]
RewriteRule "^((tags|categories)/.+)" $1.html [NE,END]
This gave me the structure I wanted:
/categories/Hacking
had all the posts about hacking
/tags/linux
had all the posts about Linux
I also wanted the categories/
and tags/
URLs to be meaningful. Pelican does generate lists of categories and tags so I just changed some more settings
so these were generated as the index.html
in the folders:
CATEGORIES_SAVE_AS = 'categories/index.html'
TAGS_SAVE_AS = 'tags/index.html'
Replace direct use of index.html with the canonical directory path.
At this point I had all the URLs setup the way I wanted:
- Pages were grouped by year and month due to setting
ARTICLE_SAVE_AS
in my pelicanconf.py
- Pages did not have a
.html
extension due to setting ARTICLE_URL
, CATGORY_URL
, TAG_URL
and judicious use of mod_rewrite
- Groupings of pages ended with a
/
because they were directories and the web server handled the redirects. Setting MONTH_ARCHIVE_SAVE_AS
,
YEAR_ARCHIVE_SAVE_AS
, CATEGORIES_SAVE_AS
and TAGS_SAVE_AS
to produce index.html
files allowed the web server loads the right content
for these directories.
There was still one problem. I wanted direct access of index.html
to redirect to the directory URL. So I added one final mod_rewrite
rule in my .htacess
:
RewriteRule "^(|.+/)index.html$" "$1" [NE,END,R=301,E=permacache:1]
Caching the Redirects
I used a lot of redirects because I wanted to send clients to the canonical URL, not just give them the content from the wrong URL. The downside of
this approach is that it results in additional HTTP round trips. This can be at least partially mitigated by permanently caching the redirects.
The E=permacache:1
in all the above redirects sets the environment variable permacache
. I use this later in the .htaccess
file to set an Expires header
on all of these redirects.
Header always set Expires "Wed, 1 Jan 2020 12:00:00 GMT" env=permacache
Did you spot the problem? All my caches stopped caching on Jan 1 2020. Whoops. A much better choice would have been to use
the Cache-Control
header instead and specify that resources can be cached "for a year after being accessed" e.g.
Header always set CacheControl "public, max-age=31536000" env=permacache
Wrap Up
So with all of that combined, here are what the relevant parts of the files look like:
pelicanconf.py
# Create each blog post in directories grouped by year and month
ARTICLE_SAVE_AS = '{date:%Y}/{date:%m}/{slug}.html'
# Generate links to each blog post grouped by year and month but without the .html extension
ARTICLE_URL = '{date:%Y}/{date:%m}/{slug}'
# Create an index.html for each year and month directory listing all the posts
YEAR_ARCHIVE_SAVE_AS = '{date:%Y}/index.html'
MONTH_ARCHIVE_SAVE_AS = '{date:%Y}/{date:%m}/index.html'
# Create a listing of posts with each category or tag in the categories and tags folders (respectively)
CATEGORY_SAVE_AS = 'categories/{slug}.html'
TAG_SAVE_AS = 'tags/{slug}.html'
# Generate links to categories and tags following the same structure but without the .html extension
CATEGORY_URL = 'categories/{slug}'
TAG_URL = 'tags/{slug}'
# Generate an index.html each for the categories and tags folders listing all the categories and all the tags
CATEGORIES_SAVE_AS = 'categories/index.html'
TAGS_SAVE_AS = 'tags/index.html'
.htaccess
# Redirect any access of posts with the .html extension to the canonical URI (without the extension)
RewriteRule "^([0-9]{4}/[0-9]{2}/[^/]+)\.html" $1 [NE,END,R=301,E=permacache:1]
# Redirect any access of tags or categories with the .html extension to the canonical URI (without the extension)
RewriteRule "^((tags|categories)/[^/]+)\.html$" $1 [NE,END,R=301,E=permacache:1]
# Redirect any direct access of index.html to the canonical directory URL
RewriteRule "^(|.+/)index.html$" "$1" [NE,END,R=301,E=permacache:1]
# When trying to find the file to serve for a post, add the .html extension
RewriteRule "^([0-9]{4}/[0-9]{2}/.+)" $1.html [NE,END]
# When trying to find the file to serve for a category or tag, add the .html extension
RewriteRule "^((tags|categories)/.+)" $1.html [NE,END]
# Cache all redirects for 1 year
Header always set CacheControl "public, max-age=31536000" env=permacache
There are comments.