Generating Clean Blog URLs with Pelican

I'm a bit of a pedant when it comes to URL structure. I like my URLs clean and coherent, and I like to retain redirects for deep-links even when it doesn't really matter (like for this blog). So when I shifted to Pelican I spent quite a bit of time configuring Pelican and mod_rewrite in order to generate the URL structure I wanted.

There were several things I wanted to achieve for the URL structure of this blog:

  • Pages are grouped by year and month. Call me old fashioned but it just doesn't seem like a blog if this isn't the URL structure.
  • Pages do not have a .html or other file extension. I believe the URL should be an identifier, if there are different content types that should be handled with content negotiation.
  • Groupings of pages end with a /. If the version without a / is accessed the client is redirected to the version with the trailing /. This redirect is because I think there should only ever by one canonical URL for a piece of content.
  • index.html and other such files can exist on disk, but never in a URL. Any attempt to access such a file should be redirected to the directory URL. Again these rules mean that there is only one canonical URL.
  • URLs used by Blogger and other previous generations of the site have a redirect to the current canonical URL.

I wrote a lot about the redirects for backwards compatibility with Blogger previously so I'm not going to cover them again here.

Get Page Structure Right

First things I wanted to get right was the year and month grouping of my posts. To do this I set the well documented pelican configuration option ARTICLE_SAVE_AS in my pelicanconf.py:

ARTICLE_SAVE_AS = '{date:%Y}/{date:%m}/{slug}.html'

That setting means that each post will be at something like 2020/01/why-is-2020-like-this.html.

I also wanted the YEAR/ and YEAR/MONTH/ URLs to return something. Nothing links directly to these, but I don't like URL segments that return 404 errors. So I configured pelican to generate an archive for each year and month as the index.html in these directories.

MONTH_ARCHIVE_SAVE_AS = '{date:%Y}/{date:%m}/index.html'
YEAR_ARCHIVE_SAVE_AS = '{date:%Y}/index.html'

Remove HTML Extension from Blog Posts

The above settings gave me the structure I wanted. Next I wanted to remove the .html extension so that the URL identified the content not the format.

I could have achieved this by removing the .html extension from the ARTICLE_SAVE_AS setting. But I'd need to use another way of telling the web server to serve these with a text/html content type. Keeping that up to date felt fragile.

I decided it's best to keep the file on disk with the .html extension and just instruct the webserver to serve it for the URL without the .html extension. This was easily done with mod_rewrite. In my .htaccess file I added

RewriteRule "^([0-9]{4}/[0-9]{2}/.+)" $1.html [NE,END]

This means that the web server adds .html to anything looking like a blog URL before trying to find the file on disk. Only problem was that the generated links from Pelican all included the .html extension. I fixed that by setting ARTICLE_URL in the pelicanconf.py. That configures how the link to a post is generated. (As opposed to the _SAVE_AS which indicates where to create the actual blog page)

ARTICLE_URL = '{date:%Y}/{date:%m}/{slug}'

With that a URL like 2020/01/everything-is-burning is translated by the web server to the file 2020/01/everything-is-burning.html. All the links from other pages are to 2020/01/everything-is-burning.

NB: Technically the ARTICLE_URL must be set any time the ARTICLE_SAVE_AS is set, if I was happy with the .html extensions I would have just set it to the same value as ARTICLE_SAVE_AS

Redirect .html Extension to the Canonical URL

This did create a problem, when posts were accessed with the .html extension, another .html extension was added. This resulted in the web server being unable to find the file and so it returned a 404 not found error. That didn't feel clean, particularly as the old blogger URLs included the .html extension.

I used mod_rewrite to define a redirect so the .html extension was permanently redirected to the canonical URL for the post.

RewriteRule "^([0-9]{4}/[0-9]{2}/[^/]+)\.html" $1 [NE,END,R=301,E=permacache:1]

This rule needed to be before the rule to add the .html extension so that anything ending with .html is redirected first. If this rule was later in the rule to add the .html extension would take precedence and this rule would never be run.

Same trick for Categories and Tags

Pelican organizes posts by category and tag. There is a page for each category, and each tag, that has a list of matching posts. By default these have a .html extension which I wanted to remove.

I used the same approach as for posts. I set the _SAVE_AS setting to produce a .html page in and _URL setting to ensure that any links to that page don't have the .html extension.

CATEGORY_SAVE_AS = 'categories/{slug}.html'
CATEGORY_URL = 'categories/{slug}'
TAG_SAVE_AS = 'tags/{slug}.html'
TAG_URL = 'tags/{slug}'

I added two more mod_rewrite rules. One to redirect any URLs with the .html extension to the canonical URL. Another to add the .html extension when looking for these pages on disk.

RewriteRule "^((tags|categories)/[^/]+)\.html$" $1 [NE,END,R=301,E=permacache:1]
RewriteRule "^((tags|categories)/.+)" $1.html [NE,END]

This gave me the structure I wanted:

  • /categories/Hacking had all the posts about hacking
  • /tags/linux had all the posts about Linux

I also wanted the categories/ and tags/ URLs to be meaningful. Pelican does generate lists of categories and tags so I just changed some more settings so these were generated as the index.html in the folders:

CATEGORIES_SAVE_AS = 'categories/index.html'
TAGS_SAVE_AS = 'tags/index.html'

Replace direct use of index.html with the canonical directory path.

At this point I had all the URLs setup the way I wanted:

  • Pages were grouped by year and month due to setting ARTICLE_SAVE_AS in my pelicanconf.py
  • Pages did not have a .html extension due to setting ARTICLE_URL, CATGORY_URL, TAG_URL and judicious use of mod_rewrite
  • Groupings of pages ended with a / because they were directories and the web server handled the redirects. Setting MONTH_ARCHIVE_SAVE_AS, YEAR_ARCHIVE_SAVE_AS, CATEGORIES_SAVE_AS and TAGS_SAVE_AS to produce index.html files allowed the web server loads the right content for these directories.

There was still one problem. I wanted direct access of index.html to redirect to the directory URL. So I added one final mod_rewrite rule in my .htacess:

RewriteRule "^(|.+/)index.html$" "$1" [NE,END,R=301,E=permacache:1]

Caching the Redirects

I used a lot of redirects because I wanted to send clients to the canonical URL, not just give them the content from the wrong URL. The downside of this approach is that it results in additional HTTP round trips. This can be at least partially mitigated by permanently caching the redirects.

The E=permacache:1 in all the above redirects sets the environment variable permacache. I use this later in the .htaccess file to set an Expires header on all of these redirects.

Header always set Expires "Wed, 1 Jan 2020 12:00:00 GMT" env=permacache

Did you spot the problem? All my caches stopped caching on Jan 1 2020. Whoops. A much better choice would have been to use the Cache-Control header instead and specify that resources can be cached "for a year after being accessed" e.g.

Header always set CacheControl "public, max-age=31536000" env=permacache

Wrap Up

So with all of that combined, here are what the relevant parts of the files look like:

pelicanconf.py

# Create each blog post in directories grouped by year and month
ARTICLE_SAVE_AS = '{date:%Y}/{date:%m}/{slug}.html'
# Generate links to each blog post grouped by year and month but without the .html extension
ARTICLE_URL = '{date:%Y}/{date:%m}/{slug}'

# Create an index.html for each year and month directory listing all the posts
YEAR_ARCHIVE_SAVE_AS = '{date:%Y}/index.html'
MONTH_ARCHIVE_SAVE_AS = '{date:%Y}/{date:%m}/index.html'

# Create a listing of posts with each category or tag in the categories and tags folders (respectively)
CATEGORY_SAVE_AS = 'categories/{slug}.html'
TAG_SAVE_AS = 'tags/{slug}.html'

# Generate links to categories and tags following the same structure but without the .html extension
CATEGORY_URL = 'categories/{slug}'
TAG_URL = 'tags/{slug}'

# Generate an index.html each for the categories and tags folders listing all the categories and all the tags
CATEGORIES_SAVE_AS = 'categories/index.html'
TAGS_SAVE_AS = 'tags/index.html'

.htaccess

# Redirect any access of posts with the .html extension to the canonical URI (without the extension)
RewriteRule "^([0-9]{4}/[0-9]{2}/[^/]+)\.html" $1 [NE,END,R=301,E=permacache:1]
# Redirect any access of tags or categories with the .html extension to the canonical URI (without the extension)
RewriteRule "^((tags|categories)/[^/]+)\.html$" $1 [NE,END,R=301,E=permacache:1]
# Redirect any direct access of index.html to the canonical directory URL
RewriteRule "^(|.+/)index.html$" "$1" [NE,END,R=301,E=permacache:1]

# When trying to find the file to serve for a post, add the .html extension
RewriteRule "^([0-9]{4}/[0-9]{2}/.+)" $1.html [NE,END]
# When trying to find the file to serve for a category or tag, add the .html extension
RewriteRule "^((tags|categories)/.+)" $1.html [NE,END]

# Cache all redirects for 1 year
Header always set CacheControl "public, max-age=31536000" env=permacache

Comments !

social