README for my website's build process

https://github.com/idupree/idupree-websitepy is the library I made for building my website. The source documents for my site (Markdown, HTML, etc.) are not on GitHub.¹ This library is targeted chiefly at me and people I know. If you want to use it yourself for some reason: The "Usage" section of this page might be enough to get started. You're more than welcome to chat with me for advice.

Raisons d'être

I wish to choose the exact HTTP headers, routes and caching behaviour. It's my personal, lifelong website, and those have important effects on privacy, performance and robustness.
Pandoc's Markdown is an amazing and actively-maintained authoring format.
Slightly too much NIH about whether I could have modified other static site generators to my needs.

Features

HTTP headers

content-hash-based ETags.
X-Robots-Tag: noarchive, noindex for pages not reachable from /, with specifiable exceptions to include and exclude pages (doindexfrom and butdontindexfrom Config options).
Link: <...>; rel="canonical" for all files, since not-HTML can't use the <link>. Not super important, but reassuring.
far-future Cache-Control for resources.
other straightforward headers like X-Frame-Options: SAMEORIGIN.

Redirects

filename.3xx is a 3xx HTTP redirect to the route-relative path named in the file's contents. For example document_root/dir/filename.301 containing bar is a 301 permanent redirect from /dir/filename to /dir/bar. The path may also start with a / or http:// or etc to be different kinds of relativeness of link.

Pandoc Markdown

Markdown .md files are converted to HTML using pandoc markdown. If you use this, specify a pandoc template file with Config pandoc_template_relative_to_source_dir.

At the time I chose it, Pandoc's Markdown was one of the few markdown implementations that supported footnotes, a feature I use a lot in my writing. I am ready to change if a better implementation presents itself. Changing would require checking that a bunch of pages still look right. Of note, markdown syntax inside explicit <div>s is nonstandardly supported by pandoc and I use it. (Sadly, in one case this produces invalid HTML due to pandoc implying a <p> tag that shouldn't be there around a </nav>).

How to use resources (CSS, images, etc)

Appending ?rr to a relative link to a resource (JavaScript, CSS, images, etc) makes the build-system realize it's a link to a resource and move the resource to a unique path based on a hash of its contents, give it lengthy caching headers, possibly serve it through a CDN, etc. Do not use this for regular href links that link to a user-visible page that will go in the user's URL bar. ?rr works not just in HTML but also in files such as CSS because of url("...") and JS for dynamic uses of resources.

This syntax was chosen to be distinctive, short, and let the page still work if served directly from file:// or a simple HTTP server. "rr" stands for "resource reference".

Use python3 -m idupree_websitepy.rrify to conveniently add ?rr to existing HTML code (it will interactively ask you which places to change the code using a web browser; it works on any textual file type but assumes that link-targets are relative to the directory of the source file).

?rr can be put on a link to a directory, in which case the directory is given a unique name instead of the files in it. This is useful for JavaScript that e.g. dynamically selects ("imgs/?rr" + "piece-" + piece_number + ".png").²

Extra HTML features

Putting  in an HTML file will be replaced by <link rel="canonical" href="..." /> (href pointing to an absolute URL of that file's route at the domain+scheme configured by Config canonical_scheme_and_domain).

Caveats: you'll have to use some escaping (or make an exception in idupree_websitepy/build.py) if you want to:

write the comment  in HTML.
write the literal sequence (url-char)?rr(non-url-char) in any text-based format.

Use AUTO_OBFUSCATE_* to write email addresses that don't require JavaScript to behave properly but will confuse simple spambot email harvesters. The source syntaxes are written without an @ and without mentioning the word "mail" in the hopes that if the source document is visible to the Web, or the replacement fails to work for some reason, it will still be difficult to automatically harvest the email address from the source.

AUTO_OBFUSCATE_URL(addr example.com) → mailto:addr@example.com obfuscated for use in an HTML href="". Note: one obfuscation, newlines in the address, is not HTML-valid but still works in at least Firefox (26) and Chromium (31) and IE (8) and means bots have to work harder than just preprocessing pages with &-decoding and case-folding.
AUTO_OBFUSCATE_HTML(addr example.com) → addr@example.com obfuscated for use in HTML body text.

Usage

Install idupree_websitepy using the instructions on github (it's a typical python3 package with no dependencies). Then in your own code, write a Python file that looks something like this.

import idupree_websitepy.build
import idupree_websitepy.tests

config = idupree_websitepy.build.Config(
      site_source_dir = '...',
      build_output_dir = '...',
      doindexfrom = ['/'],
      butdontindexfrom = [],
      error_on_missing_resource = True,
      error_on_broken_internal_link = True,
      canonical_scheme_and_domain = 'http://www.example.com',
      list_of_compilation_source_files = ['build.py'],
      test_host = 'localhost',
      test_port = 9000,
      test_host_header = 'test.example.com',
      test_canonical_origin = True,
      test_status_codes = {
        '/': 200,
        '/blog/post': 200,
        '/erherhkgf': 404,
      }
      )
idupree_websitepy.build.build(config)

# the relevant nginx has to be reloaded in order
# to notice the new lua config
subprocess.check_call(['/usr/bin/sudo', '/bin/systemctl',
      'reload-or-try-restart', 'nginx.service'])

idupree_websitepy.tests.test(config)

In this, site_source_dir should be a directory that represents the root of your site. For example, you'll likely want an index.html in it. This index.html will be exposed as http://www.example.com/ but not as http://www.example.com/index.html. (index and .html extensions are omitted in order to maintain URLs with clean and unique paths.)

The rest of the Config options are explained in its docstring in idupree_websitepy/build.py. You can also print the docstring by running python3 -c 'from idupree_websitepy.build import Config; print(Config.__init__.__doc__)'

.css and .js and image files are normally not mapped as simple URLs; instead a hash is added to their URLs to improve caching reliability. See "How to use resources" above for how to link to them.

Tests-related parts are 100% optional and require setting up a local validator.nu and nginx+lua. They test, for example, whether the built website is valid HTML. These tests are slightly customizable, but rather opinionated.

nginx config should contain something like

    # Our Lua always specifies the Content-Type
    lua_use_default_type off;

    # We handle our own ETags.
    etag off;

    # Disable If-Modified-Since since we have hash-based ETags instead
    if_modified_since off;
    add_header Last-Modified '';

    lua_package_path "/srv/openresty/conf/?.lua;/...build_output_dir/build/nginx/?.lua";

    server {
        listen 9000 default_server;
        server_name test.example.com;
        merge_slashes off;
        location / {
            content_by_lua '
              local do_page = require "deploy/do_page"
              do_page(ngx.var.uri)
              ';
        }
        location /pagecontent/ {
            internal;
            alias /...build_output_dir/build/nginx/deploy/pagecontent/;
        }
    }

Misc

Content-Encoding: gzip (not x-gzip) because even Apache 1.3's docs say that only old clients require the x- form.

Robots.txt does not forbid content pages. If it forbids a page then Google can't crawl it to read its X-Robots-Tag: noindex so Google might index it if third party sites link to that page!³

Library code overview

Using Python because it is a stable toolchain that is super easy to install everywhere with sufficient "batteries included". This decreases friction when I want to add something to my website and find out that, say, Haskell binary compat is momentarily broken on my system or NodeJS has increased by two major versions. Currently requires Python >= 3.3, but almost works in 2.7.

Dependencies

Python 3.3+ as python3
(if python < 3.4: python-asyncio library (pip3 install --user asyncio): for tests)
pandoc binary: for converting pandoc-markdown to HTML
inkscape, convert (ImageMagick), optipng: for SVG/PNG/ICO image transformations
rsync; nginx-openresty server: for some deployment scenarios. (debian nginx-extras also works. The semi-unusual required nginx module is lua.)
local copy of validator.nu (requires python2, git, and JDK; instructions: [1] or [2]): for tests
Also see HTTPS setup info, which uses openssl for certificate generation.

Scripts

rrify.py: Running
python3 -m idupree_websitepy.rrify path/to/dir/to/swizzle/
gives a convenient interface to add and remove ?rr from links between files within that directory tree.
For more details, python3 -m idupree_websitepy.rrify --help

Library modules

build.py

The main entry point for user code. Documentation for Config options is here.

tests.py

User entry point for the tests (which are somewhat customizable but rather opinionated).

errdocs.py

contents of self-contained HTML documents for each of many HTTP error codes. Has Python and shell interfaces (shell: run python errdocs.py --help to get help).

resource_rewriting.py

A system depending on buildsystem.py. A "resource" is a web page's CSS/JavaScript/image/etc. files; any file that doesn't have a URL visible to a casual user is a good choice to treat as a resource. This system renames the resource files based on their content so that Cache-Control: ∞ is reasonable and caches can't mistakenly cache older versions of content. Then it rewrites specially marked references (in pages and resources) so that they point to the current filename of the target resource file.

urlregexps.py

Regexps conforming to the URI RFCs, plus regexps for my custom rewritable-URL templating system.

buildsystem.py

A bad Make-like library which provides the following features:

Consistent mtime: Files generated through the buildsystem lib will always have mtime equal to the max mtime of the source files they depend on.
Work reuse: Files generated through the buildsystem lib whose dependencies (including build-scripts) haven't changed, won't be rebuilt unnecessarily.
Separate build directory: By default, all build products appear in ../<dirname>-builds, meaning you don't have to .gitignore the build dir. To make it easier to avoid leaving byproducts in the source dir, by default it copies the source dir to <build dir>/src, sans some editor temp files, RCS directories, etc, and cd's to <build dir>.

Its worst feature is that if, when specifying dependencies for work reuse, you miss a dependency, it can't warn you and unexpected non-recompilations may happen. Also, it can't prevent you from writing to Python variables within work-reuse blocks (even though when the work is actually reused, those blocks won't be run so the Python variables won't be set).

localwebfn.py

A library to simplistically use a browser to ask questions with a graphical UI.

utils.py

Python lib for trivial generic things.

Build products (in ../+xxxxxx-builds/)

src/: a copy of '.' nonresource-routes; a list of routes to deliberate web pages (excludes rewritten resources) nocdn-resource-routes; a list of routes to resource files (assuming no separate-domain CDN usage) nginx/; final product for Nginx OpenResty to be deployed to /srv/openresty/conf/.

intermediate build products:

site/: the site, compiled but without resource-rewriting applied yet nginx-pagecontent-hash/; hex SHA hashes of actual final page contents that will be sent over the wire rewritten-towards/; site/ rewritten in various ways. -gz/-nogz are for if the HTML page server can do content-encoding negotiation but the resource pages server can't. rr/; info from resource-rewriting, accessible in Python from the ResourceRewriter object's recall_* methods, or in the below dirs. rr/direct-deps/; list of resource files each file directly depends on rr/transitive-deps/; list of resource files each file directly or indirectly depends on rr/hash/; pre-resource-rewriting hashes of file contents rr/hash-incl-deps/; pre-resource-rewriting hashes of the file-and-all-its-dependencies' hashes rr/rewritten-resource-name/; mapping from original filename to name incorporating hash

obsolete and not generated anymore:

nocdn/pages/: final product for Apache httpd (not fully correct HTTP headers currently; no longer tested) that can be deployed to anywhere where HTTP path '/' refers to this directory (modulo deployment wibbles about extensions and content-type headers) (filenames and contents resource-rewritten) withcdn/pages/; final product files for Apache httpd (ditto) to be be directly accessed under the website-user-visible domain, pointing to public CDN (not even specified now) for resources (contents resource-rewritten) withcdn/resources/; final product resource files for Apache httpd (ditto) pointing to public CDN (ditto) for sub-resources (filenames and contents resource-rewritten)

One reason: GitHub does not let me tell robots "noarchive", "noindex", and/or "rel=canonical". Nginx does.↩
In the directory-rr case: The directory's unique name is based on the contents of everything in it and the things they depend on. The unique path of each file in it is only unique by dint of the directory's renaming. Unfortunately, JavaScript doesn't come with correct URL joining, so ("imgs/?rr" + ...) won't trivially work without resource-rewriting the way "some-image.png?rr" does. Another small downside: every file under such directories will be served as a resource, even if it's not used as a resource, because it's not trivially clear which files under the directory are accessed by JavaScript in this way. So try to put the files you're accessing this way in their own subdirectory.↩
See this video by a Googler, with the pertinent information near the end of the video.↩

idupree