acatalepsie/content/posts/achille-smc.md

14 KiB

title date draft toc
Generating incremental static site generators in Haskell using cartesian categories 2022-12-06 true true

A few days ago, I released the new version of achille, a Haskell library providing an EDSL for writing static site generators. This embedded language produces efficient, incremental and parallel static site generators, for free.

In this post, I will explain how achille is able to tranform this intuitive, "readable" syntax into an incremental static site generator:

import Achille as A

main :: IO ()
main = achille $ task A.do
  -- copy every static asset as is
  match_ "assets/*" copyFile

  -- load site template
  template <- matchFile "template.html" loadTemplate

  -- render every article in `posts/`
  -- and gather all metadata
  posts <-
    match "posts/*.md" \src -> A.do
      (meta, content) <- processPandocMeta src
      writeFile (src -<.> ".html") (renderPost template meta content)
      meta

  -- render index page with the 10 most recent articles
  renderIndex template (take 10 (sort posts))

Importantly, I want to emphasize that you --- the library user --- neither have to care about or understand the internals of achille in order to use it. Most of the machinery below is purposefully kept hidden from plain sight. You are free to ignore this post and directly go through the user manual to get started!

This article is just there to document how the right theoretical framework was instrumental in providing a good user interface and yet preserve all the desired properties. It also gives pointers on how to reliably overload Haskell's lambda abstraction syntax, because I'm sure many applications could make good use of that but are unaware that there are now ways to do it properly, without any kind of metaprogramming.


Foreword

My postulate is that static sites are good. Of course not for every use case, but for single-user, small-scale websites, it is a very convenient way to manage content. Very easy to edit offline, very easy to deploy. All in all very nice.

There are lots of static site generators readily available. However each and every one of them has a very specific idea of how you should structure your content. For simple websites --- i.e weblogs --- they are wonderful, but as soon as you want to heavily customize the generation process of your site or require more fancy transformations, and thus step outside of the supported feature set of your generator of choice, you're out of luck.

For this reason, many people end up not using existing static site generators, and instead prefer to write their own. Depending on the language you use, it is fairly straightforward to write a little static site generator that does precisely what you want. Sadly, making it incremental or parallel is another issue, and way trickier.

That's precisely the niche that Hakyll and achille try to fill: provide an embedded DSL in Haskell to specify your custom build rules, and compile them all into a full-fletched incremental static site generator executable. Some kind of static site generator generator.

Reasoning about static site generators

Let's look at what a typical site generator does. A good way to visualize it is with a flow diagram, where boxes are "build rules". Boxes have distinguished inputs and outputs, and dependencies between the build rules are represented by wires going from outputs of boxes to inputs of other boxes.

The static site generator corresponding to the Haskell code above corresponds to the following diagram:

...

Build rules are clearly identified, and we see that in order to render the index.html page, we need to wait for the renderPosts rule to finish rendering each article to HTML and return the metadata of every one of them.

Notice how some wires are continuous black lines, and some other wires are faded dotted lines. The dotted lines represent side effects of the generator.

  • files that are read from the file system, like all the markdown files in posts/.
  • files that are written to the filesystem, like the HTML output of every article, or the index.html file.

The first important insight is to realize that the build system shouldn't care about side effects. Its only role is to know whether build rules should be executed, how intermediate values get passed around, and how they change between consecutive runs.

The Recipe m abstraction

I had my gripes with Hakyll, and was looking for a simpler, more general way to express build rules. I came up with the Recipe abstraction:

newtype Recipe m a b = 
  { runRecipe :: Context -> Cache -> a -> m (b, Cache) }

It's just a glorified Kleisli arrow: a Recipe m a b will produce an output of type b by running a computation in m, given some input of type a.

The purpose is to abstract over side effects of build rules (such as producing HTML files on disk) and shift the attention to intermediate values that flow between build rules.

Caching

In the definition of Recipe, a recipe takes some Cache as input, and returns another one after the computation is done. This cache is simply a lazy bytestring, and enables recipes to have some persistent storage between runs, that they can use in any way they desire.

The key insight is how composition of recipes is handled:

(*>) :: Recipe m a b -> Recipe m a c -> Recipe m a c
Recipe f *> Recipe g = Recipe \ctx cache x -> do
  let (cf, cg) = splitCache cache
  (_, cf') <- f ctx cf x
  (y, cg') <- g ctx cg x
  pure (y, joinCache cf cg)

The cache is split in two, and both pieces are forwarded to their respective recipe. Once the computation is done, the resulting caches are put together into one again.

This ensures that every recipe will be attributed the same local cache --- assuming the description of the generator does not change between runs. Of course this is only true when Recipe m is merely used as selective applicative functor, though I doubt you need more than that for writing a static site generator. It's not perfect, but I can say that this very simple model for caching has proven to be surprisingly powerful.

I have improved upon it since then, in order to make sure that composition is associative and to enable some computationally intensive recipes to become insensitive to code refactorings, but the core idea is left unchanged.

Incremental evaluation and dependency tracking

But there is a but

We've now defined all the operations we could wish for in order to build, compose and combine recipes. We've even found the theoretical framework our concrete application inserts itself into. How cool!

But there is catch, and I hope you've already been thinking about it:
what an awful, awful way to write recipes.

Sure, it's nice to know that we have all the primitive operations required to express all the flow diagrams we could ever be interested in. We can definitely define the site generator that has been serving as example throughout:

rules :: Task ()
rules = renderIndex ∘ (...)

But I hope we can all agree on the fact that this code is complete gibberish. It's likely some Haskellers would be perfectly happy with this interface, but alas my library isn't only targeted to this crowd. No, what I really want is a way to assign intermediate results --- outputs of rules --- to variables, that then get used as inputs. Plain old Haskell variables. That is, I want to write my recipes as plain old functions.

And here is where my --- intermittent --- search for a readable syntax started, roughly two years ago.

The quest for a friendly syntax

Monads

If you've done a bit of Haskell, you may know that as soon as you're working with things that compose and sequence, there are high chances that what you're working with are monads. Perhaps the most well-known example is the IO monad. A value of type IO a represents a computation that, after doing side-effects (reading a file, writing a file, ...) will produce a value of type a.

Crucially, being a monad means you have a way to sequence computations. In the case of the IO monad, the bind operation has the following type:

(>>=) :: IO a -> (a -> IO b) -> IO b

And because monads are so prevalent in Haskell, there is a custom syntax, the do notation, that allows you to bind results of computations to variables that can be used for the following computations. This syntax gets desugared into the primitive operations (>>=) and pure.

main :: IO ()
main = do
  content <- readFile "input.txt"
  writeFile "output.txt" content

The above gets transformed into:

main :: IO ()
main = readFile "input.txt" >>= writeFile "output.txt"

Looks promising, right? I can define a Monad instance for Recipe m a, fairly easily.

instance Monad (Recipe m a) where
  (>>=) :: Recipe m a b -> (b -> Recipe m a c) -> Recipe m a c

And now problem solved?

rules :: Task IO ()
rules = do
  posts <- match "posts/*.md" renderPosts
  renderIndex posts

The answer is a resolute no. The problem becomes apparent when we try to actually define this (>>=) operation.

  1. The second argument is a Haskell function of type b -> Recipe m a c. And precisely because it is a Haskell function, it can do anything it wants depending on the value of its argument. In particular, it could very well return different recipes for different inputs. That is, the structure of the graph is no longer static, and could change between runs, if the output of type b from the first rule happens to change. This is very bad, because we rely on the static structure of recipes to make the claim that the cache stays consistent between runs.

Ok, sure, but what if we assume that users don't do bad things (we never should). No, even then, there is an ever bigger problem:

  1. Because the second argument is just a Haskell function.

Arrows

That's when I discovered Haskell's arrows. It's a generalization of monads, and is often presented as a way to compose things that behave like functions. And indeed, we can define our very instance Arrow (Recipe m). There is a special syntax, the arrow notation that kinda looks like the do notation, so is this the way out?

There is something fishy in the definition of Arrow:

class Category k => Arrow k where
  -- ...
  arr :: (a -> b) -> a `k` b

We must be able to lift any function into k a b in order to make it an Arrow. In our case we can do it, that's not the issue. No, the real issue is how Haskell desugars the arrow notation.

...

So. Haskell's Arrow isn't it either. Well, in principle it should be the solution. But the desugarer is broken, the syntax still unreadable to my taste, and nobody has the will to fix it.

This syntax investigation must carry on.

Compiling to cartesian closed categories

About a year after this project started, and well after I had given up on this whole endeavour, I happened to pass by Conal Elliott's fascinating paper "Compiling to Categories". In this paper, Conal recalls:

It is well-known that the simply typed lambda-calculus is modeled by any cartesian closed category (CCC)

I had heard of it, that is true. What this means is that, given any cartesian closed category, any term of type a -> b (a function) in the simply-typed lambda calculus corresponds to (can be interpreted as) an arrow (morphism) a -> b in the category. But a cartesian-closed category crucially has no notion of variables, just some arrows and operations to compose and rearrange them (among other things). Yet in the lambda calculus you have to construct functions using lambda abstraction. In other words, there is consistent a way to convert things defined with variables bindings into a representation (CCC morphisms) where variables are gone.

How interesting. Then, Conal goes on to explain that because Haskell is "just" lambda calculus on steroids, any monomorphic function of type a -> b really ought to be convertible into an arrow in the CCC of your choice. And so he did just that. He is behind the concat GHC plugin and library. This library exports a bunch of typeclasses that allow anyone to define instances for their very own target CCC. Additionally, the plugin gives access to the following, truly magical function:

ccc :: CartesianClosed k => (a -> b) -> a `k` b

When the plugin is run during compilation, every time it encounters this specific function it will convert the Haskell term (in GHC Core form) for the first argument (a function) into the corresponding Haskell term for the morphism in the target CCC.

How neat. A reliable way to overload the lambda notation in Haskell. The paper is really, really worth a read, and contains many practical applications such as compiling functions into circuits or automatic differentiation.

Compiling to monoidal cartesian categories

Two days ago, I stumbled upon this paper by chance:.

What they explain is that many interesting categories to compile to are in fact not closed.

No GHC plugin required, just a tiny library with a few classes.

There is one drawback: Recipe m is cartesian. That is, you can freely duplicate values. In their framework, they have you explicitely insert dup to duplicate a value. This is a bit annoying, but they have a good reason to do so: