acatalepsie/content/posts/achille-smc.md

9.1 KiB

title date draft toc
Generating incremental static site generators in Haskell using cartesian categories 2022-12-06 true true

A few days ago, I released the new version of achille, a Haskell library providing an EDSL for writing static site generators. This embedded language produces efficient, incremental and parallel static site generators, for free.

In this post, I will explain how achille is able to tranform this intuitive, "readable" syntax into an incremental static site generator:

import Achille as A

main :: IO ()
main = achille $ task A.do
  -- render every article in `posts/`
  -- and gather all metadata
  posts <-
    match "posts/*.md" \src -> A.do
      (meta, content) <- processPandocMeta src
      writeFile (src -<.> ".html") (renderPost meta content)
      meta

  -- render index page with the 10 most recent articles
  renderIndex (take 10 (sort posts))

Importantly, I want to emphasize that you --- the library user --- neither have to care about or understand the internals of achille in order to use it. You are free to ignore this post and directly go through the user manual to get started!

This post is just there to document how the right theoretical framework was key in providing a good user interface that preserves all the desired properties.


Foreword

The original postulate is that static sites are good. Of course not for every use case, but for single-user, small-scale websites, it is a very practical way of managing content. Very easy to edit offline, very easy to deploy. All in all very nice.

There are lots of static site generators readily available. However each and every one of them has a very specific idea of how you should manage your content. For simple websites --- i.e weblogs --- they are great, but as soon as you want to heavily customize the building process of your site, require more fancy transformations, and thus step outside of the supported feature set of your site generator of choice, you're in for a lot of trouble.

For this reason, many people end up not using existing static site generators, and instead prefer to write their own. Depending on the language you use, it is fairly straightforward to write a little static site generator doing everything you want. Sadly, making it incremental or parallel is another issue, and way trickier.

That's precisely the niche that Hakyll and achille try to fill: use an embedded DSL in Haskell to specify your custom build rules, and compile them all into a full-fletched incremental static site generator executable. Some kind of static site generator generator.

Reasoning about static site generators

Let's look at what a typical site generator does. A good way to visualize it is with a flow diagram, where boxes are "build rules". Boxes have distinguished inputs and outputs, and dependencies between the build rules are represented by wires going from outputs of boxes to inputs of other boxes.

The static site generator corresponding to the Haskell code above could be represented as the following diagram:

...

Build rules are clearly identified, and we see that in order to render the index.html page, we need to wait for the renderPosts rule to finish rendering each article to HTML and return the metadata of every one of them.

Notice how some wires are continuous black lines, and some other wires are faded dotted lines. The dotted lines represent side effects of the generator.

  • files that are read from the file system, like all the markdown files in posts/.
  • files that are written to the filesystem, like the HTML output of every article, or the index.html file.

The first insight is to realize that the build system shouldn't care about side effects. Its only role is to know whether build rules should be executed, and how intermediate values get passed around.

The Recipe m abstraction

I had my gripes with Hakyll, and was looking for a simpler, more general way to express build rules. I came up with the Recipe abstraction:

newtype Recipe m a b = 
  { runRecipe :: Context -> Cache -> a -> m (b, Cache) }

It's just a glorified Kleisli arrow: a Recipe m a b will produce an output of type b by running a computation in m, given some input of type a.

The purpose is to abstract over side effects of build rules (such as producing HTML files on disk) and shift the attention to intermediate values that flow between build rules.

As one could expect, if m is a monad, so is Recipe m a. This means composing recipes is very easy and dependencies between those are stated explicitely in the code.

main :: IO ()
main = achille do
  posts <- match "posts/*.md" compilePost
  compileIndex posts
<details>
  <summary>Type signatures</summary>

Simplifying a bit, these would be the type signatures of the building blocks in the code above.

compilePost  :: Recipe IO FilePath PostMeta
match        :: GlobPattern -> (Recipe IO FilePath b) -> Recipe IO () [b]
compileIndex :: PostMeta -> Recipe IO () ()
achille      :: Recipe IO () () -> IO ()
</details>

There are no ambiguities about the ordering of build rules and the evaluation model is in turn very simple --- in contrast to Hakyll, its global store and implicit ordering.

Caching

In the definition of Recipe, a recipe takes some Cache as input, and returns another one after the computation is done. This cache is simply a lazy bytestring, and enables recipes to have some persistent storage between runs, that they can use in any way they desire.

The key insight is how composition of recipes is handled:

(*>) :: Recipe m a b -> Recipe m a c -> Recipe m a c
Recipe f *> Recipe g = Recipe \ctx cache x -> do
  let (cf, cg) = splitCache cache
  (_, cf') <- f ctx cf x
  (y, cg') <- g ctx cg x
  pure (y, joinCache cf cg)

The cache is split in two, and both pieces are forwarded to their respective recipe. Once the computation is done, the resulting caches are put together into one again.

This ensures that every recipe will be attributed the same local cache --- assuming the description of the generator does not change between runs. Of course this is only true when Recipe m is merely used as selective applicative functor, though I doubt you need more than that for writing a static site generator. It's not perfect, but I can say that this very simple model for caching has proven to be surprisingly powerful.

I have improved upon it since then, in order to make sure that composition is associative and to enable some computationally intensive recipes to become insensitive to code refactorings, but the core idea is left unchanged.

Incremental evaluation and dependency tracking

But there is a but

Arrows

I really like the do notation, but sadly losing this information about variable use is bad, so no luck. If only there was a way to overload the lambda abstraction syntax of Haskell to transform it into a representation free of variable bindings...

That's when I discovered Haskell's arrows. It's a generalization of monads, and is often presented as a way to compose things that behave like functions. And indeed, we can define our very instance Arrow (Recipe m). There is a special syntax, the arrow notation that kinda looks like the do notation, so is this the way out?

There is something fishy in the definition of Arrow:

class Category k => Arrow k where
  -- ...
  arr :: (a -> b) -> a `k` b

We must be able to lift any function into k a b in order to make it an Arrow. In our case we can do it, that's not the issue. No, the real issue is how Haskell desugars the arrow notation.

...

There is a macro that is a bit smarter than current Haskell's desugarer, but not by much. I've seen some discussions about actually fixing this upstream, but I don't think anyone actually has the time to do this. So few people use arrows to justify the cost.

Conal Elliott's concat

Conal Elliott wrote a fascinating paper called Compiling to Categories. The gist of it is that any cartesian-closed category is a model of simply-typed lambda-calculus. Therefore, he made a GHC plugin giving access to a magical function:

ccc :: Closed k => (a -> b) -> a `k` b

You can see that the signature is very similar to the one of arr.

A first issue is that Recipe m very much isn't closed. Another more substantial issue is that the GHC plugin is very experimental. I had a hard time running it on simple examples, it is barely documented.

Does this mean all hope is lost? NO.

Compiling to monoidal cartesian categories

Two days ago, I stumbled upon this paper by chance:.

What they explain is that many interesting categories to compile to are in fact not closed.

No GHC plugin required, just a tiny library with a few classes.

There is one drawback: Recipe m is cartesian. That is, you can freely duplicate values. In their framework, they have you explicitely insert dup to duplicate a value. This is a bit annoying, but they have a good reason to do so: