352 lines
14 KiB
Markdown
352 lines
14 KiB
Markdown
---
|
|
title: Generating incremental static site generators in Haskell using cartesian categories
|
|
date: 2022-12-06
|
|
draft: true
|
|
toc: true
|
|
---
|
|
|
|
A few days ago, I released the new version of [achille], a Haskell library
|
|
providing an EDSL for writing static site generators. This embedded language produces
|
|
efficient, *incremental* and *parallel* static site generators, *for free*.
|
|
|
|
[achille]: /projects/achille
|
|
|
|
In this post, I will explain how [achille] is able to tranform this intuitive, "readable"
|
|
syntax into an incremental static site generator:
|
|
|
|
```haskell
|
|
import Achille as A
|
|
|
|
main :: IO ()
|
|
main = achille $ task A.do
|
|
-- copy every static asset as is
|
|
match_ "assets/*" copyFile
|
|
|
|
-- load site template
|
|
template <- matchFile "template.html" loadTemplate
|
|
|
|
-- render every article in `posts/`
|
|
-- and gather all metadata
|
|
posts <-
|
|
match "posts/*.md" \src -> A.do
|
|
(meta, content) <- processPandocMeta src
|
|
writeFile (src -<.> ".html") (renderPost template meta content)
|
|
meta
|
|
|
|
-- render index page with the 10 most recent articles
|
|
renderIndex template (take 10 (sort posts))
|
|
```
|
|
|
|
|
|
Importantly, I want to emphasize that *you* --- the library user --- neither
|
|
have to care about or understand the internals of [achille] in order to use it.
|
|
*Most* of the machinery below is purposefully kept hidden from plain sight. You
|
|
are free to ignore this post and directly go through the [user manual][manual]
|
|
to get started!
|
|
|
|
[manual]: /projects/achille/
|
|
|
|
This article is just there to document how the right theoretical framework was
|
|
instrumental in providing a good user interface *and yet* preserve all the
|
|
desired properties. It also gives pointers on how to reliably overload Haskell's
|
|
*lambda abstraction* syntax, because I'm sure many applications could make good
|
|
use of that but are unaware that there are now ways to do it properly, *without
|
|
any kind of metaprogramming*.
|
|
|
|
---
|
|
|
|
## Foreword
|
|
|
|
My postulate is that *static sites are good*. Of course not for every
|
|
use case, but for single-user, small-scale websites, it is a very convenient way
|
|
to manage content. Very easy to edit offline, very easy to deploy. All in all
|
|
very nice.
|
|
|
|
There are lots of static site generators readily available. However each and
|
|
every one of them has a very specific idea of how you should *structure* your
|
|
content. For simple websites --- i.e weblogs --- they are wonderful, but as soon
|
|
as you want to heavily customize the generation process of your site or require
|
|
more fancy transformations, and thus step outside of the supported feature set
|
|
of your generator of choice, you're out of luck.
|
|
|
|
For this reason, many people end up not using existing static site generators,
|
|
and instead prefer to write their own. Depending on the language you use, it is
|
|
fairly straightforward to write a little static site generator that does
|
|
precisely what you want. Sadly, making it *incremental* or *parallel* is another
|
|
issue, and way trickier.
|
|
|
|
That's precisely the niche that [Hakyll] and [achille] try to fill: provide an
|
|
embedded DSL in Haskell to specify your *custom* build rules, and compile them
|
|
all into a full-fletched **incremental** static site generator executable. Some
|
|
kind of static site generator *generator*.
|
|
|
|
[Hakyll]: https://jaspervdj.be/hakyll/
|
|
|
|
## Reasoning about static site generators
|
|
|
|
Let's look at what a typical site generator does. A good way to visualize it
|
|
is with a flow diagram, where *boxes* are "build rules". Boxes have
|
|
distinguished inputs and outputs, and dependencies between the build rules are
|
|
represented by wires going from outputs of boxes to inputs of other boxes.
|
|
|
|
The static site generator corresponding to the Haskell code above corresponds
|
|
to the following diagram:
|
|
|
|
...
|
|
|
|
Build rules are clearly identified, and we see that in order to render the `index.html`
|
|
page, *we need to wait* for the `renderPosts` rule to finish rendering each
|
|
article to HTML and return the metadata of every one of them.
|
|
|
|
Notice how some wires are **continuous** **black** lines, and some other wires are
|
|
faded **dotted** lines. The **dotted lines** represent **side effects** of the
|
|
generator.
|
|
|
|
- files that are read from the file system, like all the markdown files in
|
|
`posts/`.
|
|
- files that are written to the filesystem, like the HTML output of every
|
|
article, or the `index.html` file.
|
|
|
|
The first important insight is to realize that the build system *shouldn't care
|
|
about side effects*. Its *only* role is to know whether build rules *should be
|
|
executed*, how intermediate values get passed around, and how they change
|
|
between consecutive runs.
|
|
|
|
### The `Recipe m` abstraction
|
|
|
|
I had my gripes with Hakyll, and was looking for a simpler, more general way to
|
|
express build rules. I came up with the `Recipe` abstraction:
|
|
|
|
```haskell
|
|
newtype Recipe m a b =
|
|
{ runRecipe :: Context -> Cache -> a -> m (b, Cache) }
|
|
```
|
|
|
|
It's just a glorified Kleisli arrow: a `Recipe m a b` will produce an output of
|
|
type `b` by running a computation in `m`, given some input of type `a`.
|
|
|
|
The purpose is to *abstract over side effects* of build rules (such as producing
|
|
HTML files on disk) and shift the attention to *intermediate values* that flow
|
|
between build rules.
|
|
|
|
### Caching
|
|
|
|
In the definition of `Recipe`, a recipe takes some `Cache` as input, and
|
|
returns another one after the computation is done. This cache is simply a *lazy
|
|
bytestring*, and enables recipes to have some *persistent storage* between
|
|
runs, that they can use in any way they desire.
|
|
|
|
The key insight is how composition of recipes is handled:
|
|
|
|
```haskell
|
|
(*>) :: Recipe m a b -> Recipe m a c -> Recipe m a c
|
|
Recipe f *> Recipe g = Recipe \ctx cache x -> do
|
|
let (cf, cg) = splitCache cache
|
|
(_, cf') <- f ctx cf x
|
|
(y, cg') <- g ctx cg x
|
|
pure (y, joinCache cf cg)
|
|
```
|
|
|
|
The cache is split in two, and both pieces are forwarded to their respective
|
|
recipe. Once the computation is done, the resulting caches are put together
|
|
into one again.
|
|
|
|
This ensures that every recipe will be attributed the same local cache
|
|
--- assuming the description of the generator does not change between runs. Of
|
|
course this is only true when `Recipe m` is merely used as *selective*
|
|
applicative functor, though I doubt you need more than that for writing a
|
|
static site generator. It's not perfect, but I can say that this very simple model
|
|
for caching has proven to be surprisingly powerful.
|
|
|
|
I have improved upon it since then, in order to make sure that
|
|
composition is associative and to enable some computationally intensive recipes to
|
|
become insensitive to code refactorings, but the core idea is left unchanged.
|
|
|
|
### Incremental evaluation and dependency tracking
|
|
|
|
### But there is a but
|
|
|
|
We've now defined all the operations we could wish for in order to build,
|
|
compose and combine recipes. We've even found the theoretical framework our
|
|
concrete application inserts itself into. How cool!
|
|
|
|
**But there is catch**, and I hope you've already been thinking about it:
|
|
**what an awful, awful way to write recipes**.
|
|
|
|
Sure, it's nice to know that we have all the primitive operations required to
|
|
express all the flow diagrams we could ever be interested in. We *can*
|
|
definitely define the site generator that has been serving as example
|
|
throughout:
|
|
|
|
```
|
|
rules :: Task ()
|
|
rules = renderIndex ∘ (...)
|
|
```
|
|
|
|
But I hope we can all agree on the fact that this code is **complete
|
|
gibberish**. It's likely *some* Haskellers would be perfectly happy with this
|
|
interface, but alas my library isn't *only* targeted to this crowd. No, what I
|
|
really want is a way to assign intermediate results --- outputs of rules --- to
|
|
*variables*, that then get used as inputs. Plain old Haskell variables. That is,
|
|
I want to write my recipes as plain old *functions*.
|
|
|
|
And here is where my --- intermittent --- search for a readable syntax started,
|
|
roughly two years ago.
|
|
|
|
## The quest for a friendly syntax
|
|
|
|
### Monads
|
|
|
|
If you've done a bit of Haskell, you *may* know that as soon as you're working
|
|
with things that compose and sequence, there are high chances that what you're
|
|
working with are *monads*. Perhaps the most well-known example is the `IO`
|
|
monad. A value of type `IO a` represents a computation that, after doing
|
|
side-effects (reading a file, writing a file, ...) will produce a value of type
|
|
`a`.
|
|
|
|
Crucially, being a monad means you have a way to *sequence* computations. In
|
|
the case of the `IO` monad, the bind operation has the following type:
|
|
|
|
```haskell
|
|
(>>=) :: IO a -> (a -> IO b) -> IO b
|
|
```
|
|
|
|
And because monads are so prevalent in Haskell, there is a *custom syntax*, the
|
|
`do` notation, that allows you to bind results of computations to *variables*
|
|
that can be used for the following computations. This syntax gets desugared into
|
|
the primitive operations `(>>=)` and `pure`.
|
|
|
|
```haskell
|
|
main :: IO ()
|
|
main = do
|
|
content <- readFile "input.txt"
|
|
writeFile "output.txt" content
|
|
```
|
|
|
|
The above gets transformed into:
|
|
|
|
```haskell
|
|
main :: IO ()
|
|
main = readFile "input.txt" >>= writeFile "output.txt"
|
|
```
|
|
|
|
Looks promising, right? I can define a `Monad` instance for `Recipe m a`,
|
|
fairly easily.
|
|
|
|
```haskell
|
|
instance Monad (Recipe m a) where
|
|
(>>=) :: Recipe m a b -> (b -> Recipe m a c) -> Recipe m a c
|
|
```
|
|
|
|
And now problem solved?
|
|
|
|
```haskell
|
|
rules :: Task IO ()
|
|
rules = do
|
|
posts <- match "posts/*.md" renderPosts
|
|
renderIndex posts
|
|
```
|
|
|
|
The answer is a resolute **no**. The problem becomes apparent when we try to
|
|
actually define this `(>>=)` operation.
|
|
|
|
1. The second argument is a Haskell function of type `b -> Recipe m a c`. And
|
|
precisely because it is a Haskell function, it can do anything it wants
|
|
depending on the value of its argument. In particular, it could very well
|
|
return *different recipes* for *different inputs*. That is, the *structure*
|
|
of the graph is no longer *static*, and could change between runs, if the
|
|
output of type `b` from the first rule happens to change. This is **very
|
|
bad**, because we rely on the static structure of recipes to make the claim
|
|
that the cache stays consistent between runs.
|
|
|
|
Ok, sure, but what if we assume that users don't do bad things (we never should).
|
|
No, even then, there is an ever bigger problem:
|
|
|
|
2. Because the second argument is *just a Haskell function*.
|
|
|
|
## Arrows
|
|
|
|
That's when I discovered Haskell's arrows. It's a generalization of monads,
|
|
and is often presented as a way to compose things that behave like functions.
|
|
And indeed, we can define our very `instance Arrow (Recipe m)`. There is a special
|
|
syntax, the *arrow notation* that kinda looks like the `do` notation, so is this
|
|
the way out?
|
|
|
|
There is something fishy in the definition of `Arrow`:
|
|
|
|
```haskell
|
|
class Category k => Arrow k where
|
|
-- ...
|
|
arr :: (a -> b) -> a `k` b
|
|
```
|
|
|
|
We must be able to lift any function into `k a b` in order to make it an
|
|
`Arrow`. In our case we can do it, that's not the issue. No, the real issue is
|
|
how Haskell desugars the arrow notation.
|
|
|
|
...
|
|
|
|
So. Haskell's `Arrow` isn't it either. Well, in principle it *should* be the
|
|
solution. But the desugarer is broken, the syntax still unreadable to my taste,
|
|
and nobody has the will to fix it.
|
|
|
|
This syntax investigation must carry on.
|
|
|
|
## Compiling to cartesian closed categories
|
|
|
|
About a year after this project started, and well after I had given up on this
|
|
whole endeavour, I happened to pass by Conal Elliott's fascinating paper
|
|
["Compiling to Categories"][ccc]. In this paper, Conal recalls:
|
|
|
|
[ccc]: http://conal.net/papers/compiling-to-categories/
|
|
|
|
> It is well-known that the simply typed lambda-calculus is modeled by any
|
|
> cartesian closed category (CCC)
|
|
|
|
I had heard of it, that is true. What this means is that, given any cartesian
|
|
closed category, any *term* of type `a -> b` (a function) in the simply-typed
|
|
lambda calculus corresponds to (can be interpreted as) an *arrow* (morphism)
|
|
`a -> b` in the category. But a cartesian-closed category crucially has no notion
|
|
of *variables*, just some *arrows* and operations to compose and rearrange them
|
|
(among other things). Yet in the lambda calculus you *have* to construct functions
|
|
using *lambda abstraction*. In other words, there is consistent a way to convert
|
|
things defined with variables bindings into a representation (CCC morphisms)
|
|
where variables are *gone*.
|
|
|
|
How interesting. Then, Conal goes on to explain that because Haskell is
|
|
"just" lambda calculus on steroids, any monomorphic function of type `a -> b`
|
|
really ought to be convertible into an arrow in the CCC of your choice.
|
|
And so he *did* just that. He is behind the [concat] GHC plugin and library.
|
|
This library exports a bunch of typeclasses that allow anyone to define instances
|
|
for their very own target CCC. Additionally, the plugin gives access to the
|
|
following, truly magical function:
|
|
|
|
[concat]: https://github.com/compiling-to-categories/concat
|
|
|
|
```haskell
|
|
ccc :: CartesianClosed k => (a -> b) -> a `k` b
|
|
```
|
|
|
|
When the plugin is run during compilation, every time it encounters this specific
|
|
function it will convert the Haskell term (in GHC Core form) for the first
|
|
argument (a function) into the corresponding Haskell term for the morphism in
|
|
the target CCC.
|
|
|
|
How neat. A reliable way to overload the lambda notation in Haskell.
|
|
The paper is really, really worth a read, and contains many practical
|
|
applications such as compiling functions into circuits or automatic
|
|
differentiation.
|
|
|
|
## Compiling to monoidal cartesian categories
|
|
|
|
Two days ago, I stumbled upon this paper by chance:.
|
|
|
|
What they explain is that many interesting categories to compile to are in fact
|
|
not closed.
|
|
|
|
No GHC plugin required, just a tiny library with a few `class`es.
|
|
|
|
There is one drawback: `Recipe m` *is* cartesian. That is, you can freely
|
|
duplicate values. In their framework, they have you explicitely insert `dup` to
|
|
duplicate a value. This is a bit annoying, but they have a good reason to do so:
|