acatalepsie/content/posts/achille-smc.md

253 lines
9.1 KiB
Markdown

---
title: Generating incremental static site generators in Haskell using cartesian categories
date: 2022-12-06
draft: true
toc: true
---
A few days ago, I released the new version of [achille], a Haskell library
providing an EDSL for writing static site generators. This embedded language produces
efficient, *incremental* and *parallel* static site generators, *for free*.
[achille]: /projects/achille
In this post, I will explain how [achille] is able to tranform this intuitive, "readable"
syntax into an incremental static site generator:
```haskell
import Achille as A
main :: IO ()
main = achille $ task A.do
-- render every article in `posts/`
-- and gather all metadata
posts <-
match "posts/*.md" \src -> A.do
(meta, content) <- processPandocMeta src
writeFile (src -<.> ".html") (renderPost meta content)
meta
-- render index page with the 10 most recent articles
renderIndex (take 10 (sort posts))
```
Importantly, I want to emphasize that *you* --- the library user --- neither
have to care about or understand the internals of [achille] in order to use it.
You are free to ignore this post and directly go through the [user
manual][manual] to get started!
[manual]: /projects/achille/
This post is just there to document how the right theoretical framework was key
in providing a good user interface that preserves all the desired properties.
---
## Foreword
The original postulate is that *static sites are good*. Of course not for every
use case, but for single-user, small-scale websites, it is a very practical way
of managing content. Very easy to edit offline, very easy to deploy. All in all
very nice.
There are lots of static site generators readily available. However each and
every one of them has a very specific idea of how you *should* manage your
content. For simple websites --- i.e weblogs --- they are great, but as soon as
you want to heavily customize the building process of your site, require more
fancy transformations, and thus step outside of the supported feature set of
your site generator of choice, you're in for a lot of trouble.
For this reason, many people end up not using existing static site generators,
and instead prefer to write their own. Depending on the language you use, it is
fairly straightforward to write a little static site generator doing everything
you want. Sadly, making it *incremental* or *parallel* is another issue, and way
trickier.
That's precisely the niche that [Hakyll] and
[achille] try to fill: use an embedded DSL in Haskell to specify your *custom* build
rules, and compile them all into a full-fletched **incremental** static site
generator executable. Some kind of static site generator *generator*.
[Hakyll]: https://jaspervdj.be/hakyll/
## Reasoning about static site generators
Let's look at what a typical site generator does. A good way to visualize it
is with a flow diagram, where *boxes* are "build rules". Boxes have
distinguished inputs and outputs, and dependencies between the build rules are
represented by wires going from outputs of boxes to inputs of other boxes.
The static site generator corresponding to the Haskell code above could be
represented as the following diagram:
...
Build rules are clearly identified, and we see that in order to render the `index.html`
page, we need to wait for the `renderPosts` rule to finish rendering each
article to HTML and return the metadata of every one of them.
Notice how some wires are **continuous** **black** lines, and some other wires are
faded **dotted** lines. The **dotted lines** represent **side effects** of the
generator.
- files that are read from the file system, like all the markdown files in
`posts/`.
- files that are written to the filesystem, like the HTML output of every
article, or the `index.html` file.
The first insight is to realize that the build system *shouldn't care about side
effects*. Its *only* role is to know whether build rules *should be executed*,
and how intermediate values get passed around.
### The `Recipe m` abstraction
I had my gripes with Hakyll, and was looking for a simpler, more general way to
express build rules. I came up with the `Recipe` abstraction:
```haskell
newtype Recipe m a b =
{ runRecipe :: Context -> Cache -> a -> m (b, Cache) }
```
It's just a glorified Kleisli arrow: a `Recipe m a b` will produce an output of
type `b` by running a computation in `m`, given some input of type `a`.
The purpose is to *abstract over side effects* of build rules (such as producing
HTML files on disk) and shift the attention to *intermediate values* that flow
between build rules.
As one could expect, if `m` is a monad, so is `Recipe m a`. This means composing
recipes is very easy and dependencies *between* those are stated **explicitely**
in the code.
```haskell
main :: IO ()
main = achille do
posts <- match "posts/*.md" compilePost
compileIndex posts
```
``` {=html}
<details>
<summary>Type signatures</summary>
```
Simplifying a bit, these would be the type signatures of the building blocks in
the code above.
```haskell
compilePost :: Recipe IO FilePath PostMeta
match :: GlobPattern -> (Recipe IO FilePath b) -> Recipe IO () [b]
compileIndex :: PostMeta -> Recipe IO () ()
achille :: Recipe IO () () -> IO ()
```
``` {=html}
</details>
```
There are no ambiguities about the ordering of build rules and the evaluation model
is in turn *very* simple --- in contrast to Hakyll, its global store and
implicit ordering.
### Caching
In the definition of `Recipe`, a recipe takes some `Cache` as input, and
returns another one after the computation is done. This cache is simply a *lazy
bytestring*, and enables recipes to have some *persistent storage* between
runs, that they can use in any way they desire.
The key insight is how composition of recipes is handled:
```haskell
(*>) :: Recipe m a b -> Recipe m a c -> Recipe m a c
Recipe f *> Recipe g = Recipe \ctx cache x -> do
let (cf, cg) = splitCache cache
(_, cf') <- f ctx cf x
(y, cg') <- g ctx cg x
pure (y, joinCache cf cg)
```
The cache is split in two, and both pieces are forwarded to their respective
recipe. Once the computation is done, the resulting caches are put together
into one again.
This ensures that every recipe will be attributed the same local cache
--- assuming the description of the generator does not change between runs. Of
course this is only true when `Recipe m` is merely used as *selective*
applicative functor, though I doubt you need more than that for writing a
static site generator. It's not perfect, but I can say that this very simple model
for caching has proven to be surprisingly powerful.
I have improved upon it since then, in order to make sure that
composition is associative and to enable some computationally intensive recipes to
become insensitive to code refactorings, but the core idea is left unchanged.
### Incremental evaluation and dependency tracking
### But there is a but
## Arrows
I really like the `do` notation, but sadly losing this information about
variable use is bad, so no luck. If only there was a way to *overload* the
lambda abstraction syntax of Haskell to transform it into a representation free
of variable bindings...
That's when I discovered Haskell's arrows. It's a generalization of monads,
and is often presented as a way to compose things that behave like functions.
And indeed, we can define our very `instance Arrow (Recipe m)`. There is a special
syntax, the *arrow notation* that kinda looks like the `do` notation, so is this
the way out?
There is something fishy in the definition of `Arrow`:
```haskell
class Category k => Arrow k where
-- ...
arr :: (a -> b) -> a `k` b
```
We must be able to lift any function into `k a b` in order to make it an
`Arrow`. In our case we can do it, that's not the issue. No, the real issue is
how Haskell desugars the arrow notation.
...
There is a macro that is a bit smarter than current Haskell's desugarer, but not
by much. I've seen some discussions about actually fixing this upstream, but I
don't think anyone actually has the time to do this. So few people use arrows to
justify the cost.
## Conal Elliott's `concat`
Conal Elliott wrote a fascinating paper called *Compiling to Categories*.
The gist of it is that any cartesian-closed category is a model of simply-typed
lambda-calculus. Therefore, he made a GHC plugin giving access to a magical
function:
```
ccc :: Closed k => (a -> b) -> a `k` b
```
You can see that the signature is *very* similar to the one of `arr`.
A first issue is that `Recipe m` very much isn't *closed*. Another more
substantial issue is that the GHC plugin is *very* experimental. I had a hard
time running it on simple examples, it is barely documented.
Does this mean all hope is lost? **NO**.
## Compiling to monoidal cartesian categories
Two days ago, I stumbled upon this paper by chance:.
What they explain is that many interesting categories to compile to are in fact
not closed.
No GHC plugin required, just a tiny library with a few `class`es.
There is one drawback: `Recipe m` *is* cartesian. That is, you can freely
duplicate values. In their framework, they have you explicitely insert `dup` to
duplicate a value. This is a bit annoying, but they have a good reason to do so: