24 KiB
title | date | draft | toc |
---|---|---|---|
Generating incremental static site generators in Haskell using cartesian categories | 2022-12-06 | true | true |
A few days ago, I released the new version of achille, a Haskell library providing an EDSL for writing static site generators. This embedded language produces efficient, incremental and parallel static site generators, for free.
In this post, I will explain how achille is able to tranform this intuitive, "readable" syntax into an incremental static site generator:
import Achille as A
main :: IO ()
main = achille $ task A.do
-- copy every static asset as is
match_ "assets/*" copyFile
-- load site template
template <- matchFile "template.html" loadTemplate
-- render every article in `posts/`
-- and gather all metadata
posts <-
match "posts/*.md" \src -> A.do
(meta, content) <- processPandocMeta src
writeFile (src -<.> ".html") (renderPost template meta content)
meta
-- render index page with the 10 most recent articles
renderIndex template (take 10 (sort posts))
Importantly, I want to emphasize that you --- the library user --- neither have to care about or understand the internals of achille in order to use it. Most of the machinery below is purposefully kept hidden from plain sight. You are free to ignore this post and directly go through the user manual to get started!
This article is just there to document how the right theoretical framework was instrumental in providing a good user interface and yet preserve all the desired properties. It also gives pointers on how to reliably overload Haskell's lambda abstraction syntax, because I'm sure many applications could make good use of that but are unaware that there are now ways to do it properly, without any kind of metaprogramming.
Foreword
My postulate is that static sites are good. Of course not for every use case, but for single-user, small-scale websites, it is a very convenient way to manage content. Very easy to edit offline, very easy to deploy. All in all very nice.
There are lots of static site generators readily available. However each and every one of them has a very specific idea of how you should structure your content. For simple websites --- i.e weblogs --- they are wonderful, but as soon as you want to heavily customize the generation process of your site or require more fancy transformations, and thus step outside of the supported feature set of your generator of choice, you're out of luck.
For this reason, many people end up not using existing static site generators, and instead prefer to write their own. Depending on the language you use, it is fairly straightforward to write a little static site generator that does precisely what you want. Sadly, making it incremental or parallel is another issue, and way trickier.
That's precisely the niche that Hakyll and achille try to fill: provide an embedded DSL in Haskell to specify your custom build rules, and compile them all into a full-fletched incremental static site generator executable. Some kind of static site generator generator.
Reasoning about static site generators
Let's look at what a typical site generator does. A good way to visualize it is with a flow diagram, where boxes are "build rules". Boxes have distinguished inputs and outputs, and dependencies between the build rules are represented by wires going from outputs of boxes to inputs of other boxes.
The static site generator corresponding to the Haskell code above corresponds to the following diagram:
...
Build rules are clearly identified, and we see that in order to render the index.html
page, we need to wait for the renderPosts
rule to finish rendering each
article to HTML and return the metadata of every one of them.
Notice how some wires are continuous black lines, and some other wires are faded dotted lines. The dotted lines represent side effects of the generator.
- files that are read from the file system, like all the markdown files in
posts/
. - files that are written to the filesystem, like the HTML output of every
article, or the
index.html
file.
The first important insight is to realize that the build system shouldn't care about side effects. Its only role is to know whether build rules should be executed, how intermediate values get passed around, and how they change between consecutive runs.
The Recipe m
abstraction
newtype Recipe m a b =
{ runRecipe :: Context -> Cache -> a -> m (b, Cache) }
It's just a glorified Kleisli arrow: a Recipe m a b
will produce an output of
type b
by running a computation in m
, given some input of type a
.
The purpose is to abstract over side effects of build rules (such as producing HTML files on disk) and shift the attention to intermediate values that flow between build rules.
...
Visual noise
...
Caching
In the definition of Recipe m a b
, a recipe takes some Cache
as input, and
returns another one after the computation is done.
This cache --- for which I'm not gonna give a definition here --- enables recipes to have some persistent storage between runs, that they can use in any way they desire.
The key insight is how composition of recipes is handled:
(*>) :: Recipe m a b -> Recipe m a c -> Recipe m a c
Recipe f *> Recipe g = Recipe \ctx cache x -> do
let (cf, cg) = splitCache cache
(_, cf') <- f ctx cf x
(y, cg') <- g ctx cg x
pure (y, joinCache cf cg)
The cache is split in two, and both pieces are forwarded to their respective recipe. Once the computation is done, the resulting caches are put together into one again.
This ensures that every recipe will be attributed the same local cache --- assuming the description of the generator does not change between runs. It's not perfect, but I can say that this very simple model for caching has proven to be surprisingly powerful.
...
Incremental evaluation and dependency tracking
...
But there is a but
We've now defined all the operations we could wish for in order to build, compose and combine recipes. We've even found the theoretical framework our concrete application inserts itself into. How cool!
But there is catch, and I hope you've already been thinking about it:
what an awful, awful way to write recipes.
Sure, it's nice to know that we have all the primitive operations required to express all the flow diagrams we could ever be interested in. We can definitely define the site generator that has been serving as example throughout:
rules :: Task ()
rules = renderIndex ∘ (...)
But I hope we can all agree on the fact that this code is complete gibberish. It's likely some Haskellers would be perfectly happy with this interface, but alas my library isn't only targeted to this crowd. No, what I really want is a way to assign intermediate results --- outputs of rules --- to variables, that then get used as inputs. Plain old Haskell variables. That is, I want to write my recipes as plain old functions.
And here is where my --- intermittent --- search for a readable syntax started, roughly two years ago.
The quest for a friendly syntax
Monads
If you've done a bit of Haskell, you may know that as soon as you're working
with things that compose and sequence, there are high chances that what you're
working with are monads. Perhaps the most well-known example is the IO
monad. A value of type IO a
represents a computation that, after doing
side-effects (reading a file, writing a file, ...) will produce a value of type
a
.
Crucially, being a monad means you have a way to sequence computations. In
the case of the IO
monad, the bind operation has the following type:
(>>=) :: IO a -> (a -> IO b) -> IO b
And because monads are so prevalent in Haskell, there is a custom syntax, the
do
notation, that allows you to bind results of computations to variables
that can be used for the following computations. This syntax gets desugared into
the primitive operations (>>=)
and pure
.
main :: IO ()
main = do
content <- readFile "input.txt"
writeFile "output.txt" content
The above gets transformed into:
main :: IO ()
main = readFile "input.txt" >>= writeFile "output.txt"
Looks promising, right? I can define a Monad
instance for Recipe m a
,
fairly easily.
instance Monad (Recipe m a) where
(>>=) :: Recipe m a b -> (b -> Recipe m a c) -> Recipe m a c
And now problem solved?
rules :: Task IO ()
rules = do
posts <- match "posts/*.md" renderPosts
renderIndex posts
The answer is a resolute no. The problem becomes apparent when we try to
actually define this (>>=)
operation.
- The second argument is a Haskell function of type
b -> Recipe m a c
. And precisely because it is a Haskell function, it can do anything it wants depending on the value of its argument. In particular, it could very well return different recipes for different inputs. That is, the structure of the graph is no longer static, and could change between runs, if the output of typeb
from the first rule happens to change. This is very bad, because we rely on the static structure of recipes to make the claim that the cache stays consistent between runs.
Ok, sure, but what if we assume that users don't do bad things (we never should). No, even then, there is an ever bigger problem:
- Because the second argument is just a Haskell function.
...
Arrows
That's when I discovered Haskell's arrows. It's a generalization of monads,
and is often presented as a way to compose things that behave like functions.
And indeed, we can define our very instance Arrow (Recipe m)
. There is a special
syntax, the arrow notation that kinda looks like the do
notation, so is this
the way out?
There is something fishy in the definition of Arrow
:
class Category k => Arrow k where
-- ...
arr :: (a -> b) -> a `k` b
We must be able to lift any function into k a b
in order to make it an
Arrow
. In our case we can do it, that's not the issue. No, the real issue is
how Haskell desugars the arrow notation.
...
So. Haskell's Arrow
isn't it either. Well, in principle it should be the
solution. But the desugarer is broken, the syntax still unreadable to my taste,
and nobody has the will to fix it.
This syntax investigation must carry on.
Compiling to cartesian closed categories
About a year after this project started, and well after I had given up on this whole endeavour, I happened to pass by Conal Elliott's fascinating paper "Compiling to Categories". In this paper, Conal recalls:
It is well-known that the simply typed lambda-calculus is modeled by any cartesian closed category (CCC)
I had heard of it, that is true. What this means is that, given any cartesian
closed category, any term of type a -> b
(a function) in the simply-typed
lambda calculus corresponds to (can be interpreted as) an arrow (morphism)
a -> b
in the category. But a cartesian-closed category crucially has no notion
of variables, just some arrows and operations to compose and rearrange them
(among other things). Yet in the lambda calculus you have to construct functions
using lambda abstraction. In other words, there is consistent a way to convert
things defined with variables bindings into a representation (CCC morphisms)
where variables are gone.
How interesting. Then, Conal goes on to explain that because Haskell is
"just" lambda calculus on steroids, any monomorphic function of type a -> b
really ought to be convertible into an arrow in the CCC of your choice.
And so he did just that. He is behind the concat GHC plugin and library.
This library exports a bunch of typeclasses that allow anyone to define instances
for their very own target CCC. Additionally, the plugin gives access to the
following, truly magical function:
ccc :: CartesianClosed k => (a -> b) -> a `k` b
When the plugin is run during compilation, every time it encounters this specific function it will convert the Haskell term (in GHC Core form) for the first argument (a function) into the corresponding Haskell term for the morphism in the target CCC.
How neat. A reliable way to overload the lambda notation in Haskell. The paper is really, really worth a read, and contains many practical applications such as compiling functions into circuits or automatic differentiation.
...
Another year goes through, without any solution in sight. And yet.
"Compiling" to (symmetric) monoidal categories
A month ago, while browsing a Reddit thread on the sad state of Arrow
,
I stumbled upon an innocent link buried in the depth of replies.
To a paper from Jean-Philippe Bernardy and Arnaud Spiwack:
"Evaluating Linear Functions to Symmetric Monoidal Categories".
And boy oh boy, what a paper. I haven't been able to stop thinking about it since then.
It starts with the following:
A number of domain specific languages, such as circuits or data-science workflows, are best expressed as diagrams of boxes connected by wires.
Well yes indeed, what I want to express in my syntax are just plain old diagrams, made out of boxes and wires.
A faithful abstraction is Symmetric Monoidal Categories (smcs), but, so far, it hasn’t been convenient to use.
Again yes, cannot agree more. This is the right abstraction, but a terrible way to design these diagrams. But then, the kicker, a bit later in the paper:
Indeed, every linear function can be interpreted in terms of an smc.
What. This, I had never heard. Indeed it makes sense, since in (non-cartesian)
monoidal categories you cannot duplicate objects
(that is, have morphisms from a
to (a, a)
),
to only reason about functions that can only use their arguments once, and
that have to use it (or pass it along by returning it). Note that here we talk
about linear functions in the sense of Linear Haskell, type theory kind of
"linear", not linear in the linear algebra kind of "linear".
So far so good. But then, they explain how to evaluate any such linear Haskell function into the right SMC, without metaprogramming. And the techniques they employ to do so are some of the smartest, most beautiful things I've seen. I cannot recommend enough that you go read that paper to learn the full detail. It's amazing, and perhaps more approachable than Conal's paper. It is accompanied by the linear-smc libray, that exposes a very simple interface:
-
The module
Control.Category.Constrained
exports typeclasses to declare your type family of choicek :: * -> * -> *
(in the type-theory sense of type family, not the Haskell sense) as the right kind of category, usingCategory
,Monoidal
andCartesian
.class Category k where id :: a `k` a (∘) :: (b `k` c) -> (a `k` c) -> a `k` c class Category k => Monoidal k where (×) :: (a `k` b) -> (c `k` d) -> (a ⊗ c) `k` (b ⊗ d) swap :: (a ⊗ b) `k` (b ⊗ a) assoc :: ((a ⊗ b) ⊗ c) `k` (a ⊗ (b ⊗ c)) assoc' :: (a ⊗ (b ⊗ c)) `k` ((a ⊗ b) ⊗ c) unitor :: a `k` (a ⊗ ()) unitor' :: Obj k a => (a ⊗ ()) `k` a class Monoidal k => Cartesian k where exl :: (a ⊗ b) `k` a exr :: (a ⊗ b) `k` b dup :: a `k` (a ⊗ a)
So far so good, nothing surprising, we can confirm that indeed we've already defined (or can define) these operations for
Recipe m
, thus forming a cartesian category. -
But the truly incredible bit comes from
Control.Category.Linear
that provides the primitives to construct morphisms in a monoidal category using linear functions.-
It exports an abstract type
P k r a
that is supposed to represent the "output of an arrow/box in the SMCk
, of typea
. -
A function to convert a SMC arrow into a linear functions on ports.
encode :: (a `k` b) -> P k r a %1 -> P k r b
-
A function to convert a linear function on ports to an arrow in your SMC:
decode :: Monoidal k => (forall r. P k r a %1 -> P k r b) -> a `k` b
There are other primitives that we're gonna ignore here.
-
Now there are at least two things that are remarkable about this interface:
-
By keeping the type of ports
P k r a
abstract, and making sure that the exported functions to produce ports also take ports as arguments, they are able to enforce that any linear function on ports written by the user had to use the operations of the library.There is virtually no other way to produce a port out of thin air than to use the export
unit :: P k r ()
, and because the definition ofP k r a
is not exported, users have no way to retrieve a value of typea
from it. Therefore, ports can only be carried around, and ultimately given as input to arrows in the SMC, that have been converted into linear functions withencode
.I have since been told this is a fairly typical method used by DSL writers, to ensure that end users only ever use the allowed operations and nothing more. But it was a first for me, and truly some galaxy-brain technique.
-
The second thing is this
r
parameter inP k r a
. This type variable isn't relevant to the information carried by the port. No, it's true purpose is ensuring that linear functions given todecode
are closed.Indeed, the previous point demonstrated that linear functions
P k r a %1 -> P k r b
can only ever be defined in terms of variables carrying ports, or linear functions on ports.By quantifying over
r
in the first argument ofdecode
, they prevent the function to ever mention variables coming from outside the definition. Indeed, all operations of the library use the samer
for inputs and outputs. So if an outsider port of typePort k r a
was used in the definition of a linear function, but defined outside of it, the function would have to also use the samer
for every port manipulated inside of it. Crucially, this function can no longer be quantified overr
, precisely because thisr
was bound outside of its definition.I have seen this technique once before, in
Control.Monad.ST.Safe
, and it's so neat.
Because of the last two points, linear-smc ensures that the functions written by the
user given to decode
can always be translated back into arrows, simply because they
must be closed and only use the allowed operations. Incorrect functions are
simply rejected by the type-checker with "readable" error messages.
Even though the library does the translation at runtime, it cannot fail.
linear-smc is readily available as a tiny, self-contained library on Hackage. Because it doesn't do any metaprogramming, neither through Template Haskell nor GHC plugins, it is very robust, easy to maintain and safe to depend on.
The only experimental feature being used is Linear Haskell, plus some constraint wizardry.
All in all, this seems like a wonderful foundation to stand on.
The library sadly doesn't have an associated Github page, and it seems like
nobody has heard about this paper and approach. At the time of writing,
this library has only been downloaded 125
times, and I'm responsible for a
large part of it. Please give it some love and look through the paper, you're
missing out.
But now, let's look into how to apply this set of tools and go beyond.
Reaching the destination: compiling to cartesian categories
...
And here is the end destination. We've finally been able to fully overload the Haskell lambda abstraction syntax, yet are still able to track the use of variables in order to keep generators incremental.
Conclusion
If anyone has made it so far, I would like to thank you for reading through this post in its entirety. I quite frankly have no clue whether this will be of use to anyone, but I've been thinking about this for so long and was so happy to reach a "simple" solution that I couldn't just keep it to myself.
Now again, I am very thankful for Bernardy and Spiwack's paper and library. It is to my knowledge the cleanest way to do this kind of painless overloading. It truly opened my eyes and allowed me to go a bit further. I hope the techniques presented here can at least make a few people aware that these solutions exist and can be used painlessly.
Now as for achille, my pet project that was the motivation for this entire thing, it has now reached the level of usefulness and friction that I was --- only a few years ago --- merely dreaming of. Being the only user to date, I am certainly biased, and would probably do a bad job convincing anyone that they should use it, considering the amount of available tools.
However if you've been using Hakyll and are a tad frustrated by some of its limitations --- as I was, I would be very happy if you could consider taking achille for a spin. It's new, it's small, I don't know if it's as efficient as it could be, but it is definitely made with love (and sweat).
Future work
I now consider the syntax problem to be entirely solved. But there are always more features that I wish for.
-
I didn't implement parallelism yet, because it wasn't in the first version of achille and thus not a priority. But as shown in this article, it should also come for free. I first have to learn how Haskell does concurrency, then just go and implement it.
-
Make
Recipe m a b
into a GADT. Right now, the result of the translation from functions on ports to recipes is non-inspectable, because I just get a Haskell function. I think it would be very useful to makeRecipe m a b
a GADT, where in addition to the current constructor, we have one for each primitive operation (the operations of cartesian categories).This should make it possible to produce an SVG image of the diagram behind every generator made with achille, which I find pretty fucking cool.
-
In some rare cases, if the source code of the generator has been modified between two runs, it can happen that a build rule receives as input cache the old cache of a different recipe, that yet contains exactly the right kind of information.
I haven't witnessed this often, but for now the only way to restore proper incrementality is to clean the cache fully and rebuild. A bit drastic if your site is big or you have computationally expensive recipes. Now that the diagram is completely static (compared to the previous version using monads), I think it should be possible to let users give names to specific recipes, so that:
-
If we want to force execution of a specific recipe, by ignoring its cache, we can do so by simply giving the name in the CLI.
-
The cache of named recipes is stored separately from this tree-like nesting of caches, so that these recipes become insensitive to refactorings of the generator source code.
I would even go as far as saying that this would be easy to implement, but those are famous last words.
-
-
Actually, we can go even further. Because the diagram is static, we can compute a hash at every node of the diagram. Yes, a merkle tree. Every core recipe must be given a different hash (hand-picked, by me or implementors of other recipes). Then by convention every recipe appends its own hash to its local cache. This should entirely solve the problem of running recipes that have changed from scratch, and only those. If any of the sub-recipe of an outer receipe has changed, then the hash won't match, and therefore it has to run again.
At what point do we consider things over-engineered? I think I've been past that point for a few years already.
Til next time!