acatalepsie/content/projects/achille/index.markdown

541 lines
18 KiB
Markdown
Raw Normal View History

2020-06-13 16:22:47 +00:00
---
title: achille
subtitle: A Haskell library for building static site generators
year: "2020"
labels:
repo: flupe/achille
license: MIT
---
**achille** [aʃil] is a tiny Haskell library for building your very own **static site
generator**. It is in spirit a direct successor to [Hakyll][Hakyll].
## Motivation
Static site generators (SSG) have proven to be very useful tools for easily
generating static websites from neatly organised content files. Most of them
support using **markup languages** like markdown for writing content, and offer
**incremental compilation** so that updating a website stays **fast**,
regardless of its size. However, most SSGs are very opinionated about how you
should manage your content. As soon as your specific needs deviate slightly
from what your SSG supports, it becomes a lot more tedious.
This leads to many people writing their own personal static site generators
from scratch. This results in a completely personalised workflow, but without
good libraries it is a time-consuming endeavor, and incremental compilation is often
out of the equation as it is hard to get right.
This is where **achille** and [Hakyll][Hakyll] come in: they provide a *domain
specific language* embedded in Haskell to easily yet very precisely describe
how to build your site. Compile this description and **you get a full-fledged
static site generator with incremental compilation**, tailored specifically to
your needs.
[Hakyll]: https://jaspervdj.be/hakyll
### Why Hakyll is not enough
To provide incremental compilation, Hakyll relies on a global store, in which
all your *intermediate values* are stored. It is *your* responsibility to
populate it with *snapshots*. There are some severe limitations to this
approach:
- The store is **fundamentally untyped**, so **retrieving snapshots may fail at
runtime** if you're not careful when writing your build rules. You may
argue that's not very critical --- I think it shouldn't be possible in the
first place. We are using a strongly typed language, so we shouldn't have
to rely on flaky coercions at runtime to manipulate intermediate values.
- **Loading snapshots with glob patterns is awkward**. With Hakyll, *the*
way to retrieve intermediate values is by querying the store,
using glob patterns. This indirect way of managing values is very
clumsy. In Haskell, the very purpose of variables is to store intermediate
values, so we should only have to deal with plain old variables.
- **Dependencies are not explicit**. Because it relies on a global store for
handling intermediate values, Hakyll has to make sure that the snaphots you
want to load have been generated already. And because rules have no imposed
order despite implicit inter-dependencies, Hakyll has to evaluate very
carefully each rule, eventually pausing them to compute missing dependencies.
This is very complex and quite frankly impressive, yet I believe we can strive
for a simpler model of evaluation. If we used plain old variables to hold
intermediate values, we simply would not be allowed to refer to an undefined
variable.
There are other somewhat debatable design decisions:
- In Hakyll, every rule will produce an output file, and only one, if you're
restricting yourself to the API they provide. I argue
such a library should not care whether a rule produces any output on the
filesystem. Its role is merely to know *if the rule must be executed*. Because of
this requirement, producing multiple outputs from the same file is a tad
cumbersome.
- Because Hakyll stores many content files directly in the store, the resulting
cache is *huge*. This is unnecessary, the files are right here in the content
directory.
- Hakyll uses a *lot* of abstractions --- `Compiler`, `Item`, `Rule`, `RuleSet`
--- whose purpose is not obvious to a newcomer.
- It defines monads to allow the convenient `do` notation to be used, but
disregards completely the very benefit of using monads --- it composes!
### Other tools
As always when thinking I am onto something, I jumped straight into code
and forgot to check whether there were alternatives. By fixating on Hakyll, I did not
realize many people have had the same comments about the shortcomings of Hakyll
and improved upon it. Therefore, it's only after building most of **achille**
in a week that I realized there were many
other similar tools available, namely: [rib][rib], [slick][slick], [Pencil][pencil] &
[Lykah][lykah].
[rib]: https://rib.srid.ca/
[slick]: https://hackage.haskell.org/package/slick
[pencil]: http://elbenshira.com/pencil/
[lykah]: https://hackage.haskell.org/package/Lykah
Fortunately, I still believe **achille** is a significant improvement over these libraries.
- As far as I can tell, **pencil** does not provide incremental generation.
It also relies on a global store, no longer untyped but very
restrictive about what you can store. It implements its own templating language.
- Likewise, no incremental generation in **Lykah**.
Reimplements its own HTML DSL rather than use *lucid*.
Very opinionated, undocumented and unmaintained.
- **rib** and **slick** are the most feature-complete of the lot.
They both provide a minimalist web-focused interface over the very powerful build system
[Shake][Shake].
[Shake]: https://shakebuild.com/
## How achille works
In **achille** there is a single abstraction for reasoning about build rules:
`Recipe m a b`. A **recipe** of type `Recipe m a b` will produce a value of type
`m b` given some input of type `a`.
Conveniently, if `m` is a monad then **`Recipe m a` is a monad** too, so
you can retrieve the output of a recipe to reuse it in another recipe.
*(Because of caching, a recipe is **not** just a Kleisli arrow)*
```haskell
-- the (>>=) operator, restricted to recipes
(>>=) :: Monad m => Recipe m a b -> (b -> Recipe m a c) -> Recipe m a c
```
With only this, **achille** tackles every single one of the limitations highlighted above.
- Intermediate values are plain old Haskell variables.
```haskell
renderPost :: Recipe IO FilePath Post
buildPostIndex :: [Post] -> Recipe a ()
renderPosts :: Task IO ()
renderPosts = do
posts <- match "posts/*" renderPost
buildPostIndex posts
```
See how a correct ordering of build rules is enforced by design: you can only
use an intermediate value once the recipe it is originating from has been
executed.
Note: a **task** is a recipe that takes no input.
```haskell
type Task m = Recipe m ()
```
- **achille** does not care what happens during the execution of a recipe.
It only cares about the input and return type of the recipe --- that is, the
type of intermediate values.
In particullar, **achille** does not expect every recipe to produce a file,
and lets you decide when to actually write on the filesystem.
For example, it is very easy to produce multiple versions of a same source file:
```haskell
renderPage :: Recipe IO FilePath FilePath
renderPage = do
-- Copy the input file as is to the output directory
copyFile
-- Render the input file with pandoc,
-- then save it to the output dir with extension ".html"
compilePandoc >>= saveTo (-<.> "html")
```
Once you have defined the recipe for building your site, you forward
this description to **achille** in order to get a command-line interface for
your generator, just as you would using Hakyll:
```haskell
buildSite :: Task IO ()
main :: IO ()
main = achille buildSite
```
Assuming we compiled the file above into an executable called `site`, running
it gives the following output:
```bash
$ site
A static site generator for fun and profit
Usage: site COMMAND
Available options:
-h,--help Show this help text
Available commands:
build Build the site once
deploy Server go brrr
clean Delete all artefacts
```
That's it, you now have your very own static site generator!
### Caching
So far we haven't talked about caching and incremental builds.
Rest assured: **achille produces generators with robust incremental
builds** for free. To understand how this is done, we can simply look at the
definition of `Recipe m a b`:
```haskell
-- the cache is simply a lazy bytestring
type Cache = ByteString
newtype Recipe m a b = Recipe (Context a -> m (b, Cache))
```
In other words, when a recipe is run, it is provided a **context** containing
the input value, **a current cache** *local* to the recipe, and some more
information. The IO action is executed, and we update the local cache with the
new cache returned by the recipe. We say *local* because of how composition of
recipes is handled internally. When the *composition* of two recipes (made with
`>>=` or `>>`) is being run, we retrieve two bytestrings from the local cache
and feed them as local cache to both recipes respectively. Then we gather the two updated
caches, join them and make it the new cache of the composition.
This way, a recipe is guaranteed to receive the same local cache it returned
during the last run, *untouched by other recipes*. And every recipe is free to
dispose of this local cache however it wants.
As a friend noted, **achille** is "just a library for composing memoized
computations".
----
#### High-level interface
Because we do not want the user to carry the burden of updating the cache
manually, **achille** comes with many utilies for common operations, managing
the cache for us under the hood. Here is an exemple highlighting how we keep
fine-grained control over the cache at all times, while never having to
manipulate it directly.
Say you want to run a recipe for every file maching a glob pattern, *but do
not care about the output of the recipe*. A typical exemple would be to copy
every static asset of your site to the output directory. **achille** provides
the `match_` function for this very purpose:
```haskell
match_ :: Glob.Pattern -> Recipe FilePath b -> Recipe a ()
```
We would use it in this way:
```haskell
copyAssets :: Recipe a ()
copyAssets = match_ "assets/*" copyFile
main :: IO ()
main = achille copyAssets
```
Under the hood, `match_ p r` will cache every filepath for which the recipe was
run. During the next run, for every filepath matching the pattern, `match_ p r` will
lookup the path in its cache. If it is found and hasn't been modified since,
then we do nothing for this path. Otherwise, the task is run and the filepath
added to the cache.
Now assume we do care about the output of the recipe we want to run on every filepath.
For example if we compile every blogpost, we want to retrieve each blogpost's title and
the filepath of the compiled `.html` file. In that case, we can use the
built-in `match` function:
```haskell
match :: Binary b
=> Glob.Pattern -> Recipe FilePath b -> Recipe a [b]
```
Notice the difference here: we expect the type of the recipe output `b` to have
an instance of `Binary`, **so that we can encode it in the cache**. Fortunately,
many of the usual Haskell types have an instance available. Then we can do:
```haskell
data PostMeta = PostMeta { title :: Text }
renderPost :: Text -> Text -> Text
renderIndex :: [(Text, FilePath)] -> Text
buildPost :: Recipe FilePath (Text, FilePath)
buildPost = do
(PostMeta title, pandoc) <- compilePandocMeta
renderPost title pdc & saveAs (-<.> "html")
<&> (title,)
buildPost :: Recipe a [(Text, FilePath)]
buildPosts = match "posts/*.md" buildPost
buildIndex :: [(Text, FilePath)] -> Recipe
```
#### Shortcomings
The assertion *"A recipe will always receive the same cache between two runs"*
can only violated in the two following situations:
- There is **conditional branching in your recipes**, and more specifically,
**branching for which the branch taken can differ between runs**.
For example, it is **not** problematic to do branching on the extension of a file,
as the same path will be taken each execution.
But assuming you want to parametrize by some boolean value for whatever reason,
whose value you may change between runs, then because the two branches will
share the same cache, every time the boolean changes, the recipe will start
from an inconsistent cache so it will recompute from scratch, and overwrite
the existing cache.
```haskell
buildSection :: Bool -> Task IO ()
buildSection isProductionBuild =
if isProductionBuild then
someRecipe
else
someOtherRecipe
```
Although I expect few people ever do this kind of conditional branching for
generating a static site, **achille** still comes with combinators for branching.
You can use `if` in order to keep two separate caches for the two branches:
```haskell
if :: Bool -> Recipe m a b -> Recipe m a b -> Recipe m a b
```
The previous example becomes:
```haskell
buildSection :: Bool -> Task IO ()
buildSection isProductionBuild =
Achille.if isProductionBuild
someRecipe
someOtherRecipe
```
### No runtime failures
All the built-in cached recipes **achille** provides are implemented carefully
so that **they never fail in case of cache corruption**. That is, in the
eventuality of failing to retrieve the desired values from the cache, our
recipes will automatically recompute the result from the input, ignoring the
cache entirely. To make sure this is indeed what happens, every cached recipe
in **achille** has been tested carefully (not yet really, but it is on the todo
list).
This means the only failures possible are those related to poor content
formatting from the user part: missing frontmatter fields, watching files
that do not exist, etc. All of those are errors are gracefully reported to the
user.
### Parallelism
**achille** could very easily support parallelism for free, I just didn't take
the time to make it a reality.
## Making a blog from scratch
Let's see how to use **achille** for making a static site generator for a blog.
First we decide what will be the structure of our source directory.
We choose the following:
```bash
content
└── posts
   ├── 2020-04-13-hello-world.md
   ├── 2020-04-14-another-article.md
   └── 2020-05-21-some-more.md
```
We define the kind of metadata we want to allow in the frontmatter header
of our markdown files:
```haskell
{-# LANGUAGE DeriveGeneric #-}
import GHC.Generics
import Data.Aeson
import Data.Text (Text)
data Meta = Meta
{ title :: Text
} deriving (Generic)
instance FromJSON Meta
```
This way we enfore correct metadata when retrieving the content of our files.
Every markdown file will have to begin with the following header for our
generator to proceed:
```markdown
---
title: Something about efficiency
---
```
Then we create a generic template for displaying a page, thanks to lucid:
```haskell
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE BlockArguments #-}
import Lucid.Html5
renderPost :: Text -> Text -> Html a
renderPost title content = wrapContent do
h1_ $ toHtml title
toHtmlRaw content
renderIndex :: [(Text, FilePath)] -> Html a
renderIndex = wrapContent .
ul_ . mconcat . map \(title, path) ->
li_ $ a_ [href_ path] $ toHtml title
wrapContent :: Html a -> Html a
wrapContent content = doctypehtml_ do
head_ do
meta_ [charset_ "utf-8"]
title_ "my very first blog"
body_ do
header_ $ h1_ "BLOG"
content_
```
We define a recipe for rendering every post:
```haskell
buildPosts :: Task IO [(String, FilePath)]
buildPosts =
match "posts/*.md" do
(Meta title, text) <- compilePandocMetadata
saveFileAs (-<.> "html") (renderPost title text)
<&> (title,)
```
We can define a simple recipe for rendering the index, given a list of posts:
```haskell
buildIndex :: [(Text, FilePath)] -> Task IO FilePath
buildIndex posts =
save (renderIndex posts) "index.html"
```
Then, it's only a matter of composing the recipes and giving them to **achille**:
```haskell
main :: IO ()
main = achille do
posts <- buildPosts
buildIndex posts
```
And that's it, you now have a very minimalist incremental blog generator!
## Recursive recipes
It is very easy to define recursive recipes in **achille**. This allows us to
traverse and build tree-like structures, such as wikis.
For example, given the following structure:
```bash
content
├── index.md
├── folder1
│   └── index.md
└── folder2
   ├── index.md
├── folder21
   │ └── index.md
├── folder22
   │ └── index.md
   └── folder23
   ├── index.md
├── folder231
   │ └── index.md
├── folder222
   │ └── index.md
   └── folder233
   └── index.md
```
We can generate a site with the same structure and in which each index page has
links to its children:
```haskell
renderIndex :: PageMeta -> [(PageMeta, FilePath)] -> Text -> Html
buildIndex :: Recipe IO a (PageMeta, FilePath)
buildIndex = do
children <- walkDir
matchFile "index.*" do
(meta, text) <- compilePandoc
renderIndex meta children text >>= save (-<.> "html")
return $ (meta,) <$> getInput
walkDir :: Recipe IO a [(PageMeta, FilePath)]
walkDir = matchDir "*/" buildIndex
main :: IO ()
main = achille buildIndex
```
## Forcing the regeneration of output
Currently, **achille** doesn't track what files a recipe produces in the output
dir. This means you cannot ask for things like *"Please rebuild
output/index.html"*.
That's because we make the assumption that the output dir is untouched between
builds. The only reason I can think of for wanting to rebuild a specific page
is if the template used to generate it has changed.
But in that case, the template is *just another input*.
So you can treat it as such by putting it in your content directory and doing
the following:
```haskell
import Templates.Index (renderIndex)
buildIndex :: Task IO ()
buildIndex =
watchFile "Templates/Index.hs" $ match_ "index.*" do
compilePandoc <&> renderIndex >>= write "index.html"
```
This way, **achille** will automatically rebuild your index if the template has
changed!
While writing these lines, I realized it would be very easy for **achille**
to know which recipe produced which output file,
so I might just add that. Still, it would still require you to ask for an output
file to be rebuilt if a template has changed. With the above pattern, it is
handled automatically!