The complex’s secrets

Yup, I accidentally rewrote the entire thing. It’s time to talk about why I’d do such a thing, what changed, and whether it was worth it.

A humorous chart plots blog types based on two axes: 'Number of blog posts' (y-axis) and 'Number of posts about elaborate blog setups' (x-axis). In the top-left corner (many blog posts, few setup posts) are labels like 'WordPress setup from 2004', 'Old-ass Blogger.com site', and 'Weird dude who writes raw HTML'. The bottom-right corner (few blog posts, many setup posts) is labeled 'Static gen basin' and includes 'Moved from Jekyll to Hugo' people, 'My mobile Git workflow' guy, 'Org mode fan', and 'Authors of custom static site generators'. A lone outlier in the top-right is 'Superhuman from that big FOSS project'. The chart mocks how people often spend more time blogging about blogging setups than writing actual blog posts. — The memes write themselves¹ (source)

Toggle table of contents

◇ But… why?

The initial version of this site was written in Rust. Hacking on said version of the blog was tedious — it wasn’t difficult per se, but working with the AST-less pulldown parsers (see the original post), taking care of lifetimes — the tiny things added up. As a result, I kept putting tasks like adding a RSS feed to the blog off, as the experience of working on the project wasn’t particularly fun². Nowadays, I mostly write Odin in my free time (gamedev stuff!), hence I was really feeling like scratching the functional itch. Yup, you know where this is going, IT’S HASKELL TIME, BABYY!!!

I really cooked with this one

So, surely, the next step I took was porting the existing code to Haskell, right? Like, there would be no reason for me to instead create a custom programming language and then write the generator in that languag- OH GOD NOO—

◇ The custom language

The idea was simple — build a language that facilitates writing both markup, and describing how to render said markup to various formats. Think LaTeX or Typst, except the logic for rendering to say, HTML, would be written in the same language the markup was written in.

Of course, markup and computation are two very different things, requiring different kinds of syntactic constructs. Supporting both requires the language to support two different “modes” that can easily call into each other. Call those modes DSLs (domain specific languages), and the line between what a “language” even is in the first place becomes a bit blurry.

The above is a bit abstract, so it’s time to look at some examples. We’ll begin at taking a look at existing solutions, namely Typst:

= Hello, this is a heading!

We make the ground shake, *boom*.

The markup above should look pretty readable to anyone, even those unfamiliar with Typst in particular. Interestingly enough, the *text* syntax is nothing but syntactic sugar for calling the strong function with the delimited markup as an argument. Accessing the computation layer from within the markup layer is done with the # symbol, with [...] blocks switching in the other direction. Long story short, the syntax above desugars to

#heading(level: 1)[Hello, this is a heading!]

We make the ground shake, #strong[boom].

Typst’s design is super cool. For one, they allow tweaking the way things get displayed using set and show rules (learn more here), although the very base constructs (think heading and strong) are not native to the language, so one cannot just implement a new rendering backend in Typst itself.

To recap, our requirements are the following:

The language encompasses DSLs for writing markup and describing computation.
The two DSLs naturally intertwine with each other.
Syntax constructs used for writing markup desugar to mere computations.
The aforementioned desugaring should not pick out a specific rendering backend.

In particular, point (4) disqualifies us from simply desugaring everything to fixed function calls (well… depending on what one means by fixed — more on that later).

Letting the computation layer work directly on the underlying markup-language AST satisfies (3) and (4) pretty nicely. Indeed, the literate lisp legends in the metaphorical audience might throw around fancy words³ like homoiconicity, and all the benefits that brings. Unfortunately, manipulating the AST of a language with syntax a smidge more complicated than lisp’s is not a particularly pleasant experience (look at Rust’s proc-macros as a more extreme example).

A hardcore Haskell hacker might heroically bring henceforth⁴ the idea of recursion schemes. I’ve once worked on a (personal) project that made use of such sorcery, yet I don’t particularly recommend it for anything but satiating one’s thirst for knowledge. Still, I think cementing this as a core language feature could’ve lead to a pretty elegant solution, although I sadly hadn’t thought too much about it at the time.

A humorous picture depicting a person's evolution across multiple years. At the start, the person presents in a nerdy-boy style. As time passes, they present in a more and more feminine fashion, with the rightmost figure wearing cat ears. The label at the top reads “Beware the 'forall t f a. Recursive t f => Algebra f a -> t -> a'” — Beware of the catamorphism.

◇ Tagless Final

Code doesn’t have to be data though! Indeed, the aforementioned holy haskell hackers in the metaphorical audience might be familiar with a technique known as tagless final. From this perspective, markup can desugar to function calls that generate some abstract type that offers the required properties:

class (IsString a, Monoid a) => IsContent a where
  heading :: Int -> a -> a
  strong :: a -> a
  -- ...

example :: ∀a. IsContent a => a
example =
  heading "Hello, this is a heading!"
  <> "We make the ground shake, "
  <> strong "boom"

We discarded the idea of desugaring syntax to mere function calls earlier as problematic — we want to be able to render the same markup to multiple formats, after all. We can box up this type into a generic Content type, if we so desire:

data Content = Content (∀a. IsContent a => a)

-- Implementations left aside for conciseness
instance IsString  Content
instance Semigroup Content
instance Monoid    Content
instance IsContent Content

Trapping quantifiers

Note that Haskell doesn’t make working with such types particularly nice. For example, assume we want to implement IsString for Content. Intuitively, one might implement this in terms of function composition (denoted by . in Haskell):

instance IsString Content where
  fromString = Content . fromString

Unfortunately, this does not work

Main.hs:24:16: error: [GHC-91028]
    • Couldn't match type ‘b0’ with ‘forall a. IsContent a => a’
      Expected: b0 -> Content
        Actual: (forall a. IsContent a => a) -> Content
      Cannot instantiate unification variable ‘b0’
      with a type involving polytypes: forall a. IsContent a => a
    • In the first argument of ‘(.)’, namely ‘Content’
      In the expression: Content . fromString
      In an equation for ‘fromString’: fromString = Content . fromString
   |
24 |   fromString = Content . fromString
   |                ^^^^^^^

Instead, we have to guide the type checker using an explicit annotation:

instance IsString Content where
  fromString x = Content (fromString x :: ∀a. IsContent a => a)

I will not go into further details regarding this sort of issue, since I don’t need to ever box values in practice.

◇ Language-design takeaways

My approach to designing the language came down to re-implementing the same building blocks over and over again in slightly-different languages, until I came up with something I really liked. Here’s a few of the more interesting (and perhaps non-standard) things I landed on and would love to explore more in future languages.

◇ Records as co-inductive types

Inspired by Agda’s records, I decided to implement struct/records as co-inductive types. In particular, record construction is dual to pattern matching, and accessing properties can be done through syntax that appears as function application⁶. As a simple example, consider the following

struct Img where
  alt: Text
  url: Text

img = intro Img where
  .alt -> "A scan of the Yu-Gi-Oh! card “Simultaneous Equation Cannons”"
  .url -> "https://ms.yugipedia.com//3/3d/SimultaneousEquationCannons-MP25-EN-PScR-1E.png"

link = img .url

◇ Block flattening

Constructs that introduce syntactic blocks are an element of most non-trivial languages — think the block containing the fields of a struct, the body of a function, branches of an if-statement, or even the contents of an entire module. Note that this concept is more general than lexical scopes! Formatting-wise, blocks often add additional layers of indentation.

A pretty common occurrence is wanting a block to contain “everything left until the end of the outer block we’re already in”. Flattening blocks is not an uncommon idea — think async/await in JavaScript, the more general do-notation in Haskell, etc. In my language, this is the role taken by the ! operator:

module Example (main)

main : Document
main = make !

Document.created_at ⟨2025-03-03T22:34:06+01:00⟩
Document.hidden     true

Document.content text!

Hello there! This is a generic document :3

A few things to note:

The angled brackets are used to delimit text (I’ve been called unhinged for liking Unicode syntax, but I’ve done this before (in DSLs I still use to this day!), and it’s perfectly fine for languages no one else but me is going to use) (yes, the previous quote-using code snippet was a lie).
The ! operator works in any context, including odd ones like
```
struct Img where!
alt : Text
url : Text
```
“But how can the ! work in any context? Does that not include markup contexts? I mean, hell, the very example above uses ! on the last line!”, I hear you ask. Indeed, exclamation marks are pretty common in written text, hence why I ended up renaming ! to ⤵, but kept ! in the example as to not make things even more confusing (more Unicode goodness, yay).
Are the calls to Document.created_at mutating stuff? Not really! Expression blocks in this language work by implicitly composing each “statement”, where a statement is just an endomorphism in some category of types (yeah umm, I went a bit overboard with this, but you can essentially emulate monads by composing in a Kleisli category, definitely overcomplicating the language here…).
make takes a function and computes its fixpoint (i.e. calls it with its own result as an argument). I still am not sure if this is a good idea as a central language construct, but it fits nicely with the implicit function composition pattern.
The text keyword creates a block containing text (equivalent to angled brackets, but without a closer).

◇ Implementing the language

After a few weeks of playing around with language constructs in my free time, I decided it was time to actually implement the language. I have implemented languages for fun for half a decade, and while I’m no expert, the plan at this point didn’t seem that outrageous.

I picked up my Haskell setup, fought Nix compiling GHC from source for a few minutes, and went to work.

◇ The parser

Parsers are usually the easiest part of implementing a toy-language, and this time was supposed to be no different, except… I had decided this language would have a language server. I had never implemented such a server before, but knew having a non-terrible one required the parser to be error tolerant. Many of the shortcuts I found online were incompatible with indentation aware parsing — I can’t just match brackets and hope that does a “good enough” job, aughrshdfhs. With a bit of tinkering, I figured out an approach that worked well for parsing indentation-aware syntax in a very error-tolerant manner, while keeping all the trivia⁷ around and forming a CST (concrete syntax tree). This being my first time dealing with error tolerance, the process was slow and took me many days’ free time worth of work. I never got around to implementing the block flattening described above, although I plan to revisit that in a future language.

I won’t explain the way my error tolerant parsing works, not now at least — I’ve been refining the approach for another language I’ve worked on since, and plan to eventually write an entire post focused on the topic. At first the topic sounded insurmountable — this is one of those things I’ve always thought of as “very hard” even after many years of implementing languages. Even having done it once, I felt the entire process was quite sluggish (i.e. much slower than me writing a “normal” parser), but having gone through the motions again since, I feel like I’ve been getting better at it. I’m sure with enough practice this can become just as comfortable as writing error-intolerant parsers — exciting!

Before moving to the next section, I’ll leave you with this beautiful qualified-applicative-do notation block (please don’t do this at home, I ended up mostly scrapping this custom syntax, as I preferred manually specifying the error messages):

Core.tighten Core.do
  name ← Core.step $ Core.label "module name" Core.name
  exports' ←
    Core.optStep
      $ Core.label "export list"
      $ Core.delimitedList
        (Core.string "(")
        (Core.string ")")
        (Core.string ",")
      $ Core.label "export name" Core.name
  where' ← Core.pre $ Core.step $ Core.string "where"
  Core.pure (name, exports', where')

The block requires applicative-do notation (wish Haskell would use a separate keyword for it like PureScript does) since earlier steps need information about the “future steps” in order to know when to give up consuming junk— … Ok, I promised I wouldn’t go into details, so I’ll end it here. Overall, the approach required me to sidestep Megaparsec’s entire error/failure mechanism, but having to tack on additional complexity on top of megaparsec is not that unusual (anyone who’s implemented non-trivial indentation-aware syntax should know).

◇ The language server

I did manage to implement a few basic language server features! In particular, I got semantic highlighting, diagnostics, goto-definition, and on-hover information working. The experience was fairly educational, although not something I’m looking forward to repeating. In particular, the types involved in the protocol are quite anti-haskelly: non-tagged unions, nulls, etc. The lsp package does take care of all the busy work, but the resulting code is not particularly pretty to work with.

The aforementioned package uses a fair amount of type families (type-level functions, for those not into Haskell), which makes navigating the documentation an awful experience. On one hand, the online doc pages have no way of knowing which type family argument I’m attempting to follow around. To make things worse, the language server running in my editor doesn’t allow me to jump to the definition of library functions either (unlike say, Rust Analyzer), meaning my only choice was to manually search through the online documentation while mentally evaluating the type families involved — not fun!

The lsp package is not all bad though! A feature I appreciated is its virtual file system, taking care of tracking document changes and whatnot for me — very handy!

I usually write basic vim syntax highlighting definitions for my languages, although this is not always easy for indentation aware languages, considering the system is built over nested regexes. Since this language was conceived to be quite contextual, I decided to try relying on semantic-highlighting only, hoping to simplify my work. The result was pretty good looking, but the delay was noticeable (it wasn’t the delay even, moreso neovim’s input debouncing behaviour). I’ve since gone back to writing Vim syntax definitions for my languages, as much as I might hate writing regexes.

◇ The type-checker

I started off by thinking this particular language could get away with some basic Hindely-Milner type inference and a somewhat simple ML-style type system. Of course, the tagless-final ordeal meant I would have to implement a typeclass/trait system (I was thinking of doing it via auto-implicit arguments ala Idris), the implicit composition in arbitrary categories requires higher-kinded types, the computation boxing trick described earlier requires rank-n polymorphism and impredicative types, and while we’re at it, why not throw in full-blown dependent types into the mix?

Yeah… you can probably see where this is going. I’ve implemented basic dependently-typed systems before (worked my way through the excellent elaboration zoo repository and read a handful of not super-complicated papers on the topic), but those have all been toy examples compared to what this language was supposed to be.

For one, I would have to handle out-of-order definitions and mutual recursion (remember — dependent elaboration is “leaky” in the sense you can’t type-check a definition in isolation by assuming everything else in scope agrees with its definition. Indeed, dependent elaboration might require evaluating arbitrary computations, thus mutual recursion becomes way tougher to type-check! Even after ignoring the elephant in the room (well-formed recursion and whatnot) and deciding not to care about termination checking, care must still be taken as not to end up with a type-checker that keeps recursing forever). Oh, and if that wasn’t enough, some sick genius decided it would be a good idea to base a central language construct around fixpoints (after all, aren’t fixpoints super useful for things like tables of contents and backlinks?) — spoiler: that’s a terrible idea lskdjslkdjf (read: skill issue).

Yeah, I’m not saying those problems are unsolvable, but after working on this for some weeks I stopped and introspected for a few minutes — what the fuck am I doing? Why am I reading a paper on *checks notes* higher order pattern unification for a *checks notes again* markup language? AAAAAAAAAAAAAA-, this blog rewrite will not be finished by the end of the century.

After a bit more thought, I decided I’d keep using Djot for content and rewrite the static site generator in Haskell. Tooling-aside (big aside, I know), this would alleviate a lot of my annoyances, and allow me to still implement my SSG-specific DSLs just fine. A few days’ worth of free time later, and the basics were done!

Not only have I since re-implemented everything my old-site had to offer, but I’ve also added some new features. (*screen fades to black, cue the screen transition*…)

◇ The improvements

First and foremost, I can finally separate the content and the static site generator into different repositories! (I’ve wanted to do this for a while, as including easter-eggs/secret pages became a bit pointless otherwise). Why couldn’t I do this before? Well, you see, the top of every post contains “Source” and “Changelog” links, links which used to point to my git repository. Instead, I’ve since made the static site generator (which I named nihil by the way, after a certain rhythm game song pack :p) generate additional changelog / source pages for each post (check the links above out!).

The changelog pages are not perfect — they only contain the git commit messages that touched each page, and no diffs (oops!). The system has to (again) deal with the fact that .git is not accessible inside the Nix sandbox, thus the changelog history is tacked onto the TOML state-file tracked into the content repository — not that big of a deal.

Last but not least, the current implementation likely doesn’t work as expected for moved/deleted files, but I haven’t had to deal with either so far, so I’ll implement it once the time comes.

Finally, the blog has a dark theme (the page should automatically match your system’s theme!), and a RSS feed now! I still haven’t taken the time to implement Gemini support, nor a place to collect pretty pixel-art buttons. I’ve also added a webring section to the homepage, although I’m not part of too many at the moment, but that’s something I can always improve on in the future.

◇ A note on code highlighting & math

There’s a handful of Haskell libraries for syntax highlighting and LaTeX to MathML conversion, yet I wasn’t happy with the way any looked. In the end, I decided to continue using treesitter for highlighting, yet the Haskell treesitter ecosystem looked way poorer than Rust’s! I could always manually write bindings for the grammars I needed, but this looked quite painful. To make things worse, I wanted my math rendering to look like it did in the previous version, so I had no choice — I had to call the respective Rust libraries from Haskell.

The correct way to achieve this is through the C-FFI interface, yet I was feeling lazy, so I wrote two scripts (one for each task), and called them as CLIs from Haskell, passing data over stdin/stdout. Since those are meant to look like pure functions from the exterior, I then commit the deadly sin of calling unsafePerformIO, telling Haskell not to track the respective effects (awful, I know). This has worked surprisingly well (although is not something you should do).

The highlights used are a combinations of the highlights shipped with the various parsers, together with the ones shipped by neovim (see this repo). The code for the aforementioned scripts is a big mess, although a self-contained one, hence not something I worry too much about.

◇ A note on fonts

The previous version of the blog used to fetch its fonts from external URLs (think Google fonts and the like). This time, I automatically bundle the fonts with the website itself. Another thing worth mentioning is that the aforementioned previous version wouldn’t even set the main content / monospace fonts to anything specific — I expected the reader to take care of choosing pretty default fonts for their computer. To no one’s surprise, most people have never touched their default fonts. Oh well, I eventually gave in and set the fonts to the ones I use for myself (might change them based on feedback, not sure yet).

I have noticed a slight performance hit with the website (I can see the layout loading a fraction of a second before the text itself), although perhaps this could be mitigated by specifying asset hashes in the HTML, letting the browser cache more or something? (I don’t know what I’m talking about, I just remember having read about this before)

◇ Conclusion

What can I say, the experience has taught me a lesson in over-engineering things. I mean, it’s not like I didn’t know this before the adventure, but the fun of the topic can often lead me to places I perhaps shouldn’t be going to. In the end, it’s all fun and games, but I have other projects I’d rather be having fun with other than over-engineering a blog with a single public article, that is slkdjflskdjfslkdjf (hey, that’s two now!).

On the bright side, this experience has taught be a fair bit about language servers and error-tolerant parsing, so who am I to complain?

Many thanks to Frog in a Box for proof-reading this article.

The site does contain a few hidden pages I sometimes link to people, hence the total number of posts does add up to like… five — the rewrite-to-post ratio on here is still very funny though.
Return to content ↩︎
I know some will scream “skill issue”, but again, it wasn’t difficult — just tedious. I have used Rust for larger projects, yet my heart was yearning to touch Haskell once again.
Return to content ↩︎
Nothing wrong with that, of course — I love fancy words :3
Return to content ↩︎
I tried coming up with an alliteration, but my 5AM brain is not working particularly well, I’m sorry.
Return to content ↩︎
…as long as the OverloadedStrings extension is active.
Return to content ↩︎
One might rightfully point out that this doesn’t really keep the duality alive. In particular, why does constructing inductive types look like <constructor> <value> (for example, Just 5 in Haskell or Some(5) in Rust) but accessing values on records looks like <value> <.field> (i.e. foo .bar) instead? Isn’t the order of writing the caller / callee being swapped around? To be honest, I’m not super sure. While on one hand the “caller” / “callee” distinction is fake (one doesn’t even need to look at category theory for this — introductory linear algebra classes covering dual vector spaces should make this apparent), this gets hairy the moment quantifiers get introduced into the mix, so yeah, I’m not sure. Perhaps a language where property access also works backwards (i.e. .bar foo) would make more sense and keep things more consistent by staying true to the right-to-left flow (in the same way that Rust’s postfix .await keeps things in line with the other left-to-right constructs. Not even Rust keeps this consistent with & and *, which are right-to-left, but I digress).
Return to content ↩︎
By “trivia” I’m referring to comments (and other metadata), junk input (unexpected tokens and whatnot), and anything else contained in the text source that one would usually not keep around for the AST (abstract syntax tree).
Return to content ↩︎