Functional Parsing With Clojure

Recently I built a small webapp to allow me to view my org files online. I know that I can export org files to HTML and scp them to my server, but this was for a TODO list among other things and I wanted to both password protect it and possibly also make so that I could perform simple edits to the file from the web. As a result, I decided to write a webapp which parsed and converted the org file to HTML on the fly with each request. What resulted was a very simple web application, written in Clojure, which can display a directory of org files online. If you want to use it, you can get the code here.

Anyway, while writing this, I struggled a little with Clojure’s mostly functional nature to write an efficient and clean parser/translator for the org files. The issue is that, for me, the most natural way to parse something more complicated than a simple sequence of tokens, is to use a state engine to keep track of the context while parsing in each new character, line, or token. For example, when parsing something like the this:

(token1 (token2 token3) token4)

One could use a variable “depth” to keep track of how deeply nested the parentheses are at any given moment.

While Clojure does allow for state, I felt that using a ref to represent the state of the parser at any given moment, was A) a little heavyweight, and B) violated the whole point of using a ref. Clojure’s STM semantics are meant to enable shared state in a concurrent program and this sort of parser is inherently single threaded. After a few attempts to achieve a purely functional approach using line-seq and too much cleverness I came up with a messy hybrid solution that I wasn’t happy with.

Coming back to the project a week later, I had a face-palm moment where I realized “Duh, this can be done with mutual-recursion, an accumulator variable, and parameters to represent the parser state.” This is common sense to anyone used to functional programming, but it somehow slipped my mind. This is a pretty standard technique for translating a stateful algorithm to a stateless algorithm and I think my org-online webapp is a pretty clear example so I thought I’d share.

In a nutshell, code which looks like this:

(defvar *state* nil)

(loop 
  (setf *state* (do-something *state*)))

can be translated to code which looks like this:

(defn function [state]
  (let [new-state (do-something state)]
    (function new-state)))

In plain English, we’re taking the relevant part of the program state and instead of modifying it in place over the course of the program’s execution, we’re passing it in the parameters of a function call and recursing with different values so we never actually have to modify any state. It’s just like an accumulator variable but we don’t intend to ultimately return it at any point in time.

So, org-online uses just this technique to represent the parser state. Before I continue, it will be helpful to describe in more detail just what this application does. Org-mode is a flexible plain text outliner, note-taker, list-making application/mode for Emacs and org files are the files it produces. They’re simple plain text files with a simple markup. Lines beginning with asterisks represent section headers and multiple asterisks represent nested headers.¹

The entrance to the code is a function called “convert-org-file-to-html”:

(defn convert-org-file-to-html
  "Translates an org file to an HTML file returned as a string"
  [file]
  (with-open [output (StringWriter.)]
    (with-open [input (reader file)]
      (binding [*out* output]
        (let [lines (line-seq input)
              cfg (process-cfg-lines lines)]
          (trampoline next-line lines {:cfg cfg :depth 0})))
      (.toString output))))

The part to look at is the call to trampoline, which is how Clojure implements mutual recursion.² This enters the parser with a call to next-line, which takes two arguments, “lines” which is a lazy sequence of lines from the file, and a map called “state” which represents the parser state. Next-line is one of several functions which are mutually recursive and take the same arguments. Each one examines the current line and the parser state, prints out some markup,³ and returns a function to handle the rest of the file:

(defn next-line
  "Handles a line in an org file, dispatches if it is a header, list, other specially case."
  [[line & rest] {:keys [properties? depth] :as state}]
  (handle-end-lines [line state]
    (condp re-seq line 
      #":PROPERTIES:" (partial properties-line rest state)
      #"^\*+ " (partial parse-header (cons line rest) state)
      #"^(\s+[+*-] |[+-] )" (partial ulist-line (cons line rest) state)
      (do
        (println (html [:p (create-links line)]))
        (partial next-line rest state)))))

“Handle-end-lines” ends the parsing if “line” is null, (ie. the EOF) while “condp re-seq line” performs a regular expression check on the current line being parsed. Three cases are handled, one for a “PROPERTIES” tag, one for a header, one for the start of an unordered list. There is also a final catch-all for plane lines of text. The first three simple dispatch to another handler, similar to “next-line”, whereas the last one prints out the HTML of the current line before moving on to the “next-line.”

(defn header-string
"Format a header line into an HTML string"
[stars line id cfg]
  (html [(keyword (str "h" stars))
         {:id (str "head-" id) :class "folder"}
         (markup-todos (.substring line stars) cfg)]))

(defn parse-header
  "Handles a header line in an org file."
  [[line & rest] {:keys [depth ids cfg] :as state}]
  (let [stars (count-stars line)
        id (gensym)
        closings (inc (- depth stars))]
    (doseq [index (range closings)]
      (println "</div></div>"))
    (println "<div>")
    (println (header-string stars line id cfg))
    (println (str "<div class=\"foldable\" id=\"body-" id "\">"))
    (partial next-line rest
             (assoc state
               :depth stars
               :ids (cons id (drop closings ids))))))

“Parse-header” works very similarly to “next-line” in that it takes the same two arguments and also returns a function, but it examines the state, to see how deeply nested the current header is and whether it marks the end of another section. I enclose sections in divs and headers mark the beginning of each div. Sections an be nested so I need to keep track of how many divs deep I am and headers need to be parsed to see just how deep the new section is. Each div has a unique id so that I can use some javascript later to fold them like in org files opened in Emacs.

The chief bit to watch here is that the map “state” has two entries of note: depth and ids. In a declarative and iteratively written parser, these would be stateful variables, and would be modified directly after each token was read from the input stream. In a functional and recursive parser, they are parameters which change with each recursion.

It’s also worth pointing out that in this particular example, part of the state, as it would be tracked in an iterative parser, is expressed as part of which function the program is in at any given moment. That is, instead of a “state” field on the “state” variable saying things like, “parsing-header,” “parsing-list,” or “parsing-normal,” there are different functions for each of these and they call each other as they detect the appropriate elements of the file. This is the part that corresponds most closely to a classical finite-state machine.

This org-online parser is more or less a direct translation of a stateful algorithm to a stateless one. It’s a single threaded algorithm. I suppose that it could be improved with a more functional and concurrent algorithm (and in fact, I thought of a possible way while writing this post,) but this is good enough for me for now.

This isn’t all there is too org-mode but it is all I really have space to explain here. My application application actually parses a larger subset but it’s a bit much to explain everything in this blog post. If you want to read the full documentation, you can find it here: org-docs ↩
The first argument to trampoline is a function which trampoline invokes on the rest of the argument list. If that function returns another function, trampoline calls that in turn, repeating until a function returns something aside from a function. ↩
You’ll notice that I “cheat” a little bit. The output is being stored in the stream “output” which is being modified statefully, so this algorithm isn’t purely functional. I could fix this by replacing calls to println with string concatenation and let forms but I don’t really see the need. My goal isn’t actually pure functionalism here. ↩