Functional Parsing With Clojure

Recently I built a small webapp to allow me to view my org files online. I know that I can export org files to HTML and scp them to my server, but this was for a TODO list among other things and I wanted to both pass­word pro­tect it and pos­si­bly also make so that I could per­form sim­ple edits to the file from the web. As a result, I decided to write a webapp which parsed and con­verted the org file to HTML on the fly with each request. What resulted was a very sim­ple web appli­ca­tion, writ­ten in Clo­jure, which can dis­play a direc­tory of org files online. If you want to use it, you can get the code here.

Any­way, while writ­ing this, I strug­gled a lit­tle with Clo­jure’s mostly func­tional nature to write an effi­cient and clean parser/­trans­la­tor for the org files. The issue is that, for me, the most nat­ural way to parse some­thing more com­pli­cated than a sim­ple sequence of tokens, is to use a state engine to keep track of the con­text while pars­ing in each new char­ac­ter, line, or token. For exam­ple, when pars­ing some­thing like the this:

(token1 (token2 token3) token4)

One could use a vari­able “depth” to keep track of how deeply nested the paren­the­ses are at any given moment.

While Clo­jure does allow for state, I felt that using a ref to rep­re­sent the state of the parser at any given moment, was A) a lit­tle heavy­weight, and B) vio­lated the whole point of using a ref. Clo­jure’s STM seman­tics are meant to enable shared state in a con­cur­rent pro­gram and this sort of parser is inher­ently sin­gle thread­ed. After a few attempts to achieve a purely func­tional approach using line-seq and too much clev­er­ness I came up with a messy hybrid solu­tion that I was­n’t happy with.

Com­ing back to the project a week lat­er, I had a face-­palm moment where I real­ized “Duh, this can be done with mutu­al-re­cur­sion, an accu­mu­la­tor vari­able, and para­me­ters to rep­re­sent the parser state.” This is com­mon sense to any­one used to func­tional pro­gram­ming, but it some­how slipped my mind. This is a pretty stan­dard tech­nique for trans­lat­ing a state­ful algo­rithm to a state­less algo­rithm and I think my org-online webapp is a pretty clear exam­ple so I thought I’d share.

In a nut­shell, code which looks like this:

(defvar *state* nil)

  (setf *state* (do-something *state*)))

can be trans­lated to code which looks like this:

(defn function [state]
  (let [new-state (do-something state)]
    (function new-state)))

In plain Eng­lish, we’re tak­ing the rel­e­vant part of the pro­gram state and instead of mod­i­fy­ing it in place over the course of the pro­gram’s exe­cu­tion, we’re pass­ing it in the para­me­ters of a func­tion call and recurs­ing with dif­fer­ent val­ues so we never actu­ally have to mod­ify any state. It’s just like an accu­mu­la­tor vari­able but we don’t intend to ulti­mately return it at any point in time.

So, org-online uses just this tech­nique to rep­re­sent the parser state. Before I con­tin­ue, it will be help­ful to describe in more detail just what this appli­ca­tion does. Org-­mode is a flex­i­ble plain text out­lin­er, note-­tak­er, list-­mak­ing appli­ca­tion/­mode for Emacs and org files are the files it pro­duces. They’re sim­ple plain text files with a sim­ple markup. Lines begin­ning with aster­isks rep­re­sent sec­tion head­ers and mul­ti­ple aster­isks rep­re­sent nested headers.1

The entrance to the code is a func­tion called “convert-org-file-to-html”:

(defn convert-org-file-to-html
  "Translates an org file to an HTML file returned as a string"
  (with-open [output (StringWriter.)]
    (with-open [input (reader file)]
      (binding [*out* output]
        (let [lines (line-seq input)
              cfg (process-cfg-lines lines)]
          (trampoline next-line lines {:cfg cfg :depth 0})))
      (.toString output))))

The part to look at is the call to tram­po­line, which is how Clo­jure imple­ments mutual recursion.2 This enters the parser with a call to nex­t-­line, which takes two argu­ments, “li­nes” which is a lazy sequence of lines from the file, and a map called “state” which rep­re­sents the parser state. Nex­t-­line is one of sev­eral func­tions which are mutu­ally recur­sive and take the same argu­ments. Each one exam­ines the cur­rent line and the parser state, prints out some markup,3 and returns a func­tion to han­dle the rest of the file:

(defn next-line
  "Handles a line in an org file, dispatches if it is a header, list, other specially case."
  [[line & rest] {:keys [properties? depth] :as state}]
  (handle-end-lines [line state]
    (condp re-seq line 
      #":PROPERTIES:" (partial properties-line rest state)
      #"^\*+ " (partial parse-header (cons line rest) state)
      #"^(\s+[+*-] |[+-] )" (partial ulist-line (cons line rest) state)
        (println (html [:p (create-links line)]))
        (partial next-line rest state)))))

“Han­dle-end-­li­nes” ends the pars­ing if “line” is null, (ie. the EOF) while “condp re-seq line” per­forms a reg­u­lar expres­sion check on the cur­rent line being parsed. Three cases are han­dled, one for a “PROP­ER­TIES” tag, one for a head­er, one for the start of an unordered list. There is also a final catch-all for plane lines of text. The first three sim­ple dis­patch to another han­dler, sim­i­lar to “nex­t-­line”, whereas the last one prints out the HTML of the cur­rent line before mov­ing on to the “nex­t-­line.”

(defn header-string
"Format a header line into an HTML string"
[stars line id cfg]
  (html [(keyword (str "h" stars))
         {:id (str "head-" id) :class "folder"}
         (markup-todos (.substring line stars) cfg)]))

(defn parse-header
  "Handles a header line in an org file."
  [[line & rest] {:keys [depth ids cfg] :as state}]
  (let [stars (count-stars line)
        id (gensym)
        closings (inc (- depth stars))]
    (doseq [index (range closings)]
      (println "</div></div>"))
    (println "<div>")
    (println (header-string stars line id cfg))
    (println (str "<div class=\"foldable\" id=\"body-" id "\">"))
    (partial next-line rest
             (assoc state
               :depth stars
               :ids (cons id (drop closings ids))))))

“Parse-­head­er” works very sim­i­larly to “nex­t-­line” in that it takes the same two argu­ments and also returns a func­tion, but it exam­ines the state, to see how deeply nested the cur­rent header is and whether it marks the end of another sec­tion. I enclose sec­tions in divs and head­ers mark the begin­ning of each div. Sec­tions an be nested so I need to keep track of how many divs deep I am and head­ers need to be parsed to see just how deep the new sec­tion is. Each div has a unique id so that I can use some javascript later to fold them like in org files opened in Emacs.

The chief bit to watch here is that the map “state” has two entries of note: depth and ids. In a declar­a­tive and iter­a­tively writ­ten parser, these would be state­ful vari­ables, and would be mod­i­fied directly after each token was read from the input stream. In a func­tional and recur­sive parser, they are para­me­ters which change with each recur­sion.

It’s also worth point­ing out that in this par­tic­u­lar exam­ple, part of the state, as it would be tracked in an iter­a­tive parser, is expressed as part of which func­tion the pro­gram is in at any given moment. That is, instead of a “state” field on the “state” vari­able say­ing things like, “pars­ing-­head­er,” “pars­ing-list,” or “pars­ing-nor­mal,” there are dif­fer­ent func­tions for each of these and they call each other as they detect the appro­pri­ate ele­ments of the file. This is the part that cor­re­sponds most closely to a clas­si­cal finite-s­tate machine.

This org-online parser is more or less a direct trans­la­tion of a state­ful algo­rithm to a state­less one. It’s a sin­gle threaded algo­rithm. I sup­pose that it could be improved with a more func­tional and con­cur­rent algo­rithm (and in fact, I thought of a pos­si­ble way while writ­ing this post,) but this is good enough for me for now.

  1. This isn’t all there is too org-mode but it is all I really have space to explain here. My application application actually parses a larger subset but it’s a bit much to explain everything in this blog post. If you want to read the full documentation, you can find it here: org-docs  
  2. The first argument to trampoline is a function which trampoline invokes on the rest of the argument list. If that function returns another function, trampoline calls that in turn, repeating until a function returns something aside from a function. 
  3. You’ll notice that I “cheat” a little bit. The output is being stored in the stream “output” which is being modified statefully, so this algorithm isn’t purely functional. I could fix this by replacing calls to println with string concatenation and let forms but I don’t really see the need. My goal isn’t actually pure functionalism here. 

Last update: 21/05/2012

blog comments powered by Disqus