Website Filters

I’ve been writing this blog and maintaining this site for almost two months now. For most of that time, I’ve been maintaining it using a program called nanoc which generates a static website from some templates and source files. I talk more about that here.

One of the things I like about this setup is that it give me a lot of flexibility in how this website works. For example, I can apply a series of filters to my source files to generate the final web pages I want. I use Markdown, for example, to convert most of my blog posts ¹ from a legible plain text to html. I can also write my own filters: I’ve written two of them and I thought that they were worth sharing even if they didn’t merit there own projects.

The first is the hyphenation filter. It’s nice when typesetting things to be able to break up large words at the end of a line of text with a hyphen and continue them over a second row so that they don’t break up the spacing. Modern web browsers don’t currently provide a way to automatically so this but they do generally allow one to use the soft hyphen character, “” as a hyphenation hint, telling the browser that it’s okay to split up a word at the soft hyphen, which is invisible otherwise.

Now, it would be very impractical if I were to add these manually. Seeing as I can’t predict where a browser is going to attempt a line break, I’d have to add them for every large word which would get very tedious very fast. Never mind that having dozens of instances of “” scattered throughout my source files would be very ugly and unreadable or that they would interfere with the Markdown filter.²

Fortunately, programmatically finding hyphen points is a solved problem and there is already a solid hyphenation algorithm present in TeX. In fact, there’s a nice Ruby gem which implements that algorithm which I can use and all I need to do is actually insert the soft hyphens.

The filter in whole is here:

require "text/hyphen"
require "nokogiri"

class HyphenFilter < Nanoc3::Filter
  identifier :hyphenate
  type :text
  
  @@hyphenator = Text::Hyphen.new

  def self.hyphenate_word(word)
    @@hyphenator.hyphenate(word).reverse.each do |hyphen_point|
      word.insert(hyphen_point, "&shy;")
    end
    word
  end

  def self.hyphenate_text(text)
    text.lines.map do |line|
      line.gsub(/\b<([^<>]+)>\b/i, " <\1> ").split(/\s/).map do |word| 
        if word =~ /[:&;<=>\"]/
          word
        else
          HyphenFilter.hyphenate_word(word)
        end
      end.join(" ")
    end.join("\n")
  end

  def run (content, params={})
    document = Nokogiri::HTML::DocumentFragment.parse(content) #use nokogiri to only affect 'p' blocks
    document.css("p").map do |p|
      p.inner_html = HyphenFilter.hyphenate_text(p.inner_html)
    end
    document.to_html
  end
end

Because the input is a raw HTML fragment, I use nokogiri to make certain that I only run the filter on paragraphs (‘p’ elements) and not code fragments. I insert whitespace around HTML tags and split the text into ‘words’, skipping those that contain certain characters so that the HTML isn’t broken.³

The second filter is the footnote filter. Footnotes are a real nice to have for a blog as they provide a place for those asides which would otherwise clutter up the main text,⁴ but it’s a little tricky to do them in a way which is sematic and readable in the original text, at least by hand. I ended up creating a little syntax where I use double parentheses: “((“ and “))” to indicate an aside which I want to separate from the main text. A filter then parses the text for these and splits the contained text into footnotes. This has the benefit where, not only is it readable in the original, but if I decide I’d like another presentation later (say, sidenotes) or if I want different presentations for different representations of the text, then I can code that into filter and not have to make the change by hand. It’s nice improvement over doing things by hand.

Anyway, here’s the code:

require "nokogiri"

class CreateFootnotes < Nanoc3::Filter
  identifier :footnotes
  type :text

  def run(content, params={})
    document = Nokogiri::HTML::DocumentFragment.parse(content) #use nokogiri to only affect 'p' blocks
    index = 0      #numbering each footnote
    footnotes = [] #list containing footnote texts
    
    document.css("p").map do |p|
      p.inner_html = p.inner_html.gsub(/\(\(([^()]*)\)\)/) do |match|
        index += 1
        footnotes << "<div class='note' id='#{index}'\>" +
          "<sup><a name='ftn.footnote#{index}' href='#footnote#{index}'>#{index}</a></sup>" +
          "#{match[2...-2]}</div>"
        "<sup><a name='footnote#{index}' href='#ftn.footnote#{index}'>#{index}</a></sup>"
      end
    end
    document.to_html + footnotes.join("\n")
  end
end

Like with the hyphenation filter, I use nokogiri to make sure the filter is only applied to prose paragraphs. The double parentheses really interferes with code blocks. Other than that, I just match the double parentheses with a regex and replace the content with a superscript hyperlink to the footnote at the bottem. Rather simple.

I’ll publish a redacted version⁵ of my entire website to a github repo at some point in the future when I’m sufficiently satisfied with how it works. For now, I thought that these were rather neat.

Like this one! ↩
Or rather, that the Markdown filter would interfere with them as it escapes raw HTML. ↩
Characters like ‘<’, ‘>’, ‘&’, ‘=’, and ‘;’ usually signify HTML markup so skipping them guarantees that I’m not inserting soft hyphens in the middle of a tag and breaking the web page. They can appear in other situations, but they’re safe to skip I believe. ↩
Like this one! ↩
I’d rather not publish my drafts or todo list ↩