Website Filters


I’ve been writ­ing this blog and main­tain­ing this site for almost two months now. For most of that time, I’ve been main­tain­ing it using a pro­gram called nanoc which gen­er­ates a sta­tic web­site from some tem­plates and source files. I talk more about that here.

One of the things I like about this setup is that it give me a lot of flex­i­bil­ity in how this web­site works. For exam­ple, I can apply a series of fil­ters to my source files to gen­er­ate the final web pages I want. I use Markdown, for exam­ple, to con­vert most of my blog posts 1 from a leg­i­ble plain text to html. I can also write my own filters: I’ve writ­ten two of them and I thought that they were worth shar­ing even if they did­n’t merit there own projects.

The first is the hyphen­ation fil­ter. It’s nice when type­set­ting things to be able to break up large words at the end of a line of text with a hyphen and con­tinue them over a sec­ond row so that they don’t break up the spac­ing. Mod­ern web browsers don’t cur­rently pro­vide a way to auto­mat­i­cally so this but they do gen­er­ally allow one to use the soft hyphen char­ac­ter, “­” as a hyphen­ation hint, telling the browser that it’s okay to split up a word at the soft hyphen, which is invis­i­ble oth­er­wise.

Now, it would be very imprac­ti­cal if I were to add these man­u­al­ly. See­ing as I can’t pre­dict where a browser is going to attempt a line break, I’d have to add them for every large word which would get very tedious very fast. Never mind that hav­ing dozens of instances of “­” scat­tered through­out my source files would be very ugly and unread­able or that they would inter­fere with the Mark­down filter.2

For­tu­nate­ly, pro­gram­mat­i­cally find­ing hyphen points is a solved prob­lem and there is already a solid hyphenation algorithm present in TeX. In fact, there’s a nice Ruby gem which imple­ments that algo­rithm which I can use and all I need to do is actu­ally insert the soft hyphens.

The fil­ter in whole is here:

require "text/hyphen"
require "nokogiri"

class HyphenFilter < Nanoc3::Filter
  identifier :hyphenate
  type :text
  
  @@hyphenator = Text::Hyphen.new

  def self.hyphenate_word(word)
    @@hyphenator.hyphenate(word).reverse.each do |hyphen_point|
      word.insert(hyphen_point, "&shy;")
    end
    word
  end

  def self.hyphenate_text(text)
    text.lines.map do |line|
      line.gsub(/\b<([^<>]+)>\b/i, " <\1> ").split(/\s/).map do |word| 
        if word =~ /[:&;<=>\"]/
          word
        else
          HyphenFilter.hyphenate_word(word)
        end
      end.join(" ")
    end.join("\n")
  end

  def run (content, params={})
    document = Nokogiri::HTML::DocumentFragment.parse(content) #use nokogiri to only affect 'p' blocks
    document.css("p").map do |p|
      p.inner_html = HyphenFilter.hyphenate_text(p.inner_html)
    end
    document.to_html
  end
end

Because the input is a raw HTML frag­ment, I use nokogiri to make cer­tain that I only run the fil­ter on para­graphs (‘p’ ele­ments) and not code frag­ments. I insert whitespace around HTML tags and split the text into ‘word­s’, skip­ping those that con­tain cer­tain char­ac­ters so that the HTML isn’t broken.3


The sec­ond fil­ter is the foot­note fil­ter. Foot­notes are a real nice to have for a blog as they pro­vide a place for those asides which would oth­er­wise clut­ter up the main text,4 but it’s a lit­tle tricky to do them in a way which is sematic and read­able in the orig­i­nal text, at least by hand. I ended up cre­at­ing a lit­tle syn­tax where I use dou­ble parentheses: “((“ and “))” to indi­cate an aside which I want to sep­a­rate from the main text. A fil­ter then parses the text for these and splits the con­tained text into foot­notes. This has the ben­e­fit where, not only is it read­able in the orig­i­nal, but if I decide I’d like another pre­sen­ta­tion later (say, side­notes) or if I want dif­fer­ent pre­sen­ta­tions for dif­fer­ent rep­re­sen­ta­tions of the text, then I can code that into fil­ter and not have to make the change by hand. It’s nice improve­ment over doing things by hand.

Any­way, here’s the code:

require "nokogiri"

class CreateFootnotes < Nanoc3::Filter
  identifier :footnotes
  type :text

  def run(content, params={})
    document = Nokogiri::HTML::DocumentFragment.parse(content) #use nokogiri to only affect 'p' blocks
    index = 0      #numbering each footnote
    footnotes = [] #list containing footnote texts
    
    document.css("p").map do |p|
      p.inner_html = p.inner_html.gsub(/\(\(([^()]*)\)\)/) do |match|
        index += 1
        footnotes << "<div class='note' id='#{index}'\>" +
          "<sup><a name='ftn.footnote#{index}' href='#footnote#{index}'>#{index}</a></sup>" +
          "#{match[2...-2]}</div>"
        "<sup><a name='footnote#{index}' href='#ftn.footnote#{index}'>#{index}</a></sup>"
      end
    end
    document.to_html + footnotes.join("\n")
  end
end

Like with the hyphen­ation fil­ter, I use noko­giri to make sure the fil­ter is only applied to prose para­graphs. The dou­ble paren­the­ses really inter­feres with code blocks. Other than that, I just match the dou­ble paren­the­ses with a regex and replace the con­tent with a super­script hyper­link to the foot­note at the bot­tem. Rather sim­ple.


I’ll pub­lish a redacted version5 of my entire web­site to a github repo at some point in the future when I’m suf­fi­ciently sat­is­fied with how it works. For now, I thought that these were rather neat.

  1. Like this one! 
  2. Or rather, that the Markdown filter would interfere with them as it escapes raw HTML. 
  3. Characters like ‘<’, ‘>’, ‘&’, ‘=’, and ‘;’ usually signify HTML markup so skipping them guarantees that I’m not inserting soft hyphens in the middle of a tag and breaking the web page. They can appear in other situations, but they’re safe to skip I believe. 
  4. Like this one! 
  5. I’d rather not publish my drafts or todo list 

Last update: 02/10/2011

blog comments powered by Disqus