Why can't we do pipes smarter?


Some­times I think that Unix is pretty awe­some. You can strip it down to noth­ing but a ker­nel and a shell and maybe a few driver­s/­mod­ules and end up with a per­fectly use­ful, if min­i­mal, sys­tem. At the same time you can build Unix out into any­thing from a desk­top sys­tem to a high traf­ficked web­server to even a phone OS depend­ing on your def­i­n­i­tion of Unix. Unix is pretty flex­i­ble is what I’m say­ing.

A lot has been writ­ten about the flexibility and power of Unix. Suf­fice to say, Unix’s power is due in a large way to its mod­u­lar­ity and the com­pos­abil­ity of its com­po­nent pro­grams. One key ingre­di­ent of this com­pos­abil­ity is the ven­er­a­ble pipeline idiom, Unix’s abil­ity to feed the out­put stream of one pro­gram to the input stream of anoth­er. Pipes actu­ally are quite amaz­ing, at the shell, they turn a set of small util­i­ties into a com­plete sys­tem admin­is­tra­tion toolkit. Witness:

ps -ef | grep java | sort -k2   # Using three separate programs in concert, find all running java processes and sort them by process id;

Com­bined with shell script­ing, pipes pro­vide the basis for a pro­gram­ming cen­tric com­put­ing envi­ron­ment. That is, one which is flex­i­ble and eas­ily exten­si­ble using exist­ing tools.

That said, pipes to have their lim­its. Take for example:

ps -ef | sort -k8 # List all processes sorted by uptime, or tty, whichever. 

The intent of the com­mand above is to sort run­ning processes by their upti­me. The “sort -k8” com­mand looks for the eight col­umn in the “ps -ef” out­put and sorts the lines return by “ps -ef” by that column. The prob­lem is, depend­ing on how recent it is, the date can be either one column, or two:

astine 28805 15921   0 16:13:52 pts/5       0:00 /bin/bash              # One column
  root   463     7   0   Jun 01 console     0:00 /usr/dt/bin/dtsession  # Two columns

“Sort” relies on whitespace to dis­tin­guish columns, this means that it can’t always reli­ably be used to sort the out­put of “ps -ef” by column. This seems to be a small issue here, but it’s part of a much larger prob­lem with using free form text streams for all input and out­put. It’s said that text streams are a uni­ver­sal interface.1 That’s true as far as it goes, assum­ing that pro­grams use com­pat­i­ble encod­ing sys­tems. Text is sim­ple to read because all com­put­ing sys­tems can han­dle char­ac­ter encod­ing. How­ev­er, the same thing can be said about binary data: It’s made of ones and zeros and every com­puter sys­tem can han­dle ones and zeros. The dif­fi­culty with using free form text as a uni­ver­sal inter­face is the same as the prob­lem with spe­cialty binary for­mats for the same, while it can eas­ily tell one char­ac­ter from anoth­er, how is a pro­gram sup­posed to know what they mean?2

Tools like awk and sed allow for elab­o­rate oper­a­tions to help­fully munge the out­put of one pro­gram into some­thing accept­able as input to anoth­er. These are nec­es­sary because the out­put of each Unix pro­gram is essen­tially an entirely new data for­mat spe­cific to that one pro­gram. Every time you pipe two pro­grams togeth­er, you must essen­tially cre­ate a brand-new makeshift parser so that the pro­grams can be com­pat­i­ble. This is really sub­-op­ti­mal and, I think, one of the rea­sons why the “Unix style” of cre­at­ing new pro­grams out of smaller pro­grams and shell script­ing, has slowly been aban­doned in favor of things like all-in-one script­ing lan­guages such as Perl and the like, which pro­vide an entirely sep­a­rate way to com­pose com­po­nents, which allow detailed knowl­edge of the data being passed between parts.3

Imag­ine if you will, a sys­tem where shell util­i­ties were typ­i­cally used instead of sys­tem and library calls. Imag­ine instead of call­ing stat or some bind­ing to the same one could call “ls -l,” which already returns the same infor­ma­tion, and expect an out­put which is imme­di­ately use­ful to a pro­gram, out­put which does­n’t require any spe­cial, after-the-­fact pars­ing? Take this hypo­thet­i­cal Perl program:

my @files = \`ls -l\`;
for my $file (@files)
{
  if ($file->date > $some_date)
  {
    print $file->name;
  }
}

Notice how I’ve com­pletely left out any pars­ing or reg­u­lar expres­sions in the above exam­ple. Of, course we could use “stat” here, but then every lan­guage or plat­form we tried to code in would require an equiv­a­lent to “stat.” Now, with a gen­uine sys­tem util­ity such as “stat” this isn’t a problem; every lan­guage actu­ally intended to be used has a bind­ing. How­ev­er, what about non-­core Unix util­i­ties? Take some­thing like ImageMagick. Imagemag­ick has both library bind­ings and a shell inter­face. The more com­monly used one is the shell ver­sion because writ­ing to a C library is a painstak­ing and error prone process. Imagemag­ick does­n’t gen­er­ally return a lot of data so it’s eas­ier to just call the shell for most pro­grams.

A lot of what we call libraries could be writ­ten as pro­grams in any lan­guage or plat­form and used from any other if only there was an inter­change for­mat that did­n’t require error-prone mung­ing or pars­ing to make it under­stood.

So, what am I actu­ally sug­gest­ing? Well, what if the “ls” com­mand had another option. Let’s say, a “-j” option, which, if set, would cause the “ls” com­mand to return its reg­u­lar out­put embed­ded in a JSON for­mat. For example:

ls -j ./
[{"permissions" : "drwxr-xr-x", "links" : 3, "user" : "astine", "group" : "staff", "size" : 25, "date" : "Jul 11 2011", "filename" : "file1"}
 {"permissions" : "-rw-r--r--", "links" : 1, "user" : "astine", "group" : "staff", "size" : 3397945, "date" : "Jan 5 2011", "filename" : "file1.tar.gz"}
 {"permissions" : "-rw-r--r--", "links" : 1, "user" : "astine", "group" : "staff", "size" : 18718, "date" : "May 31 2011", "filename" : "file2.tar.gz"}
 {"permissions" : "-rw-r--r--", "links" : 1, "user" : "astine", "group" : "staff", "size" : 2463329, "date" : "Feb 28 2011", "filename" : "file3.tar.gz"}
 {"permissions" : "-rw-r--r--", "links" : 1, "user" : "astine", "group" : "staff", "size" : 2918886, "date" : "Feb 28 2011", "filename" : "file4.tar.gz"}
 {"permissions" : "drwxrwxrwx", "links" : 11, "user" : "astine", "group" : "staff", "size" : 35, "date" : "Feb 16 13:59", "filename" : "file2"}
 {"permissions" : "-rw-r--r--", "links" : 1, "user" : "astine", "group" : "staff", "size" : 6611, "date" : "Aug  9  2010", "filename" : "file1.clj"}]

With every field clearly delin­eat­ed, writ­ing util­i­ties to correctly sort or fil­ter by col­umn would be triv­ial. We could fur­ther break down the per­mis­sions and dates into their com­po­nent parts:

"-rw-r--r--" => {"user" : ["read", "write"], "group" : ["read"], "other" : ["read"]} 
"Jul 11 2011" => {"day" : 11, "month" : 6, "year" : 2011, "unixtime" : 1310416160}

And reli­ably sort by date, reli­ably fil­ter by date, more sim­ply fil­ter by per­mis­sions, etc. Fur­ther­more, if this were a stan­dard out­put of the “ls” com­mand, then aside from per­for­mance con­cerns, there would be no need for a direct bind­ing to “stat” in every pro­gram­ming lan­guage avail­able. Even more use­ful, shell util­i­ties which input and out­put JSON and can oper­ate on lists, trees, and dic­tio­nar­ies using css style selec­tors could be writ­ten. This would bring higher level pro­gram­ming idioms to the unix shell. While amaz­ing things are pos­si­bly with adroit usage of “sed” and “awk,” they are blunt instru­ments com­pared to what could be done with struc­tured data and an appro­pri­ate set of tools. All that’s needed is a pretty printer for the struc­tured data so that humans can read the out­put when it’s done being manip­u­lated by the machi­nes.

So what should the data inter­change for­mat be? Well, it should have most of these properties:

To achieve the goals, there are a num­ber of options, from XML to cre­at­ing a new for­mat from scratch, but I think that JSON in fact has this about 90% cov­ered. It has all the basic data struc­tures ad datatypes cov­ered. It’s human read­able by default (though that can be improved with a pretty print­er.) A char sequence to mark the begin­ning and end of of JSON datas­treams is all that is really need­ed.

JSON is not the end of the dis­cus­sion of course. If we really wanted to use shell calls instead of sys­tem/li­brary calls in a major­ity of cases a binary for­mat might be nec­es­sary to get the speed we want. A pretty printer might then solve the read­abil­ity issue. Also, we might want to pro­vide for­mat­ting and text struc­ture markup for out­put intended for human read­ing. Ulti­mate­ly, it might be prefer­able to have mul­ti­ple forms of out­put and data-in­ter­change for­mat to suite dif­fer­ent needs but I think for now an improve­ment over plain text is enough.

So for now, what I am sug­gest­ing is this: When we write new shell util­i­ties, we con­sider adding a JSON out­put option. If enough util­i­ties, includ­ing vari­ants of the com­mon shell util­i­ties such as “ls” and “ps” were to adopt this approach, I think the Unix plat­form would gar­ner a lot more flex­i­bil­i­ty.

I’m not the only per­son to have ever thought of this, and here are some projects imple­ment some of the ideas which I am suggesting:

  1. This was said by Doug McIlroy and the full quote is, “This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.” 
  2. Now, the output of most Unix programs tends to follow at least some logic. Different items are separated by line breaks and separate parts of those items are separated with spaces, so that they typically end up in well formatted columns. The line-based output format is common and, indeed, quite a few Unix programs, like grep, sed, awk, and sort, are built around the idea of line based output. However, despite its strong support in the Unix environment, however, a line-base structure is fairly limiting. Trees and arbitrarily nested structures are impossible to process intuitively and as demonstrated earlier, even the delimiters between fields are sometimes ambiguous. Commonly output formatted to be more easily readable, such as the output of “ls -l *”, breaks the line-based format altogether making simple use of tools like grep and sort altogether impossible. 
  3. These components of course, are CPAN modules and Ruby Gems, for example, and they communicate through the Perl or Ruby language which provides a rich array datatypes and organizational tools from which to build their interfaces. This style of composition, while powerful, has a number of disadvantages, the most notable here being that components written for one language are not trivially usable with another. 

Last update: 27/03/2012

blog comments powered by Disqus