Why can't we do pipes smarter?

Sometimes I think that Unix is pretty awesome. You can strip it down to nothing but a kernel and a shell and maybe a few drivers/modules and end up with a perfectly useful, if minimal, system. At the same time you can build Unix out into anything from a desktop system to a high trafficked webserver to even a phone OS depending on your definition of Unix. Unix is pretty flexible is what I’m saying.

A lot has been written about the flexibility and power of Unix. Suffice to say, Unix’s power is due in a large way to its modularity and the composability of its component programs. One key ingredient of this composability is the venerable pipeline idiom, Unix’s ability to feed the output stream of one program to the input stream of another. Pipes actually are quite amazing, at the shell, they turn a set of small utilities into a complete system administration toolkit. Witness:

ps -ef | grep java | sort -k2   # Using three separate programs in concert, find all running java processes and sort them by process id;

Combined with shell scripting, pipes provide the basis for a programming centric computing environment. That is, one which is flexible and easily extensible using existing tools.

That said, pipes to have their limits. Take for example:

ps -ef | sort -k8 # List all processes sorted by uptime, or tty, whichever.

The intent of the command above is to sort running processes by their uptime. The “sort -k8” command looks for the eight column in the “ps -ef” output and sorts the lines return by “ps -ef” by that column. The problem is, depending on how recent it is, the date can be either one column, or two:

astine 28805 15921   0 16:13:52 pts/5       0:00 /bin/bash              # One column
  root   463     7   0   Jun 01 console     0:00 /usr/dt/bin/dtsession  # Two columns

“Sort” relies on whitespace to distinguish columns, this means that it can’t always reliably be used to sort the output of “ps -ef” by column. This seems to be a small issue here, but it’s part of a much larger problem with using free form text streams for all input and output. It’s said that text streams are a universal interface.¹ That’s true as far as it goes, assuming that programs use compatible encoding systems. Text is simple to read because all computing systems can handle character encoding. However, the same thing can be said about binary data: It’s made of ones and zeros and every computer system can handle ones and zeros. The difficulty with using free form text as a universal interface is the same as the problem with specialty binary formats for the same, while it can easily tell one character from another, how is a program supposed to know what they mean?²

Tools like awk and sed allow for elaborate operations to helpfully munge the output of one program into something acceptable as input to another. These are necessary because the output of each Unix program is essentially an entirely new data format specific to that one program. Every time you pipe two programs together, you must essentially create a brand-new makeshift parser so that the programs can be compatible. This is really sub-optimal and, I think, one of the reasons why the “Unix style” of creating new programs out of smaller programs and shell scripting, has slowly been abandoned in favor of things like all-in-one scripting languages such as Perl and the like, which provide an entirely separate way to compose components, which allow detailed knowledge of the data being passed between parts.³

Imagine if you will, a system where shell utilities were typically used instead of system and library calls. Imagine instead of calling stat or some binding to the same one could call “ls -l,” which already returns the same information, and expect an output which is immediately useful to a program, output which doesn’t require any special, after-the-fact parsing? Take this hypothetical Perl program:

my @files = \`ls -l\`;
for my $file (@files)
{
  if ($file->date > $some_date)
  {
    print $file->name;
  }
}

Notice how I’ve completely left out any parsing or regular expressions in the above example. Of, course we could use “stat” here, but then every language or platform we tried to code in would require an equivalent to “stat.” Now, with a genuine system utility such as “stat” this isn’t a problem; every language actually intended to be used has a binding. However, what about non-core Unix utilities? Take something like ImageMagick. Imagemagick has both library bindings and a shell interface. The more commonly used one is the shell version because writing to a C library is a painstaking and error prone process. Imagemagick doesn’t generally return a lot of data so it’s easier to just call the shell for most programs.

A lot of what we call libraries could be written as programs in any language or platform and used from any other if only there was an interchange format that didn’t require error-prone munging or parsing to make it understood.

So, what am I actually suggesting? Well, what if the “ls” command had another option. Let’s say, a “-j” option, which, if set, would cause the “ls” command to return its regular output embedded in a JSON format. For example:

ls -j ./
[{"permissions" : "drwxr-xr-x", "links" : 3, "user" : "astine", "group" : "staff", "size" : 25, "date" : "Jul 11 2011", "filename" : "file1"}
 {"permissions" : "-rw-r--r--", "links" : 1, "user" : "astine", "group" : "staff", "size" : 3397945, "date" : "Jan 5 2011", "filename" : "file1.tar.gz"}
 {"permissions" : "-rw-r--r--", "links" : 1, "user" : "astine", "group" : "staff", "size" : 18718, "date" : "May 31 2011", "filename" : "file2.tar.gz"}
 {"permissions" : "-rw-r--r--", "links" : 1, "user" : "astine", "group" : "staff", "size" : 2463329, "date" : "Feb 28 2011", "filename" : "file3.tar.gz"}
 {"permissions" : "-rw-r--r--", "links" : 1, "user" : "astine", "group" : "staff", "size" : 2918886, "date" : "Feb 28 2011", "filename" : "file4.tar.gz"}
 {"permissions" : "drwxrwxrwx", "links" : 11, "user" : "astine", "group" : "staff", "size" : 35, "date" : "Feb 16 13:59", "filename" : "file2"}
 {"permissions" : "-rw-r--r--", "links" : 1, "user" : "astine", "group" : "staff", "size" : 6611, "date" : "Aug  9  2010", "filename" : "file1.clj"}]

With every field clearly delineated, writing utilities to correctly sort or filter by column would be trivial. We could further break down the permissions and dates into their component parts:

"-rw-r--r--" => {"user" : ["read", "write"], "group" : ["read"], "other" : ["read"]} 
"Jul 11 2011" => {"day" : 11, "month" : 6, "year" : 2011, "unixtime" : 1310416160}

And reliably sort by date, reliably filter by date, more simply filter by permissions, etc. Furthermore, if this were a standard output of the “ls” command, then aside from performance concerns, there would be no need for a direct binding to “stat” in every programming language available. Even more useful, shell utilities which input and output JSON and can operate on lists, trees, and dictionaries using css style selectors could be written. This would bring higher level programming idioms to the unix shell. While amazing things are possibly with adroit usage of “sed” and “awk,” they are blunt instruments compared to what could be done with structured data and an appropriate set of tools. All that’s needed is a pretty printer for the structured data so that humans can read the output when it’s done being manipulated by the machines.

So what should the data interchange format be? Well, it should have most of these properties:

Datastructures Having tools like ps, ls, df, etc output machine comprehensable lists is really kind of the point of what I am advocating here. The ability for programs to unambiguously distinguish distinct fields in their input streams is essential to making reliable data manipulation tools. Arrays/lists and dictionaries are probably sufficient for this task.
Datatypes The ability for programs to instantly recognize integers, floating point numbers and strings, would simplify parsing, sorting, and filtering of datastreams. In addition, standard representations for other common kinds of values, such as dates, permissions, file sizes, pathnames, etc would also allow a tighter fit between applications.
A human readable form While it is not essential for data being passed between programs to be immediately human readable, it is essential that that data be transformable trivially to a human readable form.
Ability to segregate formatted and plain text output It’s not a given that this output format will be the right tool for every job so it seems entirely reasonably to assume that there will be a need to mix formatted and unformatted output in the same stream.

To achieve the goals, there are a number of options, from XML to creating a new format from scratch, but I think that JSON in fact has this about 90% covered. It has all the basic data structures ad datatypes covered. It’s human readable by default (though that can be improved with a pretty printer.) A char sequence to mark the beginning and end of of JSON datastreams is all that is really needed.

JSON is not the end of the discussion of course. If we really wanted to use shell calls instead of system/library calls in a majority of cases a binary format might be necessary to get the speed we want. A pretty printer might then solve the readability issue. Also, we might want to provide formatting and text structure markup for output intended for human reading. Ultimately, it might be preferable to have multiple forms of output and data-interchange format to suite different needs but I think for now an improvement over plain text is enough.

So for now, what I am suggesting is this: When we write new shell utilities, we consider adding a JSON output option. If enough utilities, including variants of the common shell utilities such as “ls” and “ps” were to adopt this approach, I think the Unix platform would garner a lot more flexibility.

I’m not the only person to have ever thought of this, and here are some projects implement some of the ideas which I am suggesting:

Powershell : Powershell is a very neat project with a bold vision and one major flaw: It’s tied to the .net platform. This not only ties it to Windows, but also completely prevents the Unixy truely polyglot approach I’ looking for.
jsawk : Jsawk is a great idea which almost single handedly implements my entire suggestion with this post.
Jshon : Jshon is a similar tool to Jsawk
There are other projects I’ve seen but I can’t seem to remember all of them, if readers can, please suggest them in the comments.

This was said by Doug McIlroy and the full quote is, “This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.” ↩
Now, the output of most Unix programs tends to follow at least some logic. Different items are separated by line breaks and separate parts of those items are separated with spaces, so that they typically end up in well formatted columns. The line-based output format is common and, indeed, quite a few Unix programs, like grep, sed, awk, and sort, are built around the idea of line based output. However, despite its strong support in the Unix environment, however, a line-base structure is fairly limiting. Trees and arbitrarily nested structures are impossible to process intuitively and as demonstrated earlier, even the delimiters between fields are sometimes ambiguous. Commonly output formatted to be more easily readable, such as the output of “ls -l *”, breaks the line-based format altogether making simple use of tools like grep and sort altogether impossible. ↩
These components of course, are CPAN modules and Ruby Gems, for example, and they communicate through the Perl or Ruby language which provides a rich array datatypes and organizational tools from which to build their interfaces. This style of composition, while powerful, has a number of disadvantages, the most notable here being that components written for one language are not trivially usable with another. ↩