This chapter originally appeared in Linux Journal, volume 1, number 2, in the What's GNU? column. It was written by Arnold Robbins.
This month's column is only peripherally related to the GNU Project, in that it describes a number of the GNU tools on your Linux system and how they might be used. What it's really about is the ``Software Tools'' philosophy of program development and usage.
The software tools philosophy was an important and integral concept in the initial design and development of Unix (of which Linux and GNU are essentially clones). Unfortunately, in the modern day press of Internetworking and flashy GUIs, it seems to have fallen by the wayside. This is a shame, since it provides a powerful mental model for solving many kinds of problems.
Many people carry a Swiss Army knife around in their pants pockets (or purse). A Swiss Army knife is a handy tool to have: it has several knife blades, a screwdriver, tweezers, toothpick, nail file, corkscrew, and perhaps a number of other things on it. For the everyday, small miscellaneous jobs where you need a simple, general purpose tool, it's just the thing.
On the other hand, an experienced carpenter doesn't build a house using a Swiss Army knife. Instead, he has a toolbox chock full of specialized tools---a saw, a hammer, a screwdriver, a plane, and so on. And he knows exactly when and where to use each tool; you won't catch him hammering nails with the handle of his screwdriver.
The Unix developers at Bell Labs were all professional programmers and trained computer scientists. They had found that while a one-size-fits-all program might appeal to a user because there's only one program to use, in practice such programs are
Instead, they felt that programs should be specialized tools. In short, each program ``should do one thing well.'' No more and no less. Such programs are simpler to design, write, and get right---they only do one thing.
Furthermore, they found that with the right machinery for hooking programs together, that the whole was greater than the sum of the parts. By combining several special purpose programs, you could accomplish a specific task that none of the programs was designed for, and accomplish it much more quickly and easily than if you had to write a special purpose program. We will see some (classic) examples of this further on in the column. (An important additional point was that, if necessary, take a detour and build any software tools you may need first, if you don't already have something appropriate in the toolbox.)
Hopefully, you are familiar with the basics of I/O redirection in the shell, in particular the concepts of ``standard input,'' ``standard output,'' and ``standard error''. Briefly, ``standard input'' is a data source, where data comes from. A program should not need to either know or care if the data source is a disk file, a keyboard, a magnetic tape, or even a punched card reader. Similarly, ``standard output'' is a data sink, where data goes to. The program should neither know nor care where this might be. Programs that only read their standard input, do something to the data, and then send it on, are called ``filters'', by analogy to filters in a water pipeline.
With the Unix shell, it's very easy to set up data pipelines:
program_to_create_data | filter1 | .... | filterN > final.pretty.data
We start out by creating the raw data; each filter applies some successive transformation to the data, until by the time it comes out of the pipeline, it is in the desired form.
This is fine and good for standard input and standard output. Where does the standard
error come in to play? Well, think about
For filter programs to work together, the format of the data has to be agreed upon. The
most straightforward and easiest format to use is simply lines of text. Unix data files
are generally just streams of bytes, with lines delimited by the ASCII LF (Line Feed)
character, conventionally called a ``newline'' in the Unix literature. (This is
OK, enough introduction. Let's take a look at some of the tools, and then we'll see how to hook them together in interesting ways. In the following discussion, we will only present those command line options that interest us. As you should always do, double check your system documentation for the full story.
The first program is the
$ who arnold console Jan 22 19:57 miriam ttyp0 Jan 23 14:19(:0.0) bill ttyp1 Jan 21 09:32(:0.0) arnold ttyp2 Jan 23 20:48(:0.0)
Here, the $ is the usual shell prompt, at which I typed
The next program we'll look at is the
arnold:xyzzy:2076:10:Arnold D. Robbins:/home/arnold:/bin/ksh
To get the first and fifth fields, we would use cut like this:
$ cut -d: -f1,5 /etc/passwd root:Operator ... arnold:Arnold D. Robbins miriam:Miriam A. Robbins ...
With the -c option,
Next we'll look at the
Finally (at least for now), we'll look at the
Now, let's suppose this is a large BBS system with dozens of users logged in. The management wants the SysOp to write a program that will generate a sorted list of logged in users. Furthermore, even if a user is logged in multiple times, his or her name should only show up in the output once.
The SysOp could sit down with the system documentation and write a C program that did this. It would take perhaps a couple of hundred lines of code and about two hours to write it, test it, and debug it. However, knowing the software toolbox, the SysOp can instead start out by generating just a list of logged on users:
$ who | cut -c1-8 arnold miriam bill arnold
Next, sort the list:
$ who | cut -c1-8 | sort arnold arnold bill miriam
Finally, run the sorted list through
$ who | cut -c1-8 | sort | uniq arnold bill miriam
The SysOp puts this pipeline into a shell script, and makes it available for all the users on the system:
# cat > /usr/local/bin/listusers who | cut -c1-8 | sort | uniq ^D # chmod +x /usr/local/bin/listusers
There are four major points to note here. First, with just four programs, on one command line, the SysOp was able to save about two hours worth of work. Furthermore, the shell pipeline is just about as efficient as the C program would be, and it is much more efficient in terms of programmer time. People time is much more expensive than computer time, and in our modern ``there's never enough time to do everything'' society, saving two hours of programmer time is no mean feat.
Second, it is also important to emphasize that with the combination of the tools, it is possible to do a special purpose job never imagined by the authors of the individual programs.
Third, it is also valuable to build up your pipeline in stages, as we did here. This allows you to view the data at each stage in the pipeline, which helps you acquire the confidence that you are indeed using these tools correctly.
Finally, by bundling the pipeline in a shell script, other users can use your command, without having to remember the fancy plumbing you set up for them. In terms of how you run them, shell scripts and compiled programs are indistinguishable.
After the previous warm-up exercise, we'll look at two additional, more complicated pipelines. For them, we need to introduce two more tools.
The first is the
$ echo ThIs ExAmPlE HaS MIXED case! | tr '[A-Z]' '[a-z]' this example has mixed case!
There are several options of interest:
We will be using all three options in a moment.
The other command we'll look at is
$ cat f1 11111 22222 33333 44444 $ cat f2 00000 22222 33333 55555 $ comm f1 f2 00000 11111 22222 33333 44444 55555
The single dash as a filename tells
Now we're ready to build a fancy pipeline. The first application is a word frequency counter. This helps an author determine if he or she is over-using certain words.
The first step is to change the case of all the letters in our input file to one case. ``The'' and ``the'' are the same word when doing counting.
$ tr '[A-Z]' '[a-z]'
The next step is to get rid of punctuation. Quoted words and unquoted words should be treated identically; it's easiest to just get the punctuation out of the way.
$ tr '[A-Z]' '[a-z]'
At this point, we have data consisting of words separated by blank space. The words only contain alphanumeric characters (and the underscore). The next step is break the data apart so that we have one word per line. This makes the counting operation much easier, as we will see shortly.
$ tr '[A-Z]' '[a-z]' tr -s '[ ]' '\012' | ...
This command turns blanks into newlines. The -s option squeezes multiple newline characters in the output into just one. This helps us avoid blank lines. (The > is the shell's ``secondary prompt.'' This is what the shell prints when it notices you haven't finished typing in all of a command.)
We now have data consisting of one word per line, no punctuation, all one case. We're ready to count each word:
$ tr '[A-Z]' '[a-z]' tr -s '[ ]' '\012' | sort | uniq -c | ...
At this point, the data might look something like this:
60 a 2 able 6 about 1 above 2 accomplish 1 acquire 1 actually 2 additional
The output is sorted by word, not by count! What we want is the most frequently used
words first. Fortunately, this is easy to accomplish, with the help of two more
The final pipeline looks like this:
$ tr '[A-Z]' '[a-z]' tr -s '[ ]' '\012' | sort | uniq -c | sort -nr 156 the 60 a 58 to 51 of 51 and ...
Whew! That's a lot to digest. Yet, the same principles apply. With six commands, on two lines (really one long one split for convenience), we've created a program that does something interesting and useful, in much less time than we could have written a C program to do the same thing.
A minor modification to the above pipeline can give us a simple spelling checker! To determine if you've spelled a word correctly, all you have to do is look it up in a dictionary. If it is not there, then chances are that your spelling is incorrect. So, we need a dictionary. If you have the Slackware Linux distribution, you have the file /usr/lib/ispell/ispell.words, which is a sorted, 38,400 word dictionary.
Now, how to compare our file with the dictionary? As before, we generate a sorted list of words, one per line:
$ tr '[A-Z]' '[a-z]' tr -s '[ ]' '\012' | sort -u | ...
Now, all we need is a list of words that are not in the dictionary. Here is
$ tr '[A-Z]' '[a-z]' tr -s '[ ]' '\012' | sort -u | > comm -23 - /usr/lib/ispell/ispell.words
The -2 and -3 options eliminate lines that are only in the dictionary (the second file), and lines that are in both files. Lines only in the first file (standard input, our stream of words), are words that are not in the dictionary. These are likely candidates for spelling errors. This pipeline was the first cut at a production spelling checker on Unix.
There are some other tools that deserve brief mention.
The software tools philosophy also espoused the following bit of advice: ``Let someone else do the hard part.'' This means, take something that gives you most of what you need, and then massage it the rest of the way until it's in the form that you want.
As of this writing, all the programs we've discussed are available via anonymous
None of what I have presented in this column is new. The Software Tools philosophy was
first introduced in the book Software Tools, by Brian Kernighan and P.J.
Plauger (Addison-Wesley, ISBN 0-201-03669-X). This book showed how to write and use
software tools. It was written in 1976, using a preprocessor for FORTRAN named
In 1981, the book was updated and made available as Software Tools in Pascal (Addison-Wesley, ISBN 0-201-10342-7). Both books remain in print, and are well worth reading if you're a programmer. They certainly made a major change in how I view programming.
Initially, the programs in both books were available (on 9-track tape) from
Addison-Wesley. Unfortunately, this is no longer the case, although you might be able to
find copies floating around the Internet. For a number of years, there was an active
Software Tools Users Group, whose members had ported the original
With the current proliferation of GNU code and other clones of Unix programs, these programs now receive little attention; modern C versions are much more efficient and do more than these programs do. Nevertheless, as exposition of good programming style, and evangelism for a still-valuable philosophy, these books are unparalleled, and I recommend them highly.
Acknowledgement: I would like to express my gratitude to Brian Kernighan of Bell Labs, the original Software Toolsmith, for reviewing this column.
Email addresses listed on this site may NOT be used for unsolicited commercial email.
Portions (c)Copyright, 1996-2005 by