|
RTR's Win95Pak: The GAWK Manual - Sample Program
Go to the previous, next chapter.
The following example is a complete awk program, which prints
the number of occurrences of each word in its input. It illustrates the
associative nature of awk arrays by using strings as subscripts. It
also demonstrates the for x in array construction.
Finally, it shows how awk can be used in conjunction with other
utility programs to do a useful task of some complexity with a minimum of
effort. Some explanations follow the program listing.
awk '
# Print list of word frequencies
{
for (i = 1; i <= NF; i++)
freq[$i]++
}
END {
for (word in freq)
printf "%s\t%d\n", word, freq[word]
}'
The first thing to notice about this program is that it has two rules. The
first rule, because it has an empty pattern, is executed on every line of
the input. It uses awk's field-accessing mechanism
(see section Examining Fields) to pick out the individual words from
the line, and the built-in variable NF (see section Built-in Variables)
to know how many fields are available.
For each input word, an element of the array freq is incremented to
reflect that the word has been seen an additional time.
The second rule, because it has the pattern END, is not executed
until the input has been exhausted. It prints out the contents of the
freq table that has been built up inside the first action.
Note that this program has several problems that would prevent it from being
useful by itself on real text files:
-
Words are detected using the
awk convention that fields are
separated by whitespace and that other characters in the input (except
newlines) don't have any special meaning to awk. This means that
punctuation characters count as part of words.
-
The
awk language considers upper and lower case characters to be
distinct. Therefore, foo and Foo are not treated by this
program as the same word. This is undesirable since in normal text, words
are capitalized if they begin sentences, and a frequency analyzer should not
be sensitive to that.
-
The output does not come out in any useful order. You're more likely to be
interested in which words occur most frequently, or having an alphabetized
table of how frequently each word occurs.
The way to solve these problems is to use some of the more advanced
features of the awk language. First, we use tolower to remove
case distinctions. Next, we use gsub to remove punctuation
characters. Finally, we use the system sort utility to process the
output of the awk script. First, here is the new version of
the program:
awk '
# Print list of word frequencies
{
$0 = tolower($0) # remove case distinctions
gsub(/[^a-z0-9_ \t]/, "", $0) # remove punctuation
for (i = 1; i <= NF; i++)
freq[$i]++
}
END {
for (word in freq)
printf "%s\t%d\n", word, freq[word]
}'
Assuming we have saved this program in a file named frequency.awk,
and that the data is in file1, the following pipeline
awk -f frequency.awk file1 | sort +1 -nr
produces a table of the words appearing in file1 in order of
decreasing frequency.
The awk program suitably massages the data and produces a word
frequency table, which is not ordered.
The awk script's output is then sorted by the sort command and
printed on the terminal. The options given to sort in this example
specify to sort using the second field of each input line (skipping one field),
that the sort keys should be treated as numeric quantities (otherwise
15 would come before 5), and that the sorting should be done
in descending (reverse) order.
We could have even done the sort from within the program, by
changing the END action to:
END {
sort = "sort +1 -nr"
for (word in freq)
printf "%s\t%d\n", word, freq[word] | sort
close(sort)
}'
See the general operating system documentation for more information on how
to use the sort command.
|