|
Go to the previous, next chapter.
The basic function of awk
is to search files for lines (or other units of text) that
contain certain patterns. When a line matches one of the
patterns, awk performs specified actions on that
line. awk keeps processing input lines in this way
until the end of the input file is reached.
When you run awk, you specify an awk program
which tells awk what to do. The program consists of
a series of rules. (It may also contain function
definitions, but that is an advanced feature, so we will
ignore it for now. See section User-defined
Functions.) Each rule
specifies one pattern to search
for, and one action to perform
when that pattern is found.
Syntactically, a rule
consists of a pattern followed
by an action. The action is enclosed in curly braces to separate it from the pattern. Rules are usually
separated by newlines. Therefore, an awk program
looks like this:
pattern { action }
pattern { action }
...
- Very Simple: A very simple
example.
- Two Rules: A less simple
one-line example with two rules.
- More Complex: A more
complex example.
- Running gawk: How to run
gawk
programs; includes command line syntax.
- Comments: Adding
documentation to
gawk programs.
- Statements/Lines:
Subdividing or combining statements into lines.
- When: When to use
gawk
and when to use other things.
The following command runs a simple awk program
that searches the input file BBS-list for the string of characters: foo.
(A string of characters is
usually called, a string.
The term string is
perhaps based on similar usage in English, such as ``a string of pearls,'' or, ``a string of cars in a train.'')
awk '/foo/ { print $0 }' BBS-list
When lines containing foo are found, they are
printed, because print $0 means print the current
line. (Just print by itself means the same thing, so
we could have written that instead.)
You will notice that slashes, /, surround the string foo in the
actual awk program. The slashes indicate that foo
is a pattern to search for.
This type of pattern is called
a regular expression, and is covered in more detail
later (see section Regular Expressions
as Patterns). There are single-quotes around the awk
program so that the shell won't interpret any of it as special
shell characters.
Here is what this program prints:
fooey 555-1234 2400/1200/300 B
foot 555-6699 1200/300 B
macfoo 555-6480 1200/300 A
sabafoo 555-2127 1200/300 C
In an awk rule,
either the pattern or the action can be omitted, but not
both. If the pattern is
omitted, then the action is
performed for every input line. If the action is omitted, the default action is to print all lines that
match the pattern.
Thus, we could leave out the action
(the print statement and the curly braces) in the above example, and
the result would be the same: all lines matching the pattern foo would be
printed. By comparison, omitting the print statement
but retaining the curly braces
makes an empty action that does
nothing; then no lines would be printed.
The awk utility reads the input files one line at
a time. For each line, awk tries the patterns of
each of the rules. If several patterns match then several actions
are run, in the order in which they appear in the awk
program. If no patterns match, then no actions are run.
After processing all the rules (perhaps none) that match the
line, awk reads the next line (however, The next
Statement. This continues For example, the awk
program:
/12/ { print $0 }
/21/ { print $0 }
contains two rules. The first rule
has the string 12
as the pattern and print
$0 as the action. The
second rule has the string 21 as the pattern and also has print
$0 as the action. Each rule's action is enclosed in its own pair
of braces.
This awk program prints every line that contains
the string 12 or
the string 21. If
a line contains both strings, it is printed twice, once by each rule.
If we run this program on our two sample data files, BBS-list
and inventory-shipped, as shown here:
awk '/12/ { print $0 }
/21/ { print $0 }' BBS-list inventory-shipped
we get the following output:
aardvark 555-5553 1200/300 B
alpo-net 555-3412 2400/1200/300 A
barfly 555-7685 1200/300 A
bites 555-1675 2400/1200/300 A
core 555-2912 1200/300 C
fooey 555-1234 2400/1200/300 B
foot 555-6699 1200/300 B
macfoo 555-6480 1200/300 A
sdace 555-3430 2400/1200/300 A
sabafoo 555-2127 1200/300 C
sabafoo 555-2127 1200/300 C
Jan 21 36 64 620
Apr 21 70 74 514
Note how the line in BBS-list beginning with sabafoo
was printed twice, once for each rule.
Here is an example to give you an idea of what typical awk
programs do. This example shows how awk can be used
to summarize, select, and rearrange the output of another
utility. It uses features that haven't been covered yet, so don't
worry if you don't understand all the details.
ls -l | awk '$5 == "Nov" { sum += $4 }
END { print sum }'
This command prints the total number
of bytes in all the files in the current directory that were last
modified in November (of any year). (In the C shell you would need to type a
semicolon and then a backslash at the end of the first line; in a
POSIX-compliant shell, such as the Bourne shell or the
Bourne-Again shell, you can type the example as shown.)
The ls -l part of this example is a command that
gives you a listing of the files in a directory, including file
size and date. Its output looks like this:
-rw-r--r-- 1 close 1933 Nov 7 13:05 Makefile
-rw-r--r-- 1 close 10809 Nov 7 13:03 gawk.h
-rw-r--r-- 1 close 983 Apr 13 12:14 gawk.tab.h
-rw-r--r-- 1 close 31869 Jun 15 12:20 gawk.y
-rw-r--r-- 1 close 22414 Nov 7 13:03 gawk1.c
-rw-r--r-- 1 close 37455 Nov 7 13:03 gawk2.c
-rw-r--r-- 1 close 27511 Dec 9 13:07 gawk3.c
-rw-r--r-- 1 close 7989 Nov 7 13:03 gawk4.c
The first field contains
read-write permissions, the second field
contains the number of links to
the file, and the third field
identifies the owner of the file. The fourth field contains the size of the
file in bytes. The fifth, sixth, and seventh fields contain the
month, day, and time, respectively, that the file was last
modified. Finally, the eighth field
contains the name of the file.
The $5 == "Nov" in our awk
program is an expression that tests whether the fifth field of the output from ls
-l matches the string Nov.
Each time a line has the string Nov
in its fifth field, the action { sum += $4 }
is performed. This adds the fourth field
(the file size) to the variable sum. As a result,
when awk has finished reading all the input lines, sum
is the sum of the sizes of files whose lines matched the pattern. (This works because awk
variables are automatically initialized to zero.)
After the last line of output from ls has been
processed, the END rule
is executed, and the value of sum is printed. In
this example, the value of sum would be 80600.
These more advanced awk techniques are covered in
later sections (see section Overview
of Actions). Before you can move on to more advanced awk
programming, you have to know how awk interprets
your input and displays your output. By manipulating fields and
using print statements, you can produce some very
useful and spectacular looking reports.
There are several ways to run an awk program. If
the program is short, it is easiest to include it in the command
that runs awk, like this:
awk 'program' input-file1 input-file2 ...
where program consists of a series of patterns and
actions, as described earlier.
When the program is long, it is usually more convenient to put
it in a file and run it with a command like this:
awk -f program-file input-file1 input-file2 ...
- One-shot: Running a short
throw-away
awk program.
- Read Terminal: Using no
input files (input from terminal instead).
- Long: Putting permanent
awk
programs in files.
- Executable Scripts: Making
self-contained
awk programs.
Once you are familiar with awk, you will often
type simple programs at the moment you want to use them. Then you
can write the program as the first argument of the awk
command, like this:
awk 'program' input-file1 input-file2 ...
where program consists of a series of patterns
and actions, as described earlier.
This command format
instructs the shell to start awk and use the program
to process records in the input file(s). There are single quotes
around program so that the shell doesn't interpret any awk
characters as special shell characters. They also cause the shell
to treat all of program as a single argument for awk
and allow program to be more than one line long.
This format is also useful
for running short or medium-sized awk programs from
shell scripts, because it avoids the need for a separate file for
the awk program. A self-contained shell script is
more reliable since there are no other files to misplace.
You can also run awk without any input files. If
you type the command line:
awk 'program'
then awk applies the program to the standard
input, which usually means whatever you type on the
terminal. This continues until you indicate end-of-file by typing Control-d.
For example, if you execute this command:
awk '/th/'
whatever you type next is taken as data for that awk
program. If you go on to type the following data:
Kathy
Ben
Tom
Beth
Seth
Karen
Thomas
Control-d
then awk prints this output:
Kathy
Beth
Seth
as matching the pattern th.
Notice that it did not recognize Thomas as matching
the pattern. The awk
language is case sensitive, and matches patterns
exactly. (However, you can override this with the variable IGNORECASE.
See section Case-sensitivity in
Matching.)
Sometimes your awk programs can be very long. In
this case it is more convenient to put the program into a
separate file. To tell awk to use that file for its
program, you type:
awk -f source-file input-file1 input-file2 ...
The -f instructs the awk utility to
get the awk program from the file source-file.
Any file name can be used for source-file. For
example, you could put the program:
/th/
into the file th-prog. Then this command:
awk -f th-prog
does the same thing as this one:
awk '/th/'
which was explained earlier (see section Running awk without Input
Files, because most file names don't contain any of the
shell's special characters. Notice that in th-prog, the awk
program did not have single quotes around it. The quotes are only
needed for programs that are provided on the awk
command line.
If you want to identify your awk program files
clearly as such, you can add the extension .awk to the
file name. This doesn't affect the execution of the awk
program, but it does make ``housekeeping'' easier.
Once you have learned awk, you may want to write
self-contained awk scripts, using the #!
script mechanism. You can do
For example, you could create a text file named hello,
containing the following (where BEGIN is a feature
we have not yet discussed):
#! /bin/awk -f
# a sample awk program BEGIN { print "hello, world"
}
After making this file executable (with the chmod
command), you can simply type:
hello
awk -f hello
Self-contained awk scripts are useful when you
want to write a program which users can invoke without knowing
that the program is written in awk.
If your system does not support the #! mechanism,
you can get a similar effect using a regular shell script. It
would look something like this:
: The colon makes sure this script is executed by the Bourne shell.
awk 'program' "$@"
Using this technique, it is vital to enclose the program
in single quotes to protect it from interpretation by the shell.
If you omit the quotes, only a shell wizard can predict the
results.
The "$@" causes the shell to forward
all the command line arguments to the awk program,
without interpretation. The first line, which starts with a
colon, is used so that this shell script will work even if
invoked by a user who uses the C
shell.
A comment is some text that is included in a
program for the sake of human readers, and that is not really
part of the program. Comments can explain what the program does,
and how it works. Nearly all programming languages have
provisions for comments, because programs are typically hard to
understand without their extra help.
In the awk language, a comment starts with the
sharp sign character, #, and continues to the end of
the line. The awk language ignores the rest of a
line following a sharp sign. For example, we could have put the
following into th-prog:
# This program finds records containing the pattern th. This is how
# you continue comments on additional lines.
/th/
You can put comment lines into keyboard-composed throw-away awk
programs also, but this usually isn't very useful; the purpose of
a comment is to help you or another person understand the program
at a later time.
Most often, each line in an awk program is a
separate statement or separate rule,
like this:
awk '/12/ { print $0 }
/21/ { print $0 }' BBS-list inventory-shipped
But sometimes statements can be more than one line, and lines
can contain several statements. You can split a statement into
multiple lines by inserting a newline after any of the following:
, { ? : || && do else
A newline at any other point is considered the end of the
statement. (Splitting lines after ? and :
is a minor gawk extension. The ? and :
referred to here is the three operand conditional expression
described in section Conditional
Expressions.)
If you would like to split a single statement into two lines
at a point where a newline would terminate it, you can continue
it by ending the first line with a backslash character, \.
This is allowed absolutely anywhere in the statement, even in the
middle of a string or regular
expression. For example:
awk '/This program is too long, so continue it\
on the next line/ { print $1 }'
We have generally not used backslash continuation in the
sample programs in this manual. Since in gawk there
is no limit on the length of a line, it is never strictly
necessary; it just makes programs prettier. We have preferred to
make them even more pretty by keeping the statements short.
Backslash continuation is most useful when your awk
program is in a separate source file, instead of typed in on the
command line. You should also note that many awk
implementations are more picky about where you may use backslash
continuation. For maximal portability of your awk
programs, it is best not to split your lines in the middle of a
regular expression or a string.
Warning: backslash continuation does not work as
described above with the C
shell. Continuation with backslash works for awk
programs in files, and also for one-shot programs provided
you are using a POSIX-compliant shell, such as the Bourne shell
or the Bourne-again shell. But the C
shell used on Berkeley Unix behaves differently! There, you must
use two backslashes in a row, followed by a newline.
When awk statements within one rule are short, you might want to
put more than one of them on a line. You do this by separating
the statements with a semicolon, ;. This also
applies to the rules themselves. Thus, the previous program could
have been written:
/12/ { print $0 } ; /21/ { print $0 }
Note: the requirement that rules on the same
line must be separated with a semicolon is a recent change in the awk
language; it was done for consistency with the treatment of
statements within an action.
You might wonder how awk might be useful for you.
Using additional utility programs, more advanced patterns, field separators, arithmetic
statements, and other selection criteria, you can produce much
more complex output. The awk language is very useful
for producing reports from large amounts of raw data, such as
summarizing information from the output of other utility programs
like ls. (See section A
More Complex Example.)
Programs written with awk are usually much
smaller than they would be in other languages. This makes awk
programs easy to compose and use. Often awk programs
can be quickly composed at your terminal, used once, and thrown
away. Since awk programs are interpreted, you can
avoid the usually lengthy edit-compile-test-debug cycle of
software development.
Complex programs have been written in awk,
including a complete retargetable assembler for 8-bit
microprocessors (see section Glossary,
for more information) and a microcode assembler for a special
purpose Prolog computer. However, awk's capabilities
are strained by tasks of such complexity.
If you find yourself writing awk scripts of more
than, say, a few hundred lines, you might consider using a
different programming language. Emacs Lisp is a good choice if
you need sophisticated string
or pattern matching
capabilities. The shell is also good at string and pattern matching; in addition, it
allows powerful use of the system utilities. More conventional
languages, such as C, C++, and Lisp, offer better
facilities for system programming and for managing the complexity
of large programs. Programs in these languages may require more
lines of source code than the equivalent awk
programs, but they are easier to maintain and usually run more
efficiently.
To return to the Ready-to-Run Software Win95Pak Table of Contents please press here.
|