|
Go to the previous, next chapter.
Patterns in awk control the execution of rules: a rule is executed when its pattern matches the current input
record. This chapter tells all about how to write patterns.
Here is a summary of the types of patterns supported in awk.
- /regular expression/ A
regular expression as a pattern.
It matches when the text of the input record fits the
regular expression. (See section Regular Expressions as Patterns.)
expression A single expression. It
matches when its value, converted to a number, is nonzero (if a number) or nonnull (if a string). (See section Expressions as Patterns.)
pat1, pat2
A pair of patterns separated by a comma, specifying a
range of records. (See section Specifying
Record Ranges with Patterns.)
BEGIN END Special patterns to supply
start-up or clean-up information to awk.
(See section BEGIN
and END Special Patterns The empty pattern matches every
input record. (See section The
Empty Pattern.)
A regular expression, or regexp, is a way of
describing a class of strings. A regular expression enclosed in
slashes (/) is an awk pattern that matches every input
record whose text belongs to that class.
The simplest regular expression is a sequence of letters,
numbers, or both. Such a regexp
matches any string that
contains that sequence. Thus, the regexp foo
matches any string containing foo.
Therefore, the pattern /foo/
matches any input record containing foo. Other kinds
of regexps let you specify more complicated classes of strings.
A regular expression can be used as a pattern by enclosing it in
slashes. Then the regular expression is matched against the
entire text of each record. (Normally, it only needs to match
some part of the text in order to succeed.) For example, this
prints the second field of each
record that contains foo anywhere:
awk '/foo/ { print $2 }' BBS-list
Regular expressions can also be used in comparison
expressions. Then you can specify the string to match against; it need
not be the entire current input record. These comparison
expressions can be used as patterns or in if, while, for,
and do statements.
- exp ~ /regexp/
This is true if the expression exp (taken as a
character string) is
matched by regexp.
The following example matches, or selects, all input
records with the upper-case letter J
somewhere in the first field:
-
awk '$1 ~ /J/' inventory-shipped
So does this:
awk '{ if ($1 ~ /J/) print }' inventory-shipped
exp !~ /regexp/
This is true if the expression exp (taken as a
character string) is not
matched by regexp.
The following example matches, or selects, all input
records whose first field does
not contain the upper-case letter J:
awk '$1 !~ /J/' inventory-shipped
The right hand side of a ~ or !~
operator need not be a constant regexp
(i.e., a string of characters
between slashes). It may be any expression. The expression is
evaluated, and converted if necessary to a string; the contents of the string are used as the regexp. A regexp that is computed in this
way is called a dynamic regexp.
For example:
identifier_regexp = "[A-Za-z_][A-Za-z_0-9]+"
$0 ~ identifier_regexp
sets identifier_regexp to a regexp that describes awk
variable names, and tests if the input record matches this regexp.
You can combine regular expressions with the following
characters, called regular expression operators, or metacharacters,
to increase the power and versatility of regular expressions.
Here is a table of metacharacters. All characters not listed
in the table stand for themselves.
- ^ This matches the beginning of the string or the beginning of
a line within the string.
For example:
-
^@chapter
matches the @chapter at the beginning of
a string, and can be
used to identify chapter beginnings in Texinfo source
files.
$ This is similar to ^, but it
matches only at the end of a string or the end of a
line within the string.
For example:
p$
matches a record that ends with a p.
. This matches any single character except a
newline. For example:
.P
matches any single character followed by a P
in a string. Using concatenation we can make
regular expressions like U.A, which matches
any three-character sequence that begins with U
and ends with A.
[...] This is called a character set.
It matches any one of the characters that are enclosed in
the square brackets. For example:
[MVX]
matches any one of the characters M, V,
or X in a string.
Ranges of characters are indicated by using a hyphen
between the beginning and ending characters, and
enclosing the whole thing in brackets. For example:
[0-9]
matches any digit.
To include the character \, ], -
or ^ in a character set, put a \
in front of it. For example:
[d\]]
matches either d, or ].
This treatment of \ is compatible with
other awk implementations, and is also
mandated by the POSIX Command Language and Utilities
standard. The regular expressions in awk are
a superset of the POSIX specification for Extended
Regular Expressions (EREs). POSIX EREs are based on the
regular expressions accepted by the traditional egrep
utility.
In egrep syntax, backslash is not
syntactically special within square brackets. This means
that special tricks have to be used to represent the
characters ], - and ^
as members of a character set.
In egrep syntax, to match -,
write it as ---, which is a range containing
only -. You may also give - as
the first or last character in the set. To match ^,
put it anywhere except as the first character of a set.
To match a ], make it the first character in
the set. For example:
[]d^]
matches either ], d or ^.
[^ ...] This is a complemented character
set. The first character after the [ must
be a ^. It matches any characters except
those in the square brackets (or newline). For example:
[^0-9]
matches any character that is not a digit.
| This is the alternation operator
and it is used to specify alternatives. For example:
^P|[0-9]
matches any string
that matches either ^P or [0-9].
This means it matches any string
that contains a digit or starts with P.
The alternation applies to the largest possible
regexps on either side. (...) Parentheses are used
for grouping in regular expressions as in arithmetic.
They can be used to concatenate regular expressions
containing the alternation operator, |.
* This symbol means that the preceding regular
expression is to be repeated as many times as possible to
find a match. For example:
ph*
applies the * symbol to the preceding h
and looks for matches to one p followed by
any number of hs.
This will also match just p if no hs
are present.
The * repeats the smallest
possible preceding expression. (Use parentheses if you
wish to repeat a larger expression.) It finds as many
repetitions as possible. For example:
awk '/\(c[ad][ad]*r x\)/ { print }' sample
prints every record in the input containing a string of the form (car
x), (cdr x), (cadr x),
and so on.
+ This symbol is similar to *, but
the preceding expression must be matched at least once.
This means that:
wh+y
would match why and whhy but
not wy, whereas wh*y would
match all three of these strings. This is a simpler way
of writing the last * example:
awk '/\(c[ad]+r x\)/ { print }' sample
? This symbol is similar to *, but
the preceding expression can be matched once or not at
all. For example:
fe?d
will match fed and fd, but
nothing else.
\ This is used to suppress the special meaning
of a character when matching. For example:
\$
matches the character $.
The escape sequences used for string constants (see
section Constant Expressions)
are valid in regular expressions as well; they are also
introduced by a \.
In regular expressions, the *, +,
and ? operators have the highest precedence,
followed by concatenation, and
finally by |. As in arithmetic, parentheses can
change how operators are grouped.
Case is normally significant in regular expressions, both when
matching ordinary characters (i.e., not metacharacters), and
inside character sets. Thus a w in a regular
expression matches only a lower case w and not an
upper case W.
The simplest way to do a case-independent match is to use a
character set: [Ww]. However, this can be cumbersome
if you need to use it often; and it can make the regular
expressions harder for humans to read. There are two other
alternatives that you might prefer.
One way to do a case-insensitive match at a particular point
in the program is to convert the data to a single case, using the tolower
or toupper built-in string
functions (which we haven't discussed yet; see section Built-in Functions for String
Manipulation). For example:
tolower($1) ~ /foo/ { ... }
converts the first field to
lower case before matching against it.
Another method is to set the variable IGNORECASE
to a nonzero value (see section Built-in
Variables). When IGNORECASE is not zero, all regexp operations ignore case.
Changing the value of IGNORECASE dynamically
controls the case sensitivity of your program as it runs. Case is
significant by default because IGNORECASE (like most
variables) is initialized to zero.
x = "aB"
if (x ~ /ab/) ... # this test will fail
IGNORECASE = 1 if (x ~ /ab/) ... # now it will succeed
In general, you cannot use IGNORECASE to make
certain rules case-insensitive and other rules case-sensitive,
because there is no way to set IGNORECASE just for
the pattern of a particular rule. To do this, you must use
character sets or tolower. However, one thing you
can do only with IGNORECASE is turn case-sensitivity
on or off dynamically for all the rules at once.
IGNORECASE can be set on the command line, or in
a BEGIN rule.
Setting IGNORECASE from the command line is a way to
make a program case-insensitive without having to edit it.
The value of IGNORECASE has no effect if gawk
is in compatibility mode (@xref{}{awk}}).
Comparison patterns test relationships such as
equality between two strings or numbers. They are a special case
of expression patterns (see section Expressions
as Patterns). They are written with relational operators,
which are a superset of those in C.
Here is a table of them:
- x y True if x is less
than y.
x y True if x is
less than or equal to y.
x > y
True if x is greater than y.
x >= y
True if x is greater than or equal to y.
x == y True
if x is equal to y.
x != y True
if x is not equal to y.
x ~ y True
if x matches the regular expression described
by y.
x !~ y True
if x does not match the regular expression
described by y.
The operands of a relational operator are compared as numbers
if they are both numbers. Otherwise they are converted to, and
compared as, strings (see section Conversion
of Strings and Numbers, for the detailed rules). Strings are
compared by comparing the first character of each, then the
second character of each, and so on, until there is a difference.
If the two strings are equal until the shorter one runs out, the
shorter one is considered to be less than the longer one. Thus, "10"
is less than "9", and "abc"
is less than "abcd".
The left operand of the ~ and !~
operators is a string. The
right operand is either a constant regular expression enclosed in
slashes (/regexp/),
or any expression, whose string
value is used as a dynamic regular expression (see section How to Use Regular Expressions).
The following example prints the second field of each input record whose
first field is precisely foo.
awk '$1 == "foo" { print $2 }' BBS-list
Contrast this with the following regular expression match,
which would accept any record with a first field that contains foo:
awk '$1 ~ "foo" { print $2 }' BBS-list
or, equivalently, this one:
awk '$1 ~ /foo/ { print $2 }' BBS-list
A boolean pattern
is an expression which combines other patterns using the boolean
operators ``or'' (||), ``and'' (&&),
and ``not'' (!). Whether the boolean pattern matches an input record
depends on whether its subpatterns match.
For example, the following command prints all records in the
input file BBS-list that contain both 2400
and foo.
awk '/2400/ && /foo/' BBS-list
The following command prints all records in the input file BBS-list
that contain either 2400 or foo,
or both.
awk '/2400/ || /foo/' BBS-list
The following command prints all records in the input file BBS-list
that do not contain the string foo.
awk '! /foo/' BBS-list
Note that boolean patterns are a special case of expression
patterns (see section Expressions as
Patterns); they are expressions that use the boolean
operators. See section Boolean
Expressions, for complete information on the boolean
operators.
The subpatterns of a boolean pattern
can be constant regular expressions, comparisons, or any other awk
expressions. Range patterns are not expressions, so they cannot
appear inside boolean patterns. Likewise, the special patterns BEGIN
and END, which never match any input record, are not
expressions and cannot appear inside boolean patterns.
Any awk expression is also valid as an awk pattern. Then the pattern ``matches'' if the
expression's value is nonzero (if a number)
or nonnull (if a string).
The expression is reevaluated each time the rule is tested against a new input
record. If the expression uses fields such as $1,
the value depends directly on the new input record's text;
otherwise, it depends only on what has happened so far in the
execution of the awk program, but that may still be
useful.
Comparison patterns are actually a special case of this. For
example, the expression $5 == "foo" has
the value 1 when the value of $5 equals "foo",
and 0 otherwise; therefore, this expression as a pattern matches when the two
values are equal.
Boolean patterns are also special cases of expression
patterns.
A constant regexp as a pattern is also a special case of
an expression pattern. /foo/
as an expression has the value 1 if foo appears in
the current input record; thus, as a pattern, /foo/
matches any record containing foo.
Other implementations of awk that are not yet
POSIX compliant are less general than gawk: they
allow comparison expressions, and boolean combinations thereof
(optionally with parentheses), but not necessarily other kinds of
expressions.
A range pattern
is made of two patterns separated by a comma, of the form begpat,
endpat. It matches ranges of consecutive input records. The
first pattern begpat
controls where the range begins, and the second one endpat
controls where it ends. For example,
awk '$1 == "on", $1 == "off"'
prints every record between on/off
pairs, inclusive.
A range pattern starts out
by matching begpat against every input record; when a
record matches begpat, the range pattern becomes turned on.
The range pattern matches this
record. As long as it stays turned on, it automatically matches
every input record read. It also matches endpat
against every input record; when that succeeds, the range pattern is turned off again for
the following record. Now it goes back to checking begpat
against each record.
The record that turns on the range pattern and the one that turns it
off both match the range pattern.
If you don't want to operate on these records, you can write if
statements in the rule's action to distinguish them.
It is possible for a pattern
to be turned both on and off by the same record, if both
conditions are satisfied by that record. Then the action is executed for just that
record.
BEGIN and END are special patterns.
They are not used to match input records. Rather, they are used
for supplying start-up or clean-up information to your awk
script. A BEGIN rule
is executed, once, before the first input record has been read.
An END rule is
executed, once, after all the input has been read. For example:
awk 'BEGIN { print "Analysis of `foo'" }
/foo/ { ++foobar }
END { print "`foo' appears " foobar " times." }' BBS-list
This program finds the number
of records in the input file BBS-list that contain the string foo. The BEGIN rule prints a title for the
report. There is no need to use the BEGIN rule to initialize the counter foobar
to zero, as awk does this for us automatically (see
section Variables).
The second rule increments
the variable foobar every time a record containing
the pattern foo is
read. The END rule
prints the value of foobar at the end of the run.
The special patterns BEGIN and END
cannot be used in ranges or with boolean operators (indeed, they
cannot be used with any operators).
An awk program may have multiple BEGIN
and/or END rules. They are executed in the order
they appear, all the BEGIN rules at start-up and all
the END rules at termination.
Multiple BEGIN and END sections are
useful for writing library functions, since each library can have
its own BEGIN or END rule to do its own initialization
and/or cleanup. Note that the order in which library functions
are named on the command line controls the order in which their BEGIN
and END rules are executed. Therefore you have to be
careful to write such rules in library files so that the order in
which they are executed doesn't matter. See awk, for more
information on If an awk program only has a BEGIN rule, and no other rules, then the
program exits after the BEGIN rule has been run. (Older versions
of awk used to keep reading and ignoring input until
end of file was seen.) However, if an END rule exists as well, then the
input will be read, even if there are no other rules in the
program. This is necessary in case the END rule checks the NR
variable.
BEGIN and END rules must have
actions; there is no default action
for these rules since there is no current record when they run.
An empty pattern is
considered to match every input record. For example, the
program:
awk '{ print $1 }' BBS-list
prints the first field of
every record.
To return to the Ready-to-Run Software Win95Pak Table of Contents please press here.
|