|
RTR's Win95Pak: The GAWK Manual - Built-in Functions
Go to the previous, next chapter.
Built-in functions are functions that are always available for
your awk program to call. This chapter defines all the built-in
functions in awk; some of them are mentioned in other sections,
but they are summarized here for your convenience. (You can also define
new functions yourself. See section User-defined Functions.)
To call a built-in function, write the name of the function followed
by arguments in parentheses. For example, atan2(y + z, 1)
is a call to the function atan2, with two arguments.
Whitespace is ignored between the built-in function name and the
open-parenthesis, but we recommend that you avoid using whitespace
there. User-defined functions do not permit whitespace in this way, and
you will find it easier to avoid mistakes by following a simple
convention which always works: no whitespace after a function name.
Each built-in function accepts a certain number of arguments. In most
cases, any extra arguments given to built-in functions are ignored. The
defaults for omitted arguments vary from function to function and are
described under the individual functions.
When a function is called, expressions that create the function's actual
parameters are evaluated completely before the function call is performed.
For example, in the code fragment:
i = 4
j = sqrt(i++)
the variable i is set to 5 before sqrt is called
with a value of 4 for its actual parameter.
Here is a full list of built-in functions that work with numbers:
int(x)
This gives you the integer part of x, truncated toward 0. This
produces the nearest integer to x, located between x and 0.
For example, int(3) is 3, int(3.9) is 3, int(-3.9)
is -3, and int(-3) is -3 as well.
sqrt(x)
This gives you the positive square root of x. It reports an error
if x is negative. Thus, sqrt(4) is 2.
exp(x)
This gives you the exponential of x, or reports an error if
x is out of range. The range of values x can have depends
on your machine's floating point representation.
log(x)
This gives you the natural logarithm of x, if x is positive;
otherwise, it reports an error.
sin(x)
This gives you the sine of x, with x in radians.
cos(x)
This gives you the cosine of x, with x in radians.
atan2(y, x)
This gives you the arctangent of y / x in radians.
rand()
This gives you a random number. The values of rand are
uniformly-distributed between 0 and 1. The value is never 0 and never
1.
Often you want random integers instead. Here is a user-defined function
you can use to obtain a random nonnegative integer less than n:
function randint(n) {
return int(n * rand())
}
The multiplication produces a random real number greater than 0 and less
than n. We then make it an integer (using int) between 0
and n - 1.
Here is an example where a similar function is used to produce
random integers between 1 and n. Note that this program will
print a new random number for each input record.
awk '
# Function to roll a simulated die.
function roll(n) { return 1 + int(rand() * n) }
# Roll 3 six-sided dice and print total number of points.
{
printf("%d points\n", roll(6)+roll(6)+roll(6))
}'
Note: rand starts generating numbers from the same
point, or seed, each time you run awk. This means that
a program will produce the same results each time you run it.
The numbers are random within one awk run, but predictable
from run to run. This is convenient for debugging, but if you want
a program to do different things each time it is used, you must change
the seed to a value that will be different in each run. To do this,
use srand.
srand(x)
The function srand sets the starting point, or seed,
for generating random numbers to the value x.
Each seed value leads to a particular sequence of ``random'' numbers.
Thus, if you set the seed to the same value a second time, you will get
the same sequence of ``random'' numbers again.
If you omit the argument x, as in srand(), then the current
date and time of day are used for a seed. This is the way to get random
numbers that are truly unpredictable.
The return value of srand is the previous seed. This makes it
easy to keep track of the seeds for use in consistently reproducing
sequences of random numbers.
The functions in this section look at or change the text of one or more
strings.
index(in, find)
This searches the string in for the first occurrence of the string
find, and returns the position in characters where that occurrence
begins in the string in. For example:
awk 'BEGIN { print index("peanut", "an") }'
prints 3. If find is not found, index returns 0.
(Remember that string indices in awk start at 1.)
length(string)
This gives you the number of characters in string. If
string is a number, the length of the digit string representing
that number is returned. For example, length("abcde") is 5. By
contrast, length(15 * 35) works out to 3. How? Well, 15 * 35 =
525, and 525 is then converted to the string "525", which has
three characters.
If no argument is supplied, length returns the length of $0.
In older versions of awk, you could call the length function
without any parentheses. Doing so is marked as ``deprecated'' in the
POSIX standard. This means that while you can do this in your
programs, it is a feature that can eventually be removed from a future
version of the standard. Therefore, for maximal portability of your
awk programs you should always supply the parentheses.
match(string, regexp)
The match function searches the string, string, for the
longest, leftmost substring matched by the regular expression,
regexp. It returns the character position, or index, of
where that substring begins (1, if it starts at the beginning of
string). If no match if found, it returns 0.
The match function sets the built-in variable RSTART to
the index. It also sets the built-in variable RLENGTH to the
length in characters of the matched substring. If no match is found,
RSTART is set to 0, and RLENGTH to -1.
For example:
awk '{
if ($1 == "FIND")
regex = $2
else {
where = match($0, regex)
if (where)
print "Match of", regex, "found at", where, "in", $0
}
}'
This program looks for lines that match the regular expression stored in
the variable regex. This regular expression can be changed. If the
first word on a line is FIND, regex is changed to be the
second word on that line. Therefore, given:
FIND fo*bar
My program was a foobar
But none of it would doobar
FIND Melvin
JF+KM
This line is property of The Reality Engineering Co.
This file created by Melvin.
awk prints:
Match of fo*bar found at 18 in My program was a foobar
Match of Melvin found at 26 in This file created by Melvin.
split(string, array, fieldsep)
This divides string into pieces separated by fieldsep,
and stores the pieces in array. The first piece is stored in
array[1], the second piece in array[2], and so
forth. The string value of the third argument, fieldsep, is
a regexp describing where to split string (much as FS can
be a regexp describing where to split input records). If
the fieldsep is omitted, the value of FS is used.
split returns the number of elements created.
The split function, then, splits strings into pieces in a
manner similar to the way input lines are split into fields. For example:
split("auto-da-fe", a, "-")
splits the string auto-da-fe into three fields using - as the
separator. It sets the contents of the array a as follows:
a[1] = "auto"
a[2] = "da"
a[3] = "fe"
The value returned by this call to split is 3.
As with input field-splitting, when the value of fieldsep is
" ", leading and trailing whitespace is ignored, and the elements
are separated by runs of whitespace.
sprintf(format, expression1,...)
This returns (without printing) the string that printf would
have printed out with the same arguments
(see also: printf Summary Statements for Fancier Printing}).
sprintf("pi = %.2f (approx.)", 22/7)
returns the string "pi = 3.14 (approx.)".
sub(regexp, replacement, target)
The sub function alters the value of target.
It searches this value, which should be a string, for the
leftmost substring matched by the regular expression, regexp,
extending this match as far as possible. Then the entire string is
changed by replacing the matched text with replacement.
The modified string becomes the new value of target.
This function is peculiar because target is not simply
used to compute a value, and not just any expression will do: it
must be a variable, field or array reference, so that sub can
store a modified value there. If this argument is omitted, then the
default is to use and alter $0.
For example:
str = "water, water, everywhere"
sub(/at/, "ith", str)
sets str to "wither, water, everywhere", by replacing the
leftmost, longest occurrence of at with ith.
The sub function returns the number of substitutions made (either
one or zero).
If the special character & appears in replacement, it
stands for the precise substring that was matched by regexp. (If
the regexp can match more than one string, then this precise substring
may vary.) For example:
awk '{ sub(/candidate/, "& and his wife"); print }'
changes the first occurrence of candidate to candidate
and his wife on each input line.
Here is another example:
awk 'BEGIN {
str = "daabaaa"
sub(/a*/, "c&c", str)
print str
}'
prints dcaacbaaa. This show how & can represent a non-constant
string, and also illustrates the ``leftmost, longest'' rule.
The effect of this special character (&) can be turned off by putting a
backslash before it in the string. As usual, to insert one backslash in
the string, you must write two backslashes. Therefore, write \\&
in a string constant to include a literal & in the replacement.
For example, here is how to replace the first | on each line with
an &:
awk '{ sub(/\|/, "\\&"); print }'
Note: as mentioned above, the third argument to sub must
be an lvalue. Some versions of awk allow the third argument to
be an expression which is not an lvalue. In such a case, sub
would still search for the pattern and return 0 or 1, but the result of
the substitution (if any) would be thrown away because there is no place
to put it. Such versions of awk accept expressions like
this:
sub(/USA/, "United States", "the USA and Canada")
But that is considered erroneous in gawk.
gsub(regexp, replacement, target)
This is similar to the sub function, except gsub replaces
all of the longest, leftmost, nonoverlapping matching
substrings it can find. The g in gsub stands for
``global,'' which means replace everywhere. For example:
awk '{ gsub(/Britain/, "United Kingdom"); print }'
replaces all occurrences of the string Britain with United
Kingdom for all input records.
The gsub function returns the number of substitutions made. If
the variable to be searched and altered, target, is
omitted, then the entire input record, $0, is used.
As in sub, the characters & and \ are special, and
the third argument must be an lvalue.
substr(string, start, length)
This returns a length-character-long substring of string,
starting at character number start. The first character of a
string is character number one. For example,
substr("washington", 5, 3) returns "ing".
If length is not present, this function returns the whole suffix of
string that begins at character number start. For example,
substr("washington", 5) returns "ington". This is also
the case if length is greater than the number of characters remaining
in the string, counting from character number start.
tolower(string)
This returns a copy of string, with each upper-case character
in the string replaced with its corresponding lower-case character.
Nonalphabetic characters are left unchanged. For example,
tolower("MiXeD cAsE 123") returns "mixed case 123".
toupper(string)
This returns a copy of string, with each lower-case character
in the string replaced with its corresponding upper-case character.
Nonalphabetic characters are left unchanged. For example,
toupper("MiXeD cAsE 123") returns "MIXED CASE 123".
close(filename)
Close the file filename, for input or output. The argument may
alternatively be a shell command that was used for redirecting to or
from a pipe; then the pipe is closed.
See section Closing Input Files and Pipes, regarding closing
input files and pipes. See section Closing Output Files and Pipes,
regarding closing output files and pipes.
system(command)
The system function allows the user to execute operating system commands
and then return to the awk program. The system function
executes the command given by the string command. It returns, as
its value, the status returned by the command that was executed.
For example, if the following fragment of code is put in your awk
program:
END {
system("mail -s 'awk run done' operator < /dev/null")
}
the system operator will be sent mail when the awk program
finishes processing input and begins its end-of-input processing.
Note that much the same result can be obtained by redirecting
print or printf into a pipe. However, if your awk
program is interactive, system is useful for cranking up large
self-contained programs, such as a shell or an editor.
Some operating systems cannot implement the system function.
system causes a fatal error if it is not supported.
Many utility programs will buffer their output; they save information
to be written to a disk file or terminal in memory, until there is enough
to be written in one operation. This is often more efficient than writing
every little bit of information as soon as it is ready. However, sometimes
it is necessary to force a program to flush its buffers; that is,
write the information to its destination, even if a buffer is not full.
You can do this from your awk program by calling system
with a null string as its argument:
system("") # flush output
gawk treats this use of the system function as a special
case, and is smart enough not to run a shell (or other command
interpreter) with the empty command. Therefore, with gawk, this
idiom is not only useful, it is efficient. While this idiom should work
with other awk implementations, it will not necessarily avoid
starting an unnecessary shell.
A common use for awk programs is the processing of log files.
Log files often contain time stamp information, indicating when a
particular log record was written. Many programs log their time stamp
in the form returned by the time system call, which is the
number of seconds since a particular epoch. On POSIX systems,
it is the number of seconds since Midnight, January 1, 1970, UTC.
In order to make it easier to process such log files, and to easily produce
useful reports, gawk provides two functions for working with time
stamps. Both of these are gawk extensions; they are not specified
in the POSIX standard, nor are they in any other known version
of awk.
systime()
This function returns the current time as the number of seconds since
the system epoch. On POSIX systems, this is the number of seconds
since Midnight, January 1, 1970, UTC. It may be a different number on
other systems.
strftime(format, timestamp)
This function returns a string. It is similar to the function of the
same name in the ANSI C standard library. The time specified by
timestamp is used to produce a string, based on the contents
of the format string.
The systime function allows you to compare a time stamp from a
log file with the current time of day. In particular, it is easy to
determine how long ago a particular record was logged. It also allows
you to produce log records using the ``seconds since the epoch'' format.
The strftime function allows you to easily turn a time stamp
into human-readable information. It is similar in nature to the sprintf
function, copying non-format specification characters verbatim to the
returned string, and substituting date and time values for format
specifications in the format string. If no timestamp argument
is supplied, gawk will use the current time of day as the
time stamp.
strftime is guaranteed by the ANSI C standard to support
the following date format specifications:
%a
The locale's abbreviated weekday name.
%A
The locale's full weekday name.
%b
The locale's abbreviated month name.
%B
The locale's full month name.
%c
The locale's ``appropriate'' date and time representation.
%d
The day of the month as a decimal number (01--31).
%H
The hour (24-hour clock) as a decimal number (00--23).
%I
The hour (12-hour clock) as a decimal number (01--12).
%j
The day of the year as a decimal number (001--366).
%m
The month as a decimal number (01--12).
%M
The minute as a decimal number (00--59).
%p
The locale's equivalent of the AM/PM designations associated
with a 12-hour clock.
%S
The second as a decimal number (00--61). (Occasionally there are
minutes in a year with one or two leap seconds, which is why the
seconds can go from 0 all the way to 61.)
%U
The week number of the year (the first Sunday as the first day of week 1)
as a decimal number (00--53).
%w
The weekday as a decimal number (0--6). Sunday is day 0.
%W
The week number of the year (the first Monday as the first day of week 1)
as a decimal number (00--53).
%x
The locale's ``appropriate'' date representation.
%X
The locale's ``appropriate'' time representation.
%y
The year without century as a decimal number (00--99).
%Y
The year with century as a decimal number.
%Z
The time zone name or abbreviation, or no characters if
no time zone is determinable.
%%
A literal %.
If a conversion specifier is not one of the above, the behavior is
undefined. (This is because the ANSI standard for C leaves the
behavior of the C version of strftime undefined, and gawk
will use the system's version of strftime if it's there.
Typically, the conversion specifier will either not appear in the
returned string, or it will appear literally.)
Informally, a locale is the geographic place in which a program
is meant to run. For example, a common way to abbreviate the date
September 4, 1991 in the United States would be ``9/4/91''.
In many countries in Europe, however, it would be abbreviated ``4.9.91''.
Thus, the %x specification in a "US" locale might produce
9/4/91, while in a "EUROPE" locale, it might produce
4.9.91. The ANSI C standard defines a default "C"
locale, which is an environment that is typical of what most C programmers
are used to.
A public-domain C version of strftime is shipped with gawk
for systems that are not yet fully ANSI-compliant. If that version is
used to compile gawk,
%D
Equivalent to specifying %m/%d/%y.
%e
The day of the month, padded with a blank if it is only one digit.
%h
Equivalent to %b, above.
%n
A newline character (ASCII LF).
%r
Equivalent to specifying %I:%M:%S %p.
%R
Equivalent to specifying %H:%M.
%T
Equivalent to specifying %H:%M:%S.
%t
A TAB character.
%k
is replaced by the hour (24-hour clock) as a decimal number (0-23).
Single digit numbers are padded with a blank.
%l
is replaced by the hour (12-hour clock) as a decimal number (1-12).
Single digit numbers are padded with a blank.
%C
The century, as a number between 00 and 99.
%u
is replaced by the weekday as a decimal number
[1 (Monday)--7].
%V
is replaced by the week number of the year (the first Monday as the first
day of week 1) as a decimal number (01--53).
The method for determining the week number is as specified by ISO 8601
(to wit: if the week containing January 1 has four or more days in the
new year, then it is week 1, otherwise it is week 53 of the previous year
and the next week is week 1).
%Ec %EC %Ex %Ey %EY %Od %Oe %OH %OI
%Om %OM %OS %Ou %OU %OV %Ow %OW %Oy
These are ``alternate representations'' for the specifications
that use only the second letter (%c, %C, and so on).
They are recognized, but their normal representations are used.
(These facilitate compliance with the POSIX date
utility.)
%v
The date in VMS format (e.g. 20-JUN-1991).
Here are two examples that use strftime. The first is an
awk version of the C ctime function. (This is a
user defined function, which we have not discussed yet.
See section User-defined Functions, for more information.)
# ctime.awk
#
# awk version of C ctime(3) function
function ctime(ts, format)
{
format = "%a %b %e %H:%M:%S %Z %Y"
if (ts == 0)
ts = systime() # use current time as default
return strftime(format, ts)
}
This next example is an awk implementation of the POSIX
date utility. Normally, the date utility prints the
current date and time of day in a well known format. However, if you
provide an argument to it that begins with a +, date
will copy non-format specifier characters to the standard output, and
will interpret the current time according to the format specifiers in
the string. For example:
date '+Today is %A, %B %d, %Y.'
might print
Today is Thursday, July 11, 1991.
Here is the awk version of the date utility.
#! /usr/bin/gawk -f
#
# date --- implement the P1003.2 Draft 11 'date' command
#
# Bug: does not recognize the -u argument.
BEGIN \
{
format = "%a %b %e %H:%M:%S %Z %Y"
exitval = 0
if (ARGC > 2)
exitval = 1
else if (ARGC == 2) {
format = ARGV[1]
if (format ~ /^\+/)
format = substr(format, 2) # remove leading +
}
print strftime(format)
exit exitval
}
|