home *** CD-ROM | disk | FTP | other *** search
-
-
-
-
-
-
-
-
-
- Awk -- A Pattern Scanning and Processing Language
- (Second Edition)
-
- Alfred V. Aho
- Brian W. Kernighan
- Peter J. Weinberger
-
-
- ABSTRACT
-
-
- Awk is a programming language whose basic
- operation is to search a set of files for pat-
- terns, and to perform specified actions upon lines
- or fields of lines which contain instances of
- those patterns. Awk makes certain data selection
- and transformation operations easy to express; for
- example, the awk program
-
- length > 72
-
- prints all input lines whose length exceeds 72
- characters; the program
-
- NF % 2 == 0
-
- prints all lines with an even number of fields;
- and the program
-
- { $1 = log($1); print }
-
- replaces the first field of each line by its loga-
- rithm.
-
- Awk patterns may include arbitrary boolean
- combinations of regular expressions and of rela-
- tional operators on strings, numbers, fields,
- variables, and array elements. Actions may
- include the same pattern-matching constructions as
- in patterns, as well as arithmetic and string
- expressions and assignments, if-else, while, for
- statements, and multiple output streams.
-
- This report contains a user's guide, a dis-
- cussion of the design and implementation of awk,
- and some timing statistics.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- USD:19-2 Awk -- A Pattern Scanning and Processing Language
-
-
- 1. Introduction
-
- Awk is a programming language designed to make many
- common information retrieval and text manipulation tasks
- easy to state and to perform.
-
- The basic operation of awk is to scan a set of input
- lines in order, searching for lines which match any of a set
- of patterns which the user has specified. For each pattern,
- an action can be specified; this action will be performed on
- each line that matches the pattern.
-
- Readers familiar with the UNIX(R) program grep unix
- program manual will recognize the approach, although in awk
- the patterns may be more general than in grep, and the
- actions allowed are more involved than merely printing the
- matching line. For example, the awk program
-
- {print $3, $2}
-
- prints the third and second columns of a table in that
- order. The program
-
- $2 ~ /A|B|C/
-
- prints all input lines with an A, B, or C in the second
- field. The program
-
- $1 != prev { print; prev = $1 }
-
- prints all lines in which the first field is different from
- the previous first field.
-
- 1.1. Usage
-
- The command
-
- awk program [files]
-
- executes the awk commands in the string program on the set
- of named files, or on the standard input if there are no
- files. The statements can also be placed in a file pfile,
- and executed by the command
-
- awk -f pfile [files]
-
-
- 1.2. Program Structure
-
- An awk program is a sequence of statements of the form:
-
-
-
-
-
-
-
-
-
-
-
-
-
- Awk -- A Pattern Scanning and Processing Language USD:19-3
-
-
- pattern { action }
- pattern { action }
- ...
-
- Each line of input is matched against each of the patterns
- in turn. For each pattern that matches, the associated
- action is executed. When all the patterns have been tested,
- the next line is fetched and the matching starts over.
-
- Either the pattern or the action may be left out, but
- not both. If there is no action for a pattern, the matching
- line is simply copied to the output. (Thus a line which
- matches several patterns can be printed several times.) If
- there is no pattern for an action, then the action is per-
- formed for every input line. A line which matches no pat-
- tern is ignored.
-
- Since patterns and actions are both optional, actions
- must be enclosed in braces to distinguish them from pat-
- terns.
-
- 1.3. Records and Fields
-
- Awk input is divided into ``records'' terminated by a
- record separator. The default record separator is a new-
- line, so by default awk processes its input a line at a
- time. The number of the current record is available in a
- variable named NR.
-
- Each input record is considered to be divided into
- ``fields.'' Fields are normally separated by white space --
- blanks or tabs -- but the input field separator may be
- changed, as described below. Fields are referred to as $1,
- $2, and so forth, where $1 is the first field, and $0 is the
- whole input record itself. Fields may be assigned to. The
- number of fields in the current record is available in a
- variable named NF.
-
- The variables FS and RS refer to the input field and
- record separators; they may be changed at any time to any
- single character. The optional command-line argument -Fc
- may also be used to set FS to the character c.
-
- If the record separator is empty, an empty input line
- is taken as the record separator, and blanks, tabs and new-
- lines are treated as field separators.
-
- The variable FILENAME contains the name of the current
- input file.
-
- 1.4. Printing
-
- An action may have no pattern, in which case the action
- is executed for all lines. The simplest action is to print
-
-
-
-
-
-
-
-
-
- USD:19-4 Awk -- A Pattern Scanning and Processing Language
-
-
- some or all of a record; this is accomplished by the awk
- command print. The awk program
-
- { print }
-
- prints each record, thus copying the input to the output
- intact. More useful is to print a field or fields from each
- record. For instance,
-
- print $2, $1
-
- prints the first two fields in reverse order. Items sepa-
- rated by a comma in the print statement will be separated by
- the current output field separator when output. Items not
- separated by commas will be concatenated, so
-
- print $1 $2
-
- runs the first and second fields together.
-
- The predefined variables NF and NR can be used; for
- example
-
- { print NR, NF, $0 }
-
- prints each record preceded by the record number and the
- number of fields.
-
- Output may be diverted to multiple files; the program
-
- { print $1 >"foo1"; print $2 >"foo2" }
-
- writes the first field, $1, on the file foo1, and the second
- field on file foo2. The >> notation can also be used:
-
- print $1 >>"foo"
-
- appends the output to the file foo. (In each case, the out-
- put files are created if necessary.) The file name can be a
- variable or a field as well as a constant; for example,
-
- print $1 >$2
-
- uses the contents of field 2 as a file name.
-
- Naturally there is a limit on the number of output
- files; currently it is 10.
-
- Similarly, output can be piped into another process (on
- UNIX only); for instance,
-
- print | "mail bwk"
-
- mails the output to bwk.
-
-
-
-
-
-
-
-
-
- Awk -- A Pattern Scanning and Processing Language USD:19-5
-
-
- The variables OFS and ORS may be used to change the
- current output field separator and output record separator.
- The output record separator is appended to the output of the
- print statement.
-
- Awk also provides the printf statement for output for-
- matting:
-
- printf format expr, expr, ...
-
- formats the expressions in the list according to the speci-
- fication in format and prints them. For example,
-
- printf "%8.2f %10ld\n", $1, $2
-
- prints $1 as a floating point number 8 digits wide, with two
- after the decimal point, and $2 as a 10-digit long decimal
- number, followed by a newline. No output separators are
- produced automatically; you must add them yourself, as in
- this example. The version of printf is identical to that
- used with C. C programm language prentice hall 1978
-
- 2. Patterns
-
- A pattern in front of an action acts as a selector that
- determines whether the action is to be executed. A variety
- of expressions may be used as patterns: regular expressions,
- arithmetic relational expressions, string-valued expres-
- sions, and arbitrary boolean combinations of these.
-
- 2.1. BEGIN and END
-
- The special pattern BEGIN matches the beginning of the
- input, before the first record is read. The pattern END
- matches the end of the input, after the last record has been
- processed. BEGIN and END thus provide a way to gain control
- before and after processing, for initialization and wrapup.
-
- As an example, the field separator can be set to a
- colon by
-
- BEGIN { FS = ":" }
- ... rest of program ...
-
- Or the input lines may be counted by
-
- END { print NR }
-
- If BEGIN is present, it must be the first pattern; END must
- be the last if used.
-
-
-
-
-
-
-
-
-
-
-
-
-
- USD:19-6 Awk -- A Pattern Scanning and Processing Language
-
-
- 2.2. Regular Expressions
-
- The simplest regular expression is a literal string of
- characters enclosed in slashes, like
-
- /smith/
-
- This is actually a complete awk program which will print all
- lines which contain any occurrence of the name ``smith''.
- If a line contains ``smith'' as part of a larger word, it
- will also be printed, as in
-
- blacksmithing
-
-
- Awk regular expressions include the regular expression
- forms found in the UNIX text editor ed unix program manual
- and grep (without back-referencing). In addition, awk
- allows parentheses for grouping, | for alternatives, + for
- ``one or more'', and ? for ``zero or one'', all as in lex.
- Character classes may be abbreviated: [a-zA-Z0-9] is the set
- of all letters and digits. As an example, the awk program
-
- /[Aa]ho|[Ww]einberger|[Kk]ernighan/
-
- will print all lines which contain any of the names ``Aho,''
- ``Weinberger'' or ``Kernighan,'' whether capitalized or not.
-
- Regular expressions (with the extensions listed above)
- must be enclosed in slashes, just as in ed and sed. Within
- a regular expression, blanks and the regular expression
- metacharacters are significant. To turn of the magic mean-
- ing of one of the regular expression characters, precede it
- with a backslash. An example is the pattern
-
- /\/.*\//
-
- which matches any string of characters enclosed in slashes.
-
- One can also specify that any field or variable matches
- a regular expression (or does not match it) with the opera-
- tors ~ and !~. The program
-
- $1 ~ /[jJ]ohn/
-
- prints all lines where the first field matches ``john'' or
- ``John.'' Notice that this will also match ``Johnson'',
- ``St. Johnsbury'', and so on. To restrict it to exactly
- [jJ]ohn, use
-
- $1 ~ /^[jJ]ohn$/
-
- The caret ^ refers to the beginning of a line or field; the
- dollar sign $ refers to the end.
-
-
-
-
-
-
-
-
-
- Awk -- A Pattern Scanning and Processing Language USD:19-7
-
-
- 2.3. Relational Expressions
-
- An awk pattern can be a relational expression involving
- the usual relational operators <, <=, ==, !=, >=, and >. An
- example is
-
- $2 > $1 + 100
-
- which selects lines where the second field is at least 100
- greater than the first field. Similarly,
-
- NF % 2 == 0
-
- prints lines with an even number of fields.
-
- In relational tests, if neither operand is numeric, a
- string comparison is made; otherwise it is numeric. Thus,
-
- $1 >= "s"
-
- selects lines that begin with an s, t, u, etc. In the
- absence of any other information, fields are treated as
- strings, so the program
-
- $1 > $2
-
- will perform a string comparison.
-
- 2.4. Combinations of Patterns
-
- A pattern can be any boolean combination of patterns,
- using the operators || (or), && (and), and ! (not). For
- example,
-
- $1 >= "s" && $1 < "t" && $1 != "smith"
-
- selects lines where the first field begins with ``s'', but
- is not ``smith''. && and || guarantee that their operands
- will be evaluated from left to right; evaluation stops as
- soon as the truth or falsehood is determined.
-
- 2.5. Pattern Ranges
-
- The ``pattern'' that selects an action may also consist
- of two patterns separated by a comma, as in
-
- pat1, pat2 { ... }
-
- In this case, the action is performed for each line between
- an occurrence of pat1 and the next occurrence of pat2
- (inclusive). For example,
-
- /start/, /stop/
-
-
-
-
-
-
-
-
-
-
- USD:19-8 Awk -- A Pattern Scanning and Processing Language
-
-
- prints all lines between start and stop, while
-
- NR == 100, NR == 200 { ... }
-
- does the action for lines 100 through 200 of the input.
-
- 3. Actions
-
- An awk action is a sequence of action statements termi-
- nated by newlines or semicolons. These action statements
- can be used to do a variety of bookkeeping and string manip-
- ulating tasks.
-
- 3.1. Built-in Functions
-
- Awk provides a ``length'' function to compute the
- length of a string of characters. This program prints each
- record, preceded by its length:
-
- {print length, $0}
-
- length by itself is a ``pseudo-variable'' which yields the
- length of the current record; length(argument) is a function
- which yields the length of its argument, as in the equiva-
- lent
-
- {print length($0), $0}
-
- The argument may be any expression.
-
- Awk also provides the arithmetic functions sqrt, log,
- exp, and int, for square root, base e logarithm, exponen-
- tial, and integer part of their respective arguments.
-
- The name of one of these built-in functions, without
- argument or parentheses, stands for the value of the func-
- tion on the whole record. The program
-
- length < 10 || length > 20
-
- prints lines whose length is less than 10 or greater than
- 20.
-
- The function substr(s, m, n) produces the substring of
- s that begins at position m (origin 1) and is at most n
- characters long. If n is omitted, the substring goes to the
- end of s. The function index(s1, s2) returns the position
- where the string s2 occurs in s1, or zero if it does not.
-
- The function sprintf(f, e1, e2, ...) produces the value
- of the expressions e1, e2, etc., in the printf format speci-
- fied by f. Thus, for example,
-
-
-
-
-
-
-
-
-
-
-
- Awk -- A Pattern Scanning and Processing Language USD:19-9
-
-
- x = sprintf("%8.2f %10ld", $1, $2)
-
- sets x to the string produced by formatting the values of $1
- and $2.
-
- 3.2. Variables, Expressions, and Assignments
-
- Awk variables take on numeric (floating point) or
- string values according to context. For example, in
-
- x = 1
-
- x is clearly a number, while in
-
- x = "smith"
-
- it is clearly a string. Strings are converted to numbers
- and vice versa whenever context demands it. For instance,
-
- x = "3" + "4"
-
- assigns 7 to x. Strings which cannot be interpreted as num-
- bers in a numerical context will generally have numeric
- value zero, but it is unwise to count on this behavior.
-
- By default, variables (other than built-ins) are ini-
- tialized to the null string, which has numerical value zero;
- this eliminates the need for most BEGIN sections. For exam-
- ple, the sums of the first two fields can be computed by
-
- { s1 += $1; s2 += $2 }
- END { print s1, s2 }
-
-
- Arithmetic is done internally in floating point. The
- arithmetic operators are +, -, *, /, and % (mod). The C
- increment ++ and decrement -- operators are also available,
- and so are the assignment operators +=, -=, *=, /=, and %=.
- These operators may all be used in expressions.
-
- 3.3. Field Variables
-
- Fields in awk share essentially all of the properties
- of variables -- they may be used in arithmetic or string
- operations, and may be assigned to. Thus one can replace
- the first field with a sequence number like this:
-
- { $1 = NR; print }
-
- or accumulate two fields into a third, like this:
-
- { $1 = $2 + $3; print $0 }
-
- or assign a string to a field:
-
-
-
-
-
-
-
-
-
- USD:19-10 Awk -- A Pattern Scanning and Processing Language
-
-
- { if ($3 > 1000)
- $3 = "too big"
- print
- }
-
- which replaces the third field by ``too big'' when it is,
- and in any case prints the record.
-
- Field references may be numerical expressions, as in
-
- { print $i, $(i+1), $(i+n) }
-
- Whether a field is deemed numeric or string depends on con-
- text; in ambiguous cases like
-
- if ($1 == $2) ...
-
- fields are treated as strings.
-
- Each input line is split into fields automatically as
- necessary. It is also possible to split any variable or
- string into fields:
-
- n = split(s, array, sep)
-
- splits the the string s into array[1], ..., array[n]. The
- number of elements found is returned. If the sep argument
- is provided, it is used as the field separator; otherwise FS
- is used as the separator.
-
- 3.4. String Concatenation
-
- Strings may be concatenated. For example
-
- length($1 $2 $3)
-
- returns the length of the first three fields. Or in a print
- statement,
-
- print $1 " is " $2
-
- prints the two fields separated by `` is ''. Variables and
- numeric expressions may also appear in concatenations.
-
- 3.5. Arrays
-
- Array elements are not declared; they spring into exis-
- tence by being mentioned. Subscripts may have any non-null
- value, including non-numeric strings. As an example of a
- conventional numeric subscript, the statement
-
- x[NR] = $0
-
- assigns the current input record to the NR-th element of the
-
-
-
-
-
-
-
-
-
- Awk -- A Pattern Scanning and Processing Language USD:19-11
-
-
- array x. In fact, it is possible in principle (though per-
- haps slow) to process the entire input in a random order
- with the awk program
-
- { x[NR] = $0 }
- END { ... program ... }
-
- The first action merely records each input line in the array
- x.
-
- Array elements may be named by non-numeric values,
- which gives awk a capability rather like the associative
- memory of Snobol tables. Suppose the input contains fields
- with values like apple, orange, etc. Then the program
-
- /apple/ { x["apple"]++ }
- /orange/ { x["orange"]++ }
- END { print x["apple"], x["orange"] }
-
- increments counts for the named array elements, and prints
- them at the end of the input.
-
- 3.6. Flow-of-Control Statements
-
- Awk provides the basic flow-of-control statements if-
- else, while, for, and statement grouping with braces, as in
- C. We showed the if statement in section 3.3 without
- describing it. The condition in parentheses is evaluated;
- if it is true, the statement following the if is done. The
- else part is optional.
-
- The while statement is exactly like that of C. For
- example, to print all input fields one per line,
-
- i = 1
- while (i <= NF) {
- print $i
- ++i
- }
-
-
- The for statement is also exactly that of C:
-
- for (i = 1; i <= NF; i++)
- print $i
-
- does the same job as the while statement above.
-
- There is an alternate form of the for statement which
- is suited for accessing the elements of an associative
- array:
-
-
-
-
-
-
-
-
-
-
-
-
- USD:19-12 Awk -- A Pattern Scanning and Processing Language
-
-
- for (i in array)
- statement
-
- does statement with i set in turn to each element of array.
- The elements are accessed in an apparently random order.
- Chaos will ensue if i is altered, or if any new elements are
- accessed during the loop.
-
- The expression in the condition part of an if, while or
- for can include relational operators like <, <=, >, >=, ==
- (``is equal to''), and != (``not equal to''); regular
- expression matches with the match operators ~ and !~; the
- logical operators ||, &&, and !; and of course parentheses
- for grouping.
-
- The break statement causes an immediate exit from an
- enclosing while or for; the continue statement causes the
- next iteration to begin.
-
- The statement next causes awk to skip immediately to
- the next record and begin scanning the patterns from the
- top. The statement exit causes the program to behave as if
- the end of the input had occurred.
-
- Comments may be placed in awk programs: they begin with
- the character # and end with the end of the line, as in
-
- print x, y # this is a comment
-
-
- 4. Design
-
- The UNIX system already provides several programs that
- operate by passing input through a selection mechanism.
- Grep, the first and simplest, merely prints all lines which
- match a single specified pattern. Egrep provides more gen-
- eral patterns, i.e., regular expressions in full generality;
- fgrep searches for a set of keywords with a particularly
- fast algorithm. Sed unix programm manual provides most of
- the editing facilities of the editor ed, applied to a stream
- of input. None of these programs provides numeric capabili-
- ties, logical relations, or variables.
-
- Lex lesk lexical analyzer cstr provides general regular
- expression recognition capabilities, and, by serving as a C
- program generator, is essentially open-ended in its capabil-
- ities. The use of lex, however, requires a knowledge of C
- programming, and a lex program must be compiled and loaded
- before use, which discourages its use for one-shot applica-
- tions.
-
- Awk is an attempt to fill in another part of the matrix
- of possibilities. It provides general regular expression
- capabilities and an implicit input/output loop. But it also
-
-
-
-
-
-
-
-
-
- Awk -- A Pattern Scanning and Processing Language USD:19-13
-
-
- provides convenient numeric processing, variables, more gen-
- eral selection, and control flow in the actions. It does
- not require compilation or a knowledge of C. Finally, awk
- provides a convenient way to access fields within lines; it
- is unique in this respect.
-
- Awk also tries to integrate strings and numbers com-
- pletely, by treating all quantities as both string and
- numeric, deciding which representation is appropriate as
- late as possible. In most cases the user can simply ignore
- the differences.
-
- Most of the effort in developing awk went into deciding
- what awk should or should not do (for instance, it doesn't
- do string substitution) and what the syntax should be (no
- explicit operator for concatenation) rather than on writing
- or debugging the code. We have tried to make the syntax
- powerful but easy to use and well adapted to scanning files.
- For example, the absence of declarations and implicit ini-
- tializations, while probably a bad idea for a general-
- purpose programming language, is desirable in a language
- that is meant to be used for tiny programs that may even be
- composed on the command line.
-
- In practice, awk usage seems to fall into two broad
- categories. One is what might be called ``report genera-
- tion'' -- processing an input to extract counts, sums, sub-
- totals, etc. This also includes the writing of trivial data
- validation programs, such as verifying that a field contains
- only numeric information or that certain delimiters are
- properly balanced. The combination of textual and numeric
- processing is invaluable here.
-
- A second area of use is as a data transformer, convert-
- ing data from the form produced by one program into that
- expected by another. The simplest examples merely select
- fields, perhaps with rearrangements.
-
- 5. Implementation
-
- The actual implementation of awk uses the language
- development tools available on the UNIX operating system.
- The grammar is specified with yacc; yacc johnson cstr the
- lexical analysis is done by lex; the regular expression rec-
- ognizers are deterministic finite automata constructed
- directly from the expressions. An awk program is translated
- into a parse tree which is then directly executed by a sim-
- ple interpreter.
-
- Awk was designed for ease of use rather than processing
- speed; the delayed evaluation of variable types and the
- necessity to break input into fields makes high speed diffi-
- cult to achieve in any case. Nonetheless, the program has
- not proven to be unworkably slow.
-
-
-
-
-
-
-
-
-
- USD:19-14 Awk -- A Pattern Scanning and Processing Language
-
-
- Table I below shows the execution (user + system) time
- on a PDP-11/70 of the UNIX programs wc, grep, egrep, fgrep,
- sed, lex, and awk on the following simple tasks:
-
- 1. count the number of lines.
-
- 2. print all lines containing ``doug''.
-
- 3. print all lines containing ``doug'', ``ken'' or
- ``dmr''.
-
- 4. print the third field of each line.
-
- 5. print the third and second fields of each line, in that
- order.
-
- 6. append all lines containing ``doug'', ``ken'', and
- ``dmr'' to files ``jdoug'', ``jken'', and ``jdmr'',
- respectively.
-
- 7. print each line prefixed by ``line-number : ''.
-
- 8. sum the fourth column of a table.
-
- The program wc merely counts words, lines and characters in
- its input; we have already mentioned the others. In all
- cases the input was a file containing 10,000 lines as cre-
- ated by the command ls -l; each line has the form
-
- -rw-rw-rw- 1 ava 123 Oct 15 17:05 xxx
-
- The total length of this input is 452,960 characters. Times
- for lex do not include compile or load.
-
- As might be expected, awk is not as fast as the spe-
- cialized tools wc, sed, or the programs in the grep family,
- but is faster than the more general tool lex. In all cases,
- the tasks were about as easy to express as awk programs as
- programs in these other languages; tasks involving fields
- were considerably easier to express as awk programs. Some
- of the test programs are shown in awk, sed and lex. $LIST$
- center; c c c c c c c c c c c c c c c c c c
- c|n|n|n|n|n|n|n|n|. Task
- Program 1 2 3 4 5 6 7 8 -- wc 8.6
- grep 11.7 13.1 egrep 6.2 11.5 11.6
- fgrep 7.7 13.8 16.1 sed 10.2 11.6 15.8 29.0 30.5 16.1
- lex 65.1 150.1 144.2 67.7 70.3 104.0 81.7 92.8
- awk 15.0 25.6 29.9 33.3 38.9 46.4 71.4 31.1 --
-
-
- Table I. Execution Times of Programs. (Times are in sec.)
-
-
- The programs for some
-
-
-
-
-
-
-
-
-
- Awk -- A Pattern Scanning and Processing Language USD:19-15
-
-
- of these jobs are shown 5. /[^ ]* [ ]*\([^ ]*\) [ ]*\([^ ]*\) .*/s//\2 \1/p
- below. The lex programs are
- generally too long to show.
- 6. /ken/w jken
- AWK: /doug/w jdoug
- /dmr/w jdmr
-
- 1. END {print NR}
- LEX:
-
- 2. /doug/
- 1. %{
- int i;
- 3. /ken|doug|dmr/ %}
- %%
- \n i++;
- 4. {print $3} . ;
- %%
- yywrap() {
- 5. {print $3, $2} printf("%d\n", i);
- }
-
- 6. /ken/ {print >"jken"}
- /doug/ {print >"jdoug"} 2. %%
- /dmr/ {print >"jdmr"} ^.*doug.*$ printf("%s\n", yytext);
- . ;
- \n ;
- 7. {print NR ": " $0}
-
-
- 8. {sum = sum + $4}
- END {print sum}
-
-
- SED:
-
-
- 1. $=
-
-
- 2. /doug/p
-
-
- 3. /doug/p
- /doug/d
- /ken/p
- /ken/d
- /dmr/p
- /dmr/d
-
-
- 4. /[^ ]* [ ]*[^ ]* [ ]*\([^ ]*\) .*/s//\1/p
-
-
-
-
-
-
-
-
-