home *** CD-ROM | disk | FTP | other *** search
-
-
-
- REGEXP(3)
-
-
-
- NAME
- regcomp, regexec, regsub, regerror - regular expression
- handler
-
- SYNOPSIS
- #include <regexp.h>
-
- regexp *regcomp(exp)
- char *exp;
-
- int regexec(prog, string)
- regexp *prog;
- char *string;
-
- regsub(prog, source, dest)
- regexp *prog;
- char *source;
- char *dest;
-
- regerror(msg)
- char *msg;
-
- DESCRIPTION
- These functions implement egrep(1)-style regular expressions
- and supporting facilities.
-
- Regcomp compiles a regular expression into a structure of
- type regexp, and returns a pointer to it. The space has
- been allocated using malloc(3) and may be released by free.
-
- Regexec matches a NUL-terminated string against the compiled
- regular expression in prog. It returns 1 for success and 0
- for failure, and adjusts the contents of prog's startp and
- endp (see below) accordingly.
-
- The members of a regexp structure include at least the fol-
- lowing (not necessarily in order):
-
- char *startp[NSUBEXP];
- char *endp[NSUBEXP];
-
- where NSUBEXP is defined (as 10) in the header file. Once a
- successful regexec has been done using the regexp, each
- startp-endp pair describes one substring within the string,
- with the startp pointing to the first character of the sub-
- string and the endp pointing to the first character follow-
- ing the substring. The 0th substring is the substring of
- string that matched the whole regular expression. The oth-
- ers are those substrings that matched parenthesized expres-
- sions within the regular expression, with parenthesized
- expressions numbered in left-to-right order of their opening
- parentheses.
-
-
-
- 1
-
-
-
-
-
-
- REGEXP(3)
-
-
-
- Regsub copies source to dest, making substitutions according
- to the most recent regexec performed using prog. Each
- instance of `&' in source is replaced by the substring indi-
- cated by startp[0] and endp[0]. Each instance of `\n',
- where n is a digit, is replaced by the substring indicated
- by startp[n] and endp[n]. To get a literal `&' or `\n' into
- dest, prefix it with `\'; to get a literal `\' preceding `&'
- or `\n', prefix it with another `\'.
-
- Regerror is called whenever an error is detected in regcomp,
- regexec, or regsub. The default regerror writes the string
- msg, with a suitable indicator of origin, on the standard
- error output and invokes exit(2). Regerror can be replaced
- by the user if other actions are desirable.
-
- REGULAR EXPRESSION SYNTAX
- A regular expression is zero or more branches, separated by
- `|'. It matches anything that matches one of the branches.
-
- A branch is zero or more pieces, concatenated. It matches a
- match for the first, followed by a match for the second,
- etc.
-
- A piece is an atom possibly followed by `*', `+', or `?'.
- An atom followed by `*' matches a sequence of 0 or more
- matches of the atom. An atom followed by `+' matches a
- sequence of 1 or more matches of the atom. An atom followed
- by `?' matches a match of the atom, or the null string.
-
- An atom is a regular expression in parentheses (matching a
- match for the regular expression), a range (see below), `.'
- (matching any single character), `^' (matching the null
- string at the beginning of the input string), `$' (matching
- the null string at the end of the input string), a `\' fol-
- lowed by a single character (matching that character), or a
- single character with no other significance (matching that
- character).
-
- A range is a sequence of characters enclosed in `[]'. It
- normally matches any single character from the sequence. If
- the sequence begins with `^', it matches any single charac-
- ter not from the rest of the sequence. If two characters in
- the sequence are separated by `-', this is shorthand for the
- full list of ASCII characters between them (e.g. `[0-9]'
- matches any decimal digit). To include a literal `]' in the
- sequence, make it the first character (following a possible
- `^'). To include a literal `-', make it the first or last
- character.
-
- AMBIGUITY
- If a regular expression could match two different parts of
- the input string, it will match the one which begins
-
-
-
- 2
-
-
-
-
-
-
- REGEXP(3)
-
-
-
- earliest. If both begin in the same place but match dif-
- ferent lengths, or match the same length in different ways,
- life gets messier, as follows.
-
- In general, the possibilities in a list of branches are con-
- sidered in left-to-right order, the possibilities for `*',
- `+', and `?' are considered longest-first, nested constructs
- are considered from the outermost in, and concatenated con-
- structs are considered leftmost-first. The match that will
- be chosen is the one that uses the earliest possibility in
- the first choice that has to be made. If there is more than
- one choice, the next will be made in the same manner (earli-
- est possibility) subject to the decision on the first
- choice. And so forth.
-
- For example, `(ab|a)b*c' could match `abc' in one of two
- ways. The first choice is between `ab' and `a'; since `ab'
- is earlier, and does lead to a successful overall match, it
- is chosen. Since the `b' is already spoken for, the `b*'
- must match its last possibility-the empty string-since it
- must respect the earlier choice.
-
- In the particular case where no `|'s are present and there
- is only one `*', `+', or `?', the net effect is that the
- longest possible match will be chosen. So `ab*', presented
- with `xabbbby', will match `abbbb'. Note that if `ab*' is
- tried against `xabyabbbz', it will match `ab' just after
- `x', due to the begins-earliest rule. (In effect, the deci-
- sion on where to start the match is the first choice to be
- made, hence subsequent choices must respect it even if this
- leads them to less-preferred alternatives.)
-
- SEE ALSO
- egrep(1), expr(1)
-
- DIAGNOSTICS
- Regcomp returns NULL for a failure (regerror permitting),
- where failures are syntax errors, exceeding implementation
- limits, or applying `+' or `*' to a possibly-null operand.
-
- HISTORY
- Both code and manual page were written at U of T. They are
- intended to be compatible with the Bell V8 regexp(3), but
- are not derived from Bell code.
-
- BUGS
- Empty branches and empty regular expressions are not port-
- able to V8.
-
- The restriction against applying `*' or `+' to a possibly-
- null operand is an artifact of the simplistic implementa-
- tion.
-
-
-
- 3
-
-
-
-
-
-
- REGEXP(3)
-
-
-
- Does not support egrep's newline-separated branches; neither
- does the V8 regexp(3), though.
-
- Due to emphasis on compactness and simplicity, it's not
- strikingly fast. It does give special attention to handling
- simple cases quickly.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- 4
-
-
-
-