home *** CD-ROM | disk | FTP | other *** search
Text File | 1991-04-10 | 6.5 KB | 180 lines | [TEXT/KAHL] |
- .TH REGEXP 3 local
- .DA 2 April 1986
- .SH NAME
- regcomp, regexec, regsub, regerror \- regular expression handler
- .SH SYNOPSIS
- .ft B
- .nf
- #include <regexp.h>
-
- regexp *regcomp(exp)
- char *exp;
-
- int regexec(prog, string)
- regexp *prog;
- char *string;
-
- regsub(prog, source, dest)
- regexp *prog;
- char *source;
- char *dest;
-
- regerror(msg)
- char *msg;
- .SH DESCRIPTION
- These functions implement
- .IR egrep (1)-style
- regular expressions and supporting facilities.
- .PP
- .I Regcomp
- compiles a regular expression into a structure of type
- .IR regexp ,
- and returns a pointer to it.
- The space has been allocated using
- .IR malloc (3)
- and may be released by
- .IR free .
- .PP
- .I Regexec
- matches a NUL-terminated \fIstring\fR against the compiled regular expression
- in \fIprog\fR.
- It returns 1 for success and 0 for failure, and adjusts the contents of
- \fIprog\fR's \fIstartp\fR and \fIendp\fR (see below) accordingly.
- .PP
- The members of a
- .I regexp
- structure include at least the following (not necessarily in order):
- .PP
- .RS
- char *startp[NSUBEXP];
- .br
- char *endp[NSUBEXP];
- .RE
- .PP
- where
- .I NSUBEXP
- is defined (as 10) in the header file.
- Once a successful \fIregexec\fR has been done using the \fIregexp\fR,
- each \fIstartp\fR-\fIendp\fR pair describes one substring
- within the \fIstring\fR,
- with the \fIstartp\fR pointing to the first character of the substring and
- the \fIendp\fR pointing to the first character following the substring.
- The 0th substring is the substring of \fIstring\fR that matched the whole
- regular expression.
- The others are those substrings that matched parenthesized expressions
- within the regular expression, with parenthesized expressions numbered
- in left-to-right order of their opening parentheses.
- .PP
- .I Regsub
- copies \fIsource\fR to \fIdest\fR, making substitutions according to the
- most recent \fIregexec\fR performed using \fIprog\fR.
- Each instance of `&' in \fIsource\fR is replaced by the substring
- indicated by \fIstartp\fR[\fI0\fR] and
- \fIendp\fR[\fI0\fR].
- Each instance of `\e\fIn\fR', where \fIn\fR is a digit, is replaced by
- the substring indicated by
- \fIstartp\fR[\fIn\fR] and
- \fIendp\fR[\fIn\fR].
- To get a literal `&' or `\e\fIn\fR' into \fIdest\fR, prefix it with `\e';
- to get a literal `\e' preceding `&' or `\e\fIn\fR', prefix it with
- another `\e'.
- .PP
- .I Regerror
- is called whenever an error is detected in \fIregcomp\fR, \fIregexec\fR,
- or \fIregsub\fR.
- The default \fIregerror\fR writes the string \fImsg\fR,
- with a suitable indicator of origin,
- on the standard
- error output
- and invokes \fIexit\fR(2).
- .I Regerror
- can be replaced by the user if other actions are desirable.
- .SH "REGULAR EXPRESSION SYNTAX"
- A regular expression is zero or more \fIbranches\fR, separated by `|'.
- It matches anything that matches one of the branches.
- .PP
- A branch is zero or more \fIpieces\fR, concatenated.
- It matches a match for the first, followed by a match for the second, etc.
- .PP
- A piece is an \fIatom\fR possibly followed by `*', `+', or `?'.
- An atom followed by `*' matches a sequence of 0 or more matches of the atom.
- An atom followed by `+' matches a sequence of 1 or more matches of the atom.
- An atom followed by `?' matches a match of the atom, or the null string.
- .PP
- An atom is a regular expression in parentheses (matching a match for the
- regular expression), a \fIrange\fR (see below), `.'
- (matching any single character), `^' (matching the null string at the
- beginning of the input string), `$' (matching the null string at the
- end of the input string), a `\e' followed by a single character (matching
- that character), or a single character with no other significance
- (matching that character).
- .PP
- A \fIrange\fR is a sequence of characters enclosed in `[]'.
- It normally matches any single character from the sequence.
- If the sequence begins with `^',
- it matches any single character \fInot\fR from the rest of the sequence.
- If two characters in the sequence are separated by `\-', this is shorthand
- for the full list of ASCII characters between them
- (e.g. `[0-9]' matches any decimal digit).
- To include a literal `]' in the sequence, make it the first character
- (following a possible `^').
- To include a literal `\-', make it the first or last character.
- .SH AMBIGUITY
- If a regular expression could match two different parts of the input string,
- it will match the one which begins earliest.
- If both begin in the same place but match different lengths, or match
- the same length in different ways, life gets messier, as follows.
- .PP
- In general, the possibilities in a list of branches are considered in
- left-to-right order, the possibilities for `*', `+', and `?' are
- considered longest-first, nested constructs are considered from the
- outermost in, and concatenated constructs are considered leftmost-first.
- The match that will be chosen is the one that uses the earliest
- possibility in the first choice that has to be made.
- If there is more than one choice, the next will be made in the same manner
- (earliest possibility) subject to the decision on the first choice.
- And so forth.
- .PP
- For example, `(ab|a)b*c' could match `abc' in one of two ways.
- The first choice is between `ab' and `a'; since `ab' is earlier, and does
- lead to a successful overall match, it is chosen.
- Since the `b' is already spoken for,
- the `b*' must match its last possibility\(emthe empty string\(emsince
- it must respect the earlier choice.
- .PP
- In the particular case where no `|'s are present and there is only one
- `*', `+', or `?', the net effect is that the longest possible
- match will be chosen.
- So `ab*', presented with `xabbbby', will match `abbbb'.
- Note that if `ab*' is tried against `xabyabbbz', it
- will match `ab' just after `x', due to the begins-earliest rule.
- (In effect, the decision on where to start the match is the first choice
- to be made, hence subsequent choices must respect it even if this leads them
- to less-preferred alternatives.)
- .SH SEE ALSO
- egrep(1), expr(1)
- .SH DIAGNOSTICS
- \fIRegcomp\fR returns NULL for a failure
- (\fIregerror\fR permitting),
- where failures are syntax errors, exceeding implementation limits,
- or applying `+' or `*' to a possibly-null operand.
- .SH HISTORY
- Both code and manual page were
- written at U of T.
- They are intended to be compatible with the Bell V8 \fIregexp\fR(3),
- but are not derived from Bell code.
- .SH BUGS
- Empty branches and empty regular expressions are not portable to V8.
- .PP
- The restriction against
- applying `*' or `+' to a possibly-null operand is an artifact of the
- simplistic implementation.
- .PP
- Does not support \fIegrep\fR's newline-separated branches;
- neither does the V8 \fIregexp\fR(3), though.
- .PP
- Due to emphasis on
- compactness and simplicity,
- it's not strikingly fast.
- It does give special attention to handling simple cases quickly.
-