home *** CD-ROM | disk | FTP | other *** search
- REGULAR EXPRESSIONS
- Tcl provides two commands that support string matching using
- egrep-style regular expressions: regexp and regsub. Regular
- expressions are implemented using Henry Spencer's package,
- and the description of regular expressions below is copied
- verbatim from his manual entry.
-
- A regular expression is zero or more branches, separated by
- ``|''. It matches anything that matches one of the
- branches.
-
- A branch is zero or more pieces, concatenated. It matches a
- match for the first, followed by a match for the second,
- etc.
-
- A piece is an atom possibly followed by ``*'', ``+'', or
- ``?''. An atom followed by ``*'' matches a sequence of 0 or
- more matches of the atom. An atom followed by ``+'' matches
- a sequence of 1 or more matches of the atom. An atom
- followed by ``?'' matches a match of the atom, or the null
- string.
-
- An atom is a regular expression in parentheses (matching a
- match for the regular expression), a range (see below),
- ``.'' (matching any single character), ``^'' (matching the
- null string at the beginning of the input string), ``$''
- (matching the null string at the end of the input string), a
- ``\'' followed by a single character (matching that
- character), or a single character with no other significance
- (matching that character).
-
- A range is a sequence of characters enclosed in ``[]''. It
- normally matches any single character from the sequence. If
- the sequence begins with ``^'', it matches any single
- character not from the rest of the sequence. If two
- characters in the sequence are separated by ``-'', this is
- shorthand for the full list of ASCII characters between them
- (e.g. ``[0-9]'' matches any decimal digit). To include a
- literal ``]'' in the sequence, make it the first character
- (following a possible ``^''). To include a literal ``-'',
- make it the first or last character.
-
- If a regular expression could match two different parts of a
- string, it will match the one which begins earliest. If
- both begin in the same place but match different lengths, or
- match the same length in different ways, life gets messier,
- as follows.
-
- In general, the possibilities in a list of branches are
- considered in left-to-right order, the possibilities for
- ``*'', ``+'', and ``?'' are considered longest-first, nested
- constructs are considered from the outermost in, and
- concatenated constructs are considered leftmost-first. The
- match that will be chosen is the one that uses the earliest
- possibility in the first choice that has to be made. If
- there is more than one choice, the next will be made in the
- same manner (earliest possibility) subject to the decision
- on the first choice. And so forth.
-
- For example, ``(ab|a)b*c'' could match ``abc'' in one of two
- ways. The first choice is between ``ab'' and ``a''; since
- ``ab'' is earlier, and does lead to a successful overall
- match, it is chosen. Since the ``b'' is already spoken for,
- the ``b*'' must match its last possibility-the empty
- string-since it must respect the earlier choice.
-
- In the particular case where no ``|''s are present and there
- is only one ``*'', ``+'', or ``?'', the net effect is that
- the longest possible match will be chosen. So ``ab*'',
- presented with ``xabbbby'', will match ``abbbb''. Note that
- if ``ab*'' is tried against ``xabyabbbz'', it will match
- ``ab'' just after ``x'', due to the begins-earliest rule.
- (In effect, the decision on where to start the match is the
- first choice to be made, hence subsequent choices must
- respect it even if this leads them to less-preferred
- alternatives.)
-