home *** CD-ROM | disk | FTP | other *** search
-
-
-
- RRRREEEEGGGGEEEEXXXX((((7777)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((7777 FFFFeeeebbbb 1111999999994444)))) RRRREEEEGGGGEEEEXXXX((((7777))))
-
-
-
- NNNNAAAAMMMMEEEE
- regex - POSIX 1003.2 regular expressions
-
- DDDDEEEESSSSCCCCRRRRIIIIPPPPTTTTIIIIOOOONNNN
- Regular expressions (``RE''s), as defined in POSIX 1003.2,
- come in two forms: modern REs (roughly those of _e_g_r_e_p;
- 1003.2 calls these ``extended'' REs) and obsolete REs
- (roughly those of _e_d; 1003.2 ``basic'' REs). Obsolete REs
- mostly exist for backward compatibility in some old
- programs; they will be discussed at the end. 1003.2 leaves
- some aspects of RE syntax and semantics open; `|-' marks
- decisions on these aspects that may not be fully portable to
- other 1003.2 implementations.
-
- A (modern) RE is one|- or more non-empty|- _b_r_a_n_c_h_e_s, separated
- by `|'. It matches anything that matches one of the
- branches.
-
- A branch is one|- or more _p_i_e_c_e_s, concatenated. It matches a
- match for the first, followed by a match for the second,
- etc.
-
- A piece is an _a_t_o_m possibly followed by a single|- `*', `+',
- `?', or _b_o_u_n_d. An atom followed by `*' matches a sequence
- of 0 or more matches of the atom. An atom followed by `+'
- matches a sequence of 1 or more matches of the atom. An
- atom followed by `?' matches a sequence of 0 or 1 matches of
- the atom.
-
- A _b_o_u_n_d is `{' followed by an unsigned decimal integer,
- possibly followed by `,' possibly followed by another
- unsigned decimal integer, always followed by `}'. The
- integers must lie between 0 and RE_DUP_MAX (255|-) inclusive,
- and if there are two of them, the first may not exceed the
- second. An atom followed by a bound containing one integer
- _i and no comma matches a sequence of exactly _i matches of
- the atom. An atom followed by a bound containing one
- integer _i and a comma matches a sequence of _i or more
- matches of the atom. An atom followed by a bound containing
- two integers _i and _j matches a sequence of _i through _j
- (inclusive) matches of the atom.
-
- An atom is a regular expression enclosed in `()' (matching a
- match for the regular expression), an empty set of `()'
- (matching the null string)|-, a _b_r_a_c_k_e_t _e_x_p_r_e_s_s_i_o_n (see
- below), `.' (matching any single character), `^' (matching
- the null string at the beginning of a line), `$' (matching
- the null string at the end of a line), a `\' followed by one
- of the characters `^.[$()|*+?{\' (matching that character
- taken as an ordinary character), a `\' followed by any other
- character|- (matching that character taken as an ordinary
- character, as if the `\' had not been present|-), or a single
-
-
-
- Page 1 (printed 5/3/99)
-
-
-
-
-
-
- RRRREEEEGGGGEEEEXXXX((((7777)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((7777 FFFFeeeebbbb 1111999999994444)))) RRRREEEEGGGGEEEEXXXX((((7777))))
-
-
-
- character with no other significance (matching that
- character). A `{' followed by a character other than a
- digit is an ordinary character, not the beginning of a
- bound|-. It is illegal to end an RE with `\'.
-
- A _b_r_a_c_k_e_t _e_x_p_r_e_s_s_i_o_n is a list of characters enclosed in
- `[]'. It normally matches any single character from the
- list (but see below). If the list begins with `^', it
- matches any single character (but see below) _n_o_t from the
- rest of the list. If two characters in the list are
- separated by `-', this is shorthand for the full _r_a_n_g_e of
- characters between those two (inclusive) in the collating
- sequence, e.g. `[0-9]' in ASCII matches any decimal digit.
- It is illegal|- for two ranges to share an endpoint, e.g.
- `a-c-e'. Ranges are very collating-sequence-dependent, and
- portable programs should avoid relying on them.
-
- To include a literal `]' in the list, make it the first
- character (following a possible `^'). To include a literal
- `-', make it the first or last character, or the second
- endpoint of a range. To use a literal `-' as the first
- endpoint of a range, enclose it in `[.' and `.]' to make it
- a collating element (see below). With the exception of
- these and some combinations using `[' (see next paragraphs),
- all other special characters, including `\', lose their
- special significance within a bracket expression.
-
- Within a bracket expression, a collating element (a
- character, a multi-character sequence that collates as if it
- were a single character, or a collating-sequence name for
- either) enclosed in `[.' and `.]' stands for the sequence of
- characters of that collating element. The sequence is a
- single element of the bracket expression's list. A bracket
- expression containing a multi-character collating element
- can thus match more than one character, e.g. if the
- collating sequence includes a `ch' collating element, then
- the RE `[[.ch.]]*c' matches the first five characters of
- `chchcc'.
-
- Within a bracket expression, a collating element enclosed in
- `[=' and `=]' is an equivalence class, standing for the
- sequences of characters of all collating elements equivalent
- to that one, including itself. (If there are no other
- equivalent collating elements, the treatment is as if the
- enclosing delimiters were `[.' and `.]'.) For example, if o
- and o^ are the members of an equivalence class, then
- `[[=o=]]', `[[=o^=]]', and `[oo^]' are all synonymous. An
- equivalence class may not|- be an endpoint of a range.
-
- Within a bracket expression, the name of a _c_h_a_r_a_c_t_e_r _c_l_a_s_s
- enclosed in `[:' and `:]' stands for the list of all
- characters belonging to that class. Standard character
-
-
-
- Page 2 (printed 5/3/99)
-
-
-
-
-
-
- RRRREEEEGGGGEEEEXXXX((((7777)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((7777 FFFFeeeebbbb 1111999999994444)))) RRRREEEEGGGGEEEEXXXX((((7777))))
-
-
-
- class names are:
-
- alnum digit punct
- alpha graph space
- blank lower upper
- cntrl print xdigit
-
- These stand for the character classes defined in _c_t_y_p_e(3).
- A locale may provide others. A character class may not be
- used as an endpoint of a range.
-
- There are two special cases|- of bracket expressions: the
- bracket expressions `[[:<:]]' and `[[:>:]]' match the null
- string at the beginning and end of a word respectively. A
- word is defined as a sequence of word characters which is
- neither preceded nor followed by word characters. A word
- character is an _a_l_n_u_m character (as defined by _c_t_y_p_e(3)) or
- an underscore. This is an extension, compatible with but
- not specified by POSIX 1003.2, and should be used with
- caution in software intended to be portable to other
- systems.
-
- In the event that an RE could match more than one substring
- of a given string, the RE matches the one starting earliest
- in the string. If the RE could match more than one
- substring starting at that point, it matches the longest.
- Subexpressions also match the longest possible substrings,
- subject to the constraint that the whole match be as long as
- possible, with subexpressions starting earlier in the RE
- taking priority over ones starting later. Note that
- higher-level subexpressions thus take priority over their
- lower-level component subexpressions.
-
- Match lengths are measured in characters, not collating
- elements. A null string is considered longer than no match
- at all. For example, `bb*' matches the three middle
- characters of `abbbc', `(wee|week)(knights|nights)' matches
- all ten characters of `weeknights', when `(.*).*' is matched
- against `abc' the parenthesized subexpression matches all
- three characters, and when `(a*)*' is matched against `bc'
- both the whole RE and the parenthesized subexpression match
- the null string.
-
- If case-independent matching is specified, the effect is
- much as if all case distinctions had vanished from the
- alphabet. When an alphabetic that exists in multiple cases
- appears as an ordinary character outside a bracket
- expression, it is effectively transformed into a bracket
- expression containing both cases, e.g. `x' becomes `[xX]'.
- When it appears inside a bracket expression, all case
- counterparts of it are added to the bracket expression, so
- that (e.g.) `[x]' becomes `[xX]' and `[^x]' becomes `[^xX]'.
-
-
-
- Page 3 (printed 5/3/99)
-
-
-
-
-
-
- RRRREEEEGGGGEEEEXXXX((((7777)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((7777 FFFFeeeebbbb 1111999999994444)))) RRRREEEEGGGGEEEEXXXX((((7777))))
-
-
-
- No particular limit is imposed on the length of REs|-.
- Programs intended to be portable should not employ REs
- longer than 256 bytes, as an implementation can refuse to
- accept such REs and remain POSIX-compliant.
-
- Obsolete (``basic'') regular expressions differ in several
- respects. `|', `+', and `?' are ordinary characters and
- there is no equivalent for their functionality. The
- delimiters for bounds are `\{' and `\}', with `{' and `}' by
- themselves ordinary characters. The parentheses for nested
- subexpressions are `\(' and `\)', with `(' and `)' by
- themselves ordinary characters. `^' is an ordinary
- character except at the beginning of the RE or|- the
- beginning of a parenthesized subexpression, `$' is an
- ordinary character except at the end of the RE or|- the end
- of a parenthesized subexpression, and `*' is an ordinary
- character if it appears at the beginning of the RE or the
- beginning of a parenthesized subexpression (after a possible
- leading `^'). Finally, there is one new type of atom, a
- _b_a_c_k _r_e_f_e_r_e_n_c_e: `\' followed by a non-zero decimal digit _d
- matches the same sequence of characters matched by the _dth
- parenthesized subexpression (numbering subexpressions by the
- positions of their opening parentheses, left to right), so
- that (e.g.) `\([bc]\)\1' matches `bb' or `cc' but not `bc'.
-
- SSSSEEEEEEEE AAAALLLLSSSSOOOO
- regex(3)
-
- POSIX 1003.2, section 2.8 (Regular Expression Notation).
-
- BBBBUUUUGGGGSSSS
- Having two kinds of REs is a botch.
-
- The current 1003.2 spec says that `)' is an ordinary
- character in the absence of an unmatched `('; this was an
- unintentional result of a wording error, and change is
- likely. Avoid relying on it.
-
- Back references are a dreadful botch, posing major problems
- for efficient implementations. They are also somewhat
- vaguely defined (does `a\(\(b\)*\2\)*d' match `abbbd'?).
- Avoid using them.
-
- 1003.2's specification of case-independent matching is
- vague. The ``one case implies all cases'' definition given
- above is current consensus among implementors as to the
- right interpretation.
-
- The syntax for word boundaries is incredibly ugly.
-
-
-
-
-
-
- Page 4 (printed 5/3/99)
-
-
-
-