home *** CD-ROM | disk | FTP | other *** search
GNU Info File | 1998-05-20 | 13.4 KB | 502 lines |
- This is Info file rx.info, produced by Makeinfo-1.63 from the input
- file rx.texi.
-
- File: rx.info, Node: Top, Next: An Introduction to Regexps, Prev: (dir), Up: (dir)
-
- Regexps
- *******
-
- This document describes the Posix "Basic Regular Expression" ("BRE")
- language.
-
- The Posix Basic Regular Expression language is a notation for
- describing text patterns. Regexps are typically used by comparing them
- to a string to see if that string matches the pattern, or by searching
- within a string for a substring that matches.
-
- This is not a formal definition of Posix regexps - it is an intuitive
- and hopefully expository description of them.
-
- * Menu:
-
- * An Introduction to Regexps::
- * Literal Regexps::
- * Character Sets::
- * Subexpressions::
- * Repeated Subexpressions::
- * Optional Subexpressions::
- * Counted Subexpressions::
- * Alternative Subexpressions::
- * Backreferences::
- * A Summary of Regexp Syntax::
- * Ambiguous Patterns::
- * Acknowledgements::
-
- File: rx.info, Node: An Introduction to Regexps, Next: Literal Regexps, Prev: Top, Up: Top
-
- An Introduction to Regexps
- ==========================
-
- In the simplest cases, a regexp is just a literal string that must
- match exactly. For example, the pattern:
-
- regexp
-
- matches the string "regexp" and no others.
-
- Some characters have a special meaning when they occur in a regexp.
- They aren't matched literally as in the previous example, but instead
- denote a more general pattern. For example, the character `*' is used
- to indicate that the preceeding element of a regexp may be repeated 0,
- 1, or more times. In the pattern:
-
- smooo*th
-
- the `*' indicates that the preceeding `o' can be repeated 0 or more
- times. So the pattern matches:
-
- smooth
- smoooth
- smooooth
- smoooooth
- ...
-
- Suppose you want to write a pattern that literally matches a special
- character like `*' - in other words, you don't want to `*' to indicate
- a permissible repetition, but to match `*' literally. This is
- accomplished by quoting the special character with a backslash. The
- pattern:
-
- smoo\*th
-
- matches the string:
-
- smoo*th
-
- and no other strings.
-
- In seven cases, the pattern is reversed - a backslash makes the
- character special instead of making a special character normal. The
- characters `+', `?', `|', `(', and `)' are normal but the sequences
- `\+', `\?', `\|', `\(', `\)', `\{', and `\}' are special (their meaning
- is described later).
-
- The remaining sections of this section introduce and explain the
- various special characters that can occur in regexps.
-
- File: rx.info, Node: Literal Regexps, Next: Character Sets, Prev: An Introduction to Regexps, Up: Top
-
- Literal Regexps
- ===============
-
- A literal regexp is a string which contains no special characters.
- A literal regexp matches an identical string, but no other characters.
- For example:
-
- literally
-
- matches
-
- literally
-
- and nothing else.
-
- Generally, whitespace characters, numbers, and letters are not
- special. Some punctuation characters are special and some are not (the
- syntax summary at the end of this section makes a convenient reference
- for which characters are special and which aren't).
-
- File: rx.info, Node: Character Sets, Next: Subexpressions, Prev: Literal Regexps, Up: Top
-
- Character Sets
- ==============
-
- This section introduces the special characters `.' and `['.
-
- `.' matches any character except the NULL character. For example:
-
- p.ck
-
- matches
-
- pick
- pack
- puck
- pbck
- pcck
- p.ck
-
- ...
-
- `[' begins a "character set". A character set is similar to `.' in
- that it matches not a single, literal character, but any of a set of
- characters. `[' is different from `.' in that with `[', you define
- the set of characters explicitly.
-
- There are three basic forms a character set can take.
-
- In the first form, the character set is spelled out:
-
- [<cset-spec>] -- every character in <cset-spec> is in the set.
-
- In the second form, the character set indicated is the negation of a
- character set is explicitly spelled out:
-
- [^<cset-spec>] -- every character *not* in <cset-spec> is in the set.
-
- A `<cset-spec>' is more or less an explicit enumeration of a set of
- characters. It can be written as a string of individual characters:
-
- [aeiou]
-
- or as a range of characters:
-
- [0-9]
-
- These two forms can be mixed:
-
- [A-za-z0-9_$]
-
- Note that special regexp characters (such as `*') are *not* special
- within a character set. `-', as illustrated above, *is* special,
- except, as illustrated below, when it is the first character mentioned.
-
- This is a four-character set:
-
- [-+*/]
-
- The third form of a character set makes use of a pre-defined
- "character class":
-
- [[:class-name:]] -- every character described by class-name is in the set.
-
- The supported character classes are:
-
- alnum - the set of alpha-numeric characters
- alpha - the set of alphabetic characters
- blank - tab and space
- cntrl - the control characters
- digit - decimal digits
- graph - all printable characters except space
- lower - lower case letters
- print - the "printable" characters
- punct - punctuation
- space - whitespace characters
- upper - upper case letters
- xdigit - hexidecimal digits
-
- Finally, character class sets can also be inverted:
-
- [^[:space:]] - all non-whitespace characters
-
- Character sets can be used in a regular expression anywhere a literal
- character can.
-
- File: rx.info, Node: Subexpressions, Next: Repeated Subexpressions, Prev: Character Sets, Up: Top
-
- Subexpressions
- ==============
-
- A subexpression is a regular expression enclosed in `\(' and `\)'.
- A subexpression can be used anywhere a single character or character
- set can be used.
-
- Subexpressions are useful for grouping regexp constructs. For
- example, the repeat operator, `*', usually applies to just the
- preceeding character. Recall that:
-
- smooo*th
-
- matches
-
- smooth
- smoooth
- ...
-
- Using a subexpression, we can apply `*' to a longer string:
-
- banan\(an\)*a
-
- matches
-
- banana
- bananana
- banananana
- ...
-
- Subexpressions also have a special meaning with regard to
- backreferences and substitutions (see *Note Backreferences::).
-
- File: rx.info, Node: Repeated Subexpressions, Next: Optional Subexpressions, Prev: Subexpressions, Up: Top
-
- Repeated Subexpressions
- =======================
-
- `*' is the repeat operator. It applies to the preceeding character,
- character set, subexpression or backreference. It indicates that the
- preceeding element can be matched 0 or more times:
-
- bana\(na\)*
-
- matches
-
- bana
- banana
- bananana
- banananana
- ...
-
- `\+' is similar to `*' except that `\+' requires the preceeding element
- to be matched at least once. So while:
-
- bana\(na\)*
-
- matches
-
- bana
-
- bana(na\)\+
-
- does not. Both match
-
- banana
- bananana
- banananana
- ...
-
- Thus, `bana\(na\)+' is short-hand for `banana\(na\)*'.
-
- File: rx.info, Node: Optional Subexpressions, Next: Counted Subexpressions, Prev: Repeated Subexpressions, Up: Top
-
- Optional Subexpressions
- =======================
-
- `\?' indicates that the preceeding character, character set, or
- subexpression is optional. It is permitted to match, or to be skipped:
-
- CSNY\?
-
- matches both
-
- CSN
-
- and
-
- CSNY
-
- File: rx.info, Node: Counted Subexpressions, Next: Alternative Subexpressions, Prev: Optional Subexpressions, Up: Top
-
- Counted Subexpressions
- ======================
-
- An interval expression, `\{m,n\}' where `m' and `n' are non-negative
- integers with `n >= m', applies to the preceeding character, character
- set, subexpression or backreference. It indicates that the preceeding
- element must match at least `m' times and may match as many as `n'
- times.
-
- For example:
-
- c\([ad]\)\{1,4\}
-
- matches
-
- car
- cdr
- caar
- cdar
- ...
- caaar
- cdaar
- ...
- cadddr
- cddddr
-
- File: rx.info, Node: Alternative Subexpressions, Next: Backreferences, Prev: Counted Subexpressions, Up: Top
-
- Alternative Subexpressions
- ==========================
-
- An alternative is written:
-
- regexp-1\|regexp-2\|regexp-3\|...
-
- It matches anything matched by some `regexp-n'. For example:
-
- Crosby, Stills, \(and Nash\|Nash, and Young\)
-
- matches
-
- Crosby, Stills, and Nash
-
- and
-
- Crosby, Stills, Nash, and Young
-
- File: rx.info, Node: Backreferences, Next: A Summary of Regexp Syntax, Prev: Alternative Subexpressions, Up: Top
-
- Backreferences, Extractions and Substitutions
- =============================================
-
- A backreference is written `\n' where `n' is some single digit other
- than 0. To be a valid backreference, there must be at least `n'
- parenthesized subexpressions in the pattern prior to the backreference.
-
- A backreference matches a literal copy of whatever was matched by the
- corresponding subexpression. For example,
-
- \(.*\)-\1
-
- matches:
-
- go-go
- ha-ha
- wakka-wakka
- ...
-
- In some applications, subexpressions are used to extract substrings.
- For example, Emacs has the functions `match-beginnning' and `match-end'
- which report the positions of strings matched by subexpressions. These
- functions use the same numbering scheme for subexpressions as
- backreferences, with the additional rule that subexpression 0 is
- defined to be the whole regexp.
-
- In some applications, subexpressions are used in string substitution.
- This again uses the backreference numbering scheme. For example, this
- sed command:
-
- s/From:.*<\(.*\)>/To: \1/
-
- first matches the line:
-
- From: Joe Schmoe <schmoe@uspringfield.edu>
-
- when it does, subexpression 1 matches "schmoe@uspringfield.edu". The
- command replaces the matched line with "To: \1" after doing
- subexpression substitution on it to get:
-
- To: schmoe@uspringfield.edu
-
- File: rx.info, Node: A Summary of Regexp Syntax, Next: Ambiguous Patterns, Prev: Backreferences, Up: Top
-
- A Summary of Regexp Syntax
- ==========================
-
- In summary, regexps can be:
-
- `abcd' - matching a string literally
-
- `.' - matching everything except NULL
-
- `[a-z_?]', `^[a-z_?]', `[[:alpha:]]' and `[^[:alpha:]]' - matching
- character sets
-
- `\(subexp\)' - grouping an expression into a subexpression.
-
- `\n' - match a copy of whatever was matched by the nth subexpression.
-
- The following special characters and sequences can be applied to a
- character, character set, subexpression, or backreference:
-
- `*' - repeat the preceeding element 0 or more times.
-
- `\+' - repeat the preceeding element 1 or more times.
-
- `\?' - match the preceeding element 0 or 1 time.
-
- `{m,n}' - match the preceeding element at least `m', and as many as
- `n' times.
-
- `regexp-1\|regexp-2\|..' - match any regexp-n.
-
- A special character, like `.' or `*' can be made into a literal
- character by prefixing it with `\'.
-
- A special sequence, like `\+' or `\?' can be made into a literal
- character by dropping the `\'.
-
- File: rx.info, Node: Ambiguous Patterns, Next: Acknowledgements, Prev: A Summary of Regexp Syntax, Up: Top
-
- Ambiguous Patterns
- ==================
-
- Sometimes a regular expression appears to be ambiguous. For
- example, suppose we compare the pattern:
-
- begin\|beginning
-
- to the string
-
- beginning
-
- either just the first 5 characters will match, or the whole string will
- match.
-
- In every case like this, the longer match is preferred. The whole
- string will match.
-
- Sometimes there is ambiguity not about how many characters to match,
- but where the subexpressions occur within the match. This can effect
- extraction functions like Emacs' `match-beginning' or rewrite functions
- like sed's `s' command. For example, consider matching the pattern:
-
- b\(\[^q]*\)\(ing\)?
-
- against the string
-
- beginning
-
- One possibility is that the first subexpression matches "eginning"
- and the second is skipped. Another possibility is that the first
- subexpression matches "eginn" and the second matches "ing".
-
- The rule is that consistant with matching as many characters as
- possible, the length of lower numbered subexpressions is maximized in
- preference to maximizing the length of later subexpressions.
-
- In the case of the above example, the two possible matches are equal
- in overall length. Therefore, it comes down to maximizing the
- lower-numbered subexpression, \1. The correct answer is that \1 matches
- "eginning" and \2 is skipped.
-
- File: rx.info, Node: Acknowledgements, Prev: Ambiguous Patterns, Up: Top
-
- Acknowledgements
- ================
-
- This work was created with support from "Cygnus Solutions Inc." and
- "The Free Software Foundation, Inc.". Support the GNU project. Support
- free software!
-
-
- Tag Table:
- Node: Top83
- Node: An Introduction to Regexps955
- Node: Literal Regexps2556
- Node: Character Sets3175
- Node: Subexpressions5485
- Node: Repeated Subexpressions6274
- Node: Optional Subexpressions7021
- Node: Counted Subexpressions7385
- Node: Alternative Subexpressions7998
- Node: Backreferences8438
- Node: A Summary of Regexp Syntax9898
- Node: Ambiguous Patterns11040
- Node: Acknowledgements12504
- End Tag Table
-