home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.lang.python
- From: borbor@divsun.unige.ch (Boris Borcic)
- Subject: Re: Regular expressions, pattern matching
- Date: Mon, 27 Nov 1995 08:50:58 GMT
-
- In article <49aml2$n82@mrnews.mro.dec.com>, Peter.Mayne@cao.mts.dec.com
- (Peter Mayne) wrote:
-
- >
- > Please don't point me to any Emacs doc. I don't use Emacs, and I don't
- want to
- > use Emacs, I just want to know how regular expressions and pattern matching
- > work in Python.
-
- Then you have a problem. Note that you don't need to use emacs in order
- to use a part of its documentation. If you insist, you might try to use
- the MOO doc on regexps that uses the same package (it's really the
- same doc with the backslash escape changed to percent).
-
- Just in case you change your mind, I append the essential part
- of the emacs doc :-)
-
- Regards,
-
- Boris Borcic
- ==
-
- Syntax of Regular Expressions
- =============================
-
- Regular expressions have a syntax in which a few characters are
- special constructs and the rest are "ordinary". An ordinary character
- is a simple regular expression which matches that character and nothing
- else. The special characters are `$', `^', `.', `*', `+', `?', `[',
- `]' and `\'; no new special characters will be defined. Any other
- character appearing in a regular expression is ordinary, unless a `\'
- precedes it.
-
- For example, `f' is not a special character, so it is ordinary, and
- therefore `f' is a regular expression that matches the string `f' and
- no other string. (It does not match the string `ff'.) Likewise, `o'
- is a regular expression that matches only `o'.
-
- Any two regular expressions A and B can be concatenated. The result
- is a regular expression which matches a string if A matches some amount
- of the beginning of that string and B matches the rest of the string.
-
- As a simple example, you can concatenate the regular expressions `f'
- and `o' to get the regular expression `fo', which matches only the
- string `fo'. To do something nontrivial, you need to use one of the
- following special characters:
-
- `. (Period)'
- is a special character that matches any single character except a
- newline. Using concatenation, you can make regular expressions
- like `a.b', which matches any three-character string which begins
- with `a' and ends with `b'.
-
- `*'
- is not a construct by itself; it is a suffix, which means the
- preceding regular expression is to be repeated as many times as
- possible. In `fo*', the `*' applies to the `o', so `fo*' matches
- one `f' followed by any number of `o's. The case of zero `o's is
- allowed: `fo*' does match `f'.
-
- `*' always applies to the smallest possible preceding expression.
- Thus, `fo*' has a repeating `o', not a repeating `fo'.
-
- The matcher processes a `*' construct by immediately matching as
- many repetitions as it can find. Then it continues with the rest
- of the pattern. If that fails, backtracking occurs, discarding
- some of the matches of the `*'-modified construct in case that
- makes it possible to match the rest of the pattern. For example,
- matching `ca*ar' against the string `caaar', the `a*' first tries
- to match all three `a's; but the rest of the pattern is `ar' and
- there is only `r' left to match, so this try fails. The next
- alternative is for `a*' to match only two `a's. With this choice,
- the rest of the regexp matches successfully.
-
- `+'
- is a suffix character similar to `*' except that it requires that
- the preceding expression be matched at least once. For example,
- `ca+r' will match the strings `car' and `caaaar' but not the
- string `cr', whereas `ca*r' would match all three strings.
-
- `?'
- is a suffix character similar to `*' except that it can match the
- preceding expression either once or not at all. For example,
- `ca?r' will match `car' or `cr'; nothing else.
-
- `[ ... ]'
- `[' begins a "character set", which is terminated by a `]'. In
- the simplest case, the characters between the two form the set.
- Thus, `[ad]' matches either one `a' or one `d', and `[ad]*'
- matches any string composed of just `a's and `d's (including the
- empty string), from which it follows that `c[ad]*r' matches `cr',
- `car', `cdr', `caddaar', etc.
-
- You can include character ranges in a character set by writing two
- characters with a `-' between them. Thus, `[a-z]' matches any
- lower-case letter. Ranges may be intermixed freely with
- individual characters, as in `[a-z$%.]', which matches any lower-
- case letter or `$', `%', or period.
-
- Note that inside a character set the usual special characters are
- not special any more. A completely different set of special
- characters exists inside character sets: `]', `-', and `^'.
-
- To include a `]' in a character set, you must make it the first
- character. For example, `[]a]' matches `]' or `a'. To include a
- `-', write `---', which is a range containing only `-'. To
- include `^', make it other than the first character in the set.
-
- `[^ ... ]'
- `[^' begins a "complement character set", which matches any
- character except the ones specified. Thus, `[^a-z0-9A-Z]' matches
- all characters except letters and digits.
-
- `^' is not special in a character set unless it is the first
- character. The character following the `^' is treated as if it
- were first (`-' and `]' are not special there).
-
- Note that a complement character set can match a newline, unless
- newline is mentioned as one of the characters not to match.
-
- `^'
- is a special character that matches the empty string, but only if
- at the beginning of a line in the text being matched. Otherwise,
- it fails to match anything. Thus, `^foo' matches a `foo' that
- occurs at the beginning of a line.
-
- `$'
- is similar to `^' but matches only at the end of a line. Thus,
- `xx*$' matches a string of one `x' or more at the end of a line.
-
- `\'
- does two things: it quotes the special characters (including `\'),
- and it introduces additional special constructs.
-
- Because `\' quotes special characters, `\$' is a regular
- expression that matches only `$', and `\[' is a regular expression
- that matches only `[', and so on.
-
- Note: for historical compatibility, special characters are treated as
- ordinary ones if they are in contexts where their special meanings make
- no sense. For example, `*foo' treats `*' as ordinary since there is no
- preceding expression on which the `*' can act. It is poor practice to
- depend on this behavior; better to quote the special character anyway,
- regardless of where is appears.
-
- Usually, `\' followed by any character matches only that character.
- However, there are several exceptions: characters which, when preceded
- by `\', are special constructs. Such characters are always ordinary
- when encountered on their own. Here is a table of `\' constructs.
-
- `\|'
- specifies an alternative. Two regular expressions A and B with
- `\|' in between form an expression that matches anything A or B
- matches.
-
- Thus, `foo\|bar' matches either `foo' or `bar' but no other string.
-
- `\|' applies to the largest possible surrounding expressions.
- Only a surrounding `\( ... \)' grouping can limit the grouping
- power of `\|'.
-
- Full backtracking capability exists to handle multiple uses of
- `\|'.
-
- `\( ... \)'
- is a grouping construct that serves three purposes:
-
- 1. To enclose a set of `\|' alternatives for other operations.
- Thus, `\(foo\|bar\)x' matches either `foox' or `barx'.
-
- 2. To enclose a complicated expression for the postfix `*' to
- operate on. Thus, `ba\(na\)*' matches `bananana', etc., with
- any (zero or more) number of `na' strings.
-
- 3. To mark a matched substring for future reference.
-
-
- This last application is not a consequence of the idea of a
- parenthetical grouping; it is a separate feature which happens to
- be assigned as a second meaning to the same `\( ... \)' construct
- because in practice there is no conflict between the two meanings.
- Here is an explanation:
-
- `\DIGIT'
- after the end of a `\( ... \)' construct, the matcher remembers the
- beginning and end of the text matched by that construct. Then,
- later on in the regular expression, you can use `\' followed by
- DIGIT to mean "match the same text matched the DIGIT'th time by the
- `\( ... \)' construct."
-
- The strings matching the first nine `\( ... \)' constructs
- appearing in a regular expression are assigned numbers 1 through 9
- in order that the open-parentheses appear in the regular
- expression. `\1' through `\9' may be used to refer to the text
- matched by the corresponding `\( ... \)' construct.
-
- For example, `\(.*\)\1' matches any newline-free string that is
- composed of two identical halves. The `\(.*\)' matches the first
- half, which may be anything, but the `\1' that follows must match
- the same exact text.
-
-