home *** CD-ROM | disk | FTP | other *** search
- <TITLE>Regular Expressions -- Python library reference</TITLE>
- Next: <A HREF="../m/module_contents" TYPE="Next">Module Contents</A>
- Prev: <A HREF="../r/regex" TYPE="Prev">regex</A>
- Up: <A HREF="../r/regex" TYPE="Up">regex</A>
- Top: <A HREF="../t/top" TYPE="Top">Top</A>
- <H2>4.2.1. Regular Expressions</H2>
- A regular expression (or RE) specifies a set of strings that matches
- it; the functions in this module let you check if a particular string
- matches a given regular expression (or if a given regular expression
- matches a particular string, which comes down to the same thing).
- <P>
- Regular expressions can be concatenated to form new regular
- expressions; if <I>A</I> and <I>B</I> are both regular expressions,
- then <I>AB</I> is also an regular expression. If a string <I>p</I>
- matches A and another string <I>q</I> matches B, the string <I>pq</I>
- will match AB. Thus, complex expressions can easily be constructed
- from simpler ones like the primitives described here. For details of
- the theory and implementation of regular expressions, consult almost
- any textbook about compiler construction.
- <P>
- A brief explanation of the format of regular expressions follows.
- <P>
- Regular expressions can contain both special and ordinary characters.
- Ordinary characters, like '<CODE>A</CODE>', '<CODE>a</CODE>', or '<CODE>0</CODE>', are
- the simplest regular expressions; they simply match themselves. You
- can concatenate ordinary characters, so '<CODE>last</CODE>' matches the
- characters 'last'. (In the rest of this section, we'll write RE's in
- <CODE>this special font</CODE>, usually without quotes, and strings to be
- matched 'in single quotes'.)
- <P>
- Special characters either stand for classes of ordinary characters, or
- affect how the regular expressions around them are interpreted.
- <P>
- The special characters are:
- <UL>
- <LI>•<CODE>.</CODE> (Dot.) Matches any character except a newline.
- <LI>•<CODE>^</CODE> (Caret.) Matches the start of the string.
- <LI>•<CODE>$</CODE> Matches the end of the string.
- <CODE>foo</CODE> matches both 'foo' and 'foobar', while the regular
- expression '<CODE>foo$</CODE>' matches only 'foo'.
- <LI>•<CODE>*</CODE> Causes the resulting RE to
- match 0 or more repetitions of the preceding RE. <CODE>ab*</CODE> will
- match 'a', 'ab', or 'a' followed by any number of 'b's.
- <LI>•<CODE>+</CODE> Causes the
- resulting RE to match 1 or more repetitions of the preceding RE.
- <CODE>ab+</CODE> will match 'a' followed by any non-zero number of 'b's; it
- will not match just 'a'.
- <LI>•<CODE>?</CODE> Causes the resulting RE to
- match 0 or 1 repetitions of the preceding RE. <CODE>ab?</CODE> will
- match either 'a' or 'ab'.
- <P>
- <LI>•<CODE>@e</CODE> Either escapes special characters (permitting you to match
- characters like '*?+&$'), or signals a special sequence; special
- sequences are discussed below. Remember that Python also uses the
- backslash as an escape sequence in string literals; if the escape
- sequence isn't recognized by Python's parser, the backslash and
- subsequent character are included in the resulting string. However,
- if Python would recognize the resulting sequence, the backslash should
- be repeated twice.
- <P>
- <LI>•<CODE>[]</CODE> Used to indicate a set of characters. Characters can
- be listed individually, or a range is indicated by giving two
- characters and separating them by a '-'. Special characters are
- not active inside sets. For example, <CODE>[akm$]</CODE>
- will match any of the characters 'a', 'k', 'm', or '$'; <CODE>[a-z]</CODE> will
- match any lowercase letter.
- <P>
- If you want to include a <CODE>]</CODE> inside a
- set, it must be the first character of the set; to include a <CODE>-</CODE>,
- place it as the first or last character.
- <P>
- Characters <I>not</I> within a range can be matched by including a
- <CODE>^</CODE> as the first character of the set; <CODE>^</CODE> elsewhere will
- simply match the '<CODE>^</CODE>' character.
- </UL>
- The special sequences consist of '<CODE>\</CODE>' and a character
- from the list below. If the ordinary character is not on the list,
- then the resulting RE will match the second character. For example,
- <CODE>\$</CODE> matches the character '$'. Ones where the backslash
- should be doubled are indicated.
- <P>
- <UL>
- <LI>•<CODE>@e |</CODE> <CODE>A\|B</CODE>, where A and B can be arbitrary REs,
- creates a regular expression that will match either A or B. This can
- be used inside groups (see below) as well.
- <P>
- <LI>•<CODE>@e ( @e )</CODE> Indicates the start and end of a group; the
- contents of a group can be matched later in the string with the
- <CODE>\[1-9]</CODE> special sequence, described next.
- @fulllineitems
- <LI>•<CODE>@e@e 1, ... @e@e 7, @e 8, @e 9</CODE> Matches the contents of the group of the same
- number. For example, <CODE>\(.+\) \\1</CODE> matches 'the the' or
- '55 55', but not 'the end' (note the space after the group). This
- special sequence can only be used to match one of the first 9 groups;
- groups with higher numbers can be matched using the <CODE>\v</CODE>
- sequence. (<CODE>\8</CODE> and <CODE>\9</CODE> don't need a double backslash
- because they are not octal digits.)}
- <P>
- <LI>•<CODE>@e@e b</CODE> Matches the empty string, but only at the
- beginning or end of a word. A word is defined as a sequence of
- alphanumeric characters, so the end of a word is indicated by
- whitespace or a non-alphanumeric character.
- <P>
- <LI>•<CODE>@e B</CODE> Matches the empty string, but when it is <I>not</I> at the
- beginning or end of a word.
- <P>
- <LI>•<CODE>@e v</CODE> Must be followed by a two digit decimal number, and
- matches the contents of the group of the same number. The group number must be between 1 and 99, inclusive.
- <P>
- <LI>•<CODE>@e w</CODE> Matches any alphanumeric character; this is
- equivalent to the set <CODE>[a-zA-Z0-9]</CODE>.
- <P>
- <LI>•<CODE>@e W</CODE> Matches any non-alphanumeric character; this is
- equivalent to the set <CODE>[^a-zA-Z0-9]</CODE>.
- <LI>•<CODE>@e <</CODE> Matches the empty string, but only at the beginning of a
- word. A word is defined as a sequence of alphanumeric characters, so
- the end of a word is indicated by whitespace or a non-alphanumeric
- character.
- <LI>•<CODE>@e ></CODE> Matches the empty string, but only at the end of a
- word.
- <P>
- <LI>•<CODE>@e@e@e@e</CODE> Matches a literal backslash.
- <P>
- <LI>•<CODE>@e `</CODE> Like <CODE>^</CODE>, this only matches at the start of the
- string.
- <LI>•<CODE>@e@e '</CODE> Like <CODE>$</CODE>, this only matches at the end of the
- string.
- </UL>
-