home *** CD-ROM | disk | FTP | other *** search
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((RRRReeeelllleeeeaaaasssseeee 0000....0000 PPPPaaaattttcccchhhhlllleeeevvvveeeellll 00000000)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
- NNNNAAAAMMMMEEEE
- perlre - Perl regular expressions
-
- DDDDEEEESSSSCCCCRRRRIIIIPPPPTTTTIIIIOOOONNNN
- For a description of how to use regular expressions in
- matching operations, see m// and s/// in the _p_e_r_l_o_p manpage.
- The matching operations can have various modifiers, some of
- which relate to the interpretation of the regular expression
- inside. These are:
-
- i Do case-insensitive pattern matching.
- m Treat string as multiple lines.
- s Treat string as single line.
- x Use extended regular expressions.
-
- These are usually written as "the /x modifier", even though
- the delimiter in question might not actually be a slash. In
- fact, any of these modifiers may also be embedded within the
- regular expression itself using the new (?...) construct.
- See below.
-
- The /x modifier itself needs a little more explanation. It
- tells the regular expression parser to ignore whitespace
- that is not backslashed or within a character class. You
- can use this to break up your regular expression into
- (slightly) more readable parts. Together with the
- capability of embedding comments described later, this goes
- a long way towards making Perl 5 a readable language. See
- the C comment deletion code in the _p_e_r_l_o_p manpage.
-
- RRRReeeegggguuuullllaaaarrrr EEEExxxxpppprrrreeeessssssssiiiioooonnnnssss
-
- The patterns used in pattern matching are regular
- expressions such as those supplied in the Version 8 regexp
- routines. (In fact, the routines are derived (distantly)
- from Henry Spencer's freely redistributable reimplementation
- of the V8 routines.) See the section on _V_e_r_s_i_o_n _8 _R_e_g_u_l_a_r
- _E_x_p_r_e_s_s_i_o_n_s for details.
-
- In particular the following metacharacters have their
- standard _e_g_r_e_p-ish meanings:
-
- \ Quote the next metacharacter
- ^ Match the beginning of the line
- . Match any character (except newline)
- $ Match the end of the line
- | Alternation
- () Grouping
- [] Character class
-
- By default, the "^" character is guaranteed to match only at
- the beginning of the string, the "$" character only at the
-
-
-
- Page 1 (printed 6/30/95)
-
-
-
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((RRRReeeelllleeeeaaaasssseeee 0000....0000 PPPPaaaattttcccchhhhlllleeeevvvveeeellll 00000000)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
- end (or before the newline at the end) and Perl does certain
- optimizations with the assumption that the string contains
- only one line. Embedded newlines will not be matched by "^"
- or "$". You may, however, wish to treat a string as a
- multi-line buffer, such that the "^" will match after any
- newline within the string, and "$" will match before any
- newline. At the cost of a little more overhead, you can do
- this by using the /m modifier on the pattern match operator.
- (Older programs did this by setting $*, but this practice is
- deprecated in Perl 5.)
-
- To facilitate multi-line substitutions, the "." character
- never matches a newline unless you use the /s modifier,
- which tells Perl to pretend the string is a single line--
- even if it isn't. The /s modifier also overrides the
- setting of $*, in case you have some (badly behaved) older
- code that sets it in another module.
-
- The following standard quantifiers are recognized:
-
- * Match 0 or more times
- + Match 1 or more times
- ? Match 1 or 0 times
- {n} Match exactly n times
- {n,} Match at least n times
- {n,m} Match at least n but not more than m times
-
- (If a curly bracket occurs in any other context, it is
- treated as a regular character.) The "*" modifier is
- equivalent to {0,}, the "+" modifier to {1,}, and the "?"
- modifier to {0,1}. There is no limit to the size of n or m,
- but large numbers will chew up more memory.
-
- By default, a quantified subpattern is "greedy", that is, it
- will match as many times as possible without causing the
- rest pattern not to match. The standard quantifiers are all
- "greedy", in that they match as many occurrences as possible
- (given a particular starting location) without causing the
- pattern to fail. If you want it to match the minimum number
- of times possible, follow the quantifier with a "?" after
- any of them. Note that the meanings don't change, just the
- "gravity":
-
- *? Match 0 or more times
- +? Match 1 or more times
- ?? Match 0 or 1 time
- {n}? Match exactly n times
- {n,}? Match at least n times
- {n,m}? Match at least n but not more than m times
-
- Since patterns are processed as double quoted strings, the
- following also work:
-
-
-
- Page 2 (printed 6/30/95)
-
-
-
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((RRRReeeelllleeeeaaaasssseeee 0000....0000 PPPPaaaattttcccchhhhlllleeeevvvveeeellll 00000000)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
- \t tab
- \n newline
- \r return
- \f form feed
- \v vertical tab, whatever that is
- \a alarm (bell)
- \e escape
- \033 octal char
- \x1b hex char
- \c[ control char
- \l lowercase next char
- \u uppercase next char
- \L lowercase till \E
- \U uppercase till \E
- \E end case modification
- \Q quote regexp metacharacters till \E
-
- In addition, Perl defines the following:
-
- \w Match a "word" character (alphanumeric plus "_")
- \W Match a non-word character
- \s Match a whitespace character
- \S Match a non-whitespace character
- \d Match a digit character
- \D Match a non-digit character
-
- Note that \w matches a single alphanumeric character, not a
- whole word. To match a word you'd need to say \w+. You may
- use \w, \W, \s, \S, \d and \D within character classes
- (though not as either end of a range).
-
- Perl defines the following zero-width assertions:
-
- \b Match a word boundary
- \B Match a non-(word boundary)
- \A Match only at beginning of string
- \Z Match only at end of string
- \G Match only where previous m//g left off
-
- A word boundary (\b) is defined as a spot between two
- characters that has a \w on one side of it and and a \W on
- the other side of it (in either order), counting the
- imaginary characters off the beginning and end of the string
- as matching a \W. (Within character classes \b represents
- backspace rather than a word boundary.) The \A and \Z are
- just like "^" and "$" except that they won't match multiple
- times when the /m modifier is used, while "^" and "$" will
- match at every internal line boundary.
-
- When the bracketing construct ( ... ) is used, \<digit>
- matches the digit'th substring. (Outside of the pattern,
- always use "$" instead of "\" in front of the digit. The
-
-
-
- Page 3 (printed 6/30/95)
-
-
-
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((RRRReeeelllleeeeaaaasssseeee 0000....0000 PPPPaaaattttcccchhhhlllleeeevvvveeeellll 00000000)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
- scope of $<digit> (and $`, $&, and $') extends to the end of
- the enclosing BLOCK or eval string, or to the next pattern
- match with subexpressions. If you want to use parentheses to
- delimit subpattern (e.g. a set of alternatives) without
- saving it as a subpattern, follow the ( with a ?. The
- \<digit> notation sometimes works outside the current
- pattern, but should not be relied upon.) You may have as
- many parentheses as you wish. If you have more than 9
- substrings, the variables $10, $11, ... refer to the
- corresponding substring. Within the pattern, \10, \11, etc.
- refer back to substrings if there have been at least that
- many left parens before the backreference. Otherwise (for
- backward compatibilty) \10 is the same as \010, a backspace,
- and \11 the same as \011, a tab. And so on. (\1 through \9
- are always backreferences.)
-
- $+ returns whatever the last bracket match matched. $&
- returns the entire matched string. ($0 used to return the
- same thing, but not any more.) $` returns everything before
- the matched string. $' returns everything after the matched
- string. Examples:
-
- s/^([^ ]*) *([^ ]*)/$2 $1/; # swap first two words
-
- if (/Time: (..):(..):(..)/) {
- $hours = $1;
- $minutes = $2;
- $seconds = $3;
- }
-
- You will note that all backslashed metacharacters in Perl
- are alphanumeric, such as \b, \w, \n. Unlike some other
- regular expression languages, there are no backslashed
- symbols that aren't alphanumeric. So anything that looks
- like \\, \(, \), \<, \>, \{, or \} is always interpreted as
- a literal character, not a metacharacter. This makes it
- simple to quote a string that you want to use for a pattern
- but that you are afraid might contain metacharacters.
- Simply quote all the non-alphanumeric characters:
-
- $pattern =~ s/(\W)/\\$1/g;
-
- You can also use the built-in _q_u_o_t_e_m_e_t_a() function to do
- this. An even easier way to quote metacharacters right in
- the match operator is to say
-
- /$unquoted\Q$quoted\E$unquoted/
-
- Perl 5 defines a consistent extension syntax for regular
- expressions. The syntax is a pair of parens with a question
- mark as the first thing within the parens (this was a syntax
- error in Perl 4). The character after the question mark
-
-
-
- Page 4 (printed 6/30/95)
-
-
-
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((RRRReeeelllleeeeaaaasssseeee 0000....0000 PPPPaaaattttcccchhhhlllleeeevvvveeeellll 00000000)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
- gives the function of the extension. Several extensions are
- already supported:
-
- (?#text) A comment. The text is ignored.
-
- (?:regexp)
- This groups things like "()" but doesn't make
- backrefences like "()" does. So
-
- split(/\b(?:a|b|c)\b/)
-
- is like
-
- split(/\b(a|b|c)\b/)
-
- but doesn't spit out extra fields.
-
- (?=regexp)
- A zero-width positive lookahead assertion. For
- example, /\w+(?=\t)/ matches a word followed by a
- tab, without including the tab in $&.
-
- (?!regexp)
- A zero-width negative lookahead assertion. For
- example /foo(?!bar)/ matches any occurrence of
- "foo" that isn't followed by "bar". Note however
- that lookahead and lookbehind are NOT the same
- thing. You cannot use this for lookbehind:
- /(?!foo)bar/ will not find an occurrence of "bar"
- that is preceded by something which is not "foo".
- That's because the (?!foo) is just saying that the
- next thing cannot be "foo"--and it's not, it's a
- "bar", so "foobar" will match. You would have to
- do something like /(?foo)...bar/ for that. We
- say "like" because there's the case of your "bar"
- not having three characters before it. You could
- cover that this way: /(?:(?!foo)...|^..?)bar/.
- Sometimes it's still easier just to say:
-
- if (/foo/ && $` =~ /bar$/)
-
-
- (?imsx) One or more embedded pattern-match modifiers.
- This is particularly useful for patterns that are
- specified in a table somewhere, some of which want
- to be case sensitive, and some of which don't.
- The case insensitive ones merely need to include
- (?i) at the front of the pattern. For example:
-
- $pattern = "foobar";
- if ( /$pattern/i )
-
-
-
-
- Page 5 (printed 6/30/95)
-
-
-
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((RRRReeeelllleeeeaaaasssseeee 0000....0000 PPPPaaaattttcccchhhhlllleeeevvvveeeellll 00000000)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
- # more flexible:
-
- $pattern = "(?i)foobar";
- if ( /$pattern/ )
-
-
- The specific choice of question mark for this and the new
- minimal matching construct was because 1) question mark is
- pretty rare in older regular expressions, and 2) whenever
- you see one, you should stop and "question" exactly what is
- going on. That's psychology...
-
- VVVVeeeerrrrssssiiiioooonnnn 8888 RRRReeeegggguuuullllaaaarrrr EEEExxxxpppprrrreeeessssssssiiiioooonnnnssss
-
- In case you're not familiar with the "regular" Version 8
- regexp routines, here are the pattern-matching rules not
- described above.
-
- Any single character matches itself, unless it is a
- _m_e_t_a_c_h_a_r_a_c_t_e_r with a special meaning described here or
- above. You can cause characters which normally function as
- metacharacters to be interpreted literally by prefixing them
- with a "\" (e.g. "\." matches a ".", not any character; "\\"
- matches a "\"). A series of characters matches that series
- of characters in the target string, so the pattern blurfl
- would match "blurfl" in the target string.
-
- You can specify a character class, by enclosing a list of
- characters in [], which will match any one of the characters
- in the list. If the first character after the "[" is "^",
- the class matches any character not in the list. Within a
- list, the "-" character is used to specify a range, so that
- a-z represents all the characters between "a" and "z",
- inclusive.
-
- Characters may be specified using a metacharacter syntax
- much like that used in C: "\n" matches a newline, "\t" a
- tab, "\r" a carriage return, "\f" a form feed, etc. More
- generally, \_n_n_n, where _n_n_n is a string of octal digits,
- matches the character whose ASCII value is _n_n_n. Similarly,
- \x_n_n, where _n_n are hexidecimal digits, matches the character
- whose ASCII value is _n_n. The expression \c_x matches the
- ASCII character control-_x. Finally, the "." metacharacter
- matches any character except "\n" (unless you use /s).
-
- You can specify a series of alternatives for a pattern using
- "|" to separate them, so that fee|fie|foe will match any of
- "fee", "fie", or "foe" in the target string (as would
- f(e|i|o)e). Note that the first alternative includes
- everything from the last pattern delimiter ("(", "[", or the
- beginning of the pattern) up to the first "|", and the last
- alternative contains everything from the last "|" to the
-
-
-
- Page 6 (printed 6/30/95)
-
-
-
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((RRRReeeelllleeeeaaaasssseeee 0000....0000 PPPPaaaattttcccchhhhlllleeeevvvveeeellll 00000000)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
- next pattern delimiter. For this reason, it's common
- practice to include alternatives in parentheses, to minimize
- confusion about where they start and end. Note also that
- the pattern (fee|fie|foe) differs from the pattern
- [fee|fie|foe] in that the former matches "fee", "fie", or
- "foe" in the target string, while the latter matches
- anything matched by the classes [fee], [fie], or [foe] (i.e.
- the class [feio]).
-
- Within a pattern, you may designate subpatterns for later
- reference by enclosing them in parentheses, and you may
- refer back to the _nth subpattern later in the pattern using
- the metacharacter \_n. Subpatterns are numbered based on the
- left to right order of their opening parenthesis. Note that
- a backreference matches whatever actually matched the
- subpattern in the string being examined, not the rules for
- that subpattern. Therefore, ([0|0x])\d*\s\1\d* will match
- "0x1234 0x4321",but not "0x1234 01234", since subpattern 1
- actually matched "0x", even though the rule [0|0x] could
- potentially match the leading 0 in the second number.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Page 7 (printed 6/30/95)
-
-
-
-