Xnews Regex

       pcre - Perl-compatible regular expressions.


Luu Tran's notes to Xnews users

       First, be sure to read the section on regular expression 
       in the readme.txt. Beginners should look at the regular 
       expression tutorial here.
       
       You can interactively test regex  matching by selecting 
       View | Test Regex.

       Xnews uses Philip Hazel's PCRE package. This document is 
       a condensed version of the pcre man page.  I have removed
       the sections on programming interface and differences
       between Perl regex and PCRE.

       At this point, the only option switch I allow is case
       insensitivity.  Therefore, when you see mention of 
       PCRE_xxxxxxxx other than PCRE_CASELESS, you should ignore
       the discussion.  In addition, since you will be matching
       a regex pattern against a single line string (e.g., From 
       header), you should also ignore the differentiation 
       between single line and multi-line matching.


DESCRIPTION

       The PCRE library is a set of functions that implement reg-
       ular expression pattern matching using the same syntax and
       semantics as Perl 5, with  just  a  few  differences  (see
       below).  The  current  implementation  corresponds to Perl
       5.005.

       
       The  syntax  and semantics of the regular expressions sup-
       ported by PCRE are described  below.  Regular  expressions
       are also described in the Perl documentation and in a num-
       ber of other books, some of which have  copious  examples.
       Jeffrey  Friedl's  "Mastering  Regular  Expressions", pub-
       lished by O'Reilly (ISBN 1-56592-257-3),  covers  them  in
       great  detail.  The description here is intended as refer-
       ence documentation.

       A regular expression is a pattern that is matched  against
       a subject string from left to right. Most characters stand
       for themselves in a pattern, and match  the  corresponding
       characters  in the subject. As a trivial example, the pat-
       tern

         The quick brown fox

       matches a portion of a subject string that is identical to
       itself.  The  power  of regular expressions comes from the
       ability to include alternatives  and  repetitions  in  the
       pattern.  These  are  encoded in the pattern by the use of
       meta-characters, which do not  stand  for  themselves  but
       instead are interpreted in some special way.

       There  are  two  different  sets of meta-characters: those
       that are recognized anywhere in the pattern except  within
       square  brackets,  and those that are recognized in square
       brackets. Outside square brackets, the meta-characters are
       as follows:

         \      general escape character with several uses
         ^       assert  start  of subject (or line, in multiline
       mode)
         $      assert end of  subject  (or  line,  in  multiline
       mode)
         .      match any character except newline (by default)
         [      start character class definition
         |      start of alternative branch
         (      start subpattern
         )      end subpattern
         ?      extends the meaning of (
                also 0 or 1 quantifier
                also quantifier minimizer
         *      0 or more quantifier
         +      1 or more quantifier
         {      start min/max quantifier

       Part  of  a pattern that is in square brackets is called a
       "character class". In a character  class  the  only  meta-
       characters are:

         \      general escape character
         ]      terminates the character class

       The following sections describe the use  of  each  of  the
       meta-characters.




BACKSLASH

       The  backslash  character has several uses. Firstly, if it
       is followed by a non-alphameric character, it  takes  away
       any  special  meaning that character may have. This use of
       backslash as an escape character applies both  inside  and
       outside character classes.

       For  example,  if  you  want to match a "*" character, you
       write "\*" in the pattern. This applies whether or not the
       following  character  would  otherwise be interpreted as a
       meta-character, so it is always safe  to  precede  a  non-
       alphameric  with "\" to specify that it stands for itself.
       In particular, if you want to match a backslash, you write
       "\\".

       If  a  pattern  is compiled with the PCRE_EXTENDED option,
       whitespace in the  pattern  (other  than  in  a  character
       class)  and  characters  between a "#" outside a character
       class and the  next  newline  character  are  ignored.  An
       escaping  backslash can be used to include a whitespace or
       "#" character as part of the pattern.

       A second use of backslash provides a way of encoding  non-
       printing characters in patterns in a visible manner. There
       is no restriction on the appearance of non-printing  char-
       acters,  apart from the binary zero that terminates a pat-
       tern, but when a pattern is being prepared by  text  edit-
       ing,  it  is  usually  easier  to use one of the following
       escape sequences than the binary character it represents:

         \a     alarm, that is, the BEL character (hex 07)
         \cx    "control-x", where x is any character
         \e     escape (hex 1B)
         \f     formfeed (hex 0C)
         \n     newline (hex 0A)
         \r     carriage return (hex 0D)
         \t     tab (hex 09)
         \xhh   character with hex code hh
         \ddd   character with octal code ddd, or backreference

       The precise effect of "\cx" is as follows:  if  "x"  is  a
       lower case letter, it is converted to upper case. Then bit
       6 of the character  (hex  40)  is  inverted.   Thus  "\cz"
       becomes  hex  1A,  but  "\c{"  becomes hex 3B, while "\c;"
       becomes hex 7B.
       can be in upper or lower case).

       After  "\0"  up  to  two further octal digits are read. In
       both cases, if there are fewer than two digits, just those
       that  are  present  are  used. Thus the sequence "\0\x\07"
       specifies two binary zeros followed by  a  BEL  character.
       Make  sure you supply two digits after the initial zero if
       the character that follows is itself an octal digit.

       The handling of a backslash followed by a digit other than
       0  is  complicated.  Outside a character class, PCRE reads
       it and any following digits as a decimal  number.  If  the
       number  is  less  than  10, or if there have been at least
       that many  previous  capturing  left  parentheses  in  the
       expression,  the entire sequence is taken as a back refer-
       ence. A description of how this works is given later, fol-
       lowing the discussion of parenthesized subpatterns.

       Inside  a  character  class,  or  if the decimal number is
       greater than 9 and there have not been that many capturing
       subpatterns,  PCRE  re-reads up to three octal digits fol-
       lowing the backslash, and generates a single byte from the
       least significant 8 bits of the value. Any subsequent dig-
       its stand for themselves.  For example:

         \040   is another way of writing a space
         \40    is the same, provided there are fewer than 40
                   previous capturing subpatterns
         \7     is always a back reference
         \11    might be a back reference, or another way of
                   writing a tab
         \011   is always a tab
         \0113  is a tab followed by the character "3"
         \113   is the character with octal code 113 (since there
                   can be no more than 99 back references)
         \377   is a byte consisting entirely of 1 bits
         \81    is either a back reference, or a binary zero
                   followed by the two characters "8" and "1"

       Note  that  octal  values  of  100  or greater must not be
       introduced by a leading zero, because no more  than  three
       octal digits are ever read.

       All  the  sequences that define a single byte value can be
       used both inside and outside character classes.  In  addi-
       tion,  inside  a  character  class,  the  sequence "\b" is
       interpreted as the backspace character (hex 08). Outside a
       character class it has a different meaning (see below).

       The third use of backslash is for specifying generic char-
       acter types:

         \s     any whitespace character
         \S     any character that is not a whitespace character
         \w     any "word" character
         \W     any "non-word" character

       Each pair of escape sequences partitions the complete  set
       of  characters into two disjoint sets. Any given character
       matches one, and only one, of each pair.

       A "word" character is any letter or digit  or  the  under-
       score  character, that is, any character which can be part
       of a Perl "word". The definition of letters and digits  is
       controlled  by  PCRE's  character  tables, and may vary if
       locale- specific matching is  taking  place  (see  "Locale
       support" above). For example, in the "fr" (French) locale,
       some  character  codes  greater  than  128  are  used  for
       accented letters, and these are matched by \w.

       These  character type sequences can appear both inside and
       outside character classes. They each match  one  character
       of  the appropriate type. If the current matching point is
       at the end of the subject string, all of them fail,  since
       there is no character to match.

       The  fourth  use of backslash is for certain simple asser-
       tions. An assertion specifies a condition that has  to  be
       met  at  a  particular point in a match, without consuming
       any characters from the subject string. The use of subpat-
       terns  for more complicated assertions is described below.
       The backslashed assertions are

         \b     word boundary
         \B     not a word boundary
         \A     start of subject (independent of multiline mode)
         \Z     end of subject or newline at end (independent  of
       multiline mode)
         \z     end of subject (independent of multiline mode)

       These  assertions may not appear in character classes (but
       note  that  "\b"  has  a  different  meaning,  namely  the
       backspace character, inside a character class).

       A  word boundary is a position in the subject string where
       the current character and the previous  character  do  not
       both  match  \w  or  \W (i.e. one matches \w and the other
       matches \W), or the start or end  of  the  string  if  the
       first or last character matches \w, respectively.

       The  \A, \Z, and \z assertions differ from the traditional
       circumflex and dollar (described below) in that they  only
       ever  match  at  the  very  start  and  end of the subject
       string, whatever options are set. They are not affected by
       is  the last character of the string as well as at the end
       of the string, whereas \z matches only at the end.




CIRCUMFLEX AND DOLLAR

       Outside a character class, in the default  matching  mode,
       the  circumflex  character  is  an assertion which is true
       only if the current matching point is at the start of  the
       subject  string.  Inside a character class, circumflex has
       an entirely different meaning (see below).

       Circumflex need not be the first character of the  pattern
       if a number of alternatives are involved, but it should be
       the first thing in each alternative in which it appears if
       the  pattern is ever to match that branch. If all possible
       alternatives start with a circumflex, that is, if the pat-
       tern is constrained to match only at the start of the sub-
       ject, it is said to be an "anchored" pattern.  (There  are
       also  other  constructs  that  can  cause  a pattern to be
       anchored.)

       A dollar character is an assertion which is true  only  if
       the  current  matching  point is at the end of the subject
       string, or immediately before a newline character that  is
       the last character in the string (by default). Dollar need
       not be the last character of the pattern if  a  number  of
       alternatives  are involved, but it should be the last item
       in any branch in which it appears.  Dollar has no  special
       meaning in a character class.

       The  meaning  of  dollar can be changed so that it matches
       only at the  very  end  of  the  string,  by  setting  the
       PCRE_DOLLAR_ENDONLY  option  at  compile or matching time.
       This does not affect the \Z assertion.

       The meanings of the circumflex and dollar  characters  are
       changed  if the PCRE_MULTILINE option is set. When this is
       the case, they match  immediately  after  and  immediately
       before  an internal "\n" character, respectively, in addi-
       tion to matching at the  start  and  end  of  the  subject
       string.  For example, the pattern /^abc$/ matches the sub-
       ject string "def\nabc" in multiline mode, but  not  other-
       wise.  Consequently,  patterns that are anchored in single
       line mode because all branches  start  with  "^"  are  not
       anchored in multiline mode. The PCRE_DOLLAR_ENDONLY option
       is ignored if PCRE_MULTILINE is set.

       Note that the sequences \A, \Z, and  \z  can  be  used  to
       match  the start and end of the subject in both modes, and
       if all branches of a pattern start with \A  is  it  always
       anchored, whether PCRE_MULTILINE is set or not.
       Outside  a  character  class, a dot in the pattern matches
       any one character in the subject, including a non-printing
       character,   but   not   (by  default)  newline.   If  the
       PCRE_DOTALL option is set, then  dots  match  newlines  as
       well.  The  handling of dot is entirely independent of the
       handling of circumflex and dollar, the  only  relationship
       being  that they both involve newline characters.  Dot has
       no special meaning in a character class.




SQUARE BRACKETS

       An opening square bracket introduces  a  character  class,
       terminated  by  a closing square bracket. A closing square
       bracket on its own is not special.  If  a  closing  square
       bracket is required as a member of the class, it should be
       the first data character in the class  (after  an  initial
       circumflex, if present) or escaped with a backslash.

       A  character  class matches a single character in the sub-
       ject; the character must  be  in  the  set  of  characters
       defined  by  the  class, unless the first character in the
       class is a circumflex, in which case the subject character
       must  not be in the set defined by the class. If a circum-
       flex is actually required as a member of the class, ensure
       it  is  not the first character, or escape it with a back-
       slash.

       For example, the character class [aeiou] matches any lower
       case  vowel,  while [^aeiou] matches any character that is
       not a lower case vowel. Note that a circumflex is  just  a
       convenient  notation  for  specifying the characters which
       are in the class by enumerating those that are not. It  is
       not  an  assertion: it still consumes a character from the
       subject string, and fails if the current pointer is at the
       end of the string.

       When caseless matching is set, any letters in a class rep-
       resent both their upper case and lower case  versions,  so
       for  example,  a  caseless  [aeiou] matches "A" as well as
       "a", and a caseless [^aeiou] does not match "A", whereas a
       caseful version would.

       The  newline character is never treated in any special way
       in  character  classes,  whatever  the  setting   of   the
       PCRE_DOTALL  or PCRE_MULTILINE options is. A class such as
       [^a] will always match a newline.

       The minus (hyphen) character can  be  used  to  specify  a
       range of characters in a character class. For example, [d-
       m] matches any letter between d and  m,  inclusive.  If  a
       minus character is required in a class, it must be escaped
       first or last character in the class. It is  not  possible
       to have the character "]" as the end character of a range,
       since a sequence such as [w-] is interpreted as a class of
       two characters. The octal or hexadecimal representation of
       "]" can, however, be used to end a range.

       Ranges operate in ASCII collating sequence. They can  also
       be  used for characters specified numerically, for example
       [\000-\037]. If a range that includes letters is used when
       caseless matching is set, it matches the letters in either
       case. For example, [W-c] is equivalent to [][\^_`wxyzabc],
       matched  caselessly,  and if character tables for the "fr"
       locale are in use, [\xc8-\xcb] matches accented E  charac-
       ters in both cases.

       The  character  types  \d, \D, \s, \S, \w, and \W may also
       appear in a character class, and add the  characters  that
       they  match  to the class. For example, [\dABCDEF] matches
       any hexadecimal digit. A circumflex  can  conveniently  be
       used with the upper case character types to specify a more
       restricted set of characters than the matching lower  case
       type.  For example, the class [^\W_] matches any letter or
       digit, but not underscore.

       All non-alphameric characters other than \, -, ^  (at  the
       start)  and the terminating ] are non-special in character
       classes, but it does no harm if they are escaped.




VERTICAL BAR

       Vertical bar characters are used to  separate  alternative
       patterns. For example, the pattern

         gilbert|sullivan

       matches  either  "gilbert"  or  "sullivan".  Any number of
       alternatives may appear, and an empty alternative is  per-
       mitted  (matching the empty string).  The matching process
       tries each alternative in turn, from left  to  right,  and
       the  first  one that succeeds is used. If the alternatives
       are within a subpattern (defined below), "succeeds"  means
       matching  the  rest  of  the  main  pattern as well as the
       alternative in the subpattern.




INTERNAL OPTION SETTING

       The    settings    of    PCRE_CASELESS,    PCRE_MULTILINE,
       PCRE_DOTALL,  and PCRE_EXTENDED can be changed from within
       the pattern by a sequence of Perl option letters  enclosed
       between "(?" and ")". The option letters are
         m  for PCRE_MULTILINE
         s  for PCRE_DOTALL
         x  for PCRE_EXTENDED

       For  example,  (?im) sets caseless, multiline matching. It
       is also possible to unset these options by  preceding  the
       letter with a hyphen, and a combined setting and unsetting
       such as (?im-sx), which sets PCRE_CASELESS and PCRE_MULTI-
       LINE  while  unsetting  PCRE_DOTALL  and PCRE_EXTENDED, is
       also permitted. If a letter appears both before and  after
       the hyphen, the option is unset.

       The  scope of these option changes depends on where in the
       pattern the setting occurs. For settings that are  outside
       any  subpattern (defined below), the effect is the same as
       if the options were set or unset at the start of matching.
       The following patterns all behave in exactly the same way:

         (?i)abc
         a(?i)bc
         ab(?i)c
         abc(?i)

       which in turn is the same as  compiling  the  pattern  abc
       with  PCRE_CASELESS set.  In other words, such "top level"
       settings apply to the  whole  pattern  (unless  there  are
       other  changes  inside subpatterns). If there is more than
       one setting of the same option at top level, the rightmost
       setting is used.

       If an option change occurs inside a subpattern, the effect
       is different. This is a change of behaviour in Perl 5.005.
       An  option  change  inside  a subpattern affects only that
       part of the subpattern that follows it, so

         (a(?i)b)c

       matches  abc  and  aBc  and  no  other  strings  (assuming
       PCRE_CASELESS is not used).  By this means, options can be
       made to have different settings in different parts of  the
       pattern.  Any  changes made in one alternative do carry on
       into subsequent branches within the same  subpattern.  For
       example,

         (a(?i)b|c)

       matches  "ab", "aB", "c", and "C", even though when match-
       ing "C" the first branch is abandoned  before  the  option
       setting.  This  is  because the effects of option settings
       happen at compile time. There would  be  some  very  weird
       behaviour otherwise.

       by  using  the  characters  U and X respectively. The (?X)
       flag setting is special in that it must always occur  ear-
       lier in the pattern than any of the additional features it
       turns on, even when it is at top level. It is best put  at
       the start.




SUBPATTERNS

       Subpatterns are delimited by parentheses (round brackets),
       which can be nested.  Marking part of a pattern as a  sub-
       pattern does two things:

       1.  It  localizes  a set of alternatives. For example, the
       pattern

         cat(aract|erpillar|)

       matches one of the words "cat", "cataract", or  "caterpil-
       lar".  Without the parentheses, it would match "cataract",
       "erpillar" or the empty string.

       2. It sets up the subpattern as a capturing subpattern (as
       defined above).  When the whole pattern matches, that por-
       tion of the subject string that matched the subpattern  is
       passed  back  to  the  caller  via the ovector argument of
       pcre_exec(). Opening parentheses are counted from left  to
       right  (starting from 1) to obtain the numbers of the cap-
       turing subpatterns.

       For example, if the  string  "the  red  king"  is  matched
       against the pattern

         the ((red|white) (king|queen))

       the captured substrings are "red king", "red", and "king",
       and are numbered 1, 2, and 3.

       The fact that plain parentheses fulfil  two  functions  is
       not always helpful.  There are often times when a grouping
       subpattern is required without a capturing requirement. If
       an opening parenthesis is followed by "?:", the subpattern
       does not do any capturing, and is not counted when comput-
       ing  the  number  of any subsequent capturing subpatterns.
       For example, if the string "the white  queen"  is  matched
       against the pattern

         the ((?:red|white) (king|queen))

       the captured substrings are "white queen" and "queen", and
       are numbered 1 and 2. The maximum number of captured  sub-
       strings  is 99, and the maximum number of all subpatterns,
       required  at  the start of a non-capturing subpattern, the
       option letters may appear between the  "?"  and  the  ":".
       Thus the two patterns

         (?i:saturday|sunday)
         (?:(?i)saturday|sunday)

       match exactly the same set of strings. Because alternative
       branches are tried from left to right, and options are not
       reset  until  the  end  of  the  subpattern is reached, an
       option  setting  in  one  branch  does  affect  subsequent
       branches,  so the above patterns match "SUNDAY" as well as
       "Saturday".




REPETITION

       Repetition is specified by quantifiers, which  can  follow
       any of the following items:

         a single character, possibly escaped
         the . metacharacter
         a character class
         a back reference (see next section)
         a  parenthesized subpattern (unless it is an assertion -
       see below)

       The general repetition quantifier specifies a minimum  and
       maximum  number  of  permitted  matches, by giving the two
       numbers in curly brackets (braces), separated by a  comma.
       The numbers must be less than 65536, and the first must be
       less than or equal to the second. For example:

         z{2,4}

       matches "zz", "zzz", or "zzzz". A closing brace on its own
       is  not a special character. If the second number is omit-
       ted, but the comma is present, there is no upper limit; if
       the  second  number  and  the  comma are both omitted, the
       quantifier specifies an exact number of required  matches.
       Thus

         [aeiou]{3,}

       matches  at  least 3 successive vowels, but may match many
       more, while

         \d{8}

       matches exactly 8 digits. An opening  curly  bracket  that
       appears  in  a position where a quantifier is not allowed,
       or one that does not match the syntax of a quantifier,  is
       taken as a literal character.  For example, {,6} is not a
       quantifier, but a literal string of four characters.

       The quantifier {0} is permitted, causing the expression to
       behave as if the previous item and the quantifier were not
       present.

       For convenience (and historical compatibility)  the  three
       most  common  quantifiers  have single-character abbrevia-
       tions:

         *    is equivalent to {0,}
         +    is equivalent to {1,}
         ?    is equivalent to {0,1}

       It is possible to construct infinite loops by following  a
       subpattern  that can match no characters with a quantifier
       that has no upper limit, for example:

         (a?)*

       Earlier versions of Perl and PCRE used to give an error at
       compile time for such patterns. However, because there are
       cases where this can be  useful,  such  patterns  are  now
       accepted,  but if any repetition of the subpattern does in
       fact match no characters, the loop is forcibly broken.

       By default, the quantifiers are "greedy",  that  is,  they
       match  as  much  as  possible (up to the maximum number of
       permitted times), without causing the rest of the  pattern
       to  fail. The classic example of where this gives problems
       is in trying to match comments in C programs. These appear
       between  the  sequences /* and */ and within the sequence,
       individual * and / characters may appear.  An  attempt  to
       match C comments by applying the pattern

         /\*.*\*/

       to the string

         /* first command */  not comment  /* second comment */

       fails,  because  it  matches  the entire string due to the
       greediness of the .*  item.

       However, if a quantifier is followed by a  question  mark,
       then it ceases to be greedy, and instead matches the mini-
       mum number of times possible, so the pattern

         /\*.*?\*/

       does the right thing with the C comments. The  meaning  of
       the various quantifiers is not otherwise changed, just the
       preferred number of matches.  Do not confuse this  use  of
       doubled, as in

         \d??\d

       which  matches  one digit by preference, but can match two
       if that is the only way the rest of the pattern matches.

       If the PCRE_UNGREEDY option is set (an option which is not
       available  in Perl) then the quantifiers are not greedy by
       default, but individual ones can be made greedy by follow-
       ing  them with a question mark. In other words, it inverts
       the default behaviour.

       When a parenthesized subpattern is quantified with a mini-
       mum  repeat count that is greater than 1 or with a limited
       maximum, more store is required for the compiled  pattern,
       in proportion to the size of the minimum or maximum.

       If  a  pattern  starts  with  .*  then  it  is  implicitly
       anchored, since whatever follows  will  be  tried  against
       every  character  position  in  the  subject string.  PCRE
       treats this as though it were preceded by \A.

       When a capturing subpattern is repeated,  the  value  cap-
       tured  is  the substring that matched the final iteration.
       For example, after

         (tweedle[dume]{3}\s*)+

       has matched "tweedledum tweedledee" the value of the  cap-
       tured  substring  is  "tweedledee".  However, if there are
       nested capturing subpatterns, the  corresponding  captured
       values may have been set in previous iterations. For exam-
       ple, after

         /(a|(b))+/

       matches "aba" the value of the second  captured  substring
       is "b".




BACK REFERENCES

       Outside a character class, a backslash followed by a digit
       greater than 0 (and possibly further  digits)  is  a  back
       reference  to  a capturing subpattern earlier (i.e. to its
       left) in the pattern, provided there have been  that  many
       previous capturing left parentheses.

       However,  if the decimal number following the backslash is
       less than 10, it is always taken as a back reference,  and
       causes  an error only if there are not that many capturing
       left of the reference for numbers less than  10.  See  the
       section  entitled "Backslash" above for further details of
       the handling of digits following a backslash.

       A back reference matches  whatever  actually  matched  the
       capturing subpattern in the current subject string, rather
       than anything matching the subpattern itself. So the  pat-
       tern

         (sens|respons)e and \1ibility

       matches  "sense and sensibility" and "response and respon-
       sibility", but not "sense and responsibility". If  caseful
       matching  is  in  force at the time of the back reference,
       then the case of letters is relevant. For example,

         ((?i)rah)\s+\1

       matches "rah rah" and "RAH RAH", but not "RAH  rah",  even
       though  the original capturing subpattern is matched case-
       lessly.

       There may be more than one back reference to the same sub-
       pattern.  If  a subpattern has not actually been used in a
       particular match, then any back references  to  it  always
       fail. For example, the pattern

         (a|(bc))\2

       always  fails  if it starts to match "a" rather than "bc".
       Because there may be up to 99 back references, all  digits
       following  the  backslash are taken as part of a potential
       back reference number. If the  pattern  continues  with  a
       digit  character, then some delimiter must be used to ter-
       minate the back reference. If the PCRE_EXTENDED option  is
       set,  this  can be whitespace.  Otherwise an empty comment
       can be used.

       A back reference that occurs  inside  the  parentheses  to
       which  it  refers fails when the subpattern is first used,
       so, for example, (a\1) never matches.  However, such  ref-
       erences  can  be  useful  inside repeated subpatterns. For
       example, the pattern

         (a|b\1)+

       matches any number of "a"s and also "aba",  "ababaa"  etc.
       At  each  iteration  of the subpattern, the back reference
       matches the character string corresponding to the previous
       iteration.  In order for this to work, the pattern must be
       such that the first iteration does not need to  match  the
       back  reference. This can be done using alternation, as in
       the example above, or by a quantifier with a minimum of 
       zero.



ASSERTIONS

       An assertion is a test on the characters following or pre-
       ceding the current matching point that does  not  actually
       consume any characters. The simple assertions coded as \b,
       \B, \A, \Z, \z, ^ and $ are described above. More  compli-
       cated  assertions  are coded as subpatterns. There are two
       kinds: those that look ahead of the  current  position  in
       the subject string, and those that look behind it.

       An  assertion  subpattern  is  matched  in the normal way,
       except that it does not cause the current  matching  posi-
       tion  to  be  changed. Lookahead assertions start with (?=
       for positive assertions and (?! for  negative  assertions.
       For example,

         \w+(?=;)

       matches  a  word  followed  by  a  semicolon, but does not
       include the semicolon in the match, and

         foo(?!bar)

       matches any occurrence of "foo" that is  not  followed  by
       "bar". Note that the apparently similar pattern

         (?!foo)bar

       does  not  find an occurrence of "bar" that is preceded by
       something other than "foo"; it  finds  any  occurrence  of
       "bar"  whatsoever, because the assertion (?!foo) is always
       true when the next three characters are "bar".  A  lookbe-
       hind assertion is needed to achieve this effect.

       Lookbehind  assertions start with (?<= for positive asser-
       tions and (?<! for negative assertions. For example,

         (?<!foo)bar

       does find an occurrence of "bar" that is not  preceded  by
       "foo".   The   contents  of  a  lookbehind  assertion  are
       restricted such that all the strings it matches must  have
       a  fixed  length.  However,  if there are several alterna-
       tives, they do not all have to have the same fixed length.
       Thus

         (?<=bullock|donkey)

       is permitted, but

          (?<!dogs?|cats?)

       causes an error at compile time.  Branches that match dif-      
       ferent length strings are permitted only at the top  level
       of  a  lookbehind assertion. This is an extension compared
       with Perl 5.005, which requires all branches to match  the
       same length of string. An assertion such as

         (?<=ab(c|de))

       is  not permitted, because its single branch can match two
       different lengths, but it is acceptable  if  rewritten  to
       use two branches:

         (?<=abc|abde)

       The  implementation  of lookbehind assertions is, for each
       alternative, to temporarily move the current position back
       by  the  fixed  width  and then try to match. If there are
       insufficient characters before the current  position,  the
       match is deemed to fail.

       Assertions can be nested in any combination. For example,

         (?<=(?<!foo)bar)baz

       matches  an  occurrence of "baz" that is preceded by "bar"
       which in turn is not preceded by "foo".

       Assertion subpatterns are not capturing  subpatterns,  and
       may  not  be repeated, because it makes no sense to assert
       the same thing several times.  If  an  assertion  contains
       capturing  subpatterns within it, these are always counted
       for the purposes of numbering the capturing subpatterns in
       the whole pattern.  Substring capturing is carried out for
       positive assertions, but it does not make sense for  nega-
       tive assertions.

       Assertions  count towards the maximum of 200 parenthesized
       subpatterns.




ONCE-ONLY SUBPATTERNS

       With both maximizing and minimizing repetition, failure of
       what  follows  normally causes the repeated item to be re-
       evaluated to see if a different number of  repeats  allows
       the  rest  of the pattern to match. Sometimes it is useful
       to prevent this, either to change the nature of the match,
       or  to cause it fail earlier than it otherwise might, when
       the author of the pattern knows there is no point in  car-
       rying on.

       Consider,  for example, the pattern \d+foo when applied to
       the subject line

         123456bar

       After matching all 6 digits  and  then  failing  to  match
       "foo",  the  normal  action of the matcher is to try again
       with only 5 digits matching the \d+ item, and then with 4,
       and  so  on,  before ultimately failing. Once-only subpat-
       terns provide the means for specifying that once a portion
       of  the  pattern has matched, it is not to be re-evaluated
       in this way, so the matcher would give up  immediately  on
       failing  to  match  "foo"  the first time. The notation is
       another kind of special parenthesis, starting with (?>  as
       in this example:

         (?>\d+)bar

       This  kind of parenthesis "locks up" the  part of the pat-
       tern it contains once it has matched, and a  failure  fur-
       ther  into the pattern is prevented from backtracking into
       it. Backtracking past it to previous items, however, works
       as normal.

       An  alternative  description  is that a subpattern of this
       type matches the string of characters  that  an  identical
       standalone pattern would match, if anchored at the current
       point in the subject string.

       Once-only subpatterns are not capturing subpatterns.  Sim-
       ple  cases such as the above example can be though of as a
       maximizing repeat that must swallow everything it can. So,
       while  both \d+ and \d+? are prepared to adjust the number
       of digits they match in order to make the rest of the pat-
       tern  match,  (?>\d+) can only match an entire sequence of
       digits.

       This construction can of course contain  arbitrarily  com-
       plicated subpatterns, and it can be nested.




CONDITIONAL SUBPATTERNS

       It  is  possible  to  cause the matching process to obey a
       subpattern conditionally or to choose between two alterna-
       tive subpatterns, depending on the result of an assertion,
       or whether a previous capturing subpattern matched or not.
       The two possible forms of conditional subpattern are

         (?(condition)yes-pattern)
         (?(condition)yes-pattern|no-pattern)

       If  the  condition  is satisfied, the yes-pattern is used;
       otherwise the no-pattern (if present) is  used.  If  there
       are  more  than two alternatives in the subpattern, a com-
       pile-time error occurs.

       There are two kinds of condition. If the text between  the
       parentheses  consists  of  a  sequence of digits, then the
       condition is satisfied if the capturing subpattern of that
       number has previously matched. Consider the following pat-
       tern, which contains non-significant white space  to  make
       it  more readable (assume the PCRE_EXTENDED option) and to
       divide it into three parts for ease of discussion:

         ( \( )?    [^()]+    (?(1) \) )

       The first part matches an  optional  opening  parenthesis,
       and  if  that  character  is present, sets it as the first
       captured substring. The second part matches  one  or  more
       characters  that  are not parentheses. The third part is a
       conditional subpattern that tests whether the first set of
       parentheses  matched or not. If they did, that is, if sub-
       ject started with an opening parenthesis, the condition is
       true,  and  so  the  yes-pattern is executed and a closing
       parenthesis is required. Otherwise,  since  no-pattern  is
       not  present,  the  subpattern  matches  nothing. In other
       words, this pattern matches a sequence of non-parentheses,
       optionally enclosed in parentheses.

       If  the  condition is not a sequence of digits, it must be
       an assertion. This may be a positive or negative lookahead
       or lookbehind assertion. Consider this pattern, again con-
       taining non-significant white  space,  and  with  the  two
       alternatives on the second line:

         (?(?=[^a-z]*[a-z])
         \d{2}[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )

       The  condition  is  a  positive  lookahead  assertion that
       matches an optional sequence of non-letters followed by  a
       letter.  In  other  words, it tests for the presence of at
       least one letter in the subject. If a letter is found, the
       subject  is  matched against the first alternative; other-
       wise it  is  matched  against  the  second.  This  pattern
       matches  strings  in one of the two forms dd-aaa-dd or dd-
       dd-dd, where aaa are letters and dd are digits.




COMMENTS

       The sequence (?# marks the start of a comment  which  con-
       tinues  up  to the next closing parenthesis. Nested paren-
       theses are not permitted. The characters that  make  up  a
       comment play no part in the pattern matching at all.

       If the PCRE_EXTENDED option is set, an unescaped # charac-
       ter outside a character class introduces  a  comment  that
       continues up to the next newline character in the pattern.

       Certain items that may appear in patterns are  more  effi-
       cient than others. It is more efficient to use a character
       class like [aeiou] than a  set  of  alternatives  such  as
       (a|e|i|o|u).  In  general,  the simplest construction that
       provides the required behaviour is usually the most  effi-
       cient.  Jeffrey Friedl's book contains a lot of discussion
       about optimizing regular expressions for efficient perfor-
       mance.




AUTHOR

       Philip Hazel <ph10@cam.ac.uk>
       University Computing Service,
       New Museums Site,
       Cambridge CB2 3QG, England.
       Phone: +44 1223 334714

       Copyright (c) 1998 University of Cambridge.


































Man(1) output converted with man2html