Chip 2002 September

home *** CD-ROM | disk | FTP | other *** search

/ Chip 2002 September / 09_02.iso / software / mp3ext / MP3ext33b19.exe / {app} / rx.info < prev

Wrap

GNU Info File | 1998-05-20 | 13.4 KB | 502 lines

This is Info file rx.info, produced by Makeinfo-1.63 from the input file rx.texi. File: rx.info, Node: Top, Next: An Introduction to Regexps, Prev: (dir), Up: (dir) Regexps ******* This document describes the Posix "Basic Regular Expression" ("BRE") language. The Posix Basic Regular Expression language is a notation for describing text patterns. Regexps are typically used by comparing them to a string to see if that string matches the pattern, or by searching within a string for a substring that matches. This is not a formal definition of Posix regexps - it is an intuitive and hopefully expository description of them. * Menu: * An Introduction to Regexps:: * Literal Regexps:: * Character Sets:: * Subexpressions:: * Repeated Subexpressions:: * Optional Subexpressions:: * Counted Subexpressions:: * Alternative Subexpressions:: * Backreferences:: * A Summary of Regexp Syntax:: * Ambiguous Patterns:: * Acknowledgements:: File: rx.info, Node: An Introduction to Regexps, Next: Literal Regexps, Prev: Top, Up: Top An Introduction to Regexps ========================== In the simplest cases, a regexp is just a literal string that must match exactly. For example, the pattern: regexp matches the string "regexp" and no others. Some characters have a special meaning when they occur in a regexp. They aren't matched literally as in the previous example, but instead denote a more general pattern. For example, the character `*' is used to indicate that the preceeding element of a regexp may be repeated 0, 1, or more times. In the pattern: smooo*th the `*' indicates that the preceeding `o' can be repeated 0 or more times. So the pattern matches: smooth smoooth smooooth smoooooth ... Suppose you want to write a pattern that literally matches a special character like `*' - in other words, you don't want to `*' to indicate a permissible repetition, but to match `*' literally. This is accomplished by quoting the special character with a backslash. The pattern: smoo\*th matches the string: smoo*th and no other strings. In seven cases, the pattern is reversed - a backslash makes the character special instead of making a special character normal. The characters `+', `?', `|', `(', and `)' are normal but the sequences `\+', `\?', `\|', `$', `$', `\{', and `\}' are special (their meaning is described later). The remaining sections of this section introduce and explain the various special characters that can occur in regexps. File: rx.info, Node: Literal Regexps, Next: Character Sets, Prev: An Introduction to Regexps, Up: Top Literal Regexps =============== A literal regexp is a string which contains no special characters. A literal regexp matches an identical string, but no other characters. For example: literally matches literally and nothing else. Generally, whitespace characters, numbers, and letters are not special. Some punctuation characters are special and some are not (the syntax summary at the end of this section makes a convenient reference for which characters are special and which aren't). File: rx.info, Node: Character Sets, Next: Subexpressions, Prev: Literal Regexps, Up: Top Character Sets ============== This section introduces the special characters `.' and `['. `.' matches any character except the NULL character. For example: p.ck matches pick pack puck pbck pcck p.ck ... `[' begins a "character set". A character set is similar to `.' in that it matches not a single, literal character, but any of a set of characters. `[' is different from `.' in that with `[', you define the set of characters explicitly. There are three basic forms a character set can take. In the first form, the character set is spelled out: [<cset-spec>] -- every character in <cset-spec> is in the set. In the second form, the character set indicated is the negation of a character set is explicitly spelled out: [^<cset-spec>] -- every character *not* in <cset-spec> is in the set. A `<cset-spec>' is more or less an explicit enumeration of a set of characters. It can be written as a string of individual characters: [aeiou] or as a range of characters: [0-9] These two forms can be mixed: [A-za-z0-9_$] Note that special regexp characters (such as `*') are *not* special within a character set. `-', as illustrated above, *is* special, except, as illustrated below, when it is the first character mentioned. This is a four-character set: [-+*/] The third form of a character set makes use of a pre-defined "character class": [[:class-name:]] -- every character described by class-name is in the set. The supported character classes are: alnum - the set of alpha-numeric characters alpha - the set of alphabetic characters blank - tab and space cntrl - the control characters digit - decimal digits graph - all printable characters except space lower - lower case letters print - the "printable" characters punct - punctuation space - whitespace characters upper - upper case letters xdigit - hexidecimal digits Finally, character class sets can also be inverted: [^[:space:]] - all non-whitespace characters Character sets can be used in a regular expression anywhere a literal character can. File: rx.info, Node: Subexpressions, Next: Repeated Subexpressions, Prev: Character Sets, Up: Top Subexpressions ============== A subexpression is a regular expression enclosed in `$' and `$'. A subexpression can be used anywhere a single character or character set can be used. Subexpressions are useful for grouping regexp constructs. For example, the repeat operator, `*', usually applies to just the preceeding character. Recall that: smooo*th matches smooth smoooth ... Using a subexpression, we can apply `*' to a longer string: banan$an$*a matches banana bananana banananana ... Subexpressions also have a special meaning with regard to backreferences and substitutions (see *Note Backreferences::). File: rx.info, Node: Repeated Subexpressions, Next: Optional Subexpressions, Prev: Subexpressions, Up: Top Repeated Subexpressions ======================= `*' is the repeat operator. It applies to the preceeding character, character set, subexpression or backreference. It indicates that the preceeding element can be matched 0 or more times: bana$na$* matches bana banana bananana banananana ... `\+' is similar to `*' except that `\+' requires the preceeding element to be matched at least once. So while: bana$na$* matches bana bana(na\)\+ does not. Both match banana bananana banananana ... Thus, `bana$na$+' is short-hand for `banana$na$*'. File: rx.info, Node: Optional Subexpressions, Next: Counted Subexpressions, Prev: Repeated Subexpressions, Up: Top Optional Subexpressions ======================= `\?' indicates that the preceeding character, character set, or subexpression is optional. It is permitted to match, or to be skipped: CSNY\? matches both CSN and CSNY File: rx.info, Node: Counted Subexpressions, Next: Alternative Subexpressions, Prev: Optional Subexpressions, Up: Top Counted Subexpressions ====================== An interval expression, `\{m,n\}' where `m' and `n' are non-negative integers with `n >= m', applies to the preceeding character, character set, subexpression or backreference. It indicates that the preceeding element must match at least `m' times and may match as many as `n' times. For example: c$[ad]$\{1,4\} matches car cdr caar cdar ... caaar cdaar ... cadddr cddddr File: rx.info, Node: Alternative Subexpressions, Next: Backreferences, Prev: Counted Subexpressions, Up: Top Alternative Subexpressions ========================== An alternative is written: regexp-1\|regexp-2\|regexp-3\|... It matches anything matched by some `regexp-n'. For example: Crosby, Stills, $and Nash\|Nash, and Young$ matches Crosby, Stills, and Nash and Crosby, Stills, Nash, and Young File: rx.info, Node: Backreferences, Next: A Summary of Regexp Syntax, Prev: Alternative Subexpressions, Up: Top Backreferences, Extractions and Substitutions ============================================= A backreference is written `\n' where `n' is some single digit other than 0. To be a valid backreference, there must be at least `n' parenthesized subexpressions in the pattern prior to the backreference. A backreference matches a literal copy of whatever was matched by the corresponding subexpression. For example, $.*$-\1 matches: go-go ha-ha wakka-wakka ... In some applications, subexpressions are used to extract substrings. For example, Emacs has the functions `match-beginnning' and `match-end' which report the positions of strings matched by subexpressions. These functions use the same numbering scheme for subexpressions as backreferences, with the additional rule that subexpression 0 is defined to be the whole regexp. In some applications, subexpressions are used in string substitution. This again uses the backreference numbering scheme. For example, this sed command: s/From:.*<$.*$>/To: \1/ first matches the line: From: Joe Schmoe <schmoe@uspringfield.edu> when it does, subexpression 1 matches "schmoe@uspringfield.edu". The command replaces the matched line with "To: \1" after doing subexpression substitution on it to get: To: schmoe@uspringfield.edu File: rx.info, Node: A Summary of Regexp Syntax, Next: Ambiguous Patterns, Prev: Backreferences, Up: Top A Summary of Regexp Syntax ========================== In summary, regexps can be: `abcd' - matching a string literally `.' - matching everything except NULL `[a-z_?]', `^[a-z_?]', `[[:alpha:]]' and `[^[:alpha:]]' - matching character sets `$subexp$' - grouping an expression into a subexpression. `\n' - match a copy of whatever was matched by the nth subexpression. The following special characters and sequences can be applied to a character, character set, subexpression, or backreference: `*' - repeat the preceeding element 0 or more times. `\+' - repeat the preceeding element 1 or more times. `\?' - match the preceeding element 0 or 1 time. `{m,n}' - match the preceeding element at least `m', and as many as `n' times. `regexp-1\|regexp-2\|..' - match any regexp-n. A special character, like `.' or `*' can be made into a literal character by prefixing it with `\'. A special sequence, like `\+' or `\?' can be made into a literal character by dropping the `\'. File: rx.info, Node: Ambiguous Patterns, Next: Acknowledgements, Prev: A Summary of Regexp Syntax, Up: Top Ambiguous Patterns ================== Sometimes a regular expression appears to be ambiguous. For example, suppose we compare the pattern: begin\|beginning to the string beginning either just the first 5 characters will match, or the whole string will match. In every case like this, the longer match is preferred. The whole string will match. Sometimes there is ambiguity not about how many characters to match, but where the subexpressions occur within the match. This can effect extraction functions like Emacs' `match-beginning' or rewrite functions like sed's `s' command. For example, consider matching the pattern: b$\[^q]*$$ing$? against the string beginning One possibility is that the first subexpression matches "eginning" and the second is skipped. Another possibility is that the first subexpression matches "eginn" and the second matches "ing". The rule is that consistant with matching as many characters as possible, the length of lower numbered subexpressions is maximized in preference to maximizing the length of later subexpressions. In the case of the above example, the two possible matches are equal in overall length. Therefore, it comes down to maximizing the lower-numbered subexpression, \1. The correct answer is that \1 matches "eginning" and \2 is skipped. File: rx.info, Node: Acknowledgements, Prev: Ambiguous Patterns, Up: Top Acknowledgements ================ This work was created with support from "Cygnus Solutions Inc." and "The Free Software Foundation, Inc.". Support the GNU project. Support free software! Tag Table: Node: Top83 Node: An Introduction to Regexps955 Node: Literal Regexps2556 Node: Character Sets3175 Node: Subexpressions5485 Node: Repeated Subexpressions6274 Node: Optional Subexpressions7021 Node: Counted Subexpressions7385 Node: Alternative Subexpressions7998 Node: Backreferences8438 Node: A Summary of Regexp Syntax9898 Node: Ambiguous Patterns11040 Node: Acknowledgements12504 End Tag Table