The Web Roller regular expressions syntax is almost identical to the regular Perl syntax. General syntax is:
/regexp/options
where regexp stands for the sequence of characters
and metacharacters. Letters and numbers act as characters.
All the rest potentially is metacharacters, since
everything that is not a letter or a digit should be explicitly specified as
a character.
Thus, to enter a non-alphanumeric character you will need to enter also a preceding
\ (backslash).
\ |
- specifies that the next character will be treated as a character (not as a metacharacter). |
. |
- any character except for line feed. If there is an S in the options, then matches any character. |
^ |
- beginning of line. If the m option is specified, then the beginning of any line within the text is matched, otherwise - only the beginning of the whole text. |
$ |
- end of line. If the m option is specified, then the end of any line within the text is matched, otherwise - only the end of the whole text. |
| |
- alternative (OR). Matches either the expression before or the expression after |. |
() |
- grouping. |
[] |
- class of characters. |
Grouping serves either for using backreferences or to specify the matching fragment within the parentheses for replacement.
A Class of characters determines lists or ranges of characters. It matches any of the characters listed within the [] (or within one of the listed ranges). Continuous ranges are specified as [a-z]. Generally you cannot use other metacharacters inside classes. Placing a ^ in the first position of a set inverts it (the set becomes negative). If you want to use the ^ as a character, it should either be not in the first position or be preceded by a backslash.
Each character, group or class can be followed by one of quantifiers:
? |
- greedy match (0 or 1 times). |
?? |
- non-greedy match (0 or 1 times). |
* |
- greedy match (0 or more times). |
*? |
- non-greedy match (0 or more times) |
+ |
- greedy match (1 or more times). |
+? |
- non-greedy match (1 or more times). |
{n} |
- match exactly n times. |
{n,} |
- greedy match (n or more times). |
{n,}? |
- non-greedy match (n or more times). |
{n,m} |
- greedy match (greater or equal than n, less or equal than m times). |
{n,m}? |
- non-greedy match (greater or equal than n, less or equal than m times). |
A greedy quantifier attempts to match as much of text as possible. Only if the next iteration is unsuccessful it "releases" the captured characters (this is called "rollback" - a rather time- and resource-consuming procedure). E.g. if we attempt to match the /A.*Z/ expression against the string:
AZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
then the .* will first capture all of the string
and then will sequentially roll back until it finds the Z
character.
If we use the non-greedy quantifier /A.*?Z/
, then the Z character will be found immediately.
A non-greedy quantifier is also called a minimizing quantifier,
since it tends to capture the minimal possible amount of characters and tries
to add more only if it did not match the string.
Backslash should be used before non-alphabetic characters. Alphabetic characters preceded by a backslash designate special characters:
\t |
- tab (0x09) |
\n |
- line feed (0x0a) |
\r |
- carriage return(0x0d) |
\f |
- form feed (0x0c) |
\a |
- bell (0x07) |
\e |
- escape (0x1b) |
\xNN |
- a character in hexadecimal notation, where N belongs to the set [0-9A-Fa-f]. |
\Q |
- beginning of metacharacter quotation |
\E |
- end of metacharacter quotation |
\w |
- word character [0-9a-z_A-Z] |
\W |
- not \w |
\s |
- whitespace character [ \t\n\r\f] |
\S |
- not \s |
\d |
- digit [0-9] |
\D |
- not \d |
\i |
- letter [a-zA-Z] |
\I |
- not \i |
\l |
- lowercase character [a-z] |
\L |
- not a lowercase character |
\u |
- uppercase character [A-Z] |
\U |
- not an uppercase character |
\b |
- null token matching a word boundary |
\B |
- not \b |
\A |
- text beginning |
\Z |
- text end |
\NN |
- backreference to previously matched group in round brackets. NN - integer. |
\b means that a character which is a part of a word (\w) is located on the right or on the left of the current position, with a spacer/separator character (\W) on the opposite side.
Apart from grouping, round brackets are used for the following operations:
(?:pattern) |
- same as regular grouping, except for this match won't be saved and thus obtains no reference number (N). |
(?=pattern) |
- zero-width positive lookahead assertion. For example, \w+(?=\s) matches a word followed by whitespace, without including the whitespace in the MatchResult. |
(?!pattern) |
- zero-width negative lookahead assertion. For example foo(?!bar) matches any occurrence of "foo" that isn't followed by "bar". Remember that this is a zero-width assertion, which means that a(?!b)d will match ad because a is followed by a character that is not b (the d) and a d follows the zero-width assertion. |
(?<=pattern) |
- zero-width positive backward assertion. The pattern should be of fixed length, i.e. quantifiers cannot be used. |
(?<!pattern) |
- zero-width negative backward assertion. |
Each pair of round brackets (except for the cases listed above) has a unique sequential number - according to the order of appearance of the opening brackets.
\N - where N stands for the number of the bracket - matches the text fragment matched previously by the respective bracket.
Example: /(['"])hello\1/ matches both "hello" or 'helo'.
i |
- ignore case |
s |
- process all input as single-line text. '.' matches any character. |
m |
- process all input as multi-line text. ^ and $ match respectively any beginning/end of any "internal" line. |
x |
- spacer. Ignored if not preceded by a backslash. Useful for structuring complex expressions. |