Normally, when you search for a sub-string in a string, the match should be exact. So if you search for a sub-string "abc" then the string being searched should contain these exact letters in the same sequence for a match to be found.
We can extend this kind of search to a case-insensitive search, where the sub-string "abc" will find strings like "Abc", "ABC" and so on. That is, case is ignored but the sequence of the letters should be exactly the same. Sometimes, a case insensitive search is still not enough. For example, if we want to search for numeric digits, then we basically end up searching for each digit independently. This is where regular expressions come in to help.
Regular expressions are text patterns that are used for string matching. Regular expressions are strings that contain a mix of plain text and special characters to indicate what kind of matching to do. Here is a very brief tutorial on using regular expressions before we move on to the code for handling regular expressions.
Literals
All characters are literals except: ".", "*", "?", "+", "(", ")", "{", "}", "[", "]", "^" and "$". These characters are literals when preceded by a "\". A literal is a character that matches itself.
Wildcard
The dot character "." matches any single character.
Repeats
A repeat is an expression that is repeated an arbitrary number of times.
An expression followed by * can be repeated any number of times including zero.
An expression followed by + can be repeated any number of times, but at least once.
An expression followed by ? may be repeated zero or one times only.
When it is necessary to specify the minimum and maximum number of repeats explicitly, the bounds operator {} may be used,
thus "a{2}" is the letter "a" repeated exactly twice,
"a{2,4}" represents the letter "a" repeated between 2 and 4 times,
and "a{2,}" represents the letter "a" repeated at least twice with no upper limit. Note that there must be no white-space inside the {}, and there is no upper limit on the values of the lower and upper bounds.
Examples: "ba*" will match all of "b", "ba", "baaa" etc. "ba+" will match "ba" or "baaaa" for example but not "b". "ba?" will match "b" or "ba". "ba{2,4}" will match "baa", "baaa" and "baaaa".
Parenthesis
Parentheses () are used to group items together into a sub-expression. For example, the expression "(ab)*" would match all of the string "ababab".
Alternatives
Alternatives occur when the expression can match either one sub-expression or another, each alternative is separated by a "|". Each alternative is the largest possible previous sub-expression; this is the opposite behaviour from repetition operators.
Examples: "a(b|c)" could match "ab" or "ac". "abc|def" could match "abc" or "def".
Sets
A set is a set of characters that can match any single character that is a member of the set. Sets are delimited by "[" and "]" and can contain literals, character ranges, and character classes. Set declarations that start with "^" contain the complement of the elements that follow.
Examples: Character literals: "[abc]" will match either of "a", "b", or "c". "[^abc]" will match any character other than "a", "b", or "c". Character ranges: "[a-z]" will match any character in the range "a" to "z". "[^A-Z]" will match any character other than those in the range "A" to "Z".
Character classes
A character class is a special sequence to simplify common-used character types. Available classes are:
Class | Description | Equivalent |
---|---|---|
\w | Any word character - all alphanumeric characters plus the underscore. | [a-zA-Z_] |
\s | Any whitespace character (spaces and tabs). | |
\d | Any digit. | [0-9] |
\l | Any lower case character. | [a-z] |
\u | Any upper case character. | [A-Z] |
The uppercase version of these classes means NOT, for example, \S is non-spacing character.
The following table summarizes the syntax elements used in regular expressions.
Character | Description |
---|---|
^ | Beginning of the string. The expression "^A" will match an "A" only at the beginning of the string. |
^ | The caret (^) immediately following the left bracket ([) has a different meaning. It is used to exclude the remaining characters within brackets from matching the target string. The expression "[^0-9]" indicates that the target character should not be a digit. |
$ | The dollar sign ($) will match the end of the string. The expression "abc$" will match the sub-string "abc" only if it is at the end of the string. |
| | The alternation character (|) allows either expression on its side to match the target string. The expression "a|b" will match "a" as well as "b". |
. | The dot (.) will match any character. |
* | The asterisk (*) indicates that the character to the left of the asterisk in the expression should match 0 or more times. |
+ | The plus (+) is similar to asterisk but there should be at least one match of the character to the left of the + sign in the expression. |
? | The question mark (?) matches the character to its left 0 or 1 times. |
() | The parenthesis affects the order of pattern evaluation and also serves as a tagged expression that can be used when replacing the matched sub-string with another expression. |
[] | Brackets ([ and ]) enclosing a set of characters indicates that any of the enclosed characters may match the target character. |
{N} | Repeats expression exactly N times. |
{N, M} | Repeats expression between N and M times. |
{N, } | Repeats expression N or more times. |