Regular expressions

Regular expressionsndexfile(index-entry "regular expression" "rm" main ) are first class objects in STK. A regular expression is created by the ndexfile(index-entry "string->regexp" "tt" aux )string->regexp procedure. Matching a regular expression against a string is simply done by applying a previously created regular expression to this string. Regular expressions are implemented using code in the Henry Spencer's package, and much of the description of regular expressions below is copied from his manual.





`=̀13`(ndexfile(index-entry "string->regexp" "tt" main )string->regexpstring)
procedure
ndexfile(index-entry "String->regexp" "tt" aux )String->regexp compiles the string and returns the corresponding regular expression.

Matching a regular expression against a string is done by applying the result of ndexfile(index-entry "string->regexp" "tt" aux )string->regexp to this string. This application yields a list of integer couples if a matching occurs; it returns #f otherwise. Those integers correspond to indexes in the string which match the regular expression.

A regular expression is zero or more branches, separated by ``|''. It matches anything that matches one of the branches.

A branch is zero or more pieces, concatenated. It matches a match for the first, followed by a match for the second, etc.

A piece is an atom possibly followed by ``*'', ``+'', or ``?''. An atom followed by ``*'' matches a sequence of 0 or more matches of the atom. An atom followed by ``+'' matches a sequence of 1 or more matches of the atom. An atom followed by ``?'' matches a match of the atom, or the null string.

An atom is a regular expression in parentheses (matching a match for the regular expression), a range (see below), ``.'' (matching any single character), ``^'' (matching the null string at the beginning of the input string), ``$'' (matching the null string at the end of the input string), a ``\'' followed by a single character (matching that character), or a single character with no other significance (matching that character).

A range is a sequence of characters enclosed in ``[]''. It normally matches any single character from the sequence. If the sequence begins with ``^'', it matches any single character not from the rest of the sequence. If two characters in the sequence are separated by ``-'', this is shorthand for the full list of ASCII characters between them (e.g. ``[0-9]'' matches any decimal digit). To include a literal ``]'' in the sequence, make it the first character (following a possible ``^''). To include a literal ``-'', make it the first or last character.

In general there may be more than one way to match a regular expression to an input string. Considering only the rules given so far could lead to ambiguities. To resolve those ambiguities, the generated regular expression chooses among alternatives using the rule ``first then longest''. In other words, it considers the possible matches in order working from left to right across the input string and the pattern, and it attempts to match longer pieces of the input string before shorter ones. More specifically, the following rules apply in decreasing order of priority:

  1. If a regular expression could match two different parts of an input string then it will match the one that begins earliest.

  2. If a regular expression contains ``|'' operators then the leftmost matching sub-expression is chosen.

  3. In ``*'', ``+'', and ``?'' constructs, longer matches are chosen in preference to shorter ones.

  4. In sequences of expression components the components are considered from left to right.

$\Longrightarrow$
$\Longrightarrow$ unspecified error makeotherˆ`=̀13`


          gobblecr(define r1 (string->regexp "abc"))(r1 "xyz") #f(r1 "12abc345") ((2 5))(define r2 (string->regexp "[a-z]+"))(r2 "12abc345") ((2 5))

If the regular expression contains parenthesis, and if there is a match, the result returned by the application will contain several couples of integers. First couple will be the indexes of the first longest substring which match the regular expression. Subsequent couples, will be the indexes of all the sub-parts of this regular expression, in sequence.

$\Longrightarrow$
$\Longrightarrow$ unspecified error makeotherˆ`=̀13`


          gobblecr(define r3 (string->regexp "(a*)(b*)c"))(r3 "abc") ((0 3) (0 1) (1 2))(r3 "c")   ((0 1) (0 0) (0 0))((string->regexp "([a-z]+),([a-z]+)") "XXabcd,eXX")           ((2 8) (2 6) (7 8))






`=̀13`(ndexfile(index-entry "regexp?" "tt" main )regexp?obj)
procedure
Returns #t if obj is a regular expression created by ndexfile(index-entry "string->regexp" "tt" aux )string->regexp; otherwise returns #f.

$\Longrightarrow$
$\Longrightarrow$ unspecified error makeotherˆ`=̀13`


          gobblecr(regexp? (string->regexp "[a-zA-Z][a-zA-Z0-9]*"))                     #t





`=̀13`(ndexfile(index-entry "regexp-replace" "tt" main )regexp-replacepattern string substitution)
procedure
`=̀13`(ndexfile(index-entry "regexp-replace-all" "tt" main )regexp-replace-allpattern string substitution)
procedure
ndexfile(index-entry "Regexp-replace" "tt" aux )Regexp-replace matches the regular expression pattern against string. If there is a match, the portion of string which match pattern is replaced by the substitution string. If there is no match, ndexfile(index-entry "regexp-replace" "tt" aux )regexp-replace returns string unmodified. Note that the given pattern could be here either a string or a regular expression.

If pattern contains strings of the form ``\n'', where n is a digit between 1 and 9, then it is replaced in the substitution with the portion of string that matched the n-th parenthesized subexpression of pattern. If n is equal to 0, then it is replaced in substitution with the portion of string that matched pattern. $\Longrightarrow$
$\Longrightarrow$ unspecified error makeotherˆ`=̀13`

          gobblecr(regexp-replace "a*b" "aaabbcccc" "X")                   "Xbcccc"(regexp-replace (string->regexp "a*b") "aaabbcccc" "X")                   "Xbcccc"(regexp-replace "(a*)b" "aaabbcccc" "X\\1Y")                   "XaaaYbcccc"(regexp-replace "(a*)b" "aaabbcccc" "X\\0Y")                   "XaaabYbcccc"(regexp-replace "([a-z]*) ([a-z]*)" "john brown" "\\2 \\1")                   "brown john"

ndexfile(index-entry "Regexp-replace" "tt" aux )Regexp-replace replaces the first occurence of pattern in string. To replace all the occurences of the pattern, use ndexfile(index-entry "regexp-replace-all" "tt" aux )regexp-replace-all

$\Longrightarrow$
$\Longrightarrow$ unspecified error makeotherˆ`=̀13`


          gobblecr(regexp-replace "a*b" "aaabbcccc" "X")                   "Xbcccc"(regexp-replace-all "a*b" "aaabbcccc" "X")                   "XXcccc"