home *** CD-ROM | disk | FTP | other *** search
- From: johnk@wrq.com
- Newsgroups: alt.sources
- Subject: REGEX Globber (Wild Card Matching)
- Message-ID: <16885@milton.u.washington.edu>
- Date: 21 Feb 91 18:36:53 GMT
-
- 02-20-91 Seattle, WA
-
-
- Here is a *IX wildcard globber I butchered, hacked and cajoled together
- after seeing and hearing about and becoming disgusted with several similar
- routines which had one or more of the following attributes: slow, buggy,
- required large levels of recursion on matches, required grotesque levels
- of recursion on failing matches using '*', full of caveats about usability
- or copyrights.
-
- I submit this without copyright and with the clear understanding that
- this code may be used by anyone, for any reason, with any modifications
- and without any guarantees, warrantee or statements of usability of any
- sort.
-
- Having gotten those cow chips out of the way, these routines are fairly
- well tested and reasonably fast. I have made an effort to fail on all
- bad patterns and to quickly determine failing '*' patterns. This parser
- will also do quite a bit of the '*' matching via quick linear loops versus
- the standard blind recursive descent.
-
- This parser has been submitted to profilers at various stages of development
- and has come through quite well. If the last millisecond is important to
- you then some time can be shaved by using stack allocated variables in
- place of many of the pointer follows (which may be done fairly often) found
- in regex_match and regex_match_after_star (ie *p, *t).
-
- No attempt is made to provide general [pat,pat] comparisons. The specific
- subcases supplied by these routines is [pat,text] which is sufficient
- for the large majority of cases (should you care).
-
- Since regex_match may return one of three different values depending upon
- the pattern and text I have made a simple shell for convenience (match()).
- Also included is an is_pattern routine to quickly check a potential pattern
- for regex special characters. I even placed this all in a header file for
- you lazy folks!
-
- Having said all that, here is my own reinvention of the wheel. Please
- enjoy it's use and I hope it is of some help to those with need ....
-
-
- jbk
-
-
- ==================== BEGIN of Listing ====================
-
-
- Checksum: 3309008187 (verify or update this with "brik")
-
-
- ==================== match.h ====================
-
- /*
- EPSHeader
-
- File: match.h
- Author: J. Kercheval
- Created: Sat, 01/05/1991 22:27:18
- */
- /*
- EPSRevision History
-
- J. Kercheval Wed, 02/20/1991 22:28:37 Released to Public Domain
- */
-
- /*
- Wildcard Pattern Matching
- */
-
- #ifndef BOOLEAN
- # define BOOLEAN int
- # define TRUE 1
- # define FALSE 0
- #endif
-
- /*----------------------------------------------------------------------------
- *
- * Match the pattern PATTERN against the string TEXT;
- * return TRUE if it matches, FALSE otherwise.
- *
- * A match means the entire string TEXT is used up in matching.
- *
- * In the pattern string:
- * `*' matches any sequence of characters
- * `?' matches any character
- * [SET] matches any character in the specified set,
- * [!SET] or [^SET] matches any character not in the specified set.
- *
- * Note: the standard regex character '+' (one or more) should by
- * simulated by using "?*" which is equivelant here.
- *
- * A set is composed of characters or ranges; a range looks like
- * character hyphen character (as in 0-9 or A-Z).
- * [0-9a-zA-Z_] is the set of characters allowed in C identifiers.
- * Any other character in the pattern must be matched exactly.
- *
- * To suppress the special syntactic significance of any of `[]*?!^-\',
- * and match the character exactly, precede it with a `\'.
- *
- ----------------------------------------------------------------------------*/
-
- BOOLEAN match (char *pattern, char *text);
-
- /*----------------------------------------------------------------------------
- *
- * Return TRUE if PATTERN has any special wildcard characters
- *
- ----------------------------------------------------------------------------*/
-
- BOOLEAN is_pattern (char *pattern);
-
-
- ==================== match.c ====================
-
- /*
- EPSHeader
-
- File: match.c
- Author: J. Kercheval
- Created: Sat, 01/05/1991 22:21:49
- */
- /*
- EPSRevision History
-
- J. Kercheval Wed, 02/20/1991 22:29:01 Released to Public Domain
- */
-
- /*
- Wildcard Pattern Matching
- */
-
-
- #include "match.h"
-
- #define ABORT 2 /* end of search indicator */
-
- BOOLEAN regex_match_after_star (char *pattern, char *text);
-
- /*----------------------------------------------------------------------------
- *
- * Return TRUE if PATTERN has any special wildcard characters
- *
- ----------------------------------------------------------------------------*/
-
- BOOLEAN is_pattern (char *p)
- {
- while ( *p ) {
- switch ( *p++ ) {
- case '?':
- case '*':
- case '[':
- return TRUE;
- case '\\':
- if ( !*p++ ) return FALSE;
- }
- }
- return FALSE;
- }
-
-
- /*----------------------------------------------------------------------------
- *
- * Match the pattern PATTERN against the string TEXT;
- * return TRUE if it matches, FALSE otherwise.
- *
- * A match means the entire string TEXT is used up in matching.
- *
- * In the pattern string:
- * `*' matches any sequence of characters
- * `?' matches any character
- * [SET] matches any character in the specified set,
- * [!SET] or [^SET] matches any character not in the specified set.
- *
- * Note: the standard regex character '+' (one or more) should by
- * simulated by using "?*" which is equivelant here.
- *
- * A set is composed of characters or ranges; a range looks like
- * character hyphen character (as in 0-9 or A-Z).
- * [0-9a-zA-Z_] is the set of characters allowed in C identifiers.
- * Any other character in the pattern must be matched exactly.
- *
- * To suppress the special syntactic significance of any of `[]*?!^-\',
- * and match the character exactly, precede it with a `\'.
- *
- ----------------------------------------------------------------------------*/
-
- BOOLEAN regex_match ( register char *p, register char *t )
- {
- register char range_start, range_end; /* start and end in range */
-
- BOOLEAN invert; /* is this [..] or [!..] */
- BOOLEAN member_match; /* have I matched the [..] construct? */
- BOOLEAN loop; /* should I terminate? */
-
- for ( ; *p; p++, t++ ) {
-
- /* if this is the end of the text then this is the end of the match */
- if (!*t) {
- return ( *p == '*' && *++p == '\0' ) ? TRUE : ABORT;
- }
-
- /* determine and react to pattern type */
- switch ( *p ) {
-
- /* single any character match */
- case '?':
- break;
-
- /* multiple any character match */
- case '*':
- return regex_match_after_star (p, t);
-
- /* [..] construct, single member/exclusion character match */
- case '[': {
-
- /* move to beginning of range */
- p++;
-
- /* check if this is a member match or exclusion match */
- invert = FALSE;
- if ( *p == '!' || *p == '^') {
- invert = TRUE;
- p++;
- }
-
- /* if closing bracket here or at range start then we have a
- malformed pattern */
- if ( *p == ']' ) {
- return ABORT;
- }
-
- member_match = FALSE;
- loop = TRUE;
-
- while ( loop ) {
-
- /* if end of construct then loop is done */
- if (*p == ']') {
- loop = FALSE;
- continue;
- }
-
- /* matching a '!', '^', '-', '\' or a ']' */
- if ( *p == '\\' ) {
- range_start = range_end = *++p;
- }
- else {
- range_start = range_end = *p;
- }
-
- /* if end of pattern then bad pattern (Missing ']') */
- if (!range_start)
- return ABORT;
-
- /* move to next pattern char */
- p++;
-
- /* check for range bar */
- if (*p == '-') {
-
- /* get the range end */
- range_end = *++p;
-
- /* special character range end */
- if (range_end == '\\')
- range_end = *++p;
-
- /* if end of pattern or construct then bad pattern */
- if (range_end == '\0' || range_end == ']')
- return ABORT;
- }
-
- /* if the text character is in range then match found.
- make sure the range letters have the proper
- relationship to one another before comparison */
- if ( range_start < range_end ) {
- if (*t >= range_start && *t <= range_end) {
- member_match = TRUE;
- loop = FALSE;
- }
- }
- else {
- if (*t >= range_end && *t <= range_start) {
- member_match = TRUE;
- loop = FALSE;
- }
- }
- }
-
- /* if there was a match in an exclusion set then no match */
- /* if there was no match in a member set then no match */
- if ((invert && member_match) ||
- !(invert || member_match))
- return FALSE;
-
- /* if this is not an exclusion then skip the rest of the [...]
- construct that already matched. */
- if (member_match) {
- while (*p != ']') {
-
- /* bad pattern (Missing ']') */
- if (!*p)
- return ABORT;
-
- /* skip exact match */
- if (*p == '\\') {
- p++;
- }
-
- /* move to next pattern char */
- p++;
- }
- }
-
- break;
- }
-
- /* next character is quoted and must match exactly */
- case '\\':
-
- /* move pattern pointer to quoted char and fall through */
- p++;
-
- /* must match this character exactly */
- default:
- if (*p != *t)
- return FALSE;
- }
- }
-
- /* if end of text not reached then the pattern fails */
- return !*t;
- }
-
-
- /*----------------------------------------------------------------------------
- *
- * recursively call regex_match with final segment of PATTERN and of TEXT.
- *
- ----------------------------------------------------------------------------*/
-
- BOOLEAN regex_match_after_star (register char *p, register char *t)
- {
- register BOOLEAN match;
- register nextp;
-
- /* pass over existing ? and * in pattern */
- while ( *p == '?' || *p == '*' ) {
-
- /* take one char for each ? */
- if ( *p == '?' ) {
-
- /* if end of text then no match */
- if ( !*t++ ) {
- return ABORT;
- }
- }
-
- /* move to next char in pattern */
- p++;
- }
-
- /* if end of pattern we have matched regardless of text left */
- if ( !*p ) {
- return TRUE;
- }
-
- /* get the next character to match which must be a literal or '[' */
- nextp = *p;
- if ( nextp == '\\' )
- nextp = p[1];
-
- /* Continue until we run out of text or definite result seen */
- match = FALSE;
- while ( match == FALSE ) {
-
- /* a precondition for matching is that the next character
- in the pattern match the next character in the text or that
- the next pattern is the beginning of a range. Increment text
- pointer as we go here */
- if ( *p == *t || nextp == '[' ) {
- match = regex_match(p, t);
- }
-
- /* if the end of text is reached then no match */
- if ( !*t++ ) match = ABORT;
- }
-
- /* return result */
- return match;
- }
-
- /*----------------------------------------------------------------------------
- *
- * This is a shell to regex_match to return only a true BOOLEAN value
- *
- ----------------------------------------------------------------------------*/
-
- BOOLEAN match( char *p, char *t)
- {
- return ( regex_match(p,t) == TRUE ) ? TRUE : FALSE;
- }
-
-
- #ifdef TEST
-
- /*
- * This test main expects as first arg the pattern and as second arg
- * the match string. Output is yaeh or nay on match.
- */
-
- #include <stdio.h>
-
- int main(int argc, char *argv[])
- {
- if (argc == 3) {
- if (is_pattern(argv[1])) {
- if (match(argv[1],argv[2])) {
- printf(" Match Successful\n");
- }
- else {
- printf(" Match Fails\n");
- }
- }
- else {
- printf(" Bad Pattern\n");
- }
- }
- else printf(" Bad Breath\n");
- return(0);
- }
-
- #endif
-
-
- ==================== END of Listing ====================
- --
- John Kercheval -- 127 NW Bowdion Pl #105 -- Seattle, WA 98107-4960
- Home(Voice): (206) 547-4676 -------- Work (Voice): (206) 324-0350
-