home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: alt.sources
- From: chris@mimsy.umd.edu (Chris Torek)
- Subject: [comp.lang.c] Re: strstr sources: a summary of responses
- Message-ID: <1990Aug27.191251.21720@math.lsa.umich.edu>
- Date: Mon, 27 Aug 90 19:12:51 GMT
-
- Archive-name: torek-boyer-moore/27-Aug-90
- Original-posting-by: chris@mimsy.umd.edu (Chris Torek)
- Original-subject: Re: strstr sources: a summary of responses
- Reposted-by: emv@math.lsa.umich.edu (Edward Vielmetti)
-
- [Reposted from comp.lang.c.
- Comments on this service to emv@math.lsa.umich.edu (Edward Vielmetti).]
-
- In article <1158@umvlsi.ecs.umass.edu> srivasta@umvlsi.ecs.umass.edu
- (Manoj Srivastava) posts a Boyer-Moore based strstr() by John Lacey.
-
- >And finally, an implementation of the Boyer-Moore algorithm. I
- >am not sure offhand, but while the worst case complexity remains
- >O(N+M), the avg case behaviour is O(N/M) ???
-
- (where N is the length of the string being searched and M is the length
- of the substring, aka pattern). Yes, it is N/M.
-
- Next, the code:
-
- >/* strstr, Copyright (c) 1990 by John Lacey. -*- C -*-
- [some stuff deleted]
- >int ch, d[SCHAR_MAX + 1];
- [more stuff deleted]
- > i += d[text[i]];
- [still more deleted]
-
- If `char' is signed, and a string containing a negative character value
- is passed to strstr(), thid code can attempt to fetch, e.g., d[-40]. I
- found the code overy difficult to read, so I rewrote it. I also wrote
- a long comment describing the algorithm. This has been only lightly
- tested, but is simple enough that it should work. Note that this
- version handles strings with embedded NUL characters as well.
-
- (A real B-M search routine should take the deltas table as an argument,
- and have a separate routine to set it up, so that the same pattern can
- be applied to multiple texts. E.g., a `Boyer-Moore grep' would read a
- line at a time and apply this search to each line. See comment.)
-
- #include <limits.h>
- #include <stddef.h>
- typedef void *PTR; /* `char *' for old compilers */
-
- /*
- * Boyer-Moore search: input is `text' (a string) and its length,
- * and a `pattern' (another string) and its length.
- *
- * The linear setup cost of this function is approximately 256 + patlen.
- * Afterwards, however, the average cost is O(textlen/patlen), and the
- * worst case is O(textlen+patlen).
- *
- * The Boyer-Moore algorithm works by observing that, for each position
- * in the text, if the character there does *not* occur somewhere in the
- * search pattern, no comparisons including that character will match.
- * That is, given the text "hello world..." and the pattern "goodbye", the
- * `w' in `world' means that none of `hello w', `ello wo', `llo wor',
- * `lo worl', `o world', ` world.', and `world..' can match. In fact,
- * exactly patlen strings are certain not to match. We can discover this
- * simply by looking at the patlen'th character. Furthermore, even if
- * the text character does occur, it may be that it rules out some number
- * of other matches. Again, we can discover this by doing the match
- * `backwards'.
- *
- * We set up a table of deltas for each possible character, with
- * delta[character] being patlen for characters not in the pattern,
- * less for characters in the pattern, growing progressively smaller
- * as we near the end of the pattern. Matching then works as follows:
- *
- * 0 1 2 3
- * 01234567890123456789012345678901234567
- * "Here is the string being searched into" (text)
- * ------ (pos = [0..5])
- * "string" (pat)
- * 654321- (deltas)
- *
- * (the delta for `-' will be derived below).
- *
- * Positions 0..5 end with `i', which is not the `g' we want. `i' does
- * appear in `string', but two characters before the end. We skip
- * forward so as to make the `i's match up:
- *
- * "Here is the string being searched into" (text)
- * "string" (pos = [2..7])
- *
- * Next we find that ` ' and `g' do not match. Since ` ' does not appear
- * in the pattern at all, we can skip forward 6:
- *
- * "Here is the string being searched into" (text)
- * "string" (pos = [8..13])
- *
- * Comparing `t' vs `g', we again find no match, and so we obtain the
- * delta for `t', which is 4. We skip to position 17:
- *
- * "Here is the string being searched into" (text)
- * "string" (pos = [12..17])
- *
- * It thus takes only four steps to move the search point forward to the
- * match, in this case.
- *
- * If the pattern has a recurring character, we must set the delta for
- * that character to the distance of the one closest to the end:
- *
- * "befuddle the cat" (text)
- * "fuddle" (pos = [0..5])
- * 654321- (delta)
- *
- * We want the next search to line the `d's up like this:
- *
- * "befuddle the cat" (text)
- * "fuddle" (pos = [2..7])
- *
- * and not like this:
- *
- * "befuddle the cat" (text)
- * "fuddle" (pos = [3..8])
- *
- * so we take the smaller delta for d, i.e., 2.
- *
- * The last task is computing the delta we have noted above as `-':
- *
- * "candlesticks" (text)
- * "hand" (pos = [0..3])
- * 4321- (delta)
- *
- * Here the `d' in `hand' matches the `d' in `candlesticks', but the
- * strings differ. Since there are no other `d's in `hand', we know
- * that none of (cand,andl,ndle,dles) can match, and thus we want this
- * delta to be 4 (the length of the pattern). But if we had, e.g.:
- *
- * "candlesticks" (text)
- * "deed" (pos = [0..3])
- * 4321- (delta)
- *
- * then we should advance to line up the other `d':
- *
- * "candlesticks" (text)
- * "deed" (pos = [3..6])
- *
- * As this suggestes, the delta should be that for the `d' nearest the
- * end, but not including the end. This is easily managed by setting up
- * a delta table as follows:
- *
- * for int:c in [0..255] { delta[c] = patlen; };
- * for int:x in [0..patlen-1) { delta[pat[x]] = patlen - (x + 1); };
- *
- * delta[pat[patlen-1]] is never written, so the last letter inherits the
- * delta from an earlier iteration or from the previous loop.
- *
- * NB: the nonsense with `deltaspace' below exists merely because gcc
- * does a horrible job of common subexpression elimination (it does not
- * notice that the array is at a constant stack address).
- */
- char *
- boyer_moore_search(text, textlen, pat, patlen)
- char *text;
- size_t textlen;
- char *pat;
- size_t patlen;
- {
- register unsigned char *p, *t;
- register size_t i, p1, j, *delta;
- size_t deltaspace[UCHAR_MAX + 1];
-
- /* algorithm fails if pattern is empty */
- if ((p1 = patlen) == 0)
- return (text);
-
- /* code below fails (whenever i is unsigned) if pattern too long */
- if (p1 > textlen)
- return (NULL);
-
- /* set up deltas */
- delta = deltaspace;
- for (i = 0; i <= UCHAR_MAX; i++)
- delta[i] = p1;
- for (p = (unsigned char *)pat, i = p1; --i > 0;)
- delta[*p++] = i;
-
- /*
- * From now on, we want patlen - 1.
- * In the loop below, p points to the end of the pattern,
- * t points to the end of the text to be tested against the
- * pattern, and i counts the amount of text remaining, not
- * including the part to be tested.
- */
- p1--;
- p = (unsigned char *)pat + p1;
- t = (unsigned char *)text + p1;
- i = textlen - patlen;
- for (;;) {
- if (*p == *t && memcmp((PTR)(p - p1), (PTR)(t - p1), p1) == 0)
- return ((char *)t - p1);
- j = delta[*t];
- if (i < j)
- break;
- i -= j;
- t += j;
- }
- return (NULL);
- }
- --
- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 405 2750)
- Domain: chris@cs.umd.edu Path: uunet!mimsy!chris
-