home *** CD-ROM | disk | FTP | other *** search
- Subject: bmgsubs - Boyer-Moore-Gosper fast string search subroutines
- Newsgroups: mod.sources
- Approved: jpn@panda.UUCP
-
- Mod.sources: Volume 5, Issue 14
- Submitted by: Jeff Mogul <talcott!su-navajo.arpa:mogul>
-
- : Run this shell script with "sh" not "csh"
- PATH=:/bin:/usr/bin:/usr/ucb
- export PATH
- all=FALSE
- if [ $1x = -ax ]; then
- all=TRUE
- fi
- /bin/echo 'Extracting README'
- sed 's/^X//' <<'//go.sysin dd *' >README
- Here are routines to perform fast string searches using the
- Boyer-Moore-Gosper algorithm; they can be used in any Unix program (and
- should be portable to non-Unix systems). You can search either a file
- or a buffer in memory.
-
- The code is mostly due to James A. Woods (jaw@ames-aurora.arpa)
- although I have modified it heavily, so all bugs are my fault. The
- original code is from his sped-up version of egrep, recently posted on
- mod.sources and available via anonymous FTP from ames-aurora.arpa as
- pub/egrep.one and pub/egrep.two. That code handles regular
- expressions; mine does not.
-
- The files included here are
- README This file
- Makefile Can be modified for 4.xBSD or Sys V compilation
- bmgsubs.3l Manual page
- bmgsubs.c The search code
- bmgtest.c Test program (example of file searching)
- bmgtest2.c Another test program, showing buffer searching
-
- These have only been tested on 4.2BSD Vax systems.
-
- -Jeff Mogul
- mogul@navajo.stanford.edu
- decwrl!glacier!navajo!mogul
- //go.sysin dd *
- made=TRUE
- if [ $made = TRUE ]; then
- /bin/chmod 664 README
- /bin/echo -n ' '; /bin/ls -ld README
- fi
- /bin/echo 'Extracting Makefile'
- sed 's/^X//' <<'//go.sysin dd *' >Makefile
- # Makefile for bmgsubs (Boyer-Moore-Gosper subroutines)
- #
-
- #
- # Define BSD if you're compiling this for a 4.xBSD system
- # (BSD has bcopy, has index() instead of strchr().)
- #
- ENV = -DBSD
-
- CFLAGS = -O
-
- bmgsubs.o: bmgsubs.c
-
- bmgtest: bmgsubs.o bmgtest.o
- cc -o bmgtest bmgsubs.o bmgtest.o
-
- bmgtest2: bmgsubs.o bmgtest2.o
- cc -o bmgtest2 bmgsubs.o bmgtest2.o
-
- clean:
- rm -f bmgtest bmgtest2 *.o *.BAK *.CKP
- //go.sysin dd *
- made=TRUE
- if [ $made = TRUE ]; then
- /bin/chmod 664 Makefile
- /bin/echo -n ' '; /bin/ls -ld Makefile
- fi
- /bin/echo 'Extracting bmgsubs.3l'
- sed 's/^X//' <<'//go.sysin dd *' >bmgsubs.3l
- X.TH BMGSUBS 3L "16 May 1986"
- X.SH NAME
- (bmgsubs) bmg_setup, bmg_search, bmg_fsearch \- Boyer-Moore-Gosper string search routines
- X.SH SYNOPSIS
- bmg_setup(search_string, case_fold_flag)
- X.br
- char *search_string;
- X.br
- int case_fold_flag;
- X.sp
- bmg_fsearch(file_des, action_func)
- X.br
- int file_des;
- X.br
- int (*action_func)();
- X.sp
- bmg_search(buffer_base, buffer_length, action_func)
- X.br
- char *buffer_base;
- X.br
- int buffer_length;
- X.br
- int (*action_func)();
- X.SH DESCRIPTION
- These routines perform fast searches for strings, using the
- Boyer-Moore-Gosper algorithm. No meta-characters
- (such as `*' or `.')
- are interpreted, and the search string cannot contain newlines.
- X.PP
- X.I Bmg_setup
- must be called as the first step in performing a search. The
- X.I search_string
- parameter is the string to be searched for.
- X.I Case_fold_flag
- should be false (zero) if characters should match exactly, and
- true (non-zero) if case should be ignored when checking for
- matches.
- X.PP
- Once a search string has been specified using
- X.IR bmg_setup ,
- one or more searches for that string may be performed.
- X.PP
- X.I Bmg_fsearch
- searches a file, open for reading on file descriptor
- X.I file_des
- (this is
- X.I not
- a
- X.I stdio
- file.)
- For each line that contains the search string,
- X.I bmg_fsearch
- will call the
- X.I action_func
- function specified by the caller as
- action_func(matching_line, byte_offset).
- The
- X.I matching_line
- parameter is a (char *) pointer to a temporary copy of the line;
- X.I byte_offset
- is the offset from the beginning of the file to the first occurence
- of the search string in that line.
- X.I Action_func
- should return true (non-zero) if the search should continue, or
- false (zero) if the search should terminate at this point.
- X.PP
- X.I Bmg_search
- is like
- X.IR bmg_fsearch ,
- except that instead of searching a file, it searches the buffer
- pointed to by
- X.IR buffer_base ;
- X.I buffer_length
- specifies the number of bytes in the buffer.
- The
- X.I byte_offset
- parameter to
- X.I action_func
- gives the offset from the beginning of the buffer.
- X.PP
- If the user merely wants the matching lines printed on the
- standard output, the
- X.I action_func
- parameter
- to
- X.I bmg_fsearch
- or
- X.I bmg_search
- can be NULL.
- X.SH AUTHOR
- Jeffrey Mogul (Stanford University), based on code written
- by James A. Woods (NASA Ames)
- X.SH BUGS
- Might be nice to have a version of this that handles regular expressions.
- X.PP
- There are large, but finite, limits on the length of both pattern
- strings and text lines. When these limits are exceeded, all bets are off.
- X.PP
- The string pointer passed to
- X.I action_func
- points to a temporary copy of the matching line, and must be
- copied elsewhere before
- X.I action_func
- returns.
- X.PP
- X.I Bmg_search
- does not
- X.I permanently
- modify the buffer in any way, but during its execution (and therefore
- when
- X.I action_func
- is called), the last byte of the buffer may be temporarily changed.
- X.PP
- The Boyer-Moore algorithm cannot find lines that do not contain
- a given pattern (like "grep -v") or count lines ("grep -n").
- Although it is fast even for short search strings, it gets faster as
- the search string length increases.
- //go.sysin dd *
- made=TRUE
- if [ $made = TRUE ]; then
- /bin/chmod 664 bmgsubs.3l
- /bin/echo -n ' '; /bin/ls -ld bmgsubs.3l
- fi
- /bin/echo 'Extracting bmgsubs.c'
- sed 's/^X//' <<'//go.sysin dd *' >bmgsubs.c
- X/*
- * bmgsubs.c
- *
- * Subroutines for fast string searching; no regular expressions
- *
- * Entries:
- *
- * bmg_setup(pattern_string, case_fold_flag)
- * char *pattern_string; -- the string you are searching for
- * int case_fold_flag; -- true if you want search to ignore case
- *
- * bmg_fsearch(file_des, action_func)
- * int file_des; -- file descriptor to read on
- * int (*action_func)(); -- called on each hit
- *
- * bmg_search(buffer, buflen, action_func)
- * char *buffer; -- buffer to search
- * int buflen; -- bytes in buffer
- * int (*action_func)(); -- called on each hit
- *
- * Usage:
- * Start by calling bmg_setup() to specify a pattern. Then open
- * the file you want to search and pass the file descriptor (NOT a
- * stdio file pointer) to bmg_fsearch(). For each line that contains
- * the pattern, bmg_fsearch() will call action_func(string, position)
- * with string being a static copy of the matching line and position
- * being the disntance in bytes from the beginning of the search to the
- * matching substring.
- * action_func() should return a non-zero integer unless the
- * search should be terminated, in which case it should return 0.
- * bmg_search() works the same as bmg_fsearch(), but searches
- * a buffer rather than a file.
- *
- * Adapted from:
- *
- Boyer/Moore/Gosper-assisted 'egrep' search, with delta0 table as in
- original paper (CACM, October, 1977). No delta1 or delta2. According to
- experiment (Horspool, Soft. Prac. Exp., 1982), delta2 is of minimal
- practical value. However, to improve for worst case input, integrating
- the improved Galil strategies (Apostolico/Giancarlo, Siam. J. Comput.,
- February 1986) deserves consideration.
-
- James A. Woods Copyright (c) 1986
- NASA Ames Research Center
- *
- * 29 April 1986 Jeff Mogul Stanford
- * Adapted as a set of subroutines for use by a program. No
- * regular-expression support.
- */
-
- #include <stdio.h>
- #include <ctype.h>
- #include <sys/types.h>
-
- #ifdef BSD
- #define strchr index
- #endif
- #ifndef BUFSIZE
- #define BUFSIZE 8192
- #endif BUFSIZE
- #ifndef PATSIZE
- #define PATSIZE 1000
- #endif PATSIZE
- #define LARGE BUFSIZE + PATSIZE
- #define NL '\n'
- #define EOS '\0'
-
- struct bmg_data {
- int bmg_delta0[256]; /* ascii only */
- char bmg_cmap[256]; /* (un)folded characters */
- char bmg_str[BUFSIZE + 2];
- char bmg_linetemp[BUFSIZE];
- char bmg_pattern[PATSIZE];
- int bmg_patlen;
- int bmg_nsuccess;
- } bmdata;
- #define delta0 bmdata.bmg_delta0
- #define cmap bmdata.bmg_cmap
- #define str bmdata.bmg_str
- #define linetemp bmdata.bmg_linetemp
- #define pattern bmdata.bmg_pattern
- #define patlen bmdata.bmg_patlen
- #define nsuccess bmdata.bmg_nsuccess
-
- char *strcpy(), *strncpy(), *malloc();
-
- bmg_search(buffer, buflen, action)
- char *buffer;
- int buflen;
- int (*action)();
- {
- register char *k,
- *strend,
- *s;
- register int j;
- char *t;
- char *strbegin;
- char *gotamatch();
- int byte_offset = 0;
- char last_byte = buffer[buflen - 1];
-
- nsuccess = 0;
-
- /* make sure that buffer ends with NL */
- if (last_byte != NL)
- buffer[buflen - 1] = NL;
-
- /*
- * We must process the buffer in chunks of BUFSIZE to make
- * "LARGE" hack work right.
- */
- for (strbegin = buffer;
- strbegin < &buffer[buflen]; strbegin += BUFSIZE) {
- strend = strbegin + BUFSIZE;
-
- if (strend > &buffer[buflen]) /* stay inside buffer */
- strend = &buffer[buflen];
-
- /* find end of last full line in this chunk of the buffer */
- s = strbegin;
- while ((strend[-1]) != NL && (strend > s))
- strend--;
- if (strend <= strbegin) /* line longer than BUFSIZE, punt */
- continue;
-
- k = strbegin + patlen - 1;
-
- for (;;) {
- /*
- for a large class of patterns, upwards of 80% of
- match time is spent on the next line.
- we beat existing microcode (vax 'matchc') this way.
- */
- while ((k += delta0[*(unsigned char *) k]) < strend)
- ;
- if (k < (strbegin + LARGE))
- break;
- k -= LARGE;
-
- j = patlen - 1;
- s = k - 1;
- while (cmap[*s--] == pattern[--j]);
- /*
- * delta-less shortcut for literati, but
- * short shrift for genetic engineers.
- */
- if (j >= 0)
- k++;
- else { /* submatch */
- t = k;
- s = k - patlen + 1;
- do
- ;
- while (*s != NL && --s >= strbegin);
- k = s + 1; /* at line start */
-
- if ((k = gotamatch( k,
- (t - buffer),
- action)) == NULL)
- return;
- if (k >= strend)
- break;
- }
- }
- }
- if (buffer[buflen - 1] != last_byte)
- buffer[buflen - 1] = last_byte;
- return(nsuccess);
- }
-
- bmg_fsearch(fd, action)
- int fd;
- int (*action)();
- {
- register char *k,
- *strend,
- *s;
- register int j;
- char *t;
- char *gotamatch();
- int nleftover,
- count,
- file_offset = 0;
-
- nleftover = 0;
- nsuccess = 0;
-
- while ((count = read(fd, str + nleftover, BUFSIZE - nleftover)) > 0) {
- count += nleftover;
- if (count != BUFSIZE && fd != 0) {
- str[count++] = NL; /* insurance for broken last line */
- }
- str[count] = 0;
- for (j = count - 1; str[j] != NL && j >= 0;)
- j--;
- if (j < 0) { /* break up long line */
- str[count] = NL;
- str[++count] = EOS;
- strend = str + count;
- linetemp[0] = EOS;
- nleftover = 0;
- }
- else { /* save partial line */
- strend = str + j;
- nleftover = count - j - 1;
- strncpy(linetemp, str + j + 1, nleftover);
- }
- k = str + patlen - 1;
-
- for (;;) {
- /*
- for a large class of patterns, upwards of 80% of
- match time is spent on the next line.
- we beat existing microcode (vax 'matchc') this way.
- */
- while ((k += delta0[*(unsigned char *) k]) < strend)
- ;
- if (k < (str + LARGE))
- break;
- k -= LARGE;
-
- j = patlen - 1;
- s = k - 1;
- while (cmap[*s--] == pattern[--j]);
- /*
- * delta-less shortcut for literati, but
- * short shrift for genetic engineers.
- */
- if (j >= 0)
- k++;
- else { /* submatch */
- t = k;
- s = k - patlen + 1;
- do
- ;
- while (*s != NL && --s >= str);
- k = s + 1; /* at line start */
-
- if ((k = gotamatch( k,
- (file_offset + (t - str)),
- action)) == NULL)
- return;
- if (k >= strend)
- break;
- }
- }
- file_offset += ( 1 + (strend - str));
- strncpy(str, linetemp, nleftover);
- }
- return(nsuccess);
- }
-
- bmg_setup(pat, folded) /* compute "boyer-moore" delta table */
- char *pat; /* ... HAKMEM lives ... */
- int folded;
- {
- register int j;
- /*
- for multibyte characters (e.g. kanji), there are ways.
- extrapolating 256 to 64K may be unwise, in favor of more
- dynamics within boyermoore() itself.
- */
- patlen = strlen(pat);
- if (folded) { /* fold case while saving pattern */
- for (j = 0; j < patlen; j++) {
- pattern[j] = (isupper((int) pat[j])
- ? (char) tolower((int) pat[j]) : pat[j]);
- }
- }
- else
- bcopy(pat, pattern, patlen);
-
- for (j = 0; j < 256; j++) {
- delta0[j] = patlen;
- cmap[j] = (char) j; /* could be done at compile time */
- }
- for (j = 0; j < patlen - 1; j++)
- delta0[pattern[j]] = patlen - j - 1;
- delta0[pattern[patlen - 1]] = LARGE;
-
- if (folded) {
- for (j = 0; j < patlen - 1; j++)
- if (islower((int) pattern[j]))
- delta0[toupper((int) pattern[j])] = patlen - j - 1;
- if (islower((int) pattern[patlen - 1]))
- delta0[toupper((int) pattern[patlen - 1])] = LARGE;
- for (j = 'A'; j <= 'Z'; j++)
- cmap[j] = (char) tolower((int) j);
- }
- }
-
- static char *gotamatch(s, pos, action)
- register char *s;
- register int pos;
- int (*action)();
- {
- static char result[BUFSIZE];
- register char *r = result;
-
- nsuccess++;
- do
- *r++ = *s;
- while (*s++ != NL);
- *r = 0;
- if (action == NULL) {
- printf("%s", result);
- return(s);
- }
- else
- if (action(result, pos))
- return(s);
- else
- return(NULL);
- }
-
- #ifndef BSD
- bcopy(from, to, len)
- register char *from;
- register char *to;
- register int len;
- {
- while (len-- > 0) {
- *to++ = *from++;
- }
- }
- #endif BSD
- //go.sysin dd *
- made=TRUE
- if [ $made = TRUE ]; then
- /bin/chmod 664 bmgsubs.c
- /bin/echo -n ' '; /bin/ls -ld bmgsubs.c
- fi
- /bin/echo 'Extracting bmgtest.c'
- sed 's/^X//' <<'//go.sysin dd *' >bmgtest.c
- X/*
- * bmgtest.c
- *
- * Test driver/example program for Boyer-Moore-Gosper search routines
- *
- * Works like a stupid grep; takes only a few flags (-i, -n)
- */
-
- #include <stdio.h>
- #include <ctype.h>
- #include <sys/types.h>
-
- #ifdef BSD
- #define strchr index
- #endif
-
- int iflag = 0,
- nflag = 0;
-
- X/*
- * Called by bmg_fsearch for each line that matches the pattern
- */
- printit(string, pos)
- char *string; /* pointer to temporary copy of line */
- int pos; /* byte offset from beginning of file */
- {
- if (nflag)
- printf("%d: %s", pos, string);
- else
- printf("%s", string);
- return(1); /* tells bmg_fsearch to continue the search */
- }
-
- main(argc, argv)
- int argc;
- char **argv;
- {
- int nsuccess,
- fd;
- char *pattern;
-
- while ((argc > 1) && (argv[1][0] == '-')) {
- switch (argv[1][1]) {
- case 'i':
- iflag++;
- break;
- case 'n':
- nflag++;
- break;
- default:
- oops("usage: bmgtest [-i] [-n] pattern [file ...]");
- }
- argc--;
- argv++;
- }
- argv++;
- if (argc <= 1)
- oops("bmgtest: pattern argument missing");
- pattern = *argv;
- argc--;
- argv++;
-
- bmg_setup(pattern, iflag);
-
- if (argc <= 1) { /* process stdin */
- fd = 0;
- bmg_fsearch(fd, printit);
- }
- else {
- while (--argc > 0) {
- if ((fd = open(*argv, 0)) <= 0) {
- fprintf(stderr, "bmgtest: can't open %s\n", *argv);
- nsuccess = 2;
- argv++;
- continue;
- }
- bmg_fsearch(fd, printit);
- close(fd);
- argv++;
- }
- }
- exit((nsuccess == 2) ? 2 : (nsuccess == 0));
- }
-
- oops(message)
- char *message;
- {
- fprintf(stderr, "%s\n", message);
- exit(2);
- }
- //go.sysin dd *
- made=TRUE
- if [ $made = TRUE ]; then
- /bin/chmod 664 bmgtest.c
- /bin/echo -n ' '; /bin/ls -ld bmgtest.c
- fi
- /bin/echo 'Extracting bmgtest2.c'
- sed 's/^X//' <<'//go.sysin dd *' >bmgtest2.c
- X/*
- * bmgtest2.c
- *
- * Test driver/example program for Boyer-Moore-Gosper search routines
- * does buffer search instead of file search.
- *
- * Works like a very stupid grep; takes only a few flags (-i, -n)
- */
-
- #include <stdio.h>
- #include <ctype.h>
- #include <sys/types.h>
- #include <sys/stat.h>
-
- char *malloc();
-
- #ifdef BSD
- #define strchr index
- #endif
-
- int iflag = 0,
- nflag = 0;
-
- X/*
- * Called by bmg_search for each line that matches the pattern
- */
- printit(string, pos)
- char *string; /* pointer to temporary copy of line */
- int pos; /* byte offset from beginning of file */
- {
- if (nflag)
- printf("%d: %s", pos, string);
- else
- printf("%s", string);
- return(1); /* tells bmg_fsearch to continue the search */
- }
-
- main(argc, argv)
- int argc;
- char **argv;
- {
- int nsuccess,
- fd;
- char *pattern;
- struct stat statbuf;
- char *buffer;
-
- while ((argc > 1) && (argv[1][0] == '-')) {
- switch (argv[1][1]) {
- case 'i':
- iflag++;
- break;
- case 'n':
- nflag++;
- break;
- default:
- oops("usage: bmgtest [-i] [-n] pattern [file ...]");
- }
- argc--;
- argv++;
- }
- argv++;
- if (argc <= 1)
- oops("bmgtest: pattern argument missing");
- pattern = *argv;
- argc--;
- argv++;
-
- bmg_setup(pattern, iflag);
-
- if (argc <= 1) { /* process stdin */
- fd = 0;
- bmg_fsearch(fd, printit);
- }
- else {
- while (--argc > 0) {
- if ((fd = open(*argv, 0)) <= 0) {
- fprintf(stderr, "bmgtest: can't open %s\n", *argv);
- nsuccess = 2;
- argv++;
- continue;
- }
- fstat(fd, &statbuf);
- buffer = malloc(statbuf.st_size);
- if (buffer == 0)
- oops("malloc failure");
- read(fd, buffer, statbuf.st_size);
- bmg_search(buffer, statbuf.st_size, printit);
- close(fd);
- free(buffer);
- argv++;
- }
- }
- exit((nsuccess == 2) ? 2 : (nsuccess == 0));
- }
-
- oops(message)
- char *message;
- {
- fprintf(stderr, "%s\n", message);
- exit(2);
- }
- //go.sysin dd *
- made=TRUE
- if [ $made = TRUE ]; then
- /bin/chmod 664 bmgtest2.c
- /bin/echo -n ' '; /bin/ls -ld bmgtest2.c
- fi
-
-