home *** CD-ROM | disk | FTP | other *** search
Text File | 1988-09-11 | 56.3 KB | 2,242 lines |
- Subject: regexp(3)
- Newsgroups: mod.sources
- Approved: jpn@panda.UUCP
-
- Mod.sources: Volume 3, Issue 89
- Submitted by: decvax!utzoo!henry
-
- Below is an almost-public-domain re-implementation of the V8 regexp(3)
- routines.
-
- Henry Spencer @ U of Toronto Zoology
- {allegra,ihnp4,linus,decvax}!utzoo!henry
- --------------
- echo README:
- sed 's/^X//' >README <<'!'
- XThis is a nearly-public-domain reimplementation of the V8 regexp(3) package.
- XIt gives C programs the ability to use egrep-style regular expressions, and
- Xdoes it in a much cleaner fashion than the analogous routines in SysV.
- X
- X Copyright (c) 1986 by University of Toronto.
- X Written by Henry Spencer. Not derived from licensed software.
- X
- X Permission is granted to anyone to use this software for any
- X purpose on any computer system, and to redistribute it freely,
- X subject to the following restrictions:
- X
- X 1. The author is not responsible for the consequences of use of
- X this software, no matter how awful, even if they arise
- X from defects in it.
- X
- X 2. The origin of this software must not be misrepresented, either
- X by explicit claim or by omission.
- X
- X 3. Altered versions must be plainly marked as such, and must not
- X be misrepresented as being the original software.
- X
- XBarring a couple of small items in the BUGS list, this implementation is
- Xbelieved 100% compatible with V8. It should even be binary-compatible,
- Xsort of, since the only fields in a "struct regexp" that other people have
- Xany business touching are declared in exactly the same way at the same
- Xlocation in the struct (the beginning).
- X
- XThis implementation is *NOT* AT&T/Bell code, and is not derived from licensed
- Xsoftware. Even though U of T is a V8 licensee. This software is based on
- Xa V8 manual page sent to me by Dennis Ritchie (the manual page enclosed
- Xhere is a complete rewrite and hence is not covered by AT&T copyright).
- XThe software was nearly complete at the time of arrival of our V8 tape.
- XI haven't even looked at V8 yet, although a friend elsewhere at U of T has
- Xbeen kind enough to run a few test programs using the V8 regexp(3) to resolve
- Xa few fine points. I admit to some familiarity with regular-expression
- Ximplementations of the past, but the only one that this code traces any
- Xancestry to is the one published in Kernighan & Plauger (from which this
- Xone draws ideas but not code).
- X
- XSimplistically: put this stuff into a source directory, copy regexp.h into
- X/usr/include, inspect Makefile for compilation options that need changing
- Xto suit your local environment, and then do "make r". This compiles the
- Xregexp(3) functions, compiles a test program, and runs a large set of
- Xregression tests. If there are no complaints, then put regexp.o, regsub.o,
- Xand regerror.o into your C library, and regexp.3 into your manual-pages
- Xdirectory.
- X
- XNote that if you don't put regexp.h into /usr/include *before* compiling,
- Xyou'll have to add "-I." to CFLAGS before compiling.
- X
- XThe files are:
- X
- XMakefile instructions to make everything
- Xregexp.3 manual page
- Xregexp.h header file, for /usr/include
- Xregexp.c source for regcomp() and regexec()
- Xregsub.c source for regsub()
- Xregerror.c source for default regerror()
- Xregmagic.h internal header file
- Xtry.c source for test program
- Xtimer.c source for timing program
- Xtests test list for try and timer
- X
- XThis implementation uses nondeterministic automata rather than the
- Xdeterministic ones found in some other implementations, which makes it
- Xsimpler, smaller, and faster at compiling regular expressions, but slower
- Xat executing them. In theory, anyway. This implementation does employ
- Xsome special-case optimizations to make the simpler cases (which do make
- Xup the bulk of regular expressions actually used) run quickly. In general,
- Xif you want blazing speed you're in the wrong place. Replacing the insides
- Xof egrep with this stuff is probably a mistake; if you want your own egrep
- Xyou're going to have to do a lot more work. But if you want to use regular
- Xexpressions a little bit in something else, you're in luck. Note that many
- Xexisting text editors use nondeterministic regular-expression implementations,
- Xso you're in good company.
- X
- XThis stuff should be pretty portable, given appropriate option settings.
- XIf your chars have less than 8 bits, you're going to have to change the
- Xinternal representation of the automaton, although knowledge of the details
- Xof this is fairly localized. There are no "reserved" char values except for
- XNUL, and no special significance is attached to the top bit of chars.
- XThe string(3) functions are used a fair bit, on the grounds that they are
- Xprobably faster than coding the operations in line. Some attempts at code
- Xtuning have been made, but this is invariably a bit machine-specific.
- !
- echo Makefile:
- sed 's/^X//' >Makefile <<'!'
- X# Things you might want to put in ENV and LENV:
- X# -Dvoid=int compilers that don't do void
- X# -DCHARBITS=0377 compilers that don't do unsigned char
- X# -DSTATIC=extern compilers that don't like "static foo();" as forward decl
- X# -DSTRCSPN library does not have strcspn()
- X# -Dstrchr=index library does not have strchr()
- X# -DERRAVAIL have utzoo-compatible error() function and friends
- XENV=-Dvoid=int -DCHARBITS=0377 -DSTATIC=extern
- XLENV=-Dvoid=int -DCHARBITS=0377
- X
- X# Things you might want to put in TEST:
- X# -DDEBUG debugging hooks
- X# -I. regexp.h from current directory, not /usr/include
- XTEST=
- X
- X# Things you might want to put in PROF:
- X# -Dstatic='/* */' make everything global so profiler can see it.
- X# -p profiler
- XPROF=
- X
- XCFLAGS=-O $(ENV) $(TEST) $(PROF)
- XLINTFLAGS=$(LENV) $(TEST) -ha
- XLDFLAGS=-i
- X
- XOBJ=regexp.o regsub.o
- XLSRC=regexp.c regsub.c regerror.c
- XDTR=README dMakefile regexp.3 regexp.h regexp.c regsub.c regerror.c \
- X regmagic.h try.c timer.c tests
- X
- Xtry: try.o $(OBJ)
- X cc $(LDFLAGS) try.o $(OBJ) -o try
- X
- X# Making timer will probably require putting stuff in $(PROF) and then
- X# recompiling everything; the following is just the final stage.
- Xtimer: timer.o $(OBJ)
- X cc $(LDFLAGS) $(PROF) timer.o $(OBJ) -o timer
- X
- Xtimer.o: timer.c timer.t.h
- X
- Xtimer.t.h: tests
- X sed 's/ /","/g;s/\\/&&/g;s/.*/{"&"},/' tests >timer.t.h
- X
- X# Regression test.
- Xr: try tests
- X @echo 'No news is good news...'
- X try <tests
- X
- Xlint: timer.t.h
- X @echo 'Complaints about multiply-declared regerror() are legit.'
- X lint $(LINTFLAGS) $(LSRC) try.c
- X lint $(LINTFLAGS) $(LSRC) timer.c
- X
- Xregexp.o: regexp.c regexp.h regmagic.h
- Xregsub.o: regsub.c regexp.h regmagic.h
- X
- Xclean:
- X rm -f *.o core mon.out timer.t.h dMakefile dtr try timer
- X
- Xdtr: r makedtr $(DTR)
- X makedtr $(DTR) >dtr
- X
- XdMakefile: Makefile
- X sed '/^L*ENV=/s/ *-DERRAVAIL//' Makefile >dMakefile
- !
- echo regexp.3:
- sed 's/^X//' >regexp.3 <<'!'
- X.TH REGEXP 3 local
- X.DA 30 Nov 1985
- X.SH NAME
- Xregcomp, regexec, regsub, regerror \- regular expression handler
- X.SH SYNOPSIS
- X.ft B
- X.nf
- X#include <regexp.h>
- X
- Xregexp *regcomp(exp)
- Xchar *exp;
- X
- Xint regexec(prog, string)
- Xregexp *prog;
- Xchar *string;
- X
- Xregsub(prog, source, dest)
- Xregexp *prog;
- Xchar *source;
- Xchar *dest;
- X
- Xregerror(msg)
- Xchar *msg;
- X.SH DESCRIPTION
- XThese functions implement
- X.IR egrep (1)-style
- Xregular expressions and supporting facilities.
- X.PP
- X.I Regcomp
- Xcompiles a regular expression into a structure of type
- X.IR regexp ,
- Xand returns a pointer to it.
- XThe space has been allocated using
- X.IR malloc (3)
- Xand may be released by
- X.IR free .
- X.PP
- X.I Regexec
- Xmatches a NUL-terminated \fIstring\fR against the compiled regular expression
- Xin \fIprog\fR.
- XIt returns 1 for success and 0 for failure, and adjusts the contents of
- X\fIprog\fR's \fIstartp\fR and \fIendp\fR (see below) accordingly.
- X.PP
- XThe members of a
- X.I regexp
- Xstructure include at least the following (not necessarily in order):
- X.PP
- X.RS
- Xchar *startp[NSUBEXP];
- X.br
- Xchar *endp[NSUBEXP];
- X.RE
- X.PP
- Xwhere
- X.I NSUBEXP
- Xis defined (as 10) in the header file.
- XOnce a successful \fIregexec\fR has been done using the \fIregexp\fR,
- Xeach \fIstartp\fR-\fIendp\fR pair describes one substring
- Xwithin the \fIstring\fR,
- Xwith the \fIstartp\fR pointing to the first character of the substring and
- Xthe \fIendp\fR pointing to the first character following the substring.
- XThe 0th substring is the substring of \fIstring\fR that matched the whole
- Xregular expression.
- XThe others are those substrings that matched parenthesized expressions
- Xwithin the regular expression, with parenthesized expressions numbered
- Xin left-to-right order of their opening parentheses.
- X.PP
- X.I Regsub
- Xcopies \fIsource\fR to \fIdest\fR, making substitutions according to the
- Xmost recent \fIregexec\fR performed using \fIprog\fR.
- XEach instance of `&' in \fIsource\fR is replaced by the substring
- Xindicated by \fIstartp\fR[\fI0\fR] and
- X\fIendp\fR[\fI0\fR].
- XEach instance of `\e\fIn\fR', where \fIn\fR is a digit, is replaced by
- Xthe substring indicated by
- X\fIstartp\fR[\fIn\fR] and
- X\fIendp\fR[\fIn\fR].
- X.PP
- X.I Regerror
- Xis called whenever an error is detected in \fIregcomp\fR, \fIregexec\fR,
- Xor \fIregsub\fR.
- XThe default \fIregerror\fR writes the string \fImsg\fR,
- Xwith a suitable indicator of origin,
- Xon the standard
- Xerror output
- Xand invokes \fIexit\fR(2).
- X.I Regerror
- Xcan be replaced by the user if other actions are desirable.
- X.SH "REGULAR EXPRESSION SYNTAX"
- XA regular expression is zero or more \fIbranches\fR, separated by `|'.
- XIt matches anything that matches one of the branches.
- X.PP
- XA branch is zero or more \fIpieces\fR, concatenated.
- XIt matches a match for the first, followed by a match for the second, etc.
- X.PP
- XA piece is an \fIatom\fR possibly followed by `*', `+', or `?'.
- XAn atom followed by `*' matches a sequence of 0 or more matches of the atom.
- XAn atom followed by `+' matches a sequence of 1 or more matches of the atom.
- XAn atom followed by `?' matches a match of the atom, or the null string.
- X.PP
- XAn atom is a regular expression in parentheses (matching a match for the
- Xregular expression), a \fIrange\fR (see below), `.'
- X(matching any single character), `^' (matching the null string at the
- Xbeginning of the input string), `$' (matching the null string at the
- Xend of the input string), a `\e' followed by a single character (matching
- Xthat character), or a single character with no other significance
- X(matching that character).
- X.PP
- XA \fIrange\fR is a sequence of characters enclosed in `[]'.
- XIt normally matches any single character from the sequence.
- XIf the sequence begins with `^',
- Xit matches any single character \fInot\fR from the rest of the sequence.
- XIf two characters in the sequence are separated by `\-', this is shorthand
- Xfor the full list of ASCII characters between them
- X(e.g. `[0-9]' matches any decimal digit).
- XTo include a literal `]' in the sequence, make it the first character
- X(following a possible `^').
- XTo include a literal `\-', make it the first or last character.
- X.SH AMBIGUITY
- XIf a regular expression could match two different parts of the input string,
- Xit will match the one which begins earliest.
- XIf both begin in the same place but match different lengths, or match
- Xthe same length in different ways, life gets messier, as follows.
- X.PP
- XIn general, the possibilities in a list of branches are considered in
- Xleft-to-right order, the possibilities for `*', `+', and `?' are
- Xconsidered longest-first, nested constructs are considered from the
- Xoutermost in, and concatenated constructs are considered leftmost-first.
- XThe match that will be chosen is the one that uses the earliest
- Xpossibility in the first choice that has to be made.
- XIf there is more than one choice, the next will be made in the same manner
- X(earliest possibility) subject to the decision on the first choice.
- XAnd so forth.
- X.PP
- XFor example, `(ab|a)b*c' could match `abc' in one of two ways.
- XThe first choice is between `ab' and `a'; since `ab' is earlier, and does
- Xlead to a successful overall match, it is chosen.
- XSince the `b' is already spoken for,
- Xthe `b*' must match its last possibility\(emthe empty string\(emsince
- Xit must respect the earlier choice.
- X.PP
- XIn the particular case where no `|'s are present and there is only one
- X`*', `+', or `?', the net effect is that the longest possible
- Xmatch will be chosen.
- XSo `ab*', presented with `xabbbby', will match `abbbb'.
- XNote that if `ab*' is tried against `xabyabbbz', it
- Xwill match `ab' just after `x', due to the begins-earliest rule.
- X(In effect, the decision on where to start the match is the first choice
- Xto be made, hence subsequent choices must respect it even if this leads them
- Xto less-preferred alternatives.)
- X.SH SEE ALSO
- Xegrep(1), expr(1)
- X.SH DIAGNOSTICS
- X\fIRegcomp\fR returns NULL for a failure
- X(\fIregerror\fR permitting),
- Xwhere failures are syntax errors, exceeding implementation limits,
- Xor applying `+' or `*' to a possibly-null operand.
- X.SH HISTORY
- XBoth code and manual page were
- Xwritten at U of T.
- XThey are intended to be compatible with the Bell V8 \fIregexp\fR(3),
- Xbut are not derived from Bell code.
- X.SH BUGS
- XEmpty branches and empty regular expressions are not portable to V8.
- X.PP
- XThe restriction against
- Xapplying `*' or `+' to a possibly-null operand is an artifact of the
- Xsimplistic implementation.
- X.PP
- XDoes not support \fIegrep\fR's newline-separated branches;
- Xneither does the V8 \fIregexp\fR(3), though.
- X.PP
- XDue to emphasis on
- Xcompactness and simplicity,
- Xit's not strikingly fast.
- XIt does give special attention to handling simple cases quickly.
- !
- echo regexp.h:
- sed 's/^X//' >regexp.h <<'!'
- X/*
- X * Definitions etc. for regexp(3) routines.
- X *
- X * Caveat: this is V8 regexp(3) [actually, a reimplementation thereof],
- X * not the System V one.
- X */
- X#define NSUBEXP 10
- Xtypedef struct regexp {
- X char *startp[NSUBEXP];
- X char *endp[NSUBEXP];
- X char regstart; /* Internal use only. */
- X char reganch; /* Internal use only. */
- X char *regmust; /* Internal use only. */
- X int regmlen; /* Internal use only. */
- X char program[1]; /* Unwarranted chumminess with compiler. */
- X} regexp;
- X
- Xextern regexp *regcomp();
- Xextern int regexec();
- Xextern void regsub();
- Xextern void regerror();
- !
- echo regexp.c:
- sed 's/^X//' >regexp.c <<'!'
- X/*
- X * regcomp and regexec -- regsub and regerror are elsewhere
- X *
- X * Copyright (c) 1986 by University of Toronto.
- X * Written by Henry Spencer. Not derived from licensed software.
- X *
- X * Permission is granted to anyone to use this software for any
- X * purpose on any computer system, and to redistribute it freely,
- X * subject to the following restrictions:
- X *
- X * 1. The author is not responsible for the consequences of use of
- X * this software, no matter how awful, even if they arise
- X * from defects in it.
- X *
- X * 2. The origin of this software must not be misrepresented, either
- X * by explicit claim or by omission.
- X *
- X * 3. Altered versions must be plainly marked as such, and must not
- X * be misrepresented as being the original software.
- X *
- X * Beware that some of this code is subtly aware of the way operator
- X * precedence is structured in regular expressions. Serious changes in
- X * regular-expression syntax might require a total rethink.
- X */
- X#include <stdio.h>
- X#include <regexp.h>
- X#include "regmagic.h"
- X
- X/*
- X * The "internal use only" fields in regexp.h are present to pass info from
- X * compile to execute that permits the execute phase to run lots faster on
- X * simple cases. They are:
- X *
- X * regstart char that must begin a match; '\0' if none obvious
- X * reganch is the match anchored (at beginning-of-line only)?
- X * regmust string (pointer into program) that match must include, or NULL
- X * regmlen length of regmust string
- X *
- X * Regstart and reganch permit very fast decisions on suitable starting points
- X * for a match, cutting down the work a lot. Regmust permits fast rejection
- X * of lines that cannot possibly match. The regmust tests are costly enough
- X * that regcomp() supplies a regmust only if the r.e. contains something
- X * potentially expensive (at present, the only such thing detected is * or +
- X * at the start of the r.e., which can involve a lot of backup). Regmlen is
- X * supplied because the test in regexec() needs it and regcomp() is computing
- X * it anyway.
- X */
- X
- X/*
- X * Structure for regexp "program". This is essentially a linear encoding
- X * of a nondeterministic finite-state machine (aka syntax charts or
- X * "railroad normal form" in parsing technology). Each node is an opcode
- X * plus a "next" pointer, possibly plus an operand. "Next" pointers of
- X * all nodes except BRANCH implement concatenation; a "next" pointer with
- X * a BRANCH on both ends of it is connecting two alternatives. (Here we
- X * have one of the subtle syntax dependencies: an individual BRANCH (as
- X * opposed to a collection of them) is never concatenated with anything
- X * because of operator precedence.) The operand of some types of node is
- X * a literal string; for others, it is a node leading into a sub-FSM. In
- X * particular, the operand of a BRANCH node is the first node of the branch.
- X * (NB this is *not* a tree structure: the tail of the branch connects
- X * to the thing following the set of BRANCHes.) The opcodes are:
- X */
- X
- X/* definition number opnd? meaning */
- X#define END 0 /* no End of program. */
- X#define BOL 1 /* no Match "" at beginning of line. */
- X#define EOL 2 /* no Match "" at end of line. */
- X#define ANY 3 /* no Match any one character. */
- X#define ANYOF 4 /* str Match any character in this string. */
- X#define ANYBUT 5 /* str Match any character not in this string. */
- X#define BRANCH 6 /* node Match this alternative, or the next... */
- X#define BACK 7 /* no Match "", "next" ptr points backward. */
- X#define EXACTLY 8 /* str Match this string. */
- X#define NOTHING 9 /* no Match empty string. */
- X#define STAR 10 /* node Match this (simple) thing 0 or more times. */
- X#define PLUS 11 /* node Match this (simple) thing 1 or more times. */
- X#define OPEN 20 /* no Mark this point in input as start of #n. */
- X /* OPEN+1 is number 1, etc. */
- X#define CLOSE 30 /* no Analogous to OPEN. */
- X
- X/*
- X * Opcode notes:
- X *
- X * BRANCH The set of branches constituting a single choice are hooked
- X * together with their "next" pointers, since precedence prevents
- X * anything being concatenated to any individual branch. The
- X * "next" pointer of the last BRANCH in a choice points to the
- X * thing following the whole choice. This is also where the
- X * final "next" pointer of each individual branch points; each
- X * branch starts with the operand node of a BRANCH node.
- X *
- X * BACK Normal "next" pointers all implicitly point forward; BACK
- X * exists to make loop structures possible.
- X *
- X * STAR,PLUS '?', and complex '*' and '+', are implemented as circular
- X * BRANCH structures using BACK. Simple cases (one character
- X * per match) are implemented with STAR and PLUS for speed
- X * and to minimize recursive plunges.
- X *
- X * OPEN,CLOSE ...are numbered at compile time.
- X */
- X
- X/*
- X * A node is one char of opcode followed by two chars of "next" pointer.
- X * "Next" pointers are stored as two 8-bit pieces, high order first. The
- X * value is a positive offset from the opcode of the node containing it.
- X * An operand, if any, simply follows the node. (Note that much of the
- X * code generation knows about this implicit relationship.)
- X *
- X * Using two bytes for the "next" pointer is vast overkill for most things,
- X * but allows patterns to get big without disasters.
- X */
- X#define OP(p) (*(p))
- X#define NEXT(p) (((*((p)+1)&0377)<<8) + *((p)+2)&0377)
- X#define OPERAND(p) ((p) + 3)
- X
- X/*
- X * See regmagic.h for one further detail of program structure.
- X */
- X
- X
- X/*
- X * Utility definitions.
- X */
- X#ifndef CHARBITS
- X#define UCHARAT(p) ((int)*(unsigned char *)(p))
- X#else
- X#define UCHARAT(p) ((int)*(p)&CHARBITS)
- X#endif
- X
- X#define FAIL(m) { regerror(m); return(NULL); }
- X#define ISMULT(c) ((c) == '*' || (c) == '+' || (c) == '?')
- X#define META "^$.[()|?+*\\"
- X
- X/*
- X * Flags to be passed up and down.
- X */
- X#define HASWIDTH 01 /* Known never to match null string. */
- X#define SIMPLE 02 /* Simple enough to be STAR/PLUS operand. */
- X#define SPSTART 04 /* Starts with * or +. */
- X#define WORST 0 /* Worst case. */
- X
- X/*
- X * Global work variables for regcomp().
- X */
- Xstatic char *regparse; /* Input-scan pointer. */
- Xstatic int regnpar; /* () count. */
- Xstatic char regdummy;
- Xstatic char *regcode; /* Code-emit pointer; ®dummy = don't. */
- Xstatic long regsize; /* Code size. */
- X
- X/*
- X * Forward declarations for regcomp()'s friends.
- X */
- X#ifndef STATIC
- X#define STATIC static
- X#endif
- XSTATIC char *reg();
- XSTATIC char *regbranch();
- XSTATIC char *regpiece();
- XSTATIC char *regatom();
- XSTATIC char *regnode();
- XSTATIC char *regnext();
- XSTATIC void regc();
- XSTATIC void reginsert();
- XSTATIC void regtail();
- XSTATIC void regoptail();
- X#ifdef STRCSPN
- XSTATIC int strcspn();
- X#endif
- X
- X/*
- X - regcomp - compile a regular expression into internal code
- X *
- X * We can't allocate space until we know how big the compiled form will be,
- X * but we can't compile it (and thus know how big it is) until we've got a
- X * place to put the code. So we cheat: we compile it twice, once with code
- X * generation turned off and size counting turned on, and once "for real".
- X * This also means that we don't allocate space until we are sure that the
- X * thing really will compile successfully, and we never have to move the
- X * code and thus invalidate pointers into it. (Note that it has to be in
- X * one piece because free() must be able to free it all.)
- X *
- X * Beware that the optimization-preparation code in here knows about some
- X * of the structure of the compiled regexp.
- X */
- Xregexp *
- Xregcomp(exp)
- Xchar *exp;
- X{
- X register regexp *r;
- X register char *scan;
- X register char *longest;
- X register int len;
- X int flags;
- X extern char *malloc();
- X
- X if (exp == NULL)
- X FAIL("NULL argument");
- X
- X /* First pass: determine size, legality. */
- X regparse = exp;
- X regnpar = 1;
- X regsize = 0L;
- X regcode = ®dummy;
- X regc(MAGIC);
- X if (reg(0, &flags) == NULL)
- X return(NULL);
- X
- X /* Small enough for pointer-storage convention? */
- X if (regsize >= 32767L) /* Probably could be 65535L. */
- X FAIL("regexp too big");
- X
- X /* Allocate space. */
- X r = (regexp *)malloc(sizeof(regexp) + (unsigned)regsize);
- X if (r == NULL)
- X FAIL("out of space");
- X
- X /* Second pass: emit code. */
- X regparse = exp;
- X regnpar = 1;
- X regcode = r->program;
- X regc(MAGIC);
- X if (reg(0, &flags) == NULL)
- X return(NULL);
- X
- X /* Dig out information for optimizations. */
- X r->regstart = '\0'; /* Worst-case defaults. */
- X r->reganch = 0;
- X r->regmust = NULL;
- X r->regmlen = 0;
- X scan = r->program+1; /* First BRANCH. */
- X if (OP(regnext(scan)) == END) { /* Only one top-level choice. */
- X scan = OPERAND(scan);
- X
- X /* Starting-point info. */
- X if (OP(scan) == EXACTLY)
- X r->regstart = *OPERAND(scan);
- X else if (OP(scan) == BOL)
- X r->reganch++;
- X
- X /*
- X * If there's something expensive in the r.e., find the
- X * longest literal string that must appear and make it the
- X * regmust. Resolve ties in favor of later strings, since
- X * the regstart check works with the beginning of the r.e.
- X * and avoiding duplication strengthens checking. Not a
- X * strong reason, but sufficient in the absence of others.
- X */
- X if (flags&SPSTART) {
- X longest = NULL;
- X len = 0;
- X for (; scan != NULL; scan = regnext(scan))
- X if (OP(scan) == EXACTLY && strlen(OPERAND(scan)) >= len) {
- X longest = OPERAND(scan);
- X len = strlen(OPERAND(scan));
- X }
- X r->regmust = longest;
- X r->regmlen = len;
- X }
- X }
- X
- X return(r);
- X}
- X
- X/*
- X - reg - regular expression, i.e. main body or parenthesized thing
- X *
- X * Caller must absorb opening parenthesis.
- X *
- X * Combining parenthesis handling with the base level of regular expression
- X * is a trifle forced, but the need to tie the tails of the branches to what
- X * follows makes it hard to avoid.
- X */
- Xstatic char *
- Xreg(paren, flagp)
- Xint paren; /* Parenthesized? */
- Xint *flagp;
- X{
- X register char *ret;
- X register char *br;
- X register char *ender;
- X register int parno;
- X int flags;
- X
- X *flagp = HASWIDTH; /* Tentatively. */
- X
- X /* Make an OPEN node, if parenthesized. */
- X if (paren) {
- X if (regnpar >= NSUBEXP)
- X FAIL("too many ()");
- X parno = regnpar;
- X regnpar++;
- X ret = regnode(OPEN+parno);
- X } else
- X ret = NULL;
- X
- X /* Pick up the branches, linking them together. */
- X br = regbranch(&flags);
- X if (br == NULL)
- X return(NULL);
- X if (ret != NULL)
- X regtail(ret, br); /* OPEN -> first. */
- X else
- X ret = br;
- X if (!(flags&HASWIDTH))
- X *flagp &= ~HASWIDTH;
- X *flagp |= flags&SPSTART;
- X while (*regparse == '|') {
- X regparse++;
- X br = regbranch(&flags);
- X if (br == NULL)
- X return(NULL);
- X regtail(ret, br); /* BRANCH -> BRANCH. */
- X if (!(flags&HASWIDTH))
- X *flagp &= ~HASWIDTH;
- X *flagp |= flags&SPSTART;
- X }
- X
- X /* Make a closing node, and hook it on the end. */
- X ender = regnode((paren) ? CLOSE+parno : END);
- X regtail(ret, ender);
- X
- X /* Hook the tails of the branches to the closing node. */
- X for (br = ret; br != NULL; br = regnext(br))
- X regoptail(br, ender);
- X
- X /* Check for proper termination. */
- X if (paren && *regparse++ != ')') {
- X FAIL("unmatched ()");
- X } else if (!paren && *regparse != '\0') {
- X if (*regparse == ')') {
- X FAIL("unmatched ()");
- X } else
- X FAIL("junk on end"); /* "Can't happen". */
- X /* NOTREACHED */
- X }
- X
- X return(ret);
- X}
- X
- X/*
- X - regbranch - one alternative of an | operator
- X *
- X * Implements the concatenation operator.
- X */
- Xstatic char *
- Xregbranch(flagp)
- Xint *flagp;
- X{
- X register char *ret;
- X register char *chain;
- X register char *latest;
- X int flags;
- X
- X *flagp = WORST; /* Tentatively. */
- X
- X ret = regnode(BRANCH);
- X chain = NULL;
- X while (*regparse != '\0' && *regparse != '|' && *regparse != ')') {
- X latest = regpiece(&flags);
- X if (latest == NULL)
- X return(NULL);
- X *flagp |= flags&HASWIDTH;
- X if (chain == NULL) /* First piece. */
- X *flagp |= flags&SPSTART;
- X else
- X regtail(chain, latest);
- X chain = latest;
- X }
- X if (chain == NULL) /* Loop ran zero times. */
- X (void) regnode(NOTHING);
- X
- X return(ret);
- X}
- X
- X/*
- X - regpiece - something followed by possible [*+?]
- X *
- X * Note that the branching code sequences used for ? and the general cases
- X * of * and + are somewhat optimized: they use the same NOTHING node as
- X * both the endmarker for their branch list and the body of the last branch.
- X * It might seem that this node could be dispensed with entirely, but the
- X * endmarker role is not redundant.
- X */
- Xstatic char *
- Xregpiece(flagp)
- Xint *flagp;
- X{
- X register char *ret;
- X register char op;
- X register char *next;
- X int flags;
- X
- X ret = regatom(&flags);
- X if (ret == NULL)
- X return(NULL);
- X
- X op = *regparse;
- X if (!ISMULT(op)) {
- X *flagp = flags;
- X return(ret);
- X }
- X
- X if (!(flags&HASWIDTH) && op != '?')
- X FAIL("*+ operand could be empty");
- X *flagp = (op != '+') ? (WORST|SPSTART) : (WORST|HASWIDTH);
- X
- X if (op == '*' && (flags&SIMPLE))
- X reginsert(STAR, ret);
- X else if (op == '*') {
- X /* Emit x* as (x&|), where & means "self". */
- X reginsert(BRANCH, ret); /* Either x */
- X regoptail(ret, regnode(BACK)); /* and loop */
- X regoptail(ret, ret); /* back */
- X regtail(ret, regnode(BRANCH)); /* or */
- X regtail(ret, regnode(NOTHING)); /* null. */
- X } else if (op == '+' && (flags&SIMPLE))
- X reginsert(PLUS, ret);
- X else if (op == '+') {
- X /* Emit x+ as x(&|), where & means "self". */
- X next = regnode(BRANCH); /* Either */
- X regtail(ret, next);
- X regtail(regnode(BACK), ret); /* loop back */
- X regtail(next, regnode(BRANCH)); /* or */
- X regtail(ret, regnode(NOTHING)); /* null. */
- X } else if (op == '?') {
- X /* Emit x? as (x|) */
- X reginsert(BRANCH, ret); /* Either x */
- X regtail(ret, regnode(BRANCH)); /* or */
- X next = regnode(NOTHING); /* null. */
- X regtail(ret, next);
- X regoptail(ret, next);
- X }
- X regparse++;
- X if (ISMULT(*regparse))
- X FAIL("nested *?+");
- X
- X return(ret);
- X}
- X
- X/*
- X - regatom - the lowest level
- X *
- X * Optimization: gobbles an entire sequence of ordinary characters so that
- X * it can turn them into a single node, which is smaller to store and
- X * faster to run. Backslashed characters are exceptions, each becoming a
- X * separate node; the code is simpler that way and it's not worth fixing.
- X */
- Xstatic char *
- Xregatom(flagp)
- Xint *flagp;
- X{
- X register char *ret;
- X int flags;
- X
- X *flagp = WORST; /* Tentatively. */
- X
- X switch (*regparse++) {
- X case '^':
- X ret = regnode(BOL);
- X break;
- X case '$':
- X ret = regnode(EOL);
- X break;
- X case '.':
- X ret = regnode(ANY);
- X *flagp |= HASWIDTH|SIMPLE;
- X break;
- X case '[': {
- X register int class;
- X register int classend;
- X
- X if (*regparse == '^') { /* Complement of range. */
- X ret = regnode(ANYBUT);
- X regparse++;
- X } else
- X ret = regnode(ANYOF);
- X if (*regparse == ']' || *regparse == '-')
- X regc(*regparse++);
- X while (*regparse != '\0' && *regparse != ']') {
- X if (*regparse == '-') {
- X regparse++;
- X if (*regparse == ']' || *regparse == '\0')
- X regc('-');
- X else {
- X class = UCHARAT(regparse-2)+1;
- X classend = UCHARAT(regparse);
- X if (class > classend+1)
- X FAIL("invalid [] range");
- X for (; class <= classend; class++)
- X regc(class);
- X regparse++;
- X }
- X } else
- X regc(*regparse++);
- X }
- X regc('\0');
- X if (*regparse != ']')
- X FAIL("unmatched []");
- X regparse++;
- X *flagp |= HASWIDTH|SIMPLE;
- X }
- X break;
- X case '(':
- X ret = reg(1, &flags);
- X if (ret == NULL)
- X return(NULL);
- X *flagp |= flags&(HASWIDTH|SPSTART);
- X break;
- X case '\0':
- X case '|':
- X case ')':
- X FAIL("internal urp"); /* Supposed to be caught earlier. */
- X break;
- X case '?':
- X case '+':
- X case '*':
- X FAIL("?+* follows nothing");
- X break;
- X case '\\':
- X if (*regparse == '\0')
- X FAIL("trailing \\");
- X ret = regnode(EXACTLY);
- X regc(*regparse++);
- X regc('\0');
- X *flagp |= HASWIDTH|SIMPLE;
- X break;
- X default: {
- X register int len;
- X register char ender;
- X
- X regparse--;
- X len = strcspn(regparse, META);
- X if (len <= 0)
- X FAIL("internal disaster");
- X ender = *(regparse+len);
- X if (len > 1 && ISMULT(ender))
- X len--; /* Back off clear of ?+* operand. */
- X *flagp |= HASWIDTH;
- X if (len == 1)
- X *flagp |= SIMPLE;
- X ret = regnode(EXACTLY);
- X while (len > 0) {
- X regc(*regparse++);
- X len--;
- X }
- X regc('\0');
- X }
- X break;
- X }
- X
- X return(ret);
- X}
- X
- X/*
- X - regnode - emit a node
- X */
- Xstatic char * /* Location. */
- Xregnode(op)
- Xchar op;
- X{
- X register char *ret;
- X register char *ptr;
- X
- X ret = regcode;
- X if (ret == ®dummy) {
- X regsize += 3;
- X return(ret);
- X }
- X
- X ptr = ret;
- X *ptr++ = op;
- X *ptr++ = '\0'; /* Null "next" pointer. */
- X *ptr++ = '\0';
- X regcode = ptr;
- X
- X return(ret);
- X}
- X
- X/*
- X - regc - emit (if appropriate) a byte of code
- X */
- Xstatic void
- Xregc(b)
- Xchar b;
- X{
- X if (regcode != ®dummy)
- X *regcode++ = b;
- X else
- X regsize++;
- X}
- X
- X/*
- X - reginsert - insert an operator in front of already-emitted operand
- X *
- X * Means relocating the operand.
- X */
- Xstatic void
- Xreginsert(op, opnd)
- Xchar op;
- Xchar *opnd;
- X{
- X register char *src;
- X register char *dst;
- X register char *place;
- X
- X if (regcode == ®dummy) {
- X regsize += 3;
- X return;
- X }
- X
- X src = regcode;
- X regcode += 3;
- X dst = regcode;
- X while (src > opnd)
- X *--dst = *--src;
- X
- X place = opnd; /* Op node, where operand used to be. */
- X *place++ = op;
- X *place++ = '\0';
- X *place++ = '\0';
- X}
- X
- X/*
- X - regtail - set the next-pointer at the end of a node chain
- X */
- Xstatic void
- Xregtail(p, val)
- Xchar *p;
- Xchar *val;
- X{
- X register char *scan;
- X register char *temp;
- X register int offset;
- X
- X if (p == ®dummy)
- X return;
- X
- X /* Find last node. */
- X scan = p;
- X for (;;) {
- X temp = regnext(scan);
- X if (temp == NULL)
- X break;
- X scan = temp;
- X }
- X
- X if (OP(scan) == BACK)
- X offset = scan - val;
- X else
- X offset = val - scan;
- X *(scan+1) = (offset>>8)&0377;
- X *(scan+2) = offset&0377;
- X}
- X
- X/*
- X - regoptail - regtail on operand of first argument; nop if operandless
- X */
- Xstatic void
- Xregoptail(p, val)
- Xchar *p;
- Xchar *val;
- X{
- X /* "Operandless" and "op != BRANCH" are synonymous in practice. */
- X if (p == NULL || p == ®dummy || OP(p) != BRANCH)
- X return;
- X regtail(OPERAND(p), val);
- X}
- X
- X/*
- X * regexec and friends
- X */
- X
- X/*
- X * Global work variables for regexec().
- X */
- Xstatic char *reginput; /* String-input pointer. */
- Xstatic char *regbol; /* Beginning of input, for ^ check. */
- Xstatic char **regstartp; /* Pointer to startp array. */
- Xstatic char **regendp; /* Ditto for endp. */
- X
- X/*
- X * Forwards.
- X */
- XSTATIC int regtry();
- XSTATIC int regmatch();
- XSTATIC int regrepeat();
- X
- X#ifdef DEBUG
- Xint regnarrate = 0;
- Xvoid regdump();
- XSTATIC char *regprop();
- X#endif
- X
- X/*
- X - regexec - match a regexp against a string
- X */
- Xint
- Xregexec(prog, string)
- Xregister regexp *prog;
- Xregister char *string;
- X{
- X register char *s;
- X extern char *strchr();
- X
- X /* Be paranoid... */
- X if (prog == NULL || string == NULL) {
- X regerror("NULL parameter");
- X return(0);
- X }
- X
- X /* Check validity of program. */
- X if (UCHARAT(prog->program) != MAGIC) {
- X regerror("corrupted program");
- X return(0);
- X }
- X
- X /* If there is a "must appear" string, look for it. */
- X if (prog->regmust != NULL) {
- X s = string;
- X while ((s = strchr(s, prog->regmust[0])) != NULL) {
- X if (strncmp(s, prog->regmust, prog->regmlen) == 0)
- X break; /* Found it. */
- X s++;
- X }
- X if (s == NULL) /* Not present. */
- X return(0);
- X }
- X
- X /* Mark beginning of line for ^ . */
- X regbol = string;
- X
- X /* Simplest case: anchored match need be tried only once. */
- X if (prog->reganch)
- X return(regtry(prog, string));
- X
- X /* Messy cases: unanchored match. */
- X s = string;
- X if (prog->regstart != '\0')
- X /* We know what char it must start with. */
- X while ((s = strchr(s, prog->regstart)) != NULL) {
- X if (regtry(prog, s))
- X return(1);
- X s++;
- X }
- X else
- X /* We don't -- general case. */
- X do {
- X if (regtry(prog, s))
- X return(1);
- X } while (*s++ != '\0');
- X
- X /* Failure. */
- X return(0);
- X}
- X
- X/*
- X - regtry - try match at specific point
- X */
- Xstatic int /* 0 failure, 1 success */
- Xregtry(prog, string)
- Xregexp *prog;
- Xchar *string;
- X{
- X register int i;
- X register char **sp;
- X register char **ep;
- X
- X reginput = string;
- X regstartp = prog->startp;
- X regendp = prog->endp;
- X
- X sp = prog->startp;
- X ep = prog->endp;
- X for (i = NSUBEXP; i > 0; i--) {
- X *sp++ = NULL;
- X *ep++ = NULL;
- X }
- X if (regmatch(prog->program + 1)) {
- X prog->startp[0] = string;
- X prog->endp[0] = reginput;
- X return(1);
- X } else
- X return(0);
- X}
- X
- X/*
- X - regmatch - main matching routine
- X *
- X * Conceptually the strategy is simple: check to see whether the current
- X * node matches, call self recursively to see whether the rest matches,
- X * and then act accordingly. In practice we make some effort to avoid
- X * recursion, in particular by going through "ordinary" nodes (that don't
- X * need to know whether the rest of the match failed) by a loop instead of
- X * by recursion.
- X */
- Xstatic int /* 0 failure, 1 success */
- Xregmatch(prog)
- Xchar *prog;
- X{
- X register char *scan; /* Current node. */
- X char *next; /* Next node. */
- X extern char *strchr();
- X
- X scan = prog;
- X#ifdef DEBUG
- X if (scan != NULL && regnarrate)
- X fprintf(stderr, "%s(\n", regprop(scan));
- X#endif
- X while (scan != NULL) {
- X#ifdef DEBUG
- X if (regnarrate)
- X fprintf(stderr, "%s...\n", regprop(scan));
- X#endif
- X next = regnext(scan);
- X
- X switch (OP(scan)) {
- X case BOL:
- X if (reginput != regbol)
- X return(0);
- X break;
- X case EOL:
- X if (*reginput != '\0')
- X return(0);
- X break;
- X case ANY:
- X if (*reginput == '\0')
- X return(0);
- X reginput++;
- X break;
- X case EXACTLY: {
- X register int len;
- X register char *opnd;
- X
- X opnd = OPERAND(scan);
- X /* Inline the first character, for speed. */
- X if (*opnd != *reginput)
- X return(0);
- X len = strlen(opnd);
- X if (len > 1 && strncmp(opnd, reginput, len) != 0)
- X return(0);
- X reginput += len;
- X }
- X break;
- X case ANYOF:
- X if (strchr(OPERAND(scan), *reginput) == NULL)
- X return(0);
- X reginput++;
- X break;
- X case ANYBUT:
- X if (strchr(OPERAND(scan), *reginput) != NULL)
- X return(0);
- X reginput++;
- X break;
- X case NOTHING:
- X break;
- X case BACK:
- X break;
- X case OPEN+1:
- X case OPEN+2:
- X case OPEN+3:
- X case OPEN+4:
- X case OPEN+5:
- X case OPEN+6:
- X case OPEN+7:
- X case OPEN+8:
- X case OPEN+9: {
- X register int no;
- X register char *save;
- X
- X no = OP(scan) - OPEN;
- X save = reginput;
- X
- X if (regmatch(next)) {
- X /*
- X * Don't set startp if some later
- X * invocation of the same parentheses
- X * already has.
- X */
- X if (regstartp[no] == NULL)
- X regstartp[no] = save;
- X return(1);
- X } else
- X return(0);
- X }
- X break;
- X case CLOSE+1:
- X case CLOSE+2:
- X case CLOSE+3:
- X case CLOSE+4:
- X case CLOSE+5:
- X case CLOSE+6:
- X case CLOSE+7:
- X case CLOSE+8:
- X case CLOSE+9: {
- X register int no;
- X register char *save;
- X
- X no = OP(scan) - CLOSE;
- X save = reginput;
- X
- X if (regmatch(next)) {
- X /*
- X * Don't set endp if some later
- X * invocation of the same parentheses
- X * already has.
- X */
- X if (regendp[no] == NULL)
- X regendp[no] = save;
- X return(1);
- X } else
- X return(0);
- X }
- X break;
- X case BRANCH: {
- X register char *save;
- X
- X if (OP(next) != BRANCH) /* No choice. */
- X next = OPERAND(scan); /* Avoid recursion. */
- X else {
- X do {
- X save = reginput;
- X if (regmatch(OPERAND(scan)))
- X return(1);
- X reginput = save;
- X scan = regnext(scan);
- X } while (scan != NULL && OP(scan) == BRANCH);
- X return(0);
- X /* NOTREACHED */
- X }
- X }
- X break;
- X case STAR:
- X case PLUS: {
- X register char nextch;
- X register int no;
- X register char *save;
- X register int min;
- X
- X /*
- X * Lookahead to avoid useless match attempts
- X * when we know what character comes next.
- X */
- X nextch = '\0';
- X if (OP(next) == EXACTLY)
- X nextch = *OPERAND(next);
- X min = (OP(scan) == STAR) ? 0 : 1;
- X save = reginput;
- X no = regrepeat(OPERAND(scan));
- X while (no >= min) {
- X /* If it could work, try it. */
- X if (nextch == '\0' || *reginput == nextch)
- X if (regmatch(next))
- X return(1);
- X /* Couldn't or didn't -- back up. */
- X no--;
- X reginput = save + no;
- X }
- X return(0);
- X }
- X break;
- X case END:
- X return(1); /* Success! */
- X break;
- X default:
- X regerror("memory corruption");
- X return(0);
- X break;
- X }
- X
- X scan = next;
- X }
- X
- X /*
- X * We get here only if there's trouble -- normally "case END" is
- X * the terminating point.
- X */
- X regerror("corrupted pointers");
- X return(0);
- X}
- X
- X/*
- X - regrepeat - repeatedly match something simple, report how many
- X */
- Xstatic int
- Xregrepeat(p)
- Xchar *p;
- X{
- X register int count = 0;
- X register char *scan;
- X register char *opnd;
- X
- X scan = reginput;
- X opnd = OPERAND(p);
- X switch (OP(p)) {
- X case ANY:
- X count = strlen(scan);
- X scan += count;
- X break;
- X case EXACTLY:
- X while (*opnd == *scan) {
- X count++;
- X scan++;
- X }
- X break;
- X case ANYOF:
- X while (*scan != '\0' && strchr(opnd, *scan) != NULL) {
- X count++;
- X scan++;
- X }
- X break;
- X case ANYBUT:
- X while (*scan != '\0' && strchr(opnd, *scan) == NULL) {
- X count++;
- X scan++;
- X }
- X break;
- X default: /* Oh dear. Called inappropriately. */
- X regerror("internal foulup");
- X count = 0; /* Best compromise. */
- X break;
- X }
- X reginput = scan;
- X
- X return(count);
- X}
- X
- X/*
- X - regnext - dig the "next" pointer out of a node
- X */
- Xstatic char *
- Xregnext(p)
- Xregister char *p;
- X{
- X register int offset;
- X
- X if (p == ®dummy)
- X return(NULL);
- X
- X offset = NEXT(p);
- X if (offset == 0)
- X return(NULL);
- X
- X if (OP(p) == BACK)
- X return(p-offset);
- X else
- X return(p+offset);
- X}
- X
- X#ifdef DEBUG
- X
- XSTATIC char *regprop();
- X
- X/*
- X - regdump - dump a regexp onto stdout in vaguely comprehensible form
- X */
- Xvoid
- Xregdump(r)
- Xregexp *r;
- X{
- X register char *s;
- X register char op = EXACTLY; /* Arbitrary non-END op. */
- X register char *next;
- X extern char *strchr();
- X
- X
- X s = r->program + 1;
- X while (op != END) { /* While that wasn't END last time... */
- X op = OP(s);
- X printf("%2d%s", s-r->program, regprop(s)); /* Where, what. */
- X next = regnext(s);
- X if (next == NULL) /* Next ptr. */
- X printf("(0)");
- X else
- X printf("(%d)", (s-r->program)+(next-s));
- X s += 3;
- X if (op == ANYOF || op == ANYBUT || op == EXACTLY) {
- X /* Literal string, where present. */
- X while (*s != '\0') {
- X putchar(*s);
- X s++;
- X }
- X s++;
- X }
- X putchar('\n');
- X }
- X
- X /* Header fields of interest. */
- X if (r->regstart != '\0')
- X printf("start `%c' ", r->regstart);
- X if (r->reganch)
- X printf("anchored ");
- X if (r->regmust != NULL)
- X printf("must have \"%s\"", r->regmust);
- X printf("\n");
- X}
- X
- X/*
- X - regprop - printable representation of opcode
- X */
- Xstatic char *
- Xregprop(op)
- Xchar *op;
- X{
- X register char *p;
- X static char buf[50];
- X
- X (void) strcpy(buf, ":");
- X
- X switch (OP(op)) {
- X case BOL:
- X p = "BOL";
- X break;
- X case EOL:
- X p = "EOL";
- X break;
- X case ANY:
- X p = "ANY";
- X break;
- X case ANYOF:
- X p = "ANYOF";
- X break;
- X case ANYBUT:
- X p = "ANYBUT";
- X break;
- X case BRANCH:
- X p = "BRANCH";
- X break;
- X case EXACTLY:
- X p = "EXACTLY";
- X break;
- X case NOTHING:
- X p = "NOTHING";
- X break;
- X case BACK:
- X p = "BACK";
- X break;
- X case END:
- X p = "END";
- X break;
- X case OPEN+1:
- X case OPEN+2:
- X case OPEN+3:
- X case OPEN+4:
- X case OPEN+5:
- X case OPEN+6:
- X case OPEN+7:
- X case OPEN+8:
- X case OPEN+9:
- X sprintf(buf+strlen(buf), "OPEN%d", OP(op)-OPEN);
- X p = NULL;
- X break;
- X case CLOSE+1:
- X case CLOSE+2:
- X case CLOSE+3:
- X case CLOSE+4:
- X case CLOSE+5:
- X case CLOSE+6:
- X case CLOSE+7:
- X case CLOSE+8:
- X case CLOSE+9:
- X sprintf(buf+strlen(buf), "CLOSE%d", OP(op)-CLOSE);
- X p = NULL;
- X break;
- X case STAR:
- X p = "STAR";
- X break;
- X case PLUS:
- X p = "PLUS";
- X break;
- X default:
- X regerror("corrupted opcode");
- X break;
- X }
- X if (p != NULL)
- X (void) strcat(buf, p);
- X return(buf);
- X}
- X#endif
- X
- X/*
- X * The following is provided for those people who do not have strcspn() in
- X * their C libraries. They should get off their butts and do something
- X * about it; at least one public-domain implementation of those (highly
- X * useful) string routines has been published on Usenet.
- X */
- X#ifdef STRCSPN
- X/*
- X * strcspn - find length of initial segment of s1 consisting entirely
- X * of characters not from s2
- X */
- X
- Xstatic int
- Xstrcspn(s1, s2)
- Xchar *s1;
- Xchar *s2;
- X{
- X register char *scan1;
- X register char *scan2;
- X register int count;
- X
- X count = 0;
- X for (scan1 = s1; *scan1 != '\0'; scan1++) {
- X for (scan2 = s2; *scan2 != '\0';) /* ++ moved down. */
- X if (*scan1 == *scan2++)
- X return(count);
- X count++;
- X }
- X return(count);
- X}
- X#endif
- !
- echo regsub.c:
- sed 's/^X//' >regsub.c <<'!'
- X/*
- X * regsub
- X *
- X * Copyright (c) 1986 by University of Toronto.
- X * Written by Henry Spencer. Not derived from licensed software.
- X *
- X * Permission is granted to anyone to use this software for any
- X * purpose on any computer system, and to redistribute it freely,
- X * subject to the following restrictions:
- X *
- X * 1. The author is not responsible for the consequences of use of
- X * this software, no matter how awful, even if they arise
- X * from defects in it.
- X *
- X * 2. The origin of this software must not be misrepresented, either
- X * by explicit claim or by omission.
- X *
- X * 3. Altered versions must be plainly marked as such, and must not
- X * be misrepresented as being the original software.
- X */
- X#include <stdio.h>
- X#include <regexp.h>
- X#include "regmagic.h"
- X
- X#ifndef CHARBITS
- X#define UCHARAT(p) ((int)*(unsigned char *)(p))
- X#else
- X#define UCHARAT(p) ((int)*(p)&CHARBITS)
- X#endif
- X
- X/*
- X - regsub - perform substitutions after a regexp match
- X */
- Xvoid
- Xregsub(prog, source, dest)
- Xregexp *prog;
- Xchar *source;
- Xchar *dest;
- X{
- X register char *src;
- X register char *dst;
- X register char c;
- X register int no;
- X register int len;
- X extern char *strncpy();
- X
- X if (prog == NULL || source == NULL || dest == NULL) {
- X regerror("NULL parm to regsub");
- X return;
- X }
- X if (UCHARAT(prog->program) != MAGIC) {
- X regerror("damaged regexp fed to regsub");
- X return;
- X }
- X
- X src = source;
- X dst = dest;
- X while ((c = *src++) != '\0') {
- X if (c == '&')
- X no = 0;
- X else if (c == '\\' && '0' <= *src && *src <= '9')
- X no = *src++ - '0';
- X else
- X no = -1;
- X
- X if (no < 0) /* Ordinary character. */
- X *dst++ = c;
- X else if (prog->startp[no] != NULL && prog->endp[no] != NULL) {
- X len = prog->endp[no] - prog->startp[no];
- X (void) strncpy(dst, prog->startp[no], len);
- X dst += len;
- X if (*(dst-1) == '\0') { /* strncpy hit NUL. */
- X regerror("damaged match string");
- X return;
- X }
- X }
- X }
- X *dst++ = '\0';
- X}
- !
- echo regerror.c:
- sed 's/^X//' >regerror.c <<'!'
- X#include <stdio.h>
- X
- Xvoid
- Xregerror(s)
- Xchar *s;
- X{
- X#ifdef ERRAVAIL
- X error("regexp: %s", s);
- X#else
- X fprintf(stderr, "regexp(3): %s", s);
- X exit(1);
- X#endif
- X /* NOTREACHED */
- X}
- !
- echo regmagic.h:
- sed 's/^X//' >regmagic.h <<'!'
- X/*
- X * The first byte of the regexp internal "program" is actually this magic
- X * number; the start node begins in the second byte.
- X */
- X#define MAGIC 0234
- !
- echo try.c:
- sed 's/^X//' >try.c <<'!'
- X/*
- X * Simple test program for regexp(3) stuff. Knows about debugging hooks.
- X *
- X * Copyright (c) 1986 by University of Toronto.
- X * Written by Henry Spencer. Not derived from licensed software.
- X *
- X * Permission is granted to anyone to use this software for any
- X * purpose on any computer system, and to redistribute it freely,
- X * subject to the following restrictions:
- X *
- X * 1. The author is not responsible for the consequences of use of
- X * this software, no matter how awful, even if they arise
- X * from defects in it.
- X *
- X * 2. The origin of this software must not be misrepresented, either
- X * by explicit claim or by omission.
- X *
- X * 3. Altered versions must be plainly marked as such, and must not
- X * be misrepresented as being the original software.
- X *
- X * Usage: try re [string [output [-]]]
- X * The re is compiled and dumped, regexeced against the string, the result
- X * is applied to output using regsub(). The - triggers a running narrative
- X * from regexec(). Dumping and narrative don't happen unless DEBUG.
- X *
- X * If there are no arguments, stdin is assumed to be a stream of lines with
- X * five fields: a r.e., a string to match it against, a result code, a
- X * source string for regsub, and the proper result. Result codes are 'c'
- X * for compile failure, 'y' for match success, 'n' for match failure.
- X * Field separator is tab.
- X */
- X#include <stdio.h>
- X#include <regexp.h>
- X
- X#ifdef ERRAVAIL
- Xchar *progname;
- Xextern char *mkprogname();
- X#endif
- X
- X#ifdef DEBUG
- Xextern int regnarrate;
- X#endif
- X
- Xchar buf[BUFSIZ];
- X
- Xint errreport = 0; /* Report errors via errseen? */
- Xchar *errseen = NULL; /* Error message. */
- Xint status = 0; /* Exit status. */
- X
- X/* ARGSUSED */
- Xmain(argc, argv)
- Xint argc;
- Xchar *argv[];
- X{
- X regexp *r;
- X int i;
- X
- X#ifdef ERRAVAIL
- X progname = mkprogname(argv[0]);
- X#endif
- X
- X if (argc == 1) {
- X multiple();
- X exit(status);
- X }
- X
- X r = regcomp(argv[1]);
- X if (r == NULL)
- X error("regcomp failure", "");
- X#ifdef DEBUG
- X regdump(r);
- X if (argc > 4)
- X regnarrate++;
- X#endif
- X if (argc > 2) {
- X i = regexec(r, argv[2]);
- X printf("%d", i);
- X for (i = 1; i < NSUBEXP; i++)
- X if (r->startp[i] != NULL && r->endp[i] != NULL)
- X printf(" \\%d", i);
- X printf("\n");
- X }
- X if (argc > 3) {
- X regsub(r, argv[3], buf);
- X printf("%s\n", buf);
- X }
- X exit(status);
- X}
- X
- Xvoid
- Xregerror(s)
- Xchar *s;
- X{
- X if (errreport)
- X errseen = s;
- X else
- X error(s, "");
- X}
- X
- X#ifndef ERRAVAIL
- Xerror(s1, s2)
- Xchar *s1;
- Xchar *s2;
- X{
- X fprintf(stderr, "regexp: ");
- X fprintf(stderr, s1, s2);
- X fprintf(stderr, "\n");
- X exit(1);
- X}
- X#endif
- X
- Xint lineno;
- X
- Xregexp badregexp; /* Implicit init to 0. */
- X
- Xmultiple()
- X{
- X char rbuf[BUFSIZ];
- X char *field[5];
- X char *scan;
- X int i;
- X regexp *r;
- X extern char *strchr();
- X
- X errreport = 1;
- X lineno = 0;
- X while (fgets(rbuf, sizeof(rbuf), stdin) != NULL) {
- X rbuf[strlen(rbuf)-1] = '\0'; /* Dispense with \n. */
- X lineno++;
- X scan = rbuf;
- X for (i = 0; i < 5; i++) {
- X field[i] = scan;
- X if (field[i] == NULL) {
- X complain("bad testfile format", "");
- X exit(1);
- X }
- X scan = strchr(scan, '\t');
- X if (scan != NULL)
- X *scan++ = '\0';
- X }
- X try(field);
- X }
- X
- X /* And finish up with some internal testing... */
- X lineno = 9990;
- X errseen = NULL;
- X if (regcomp((char *)NULL) != NULL || errseen == NULL)
- X complain("regcomp(NULL) doesn't complain", "");
- X lineno = 9991;
- X errseen = NULL;
- X if (regexec((regexp *)NULL, "foo") || errseen == NULL)
- X complain("regexec(NULL, ...) doesn't complain", "");
- X lineno = 9992;
- X r = regcomp("foo");
- X if (r == NULL) {
- X complain("regcomp(\"foo\") fails", "");
- X return;
- X }
- X lineno = 9993;
- X errseen = NULL;
- X if (regexec(r, (char *)NULL) || errseen == NULL)
- X complain("regexec(..., NULL) doesn't complain", "");
- X lineno = 9994;
- X errseen = NULL;
- X regsub((regexp *)NULL, "foo", rbuf);
- X if (errseen == NULL)
- X complain("regsub(NULL, ..., ...) doesn't complain", "");
- X lineno = 9995;
- X errseen = NULL;
- X regsub(r, (char *)NULL, rbuf);
- X if (errseen == NULL)
- X complain("regsub(..., NULL, ...) doesn't complain", "");
- X lineno = 9996;
- X errseen = NULL;
- X regsub(r, "foo", (char *)NULL);
- X if (errseen == NULL)
- X complain("regsub(..., ..., NULL) doesn't complain", "");
- X lineno = 9997;
- X errseen = NULL;
- X if (regexec(&badregexp, "foo") || errseen == NULL)
- X complain("regexec(nonsense, ...) doesn't complain", "");
- X lineno = 9998;
- X errseen = NULL;
- X regsub(&badregexp, "foo", rbuf);
- X if (errseen == NULL)
- X complain("regsub(nonsense, ..., ...) doesn't complain", "");
- X}
- X
- Xtry(fields)
- Xchar **fields;
- X{
- X regexp *r;
- X char dbuf[BUFSIZ];
- X
- X errseen = NULL;
- X r = regcomp(fields[0]);
- X if (r == NULL) {
- X if (*fields[2] != 'c')
- X complain("regcomp failure in `%s'", fields[0]);
- X return;
- X }
- X if (*fields[2] == 'c') {
- X complain("unexpected regcomp success in `%s'", fields[0]);
- X free((char *)r);
- X return;
- X }
- X if (!regexec(r, fields[1])) {
- X if (*fields[2] != 'n')
- X complain("regexec failure in `%s'", "");
- X free((char *)r);
- X return;
- X }
- X if (*fields[2] == 'n') {
- X complain("unexpected regexec success", "");
- X free((char *)r);
- X return;
- X }
- X errseen = NULL;
- X regsub(r, fields[3], dbuf);
- X if (errseen != NULL) {
- X complain("regsub complaint", "");
- X free((char *)r);
- X return;
- X }
- X if (strcmp(dbuf, fields[4]) != 0)
- X complain("regsub result `%s' wrong", dbuf);
- X free((char *)r);
- X}
- X
- Xcomplain(s1, s2)
- Xchar *s1;
- Xchar *s2;
- X{
- X fprintf(stderr, "try: %d: ", lineno);
- X fprintf(stderr, s1, s2);
- X fprintf(stderr, " (%s)\n", (errseen != NULL) ? errseen : "");
- X status = 1;
- X}
- !
- echo timer.c:
- sed 's/^X//' >timer.c <<'!'
- X/*
- X * Simple timing program for regcomp().
- X *
- X * Copyright (c) 1986 by University of Toronto.
- X * Written by Henry Spencer. Not derived from licensed software.
- X *
- X * Permission is granted to anyone to use this software for any
- X * purpose on any computer system, and to redistribute it freely,
- X * subject to the following restrictions:
- X *
- X * 1. The author is not responsible for the consequences of use of
- X * this software, no matter how awful, even if they arise
- X * from defects in it.
- X *
- X * 2. The origin of this software must not be misrepresented, either
- X * by explicit claim or by omission.
- X *
- X * 3. Altered versions must be plainly marked as such, and must not
- X * be misrepresented as being the original software.
- X *
- X * Usage: timer ncomp nexec nsub
- X * or
- X * timer ncomp nexec nsub regexp string [ answer [ sub ] ]
- X *
- X * The second form is for timing repetitions of a single test case.
- X * The first form's test data is a compiled-in copy of the "tests" file.
- X * Ncomp, nexec, nsub are how many times to do each regcomp, regexec,
- X * and regsub. The way to time an operation individually is to do something
- X * like "timer 1 50 1".
- X */
- X#include <stdio.h>
- X
- Xstruct try {
- X char *re, *str, *ans, *src, *dst;
- X} tests[] = {
- X#include "timer.t.h"
- X{ NULL, NULL, NULL, NULL, NULL }
- X};
- X
- X#include <regexp.h>
- X
- Xint errreport = 0; /* Report errors via errseen? */
- Xchar *errseen = NULL; /* Error message. */
- X
- Xchar *progname;
- X
- X/* ARGSUSED */
- Xmain(argc, argv)
- Xint argc;
- Xchar *argv[];
- X{
- X int ncomp, nexec, nsub;
- X struct try one;
- X char dummy[512];
- X
- X if (argc < 4) {
- X ncomp = 1;
- X nexec = 1;
- X nsub = 1;
- X } else {
- X ncomp = atoi(argv[1]);
- X nexec = atoi(argv[2]);
- X nsub = atoi(argv[3]);
- X }
- X
- X progname = argv[0];
- X if (argc > 5) {
- X one.re = argv[4];
- X one.str = argv[5];
- X if (argc > 6)
- X one.ans = argv[6];
- X else
- X one.ans = "y";
- X if (argc > 7) {
- X one.src = argv[7];
- X one.dst = "xxx";
- X } else {
- X one.src = "x";
- X one.dst = "x";
- X }
- X errreport = 1;
- X try(one, ncomp, nexec, nsub);
- X } else
- X multiple(ncomp, nexec, nsub);
- X exit(0);
- X}
- X
- Xvoid
- Xregerror(s)
- Xchar *s;
- X{
- X if (errreport)
- X errseen = s;
- X else
- X error(s, "");
- X}
- X
- X#ifndef ERRAVAIL
- Xerror(s1, s2)
- Xchar *s1;
- Xchar *s2;
- X{
- X fprintf(stderr, "regexp: ");
- X fprintf(stderr, s1, s2);
- X fprintf(stderr, "\n");
- X exit(1);
- X}
- X#endif
- X
- Xint lineno = 0;
- X
- Xmultiple(ncomp, nexec, nsub)
- Xint ncomp, nexec, nsub;
- X{
- X register int i;
- X extern char *strchr();
- X
- X errreport = 1;
- X for (i = 0; tests[i].re != NULL; i++) {
- X lineno++;
- X try(tests[i], ncomp, nexec, nsub);
- X }
- X}
- X
- Xtry(fields, ncomp, nexec, nsub)
- Xstruct try fields;
- Xint ncomp, nexec, nsub;
- X{
- X regexp *r;
- X char dbuf[BUFSIZ];
- X register int i;
- X
- X errseen = NULL;
- X r = regcomp(fields.re);
- X if (r == NULL) {
- X if (*fields.ans != 'c')
- X complain("regcomp failure in `%s'", fields.re);
- X return;
- X }
- X if (*fields.ans == 'c') {
- X complain("unexpected regcomp success in `%s'", fields.re);
- X free((char *)r);
- X return;
- X }
- X for (i = ncomp-1; i > 0; i--) {
- X free((char *)r);
- X r = regcomp(fields.re);
- X }
- X if (!regexec(r, fields.str)) {
- X if (*fields.ans != 'n')
- X complain("regexec failure in `%s'", "");
- X free((char *)r);
- X return;
- X }
- X if (*fields.ans == 'n') {
- X complain("unexpected regexec success", "");
- X free((char *)r);
- X return;
- X }
- X for (i = nexec-1; i > 0; i--)
- X (void) regexec(r, fields.str);
- X errseen = NULL;
- X for (i = nsub; i > 0; i--)
- X regsub(r, fields.src, dbuf);
- X if (errseen != NULL) {
- X complain("regsub complaint", "");
- X free((char *)r);
- X return;
- X }
- X if (strcmp(dbuf, fields.dst) != 0)
- X complain("regsub result `%s' wrong", dbuf);
- X free((char *)r);
- X}
- X
- Xcomplain(s1, s2)
- Xchar *s1;
- Xchar *s2;
- X{
- X fprintf(stderr, "try: %d: ", lineno);
- X fprintf(stderr, s1, s2);
- X fprintf(stderr, " (%s)\n", (errseen != NULL) ? errseen : "");
- X}
- !
- echo tests:
- sed 's/^X//' >tests <<'!'
- Xabc abc y & abc
- Xabc xbc n - -
- Xabc axc n - -
- Xabc abx n - -
- Xabc xabcy y & abc
- Xabc ababc y & abc
- Xab*c abc y & abc
- Xab*bc abc y & abc
- Xab*bc abbc y & abbc
- Xab*bc abbbbc y & abbbbc
- Xab+bc abbc y & abbc
- Xab+bc abc n - -
- Xab+bc abq n - -
- Xab+bc abbbbc y & abbbbc
- Xab?bc abbc y & abbc
- Xab?bc abc y & abc
- Xab?bc abbbbc n - -
- Xab?c abc y & abc
- X^abc$ abc y & abc
- X^abc$ abcc n - -
- X^abc abcc y & abc
- X^abc$ aabc n - -
- Xabc$ aabc y & abc
- X^ abc y &
- X$ abc y &
- Xa.c abc y & abc
- Xa.c axc y & axc
- Xa.*c axyzc y & axyzc
- Xa.*c axyzd n - -
- Xa[bc]d abc n - -
- Xa[bc]d abd y & abd
- Xa[b-d]e abd n - -
- Xa[b-d]e ace y & ace
- Xa[b-d] aac y & ac
- Xa[-b] a- y & a-
- Xa[b-] a- y & a-
- Xa[b-a] - c - -
- Xa[]b - c - -
- Xa[ - c - -
- Xa] a] y & a]
- Xa[]]b a]b y & a]b
- Xa[^bc]d aed y & aed
- Xa[^bc]d abd n - -
- Xa[^-b]c adc y & adc
- Xa[^-b]c a-c n - -
- Xa[^]b]c a]c n - -
- Xa[^]b]c adc y & adc
- Xab|cd abc y & ab
- Xab|cd abcd y & ab
- X()ef def y &-\1 ef-
- X()* - c - -
- X*a - c - -
- X^* - c - -
- X$* - c - -
- X(*)b - c - -
- X$b b n - -
- Xa\ - c - -
- Xa\(b a(b y &-\1 a(b-
- Xa\(*b ab y & ab
- Xa\(*b a((b y & a((b
- Xa\\b a\b y & a\b
- Xabc) - c - -
- X(abc - c - -
- X((a)) abc y &-\1-\2 a-a-a
- X(a)b(c) abc y &-\1-\2 abc-a-c
- Xa+b+c aabbabc y & abc
- Xa** - c - -
- Xa*? - c - -
- X(a*)* - c - -
- X(a*)+ - c - -
- X(a|)* - c - -
- X(a*|b)* - c - -
- X(a+|b)* ab y &-\1 ab-b
- X(a+|b)+ ab y &-\1 ab-b
- X(a+|b)? ab y &-\1 a-a
- X[^ab]* cde y & cde
- X(^)* - c - -
- X(ab|)* - c - -
- X)( - c - -
- X abc y &
- Xabc n - -
- Xa* y &
- X([abc])*d abbbcd y &-\1 abbbcd-c
- X([abc])*bcd abcd y &-\1 abcd-a
- Xa|b|c|d|e e y & e
- X(a|b|c|d|e)f ef y &-\1 ef-e
- X((a*|b))* - c - -
- Xabcd*efg abcdefg y & abcdefg
- Xab* xabyabbbz y & ab
- Xab* xayabbbz y & a
- X(ab|cd)e abcde y &-\1 cde-cd
- X[abhgefdc]ij hij y & hij
- X^(ab|cd)e abcde n x\1y xy
- X(abc|)ef abcdef y &-\1 ef-
- X(a|b)c*d abcd y &-\1 bcd-b
- X(ab|ab*)bc abc y &-\1 abc-a
- Xa([bc]*)c* abc y &-\1 abc-bc
- Xa([bc]*)(c*d) abcd y &-\1-\2 abcd-bc-d
- Xa([bc]+)(c*d) abcd y &-\1-\2 abcd-bc-d
- Xa([bc]*)(c+d) abcd y &-\1-\2 abcd-b-cd
- Xa[bcd]*dcdcde adcdcde y & adcdcde
- Xa[bcd]+dcdcde adcdcde n - -
- X(ab|a)b*c abc y &-\1 abc-ab
- X((a)(b)c)(d) abcd y \1-\2-\3-\4 abc-a-b-d
- X[a-zA-Z_][a-zA-Z0-9_]* alpha y & alpha
- X^a(bc+|b[eh])g|.h$ abh y &-\1 bh-
- X(bc+d$|ef*g.|h?i(j|k)) effgz y &-\1-\2 effgz-effgz-
- X(bc+d$|ef*g.|h?i(j|k)) ij y &-\1-\2 ij-ij-j
- X(bc+d$|ef*g.|h?i(j|k)) effg n - -
- X(bc+d$|ef*g.|h?i(j|k)) bcdd n - -
- X(bc+d$|ef*g.|h?i(j|k)) reffgz y &-\1-\2 effgz-effgz-
- X((((((((((a)))))))))) - c - -
- X(((((((((a))))))))) a y & a
- Xmultiple words of text uh-uh n - -
- Xmultiple words multiple words, yeah y & multiple words
- X(.*)c(.*) abcde y &-\1-\2 abcde-ab-de
- X\((.*), (.*)\) (a, b) y (\2, \1) (b, a)
- !
- echo done
- exit 0
-
-