home *** CD-ROM | disk | FTP | other *** search
- regex.library - v1.0
- An Amiga Shared Library of the
- GNU Regular Expression Package
-
- Ported by Edwin Hoogerbeets 24/07/89
-
- This collection of files may be copied and distributed under the GNU Public
- Licence. See the comment at the top of regex.c for details.
-
- Adapted from Elib by Jim Mackraz, mklib by Edwin Hoogerbeets, and the
- GNU regular expression package by the Free Software Foundation.
-
-
- A General View of How it is Used:
-
- A regular expression is a concise method of describing a pattern of
- characters in a string. By use of special wildcards, almost any pattern
- can be described. A regular expression pattern can be used for searching
- strings in such programs as editors or other string handling programs.
-
- A regular expression pattern must first be compiled into a form more
- easily understood by the matching routines. The compiled form is stored
- in a buffer structure called `struct re_pattern_buffer.' The buffer must
- first be initialized to allocate memory or resources. The pattern is
- compiled into this buffer. Strings can then be matched against the
- compiled regular expression as many times as desired. When the matching
- is done, the buffer is terminated, and the program can exit.
-
- There are two parts to the source: the linkable libraries and the Amiga
- shared library routines. The linkable libraries contains the
- non-re-entrant routines and the glue code that allows access to the
- shared library routines. The shared library contains routines that
- compile and match regular expressions.
-
- To use the library, copy regex.library to your libs: directory and simply
- execute a program that uses the library, such as tinygrep.
-
-
- GNU Regular Expressions:
-
- The following table details the various special characters understood in
- each of the grep and egrep style regular expressions:
-
- (grep) (egrep) (explanation)
- . . matches any single character except newline
- \? ? postfix operator; preceeding item is optional
- * * postfix operator; preceeding item 0 or more times
- \+ + postfix operator; preceeding item 1 or more times
- \| | infix operator; matches either argument
- ^ ^ matches the empty string at the beginning of a line
- $ $ matches the empty string at the end of a line
- \< \< matches the empty string at the beginning of a word
- \> \> matches the empty string at the end of a word
- [chars] [chars] match any character in the given class; if the
- first character after [ is ^, match any character
- not in the given class; a range of characters may
- be specified by <first>-<last>; for example, \W
- (below) is equivalent to the class [^A-Za-z0-9]
- \( \) ( ) parentheses are used for grouping and to override
- operator precedence
- \<1-9> \<1-9> \<n> matches a repeat of the text matched earlier
- in the regexp by the subexpression inside the
- nth opening parenthesis
- \ \ any special character may be preceded by a backslash
- to match it literally
-
- Operator precedence is (highest to lowest) ?, *, and +, concatenation,
- and finally |. All other constructs are syntactically identical to
- normal characters.
-
-
- Writing a C Program That Uses Regular Expressions:
-
- To write a program that uses the library, include the header file regex.h
- at the top of your source. This declares the data structures and function
- return types for you.
-
- You must do an OpenLibrary() call on regex.library and assign the pointer
- obtained to the external variable RegexBase. The pointer RegexBase is
- then used to find functions within regex.library, and thus RegexBase must
- be valid before using any of these library routines. A RegexBase variable
- is already provided in regex.lib. When linking, give the -lregex flag to
- include regex.lib (the linkable library code).
-
- To use the routines, first declare a struct re_pattern_buffer variable
- and call re_initialize_buffer() with a pointer to this buffer. (Specific
- details of the regex functions are listed below.)
-
- Then, determine a regular expression you wish to compile, perhaps from
- user input. Call the function re_compile_pattern() with a pointer to your
- buffer and the string you wish to compile. Now the buffer will contain
- the compiled regular expression ready for matching.
-
- Next, you can search for your pattern in any given text by calling
- re_search() with the compiled buffer and the string you wish to search
- on. This will locate the regular expression anywhere in the string you
- passed to it, within the bounds specified.
-
- If you are looking for an exact match, however, re_match() is the
- function you want. It returns true when the regular expression matches
- the string starting at the character specified.
-
- When you are done with the buffer, you must call re_terminate_buffer()
- to reclaim all memory and resources used by the library.
-
- Two programs, tester.c and tinygrep.c, are included in the distribution
- as simple examples of programming with the library.
-
- Tester allows you to enter grep style regular expressions and match them
- against a string.
-
- Tinygrep is a small implementation of the popular grep program that uses
- the regex library to search for patterns in text files. However, it is
- not overall as fast as GNU grep, or even Manx grep. This is because these
- other programs handle their slowest part (input) much better. To make
- tinygrep faster, the regular expression searching could be performed
- directly on the input buffer.
-
-
- Assembler Support:
-
- If you are writing in assembler instead of C, the registers expected for
- function parameters are listed along with the function descriptions
- below.
-
- The sequence of calls to the functions in regex.library described for C
- still apply. However, instead of using the glue code to call the library,
- you should call the regex.library functions directly following this
- example:
-
- ; assembler example of calling re_terminate_buffer()
- ;
-
- ; define the library offsets
- include 'regex.i'
-
- ; setup arguments in appropriate registers here
- ; d0 is where the buffer pointer parameter should go
-
- move.l bufp,d0
-
- ; get the address of the library and jump to the appropriate point
-
- move.l _RegexBase,a6
- jsr _LVOre_terminate_buffer(a6)
-
- ; d0 should now contain the result
-
- To use different functions, replace the re_terminate_buffer part of the
- jsr line with the function name you wish to call. The _LVO with the
- function name is expanded to a number which is the offset from register
- a6 where the address of the function you are calling can be found.
-
-
- Functions in regex.library:
-
- This is a more detailed description of each of the functions and
- variables offered by the regex package. These functions are available
- from C by linking with the regex.lib.
-
- Regex offers the following entry points:
-
- D0 D0 D1
- char *re_initialize_buffer(bufp,table)
- struct re_pattern_buffer *bufp;
- char *table;
-
- This function is used to initialize a pattern buffer `bufp' that is
- used to compile regular expressions. Declare a variable of type
- `struct re_pattern_buffer' variable on the stack or dynamically
- allocate room for it, and pass a pointer to the new memory to
- re_initialize_buffer(). The fields of the buffer are filled in for
- you.
-
- The `table' parameter is a pointer to a translation table used to
- equate characters during matching. When a character is matched, it is
- used as an index into this table to find the resulting character. One
- use for this might be to translate all vowels to the character @, so
- that @ can be used in a regular expression to match any vowel. If the
- table parameter is NULL, no translation is performed on the
- characters, and each character is matched literally. (See the
- __Upcase table below for another example)
-
- If re_initialize_buffer succeeds, a NULL pointer is returned. If an
- error occurs, a pointer to one of the following fixed strings is
- returned:
-
- "No buffer" - you passed a NULL pointer, not a pointer
- to a regular expression buffer
- "Memory exhausted" - Not enough memory in the system to
- initialize the buffer
-
-
- D0 D0
- LONG re_terminate_buffer(bufp)
- struct re_pattern_buffer *bufp;
-
- This function must be called to free the memory and resources
- allocated during the initialize routine. It is not fatal if this
- routine is not called before the your program exits, but all the
- memory will not be returned to the system. (for which you will get
- royalled flamed on the nets, believe me! 8-)
-
- A value of 1 is returned for a successful termination, and 0 for the
- error condition. An error (zero) means you passed a NULL pointer to
- the function.
-
-
- D0 D0 D1 A0 A1
- char *re_compile_pattern(pattern, size, bufp, ob)
- char *pattern;
- long size, ob;
- struct re_pattern_buffer *bufp;
-
- This function compiles a regular expression `pattern' with length
- `size' into the properly initialized buffer `bufp.'
-
- Different syntaxes for regular expressions exist. The syntax you
- would like is specified in the `ob' parameter. The ob parameter can
- be one of the following defined flags:
-
- (In general, the presence of one of the flags below indicates that
- the character referenced should be treated as a wildcard. If the flag
- is absent, then the character is not treated as a wildcard.)
-
- RE_NO_BK_PARENS
-
- Treat parentheses as the grouping wildcard. To specify a literal
- parenthesis the pattern \( or \) is needed. If this flag is left
- out, \( and \) are the grouping wildcards and ( and ) match the
- literal parentheses.
-
- RE_NO_BK_VBAR
-
- Treat the vertical bar as the "or"-operator, and \| as a literal
- vertical bar. If this flag is left out, the syntax is reversed.
-
- RE_BK_PLUS_QM
-
- Treat the plus and the question mark characters as wildcards, and
- \+ and \? as the literal characters.
-
- RE_TIGHT_VBAR
-
- Bind the vertical bar tighter than the ^ and $ operators. This
- means that the vertical bar takes precedence over the ^ and $ in a
- single expression.
-
- RE_NEWLINE_OR
-
- Treat the newline character `\n' as a an "or"-operator. This might
- be useful in a program such as fgrep.
-
- RE_CONTEXT_INDEP_OPS
-
- Treat certain wildcards characters as wildcards only in certain
- contexts. Specifically, this applies to:
-
- ^ - only special at the beginning of a line, or after ( or |
- $ - only special at the end of a line, or before ) or |
- *, +, ? - only special when not after the beginning of a line,
- (, or |
-
- Some programs have a combination of the above flags as their default.
- The following flags give the syntax of some well-known Unix
- utilities in terms of the above flags:
-
- RE_SYNTAX_AWK - emulate awk regular expressions
- RE_SYNTAX_EGREP - emulate egrep regular expressions
- RE_SYNTAX_GREP - emulate grep regular expressions
- RE_SYNTAX_EMACS - emulate emacs-like regular expressions
-
- If re_compile_pattern() is successful in compiling the given regular
- expression, a NULL pointer is returned. If an error condition occurs,
- a pointer to one of the following fixed strings is returned.
-
- "Invalid regular expression" - eg: "\(ab\)*123\" has an
- invalid trailing '\'
-
- "Unmatched \(" - eg: "\(ab*123" has no
- closing "\)"
-
- "Unmatched \)" - eg: "ab\)*123" has no
- opening "\("
-
- "Premature end of regular expression" - eg: "foo[1-9" has no ']'
-
- "Nesting too deep" - you have too many levels
- of groupings: "\( \)"
-
- "Regular expression too big" - the regular expression
- needed more than 64K to
- store -- Try using a
- shorter one!
-
- "Memory exhausted" - Close some windows!
-
-
- D0 D0
- LONG re_compile_fastmap(bufp)
- struct re_pattern_buffer *bufp;
-
- If the initial part of a pattern does not match the string starting
- at a certain position, the whole expression will not match the string
- starting at that position.
-
- On this basis, it is possible to compute which characters can
- possibly be found at the start the pattern. If a string does not
- start with one of these characters, it cannot match the pattern.
- These collections of possible starting characters are called a
- fastmap.
-
- Fastmaps make pattern searching much faster by reducing the number of
- failed full matches.
-
- This function takes a compiled pattern in buffer `bufp' and computes
- a fastmap for it, which is stored in the `fastmap' field of the
- buffer. The fastmap is then used in the re_search() function while
- searching a string for a regular expression. If this function is
- not called before a re_search(), then re_search() will call it
- for you.
-
-
- D0 D0 D1 A0 A1 D2 D3
- LONG re_search(pbufp, string, size, startpos, range, regs)
- struct re_pattern_buffer *pbufp;
- char *string;
- long size, startpos, range;
- struct re_registers *regs;
-
- This function searches the string `string' of size `size' for the
- regular expression previously compiled to the buffer `pbufp.' The
- `startpos' parameter is the index into the string to start searching.
- If the search is unsuccessful at startpos, it is tried at startpos+1
- and so forth. The `range' parameter tells how far from the start
- position to go before failing. It is up to the caller to make sure
- that range is not so large as to take the starting position outside
- of the input strings. If the range parameter is negative, then the
- search will proceed from startpos to startpos-1 and so forth until
- -range positions have been checked.
-
- The `regs' parameter is a place to store information about exactly
- what was matched if the search is successful, including
- subexpressions. A subexpression is any part of a regular expression
- bounded by parentheses. The `start' field of a re_registers structure
- is an array of character pointers to the beginning of each
- subexpression matched. The `end' field is an array of character
- pointers to the character just past the end of each subexpression.
-
- For example,
-
- regs->start[0] to regs->end[0] is the entire expression matched
-
- regs->start[1] to regs->end[1] is the subexpression contained
- in the first \( \) grouping if there is one
-
- regs->start[2] to regs->end[2] is the subexpression contained
- in the second \( \) grouping if there is one
-
- and so on. If a NULL pointer is passed as the `regs' parameter,
- no information on matching is stored.
-
- There is a maximum of NREGS groupings available. If you really need
- more, you can change the definition of NREGS in regex.h and recompile
- the library.
-
- The return value is the position of the start of the of the string
- that matches the regular expression. If there is no match, a -1 is
- returned. If there was some internal error, a -2 is returned.
-
- The function re_search() depends on re_search_2() below to do
- its grunt work.
-
-
- D0 D0 D1 D0 A1 D2 D3
- LONG re_search_2(pbufp, string1, size1, string2, size2, startpos,
-
- D4 D5 D6
- range, regs, mstop)
- struct re_pattern_buffer *pbufp;
- char *string1, *string2;
- long size1, size2;
- long startpos;
- register long range;
- struct re_registers *regs;
- long mstop;
-
- This function works the same as re_search, with the exception that it
- takes different arguments. The regular expression in the buffer
- `pbufp' is searched for in the concatenation of `string1' and
- `string2.' The parameters `size1' and `size2' are the lengths of
- string1 and string2 respectively. The `startpos' is the starting
- position of the search and the the `range' is how many characters
- further to try the search, just as in re_search. The `regs' parameter
- is a pointer to a re_registers structure which is space for storing
- information about what exactly was matched.
-
- The return value is the position of the start of the of the string
- that matches the regular expression. If there is no match, a -1 is
- returned. If there was some internal error, a -2 is returned.
-
- See the description of the re_search() function for more details.
-
-
- D0 D0 D1 A0 A1 D2
- LONG re_match(pbufp, string, size, pos, regs)
- struct re_pattern_buffer *pbufp;
- char *string;
- long size, pos;
- struct re_registers *regs;
-
- This function matches the compiled regular expression in `pbufp'
- against `string,' which is of length `size.' The `pos' parameter is
- the position in the string to start the matching. The `regs'
- parameter points to space to store information about the part of the
- string that matched the regular expression. See the description of
- the re_search() function for more details of the `regs' parameter.
-
- The return value is the length of the string that matches the regular
- expression. If there is no match, a -1 is returned. If there was some
- internal error, a -2 is returned.
-
- The difference between re_search() and re_match() is that re_search()
- finds the regular expression anywhere in a certain range of a string
- by looking at different starting positions, while re_match() only
- looks at the starting position specified.
-
- D0 D0 D1 A0 A1 D2 D3 D4 D5
- LONG re_match_2(pbufp, string1, size1, string2, size2, pos, regs, mstop)
- struct re_pattern_buffer *pbufp;
- unsigned char *string1, *string2;
- long size1, size2;
- long pos;
- struct re_registers *regs;
- long mstop;
-
- This function is much like re_match(), except that two strings are
- specified as parameters. The function matches the compiled regular
- expression in `pbufp' against the concatenation of `string1' and
- `string2,' which are of length `size1' and `size2' respectively. The
- `pos' parameter is the position in the string to start the matching.
- The `regs' parameter points to space to store information about the
- part of the string that matched the regular expression. See the
- description of the re_search() function for more details of the
- `regs' parameter.
-
- The return value is the length of the string that matches the regular
- expression. If there is no match, a -1 is returned. If there was some
- internal error, a -2 is returned.
-
-
- Functions in regex.lib:
-
- The following entry points are for compatibility with the BSD Unix
- regular expression package. The BSD regular expression package does not
- fiddle with such piddly re-entrant ideas as user buffers, and thus a
- static buffer is used for you when compiling regular expressions.
-
- If you are writing your program in assembler, you will have to link
- with the aregex.lib as well as regex.lib to access these functions.
- This is because these routines are written in C, and parameters must be
- put on the stack. The glue code in aregex.lib does this for you. For
- assembler programs, the entry points for these functions are the
- function names without a leading underscore character. (ie. re_comp and
- re_exec, instead of _re_comp and _re_exec)
-
- D0
- char *re_BSD_initialize()
-
- This function initializes the internal buffer. This function should
- be placed at the beginning of any program using the BSD entry points.
-
-
- void re_BSD_terminate()
-
- This function frees the system resources used by the initialize
- routine. This function should be placed at the end of any program
- using the BSD entry points.
-
-
- D0 D0
- char *re_comp( s )
- char *s;
-
- Compile the pattern in the string `s' for use in subsequent matchings.
- If the internal buffer has not been properly initialized, this
- function will detect the condition and call re_BSD_initialize()
- for you. This means it is not critical to call the initialize
- routine, but it is a good idea anyway.
-
- If the string s is a NULL pointer, the previous regular expression
- will be used.
-
- If the compilation is succesful, a NULL pointer is returned.
- Otherwise, a pointer to one of fixed strings returned by
- re_compile_buffer() is returned. (see the description of
- re_compile_buffer() above for details.) As well, re_comp() may
- return a pointer to the following string:
-
- "No previous regular expression" - re_BSD_initialize was never
- called
-
-
- D0 D0
- LONG re_exec( s )
- char *s;
-
- Use the last compiled pattern to match against the string `s.'
- Like re_search(), this function returns a -1 for no match, a -2 for
- internal error, and the position of the beginning of the matched
- string for a successful matching.
-
-
- Variables in regex.lib:
-
- The following variables are also provided in the linkable library for
- programming convenience:
-
- struct RegexBase *RegexBase
-
- Assign the results of an OpenLibrary() on regex.library to this
- variable. It is used to find the jump table in memory so that
- the shared library routines can be executed.
-
-
- char __Upcase[]
-
- This is a pre-defined translation table for use in a call to
- re_initialize_buffer(). It is a translation table that turns
- all lower case letters into upper case letters, effectively
- making the regular expression case insensitive while matching.
-
-
- Still To Do:
-
- - providing a Modula II, Lattice, PDC and/or Draco linkable support
- library
-
- Not having Modula II or Lattice, these are difficult for me to do right
- now... However, if you do do any of these, I would be eager to hear
- from you!
-
- I suspect the Lattice support would simply consist of a header file
- of #pragmas, but I have little idea how that would work.
-
-
- Files:
- alink.asm - assembler glue code source for aregex.lib
- aregex.lib - interface between assembler and regex.library
- interface.asm - interface between assembler and C within
- regex.library
- lib1.c - BSD style entry points to regex.library
- lib2.c - default uppercase translation table
- library.c - main shared library routines of regex.library
- library.h - header for library.c
- link.asm - C glue code source for regex.lib
- makefile - makefile for Manx
- malloc.c - support routines for regex.library
- ReadMe - this file
- regex.c - regular expression code in regex.library
- regex.h - C header file for anything to do with regex
- regex.i - assembler header file for anything to do with regex
- regex.lib - interface code between C and regex.library
- regex.library - Amiga shared library
- rtag.asm - ROM tag code for regex.library
- startup.asm - modified small model startup code for regex.library
- tester - test program
- tester.c - source to the above
- tinygrep - small, almost-useful test program
- tinygrep.c - source to the above
-
-
- Please redirect any comments, criticisms or vivacious vixens:
-
- Edwin Hoogerbeets
- Usenet: ehoogerbeets@rose.waterloo.edu (school account until Aug '89)
- or edwin@watcsc.waterloo.edu (permanent account)
- or w-edwinh@microsoft.uucp (Sept '89 to Dec '89)
- CIS: 72647,3675 (funds-dependent permanent 8-)
-
- Remember, pillows don't hit people. People do.
-
-
-
-