home *** CD-ROM | disk | FTP | other *** search
-
- Program: Scan
-
- Version: 0.82 gamma 4/02/92
-
- Utility: Searches twice as fast on hard drives and five times faster
- in ram than the best search programs currently available.
-
- Option to scan selective(wildcard) internal LZH and LHA files
-
- Supports searching for multiple patterns simultaneously
- with little speed degradation.
-
- Option to output whole article when a match is found.
-
- Extensive wildcard support(*,?,[],[^],[-],+,|,&,..).
-
- Optional inverted pattern matching.
-
- Recursive directory scanning.
-
- Support for \x?? and \? in patterns and article separator.
-
- Line search highlights matching words with selectable color.
-
- Freeware.
-
- Tribute: Eternal praise to Jesus for saving my soul and for the wonders
- of God's creation.
-
- Legal: Copyright © 1991, 1992 by Walter Rothe
-
- This program is freely distributable, but copyrighted by me. This
- means that you can copy it freely as long as you don't ask for any
- more money than a nominal fee for copying. This program may be
- placed on Public Domain disks, like Fred Fish's library. To
- distribute this program you must include the program,
- documentation, and test files in their original unmodified form.
- This does not preclude compression by archiving programs like
- lharc. This program cannot be used for commercial purposes
- without written permission from the author. The author can not
- be made responsible for any damage which is caused by using
- this program.
-
- Command: Scan -[nprt] -[hColor] -[lNumLines] -[wWinSize] -[oOutFile]
- format -[zLHAWild] SrchFile(s) Pattern
-
- OR
-
- Scan -f[CnfgFile] -[oOutFile] -[pr] -[zLHAWild] SrchFile(s)
-
- OR
-
- Scan -a[ipr] [-cColumn] -[sArtSep] -[oOutFile] -[wWinSize]
- -[zLHAWild] SrchFile(s) Pattern
-
- OR
-
- Scan -[vx]
-
- Names: Color - A two digit number, where the 1st number indicates
- the color the matching word is highlighted with,
- and the second is the color the filenames of the
- files being searched are highlighted with. Currently
- limited to 0-9.
-
- NumLines - Number of lines of context information printed
- around match.
-
- WinSize - Number of bytes in window. Default 16K bytes.
- Modulus( WinSize, 4 ) must be 0. There are
- three buffers, each WinSize long that are swapped.
- Larger size windows usually increase the speed,
- except when handling large numbers of small files.
- Also, the larger the window, the more context
- information can be printed. Context info is limited
- to whats in the present and previous buffer. Large
- article may need a large window to be fully printed
- out. Currently the WinSize is forced to 16kb
- whenever the -z option is in effect.
-
- OutFile - Pathname of file output will be put into.
-
- LHAWild - Wildcard pattern that is used to determine which
- internal LHA files are scanned. Only "*" and "?"
- wildcards are permitted here. Note that the full
- internal filename must be matched. Any directories
- must be included. Some shells expand wildcards on
- the command line so you may need to enclose the
- option with quotes. i.e. "-z*.c"
-
- SrchFile(s) - Pathname of file(s) to be searched. Only "*" and "?"
- wildcards are permited here. For recursive dir
- scans, you need only specify the directory pathname.
- You can optionally add a "/" or "/*.h" to the end of
- the directory pathname. The command line is limited
- to 255 chars so if you specify a wildcard pattern and
- the pathnames of the matching files exceed 255 chars,
- the last files wont be scanned. To get around this,
- enclose the wildcard with quotes. i.e. "*.l??"
-
- Pattern - Pattern to search for. This can include the
- wildcards "*", "?", "&", "|", "+", and "..". Refer to
- the following section on pattern matching. Note that
- you may need to enclose some of the wildcards in
- quotes to keep the shell from expanding them. Also,
- there must be at least 2 consequtive unique non
- wildcard characters in each pattern between the
- "|" wildcards. Pattern is case insensitive. For
- example "sale..d*paint[3i]|paint&prog"
-
- CnfgFile - Pathname of file containing article separator,
- column for article separator, inversion flag,
- window size, and search pattern with each on a
- separate line, starting in column 1. Note that each
- pattern between "|"'s appears on a separate line
- without the "|". There is a maximum of 125 of these.
- If the "-f" option is not followed with a name,
- then file S:scan.config is used.
-
- ArtSep - Article separator. Defaults to "\nArticle". Note
- that a "\" in the article separator has the same
- meaning as that in the pattern matching algorithm
- explained below. An article separator must be 2 or
- more characters long. For example, "\n\n" is a
- valid separator and causes a new article to be
- started with each blank line. Note that some shells
- require you to specify this as "\\n\\n".
-
- Options:
-
- -a Article scan. Prints out all articles with matches.
-
- -cColumn Column article separtor must be in(1..?).
-
- -fCnfgFile Get parms from config file.
-
- -f Get parms from s:scan.config.
-
- -hxy Highlight match with x color and pathname, y color.
-
- -i Invert matching so nonmatching articles are printed.
-
- -lxx Line search with xx lines around target printed.
-
- -n Print line numbers with matched text(slower).
-
- -oOutFile Send output to file.
-
- -p Always print file pathnames scanned.
-
- -r Recursively scan down directories.
-
- -sArtSep Article separator(def Article).
-
- -t Truncate output to window width. Only works with -n.
-
- -v Print version number. Other options nulled.
-
- -wWinSize Window size(def 16384 bytes). Mod(size,4) must be 0.
-
- -x Print out more help information. Nulls other options.
-
- -zWildPat Enabl decomp of .lzh and .lha files with int files
- matching WildPat.
-
- -z Enable decompression of all .lzh and .lha files.
-
- Pattern:
- matching
-
- ? Matches any single character except
-
- [chars] Match any characters within braces. i.e. [abcxyz]
-
- [c1-c2] Match any character from c1 to c2. i.e. [a-x]
-
- [^chars] Match any characters not within the brackets.
-
- \xYY Matches hex number YY as a character. Note that back
- slashes are preprocessed by some shells. You may need
- to put two slashes back to back to prevent this.
-
- \Y Matches the standard C escape sequence Y
-
- \YYY Matches the decimal number YYY as a character
-
- | Finding the pattern on the left or right causes a match
-
- + Same as |
-
- * Pattern on left and right must both match and be in the
- same word. Match on left must come before one on right.
-
- & Pattern on left and right must both match and be in the
- same sentence. Match on left must come before match on
- the right. A sentence is delimited by:
-
- a) period followed by a space or line feed
- b) a maximum of OVERLAP(512) characters
- c) two newline chars with no chars but ">" between them
- d) start or end of article
- e) newline before a colon
-
- .. Pattern on left and right must both match and be in the
- same article. Order of left and right matches is not
- important. This is faster than "&". This wildcard is
- only used during article scans(-a or -f), not line
- scans.
-
-
- Config:
- format
-
- line 1 Article separator
-
- line 2 Column article separator must be in. 0 -> ignore
-
- line 3 Invert match flag. 1 -> invert match. 0 -> normal
-
- line 4 Window size in bytes. Mod( size, 4 ) must be 0.
-
- line 5..129 Search patterns. There is an implicit "|" between
- each line. There may be fewer than 125 pattern lines
- if there are any ".." wildcards on a line. The number
- of lines is reduced directly by the number of ".."'s.
-
-
- Background:
-
- I wrote "scan" to help minimize the time I spend scanning the
- very large(megabyte) USENET proceedings I download weekly. This
- program scans a file or set of files looking for strings matching
- a user specified pattern(s). It supports the traditional *, ?, and
- [] wildcards, but includes three new ones; "&" and "|" which are
- similar to "*" but work on sentences and articles instead of words
- and ".." which is similar to "&" but is order independent. "Scan"
- can print out the entire text of an article if a match is found
- anywhere in the article. It can search recursively down directories
- and do inverse pattern matching where only articles that don't match
- the pattern are printed out. Due to the size of Usenet proceedings,
- it's desirable to keep them archived in compressed form. Scan
- supports .LZH and .LHA formats with .LHA being significantly faster
- and more dense. You can even selectively scan user specified
- internal .LZH and .LHA files using a wildcard pattern. Finally, up
- to 125 patterns can be scanned for simultaneously, with minimal
- speed degradation.
-
- The fastest search programs I've seen to date are "zgrep" and the
- "csh" search command. "Zgrep has the edge when run out of ram
- but "csh" does better on hard disk searches. "Scan" searches twice
- as fast as "csh" on harddrives and 3 times faster than "zgrep".
- It searches 5 times faster than "zgrep" in ram and 15 times faster
- than "csh" in ram. Search time is about the same for all 3 when
- done off of the floppy.
-
-
- Algorithm:
-
- A preprocessor selects the least repetitive two character
- sub-pattern from each major term of the pattern. An even and an
- odd two character subpattern is selected. This allows 16 bits
- to be processed at a time in the inner loop. These two character
- subpatterns are used to do a parallel Boyer-Moore type search.
- If a match is found with the two char subpattern, the rest of
- the pattern is checked. If the full major term matches, a flag is
- set and other flags examined to see if the full minterm matches.
- If so, another flag is set to cause the article to be printed out.
- A triple buffer approach is used with Matt Dillons asynchronous
- I/O to help speed file reads. Thanks Matt. A special two character
- end of buffer subpattern is put at the end of the buffer so EOF
- wont have to be checked for after each pattern check.
-
-
- Examples:
-
- Searching for sentences with the words "Amiga" and "commercial"
- in them is specified with:
-
- amiga&commercial | commercial&amiga
-
- If "Amiga" and commercial don't have to be in the same sentence,
- then it can be done with:
-
- amiga..commercial
-
- To find the paragraphs with the words "truth" and "life" in them
- within the Gospel of John, using the archived new testament on BIX,
- with the "csh" shell, and printing out the internal book names as
- they are scanned:
-
- scan -aps\\n\\n "-z*john*" biblenew.lha truth..life
-
- This takes about 3 seconds off a harddrive on an A3000.
-
- Note also that to setup an alias in "csh" requires additional
- back slashes:
-
- alias scankjv "scan -as\\\\n\\\\n"
-
- To find all occurances of "faith" and "healed" in the same sentence
- in the entire new testament, only printing out the names of the
- books with the matches, highlighting the word that matched, and
- printing 3 lines of context on both sides of the match:
-
- scan -l3 -z biblenew.* faith&healed+healed&faith
-
-
- Hints: Search patterns must be at least 2 characters long.
- There must be at least two consequtive unique characters
- within each major term of the pattern. A major term is delimited
- by a "|", "+", or "..". The program will tell you if it cannot
- find unique subpatterns.
-
- There are four distinct command formats of Scan. The 1st is a
- default format which scans for matching lines, not articles.
- The 2nd format has a "-a" in it and scans for articles. The
- 3rd has a "-f" in it and also scans for articles, but it gets
- most of its options from a configuration file. The 4th is used
- to print the version or more help information.
-
- Specifying an article separator containing line feeds, can be
- done by adding a "\n". Note that some command line interpreters
- require an extra slash "\\n". For example, to specify a blank
- line as an article separator with csh, use "-s\\n\\n".
-
- Command line options can be grouped together with the exception
- of those having arguments. An option with an argument like "-l"
- must appear with a separate dash or be the last option in a
- dash group.
-
- Some CLI's, along with Scan, limit the command line to 255
- characters. When doing wildcard file matching on the command
- line, large numbers of matching files can cause a command line
- overrun. This is not harmful, but some of the files you expected
- to get scanned, won't be. The solution is to put quotes around
- the wildcards.
-
- The filename pattern matching algorithm is alot faster than the
- one in "csh" and probably other shells. Because of this, it is
- alot faster to do:
-
- scan "dh0:work/*" xyz
-
- than to do:
-
- scan dh0:work/* xyz
-
- especially if there are alot of files in the directory.
-
-
- Source: This is a gamma test version of the program "scan". Once I have
- received enough bug notifications and suggestions, I will release
- the source.
-
- Author: Walter Rothe
-
- Contact: BIX - aimania OR 2008 Mary St, Carrollton, Tx 75006
-
-
-
-