home *** CD-ROM | disk | FTP | other *** search
Text File | 1993-10-18 | 68.9 KB | 1,961 lines |
- Newsgroups: comp.sources.unix
- From: john@johncon.com (John Conover)
- Subject: v27i075: qt - full-text retrieval program, Part01/01
- Message-id: <1.750984870.12686@gw.home.vix.com>
- Sender: unix-sources-moderator@gw.home.vix.com
- Approved: vixie@gw.home.vix.com
-
- Submitted-By: john@johncon.com (John Conover)
- Posting-Number: Volume 27, Issue 75
- Archive-Name: qt/part01
-
- Qt stands for Query Text, a text information retrieval system. Qt
- creates, maintains, and queries a full text database. The database
- file system is organized as an inverted index. The program is written
- as a single script, in Bourne Shell, and permits simple natural
- language queries.
-
- Environment: Unix, SysV. rel. 4.x, R6000, DEC ALPHA 3000, others.
-
- john@johncon.com
-
- #!/bin/sh
- # This is a shell archive (produced by shar 3.49)
- # To extract the files from this archive, save it to a file, remove
- # everything above the "!/bin/sh" line above, and type "sh file_name".
- #
- # made 09/16/1993 02:58 UTC by john@johncon
- # Source directory /home/john
- #
- # existing files will NOT be overwritten unless -c is specified
- #
- # This shar contains:
- # length mode name
- # ------ ---------- ------------------------------------------
- # 65404 -rwxrw-rw- qt/qt
- # 1844 -rw-r--r-- qt/README
- #
- # ============= qt/qt ==============
- if test ! -d 'qt'; then
- echo 'x - creating directory qt'
- mkdir 'qt'
- fi
- if test -f 'qt/qt' -a X"$1" != X"-c"; then
- echo 'x - skipping qt/qt (File already exists)'
- else
- echo 'x - extracting qt/qt (Text)'
- sed 's/^X//' << 'SHAR_EOF' > 'qt/qt' &&
- #!/bin/sh
- #
- VERSION="qt - Version 0.1. (qt -h, for description and help.)"
- #
- # Qt stands for Query Text, a text information retrieval system. Qt
- # creates, maintains, and queries a full text database. The database
- # file system is organized as an inverted index. The program is written
- # as a single script, in Bourne Shell, and permits simple natural
- # language queries.
- #
- # As a simple application example, this program can be used to search
- # the "catman" pages for a command that performs a specific function,
- # even though the command's name is not known-e.g., if you knew what
- # you wanted to do, you could find the command that would do it.
- #
- # The program, qt, is free software, and can be redistributed and/or
- # modified, without any restrictions. It is distributed with no
- # warranty of any kind, implied or otherwise. Specifically, there is
- # no warranty of fitness for any particular purpose and/or
- # merchantability.
- #
- # Comments and/or bug reports should be addressed to:
- #
- # john@johncon.com (John Conover)
- #
- # Known caveats: There is no concurrency control-it would be
- # ill-advised to use this program as a concurrent application.
- # Additionally, the natural language query does not support grouping
- # operators.
- #
- # For a quick start, execute qt -h for help, which may be re-directed to
- # stdio. At the "tail -23" of this help file are some simple commands to
- # evaluate this script.
- #
- # Installation:
- #
- # The comments in this script are verbose, and should be stripped prior
- # to any installation with something like:
- #
- # sed '/^ *#/d;/^$/d' qt > qt.new
- #
- # and installing qt.new as qt in the executable path. Likewise,
- # possibly, the function, help(), should be eliminated. The function,
- # find_program(), is not efficient and should be eliminated, by hard
- # coding the paths to the various programs in your system. There are
- # tab characters used in this script, (which are referenced as the
- # variable, "${TAB}") requiring that the script be saved with tabs.
- #
- # Applicability:
- #
- # Applicability of qt varies on complexity of search, size of database,
- # speed of host environment, etc., however, as some general guidelines:
- #
- # 1) For text files with a total size of less than 5 MB,
- # standard egrep(1) queries of the text files will probably
- # prove adequate.
- #
- # 2) For text files with a total size of 5 MB to 50 MB, qt seems
- # adequate for most queries. The significant issue is that,
- # although the retrieval execution times are probably adequate
- # with qt, the database write times are not impressive.
- #
- # 3) For text files with a total size that is larger than 50 MB,
- # or where concurrency is an issue, it would be appropriate to
- # consider one of the alternatives listed in "Related
- # information retrieval software:," below.
- #
- # References:
- #
- # 1) "Information Retrieval, Data Structures & Algorithms,"
- # William B. Frakes, Ricardo Baeza-Yates, Editors, Prentice
- # Hall, Englewood Cliffs, New Jersey 07632, 1992, ISBN
- # 0-13-463837-9.
- #
- # The sources for the many of the algorithms presented in 1) are
- # available by ftp, ftp.vt.edu:/pub/reuse/ircode.tar.Z
- #
- # 2) "Text Information Retrieval Systems," Charles T. Meadow,
- # Academic Press, Inc, San Diego, 1992, ISBN 0-12-487410-X.
- #
- # 3) "Full Text Databases," Carol Tenopir, Jung Soon Ro,
- # Greenwood Press, New York, 1990, ISBN 0-313-26303-5.
- #
- # 4) "Text and Context, Document Processing and Storage," Susan
- # Jones, Springer-Verlag, New York, 1991, ISBN 0-387-19604-8.
- #
- # 5) ftp think.com:/wais/wais-corporate-paper.text
- #
- # 6) ftp cs.toronto.edu:/pub/lq-text.README.1.10
- #
- # 7) "Unix Shell Programming," Lowell Jay Arthur, John Wiley &
- # Sons, Inc., New York, 1990, ISBN 0-471-51820-4.
- #
- # Related information retrieval software:
- #
- # 1) Wais, available by ftp, think.com:/wais/wais-8-b5.1.tar.Z
- #
- # 2) lq-text, available by ftp, cs.toronto.edu:
- # /pub/lq-text1.10.tar.Z
- #
- # This script uses the Unix concept of a simple flat text file as a
- # database, operated on by the various utilities native to the Unix
- # system. The flat text file's organization is exactly one record for
- # each file that contains at least one instance of a specific word.
- # Each record in the flat text file has exactly two fields. These two
- # fields are the word, followed by a single "${TAB}" field delimiter,
- # and the file name containing the word. The record sequence in the
- # flat text file is the ASCII collated sequence of the word field.
- #
- # This organization of flat text file is an inverted index database,
- # ie., the file names of files that contain a specific word can be
- # found, using a binary search program, (like the native Unix program,
- # "look.") The inverted index file records can be created by parsing
- # the words in textural documents (perhaps using the native Unix
- # programs, "tr" and "sed,") and concatenating these words with a
- # single "${TAB}" and the file name of the file containing the word.
- # These records can then be sorted into ASCII collation sequence,
- # (using, for example the Unix program, "sort" -u.) Obviously, two
- # sorted inverted index databases could be combined with the "sort"
- # -um, command. File names could be removed from the inverted index
- # database with the "egrep" -v "*${TAB}filename" command, and words
- # removed with the "egrep" -v "^word${TAB}" command, and so on. This
- # script uses only a few of the native Unix programs to construct an
- # inverted index database system.
- #
- # The functions contained in this script:
- #
- # 1) find_program(), find if a program exists.
- # 2) help(), help.
- # 3) read_index (), query the inverted index file for word(s).
- # 4) write_index(), index the words in the files.
- # 5) update_database(), update the database.
- # 6) parse_word(), lexical analysis.
- # 7) remove_words(), remove words from the inverted index.
- # 8) remove_files(), remove files from the inverted index.
- # 9) relevance_count(), relevance count.
- # 10) relevance_proximity(), proximity retrieval.
- #
- # The functions, remove_words(), remove_files(), relevance_count() and
- # relevance_proximity() are included to serve as templates for further
- # applications. Probably, in the interest of generality, they should
- # not be included in the program since they can be completely
- # implemented as external scripts, aliases, or pipes from the output of
- # qt.
- #
- # This program will create and query inverted index files that index
- # the words in text files. These indices are useful in information
- # retrieval systems. The inverted index files are, typically, about the
- # same size of the text files, and do not require the text files to be
- # present for query operations. The query functions, typically, consist
- # of boolean operations on word searches. The output of the query is,
- # typically, a list of the file names that contain the queried word(s).
- #
- # The read synopsis is:
- #
- # qt [-e] [-r | -rc | -rp] [-f index_name] word1 [op1] word2 [op2] ...
- #
- # where word1-word2 ... are the words to be queried in the inverted
- # index, and op1-op2 ... are the operations to be performed on the set
- # of file names that contain these words. The word/operation arguments
- # consist of pairs of search words, and boolean operators, with a left
- # to right operational precedence.
- #
- # Thus if A, B, and C are words, then the query:
- #
- # A and B or C not D
- #
- # would specify that all file names containing word A should be found,
- # then all the file names containing word B should be found, and only
- # those file names that contain words A and B should be added to those
- # file names containing word C, and then if these file names do not
- # contain word D, they are output.
- #
- # Logical "or'ing" is implicit, thus:
- #
- # A B C
- #
- # is identical to:
- #
- # A or B or C
- #
- # Obviously, the keywords, "and," "not," and "or," may not be queried
- # for, when using the implicit "or" query constructs.
- #
- # If the "-e" option is specified, then "exact match" queries will be
- # performed, otherwise, a "partial key" type of search will be
- # performed, which is the default. It is recommended that the "-e"
- # option be used if the query involves any boolean operations.
- #
- # If the "-r" option is specified, then the words being queried for
- # will be output before the list file names that contain the queried
- # words. This output format is compatible with egrep(1), and is useful
- # in doing "relevance feedback" searches.
- #
- # If the "-rc" option is specified, then the count of records in a file
- # that contain match(s) will be output with the file name containing
- # the matches. This provides the system with a remedial "relevance
- # feedback" capability. The original text files that were used to
- # construct the inverted index file must be available in the system to
- # use this option.
- #
- # If the "-rp" option is specified, then the records in a file that
- # contain match(s) will be output after the file name containing the
- # matches. This output format provides the system with a remedial
- # "permuted index" type of "proximity retrieval." The original text
- # files that were used to construct the inverted index must be
- # available in the system to use this option.
- #
- # The write synopsis is:
- #
- # qt -w [-f index_name] [-1 | ... | -8] file1 file2 ...
- #
- # where file1 file2 ... are the file names that contain words that are
- # to be added to the inverted index, or:
- #
- # qt -w [-f index_name] [-1 | ... | -8] < file_list
- #
- # where file_list is the name of a file that contains a list of file
- # names, one file name per record, that contain words that are to be
- # added to the inverted index.
- #
- # It is recommended that file names contain the absolute path to the
- # system's root directory.
- #
- # If the inverted index file does not exist, then it will be created,
- # and contain an index to all of the words in the input files. If the
- # inverted index file exists, then the indices of all words in the
- # input files will be added, incrementally. Instances of words and
- # filename pairs will be unique in the inverted index.
- #
- # The "-w" option, specifies that write operations will be performed,
- # and is a mandatory option, to be used if and only if write operations
- # are desired.
- #
- # The "-f index_name," optionally, specifies the inverted index file's
- # name. If the "-f" option is not specified, the inverted index file
- # name will default to "qt.index."
- #
- # The lexical analyzer level is specified by, "-1", through "-8". If
- # none are specified, the default, "-4", will be used. The lexical
- # analyzers with larger numbers are, generally, more sophisticated
- # about the words that are placed in the inverted index. The lexical
- # analyzers available are:
- #
- # 1) Parses words and numbers. All other characters are omitted.
- # Capitalization is preserved. Probably the best choice if
- # non-word searches are important.
- #
- # 2) Like 1) above, but the '_' character is recognized. This
- # parser seems to work well with "C" program source files.
- #
- # 3) Like 1) above, but, only words of more than two characters
- # are placed in the inverted index file. If capitalization is
- # considered important in the search criteria, then this seems
- # to be the best choice.
- #
- # 4) Like 1) above, but, capitalization is ignored. For general
- # text where all words and numbers are considered significant,
- # this seems to be the best choice. Also seems a good choice for
- # "catman" pages. Queries should be in lowercase.
- #
- # 5) Like 3) above, but capitalization is ignored. For general
- # text, this seems to be the best choice. Queries should be in
- # lowercase.
- #
- # 6) Like 4) above, but words containing only numbers are
- # omitted from the inverted index file. For text containing only
- # words, this seems to be the best choice. Queries should be in
- # lowercase.
- #
- # 7) Like 4) above, but does not include Unix mail headers in
- # the inverted index file. Each email should be in a separate
- # file, as opposed to concatenated into folders. This seems to
- # be the best choice for Unix mail files, if the header
- # information is not desirable.
- #
- # 8) Like 4) above, but deletes TeX and/or LaTeX commands from
- # the inverted index file. This seems to be the best choice for
- # TeX and LaTeX documents.
- #
- # The more sophisticated the parser, the smaller the size of the
- # inverted index file. Multiple runs can be made, using the different
- # parsers, to store words in the inverted index. For example, using
- # parsers 3) and 4) would place both the capitalized and
- # non-capitalized words in the index. This would not duplicate any
- # words already in the index-only add the words that were different.
- #
- # The remove words synopsis is:
- #
- # qt -w -dw [-f index_name] word1 word2 ...
- #
- # where word1 word2 ... are the words that are to be deleted from the
- # inverted index file, and may be a regular expressions-no '^' or '$'
- # characters should be used, unless they are escaped. The "-w" option
- # is mandatory.
- #
- # The remove files synopsis is:
- #
- # qt -w -df [-f index_name] file1 file2 ...
- #
- # where file1 file2 ... are the file names that are to be deleted from
- # inverted index. The file names to be deleted from the inverted index
- # file may be regular expressions-no '^' or '$' characters should be
- # used, unless they are escaped. The "-w" option is mandatory.
- #
- # The version synopsis is:
- #
- # qt -v
- #
- # which will print the version number of qt.
- #
- # The help synopsis is:
- #
- # qt -h
- #
- # which will list a synopsis of the command semantics.
- #
- # A common example of writing an inverted index file would be:
- #
- # find /dir1/dir2 -type f -print | qt -w
- #
- # which would recursively descend through the directory hierarchy, and
- # create an inverted index of all of the words in all of the files in
- # all of the directories, starting with /dir1/dir2.
- #
- # A common example of retrieving information from an inverted index
- # file would be:
- #
- # more +/word `qt word`
- #
- # where the "more" program would page through the documents that
- # contain "word," advancing to the next instance every time the 'n' key
- # is depressed.
- #
- # A common example of relevance determination in retrieving information
- # from an inverted index file would be:
- #
- # egrep -ic `qt -r word` | sort -n -r -t: +1
- #
- # which would print the file(s) that contain "word," with the count of
- # the instances of records that contain "word" in each of the file(s).
- #
- # Since the inverted index file constitutes a database system, care of
- # how this file is manipulated is important. The general procedure used
- # in this script is as follows:
- #
- # 1) When this script commences execution, a test is made for
- # the existence of a backup of an original inverted index file,
- # which was created in step 3), below. (Presumably, this file
- # was left by failed attempt(s) of step(s) 3), 4), or 5), by a
- # prior, unsuccessful, execution of this script.) If the backup
- # file exists, it is unconditionally moved, via the "mv"
- # command, to be the current, original database. If this
- # operation is successful, then step 2) is executed, if not, the
- # script aborts.
- #
- # 2) After any original inverted index file backup is restored,
- # all write operations to the database are written to a
- # temporary file-including duplication of any required data from
- # the current inverted index file. This temporary file will
- # become the new inverted index file. If these operations are
- # successful, then step 3) is executed, if not, the script
- # aborts.
- #
- # 3) After all write operations are completed, the original
- # inverted index file is backed up, using the "mv" command. If
- # this operation successful, then step 4) is executed, if not,
- # the script aborts.
- #
- # 4) After the original inverted index file has been backed up,
- # the temporary inverted index file is moved, using the "mv"
- # command, as the new inverted index file. If this operation is
- # successful, then step 5) is executed, if not, the script
- # aborts.
- #
- # 5) After the temporary inverted index file has been moved, the
- # original inverted index file backup is removed. If this
- # operation is successful, the script exits normally, if not,
- # the script aborts.
- #
- # Note that the vulnerability is in steps 3), 4) and 5). If step 3)
- # fails, then the the original inverted index is still intact, and
- # there is no backup (or need for one.) If step 4) fails, then there is
- # a backup and it will restored in step 1). If step 5) fails, then
- # there is a backup and it will, also, be restored in step 1),
- # (inadvertently destroying the new inverted index file.) Note,
- # additionally, that steps 3), 4), and 5) are "low risk" operations,
- # (two "mv" and one "rm" operation,) and executed sequentially, with no
- # intervening program steps.
- X
- # Function to find the programs used in this script. The arguments are
- # the choice of paths to a program, in precedence of your first choice,
- # second choice, and so on.
- #
- # Note that this function is not efficient. During installation, the
- # paths should be hard coded, and this function removed.
- X
- find_program()
- {
- X # If no arguments, exit.
- X
- X if [ "$#" -eq "0" ]; then
- X ${ECHO} "No program name specified, aborting." 1>&2
- X exit 1
- X fi
- X
- X # Save the first argument's basename for error reporting.
- X
- X program_base=`basename $1`
- X
- X # For each argument, test if the file name exists, and is
- X # executable.
- X
- X while [ "$#" -ne "0" ]
- X do
- X if [ -x "$1" ]; then
- X ${ECHO} "$1"
- X return
- X fi
- X shift
- X done
- X
- X # None of the file name arguments were found, exit with the error.
- X
- X ${ECHO} "Program not found, $program_base, aborting." 1>&2
- X exit 1
- }
- X
- # Assume an echo is in the path for find_program().
- X
- ECHO=echo
- X
- CAT=`find_program /usr/bin/cat`
- CP=`find_program /usr/bin/cp`
- X
- # For SunOS 4.1.x, use the SysV version of echo, in /usr/5bin/echo, all
- # others use /usr/bin/echo.
- X
- ECHO=`find_program /usr/5bin/echo /usr/bin/echo`
- EGREP=`find_program /usr/bin/egrep`
- JOIN=`find_program /usr/bin/join`
- X
- # For SysV Rel. 4.x, the look program resides in /usr/ucb/look, all
- # others use /usr/bin/look.
- X
- LOOK=`find_program /usr/bin/look /usr/ucb/look`
- MV=`find_program /usr/bin/mv`
- RM=`find_program /usr/bin/rm`
- SED=`find_program /usr/bin/sed`
- SORT=`find_program /usr/bin/sort`
- X
- # For the DEC ALPHA 3000, the sync program resides in /usr/sbin/sync,
- # all others use /usr/bin/sync.
- X
- SYNC=`find_program /usr/bin/sync /usr/sbin/sync`
- X
- # For SunOS 4.1.x, use the SysV version of tr, in /usr/5bin/tr, all
- # others use /usr/bin/tr.
- X
- TR=`find_program /usr/5bin/tr /usr/bin/tr`
- UNIQ=`find_program /usr/bin/uniq`
- X
- # Default inverted index file name.
- X
- DB_NAME="qt.index"
- X
- # Default temporary inverted index file base name.
- X
- TMP_NAME="qt.index"
- X
- # If the environmental variable, TMPDIR, exists, then that directory
- # will be used for all temporary files, if not, then /tmp will be used.
- # The temporary file names in TMPDIR are always the basename of the
- # database, concatenated with a '-', a unique character that identifies
- # the temporary file to this script, a '.', and this script's pid.
- X
- # Temporary file directory name.
- X
- TEMP_DIR="${TMPDIR:-/tmp}"
- X
- # Temporary inverted index file name. There are two alternatives here.
- # The database is updated by backing up the current database (with the
- # ${MV} command) to a different name in its current directory. Then the
- # new database is moved (again with the ${MV} command) from its
- # temporary name to the database name. In some systems, if the /tmp
- # directory is on a different disk partition, ${MV} will have to copy
- # the data across the file systems to perform the move. During this
- # time, the database is vulnerable to power outages, etc. If the
- # temporary database is in the same directory as the current database,
- # the update will not involve any data transfer on the disk (ie., only
- # a name change.) On one hand, constructing the temporary database in
- # its home directory lowers the risk of corruption if the machine goes
- # down, but on the other hand, it could leave the temporary file-which
- # can be large-when the machine comes up. (Note that the /tmp directory
- # is purged during the boot process.) The two options are:
- X
- # TMP_DB="${TEMP_DIR}/${TMP_NAME}-1.$$"
- TMP_DB="${DB_NAME}.NEW"
- X
- # List of all temporary files:
- #
- # "${TEMP_DIR}/${TMP_NAME}-1.$$", is the temporary file name of
- # the new inverted index file. (Either in read or write modes.)
- #
- # "${TEMP_DIR}/${TMP_NAME}-2.$$", is the temporary file name
- # where anything that is to be added to the inverted index file
- # is temporarily held. (Either in read or write modes.)
- #
- # "${TEMP_DIR}/${TMP_NAME}-3.$$", is the temporary file name
- # where anything that is to be added to the inverted index file
- # is temporarily held. (In read mode only.)
- #
- # "${TEMP_DIR}/${TMP_NAME}-E.$$", is a temporary file name that
- # contains information about error conditions. If it does not
- # exist, or exists and is zero length, then no error occured.
- X
- TEMP_FILES="${TEMP_DIR}/${TMP_NAME}-1.$$ ${TEMP_DIR}/${TMP_NAME}-2.$$ ${TEMP_DIR}/${TMP_NAME}-3.$$ ${TEMP_DIR}/${TMP_NAME}-E.$$ ${TMP_DB}"
- X
- # Default lexical analyzer.
- X
- LEXICO=4
- X
- # RELEVANCE_ARGUMENTS, a list of the words being queried for in
- # read_index().
- X
- RELEVANCE_ARGUMENTS=""
- X
- # RELEVANCE_ATTRIBUTE, 1 = include "${RELEVANCE_ARGUMENTS}" as the
- # first record ouput from read_index(), 0 = do not output
- # "${RELEVANCE_ARGUMENTS}"
- X
- RELEVANCE_ATTRIBUTE=0
- X
- # Mode of operations:
- #
- # READ_MODE, read only mode = 1.
- #
- # WRITE_MODE, write mode = 2.
- #
- # DELETE_WORDS, delete words mode = 3.
- #
- # DELETE_FILES, delete files mode = 4.
- #
- # RELEVANCE_COUNT, relevance count mode = 5.
- #
- # RELEVANCE_PROXIMITY, relevance proximity mode = 6.
- X
- READ_MODE=1
- WRITE_MODE=2
- DELETE_WORDS=3
- DELETE_FILES=4
- RELEVANCE_COUNT=5
- RELEVANCE_PROXIMITY=6
- X
- # Default mode of operation.
- X
- OP_MODE="${READ_MODE}"
- X
- # Default termination character used by ${LOOK} program. Setting this
- # to "${TAB}" will allow "exact key" searches for the words in the
- # inverted index file. The default, which is null, is to allow "partial
- # key" searches.
- X
- END=
- X
- # Tab character, used to specify the field delimiter for the ${JOIN}
- # program. This is chosen as a character that will never be in the
- # inverted index file, since all white space is to be parsed away. Note
- # that this file should never be detab'ed.
- X
- TAB=' '
- X
- # The help function. Prints to stdout, so that it can be redirected to
- # a file. The function requires no arguments.
- X
- help()
- {
- X ${ECHO} "This program will create and query inverted index files that index"
- X ${ECHO} "the words in text files. These indices are useful in information"
- X ${ECHO} "retrieval systems. The inverted index files are, typically, about the"
- X ${ECHO} "same size of the text files, and do not require the text files to be"
- X ${ECHO} "present for query operations. The query functions, typically, consist"
- X ${ECHO} "of boolean operations on word searches. The output of the query is,"
- X ${ECHO} "typically, a list of the file names that contain the queried word(s)."
- X ${ECHO} ""
- X ${ECHO} "The read synopsis is:"
- X ${ECHO} ""
- X ${ECHO} " qt [-e] [-r | -rc | -rp] [-f index_name] word1 [op1] word2 [op2] ..."
- X ${ECHO} ""
- X ${ECHO} "where word1-word2 ... are the words to be queried in the inverted"
- X ${ECHO} "index, and op1-op2 ... are the operations to be performed on the set"
- X ${ECHO} "of file names that contain these words. The word/operation arguments"
- X ${ECHO} "consist of pairs of search words, and boolean operators, with a left"
- X ${ECHO} "to right operational precedence."
- X ${ECHO} ""
- X ${ECHO} "Thus if A, B, and C are words, then the query:"
- X ${ECHO} ""
- X ${ECHO} " A and B or C not D"
- X ${ECHO} ""
- X ${ECHO} "would specify that all file names containing word A should be found,"
- X ${ECHO} "then all the file names containing word B should be found, and only"
- X ${ECHO} "those file names that contain words A and B should be added to those"
- X ${ECHO} "file names containing word C, and then if these file names do not"
- X ${ECHO} "contain word D, they are output."
- X ${ECHO} ""
- X ${ECHO} "Logical \"or'ing\" is implicit, thus:"
- X ${ECHO} ""
- X ${ECHO} " A B C"
- X ${ECHO} ""
- X ${ECHO} "is identical to:"
- X ${ECHO} ""
- X ${ECHO} " A or B or C"
- X ${ECHO} ""
- X ${ECHO} "Obviously, the keywords, \"and,\" \"not,\" and \"or,\" may not be queried"
- X ${ECHO} "for, when using the implicit \"or\" query constructs."
- X ${ECHO} ""
- X ${ECHO} "If the \"-e\" option is specified, then \"exact match\" queries will be"
- X ${ECHO} "performed, otherwise, a \"partial key\" type of search will be"
- X ${ECHO} "performed, which is the default. It is recommended that the \"-e\""
- X ${ECHO} "option be used if the query involves any boolean operations."
- X ${ECHO} ""
- X ${ECHO} "If the \"-r\" option is specified, then the words being queried for"
- X ${ECHO} "will be output before the list file names that contain the queried"
- X ${ECHO} "words. This output format is compatible with egrep(1), and is useful"
- X ${ECHO} "in doing \"relevance feedback\" searches."
- X ${ECHO} ""
- X ${ECHO} "If the \"-rc\" option is specified, then the count of records in a file"
- X ${ECHO} "that contain match(s) will be output with the file name containing"
- X ${ECHO} "the matches. This provides the system with a remedial \"relevance"
- X ${ECHO} "feedback\" capability. The original text files that were used to"
- X ${ECHO} "construct the inverted index file must be available in the system to"
- X ${ECHO} "use this option."
- X ${ECHO} ""
- X ${ECHO} "If the \"-rp\" option is specified, then the records in a file that"
- X ${ECHO} "contain match(s) will be output after the file name containing the"
- X ${ECHO} "matches. This output format provides the system with a remedial"
- X ${ECHO} "\"permuted index\" type of \"proximity retrieval.\" The original text"
- X ${ECHO} "files that were used to construct the inverted index must be"
- X ${ECHO} "available in the system to use this option."
- X ${ECHO} ""
- X ${ECHO} "The write synopsis is:"
- X ${ECHO} ""
- X ${ECHO} " qt -w [-f index_name] [-1 | ... | -8] file1 file2 ..."
- X ${ECHO} ""
- X ${ECHO} "where file1 file2 ... are the file names that contain words that are"
- X ${ECHO} "to be added to the inverted index, or:"
- X ${ECHO} ""
- X ${ECHO} " qt -w [-f index_name] [-1 | ... | -8] < file_list"
- X ${ECHO} ""
- X ${ECHO} "where file_list is the name of a file that contains a list of file"
- X ${ECHO} "names, one file name per record, that contain words that are to be"
- X ${ECHO} "added to the inverted index."
- X ${ECHO} ""
- X ${ECHO} "It is recommended that file names contain the absolute path to the"
- X ${ECHO} "system's root directory."
- X ${ECHO} ""
- X ${ECHO} "If the inverted index file does not exist, then it will be created,"
- X ${ECHO} "and contain an index to all of the words in the input files. If the"
- X ${ECHO} "inverted index file exists, then the indices of all words in the"
- X ${ECHO} "input files will be added, incrementally. Instances of words and"
- X ${ECHO} "filename pairs will be unique in the inverted index."
- X ${ECHO} ""
- X ${ECHO} "The \"-w\" option, specifies that write operations will be performed,"
- X ${ECHO} "and is a mandatory option, to be used if and only if write operations"
- X ${ECHO} "are desired."
- X ${ECHO} ""
- X ${ECHO} "The \"-f index_name,\" optionally, specifies the inverted index file's"
- X ${ECHO} "name. If the \"-f\" option is not specified, the inverted index file"
- X ${ECHO} "name will default to \"qt.index.\""
- X ${ECHO} ""
- X ${ECHO} "The lexical analyzer level is specified by, \"-1\", through \"-8\". If"
- X ${ECHO} "none are specified, the default, \"-4\", will be used. The lexical"
- X ${ECHO} "analyzers with larger numbers are, generally, more sophisticated"
- X ${ECHO} "about the words that are placed in the inverted index. The lexical"
- X ${ECHO} "analyzers available are:"
- X ${ECHO} ""
- X ${ECHO} " 1) Parses words and numbers. All other characters are omitted."
- X ${ECHO} " Capitalization is preserved. Probably the best choice if"
- X ${ECHO} " non-word searches are important."
- X ${ECHO} ""
- X ${ECHO} " 2) Like 1) above, but the '_' character is recognized. This"
- X ${ECHO} " parser seems to work well with \"C\" program source files."
- X ${ECHO} ""
- X ${ECHO} " 3) Like 1) above, but, only words of more than two characters"
- X ${ECHO} " are placed in the inverted index file. If capitalization is"
- X ${ECHO} " considered important in the search criteria, then this seems"
- X ${ECHO} " to be the best choice."
- X ${ECHO} ""
- X ${ECHO} " 4) Like 1) above, but, capitalization is ignored. For general"
- X ${ECHO} " text where all words and numbers are considered significant,"
- X ${ECHO} " this seems to be the best choice. Also seems a good choice for"
- X ${ECHO} " \"catman\" pages. Queries should be in lowercase."
- X ${ECHO} ""
- X ${ECHO} " 5) Like 3) above, but capitalization is ignored. For general"
- X ${ECHO} " text, this seems to be the best choice. Queries should be in"
- X ${ECHO} " lowercase."
- X ${ECHO} ""
- X ${ECHO} " 6) Like 4) above, but words containing only numbers are"
- X ${ECHO} " omitted from the inverted index file. For text containing only"
- X ${ECHO} " words, this seems to be the best choice. Queries should be in"
- X ${ECHO} " lowercase."
- X ${ECHO} ""
- X ${ECHO} " 7) Like 4) above, but does not include Unix mail headers in"
- X ${ECHO} " the inverted index file. Each email should be in a separate"
- X ${ECHO} " file, as opposed to concatenated into folders. This seems to"
- X ${ECHO} " be the best choice for Unix mail files, if the header"
- X ${ECHO} " information is not desirable."
- X ${ECHO} ""
- X ${ECHO} " 8) Like 4) above, but deletes TeX and/or LaTeX commands from"
- X ${ECHO} " the inverted index file. This seems to be the best choice for"
- X ${ECHO} " TeX and LaTeX documents."
- X ${ECHO} ""
- X ${ECHO} "The more sophisticated the parser, the smaller the size of the"
- X ${ECHO} "inverted index file. Multiple runs can be made, using the different"
- X ${ECHO} "parsers, to store words in the inverted index. For example, using"
- X ${ECHO} "parsers 3) and 4) would place both the capitalized and"
- X ${ECHO} "non-capitalized words in the index. This would not duplicate any"
- X ${ECHO} "words already in the index-only add the words that were different."
- X ${ECHO} ""
- X ${ECHO} "The remove words synopsis is:"
- X ${ECHO} ""
- X ${ECHO} " qt -w -dw [-f index_name] word1 word2 ..."
- X ${ECHO} ""
- X ${ECHO} "where word1 word2 ... are the words that are to be deleted from the"
- X ${ECHO} "inverted index file, and may be a regular expressions-no '^' or '$'"
- X ${ECHO} "characters should be used, unless they are escaped. The \"-w\" option"
- X ${ECHO} "is mandatory."
- X ${ECHO} ""
- X ${ECHO} "The remove files synopsis is:"
- X ${ECHO} ""
- X ${ECHO} " qt -w -df [-f index_name] file1 file2 ..."
- X ${ECHO} ""
- X ${ECHO} "where file1 file2 ... are the file names that are to be deleted from"
- X ${ECHO} "inverted index. The file names to be deleted from the inverted index"
- X ${ECHO} "file may be regular expressions-no '^' or '$' characters should be"
- X ${ECHO} "used, unless they are escaped. The \"-w\" option is mandatory."
- X ${ECHO} ""
- X ${ECHO} "The version synopsis is:"
- X ${ECHO} ""
- X ${ECHO} " qt -v"
- X ${ECHO} ""
- X ${ECHO} "which will print the version number of qt."
- X ${ECHO} ""
- X ${ECHO} "The help synopsis is:"
- X ${ECHO} ""
- X ${ECHO} " qt -h"
- X ${ECHO} ""
- X ${ECHO} "which will list a synopsis of the command semantics."
- X ${ECHO} ""
- X ${ECHO} "A common example of writing an inverted index file would be:"
- X ${ECHO} ""
- X ${ECHO} " find /dir1/dir2 -type f -print | qt -w"
- X ${ECHO} ""
- X ${ECHO} "which would recursively descend through the directory hierarchy, and"
- X ${ECHO} "create an inverted index of all of the words in all of the files in"
- X ${ECHO} "all of the directories, starting with /dir1/dir2."
- X ${ECHO} ""
- X ${ECHO} "A common example of retrieving information from an inverted index"
- X ${ECHO} "file would be:"
- X ${ECHO} ""
- X ${ECHO} " more +/word \`qt word\`"
- X ${ECHO} ""
- X ${ECHO} "where the \"more\" program would page through the documents that"
- X ${ECHO} "contain \"word,\" advancing to the next instance every time the 'n' key"
- X ${ECHO} "is depressed."
- X ${ECHO} ""
- X ${ECHO} "A common example of relevance determination in retrieving information"
- X ${ECHO} "from an inverted index file would be:"
- X ${ECHO} ""
- X ${ECHO} " egrep -ic \`qt -r word\` | sort -n -r -t: +1"
- X ${ECHO} ""
- X ${ECHO} "which would print the file(s) that contain \"word,\" with the count of"
- X ${ECHO} "the instances of records that contain \"word\" in each of the file(s)."
- }
- X
- # Query the inverted index file for word(s). If "${END}" is a "${TAB}",
- # then only exact matches will be found. If "${END}" is null, then
- # "partial key" types of operations will be performed. The command line
- # arguments consist of pairs of search words, and boolean operators,
- # with a left to right operational precedence.
- #
- # Thus if A, B, and C are words, then the query:
- #
- # A and B or C not D
- #
- # would specify that all file names containing word A should be found,
- # then all the file names containing word B should be found, and only
- # those file names that contain words A and B should be added to those
- # file names containing word C, and then if these file names do not
- # contain word D, they are output.
- #
- # Logical "or'ing" is implicit, thus:
- #
- # A B C
- #
- # is identical to:
- #
- # A or B or C
- #
- # The boolean operators supported are:
- #
- # and, which is implemented with the ${JOIN} program to perform a
- # "natural join" of two files, each file containing a list of the
- # file names which contain specific word(s).
- #
- # not, which is implemented with the ${JOIN} program to perform an
- # "natural join" operation, the output of which is combined with
- # the first file, using the ${SORT} -m program, and piped to the
- # ${UNIQ} -u program so that only those file names that are unique
- # to the first file are output.
- #
- # or, which is implemented by ${SORT} -mu to concatenate the two
- # files together, each record being unique.
- #
- # The ${LOOK} program is used to perform a binary search on the inverted
- # index file, (which is made up of ASCII records, each record containing
- # a word in a file, and the file name, separated by a single "${TAB}".)
- # The inverted index file is sorted in the ASCII collation sequence of
- # the words. The output of the ${LOOK} program is also sorted in ASCII
- # collation sequence. The words are striped from the records output from
- # the ${LOOK} program with the ${SED} 's/.* //' program.
- #
- # The ${JOIN} program uses the -tc option, where c is a character that
- # can never be in the inverted index.
- #
- # Note, the use of "-e" for exact word match operations is recommended
- # when doing boolean searches.
- #
- # The variable, ${RELEVANCE_ARGUMENTS}, contains a running list of the
- # search words, separated by a pipe symbol, '|'. This argument is is
- # useful for piping to ${EGREP} for further refined searches.
- #
- # The rules of concatenation are:
- #
- # 1) initial operator argument, concatenate word only.
- #
- # 2) "or" operator argument, concatenate '|' and word to
- # beginning.
- #
- # 3) "not" operator argument, do not concatenate.
- #
- # 4) "and" operator argument, replace concatenated string with
- # word.
- #
- # This variable, at the conclusion of read_index(), will contain a
- # regular expression that is compatible with the first argument of
- # ${EGREP}. If the output of qt is piped to ${EGREP} with the file names
- # as additional arguments, then the words found in the records of the
- # files will approximately equal what the ${LOOK} program found in the
- # inverted index file.
- #
- # The arguments are the list of word/operators.
- X
- read_index ()
- {
- X # If no arguments, return.
- X
- X if [ "$#" -eq "0" ]; then
- X return
- X fi
- X
- X # If the file does not exist, or is unreadable, exit.
- X
- X if [ ! -f "${DB_NAME}" -o ! -r "${DB_NAME}" ]; then
- X ${ECHO} "The inverted index file does not exist, or is unreadable, aborting." 1>&2
- X exit 1
- X fi
- X
- X # For each query argument:
- X
- X while [ "$#" -ne "0" ]
- X do
- X case "$1" in
- X and)
- X
- X # The query operator was an "and", shift over it.
- X
- X shift
- X if [ -f "${TEMP_DIR}/${TMP_NAME}-1.$$" ]; then
- X
- X # A partial file list exists, clear any
- X # ${RELEVANCE_ARGUMENTS} and start a new one.
- X
- X RELEVANCE_ARGUMENTS="$1"
- X
- X # Find the list of word/file name pairs, striping
- X # the words, and make this list unique.
- X
- X ${LOOK} "$1${END}" "${DB_NAME}" | ${SED} "s/.*${TAB}//" | ${UNIQ} > "${TEMP_DIR}/${TMP_NAME}-2.$$"
- X
- X # ${JOIN} this list with the existing partial file
- X # list.
- X
- X ${JOIN} "-t${TAB}" "${TEMP_DIR}/${TMP_NAME}-1.$$" "${TEMP_DIR}/${TMP_NAME}-2.$$" > "${TEMP_DIR}/${TMP_NAME}-3.$$"
- X
- X # This becomes the new partial file list.
- X
- X ${MV} -f "${TEMP_DIR}/${TMP_NAME}-3.$$" "${TEMP_DIR}/${TMP_NAME}-1.$$"
- X else
- X
- X # A partial file list does not exist, start one,
- X # clear any ${RELEVANCE_ARGUMENTS} , and start a new
- X # one.
- X
- X RELEVANCE_ARGUMENTS="$1"
- X
- X # Find the list of word/file name pairs, striping
- X # the words, and make this list unique.
- X
- X ${LOOK} "$1${END}" "${DB_NAME}" | ${SED} "s/.*${TAB}//" | ${UNIQ} > "${TEMP_DIR}/${TMP_NAME}-1.$$"
- X fi; shift;;
- X not)
- X
- X # The query operator was an "not", shift over it.
- X
- X shift
- X if [ -f "${TEMP_DIR}/${TMP_NAME}-1.$$" ]; then
- X
- X # A partial file list exists, don't add this to
- X # ${RELEVANCE_ARGUMENTS}, find the list of word/file
- X # name pairs, striping the words, and make this list
- X # unique.
- X
- X ${LOOK} "$1${END}" "${DB_NAME}" | ${SED} "s/.*${TAB}//" | ${UNIQ} > "${TEMP_DIR}/${TMP_NAME}-2.$$"
- X
- X # ${JOIN} this list with the existing partial file
- X # list.
- X
- X ${JOIN} "-t${TAB}" "${TEMP_DIR}/${TMP_NAME}-1.$$" "${TEMP_DIR}/${TMP_NAME}-2.$$" > "${TEMP_DIR}/${TMP_NAME}-3.$$"
- X
- X # ${SORT} -m this list with the existing partial
- X # file list.
- X
- X ${SORT} -m "${TEMP_DIR}/${TMP_NAME}-1.$$" "${TEMP_DIR}/${TMP_NAME}-3.$$" | ${UNIQ} -u > "${TEMP_DIR}/${TMP_NAME}-2.$$"
- X
- X # This becomes the new partial file list.
- X
- X ${MV} -f "${TEMP_DIR}/${TMP_NAME}-2.$$" "${TEMP_DIR}/${TMP_NAME}-1.$$"
- X else
- X
- X # A partial file list does not exist-start one,
- X # don't add this to # ${RELEVANCE_ARGUMENTS}, find
- X # the list of word/file name pairs, striping the
- X # words, and make this list unique.
- X
- X ${LOOK} "$1${END}" "${DB_NAME}" | ${SED} "s/.*${TAB}//" | ${UNIQ} > "${TEMP_DIR}/${TMP_NAME}-1.$$"
- X fi; shift;;
- X or)
- X
- X # The query operator was an "or", shift over it.
- X
- X shift
- X if [ -f "${TEMP_DIR}/${TMP_NAME}-1.$$" ]; then
- X
- X # A partial file list exists, add this with a
- X # leading '|, to the ${RELEVANCE_ARGUMENTS}.
- X
- X RELEVANCE_ARGUMENTS="$1|${RELEVANCE_ARGUMENTS}"
- X
- X # Find the list of word/file name pairs, striping
- X # the words, and make this list unique.
- X
- X ${LOOK} "$1${END}" "${DB_NAME}" | ${SED} "s/.*${TAB}//" | ${UNIQ} > "${TEMP_DIR}/${TMP_NAME}-2.$$"
- X
- X # ${SORT} -mu this list with the existing partial
- X # file list.
- X
- X ${SORT} -mu "${TEMP_DIR}/${TMP_NAME}-1.$$" "${TEMP_DIR}/${TMP_NAME}-2.$$" > "${TEMP_DIR}/${TMP_NAME}-3.$$"
- X
- X # This becomes the new partial file list.
- X
- X ${MV} -f "${TEMP_DIR}/${TMP_NAME}-3.$$" "${TEMP_DIR}/${TMP_NAME}-1.$$"
- X else
- X
- X # A partial file list does not exist, add this with a
- X # leading '|, to the ${RELEVANCE_ARGUMENTS}.
- X
- X RELEVANCE_ARGUMENTS="$1"
- X
- X # the list of word/file name pairs, striping the
- X # words, and make this list unique.
- X
- X ${LOOK} "$1${END}" "${DB_NAME}" | ${SED} "s/.*${TAB}//" | ${UNIQ} > "${TEMP_DIR}/${TMP_NAME}-1.$$"
- X fi; shift;;
- X *)
- X
- X # The query operator was not an "and", "not", or "or".
- X
- X if [ -f "${TEMP_DIR}/${TMP_NAME}-1.$$" ]; then
- X
- X # A partial file list exists, add this with a
- X # leading '|, to the ${RELEVANCE_ARGUMENTS}.
- X
- X RELEVANCE_ARGUMENTS="$1|${RELEVANCE_ARGUMENTS}"
- X
- X # Find the list of word/file name pairs, striping
- X # the words, and make this list unique.
- X
- X ${LOOK} "$1${END}" "${DB_NAME}" | ${SED} "s/.*${TAB}//" | ${UNIQ} > "${TEMP_DIR}/${TMP_NAME}-2.$$"
- X
- X # ${SORT} -mu this list with the existing partial
- X # file list.
- X
- X ${SORT} -mu "${TEMP_DIR}/${TMP_NAME}-1.$$" "${TEMP_DIR}/${TMP_NAME}-2.$$" > "${TEMP_DIR}/${TMP_NAME}-3.$$"
- X
- X # This becomes the new partial file list.
- X
- X ${MV} -f "${TEMP_DIR}/${TMP_NAME}-3.$$" "${TEMP_DIR}/${TMP_NAME}-1.$$"
- X else
- X
- X # A partial file list does not exist, add this with a
- X # leading '|, to the ${RELEVANCE_ARGUMENTS}.
- X
- X RELEVANCE_ARGUMENTS="$1"
- X
- X # Find the list of word/file name pairs, striping
- X # the words, and make this list unique.
- X
- X ${LOOK} "$1${END}" "${DB_NAME}" | ${SED} "s/.*${TAB}//" | ${UNIQ} > "${TEMP_DIR}/${TMP_NAME}-1.$$"
- X fi; shift;;
- X esac
- X done
- X
- X # If a request for egrep(1) compatable regular expression search
- X # word to be output prior to file names option, the output it.
- X
- X if [ ! "${RELEVANCE_ATTRIBUTE}" -eq "0" ]; then
- X ${ECHO} "${RELEVANCE_ARGUMENTS}"
- X fi
- X
- X # Make shure that each file name in the list is unique.
- X
- X ${SORT} -u "${TEMP_DIR}/${TMP_NAME}-1.$$"
- X
- X # Remove any temporary files.
- X
- X ${RM} -f "${TEMP_DIR}/${TMP_NAME}-1.$$" "${TEMP_DIR}/${TMP_NAME}-2.$$"
- }
- X
- # Index the words in the files. Sort the index, lexographically, on the
- # words, making sure that word file name pairs are unique. Write the
- # output to a temporary file. If any errors, write the errors to
- # "${TEMP_DIR}/${TMP_NAME}-E.$$". Test this file for zero length before
- # proceeding.
- #
- # The required arguments are the list of file names, or none, in which
- # case the file names will be read from the stdin.
- X
- write_index()
- {
- X if [ "$#" -eq "0" ]; then
- X
- X # There are no arguments, read each file name from stdin.
- X
- X while read file_name
- X do
- X
- X # Parse the words from the file.
- X
- X parse_word "${file_name}"
- X done
- X else
- X
- X # There are arguments, for each one of them, read the file.
- X
- X for file_name
- X do
- X
- X # Parse the words from the file.
- X
- X parse_word "${file_name}"
- X done
- X fi | { ${SORT} -T "${TEMP_DIR}" -u > "${TEMP_DIR}/${TMP_NAME}-2.$$"; } 2>> "${TEMP_DIR}/${TMP_NAME}-E.$$"
- X
- X # If any errors indexing files, abort.
- X
- X if [ -s "${TEMP_DIR}/${TMP_NAME}-E.$$" ]; then
- X
- X # There was an error, abort.
- X
- X exit 1
- X fi
- X
- X # If an inverted index file already exists, merge it, uniquely, with the
- X # new temporary inverted index file, else, move the the temporary
- X # inverted index file to that name.
- X
- X if [ -f "${DB_NAME}" ]; then
- X
- X # An inverted index file exists, merge the new inverted index
- X # with it, using ${SORT} -mu.
- X
- X ${SORT} -T "${TEMP_DIR}" -mu "${TEMP_DIR}/${TMP_NAME}-2.$$" "${DB_NAME}" > "${TMP_DB}"
- X if [ ! "$?" -eq "0" ]; then
- X
- X # Couldn't merge the two files, abort.
- X
- X exit 1
- X fi
- X else
- X
- X # An inverted index file does not exist, ${MV} the new inverted
- X # index as this file.
- X
- X ${MV} -f "${TEMP_DIR}/${TMP_NAME}-2.$$" "${TMP_DB}"
- X if [ ! "$?" -eq "0" ]; then
- X
- X # Couldn't move the file, abort.
- X
- X exit 1
- X fi
- X fi
- X
- X # Updataing the temporary inverted index file has been completed.
- X # Update the inverted index file.
- X
- X update_database "${DB_NAME}" "${TMP_DB}"
- }
- X
- # Update the database function-update the database as described above.
- #
- # 1) Backup the original inverted index file is backed up, using
- # the "mv" command. If this operation successful, then step 2)
- # is executed, if not, the script aborts.
- #
- # 2) After the original inverted index file has been backed up,
- # the temporary inverted index file is moved, using the "mv"
- # command, as the new inverted index file. If this operation is
- # successful, then step 3) is executed, if not, the script
- # aborts.
- #
- # 3) After the temporary inverted index file has been moved, the
- # original inverted index file backup is removed. If this
- # operation is successful, the script exits normally, if not,
- # the script aborts.
- #
- # On any error, this function aborts, after trying to restore the
- # original database. The arguments to this function are:
- #
- # "$1" is the original database's name.
- #
- # "$2" is the new database's name.
- X
- update_database()
- {
- X # Ignore interrupts.
- X
- X trap '' 0 1 2 3 15
- X
- X # If the Inverted index database does not exist, or is unreadable or
- X # unwritable, exit. Note that the original index may not exist, yet.
- X
- X if [ -b "$1" -o -c "$1" -o -d "$1" -o -p "$1" ]; then
- X ${ECHO} "The inverted index file does not exist, or is not writable or readable, aborting." 1>&2
- X ${RM} -f "${TEMP_FILES}"
- X exit 1
- X fi
- X
- X # If the original inverted index exists, back it up, if that fails,
- X # attempt to restore it.
- X
- X if [ -f "$1" ]; then
- X
- X # Inverted index file exists.
- X
- X if [ -w "$1" ]; then
- X
- X # Inverted index file is writable, back it up.
- X
- X ${MV} -f "$1" "$1.BAK"
- X if [ ! "$?" -eq "0" ]; then
- X
- X # Backup was not successful, attempt to restore.
- X
- X if [ -f "$1.BAK" ]; then
- X
- X # Backup file exists, print the message that an
- X # attemp to restore is underway.
- X
- X ${ECHO} "Error backing up original index file, attempting to restore." 1>&2
- X
- X # Remove any temporary files to attempt to open up
- X # disk space.
- X
- X ${RM} -f "${TEMP_FILES}"
- X
- X # Restore the backup inverted index file as the
- X # original inverted index file.
- X
- X ${MV} -f "$1.BAK" "$1"
- X if [ "$?" -eq "0" ]; then
- X
- X # Backup succeeded, notify the user.
- X
- X ${ECHO} "Restoration of original index file succeeded, aborting." 1>&2
- X else
- X
- X # Backup failed, notify the user.
- X
- X ${ECHO} "Restoration of original index file failed, aborting." 1>&2
- X fi
- X ${SYNC}
- X ${SYNC}
- X else
- X
- X # The original inverted index file was never
- X # backed up, test for the original.
- X
- X if [ -f "$1" ]; then
- X
- X # Original inverted index file exists, notify
- X # the user.
- X
- X ${ECHO} "Restoration of original index file succeeded, aborting." 1>&2
- X else
- X
- X # Something is terribly wrong-the original
- X # inverted index and its backp are both gone.
- X
- X ${ECHO} "Restoration of original index file failed, aborting." 1>&2
- X fi
- X
- X # Remove all temporary files.
- X
- X ${RM} -f "${TEMP_FILES}"
- X fi
- X exit 1
- X fi
- X else
- X
- X # Inverted index file is not writable, abort.
- X
- X ${ECHO} "The index file is a not writeable, aborting." 1>&2
- X ${RM} -f "${TEMP_FILES}"
- X exit 1
- X fi
- X fi
- X
- X # The original inverted index file, if it existed, is now backed up,
- X # move the temporary inverted index file as the new inverted index
- X # file.
- X
- X ${MV} -f "$2" "$1"
- X
- X # Erase the original inverted index backup file, if it exists, if that
- X # fails, attempt to restore it.
- X
- X if [ "$?" -eq "0" ]; then
- X
- X # The new inverted index has been moved into place, erase the
- X # backup inverted index file, if it exists.
- X
- X if [ -f "$1.BAK" ]; then
- X
- X # The backup inverted index file exists, erase it.
- X
- X ${RM} -f "$1.BAK"
- X if [ ! "$?" -eq "0" ]; then
- X
- X # The removal of the backup inverted index file
- X # failed, notify the user.
- X
- X ${ECHO} "Error removing original index backup, attempting to restore." 1>&2
- X
- X # Remove any temporary files to attempt to open up
- X # disk space.
- X
- X ${RM} -f "${TEMP_FILES}"
- X
- X # Restore the backup inverted index file as the
- X # original inverted index file.
- X
- X ${MV} "$1.BAK" "$1"
- X if [ "$?" -eq "0" ]; then
- X
- X # Backup succeeded, notify the user.
- X
- X ${ECHO} "Restoration of original index file succeeded, aborting." 1>&2
- X else
- X
- X # Backup failed, notify the user.
- X
- X ${ECHO} "Restoration of original index file failed, aborting." 1>&2
- X fi
- X ${SYNC}
- X ${SYNC}
- X exit 1
- X fi
- X ${SYNC}
- X ${SYNC}
- X fi
- X else
- X
- X # The ${MV} of the new inverted index file failed, attempt to
- X # restore the backup file.
- X
- X if [ -f "$1.BAK" ]; then
- X
- X # Backup file exists, print the message that an attemp to
- X # restore is underway.
- X
- X ${ECHO} "Error writing new index file, attempting to restore." 1>&2
- X
- X # A backup file exists, Remove any temporary files to
- X # attempt to open up disk space.
- X
- X ${RM} -f "${TEMP_FILES}"
- X
- X # Restore the backup inverted index file as the original
- X # inverted index file.
- X
- X ${MV} -f "$1.BAK" "$1"
- X if [ "$?" -eq "0" ]; then
- X
- X # Backup succeeded, notify the user.
- X
- X ${ECHO} "Restoration of original index file succeeded, aborting." 1>&2
- X
- X else
- X
- X # Backup failed, notify the user.
- X
- X ${ECHO} "Restoration of original index file failed, aborting." 1>&2
- X fi
- X ${SYNC}
- X ${SYNC}
- X else
- X
- X # The original inverted index file was never backed up,
- X # test for the original.
- X
- X if [ -f "$1" ]; then
- X
- X # Original inverted index file exists, notify the user.
- X
- X ${ECHO} "Restoration of original index file succeeded, aborting." 1>&2
- X else
- X
- X # Something is terribly wrong-the original inverted
- X # index and its backp are both gone.
- X
- X ${ECHO} "Restoration of original index file failed, aborting." 1>&2
- X fi
- X
- X # Remove all temporary files.
- X
- X ${RM} -f "${TEMP_FILES}"
- X fi
- X exit 1
- X fi
- }
- X
- # Lexical analysis function. The words in the file are parsed, one word
- # per record, and concatenated with a "${TAB}", and the file's name.
- # The records are output to the standard output. If an error occurs in
- # any of the programs in the pipe, it is included in the file, named
- # "${TEMP_DIR}/${TMP_NAME}-E.$$". If any errors occurred, exit.
- #
- # Note: This parser is not particularly elegant or fast. As a better
- # solution, the program "stopper," from reference 1) above (Frakes, et
- # al,) can be adapted to work quite well. This program is implemented as
- # an FSM, and supports stop words.
- #
- # The required argument is the file name to be parsed.
- X
- parse_word()
- {
- X # If no arguments, exit.
- X
- X if [ "$#" -eq "0" ]; then
- X ${ECHO} "No file name specified, aborting." 1>&2
- X exit 1
- X fi
- X
- X # If the file does not exist, or is unreadable, exit.
- X
- X if [ ! -f "$1" -o ! -r "$1" ]; then
- X ${ECHO} "File name $1 does not exist or is not readable, aborting." 1>&2
- X exit 1
- X fi
- X
- X # Select the lexical analyzer, and parse the words in the file.
- X
- X case "$LEXICO" in
- X 1) { ${TR} -cs '[a-z][A-Z][0-9]' '[\012*]' < "$1" | ${SED} "/^ *$/d;/^ /d;s,$,${TAB}$1,"; } 2>> "${TEMP_DIR}/${TMP_NAME}-E.$$";;
- X 2) { ${TR} -cs '[a-z][A-Z]_[0-9]' '[\012*]' < "$1" | ${SED} "/^ *$/d;/^ /d;s,$,${TAB}$1,"; } 2>> "${TEMP_DIR}/${TMP_NAME}-E.$$";;
- X 3) { ${TR} -cs '[a-z][A-Z][0-9]' '[\012*]' < "$1" | ${SED} "/^ *$/d;/^.$/d;/^..$/d;/^ /d;s,$,${TAB}$1,"; } 2>> "${TEMP_DIR}/${TMP_NAME}-E.$$";;
- X 4) { ${TR} '[A-Z]' '[a-z]' < "$1" | ${TR} -cs '[a-z][0-9]' '[\012*]' | ${SED} "/^ *$/d;/^ /d;s,$,${TAB}$1,"; } 2>> "${TEMP_DIR}/${TMP_NAME}-E.$$";;
- X 5) { ${TR} '[A-Z]' '[a-z]' < "$1" | ${TR} -cs '[a-z][0-9]' '[\012*]' | ${SED} "/^ *$/d;/^.$/d;/^..$/d;/^ /d;s,$,${TAB}$1,"; } 2>> "${TEMP_DIR}/${TMP_NAME}-E.$$";;
- X 6) { ${TR} '[A-Z]' '[a-z]' < "$1" | ${TR} -cs '[a-z][0-9]' '[\012*]' | ${EGREP} -v '^[0-9]*$' | ${SED} "/^ *$/d;/^.$/d;/^..$/d;/^ /d;s,$,${TAB}$1,"; } 2>> "${TEMP_DIR}/${TMP_NAME}-E.$$";;
- X 7) { ${SED} -e '1,/^ *$/d' < "$1" | ${TR} '[A-Z]' '[a-z]' | ${TR} -cs '[a-z][0-9]' '[\012*]' | ${SED} "/^ *$/d;/^ /d;s,$,${TAB}$1,"; } 2>> "${TEMP_DIR}/${TMP_NAME}-E.$$";;
- X 8) { ${TR} '[A-Z]' '[a-z]' < "$1" | ${TR} -cs '\\[a-z][0-9]' '[\012*]' | ${SED} "/^ *$/d;/^ /d;/\\\.*/d;s,$,${TAB}$1,"; } 2>> "${TEMP_DIR}/${TMP_NAME}-E.$$";;
- X *) ${ECHO} "Unknown lexical analyzer, aborting." 1>&2; exit 1;;
- X esac
- X
- X # If any errors are contained in "${TEMP_DIR}/${TMP_NAME}-E.$$", exit.
- X
- X if [ -s "${TEMP_DIR}/${TMP_NAME}-E.$$" ]; then
- X exit 1
- X fi
- }
- X
- # Remove words from the inverted index function. Since the inverted
- # index file structure is records constructed of word/filename pairs,
- # separated by a "${TAB}", search for the word, adding a claret to
- # signify beginning of record, and terminated with a "${TAB}". With
- # multiple arguments, this script will create multiple indices. A copy of
- # the original is made in the temporary directory, and all operations
- # occur there. The final copy is ${MV}'ed to the index's directory and
- # finally update_database() is called to install the new version.
- #
- # The required arguments are regular expressions. All words in the
- # inverted index that match these expressions will be deleted.
- X
- remove_words()
- {
- X # If no arguments, exit.
- X
- X if [ "$#" -eq "0" ]; then
- X ${ECHO} "No words to remove specified, aborting." 1>&2
- X exit 1
- X fi
- X
- X # If the Inverted index database does not exist, or is unreadable or
- X # unwritable, exit.
- X
- X if [ ! -f "${DB_NAME}" -o ! -r "${DB_NAME}" -o ! -w "${DB_NAME}" ]; then
- X ${ECHO} "The inverted index file does not exist, or is not writable or readable, aborting." 1>&2
- X exit 1
- X fi
- X
- X # Make a copy of the database in the ${TEMP_DIR}-operate on the
- X # copy, exit if this fails.
- X
- X ${CP} "${DB_NAME}" "${TEMP_DIR}/${TMP_NAME}-1.$$"
- X
- X if [ ! "$?" -eq "0" ]; then
- X ${ECHO} "Error removing words from index file, aborting." 1>&2
- X exit 1
- X fi
- X
- X # For each argument, scan the copy of the database, using ${EGREP}
- X # -v, to remove any records that contain the specified word, making
- X # a new database in the "${TEMP_DIR}". ${MV} this file as the the
- X # new copy of the database. Exit on any failure.
- X
- X while [ "$#" -gt "0" ]
- X do
- X ${EGREP} -v "^$1${TAB}" "${TEMP_DIR}/${TMP_NAME}-1.$$" > "${TEMP_DIR}/${TMP_NAME}-2.$$"
- X
- X # Only check for syntax errors. No matches is OK.
- X
- X if [ "$?" -eq "2" ]; then
- X ${ECHO} "Error removing words from index file, aborting." 1>&2
- X exit 1
- X fi
- X
- X ${MV} -f "${TEMP_DIR}/${TMP_NAME}-2.$$" "${TEMP_DIR}/${TMP_NAME}-1.$$"
- X
- X if [ ! "$?" -eq "0" ]; then
- X ${ECHO} "Error removing words from index file, aborting." 1>&2
- X exit 1
- X fi
- X
- X shift
- X done
- X
- X # The specified words have been removed from the copy of the
- X # database. ${MV} the copy of the database to "${TMP_DB}". Exit on
- X # failure.
- X
- X ${MV} "${TEMP_DIR}/${TMP_NAME}-1.$$" "${TMP_DB}"
- X
- X if [ ! "$?" -eq "0" ]; then
- X ${ECHO} "Error removing words from index file, aborting." 1>&2
- X exit 1
- X fi
- X
- X # Flush the buffers and call update_database() to update the
- X # inverted index file.
- X
- X ${SYNC}
- X ${SYNC}
- X
- X update_database "${DB_NAME}" "${TMP_DB}"
- }
- X
- # Remove files from the inverted index function. Since the inverted
- # index file structure is records constructed of word/filename pairs,
- # separated by a "${TAB}", search for the file name, adding a "${TAB}"
- # to the beginning of the file name, and a dollar sign to the end of the
- # file name to signify the end of record. With multiple arguments, this
- # script will create multiple indices. A copy of the original is made in
- # the temporary directory, and all operations occur there. The final
- # copy is ${MV}'e to the index's directory and finally update_database()
- # is called to install the new version.
- #
- # The required arguments are regular expressions. All file names in the
- # inverted index that match these expressions will be deleted.
- X
- remove_files()
- {
- X # If no arguments, exit.
- X
- X if [ "$#" -eq "0" ]; then
- X ${ECHO} "No file names to remove specified, aborting." 1>&2
- X exit 1
- X fi
- X
- X # If the Inverted index database does not exist, or is unreadable or
- X # unwritable, exit.
- X
- X if [ ! -f "${DB_NAME}" -o ! -r "${DB_NAME}" -o ! -w "${DB_NAME}" ]; then
- X ${ECHO} "The inverted index file does not exist, or is not writable or readable, aborting." 1>&2
- X exit 1
- X fi
- X
- X # Make a copy of the database in the ${TEMP_DIR}-operate on the
- X # copy, exit if this fails.
- X
- X ${CP} "${DB_NAME}" "${TEMP_DIR}/${TMP_NAME}-1.$$"
- X
- X if [ ! "$?" -eq "0" ]; then
- X ${ECHO} "Error removing files from index file, aborting." 1>&2
- X exit 1
- X fi
- X
- X # For each argument, scan the copy of the database, using ${EGREP}
- X # -v, to remove any records that contain the specified file name,
- X # making a new database in the "${TEMP_DIR}". ${MV} this file as the
- X # the new copy of the database. Exit on any failure.
- X
- X while [ "$#" -gt "0" ]
- X do
- X ${EGREP} -v "${TAB}$1\$" "${TEMP_DIR}/${TMP_NAME}-1.$$" > "${TEMP_DIR}/${TMP_NAME}-2.$$"
- X
- X # Only check for syntax errors. No matches is OK.
- X
- X if [ "$?" -eq "2" ]; then
- X ${ECHO} "Error removing files from index file, aborting." 1>&2
- X exit 1
- X fi
- X
- X ${MV} -f "${TEMP_DIR}/${TMP_NAME}-2.$$" "${TEMP_DIR}/${TMP_NAME}-1.$$"
- X
- X if [ ! "$?" -eq "0" ]; then
- X ${ECHO} "Error removing files from index file, aborting." 1>&2
- X exit 1
- X fi
- X
- X shift
- X done
- X
- X # The specified file names have been removed from the copy of the
- X # database. ${MV} the copy of the database to "${TMP_DB}". Exit on
- X # failure.
- X
- X ${MV} "${TEMP_DIR}/${TMP_NAME}-1.$$" "${TMP_DB}"
- X
- X if [ ! "$?" -eq "0" ]; then
- X ${ECHO} "Error removing files from index file, aborting." 1>&2
- X exit 1
- X fi
- X
- X # Flush the buffers and call update_database() to update the
- X # inverted index file.
- X
- X ${SYNC}
- X ${SYNC}
- X
- X update_database "${DB_NAME}" "${TMP_DB}"
- }
- X
- # Relevance count function-the count of records in a file that contain
- # match(s) will be output with the file name containing the matches.
- # This provides the system with a remedial "relevance feedback"
- # capability. The original text files that were used to construct the
- # inverted index file must be available in the system to use this
- # option.
- #
- # Note that ${LOOK}, essentially, uses the regular expression
- # "^word${TAB}" (if exact matches are enabled) to search for words, and
- # ${EGREP} uses, simply, "word", ie., this function will output more
- # instances of records that contain "word" than were found by ${LOOK}.
- #
- # The required arguments are the regular expression to be searched for,
- # followed by the the list of file names to be searched.
- X
- relevance_count()
- {
- X # If no arguments, or only one argument, return.
- X
- X if [ "$#" -eq "0" -o "$#" -eq "0" ]; then
- X return
- X fi
- X
- X # Read the search field.
- X
- X SEARCH_FIELD="$1"
- X shift
- X
- X # ${EGREP} each file, count the records that contain queried words,
- X # ignoring case. ${SORT} the output, in reverse numerical order on
- X # the second field, using a colon as the field delimiter. This
- X # arranges the file names in descending order of the number of
- X # queried words that the files contain. One of the problems with
- X # ${EGREP} is that if only one file name is present, it does not
- X # output the file name-so handle each file separately, using ${ECHO}
- X # to print the file name.
- X
- X while [ "$#" -ne "0" ]
- X do
- X ${ECHO} "$1: \c"
- X ${EGREP} -ic "${SEARCH_FIELD}" "$1"
- X shift
- X done | ${SORT} -n -r -t: +1
- }
- X
- # Proximity retrieval function-the records in a file that contain
- # match(s) will be output after the file name containing the matches.
- # This output format provides the system with a remedial "permuted
- # index" type of "proximity retrieval." The original text files that
- # were used to construct the inverted index must be available in the
- # system to use this option.
- #
- # Note that ${LOOK}, essentially, uses the regular expression
- # "^word${TAB}" (if exact matches are enabled) to search for words, and
- # ${EGREP} uses, simply, "word", ie., this function will output more
- # instances of records that contain "word" than were found by ${LOOK}.
- #
- # The required arguments are the regular expression to be searched for,
- # followed by the the list of file names to be searched.
- X
- relevance_proximity()
- {
- X # If no arguments, or only one argument, return.
- X
- X if [ "$#" -eq "0" -o "$#" -eq "0" ]; then
- X return
- X fi
- X
- X # Read the search field.
- X
- X SEARCH_FIELD="$1"
- X shift
- X
- X # ${EGREP} each file, count the records that contain queried words,
- X # ignoring case. ${SORT} the output, in reverse numerical order on
- X # the second field, using a colon as the field delimiter. This
- X # arranges the file names in descending order of the number of
- X # queried words that the files contain. One of the problems with
- X # ${EGREP} is that if only one file name is present, it does not
- X # output the file name-so handle each file separately, using ${ECHO}
- X # to print the file name. Strip the colon and anything following it
- X # using ${SED}.
- X
- X while [ "$#" -ne "0" ]
- X do
- X ${ECHO} "$1: \c"
- X ${EGREP} -ic "${SEARCH_FIELD}" "$1"
- X shift
- X done | ${SORT} -n -r -t: +1 | ${SED} "s/: .*$//" | while read file_name
- X do
- X # For each file name, print the file name, and ...
- X
- X ${ECHO} ""
- X ${ECHO} "${file_name}:"
- X ${ECHO} ""
- X
- X # ${EGREP} the file for instances of the queried words, ignoring
- X # case and without printing the file name, pipe this output to
- X # ${SED} to despace, and detab the record, and print the record
- X # with leading and trailing dots
- X
- X ${EGREP} -i -h "${SEARCH_FIELD}" "${file_name}" | ${SED} "/[[${TAB} ]*[${TAB} ]/s/[[${TAB} ]*[${TAB} ]/ /g;/^ /s/^ //;/^/s/^/\.\.\. /;/$/s/$/ \.\.\./"
- X done
- }
- X
- # Trap all interrupts, removing the the temporary and error files on any
- # signal.
- X
- trap "${RM} -f ${TEMP_FILES}; exit" 0 1 2 3 15
- X
- # The inverted index backup file should have been removed on the
- # completion of any previous execution of this script. If it wasn't, the
- # previous execution failed, and the backup file should be restored-if
- # this fails, exit with an error.
- X
- if [ -f "${DB_NAME}.BAK" ]; then
- X ${ECHO} "Index backup file exists, restoring." 1>&2
- X ${MV} -f "${DB_NAME}.BAK" "${DB_NAME}"
- X if [ "$?" -eq "0" ]; then
- X ${ECHO} "Restoration of index backup file succeeded, continuing." 1>&2
- X else
- X ${ECHO} "Restoration of index backup file failed, aborting." 1>&2
- X exit 1
- X fi
- X ${SYNC}
- X ${SYNC}
- fi
- X
- # If no operations specified, print the version, which also gives
- # information on help(), and exit.
- X
- if [ "$#" -eq "0" ]; then
- X ${ECHO} "${VERSION}"
- X exit 1
- fi
- X
- # Set variables from command line options. The attempt here is to allow
- # only one mode switch in the command line-two or more are illegal. A
- # "-w" switch is required for any command that updates the database.
- # READ_MODE is the default mode. GETOPT could be used, but some of the
- # switches require two character names.
- X
- while [ "$#" -gt "0" ]
- do
- X case "$1" in
- X
- X # Lexical analysis option.
- X
- X -1) LEXICO=1; shift;;
- X -2) LEXICO=2; shift;;
- X -3) LEXICO=3; shift;;
- X -4) LEXICO=4; shift;;
- X -5) LEXICO=5; shift;;
- X -6) LEXICO=6; shift;;
- X -7) LEXICO=7; shift;;
- X -8) LEXICO=8; shift;;
- X
- X # Delete file(s) option-requires write mode enable.
- X
- X -df) if [ "${OP_MODE}" -eq "${WRITE_MODE}" ]; then
- X OP_MODE="${DELETE_FILES}"; shift
- X else
- X ${ECHO} "Delete files mode requires preceding \"-w\" option, aborting." 1>&2; exit 1
- X fi;;
- X
- X # Delete words(s) option-requires write mode enable.
- X
- X -dw) if [ "${OP_MODE}" -eq "${WRITE_MODE}" ]; then
- X OP_MODE="${DELETE_WORDS}"; shift
- X else
- X ${ECHO} "Delete words mode requires preceding \"-w\" option, aborting." 1>&2; exit 1
- X fi;;
- X
- X # Exact query match option.
- X
- X -e) END="${TAB}"; shift;;
- X
- X # Change file name option, "${TMP_NAME} is the base name of
- X # this file name.
- X
- X -f) shift
- X if [ "$#" -gt "0" ]; then
- X DB_NAME="$1"
- X TMP_NAME=`basename "$1"`
- X shift
- X else
- X ${ECHO} "No inverted index file name specified, aborting." 1>&2; exit 1
- X fi;;
- X
- X # Request for help option.
- X
- X -h) help; shift; exit 0;;
- X
- X # Request for egrep(1) compatable regular expression search word
- X # to be output prior to file names option.
- X
- X -r) RELEVANCE_ATTRIBUTE=1; shift;;
- X
- X # Count of records in a file that contain match(s) option-
- X # requires no previous mode switch..
- X
- X -rc) if [ "${OP_MODE}" -eq "${READ_MODE}" ]; then
- X RELEVANCE_ATTRIBUTE=1; OP_MODE="${RELEVANCE_COUNT}" ;shift
- X else
- X ${ECHO} "Illegal relevance mode specified, aborting." 1>&2; exit 1
- X fi;;
- X
- X # Records in a file that contain match(s) will be output after
- X # the file name option-requires no previous mode switch..
- X
- X -rp) if [ "${OP_MODE}" -eq "${READ_MODE}" ]; then
- X RELEVANCE_ATTRIBUTE=2; OP_MODE="${RELEVANCE_PROXIMITY}" ;shift
- X else
- X ${ECHO} "Illegal relevance mode specified, aborting." 1>&2; exit 1
- X fi;;
- X
- X # Version option.
- X
- X -v) ${ECHO} "${VERSION}"; shift; exit 0;;
- X
- X # Write mode enable-requires no previous mode switch.
- X
- X -w) if [ "${OP_MODE}" -eq "${READ_MODE}" ]; then
- X OP_MODE="${WRITE_MODE}"; shift
- X else
- X ${ECHO} "Illegal write mode specified, aborting." 1>&2; exit 1
- X fi;;
- X
- X # Anything else is a file name or query word/operator.
- X
- X *) break;;
- X esac
- done
- X
- # Dispatch to the specified operation.
- X
- case "${OP_MODE}" in
- X
- X # Write mode, call the function.
- X
- X "${WRITE_MODE}") write_index "$@";;
- X
- X # Read mode, call the function.
- X
- X "${READ_MODE}") read_index "$@";;
- X
- X # Delete words mode, call the function.
- X
- X "${DELETE_WORDS}") remove_words "$@";;
- X
- X # Delete files mode, call the function.
- X
- X "${DELETE_FILES}") remove_files "$@";;
- X
- X # Relevance count-call read_index() to get a list of the files, with
- X # RELEVANCE_ATTRIBUTES set.
- X
- X "${RELEVANCE_COUNT}") relevance_count `read_index "$@"`;;
- X
- X # Proximity relevance-call read_index() to get a list of the files,
- X # with RELEVANCE_ATTRIBUTES set. Pipe these files to
- X # relevance_proximity().
- X
- X "${RELEVANCE_PROXIMITY}") relevance_proximity `read_index "$@"`;;
- X
- X # Nothing to do, fall through to the exit.
- X
- X *) break;;
- esac
- X
- # Restore trapping all interrupts, removing the the temporary and error
- # files on any signal, (update_database() disables interrupts.)
- X
- trap "${RM} -f ${TEMP_FILES}; exit" 0 1 2 3 15
- X
- Xexit 0
- SHAR_EOF
- chmod 0766 qt/qt ||
- echo 'restore of qt/qt failed'
- Wc_c="`wc -c < 'qt/qt'`"
- test 65404 -eq "$Wc_c" ||
- echo 'qt/qt: original size 65404, current size' "$Wc_c"
- fi
- # ============= qt/README ==============
- if test -f 'qt/README' -a X"$1" != X"-c"; then
- echo 'x - skipping qt/README (File already exists)'
- else
- echo 'x - extracting qt/README (Text)'
- sed 's/^X//' << 'SHAR_EOF' > 'qt/README' &&
- Qt stands for Query Text, a text information retrieval system. Qt
- creates, maintains, and queries a full text database. The database
- file system is organized as an inverted index. The program is written
- as a single script, in Bourne Shell, and permits simple natural
- language queries.
- X
- As a simple application example, this program can be used to search
- the "catman" pages for a command that performs a specific function,
- even though the command's name is not known-e.g., if you knew what
- you wanted to do, you could find the command that would do it.
- X
- The program, qt, is free software, and can be redistributed and/or
- modified, without any restrictions. It is distributed with no
- warranty of any kind, implied or otherwise. Specifically, there is
- no warranty of fitness for any particular purpose and/or
- merchantability.
- X
- Comments and/or bug reports should be addressed to:
- X
- X john@johncon.com (John Conover)
- X
- Known caveats: There is no concurrency control-it would be
- ill-advised to use this program as a concurrent application.
- Additionally, the natural language query does not support grouping
- operators.
- X
- For a quick start, execute qt -h for help, which may be re-directed to
- stdio. At the "tail -23" of this help file are some simple commands to
- evaluate this script.
- X
- Installation:
- X
- The comments in this script are verbose, and should be stripped prior
- to any installation with something like:
- X
- X sed '/^ *#/d;/^$/d' qt > qt.new
- X
- and installing qt.new as qt in the executable path. Likewise,
- possibly, the function, help(), should be eliminated. The function,
- find_program(), is not efficient and should be eliminated, by hard
- coding the paths to the various programs in your system. There are
- tab characters used in this script, (which are referenced as the
- variable, "${TAB}") requiring that the script be saved with tabs.
- SHAR_EOF
- chmod 0644 qt/README ||
- echo 'restore of qt/README failed'
- Wc_c="`wc -c < 'qt/README'`"
- test 1844 -eq "$Wc_c" ||
- echo 'qt/README: original size 1844, current size' "$Wc_c"
- fi
- exit 0
-