Source Code 1994 March

home *** CD-ROM | disk | FTP | other *** search

/ Source Code 1994 March / Source_Code_CD-ROM_Walnut_Creek_March_1994.iso / compsrcs / unix / volume27 / qt / part01 < prev next >

Wrap

Text File | 1993-10-18 | 68.9 KB | 1,961 lines

Newsgroups: comp.sources.unix From: john@johncon.com (John Conover) Subject: v27i075: qt - full-text retrieval program, Part01/01 Message-id: <1.750984870.12686@gw.home.vix.com> Sender: unix-sources-moderator@gw.home.vix.com Approved: vixie@gw.home.vix.com Submitted-By: john@johncon.com (John Conover) Posting-Number: Volume 27, Issue 75 Archive-Name: qt/part01 Qt stands for Query Text, a text information retrieval system. Qt creates, maintains, and queries a full text database. The database file system is organized as an inverted index. The program is written as a single script, in Bourne Shell, and permits simple natural language queries. Environment: Unix, SysV. rel. 4.x, R6000, DEC ALPHA 3000, others. john@johncon.com #!/bin/sh # This is a shell archive (produced by shar 3.49) # To extract the files from this archive, save it to a file, remove # everything above the "!/bin/sh" line above, and type "sh file_name". # # made 09/16/1993 02:58 UTC by john@johncon # Source directory /home/john # # existing files will NOT be overwritten unless -c is specified # # This shar contains: # length mode name # ------ ---------- ------------------------------------------ # 65404 -rwxrw-rw- qt/qt # 1844 -rw-r--r-- qt/README # # ============= qt/qt ============== if test ! -d 'qt'; then echo 'x - creating directory qt' mkdir 'qt' fi if test -f 'qt/qt' -a X"$1" != X"-c"; then echo 'x - skipping qt/qt (File already exists)' else echo 'x - extracting qt/qt (Text)' sed 's/^X//' << 'SHAR_EOF' > 'qt/qt' && #!/bin/sh # VERSION="qt - Version 0.1. (qt -h, for description and help.)" # # Qt stands for Query Text, a text information retrieval system. Qt # creates, maintains, and queries a full text database. The database # file system is organized as an inverted index. The program is written # as a single script, in Bourne Shell, and permits simple natural # language queries. # # As a simple application example, this program can be used to search # the "catman" pages for a command that performs a specific function, # even though the command's name is not known-e.g., if you knew what # you wanted to do, you could find the command that would do it. # # The program, qt, is free software, and can be redistributed and/or # modified, without any restrictions. It is distributed with no # warranty of any kind, implied or otherwise. Specifically, there is # no warranty of fitness for any particular purpose and/or # merchantability. # # Comments and/or bug reports should be addressed to: # # john@johncon.com (John Conover) # # Known caveats: There is no concurrency control-it would be # ill-advised to use this program as a concurrent application. # Additionally, the natural language query does not support grouping # operators. # # For a quick start, execute qt -h for help, which may be re-directed to # stdio. At the "tail -23" of this help file are some simple commands to # evaluate this script. # # Installation: # # The comments in this script are verbose, and should be stripped prior # to any installation with something like: # # sed '/^ *#/d;/^$/d' qt > qt.new # # and installing qt.new as qt in the executable path. Likewise, # possibly, the function, help(), should be eliminated. The function, # find_program(), is not efficient and should be eliminated, by hard # coding the paths to the various programs in your system. There are # tab characters used in this script, (which are referenced as the # variable, "${TAB}") requiring that the script be saved with tabs. # # Applicability: # # Applicability of qt varies on complexity of search, size of database, # speed of host environment, etc., however, as some general guidelines: # # 1) For text files with a total size of less than 5 MB, # standard egrep(1) queries of the text files will probably # prove adequate. # # 2) For text files with a total size of 5 MB to 50 MB, qt seems # adequate for most queries. The significant issue is that, # although the retrieval execution times are probably adequate # with qt, the database write times are not impressive. # # 3) For text files with a total size that is larger than 50 MB, # or where concurrency is an issue, it would be appropriate to # consider one of the alternatives listed in "Related # information retrieval software:," below. # # References: # # 1) "Information Retrieval, Data Structures & Algorithms," # William B. Frakes, Ricardo Baeza-Yates, Editors, Prentice # Hall, Englewood Cliffs, New Jersey 07632, 1992, ISBN # 0-13-463837-9. # # The sources for the many of the algorithms presented in 1) are # available by ftp, ftp.vt.edu:/pub/reuse/ircode.tar.Z # # 2) "Text Information Retrieval Systems," Charles T. Meadow, # Academic Press, Inc, San Diego, 1992, ISBN 0-12-487410-X. # # 3) "Full Text Databases," Carol Tenopir, Jung Soon Ro, # Greenwood Press, New York, 1990, ISBN 0-313-26303-5. # # 4) "Text and Context, Document Processing and Storage," Susan # Jones, Springer-Verlag, New York, 1991, ISBN 0-387-19604-8. # # 5) ftp think.com:/wais/wais-corporate-paper.text # # 6) ftp cs.toronto.edu:/pub/lq-text.README.1.10 # # 7) "Unix Shell Programming," Lowell Jay Arthur, John Wiley & # Sons, Inc., New York, 1990, ISBN 0-471-51820-4. # # Related information retrieval software: # # 1) Wais, available by ftp, think.com:/wais/wais-8-b5.1.tar.Z # # 2) lq-text, available by ftp, cs.toronto.edu: # /pub/lq-text1.10.tar.Z # # This script uses the Unix concept of a simple flat text file as a # database, operated on by the various utilities native to the Unix # system. The flat text file's organization is exactly one record for # each file that contains at least one instance of a specific word. # Each record in the flat text file has exactly two fields. These two # fields are the word, followed by a single "${TAB}" field delimiter, # and the file name containing the word. The record sequence in the # flat text file is the ASCII collated sequence of the word field. # # This organization of flat text file is an inverted index database, # ie., the file names of files that contain a specific word can be # found, using a binary search program, (like the native Unix program, # "look.") The inverted index file records can be created by parsing # the words in textural documents (perhaps using the native Unix # programs, "tr" and "sed,") and concatenating these words with a # single "${TAB}" and the file name of the file containing the word. # These records can then be sorted into ASCII collation sequence, # (using, for example the Unix program, "sort" -u.) Obviously, two # sorted inverted index databases could be combined with the "sort" # -um, command. File names could be removed from the inverted index # database with the "egrep" -v "*${TAB}filename" command, and words # removed with the "egrep" -v "^word${TAB}" command, and so on. This # script uses only a few of the native Unix programs to construct an # inverted index database system. # # The functions contained in this script: # # 1) find_program(), find if a program exists. # 2) help(), help. # 3) read_index (), query the inverted index file for word(s). # 4) write_index(), index the words in the files. # 5) update_database(), update the database. # 6) parse_word(), lexical analysis. # 7) remove_words(), remove words from the inverted index. # 8) remove_files(), remove files from the inverted index. # 9) relevance_count(), relevance count. # 10) relevance_proximity(), proximity retrieval. # # The functions, remove_words(), remove_files(), relevance_count() and # relevance_proximity() are included to serve as templates for further # applications. Probably, in the interest of generality, they should # not be included in the program since they can be completely # implemented as external scripts, aliases, or pipes from the output of # qt. # # This program will create and query inverted index files that index # the words in text files. These indices are useful in information # retrieval systems. The inverted index files are, typically, about the # same size of the text files, and do not require the text files to be # present for query operations. The query functions, typically, consist # of boolean operations on word searches. The output of the query is, # typically, a list of the file names that contain the queried word(s). # # The read synopsis is: # # qt [-e] [-r | -rc | -rp] [-f index_name] word1 [op1] word2 [op2] ... # # where word1-word2 ... are the words to be queried in the inverted # index, and op1-op2 ... are the operations to be performed on the set # of file names that contain these words. The word/operation arguments # consist of pairs of search words, and boolean operators, with a left # to right operational precedence. # # Thus if A, B, and C are words, then the query: # # A and B or C not D # # would specify that all file names containing word A should be found, # then all the file names containing word B should be found, and only # those file names that contain words A and B should be added to those # file names containing word C, and then if these file names do not # contain word D, they are output. # # Logical "or'ing" is implicit, thus: # # A B C # # is identical to: # # A or B or C # # Obviously, the keywords, "and," "not," and "or," may not be queried # for, when using the implicit "or" query constructs. # # If the "-e" option is specified, then "exact match" queries will be # performed, otherwise, a "partial key" type of search will be # performed, which is the default. It is recommended that the "-e" # option be used if the query involves any boolean operations. # # If the "-r" option is specified, then the words being queried for # will be output before the list file names that contain the queried # words. This output format is compatible with egrep(1), and is useful # in doing "relevance feedback" searches. # # If the "-rc" option is specified, then the count of records in a file # that contain match(s) will be output with the file name containing # the matches. This provides the system with a remedial "relevance # feedback" capability. The original text files that were used to # construct the inverted index file must be available in the system to # use this option. # # If the "-rp" option is specified, then the records in a file that # contain match(s) will be output after the file name containing the # matches. This output format provides the system with a remedial # "permuted index" type of "proximity retrieval." The original text # files that were used to construct the inverted index must be # available in the system to use this option. # # The write synopsis is: # # qt -w [-f index_name] [-1 | ... | -8] file1 file2 ... # # where file1 file2 ... are the file names that contain words that are # to be added to the inverted index, or: # # qt -w [-f index_name] [-1 | ... | -8] < file_list # # where file_list is the name of a file that contains a list of file # names, one file name per record, that contain words that are to be # added to the inverted index. # # It is recommended that file names contain the absolute path to the # system's root directory. # # If the inverted index file does not exist, then it will be created, # and contain an index to all of the words in the input files. If the # inverted index file exists, then the indices of all words in the # input files will be added, incrementally. Instances of words and # filename pairs will be unique in the inverted index. # # The "-w" option, specifies that write operations will be performed, # and is a mandatory option, to be used if and only if write operations # are desired. # # The "-f index_name," optionally, specifies the inverted index file's # name. If the "-f" option is not specified, the inverted index file # name will default to "qt.index." # # The lexical analyzer level is specified by, "-1", through "-8". If # none are specified, the default, "-4", will be used. The lexical # analyzers with larger numbers are, generally, more sophisticated # about the words that are placed in the inverted index. The lexical # analyzers available are: # # 1) Parses words and numbers. All other characters are omitted. # Capitalization is preserved. Probably the best choice if # non-word searches are important. # # 2) Like 1) above, but the '_' character is recognized. This # parser seems to work well with "C" program source files. # # 3) Like 1) above, but, only words of more than two characters # are placed in the inverted index file. If capitalization is # considered important in the search criteria, then this seems # to be the best choice. # # 4) Like 1) above, but, capitalization is ignored. For general # text where all words and numbers are considered significant, # this seems to be the best choice. Also seems a good choice for # "catman" pages. Queries should be in lowercase. # # 5) Like 3) above, but capitalization is ignored. For general # text, this seems to be the best choice. Queries should be in # lowercase. # # 6) Like 4) above, but words containing only numbers are # omitted from the inverted index file. For text containing only # words, this seems to be the best choice. Queries should be in # lowercase. # # 7) Like 4) above, but does not include Unix mail headers in # the inverted index file. Each email should be in a separate # file, as opposed to concatenated into folders. This seems to # be the best choice for Unix mail files, if the header # information is not desirable. # # 8) Like 4) above, but deletes TeX and/or LaTeX commands from # the inverted index file. This seems to be the best choice for # TeX and LaTeX documents. # # The more sophisticated the parser, the smaller the size of the # inverted index file. Multiple runs can be made, using the different # parsers, to store words in the inverted index. For example, using # parsers 3) and 4) would place both the capitalized and # non-capitalized words in the index. This would not duplicate any # words already in the index-only add the words that were different. # # The remove words synopsis is: # # qt -w -dw [-f index_name] word1 word2 ... # # where word1 word2 ... are the words that are to be deleted from the # inverted index file, and may be a regular expressions-no '^' or '$' # characters should be used, unless they are escaped. The "-w" option # is mandatory. # # The remove files synopsis is: # # qt -w -df [-f index_name] file1 file2 ... # # where file1 file2 ... are the file names that are to be deleted from # inverted index. The file names to be deleted from the inverted index # file may be regular expressions-no '^' or '$' characters should be # used, unless they are escaped. The "-w" option is mandatory. # # The version synopsis is: # # qt -v # # which will print the version number of qt. # # The help synopsis is: # # qt -h # # which will list a synopsis of the command semantics. # # A common example of writing an inverted index file would be: # # find /dir1/dir2 -type f -print | qt -w # # which would recursively descend through the directory hierarchy, and # create an inverted index of all of the words in all of the files in # all of the directories, starting with /dir1/dir2. # # A common example of retrieving information from an inverted index # file would be: # # more +/word `qt word` # # where the "more" program would page through the documents that # contain "word," advancing to the next instance every time the 'n' key # is depressed. # # A common example of relevance determination in retrieving information # from an inverted index file would be: # # egrep -ic `qt -r word` | sort -n -r -t: +1 # # which would print the file(s) that contain "word," with the count of # the instances of records that contain "word" in each of the file(s). # # Since the inverted index file constitutes a database system, care of # how this file is manipulated is important. The general procedure used # in this script is as follows: # # 1) When this script commences execution, a test is made for # the existence of a backup of an original inverted index file, # which was created in step 3), below. (Presumably, this file # was left by failed attempt(s) of step(s) 3), 4), or 5), by a # prior, unsuccessful, execution of this script.) If the backup # file exists, it is unconditionally moved, via the "mv" # command, to be the current, original database. If this # operation is successful, then step 2) is executed, if not, the # script aborts. # # 2) After any original inverted index file backup is restored, # all write operations to the database are written to a # temporary file-including duplication of any required data from # the current inverted index file. This temporary file will # become the new inverted index file. If these operations are # successful, then step 3) is executed, if not, the script # aborts. # # 3) After all write operations are completed, the original # inverted index file is backed up, using the "mv" command. If # this operation successful, then step 4) is executed, if not, # the script aborts. # # 4) After the original inverted index file has been backed up, # the temporary inverted index file is moved, using the "mv" # command, as the new inverted index file. If this operation is # successful, then step 5) is executed, if not, the script # aborts. # # 5) After the temporary inverted index file has been moved, the # original inverted index file backup is removed. If this # operation is successful, the script exits normally, if not, # the script aborts. # # Note that the vulnerability is in steps 3), 4) and 5). If step 3) # fails, then the the original inverted index is still intact, and # there is no backup (or need for one.) If step 4) fails, then there is # a backup and it will restored in step 1). If step 5) fails, then # there is a backup and it will, also, be restored in step 1), # (inadvertently destroying the new inverted index file.) Note, # additionally, that steps 3), 4), and 5) are "low risk" operations, # (two "mv" and one "rm" operation,) and executed sequentially, with no # intervening program steps. X # Function to find the programs used in this script. The arguments are # the choice of paths to a program, in precedence of your first choice, # second choice, and so on. # # Note that this function is not efficient. During installation, the # paths should be hard coded, and this function removed. X find_program() { X # If no arguments, exit. X X if [ "$#" -eq "0" ]; then X ${ECHO} "No program name specified, aborting." 1>&2 X exit 1 X fi X X # Save the first argument's basename for error reporting. X X program_base=`basename $1` X X # For each argument, test if the file name exists, and is X # executable. X X while [ "$#" -ne "0" ] X do X if [ -x "$1" ]; then X ${ECHO} "$1" X return X fi X shift X done X X # None of the file name arguments were found, exit with the error. X X ${ECHO} "Program not found, $program_base, aborting." 1>&2 X exit 1 } X # Assume an echo is in the path for find_program(). X ECHO=echo X CAT=`find_program /usr/bin/cat` CP=`find_program /usr/bin/cp` X # For SunOS 4.1.x, use the SysV version of echo, in /usr/5bin/echo, all # others use /usr/bin/echo. X ECHO=`find_program /usr/5bin/echo /usr/bin/echo` EGREP=`find_program /usr/bin/egrep` JOIN=`find_program /usr/bin/join` X # For SysV Rel. 4.x, the look program resides in /usr/ucb/look, all # others use /usr/bin/look. X LOOK=`find_program /usr/bin/look /usr/ucb/look` MV=`find_program /usr/bin/mv` RM=`find_program /usr/bin/rm` SED=`find_program /usr/bin/sed` SORT=`find_program /usr/bin/sort` X # For the DEC ALPHA 3000, the sync program resides in /usr/sbin/sync, # all others use /usr/bin/sync. X SYNC=`find_program /usr/bin/sync /usr/sbin/sync` X # For SunOS 4.1.x, use the SysV version of tr, in /usr/5bin/tr, all # others use /usr/bin/tr. X TR=`find_program /usr/5bin/tr /usr/bin/tr` UNIQ=`find_program /usr/bin/uniq` X # Default inverted index file name. X DB_NAME="qt.index" X # Default temporary inverted index file base name. X TMP_NAME="qt.index" X # If the environmental variable, TMPDIR, exists, then that directory # will be used for all temporary files, if not, then /tmp will be used. # The temporary file names in TMPDIR are always the basename of the # database, concatenated with a '-', a unique character that identifies # the temporary file to this script, a '.', and this script's pid. X # Temporary file directory name. X TEMP_DIR="${TMPDIR:-/tmp}" X # Temporary inverted index file name. There are two alternatives here. # The database is updated by backing up the current database (with the # ${MV} command) to a different name in its current directory. Then the # new database is moved (again with the ${MV} command) from its # temporary name to the database name. In some systems, if the /tmp # directory is on a different disk partition, ${MV} will have to copy # the data across the file systems to perform the move. During this # time, the database is vulnerable to power outages, etc. If the # temporary database is in the same directory as the current database, # the update will not involve any data transfer on the disk (ie., only # a name change.) On one hand, constructing the temporary database in # its home directory lowers the risk of corruption if the machine goes # down, but on the other hand, it could leave the temporary file-which # can be large-when the machine comes up. (Note that the /tmp directory # is purged during the boot process.) The two options are: X # TMP_DB="${TEMP_DIR}/${TMP_NAME}-1.$$" TMP_DB="${DB_NAME}.NEW" X # List of all temporary files: # # "${TEMP_DIR}/${TMP_NAME}-1.$$", is the temporary file name of # the new inverted index file. (Either in read or write modes.) # # "${TEMP_DIR}/${TMP_NAME}-2.$$", is the temporary file name # where anything that is to be added to the inverted index file # is temporarily held. (Either in read or write modes.) # # "${TEMP_DIR}/${TMP_NAME}-3.$$", is the temporary file name # where anything that is to be added to the inverted index file # is temporarily held. (In read mode only.) # # "${TEMP_DIR}/${TMP_NAME}-E.$$", is a temporary file name that # contains information about error conditions. If it does not # exist, or exists and is zero length, then no error occured. X TEMP_FILES="${TEMP_DIR}/${TMP_NAME}-1.$$ ${TEMP_DIR}/${TMP_NAME}-2.$$ ${TEMP_DIR}/${TMP_NAME}-3.$$ ${TEMP_DIR}/${TMP_NAME}-E.$$ ${TMP_DB}" X # Default lexical analyzer. X LEXICO=4 X # RELEVANCE_ARGUMENTS, a list of the words being queried for in # read_index(). X RELEVANCE_ARGUMENTS="" X # RELEVANCE_ATTRIBUTE, 1 = include "${RELEVANCE_ARGUMENTS}" as the # first record ouput from read_index(), 0 = do not output # "${RELEVANCE_ARGUMENTS}" X RELEVANCE_ATTRIBUTE=0 X # Mode of operations: # # READ_MODE, read only mode = 1. # # WRITE_MODE, write mode = 2. # # DELETE_WORDS, delete words mode = 3. # # DELETE_FILES, delete files mode = 4. # # RELEVANCE_COUNT, relevance count mode = 5. # # RELEVANCE_PROXIMITY, relevance proximity mode = 6. X READ_MODE=1 WRITE_MODE=2 DELETE_WORDS=3 DELETE_FILES=4 RELEVANCE_COUNT=5 RELEVANCE_PROXIMITY=6 X # Default mode of operation. X OP_MODE="${READ_MODE}" X # Default termination character used by ${LOOK} program. Setting this # to "${TAB}" will allow "exact key" searches for the words in the # inverted index file. The default, which is null, is to allow "partial # key" searches. X END= X # Tab character, used to specify the field delimiter for the ${JOIN} # program. This is chosen as a character that will never be in the # inverted index file, since all white space is to be parsed away. Note # that this file should never be detab'ed. X TAB=' ' X # The help function. Prints to stdout, so that it can be redirected to # a file. The function requires no arguments. X help() { X ${ECHO} "This program will create and query inverted index files that index" X ${ECHO} "the words in text files. These indices are useful in information" X ${ECHO} "retrieval systems. The inverted index files are, typically, about the" X ${ECHO} "same size of the text files, and do not require the text files to be" X ${ECHO} "present for query operations. The query functions, typically, consist" X ${ECHO} "of boolean operations on word searches. The output of the query is," X ${ECHO} "typically, a list of the file names that contain the queried word(s)." X ${ECHO} "" X ${ECHO} "The read synopsis is:" X ${ECHO} "" X ${ECHO} " qt [-e] [-r | -rc | -rp] [-f index_name] word1 [op1] word2 [op2] ..." X ${ECHO} "" X ${ECHO} "where word1-word2 ... are the words to be queried in the inverted" X ${ECHO} "index, and op1-op2 ... are the operations to be performed on the set" X ${ECHO} "of file names that contain these words. The word/operation arguments" X ${ECHO} "consist of pairs of search words, and boolean operators, with a left" X ${ECHO} "to right operational precedence." X ${ECHO} "" X ${ECHO} "Thus if A, B, and C are words, then the query:" X ${ECHO} "" X ${ECHO} " A and B or C not D" X ${ECHO} "" X ${ECHO} "would specify that all file names containing word A should be found," X ${ECHO} "then all the file names containing word B should be found, and only" X ${ECHO} "those file names that contain words A and B should be added to those" X ${ECHO} "file names containing word C, and then if these file names do not" X ${ECHO} "contain word D, they are output." X ${ECHO} "" X ${ECHO} "Logical \"or'ing\" is implicit, thus:" X ${ECHO} "" X ${ECHO} " A B C" X ${ECHO} "" X ${ECHO} "is identical to:" X ${ECHO} "" X ${ECHO} " A or B or C" X ${ECHO} "" X ${ECHO} "Obviously, the keywords, \"and,\" \"not,\" and \"or,\" may not be queried" X ${ECHO} "for, when using the implicit \"or\" query constructs." X ${ECHO} "" X ${ECHO} "If the \"-e\" option is specified, then \"exact match\" queries will be" X ${ECHO} "performed, otherwise, a \"partial key\" type of search will be" X ${ECHO} "performed, which is the default. It is recommended that the \"-e\"" X ${ECHO} "option be used if the query involves any boolean operations." X ${ECHO} "" X ${ECHO} "If the \"-r\" option is specified, then the words being queried for" X ${ECHO} "will be output before the list file names that contain the queried" X ${ECHO} "words. This output format is compatible with egrep(1), and is useful" X ${ECHO} "in doing \"relevance feedback\" searches." X ${ECHO} "" X ${ECHO} "If the \"-rc\" option is specified, then the count of records in a file" X ${ECHO} "that contain match(s) will be output with the file name containing" X ${ECHO} "the matches. This provides the system with a remedial \"relevance" X ${ECHO} "feedback\" capability. The original text files that were used to" X ${ECHO} "construct the inverted index file must be available in the system to" X ${ECHO} "use this option." X ${ECHO} "" X ${ECHO} "If the \"-rp\" option is specified, then the records in a file that" X ${ECHO} "contain match(s) will be output after the file name containing the" X ${ECHO} "matches. This output format provides the system with a remedial" X ${ECHO} "\"permuted index\" type of \"proximity retrieval.\" The original text" X ${ECHO} "files that were used to construct the inverted index must be" X ${ECHO} "available in the system to use this option." X ${ECHO} "" X ${ECHO} "The write synopsis is:" X ${ECHO} "" X ${ECHO} " qt -w [-f index_name] [-1 | ... | -8] file1 file2 ..." X ${ECHO} "" X ${ECHO} "where file1 file2 ... are the file names that contain words that are" X ${ECHO} "to be added to the inverted index, or:" X ${ECHO} "" X ${ECHO} " qt -w [-f index_name] [-1 | ... | -8] < file_list" X ${ECHO} "" X ${ECHO} "where file_list is the name of a file that contains a list of file" X ${ECHO} "names, one file name per record, that contain words that are to be" X ${ECHO} "added to the inverted index." X ${ECHO} "" X ${ECHO} "It is recommended that file names contain the absolute path to the" X ${ECHO} "system's root directory." X ${ECHO} "" X ${ECHO} "If the inverted index file does not exist, then it will be created," X ${ECHO} "and contain an index to all of the words in the input files. If the" X ${ECHO} "inverted index file exists, then the indices of all words in the" X ${ECHO} "input files will be added, incrementally. Instances of words and" X ${ECHO} "filename pairs will be unique in the inverted index." X ${ECHO} "" X ${ECHO} "The \"-w\" option, specifies that write operations will be performed," X ${ECHO} "and is a mandatory option, to be used if and only if write operations" X ${ECHO} "are desired." X ${ECHO} "" X ${ECHO} "The \"-f index_name,\" optionally, specifies the inverted index file's" X ${ECHO} "name. If the \"-f\" option is not specified, the inverted index file" X ${ECHO} "name will default to \"qt.index.\"" X ${ECHO} "" X ${ECHO} "The lexical analyzer level is specified by, \"-1\", through \"-8\". If" X ${ECHO} "none are specified, the default, \"-4\", will be used. The lexical" X ${ECHO} "analyzers with larger numbers are, generally, more sophisticated" X ${ECHO} "about the words that are placed in the inverted index. The lexical" X ${ECHO} "analyzers available are:" X ${ECHO} "" X ${ECHO} " 1) Parses words and numbers. All other characters are omitted." X ${ECHO} " Capitalization is preserved. Probably the best choice if" X ${ECHO} " non-word searches are important." X ${ECHO} "" X ${ECHO} " 2) Like 1) above, but the '_' character is recognized. This" X ${ECHO} " parser seems to work well with \"C\" program source files." X ${ECHO} "" X ${ECHO} " 3) Like 1) above, but, only words of more than two characters" X ${ECHO} " are placed in the inverted index file. If capitalization is" X ${ECHO} " considered important in the search criteria, then this seems" X ${ECHO} " to be the best choice." X ${ECHO} "" X ${ECHO} " 4) Like 1) above, but, capitalization is ignored. For general" X ${ECHO} " text where all words and numbers are considered significant," X ${ECHO} " this seems to be the best choice. Also seems a good choice for" X ${ECHO} " \"catman\" pages. Queries should be in lowercase." X ${ECHO} "" X ${ECHO} " 5) Like 3) above, but capitalization is ignored. For general" X ${ECHO} " text, this seems to be the best choice. Queries should be in" X ${ECHO} " lowercase." X ${ECHO} "" X ${ECHO} " 6) Like 4) above, but words containing only numbers are" X ${ECHO} " omitted from the inverted index file. For text containing only" X ${ECHO} " words, this seems to be the best choice. Queries should be in" X ${ECHO} " lowercase." X ${ECHO} "" X ${ECHO} " 7) Like 4) above, but does not include Unix mail headers in" X ${ECHO} " the inverted index file. Each email should be in a separate" X ${ECHO} " file, as opposed to concatenated into folders. This seems to" X ${ECHO} " be the best choice for Unix mail files, if the header" X ${ECHO} " information is not desirable." X ${ECHO} "" X ${ECHO} " 8) Like 4) above, but deletes TeX and/or LaTeX commands from" X ${ECHO} " the inverted index file. This seems to be the best choice for" X ${ECHO} " TeX and LaTeX documents." X ${ECHO} "" X ${ECHO} "The more sophisticated the parser, the smaller the size of the" X ${ECHO} "inverted index file. Multiple runs can be made, using the different" X ${ECHO} "parsers, to store words in the inverted index. For example, using" X ${ECHO} "parsers 3) and 4) would place both the capitalized and" X ${ECHO} "non-capitalized words in the index. This would not duplicate any" X ${ECHO} "words already in the index-only add the words that were different." X ${ECHO} "" X ${ECHO} "The remove words synopsis is:" X ${ECHO} "" X ${ECHO} " qt -w -dw [-f index_name] word1 word2 ..." X ${ECHO} "" X ${ECHO} "where word1 word2 ... are the words that are to be deleted from the" X ${ECHO} "inverted index file, and may be a regular expressions-no '^' or '$'" X ${ECHO} "characters should be used, unless they are escaped. The \"-w\" option" X ${ECHO} "is mandatory." X ${ECHO} "" X ${ECHO} "The remove files synopsis is:" X ${ECHO} "" X ${ECHO} " qt -w -df [-f index_name] file1 file2 ..." X ${ECHO} "" X ${ECHO} "where file1 file2 ... are the file names that are to be deleted from" X ${ECHO} "inverted index. The file names to be deleted from the inverted index" X ${ECHO} "file may be regular expressions-no '^' or '$' characters should be" X ${ECHO} "used, unless they are escaped. The \"-w\" option is mandatory." X ${ECHO} "" X ${ECHO} "The version synopsis is:" X ${ECHO} "" X ${ECHO} " qt -v" X ${ECHO} "" X ${ECHO} "which will print the version number of qt." X ${ECHO} "" X ${ECHO} "The help synopsis is:" X ${ECHO} "" X ${ECHO} " qt -h" X ${ECHO} "" X ${ECHO} "which will list a synopsis of the command semantics." X ${ECHO} "" X ${ECHO} "A common example of writing an inverted index file would be:" X ${ECHO} "" X ${ECHO} " find /dir1/dir2 -type f -print | qt -w" X ${ECHO} "" X ${ECHO} "which would recursively descend through the directory hierarchy, and" X ${ECHO} "create an inverted index of all of the words in all of the files in" X ${ECHO} "all of the directories, starting with /dir1/dir2." X ${ECHO} "" X ${ECHO} "A common example of retrieving information from an inverted index" X ${ECHO} "file would be:" X ${ECHO} "" X ${ECHO} " more +/word \`qt word\`" X ${ECHO} "" X ${ECHO} "where the \"more\" program would page through the documents that" X ${ECHO} "contain \"word,\" advancing to the next instance every time the 'n' key" X ${ECHO} "is depressed." X ${ECHO} "" X ${ECHO} "A common example of relevance determination in retrieving information" X ${ECHO} "from an inverted index file would be:" X ${ECHO} "" X ${ECHO} " egrep -ic \`qt -r word\` | sort -n -r -t: +1" X ${ECHO} "" X ${ECHO} "which would print the file(s) that contain \"word,\" with the count of" X ${ECHO} "the instances of records that contain \"word\" in each of the file(s)." } X # Query the inverted index file for word(s). If "${END}" is a "${TAB}", # then only exact matches will be found. If "${END}" is null, then # "partial key" types of operations will be performed. The command line # arguments consist of pairs of search words, and boolean operators, # with a left to right operational precedence. # # Thus if A, B, and C are words, then the query: # # A and B or C not D # # would specify that all file names containing word A should be found, # then all the file names containing word B should be found, and only # those file names that contain words A and B should be added to those # file names containing word C, and then if these file names do not # contain word D, they are output. # # Logical "or'ing" is implicit, thus: # # A B C # # is identical to: # # A or B or C # # The boolean operators supported are: # # and, which is implemented with the ${JOIN} program to perform a # "natural join" of two files, each file containing a list of the # file names which contain specific word(s). # # not, which is implemented with the ${JOIN} program to perform an # "natural join" operation, the output of which is combined with # the first file, using the ${SORT} -m program, and piped to the # ${UNIQ} -u program so that only those file names that are unique # to the first file are output. # # or, which is implemented by ${SORT} -mu to concatenate the two # files together, each record being unique. # # The ${LOOK} program is used to perform a binary search on the inverted # index file, (which is made up of ASCII records, each record containing # a word in a file, and the file name, separated by a single "${TAB}".) # The inverted index file is sorted in the ASCII collation sequence of # the words. The output of the ${LOOK} program is also sorted in ASCII # collation sequence. The words are striped from the records output from # the ${LOOK} program with the ${SED} 's/.* //' program. # # The ${JOIN} program uses the -tc option, where c is a character that # can never be in the inverted index. # # Note, the use of "-e" for exact word match operations is recommended # when doing boolean searches. # # The variable, ${RELEVANCE_ARGUMENTS}, contains a running list of the # search words, separated by a pipe symbol, '|'. This argument is is # useful for piping to ${EGREP} for further refined searches. # # The rules of concatenation are: # # 1) initial operator argument, concatenate word only. # # 2) "or" operator argument, concatenate '|' and word to # beginning. # # 3) "not" operator argument, do not concatenate. # # 4) "and" operator argument, replace concatenated string with # word. # # This variable, at the conclusion of read_index(), will contain a # regular expression that is compatible with the first argument of # ${EGREP}. If the output of qt is piped to ${EGREP} with the file names # as additional arguments, then the words found in the records of the # files will approximately equal what the ${LOOK} program found in the # inverted index file. # # The arguments are the list of word/operators. X read_index () { X # If no arguments, return. X X if [ "$#" -eq "0" ]; then X return X fi X X # If the file does not exist, or is unreadable, exit. X X if [ ! -f "${DB_NAME}" -o ! -r "${DB_NAME}" ]; then X ${ECHO} "The inverted index file does not exist, or is unreadable, aborting." 1>&2 X exit 1 X fi X X # For each query argument: X X while [ "$#" -ne "0" ] X do X case "$1" in X and) X X # The query operator was an "and", shift over it. X X shift X if [ -f "${TEMP_DIR}/${TMP_NAME}-1.$$" ]; then X X # A partial file list exists, clear any X # ${RELEVANCE_ARGUMENTS} and start a new one. X X RELEVANCE_ARGUMENTS="$1" X X # Find the list of word/file name pairs, striping X # the words, and make this list unique. X X ${LOOK} "$1${END}" "${DB_NAME}" | ${SED} "s/.*${TAB}//" | ${UNIQ} > "${TEMP_DIR}/${TMP_NAME}-2.$$" X X # ${JOIN} this list with the existing partial file X # list. X X ${JOIN} "-t${TAB}" "${TEMP_DIR}/${TMP_NAME}-1.$$" "${TEMP_DIR}/${TMP_NAME}-2.$$" > "${TEMP_DIR}/${TMP_NAME}-3.$$" X X # This becomes the new partial file list. X X ${MV} -f "${TEMP_DIR}/${TMP_NAME}-3.$$" "${TEMP_DIR}/${TMP_NAME}-1.$$" X else X X # A partial file list does not exist, start one, X # clear any ${RELEVANCE_ARGUMENTS} , and start a new X # one. X X RELEVANCE_ARGUMENTS="$1" X X # Find the list of word/file name pairs, striping X # the words, and make this list unique. X X ${LOOK} "$1${END}" "${DB_NAME}" | ${SED} "s/.*${TAB}//" | ${UNIQ} > "${TEMP_DIR}/${TMP_NAME}-1.$$" X fi; shift;; X not) X X # The query operator was an "not", shift over it. X X shift X if [ -f "${TEMP_DIR}/${TMP_NAME}-1.$$" ]; then X X # A partial file list exists, don't add this to X # ${RELEVANCE_ARGUMENTS}, find the list of word/file X # name pairs, striping the words, and make this list X # unique. X X ${LOOK} "$1${END}" "${DB_NAME}" | ${SED} "s/.*${TAB}//" | ${UNIQ} > "${TEMP_DIR}/${TMP_NAME}-2.$$" X X # ${JOIN} this list with the existing partial file X # list. X X ${JOIN} "-t${TAB}" "${TEMP_DIR}/${TMP_NAME}-1.$$" "${TEMP_DIR}/${TMP_NAME}-2.$$" > "${TEMP_DIR}/${TMP_NAME}-3.$$" X X # ${SORT} -m this list with the existing partial X # file list. X X ${SORT} -m "${TEMP_DIR}/${TMP_NAME}-1.$$" "${TEMP_DIR}/${TMP_NAME}-3.$$" | ${UNIQ} -u > "${TEMP_DIR}/${TMP_NAME}-2.$$" X X # This becomes the new partial file list. X X ${MV} -f "${TEMP_DIR}/${TMP_NAME}-2.$$" "${TEMP_DIR}/${TMP_NAME}-1.$$" X else X X # A partial file list does not exist-start one, X # don't add this to # ${RELEVANCE_ARGUMENTS}, find X # the list of word/file name pairs, striping the X # words, and make this list unique. X X ${LOOK} "$1${END}" "${DB_NAME}" | ${SED} "s/.*${TAB}//" | ${UNIQ} > "${TEMP_DIR}/${TMP_NAME}-1.$$" X fi; shift;; X or) X X # The query operator was an "or", shift over it. X X shift X if [ -f "${TEMP_DIR}/${TMP_NAME}-1.$$" ]; then X X # A partial file list exists, add this with a X # leading '|, to the ${RELEVANCE_ARGUMENTS}. X X RELEVANCE_ARGUMENTS="$1|${RELEVANCE_ARGUMENTS}" X X # Find the list of word/file name pairs, striping X # the words, and make this list unique. X X ${LOOK} "$1${END}" "${DB_NAME}" | ${SED} "s/.*${TAB}//" | ${UNIQ} > "${TEMP_DIR}/${TMP_NAME}-2.$$" X X # ${SORT} -mu this list with the existing partial X # file list. X X ${SORT} -mu "${TEMP_DIR}/${TMP_NAME}-1.$$" "${TEMP_DIR}/${TMP_NAME}-2.$$" > "${TEMP_DIR}/${TMP_NAME}-3.$$" X X # This becomes the new partial file list. X X ${MV} -f "${TEMP_DIR}/${TMP_NAME}-3.$$" "${TEMP_DIR}/${TMP_NAME}-1.$$" X else X X # A partial file list does not exist, add this with a X # leading '|, to the ${RELEVANCE_ARGUMENTS}. X X RELEVANCE_ARGUMENTS="$1" X X # the list of word/file name pairs, striping the X # words, and make this list unique. X X ${LOOK} "$1${END}" "${DB_NAME}" | ${SED} "s/.*${TAB}//" | ${UNIQ} > "${TEMP_DIR}/${TMP_NAME}-1.$$" X fi; shift;; X *) X X # The query operator was not an "and", "not", or "or". X X if [ -f "${TEMP_DIR}/${TMP_NAME}-1.$$" ]; then X X # A partial file list exists, add this with a X # leading '|, to the ${RELEVANCE_ARGUMENTS}. X X RELEVANCE_ARGUMENTS="$1|${RELEVANCE_ARGUMENTS}" X X # Find the list of word/file name pairs, striping X # the words, and make this list unique. X X ${LOOK} "$1${END}" "${DB_NAME}" | ${SED} "s/.*${TAB}//" | ${UNIQ} > "${TEMP_DIR}/${TMP_NAME}-2.$$" X X # ${SORT} -mu this list with the existing partial X # file list. X X ${SORT} -mu "${TEMP_DIR}/${TMP_NAME}-1.$$" "${TEMP_DIR}/${TMP_NAME}-2.$$" > "${TEMP_DIR}/${TMP_NAME}-3.$$" X X # This becomes the new partial file list. X X ${MV} -f "${TEMP_DIR}/${TMP_NAME}-3.$$" "${TEMP_DIR}/${TMP_NAME}-1.$$" X else X X # A partial file list does not exist, add this with a X # leading '|, to the ${RELEVANCE_ARGUMENTS}. X X RELEVANCE_ARGUMENTS="$1" X X # Find the list of word/file name pairs, striping X # the words, and make this list unique. X X ${LOOK} "$1${END}" "${DB_NAME}" | ${SED} "s/.*${TAB}//" | ${UNIQ} > "${TEMP_DIR}/${TMP_NAME}-1.$$" X fi; shift;; X esac X done X X # If a request for egrep(1) compatable regular expression search X # word to be output prior to file names option, the output it. X X if [ ! "${RELEVANCE_ATTRIBUTE}" -eq "0" ]; then X ${ECHO} "${RELEVANCE_ARGUMENTS}" X fi X X # Make shure that each file name in the list is unique. X X ${SORT} -u "${TEMP_DIR}/${TMP_NAME}-1.$$" X X # Remove any temporary files. X X ${RM} -f "${TEMP_DIR}/${TMP_NAME}-1.$$" "${TEMP_DIR}/${TMP_NAME}-2.$$" } X # Index the words in the files. Sort the index, lexographically, on the # words, making sure that word file name pairs are unique. Write the # output to a temporary file. If any errors, write the errors to # "${TEMP_DIR}/${TMP_NAME}-E.$$". Test this file for zero length before # proceeding. # # The required arguments are the list of file names, or none, in which # case the file names will be read from the stdin. X write_index() { X if [ "$#" -eq "0" ]; then X X # There are no arguments, read each file name from stdin. X X while read file_name X do X X # Parse the words from the file. X X parse_word "${file_name}" X done X else X X # There are arguments, for each one of them, read the file. X X for file_name X do X X # Parse the words from the file. X X parse_word "${file_name}" X done X fi | { ${SORT} -T "${TEMP_DIR}" -u > "${TEMP_DIR}/${TMP_NAME}-2.$$"; } 2>> "${TEMP_DIR}/${TMP_NAME}-E.$$" X X # If any errors indexing files, abort. X X if [ -s "${TEMP_DIR}/${TMP_NAME}-E.$$" ]; then X X # There was an error, abort. X X exit 1 X fi X X # If an inverted index file already exists, merge it, uniquely, with the X # new temporary inverted index file, else, move the the temporary X # inverted index file to that name. X X if [ -f "${DB_NAME}" ]; then X X # An inverted index file exists, merge the new inverted index X # with it, using ${SORT} -mu. X X ${SORT} -T "${TEMP_DIR}" -mu "${TEMP_DIR}/${TMP_NAME}-2.$$" "${DB_NAME}" > "${TMP_DB}" X if [ ! "$?" -eq "0" ]; then X X # Couldn't merge the two files, abort. X X exit 1 X fi X else X X # An inverted index file does not exist, ${MV} the new inverted X # index as this file. X X ${MV} -f "${TEMP_DIR}/${TMP_NAME}-2.$$" "${TMP_DB}" X if [ ! "$?" -eq "0" ]; then X X # Couldn't move the file, abort. X X exit 1 X fi X fi X X # Updataing the temporary inverted index file has been completed. X # Update the inverted index file. X X update_database "${DB_NAME}" "${TMP_DB}" } X # Update the database function-update the database as described above. # # 1) Backup the original inverted index file is backed up, using # the "mv" command. If this operation successful, then step 2) # is executed, if not, the script aborts. # # 2) After the original inverted index file has been backed up, # the temporary inverted index file is moved, using the "mv" # command, as the new inverted index file. If this operation is # successful, then step 3) is executed, if not, the script # aborts. # # 3) After the temporary inverted index file has been moved, the # original inverted index file backup is removed. If this # operation is successful, the script exits normally, if not, # the script aborts. # # On any error, this function aborts, after trying to restore the # original database. The arguments to this function are: # # "$1" is the original database's name. # # "$2" is the new database's name. X update_database() { X # Ignore interrupts. X X trap '' 0 1 2 3 15 X X # If the Inverted index database does not exist, or is unreadable or X # unwritable, exit. Note that the original index may not exist, yet. X X if [ -b "$1" -o -c "$1" -o -d "$1" -o -p "$1" ]; then X ${ECHO} "The inverted index file does not exist, or is not writable or readable, aborting." 1>&2 X ${RM} -f "${TEMP_FILES}" X exit 1 X fi X X # If the original inverted index exists, back it up, if that fails, X # attempt to restore it. X X if [ -f "$1" ]; then X X # Inverted index file exists. X X if [ -w "$1" ]; then X X # Inverted index file is writable, back it up. X X ${MV} -f "$1" "$1.BAK" X if [ ! "$?" -eq "0" ]; then X X # Backup was not successful, attempt to restore. X X if [ -f "$1.BAK" ]; then X X # Backup file exists, print the message that an X # attemp to restore is underway. X X ${ECHO} "Error backing up original index file, attempting to restore." 1>&2 X X # Remove any temporary files to attempt to open up X # disk space. X X ${RM} -f "${TEMP_FILES}" X X # Restore the backup inverted index file as the X # original inverted index file. X X ${MV} -f "$1.BAK" "$1" X if [ "$?" -eq "0" ]; then X X # Backup succeeded, notify the user. X X ${ECHO} "Restoration of original index file succeeded, aborting." 1>&2 X else X X # Backup failed, notify the user. X X ${ECHO} "Restoration of original index file failed, aborting." 1>&2 X fi X ${SYNC} X ${SYNC} X else X X # The original inverted index file was never X # backed up, test for the original. X X if [ -f "$1" ]; then X X # Original inverted index file exists, notify X # the user. X X ${ECHO} "Restoration of original index file succeeded, aborting." 1>&2 X else X X # Something is terribly wrong-the original X # inverted index and its backp are both gone. X X ${ECHO} "Restoration of original index file failed, aborting." 1>&2 X fi X X # Remove all temporary files. X X ${RM} -f "${TEMP_FILES}" X fi X exit 1 X fi X else X X # Inverted index file is not writable, abort. X X ${ECHO} "The index file is a not writeable, aborting." 1>&2 X ${RM} -f "${TEMP_FILES}" X exit 1 X fi X fi X X # The original inverted index file, if it existed, is now backed up, X # move the temporary inverted index file as the new inverted index X # file. X X ${MV} -f "$2" "$1" X X # Erase the original inverted index backup file, if it exists, if that X # fails, attempt to restore it. X X if [ "$?" -eq "0" ]; then X X # The new inverted index has been moved into place, erase the X # backup inverted index file, if it exists. X X if [ -f "$1.BAK" ]; then X X # The backup inverted index file exists, erase it. X X ${RM} -f "$1.BAK" X if [ ! "$?" -eq "0" ]; then X X # The removal of the backup inverted index file X # failed, notify the user. X X ${ECHO} "Error removing original index backup, attempting to restore." 1>&2 X X # Remove any temporary files to attempt to open up X # disk space. X X ${RM} -f "${TEMP_FILES}" X X # Restore the backup inverted index file as the X # original inverted index file. X X ${MV} "$1.BAK" "$1" X if [ "$?" -eq "0" ]; then X X # Backup succeeded, notify the user. X X ${ECHO} "Restoration of original index file succeeded, aborting." 1>&2 X else X X # Backup failed, notify the user. X X ${ECHO} "Restoration of original index file failed, aborting." 1>&2 X fi X ${SYNC} X ${SYNC} X exit 1 X fi X ${SYNC} X ${SYNC} X fi X else X X # The ${MV} of the new inverted index file failed, attempt to X # restore the backup file. X X if [ -f "$1.BAK" ]; then X X # Backup file exists, print the message that an attemp to X # restore is underway. X X ${ECHO} "Error writing new index file, attempting to restore." 1>&2 X X # A backup file exists, Remove any temporary files to X # attempt to open up disk space. X X ${RM} -f "${TEMP_FILES}" X X # Restore the backup inverted index file as the original X # inverted index file. X X ${MV} -f "$1.BAK" "$1" X if [ "$?" -eq "0" ]; then X X # Backup succeeded, notify the user. X X ${ECHO} "Restoration of original index file succeeded, aborting." 1>&2 X X else X X # Backup failed, notify the user. X X ${ECHO} "Restoration of original index file failed, aborting." 1>&2 X fi X ${SYNC} X ${SYNC} X else X X # The original inverted index file was never backed up, X # test for the original. X X if [ -f "$1" ]; then X X # Original inverted index file exists, notify the user. X X ${ECHO} "Restoration of original index file succeeded, aborting." 1>&2 X else X X # Something is terribly wrong-the original inverted X # index and its backp are both gone. X X ${ECHO} "Restoration of original index file failed, aborting." 1>&2 X fi X X # Remove all temporary files. X X ${RM} -f "${TEMP_FILES}" X fi X exit 1 X fi } X # Lexical analysis function. The words in the file are parsed, one word # per record, and concatenated with a "${TAB}", and the file's name. # The records are output to the standard output. If an error occurs in # any of the programs in the pipe, it is included in the file, named # "${TEMP_DIR}/${TMP_NAME}-E.$$". If any errors occurred, exit. # # Note: This parser is not particularly elegant or fast. As a better # solution, the program "stopper," from reference 1) above (Frakes, et # al,) can be adapted to work quite well. This program is implemented as # an FSM, and supports stop words. # # The required argument is the file name to be parsed. X parse_word() { X # If no arguments, exit. X X if [ "$#" -eq "0" ]; then X ${ECHO} "No file name specified, aborting." 1>&2 X exit 1 X fi X X # If the file does not exist, or is unreadable, exit. X X if [ ! -f "$1" -o ! -r "$1" ]; then X ${ECHO} "File name $1 does not exist or is not readable, aborting." 1>&2 X exit 1 X fi X X # Select the lexical analyzer, and parse the words in the file. X X case "$LEXICO" in X 1) { ${TR} -cs '[a-z][A-Z][0-9]' '[\012*]' < "$1" | ${SED} "/^ *$/d;/^ /d;s,$,${TAB}$1,"; } 2>> "${TEMP_DIR}/${TMP_NAME}-E.$$";; X 2) { ${TR} -cs '[a-z][A-Z]_[0-9]' '[\012*]' < "$1" | ${SED} "/^ *$/d;/^ /d;s,$,${TAB}$1,"; } 2>> "${TEMP_DIR}/${TMP_NAME}-E.$$";; X 3) { ${TR} -cs '[a-z][A-Z][0-9]' '[\012*]' < "$1" | ${SED} "/^ *$/d;/^.$/d;/^..$/d;/^ /d;s,$,${TAB}$1,"; } 2>> "${TEMP_DIR}/${TMP_NAME}-E.$$";; X 4) { ${TR} '[A-Z]' '[a-z]' < "$1" | ${TR} -cs '[a-z][0-9]' '[\012*]' | ${SED} "/^ *$/d;/^ /d;s,$,${TAB}$1,"; } 2>> "${TEMP_DIR}/${TMP_NAME}-E.$$";; X 5) { ${TR} '[A-Z]' '[a-z]' < "$1" | ${TR} -cs '[a-z][0-9]' '[\012*]' | ${SED} "/^ *$/d;/^.$/d;/^..$/d;/^ /d;s,$,${TAB}$1,"; } 2>> "${TEMP_DIR}/${TMP_NAME}-E.$$";; X 6) { ${TR} '[A-Z]' '[a-z]' < "$1" | ${TR} -cs '[a-z][0-9]' '[\012*]' | ${EGREP} -v '^[0-9]*$' | ${SED} "/^ *$/d;/^.$/d;/^..$/d;/^ /d;s,$,${TAB}$1,"; } 2>> "${TEMP_DIR}/${TMP_NAME}-E.$$";; X 7) { ${SED} -e '1,/^ *$/d' < "$1" | ${TR} '[A-Z]' '[a-z]' | ${TR} -cs '[a-z][0-9]' '[\012*]' | ${SED} "/^ *$/d;/^ /d;s,$,${TAB}$1,"; } 2>> "${TEMP_DIR}/${TMP_NAME}-E.$$";; X 8) { ${TR} '[A-Z]' '[a-z]' < "$1" | ${TR} -cs '\\[a-z][0-9]' '[\012*]' | ${SED} "/^ *$/d;/^ /d;/\\\.*/d;s,$,${TAB}$1,"; } 2>> "${TEMP_DIR}/${TMP_NAME}-E.$$";; X *) ${ECHO} "Unknown lexical analyzer, aborting." 1>&2; exit 1;; X esac X X # If any errors are contained in "${TEMP_DIR}/${TMP_NAME}-E.$$", exit. X X if [ -s "${TEMP_DIR}/${TMP_NAME}-E.$$" ]; then X exit 1 X fi } X # Remove words from the inverted index function. Since the inverted # index file structure is records constructed of word/filename pairs, # separated by a "${TAB}", search for the word, adding a claret to # signify beginning of record, and terminated with a "${TAB}". With # multiple arguments, this script will create multiple indices. A copy of # the original is made in the temporary directory, and all operations # occur there. The final copy is ${MV}'ed to the index's directory and # finally update_database() is called to install the new version. # # The required arguments are regular expressions. All words in the # inverted index that match these expressions will be deleted. X remove_words() { X # If no arguments, exit. X X if [ "$#" -eq "0" ]; then X ${ECHO} "No words to remove specified, aborting." 1>&2 X exit 1 X fi X X # If the Inverted index database does not exist, or is unreadable or X # unwritable, exit. X X if [ ! -f "${DB_NAME}" -o ! -r "${DB_NAME}" -o ! -w "${DB_NAME}" ]; then X ${ECHO} "The inverted index file does not exist, or is not writable or readable, aborting." 1>&2 X exit 1 X fi X X # Make a copy of the database in the ${TEMP_DIR}-operate on the X # copy, exit if this fails. X X ${CP} "${DB_NAME}" "${TEMP_DIR}/${TMP_NAME}-1.$$" X X if [ ! "$?" -eq "0" ]; then X ${ECHO} "Error removing words from index file, aborting." 1>&2 X exit 1 X fi X X # For each argument, scan the copy of the database, using ${EGREP} X # -v, to remove any records that contain the specified word, making X # a new database in the "${TEMP_DIR}". ${MV} this file as the the X # new copy of the database. Exit on any failure. X X while [ "$#" -gt "0" ] X do X ${EGREP} -v "^$1${TAB}" "${TEMP_DIR}/${TMP_NAME}-1.$$" > "${TEMP_DIR}/${TMP_NAME}-2.$$" X X # Only check for syntax errors. No matches is OK. X X if [ "$?" -eq "2" ]; then X ${ECHO} "Error removing words from index file, aborting." 1>&2 X exit 1 X fi X X ${MV} -f "${TEMP_DIR}/${TMP_NAME}-2.$$" "${TEMP_DIR}/${TMP_NAME}-1.$$" X X if [ ! "$?" -eq "0" ]; then X ${ECHO} "Error removing words from index file, aborting." 1>&2 X exit 1 X fi X X shift X done X X # The specified words have been removed from the copy of the X # database. ${MV} the copy of the database to "${TMP_DB}". Exit on X # failure. X X ${MV} "${TEMP_DIR}/${TMP_NAME}-1.$$" "${TMP_DB}" X X if [ ! "$?" -eq "0" ]; then X ${ECHO} "Error removing words from index file, aborting." 1>&2 X exit 1 X fi X X # Flush the buffers and call update_database() to update the X # inverted index file. X X ${SYNC} X ${SYNC} X X update_database "${DB_NAME}" "${TMP_DB}" } X # Remove files from the inverted index function. Since the inverted # index file structure is records constructed of word/filename pairs, # separated by a "${TAB}", search for the file name, adding a "${TAB}" # to the beginning of the file name, and a dollar sign to the end of the # file name to signify the end of record. With multiple arguments, this # script will create multiple indices. A copy of the original is made in # the temporary directory, and all operations occur there. The final # copy is ${MV}'e to the index's directory and finally update_database() # is called to install the new version. # # The required arguments are regular expressions. All file names in the # inverted index that match these expressions will be deleted. X remove_files() { X # If no arguments, exit. X X if [ "$#" -eq "0" ]; then X ${ECHO} "No file names to remove specified, aborting." 1>&2 X exit 1 X fi X X # If the Inverted index database does not exist, or is unreadable or X # unwritable, exit. X X if [ ! -f "${DB_NAME}" -o ! -r "${DB_NAME}" -o ! -w "${DB_NAME}" ]; then X ${ECHO} "The inverted index file does not exist, or is not writable or readable, aborting." 1>&2 X exit 1 X fi X X # Make a copy of the database in the ${TEMP_DIR}-operate on the X # copy, exit if this fails. X X ${CP} "${DB_NAME}" "${TEMP_DIR}/${TMP_NAME}-1.$$" X X if [ ! "$?" -eq "0" ]; then X ${ECHO} "Error removing files from index file, aborting." 1>&2 X exit 1 X fi X X # For each argument, scan the copy of the database, using ${EGREP} X # -v, to remove any records that contain the specified file name, X # making a new database in the "${TEMP_DIR}". ${MV} this file as the X # the new copy of the database. Exit on any failure. X X while [ "$#" -gt "0" ] X do X ${EGREP} -v "${TAB}$1\$" "${TEMP_DIR}/${TMP_NAME}-1.$$" > "${TEMP_DIR}/${TMP_NAME}-2.$$" X X # Only check for syntax errors. No matches is OK. X X if [ "$?" -eq "2" ]; then X ${ECHO} "Error removing files from index file, aborting." 1>&2 X exit 1 X fi X X ${MV} -f "${TEMP_DIR}/${TMP_NAME}-2.$$" "${TEMP_DIR}/${TMP_NAME}-1.$$" X X if [ ! "$?" -eq "0" ]; then X ${ECHO} "Error removing files from index file, aborting." 1>&2 X exit 1 X fi X X shift X done X X # The specified file names have been removed from the copy of the X # database. ${MV} the copy of the database to "${TMP_DB}". Exit on X # failure. X X ${MV} "${TEMP_DIR}/${TMP_NAME}-1.$$" "${TMP_DB}" X X if [ ! "$?" -eq "0" ]; then X ${ECHO} "Error removing files from index file, aborting." 1>&2 X exit 1 X fi X X # Flush the buffers and call update_database() to update the X # inverted index file. X X ${SYNC} X ${SYNC} X X update_database "${DB_NAME}" "${TMP_DB}" } X # Relevance count function-the count of records in a file that contain # match(s) will be output with the file name containing the matches. # This provides the system with a remedial "relevance feedback" # capability. The original text files that were used to construct the # inverted index file must be available in the system to use this # option. # # Note that ${LOOK}, essentially, uses the regular expression # "^word${TAB}" (if exact matches are enabled) to search for words, and # ${EGREP} uses, simply, "word", ie., this function will output more # instances of records that contain "word" than were found by ${LOOK}. # # The required arguments are the regular expression to be searched for, # followed by the the list of file names to be searched. X relevance_count() { X # If no arguments, or only one argument, return. X X if [ "$#" -eq "0" -o "$#" -eq "0" ]; then X return X fi X X # Read the search field. X X SEARCH_FIELD="$1" X shift X X # ${EGREP} each file, count the records that contain queried words, X # ignoring case. ${SORT} the output, in reverse numerical order on X # the second field, using a colon as the field delimiter. This X # arranges the file names in descending order of the number of X # queried words that the files contain. One of the problems with X # ${EGREP} is that if only one file name is present, it does not X # output the file name-so handle each file separately, using ${ECHO} X # to print the file name. X X while [ "$#" -ne "0" ] X do X ${ECHO} "$1: \c" X ${EGREP} -ic "${SEARCH_FIELD}" "$1" X shift X done | ${SORT} -n -r -t: +1 } X # Proximity retrieval function-the records in a file that contain # match(s) will be output after the file name containing the matches. # This output format provides the system with a remedial "permuted # index" type of "proximity retrieval." The original text files that # were used to construct the inverted index must be available in the # system to use this option. # # Note that ${LOOK}, essentially, uses the regular expression # "^word${TAB}" (if exact matches are enabled) to search for words, and # ${EGREP} uses, simply, "word", ie., this function will output more # instances of records that contain "word" than were found by ${LOOK}. # # The required arguments are the regular expression to be searched for, # followed by the the list of file names to be searched. X relevance_proximity() { X # If no arguments, or only one argument, return. X X if [ "$#" -eq "0" -o "$#" -eq "0" ]; then X return X fi X X # Read the search field. X X SEARCH_FIELD="$1" X shift X X # ${EGREP} each file, count the records that contain queried words, X # ignoring case. ${SORT} the output, in reverse numerical order on X # the second field, using a colon as the field delimiter. This X # arranges the file names in descending order of the number of X # queried words that the files contain. One of the problems with X # ${EGREP} is that if only one file name is present, it does not X # output the file name-so handle each file separately, using ${ECHO} X # to print the file name. Strip the colon and anything following it X # using ${SED}. X X while [ "$#" -ne "0" ] X do X ${ECHO} "$1: \c" X ${EGREP} -ic "${SEARCH_FIELD}" "$1" X shift X done | ${SORT} -n -r -t: +1 | ${SED} "s/: .*$//" | while read file_name X do X # For each file name, print the file name, and ... X X ${ECHO} "" X ${ECHO} "${file_name}:" X ${ECHO} "" X X # ${EGREP} the file for instances of the queried words, ignoring X # case and without printing the file name, pipe this output to X # ${SED} to despace, and detab the record, and print the record X # with leading and trailing dots X X ${EGREP} -i -h "${SEARCH_FIELD}" "${file_name}" | ${SED} "/[[${TAB} ]*[${TAB} ]/s/[[${TAB} ]*[${TAB} ]/ /g;/^ /s/^ //;/^/s/^/\.\.\. /;/$/s/$/ \.\.\./" X done } X # Trap all interrupts, removing the the temporary and error files on any # signal. X trap "${RM} -f ${TEMP_FILES}; exit" 0 1 2 3 15 X # The inverted index backup file should have been removed on the # completion of any previous execution of this script. If it wasn't, the # previous execution failed, and the backup file should be restored-if # this fails, exit with an error. X if [ -f "${DB_NAME}.BAK" ]; then X ${ECHO} "Index backup file exists, restoring." 1>&2 X ${MV} -f "${DB_NAME}.BAK" "${DB_NAME}" X if [ "$?" -eq "0" ]; then X ${ECHO} "Restoration of index backup file succeeded, continuing." 1>&2 X else X ${ECHO} "Restoration of index backup file failed, aborting." 1>&2 X exit 1 X fi X ${SYNC} X ${SYNC} fi X # If no operations specified, print the version, which also gives # information on help(), and exit. X if [ "$#" -eq "0" ]; then X ${ECHO} "${VERSION}" X exit 1 fi X # Set variables from command line options. The attempt here is to allow # only one mode switch in the command line-two or more are illegal. A # "-w" switch is required for any command that updates the database. # READ_MODE is the default mode. GETOPT could be used, but some of the # switches require two character names. X while [ "$#" -gt "0" ] do X case "$1" in X X # Lexical analysis option. X X -1) LEXICO=1; shift;; X -2) LEXICO=2; shift;; X -3) LEXICO=3; shift;; X -4) LEXICO=4; shift;; X -5) LEXICO=5; shift;; X -6) LEXICO=6; shift;; X -7) LEXICO=7; shift;; X -8) LEXICO=8; shift;; X X # Delete file(s) option-requires write mode enable. X X -df) if [ "${OP_MODE}" -eq "${WRITE_MODE}" ]; then X OP_MODE="${DELETE_FILES}"; shift X else X ${ECHO} "Delete files mode requires preceding \"-w\" option, aborting." 1>&2; exit 1 X fi;; X X # Delete words(s) option-requires write mode enable. X X -dw) if [ "${OP_MODE}" -eq "${WRITE_MODE}" ]; then X OP_MODE="${DELETE_WORDS}"; shift X else X ${ECHO} "Delete words mode requires preceding \"-w\" option, aborting." 1>&2; exit 1 X fi;; X X # Exact query match option. X X -e) END="${TAB}"; shift;; X X # Change file name option, "${TMP_NAME} is the base name of X # this file name. X X -f) shift X if [ "$#" -gt "0" ]; then X DB_NAME="$1" X TMP_NAME=`basename "$1"` X shift X else X ${ECHO} "No inverted index file name specified, aborting." 1>&2; exit 1 X fi;; X X # Request for help option. X X -h) help; shift; exit 0;; X X # Request for egrep(1) compatable regular expression search word X # to be output prior to file names option. X X -r) RELEVANCE_ATTRIBUTE=1; shift;; X X # Count of records in a file that contain match(s) option- X # requires no previous mode switch.. X X -rc) if [ "${OP_MODE}" -eq "${READ_MODE}" ]; then X RELEVANCE_ATTRIBUTE=1; OP_MODE="${RELEVANCE_COUNT}" ;shift X else X ${ECHO} "Illegal relevance mode specified, aborting." 1>&2; exit 1 X fi;; X X # Records in a file that contain match(s) will be output after X # the file name option-requires no previous mode switch.. X X -rp) if [ "${OP_MODE}" -eq "${READ_MODE}" ]; then X RELEVANCE_ATTRIBUTE=2; OP_MODE="${RELEVANCE_PROXIMITY}" ;shift X else X ${ECHO} "Illegal relevance mode specified, aborting." 1>&2; exit 1 X fi;; X X # Version option. X X -v) ${ECHO} "${VERSION}"; shift; exit 0;; X X # Write mode enable-requires no previous mode switch. X X -w) if [ "${OP_MODE}" -eq "${READ_MODE}" ]; then X OP_MODE="${WRITE_MODE}"; shift X else X ${ECHO} "Illegal write mode specified, aborting." 1>&2; exit 1 X fi;; X X # Anything else is a file name or query word/operator. X X *) break;; X esac done X # Dispatch to the specified operation. X case "${OP_MODE}" in X X # Write mode, call the function. X X "${WRITE_MODE}") write_index "$@";; X X # Read mode, call the function. X X "${READ_MODE}") read_index "$@";; X X # Delete words mode, call the function. X X "${DELETE_WORDS}") remove_words "$@";; X X # Delete files mode, call the function. X X "${DELETE_FILES}") remove_files "$@";; X X # Relevance count-call read_index() to get a list of the files, with X # RELEVANCE_ATTRIBUTES set. X X "${RELEVANCE_COUNT}") relevance_count `read_index "$@"`;; X X # Proximity relevance-call read_index() to get a list of the files, X # with RELEVANCE_ATTRIBUTES set. Pipe these files to X # relevance_proximity(). X X "${RELEVANCE_PROXIMITY}") relevance_proximity `read_index "$@"`;; X X # Nothing to do, fall through to the exit. X X *) break;; esac X # Restore trapping all interrupts, removing the the temporary and error # files on any signal, (update_database() disables interrupts.) X trap "${RM} -f ${TEMP_FILES}; exit" 0 1 2 3 15 X Xexit 0 SHAR_EOF chmod 0766 qt/qt || echo 'restore of qt/qt failed' Wc_c="`wc -c < 'qt/qt'`" test 65404 -eq "$Wc_c" || echo 'qt/qt: original size 65404, current size' "$Wc_c" fi # ============= qt/README ============== if test -f 'qt/README' -a X"$1" != X"-c"; then echo 'x - skipping qt/README (File already exists)' else echo 'x - extracting qt/README (Text)' sed 's/^X//' << 'SHAR_EOF' > 'qt/README' && Qt stands for Query Text, a text information retrieval system. Qt creates, maintains, and queries a full text database. The database file system is organized as an inverted index. The program is written as a single script, in Bourne Shell, and permits simple natural language queries. X As a simple application example, this program can be used to search the "catman" pages for a command that performs a specific function, even though the command's name is not known-e.g., if you knew what you wanted to do, you could find the command that would do it. X The program, qt, is free software, and can be redistributed and/or modified, without any restrictions. It is distributed with no warranty of any kind, implied or otherwise. Specifically, there is no warranty of fitness for any particular purpose and/or merchantability. X Comments and/or bug reports should be addressed to: X X john@johncon.com (John Conover) X Known caveats: There is no concurrency control-it would be ill-advised to use this program as a concurrent application. Additionally, the natural language query does not support grouping operators. X For a quick start, execute qt -h for help, which may be re-directed to stdio. At the "tail -23" of this help file are some simple commands to evaluate this script. X Installation: X The comments in this script are verbose, and should be stripped prior to any installation with something like: X X sed '/^ *#/d;/^$/d' qt > qt.new X and installing qt.new as qt in the executable path. Likewise, possibly, the function, help(), should be eliminated. The function, find_program(), is not efficient and should be eliminated, by hard coding the paths to the various programs in your system. There are tab characters used in this script, (which are referenced as the variable, "${TAB}") requiring that the script be saved with tabs. SHAR_EOF chmod 0644 qt/README || echo 'restore of qt/README failed' Wc_c="`wc -c < 'qt/README'`" test 1844 -eq "$Wc_c" || echo 'qt/README: original size 1844, current size' "$Wc_c" fi exit 0