home *** CD-ROM | disk | FTP | other *** search
- From: lee@sq.sq.com (Liam R. E. Quin)
- Newsgroups: alt.sources
- Subject: lq-text Full Text Retrieval Database Part 01/13
- Message-ID: <1991Mar4.020026.16059@sq.sq.com>
- Date: 4 Mar 91 02:00:26 GMT
-
- : cut here --- cut here --
- : To unbundle, sh this file
- #! /bin/sh
- # make the directory structure:
- test -d lq-text || mkdir lq-text
- test -d lq-text/doc || mkdir lq-text/doc
- test -d lq-text/Sample || mkdir lq-text/Sample
- test -d lq-text/src || mkdir lq-text/src
- test -d lq-text/src/filters || mkdir lq-text/src/filters
- test -d lq-text/src/h || mkdir lq-text/src/h
- test -d lq-text/src/liblqtext || mkdir lq-text/src/liblqtext
- test -d lq-text/src/lqtext || mkdir lq-text/src/lqtext
- test -d lq-text/src/menu || mkdir lq-text/src/menu
- test -d lq-text/src/test || mkdir lq-text/src/test
- test -d lq-text/src/ozmahash || mkdir lq-text/src/ozmahash
-
- echo x - lq-text/README 1>&2
- sed 's/^X//' >lq-text/README <<'@@@End of lq-text/README'
- XLiam Quin's text retrieval package (lq-text) Sun Mar 3 17:18:26 EST 1991
- Xsrc/h/Revision.h defines this as Revision 1.10.
- X
- Xlq-text is copyright 1990, 1991 Liam R. E. Quin; see src/COPYRIGHT for details.
- X
- X
- XWhat It Does:
- X Lets you search for phrases in text that you previously indexed.
- X The necessary indexing program (lqaddfile) is enclosed. Indexes are
- X usually less than the size of the data, and sometimes half that.
- X There is a browser (lqtext) for System V, and a shell script (lq) for
- X any Unix system. There is also a program (lqkwik) that turns the
- X output of lqphrase or "lqword -l" into a keyword in context-style list.
- X
- XHow to Install It
- X unpack this tar
- X cd lq-text/src
- X edit h/globals.h (following the instructions in there. Use ozmahash)
- X edit Makefile
- X make depend # If you have mkdep. If you don't, and you can't get it
- X # -- e.g. from the tahoe BSD distribution -- you'll have
- X # to edit all of the makefiles to delete everything
- X # below the DO NOT DELETE pair of lines (leave the ones
- X # that say "DO NOT DELETE", though).
- X make all # this will put things in src/bin and src/lib
- X make install # This will put things in $BINDIR and $LIBDIR.
- X
- X You might want to try
- X make local # This will put stripped executables in src/bin and src/lib;
- X # I find this convenient for testing.
- X before doing a make install.
- X
- X See below for possible problems.
- X
- X
- XHow to Use It
- X (see doc/*)
- X Make a directory $HOME/LQTEXTDIR (or set $LQTEXTDIR to point to the
- X (currently empty) directory you want to put there.
- X Make lq-text/src/bin and lq-text/src/lib be in your path
- X Put a README file in $LQTEXTDIR:
- X docpath /my/login/directory:/or/somewhere/else
- X common Common
- X and make an empty file called Common (or include words like "uucp"
- X that you don't want indexed) in the same directory.
- X Find some files (e.g. your mailbox) and say
- X lqaddfile -t2 file [...]
- X You should see some diagnostic output... (this is what -t2 does).
- X lqaddfile may take several minutes to write out its data, depending
- X on the system. Try a small file first -- you can add more later!
- X Another fun thing to try is setting DOCPATH to /usr/man and running
- X cd /usr/man
- X find man* -type f -print | lqaddfile -t2 -f -
- X to make an index of the manual pages (use cat* instead of man* if you
- X prefer). If you have less than 10 meg or so of RAM, give lqaddfile the
- X -w100000 option -- this is the number of words to keep in memory before
- X writing to the database. The idea is that the number should be small
- X enough to prevent frantic paging activity!
- X
- X
- X Now try
- X lqword ---> an unsorted list of all known words
- X lq ---> type phrases and browse through them
- X lqtext ---> curses-based browser, if it compiled.
- X
- X lqshow `lqphrase "floppy disk"` ---> lq does this for you
- X lqkwik `lqphrase "floppy disk"` ---> this is the most fun.
- X
- X
- X If the files you are indexing have pathnmames with leading bits in
- X common (e.g. indexing a directory such as /usr/spool/news, or
- X /home/lee/text/humour), make use of DOCPATH. This is searched
- X linearly, so a dozen or so entries is the practical limit at the
- X moment.
- X
- X Every indexed pathname must fit into a dbm page, which is 4KBytes
- X with sdbm but probably much less (e.g. 512) with dbm. With ozmahash
- X this problem has gone away.
- X
- X
- XKnown Problems
- X lqaddfile runs extraordinarily slowly if the database directory is
- X mounted over a network with NFS. Run lqaddfile on the NFS server --
- X there's no problem with having the data files on a remote system.
- X
- X With this distribution I am including both Ozan Yigit's sdbm package
- X and the BSD hash package written by Ozan Yigit and Margo Seltzer. The
- X latter is called "ozmahash" here, to avoid confusion with System V hash.
- X Try using ozmahash first, and if that doesn't work use sdbm. The hash
- X package seems to work on all the systems here, but it might not do so
- X well on system V. Sdbm has been ported extensively, but is slower.
- X
- X If you end up with one or more empty .dir or .pag files in the
- X LQTEXTDIR directory, you probably have a broken sdbm/ndbm/dbm. Try
- X recompiling with a different dbm package if possible. In particular,
- X early versions of sdbm had this problem.
- X
- X There are some tests, but it is not always
- X clear how to run them. I intend to make a little test suite...
- X If you get strange error messages, try
- X testbin/dbmtry 5000
- X (this will make and leave behind either one or two files in /tmp).
- X Then try testbin/dbmtry 10000. If that gives errors, the most likely
- X problem is that you have a faulty bcopy. I have included a version
- X of bcopy() that is linked in by default -- perhaps you aren't using
- X it? Do _not_ use memcpy(), as it doesn't handle overlapping regions
- X correctly.
- X
- X If -lmalloc fails, simply remove it in Makefile.
- X If you don't have <malloc.h>, you can make an empty file called
- X h/malloc.h (ugh). I ship a Makefile with -lmalloc because it's such a
- X big win when it is available, and I wouldn't want anyone to forget it!
- X
- X On a sun, gcc might have some strange problems with libraries. If so,
- X use cc. Sorry.
- X You can use -O on all systems I've tried, and -O4 seems OK on the Sun --
- X at any rate I have done this on my Sun 4/110 under SunOS 4.0.3 here.
- X
- X In ancient history, I used gcc -Wall under 386/ix. I still port
- X lq-text to 386/ix (2.0.2 most recently, October 1990), but can no
- X longer use gcc there because of disk space, so I don't know if gcc
- X will produce messages. Versions of Unix predating the Norman Conquest
- X may cause problems too.
- X
- X For serious debugging, I have included "saber.project", so Saber-C
- X users can get started quickly. If you are debugging without Saber-C,
- X the first thing to do is to buy it. It's worth it...
- X
- X
- X Otherwise, compile with -DASCIITRACE. You could also use
- X -DMALLOCTRACE, which makes the malloc() routines print messages to
- X stderr, which can be processed with awk -- see test/malloctrace.
- X
- X
- X Oh, and the common word list is searched linearly, so it is worth
- X keeping it fairly short. Usually about a dozen words is plenty.
- X
- X
- XLee
- X
- Xlee@sq.com
- Xlee%sq.com@cs.toronto.edu
- X{uunet,utzoo,cs.toronto.edu}!sq!lee
- @@@End of lq-text/README
- echo x - lq-text/doc/lqtext.1 1>&2
- sed 's/^X//' >lq-text/doc/lqtext.1 <<'@@@End of lq-text/doc/lqtext.1'
- X.\" use sqtbl % | troff -man
- X.de r2
- X.RS
- X.RS
- X..
- X.de re
- X.RE
- X.RE
- X..
- X.TH LQ-TEXT 1 "\(co copyright Liam Quin 1989, 1990"
- X.SH NAME
- Xlqtext, lqword, lqphrase, lqaddfile, lqfile, lqkwik, lqshow, lq \- text retrieval package
- X.SH SYNOPSIS
- X.B lqtext
- X[
- X.B \-vVx
- X] [
- X.BI \-c cfile
- X] [
- X.BI \-d dir
- X] [
- X.BI \-m c
- X]
- X.br
- X.B lqword
- X[
- X.B \-aAlsvVx
- X] [
- X.BI \-c cfile
- X] [
- X.BI \-d dir
- X] [
- X.BI \-m c
- X] [
- X.BI \-t n
- X]
- X.I word
- X\&.\|.\|.
- X.br
- X.B lqphrase
- X[
- X.B \-lsvVx
- X] [
- X.BI \-c cfile
- X] [
- X.BI \-d dir
- X] [
- X.BI \-t n
- X] [
- X.BI \-m c
- X]
- X.I phrase
- X\&.\|.\|.
- X.br
- X.B lqaddfile
- X[
- X.BI \-xvV
- X] [
- X.BI \-d dir
- X] [
- X.BI \-c cfile
- X] [
- X.BI \-t n
- X]
- X.I file
- X\&.\|.\|.
- X.br
- X.B lqfile
- X[
- X.BI \-aAxvV
- X] [
- X.I file
- X]
- X\&.\|.\|.
- X.br
- X.B lqshow
- X.I match
- X\&.\|.\|.
- X.br
- X.B lq
- X.SH DESCRIPTION
- X.I lq-text
- Xis a text retrieval database.
- XYou can retrieve files based on words (with
- X.IR lqword )
- Xor phrases (with
- X.IR lqphrase ).
- X.I Lq-text
- Xkeeps a database containing all of the known words and listing the
- Xfiles in which they are found.
- XThis database is typically between half and three-quarters of the total
- Xsize of the actual data, and enables searching to be rapid.
- XFiles can be added to the database at any time (with
- X.IR lqaddfile ).
- X.PP
- XThe retrieval programs will give you the names of files containing the
- Xwords about which you enquired, but will not show the actual text.
- XThis means that you can archive or remove files, and
- X.I lq-text
- Xcan still find them.
- X.I lqshow
- Xwill display the matches directly.
- XIf it is installed on your system,
- X.I lqtext
- Xprogram provides an interactive front end. (This program is generally
- Xonly available under System V Release 3.2 or later at the time of writing).
- XIf not, there is a shell-script called
- X.I lq
- Xwhich is rather slower but which provides much of the same functionality.
- X.I Lqkwik
- Xtakes a list of matches as produced by
- X.I lqword
- Xor
- X.I lqphrase
- Xand prints a few words either side of each match, formatted so that the
- Xmatched phrases are all in the same column.
- XThere are options to alter the sizes of the various columns.
- XSince
- X.I lqkwik
- Xis new and experimentatl it is not yet otherwise documented here.
- X.SH "OPTIONS (all programs)"
- X.TP
- X.BI \-m c
- XSet the matching level. If
- X.I c
- Xis
- X.BR p ,
- Xprecise matching is used;
- X.B \-mh
- Xinvokes heuristic matching, and
- X.B \-ma
- Xallows approximate matching.
- XSee below for more explanation of word and phrase matching.
- X.TP
- X.BI \-t n
- XSet the trace-level to
- X.IR n .
- XThis is mainly used for debugging.
- XThe default trace level is zero, giving no debugging trace at all.
- X.TP
- X.B \-v
- Xverbose mode \- this is exactly the same as using
- X.BR \-t1 .
- X.TP
- X.B \-V
- Xprint version information.
- X.TP
- X.B \-x
- XPrint an explanation of options. This is the single most important
- Xoption to remember (and arguably the
- X.I only
- Xone worth remembering!), as the programs may get updated more
- Xoften than the documentation.
- X.br
- XThe
- X.B \-x
- Xand
- X.B \-v
- Xoptions can be combined, so that
- X.B \-xv
- Xgives a slightly longer explanation.
- X.TP
- X.BI \-d dir
- XLook in the named directory
- Xfor the database files.
- XIf this is not given, the environment variable
- X.SM LQTEXTDIR
- Xis inspected, and either this or a built-in default is used.
- X.TP
- X.BI \-c file
- XThe named file should contain a list of words that will not
- Xbe included in the index. A good starting point might
- Xbe
- X.I /usr/lib/eign
- Xif your system has it. If not, see the
- X.I FindCommon
- Xscript for a way to generate one.
- XIf this option is not given, the programs search for the file named
- Xin the
- X.SM LQCOMMON
- Xenvironment variable, and then look in the file
- X.SM README
- Xin the
- X.I lq-text
- Xdatabase directory for a line of the form
- X.br \" is this .br needed?
- X.r2
- X.B common
- X.I " filename"
- X.re
- X.br\" and this one too?
- Xbefore the first
- X.I end
- Xkeyword.
- X.SH "LQADDFILE OPTIONS"
- X.TP
- X.BI \-w n
- XNormally
- X.I lqaddfile
- Xkeeps a cache of the words it has seen, and writes them out only
- Xoccasionally. The less often the cache is written, the faster the
- Xprogram will run. On the other hand, as soon as
- X.I lqaddfle
- Xgrows large enough to fill all of available physical memory, it starts
- Xto run very, very slpwly and to impose a noticeable system overhead.
- X.PP
- XThe total number of words in the cache determines (approximetely) the
- Xtotal size of
- X.I lqtext
- Xwhen it runs. Allow about twenty bytes per word.
- XValues of from 30,000 to 100,000 appear suitable for machines with from
- Xfour to twelve megabytes of memory, as a rough guide.
- X.TP
- X.BI \-f \^file
- XThe list of files to index is read from the named
- X.IR file .
- XIf this file is `\-', standard input is read.
- X.SH "LQFILE OPTIONS"
- X.TP
- X.B \-a
- Xproduces a list of all files in the database
- X.TP
- X.B \-A
- Xtreats each of the remaining arguments as files to add to the file list.
- XNo indexing is done, so the main effect of this is that the named files
- Xwill not be added to the database until they have changed.
- X.SH "LQSHOW OPTIONS"
- X.TP
- X.BI \-a above
- XDisplay
- X.I above
- Xlines of text above each match.
- XThe default is to display up to six lines preceding each match from each file.
- X.TP
- X.BI \-b below
- XDisplay
- X.I below
- Xlines of text following each match.
- XThe default is to display an extra six lines.
- XIf there are too many lines to fit on the screen, they will wrap around
- Xto the top of the screen.
- XThe default is to display six lines of text after the line containing
- Xthe first matched keyword in a phrase.
- X.TP
- X.BI \-f \^file
- XThe named
- X.I file
- Xis assumed to contain a list of matches in the form produced
- Xby
- X.I lqtext
- X.IR \-l ,
- Xand allows browsing of a much longer list of matches than the form given below.
- X.PP
- XRemaining arguments are taken to be matches.
- XThese are groups of three strings:
- Xa number representing the block in the file, another representing the
- Xword in the block, and finally a file name (or path).
- XFiles will be found if they are absolute (starting with a /), or if they
- Xare in a directory which is specified in
- Xthe
- X.SM DOCPATH
- Xenvironment variable, as described under `environment' below.
- X.PP
- XThere are also some (deliberately) undocumented options used by the
- X.I lq
- Xshell script.
- X.SH "LQWORD OPTIONS"
- XWith no options at all,
- X.I lqword
- Xwill list all of the words in the database, one per line.
- XIf it is given the
- X.B \-a
- Xflag, it will print statistics about each word as well as the word itself.
- XIf given the
- X.B \-a
- Xoption, it will print out the pathname, block and word-in-block of every
- Xoccurrence of every word in the database.
- XThis can take some time (from one to two minutes per megabyte of database
- Xon a typical 386/ix system, for example).
- X.PP
- XOther options are
- X.TP
- X.B \-l
- Xlist format \- list matches without attempting to format them for human
- Xreadability. This allows one to use
- X.r2
- Xlqshow \`lqword \-l word1 word2 ...\`
- X.re
- Xto view files immediately.
- X.PP
- XOther options to
- X.I lqword
- Xare:
- X.TP
- X.BI \-d word
- Xdelete mode \- delete the given
- X.I word
- Xfrom the database.
- XThis should be used with caution, and will be removed to the
- Xnew (and unreleased)
- X.I lqadmin
- Xcommand in the next release.
- X.TP
- X.B \-s
- Xsilent mode.
- XIn this mode,
- X.I lqword
- Xdoes not produce any output, but the exit status is zero if at least
- Xone of the given words was found, and one otherwise.
- XIf no words are given in this mode,
- X.I lqword
- Xwill exit with non-zero status.
- X.SH MATCHING
- XThe
- X.I "matching level"
- Xwas mentioned briefly under
- X.SM OPTIONS
- Xabove.
- XThe following table summarises the differences between the three available
- Xlevels.
- X.br
- X.\" the .ne lines are for broken versions of tbl...
- X.TS
- Xallbox doublebox;
- XlB lB lB
- Xl l l.
- X\-m Option Meaning Description
- X=
- X.ne 4
- X\-m\^p Precise T{
- X.ll 3i
- XPhrases in the data must have the same
- XCapitalisation as you type, and words must be the same distance apart.
- X.br
- XUse this only if you get too many matches otherwise.
- XT}
- X.ne 4
- X\-m\^h Heuristic T{
- X.ll 3i
- XWords that you give starting with Capital Letters will only match
- Xsimilar words in the database; lower case words will match either.
- X.br
- XPlurals that you give will only match plurals in the data, but
- Xa singular word will match either. For example, if you type `sock',
- Xyou will find both `sock' and `socks'.
- XT}
- X.ne 4
- X\-m\^\&a Any T{
- X.ll 3i
- XWith this option, \&\fIlq-text\fP
- Xmatching programs will try as hard as possible to match words.
- X.br
- XPlurals, possessives, case and word separation are all treated loosely.
- XT}
- X.TE
- X.SH EXAMPLES
- X.r2
- Xlqword martin
- X.re
- Xfinds all matches of the word `martin' in the default database
- X.r2
- Xlqphrase -d sources/unix "text retrieval" "word searching"
- X.re
- Xlooks for the two named phrases.
- X.SH ENVIRONMENT
- XAll of the programs recognise the environment variables
- X.BR LQTEXTDIR ,
- X.B DOCPATH
- Xand
- X.BR LQCOMMON .
- XThe first of these contains the directory in which to look for the
- Xdatabase files. If this is not given, you may find that there is
- Xa further default built in to the programs when they were compiled.
- XThe
- X.B -d
- Xoption overrides the
- X.SM LQTEXTDIR
- Xvariable.
- X.br
- XThe second,
- X.SM DOCPATH ,
- Xis a colon-separated list of places to look for files when adding or
- Xdisplaying.
- XThis is normally set in the file
- X.SM README
- Xin the database directory.
- X.br
- XFinally,
- X.SM LQCOMMON
- Xcan be set to a list of words to ignore when adding or retrieving files.
- XAgain, this is usually set in the
- X.SM README
- Xfile, but you might want to treat certain files differently.
- XThe default Common Word List is called `Common', and lives in the
- Xdatabase directory.
- X.SM LQCOMMON
- Xstarts with a `/', it will be taken to be an absolute pathname;
- Xotherwise, it is assumed to name a file in the database directory.
- X.SH BUGS
- XThis is a beta (test) release.
- XPlease don't hesitate to fix bugs and let me know what you did\^.\^.\^.
- X.sp
- XThis documentation is at best preliminary.
- X.SH AUTHOR
- XLiam R. Quin, 1990
- X.
- X.\" $Log: lqtext.1,v $
- X.\" Revision 1.3 91/03/03 00:23:19 lee
- X.\" Mentioned lqkwik
- X.\"
- X.\" Revision 1.2 90/10/06 02:32:22 lee
- X.\" Prepared for first beta release.
- X.\"
- X.\" Revision 1.1 90/10/04 17:38:34 lee
- X.\" Initial revision
- X.\"
- X.\"
- @@@End of lq-text/doc/lqtext.1
- echo x - lq-text/Sample/CommonWords 1>&2
- sed 's/^X//' >lq-text/Sample/CommonWords <<'@@@End of lq-text/Sample/CommonWords'
- X# lq-text common word stop-list
- X
- X# Keep this list short -- at most 50 words -- or you will pay a penalty
- X# in performance when you add documents to the index -- the list is
- X# searched linearly (but is kept sorted internally, so it's OK to have
- X# duplicated in here).
- X
- X# First index some text with everything commented out, and then use
- X# FindCommon to determine which are very common words. You don't gain
- X# all that much space by deleting them, so I don't usually bother.
- X
- X# the # 27880 <-- number of times this word appeared in a sample run
- X# and # 23857 <-- on part (or all? I forget) of the King James Bible..
- X# that # 4705
- X# for # 3011
- @@@End of lq-text/Sample/CommonWords
- echo x - lq-text/Sample/README 1>&2
- sed 's/^X//' >lq-text/Sample/README <<'@@@End of lq-text/Sample/README'
- X# This file is read (up to "end") by all lq-text programs.
- X
- Xcommon CommonWords
- X
- X# where to find documents:
- Xdocpath /usr/spool/news:/home/lee/text:
- X
- Xend
- X# end of machine-readable configuration (the computer reads no further! --
- X# this is an optimisation for start-up speed...!)
- X
- X# Common common-file
- X# --- giving the name of a file of common words
- X# Docpath "path" (the quotes are optional)
- X# --- giving a list of places to look for files, separated by ":"
- X# Useful tip: avoid putting :: or "." in DOCPATH, as you'll
- X# then get files that might or might not be found, depending
- X# on where you happen to be. $HOME is NOT understood in here.
- X#
- X# Docpath can be replaced by the environment variable $DOCPATH.
- X#
- X# I'll be adding other keywords soon... and I am open to suggestions!
- X# Lee
- X# Liam R. E. Quin lee@sq.com
- X
- X/*
- X * LQ-TEXT Copyright 1990 Liam Russell Eric Quin. All rights reserved.
- X * Written by Liam Quin.
- X *
- X * This software is not subject to any license of the American Telephone
- X * and Telegraph Company or of the Regents of the University of California,
- X * or of the X Consortium, or of the Free Software Foundation.
- X *
- X * Permission is granted to anyone to use this software for any purpose on
- X * any computer system, and to alter it and redistribute it freely, subject
- X * to the following restrictions:
- X *
- X * 1. The author is not responsible for the consequences of use of this
- X * software, no matter how awful, even if they arise from flaws in it.
- X *
- X * 2. The origin of this software must not be misrepresented, either by
- X * explicit claim or by omission. Since few users ever read sources,
- X * credits must appear in the documentation.
- X *
- X * 3. Altered versions must be plainly marked as such, and must not be
- X * misrepresented as being the original software. Since few users
- X * ever read sources, credits must appear in the documentation.
- X *
- X * 4. Permission must be obtained for any commercial use of this software
- X * which involves resale of all or part of the software, whether
- X * modified or not.
- X *
- X * 5. This notice may not be removed or altered.
- X *
- X */
- X
- X/*
- X * Acknowledgements to Henry Spencer for permission to modify and use his
- X * and Geoff Collyer's C News copyright notice.
- X *
- X */
- X
- X
- @@@End of lq-text/Sample/README
- echo x - lq-text/src/COPYRIGHT 1>&2
- sed 's/^X//' >lq-text/src/COPYRIGHT <<'@@@End of lq-text/src/COPYRIGHT'
- X/*
- X * Copyright 1989 Liam Russell Eric Quin. All rights reserved.
- X * Written by Liam Quin.
- X *
- X * This software is not subject to any license of the American Telephone
- X * and Telegraph Company or of the Regents of the University of California,
- X * or of the X Consortium, or of the Free Software Foundation.
- X *
- X * Permission is granted to anyone to use this software for any purpose on
- X * any computer system, and to alter it and redistribute it freely, subject
- X * to the following restrictions:
- X *
- X * 1. The author is not responsible for the consequences of use of this
- X * software, no matter how awful, even if they arise from flaws in it.
- X *
- X * 2. The origin of this software must not be misrepresented, either by
- X * explicit claim or by omission. Since few users ever read sources,
- X * credits must appear in the documentation.
- X *
- X * 3. Altered versions must be plainly marked as such, and must not be
- X * misrepresented as being the original software. Since few users
- X * ever read sources, credits must appear in the documentation.
- X *
- X * 4. Permission must be obtained for any commercial use of this software
- X * which involves resale of all or part of the software, whether
- X * modified or not.
- X *
- X * 5. This notice may not be removed or altered.
- X *
- X */
- X
- X/*
- X * Acknowledgements to Henry Spencer for permission to modify and use his
- X * and Geoff Collyer's C News copyright notice.
- X *
- X */
- X
- X
- @@@End of lq-text/src/COPYRIGHT
- echo x - lq-text/src/Makefile 1>&2
- sed 's/^X//' >lq-text/src/Makefile <<'@@@End of lq-text/src/Makefile'
- X# Makefile for LQ-Text, a full text retrieval package by Liam R. Quin
- X#
- X# $Id: Makefile,v 1.13 91/03/02 20:23:02 lee Exp $
- X#
- X
- X# Do this first for sanity...:
- XSHELL=/bin/sh
- X
- X### Some global configuration options.
- X
- X# You should also look at h/globals.h for more things to change.
- X#
- X# DEFS are included in CFLAGS, passed to the C compiler:
- X# If ASCIITRACE is defined, you can get extra debugging output using -t99
- X# (or some other number), but there is a slight performance penalty for
- X# including this, and you'd need to understand the code.
- X
- X# DEFS:
- X# Use either -UBSD -DSYSV or -USYSV -DBSD as appropriate...
- X# This affects
- X# * the choice of default pager ($PAGER) in globals.h
- X# * whether some extra declarations are used to make lint and gcc -Wall
- X# happy about SysV stdio.h
- X# It isn't very important... if you have System V stdio and curses, you
- X# might as well use -DSYSV -UBSD even on SysV. Ultrix diffs are included
- X# inside #ifdef ultrix; use BSD on Ultrix.
- X# Lqtext doesn't do explicit locking or signal handling at the moment.
- X# If you are using sdbm (this is what I use) and get messages about L_SET
- X# or L_SEEK being undefined, add -DSVID (there are other changes, but
- X# this is a useful symptom...)
- X#
- X# -DMALLCTRACE makes malloc.c produce masses of output...
- X#
- X# -DCURSESX, if present, says that we have the System V.3.1 or later
- X# curses that has A_STANDOUT and in which box(win, 0, 0) draws a neat box
- X# with vt100 characters. If you're not sure, if the string ACS appears
- X# in /usr/include/curses.h you should probably use -DCURSESX.
- X#
- X# DEFS= -DASCIITRACE -UBSD -DSYSV -DMALLOCTRACE -DCURSESX ### for BIG testing
- XDEFS= -UASCIITRACE -DBSD -USYSV ### Try this on BSD-like Unix ...
- X# DEFS= -UASCIITRACE -UBSD -DSYSV -DCURSESX -DSVID ### ...and this on Sys V
- X
- X# Who owns the installed binaries?
- XOWNER=lee
- X# and what group are they in?
- XGROUP=other
- X# and where do they go?
- XBINDIR=/usr/local/bin
- XLIBDIR=/usr/local/lib/lqtext
- X# and the file mode for executables?
- XMODE=751
- X
- X# NewsFilter and MailFilter are programs which read news/mail files and
- X# turn unwanted words (e.g. Received-By lines inside mail headers) into
- X# "qxxxxx", with the right number of x's so that the total byte count is
- X# unchanged....
- X# Lqshow is the document browser.
- XMAILFILTER=$(LIBDIR)/MailFilter
- XNEWSFILTER=$(LIBDIR)/NewsFilter
- XLQSHOW=$(BINDIR)/lqshow
- XLQFILE=$(BINDIR)/lqfile
- X
- X# If you have -lmalloc, use it...
- XMALLOC=-lmalloc # faster version of malloc
- X# MALLOC= # BSD Unix doesn't have malloc(3X), only malloc(3)
- X
- X# Choose between ozmahash, ndbm, sdbm, gdbm or dbm -- if you only have dbm,
- X# you'll have some work to do -- see PORTING for gdbm or dbm.
- X# The necessary changes are in h/smalldb.h and h/Liamdbm.h if you need them.
- X# If you use ozmahash or sdbm you must use the fixed versions -- sdbm was
- X# posted to netnews in 1991, and ozmahash is included with this distribution.
- X# If you use ozmahash, copy ozmahash/*.h into h.
- X
- X# WHICHDBM=ndbm
- X# DBMLIBS=-lndbm
- X# MKDBM= # this is the target if we have to build ndbm...
- XWHICHDBM=ozmahash
- XDBMLIBS=../lib/libhash.a
- XMKDBM=mkozmahash # this is the target if we have to build ndbm...
- X
- X# On BSD systems you need -ltermcap as well as libcurses for "lqshow".
- X# TERMCAP=-lcursesX -ltermcap # ultrix -- cursesx
- XTERMCAP=-lcurses -ltermcap
- X# TERMCAP=-lcurses
- X
- X# on SYSV ranlib is usually "echo"
- X# RANLIB=echo
- XRANLIB=ranlib
- X
- X# Choose a C compiler -- GNU's gcc if you have it, or the standard cc.
- X# GNU cc won't compile lqaddfile.c on some machines, but I don't know why.
- X## for gcc:
- X#CC=gcc
- X#GCCF= -Fwriteable-strings -Wall -I/usr/include
- X## for anything else:
- XCC=cc
- XGCCF=
- X##
- X# Use -O or -O -g for the optimiser. Or -ql or -p for profiling (sysV)
- X# -O3 (or -O4 if you are feeling brave) is for SunOS
- X# With gcc or recent System V compilers, you can use OPT=-O -g
- X# NOTE to profilers: do not mix -g with -p -- this is often broken!
- XOPT=-O
- X
- XCFLAGS= $(OPT) $(DEFS) $(GCCF) -D$(WHICHDBM) $$(EXTRA)
- X
- X# Lint flags vary wildly between systems.
- X# LINTFLAGS=-xv
- XLINTFLAGS=-a -b -c -h -x
- X
- X
- X### End of configuration section. See also PORTING in this directory.
- X
- XTARGETS=mklib mkbin libs mkfilters mktest mkmenu
- XDIRS=mkfilters mkliblqtext mklqtext mktest mkmenu
- XMKTARGETS=$(MKDBM) $(DIRS)
- X
- X# Make all does a local install (in src/bin src/lib src/testbin) too...
- Xall: local
- X
- X.SUFFIXES: .c .o .src .obj
- X
- X.c.src:
- X #load $(CFLAGS) $<
- X
- X.o.obj:
- X #load $(CFLAGS) $<
- X
- Xsaber_src:
- X $(MAKE) -$(MAKEFLAGS) MAKEWHAT=saber_src $(MKTARGETS)
- X
- Xsaber_obj:
- X $(MAKE) -$(MAKEFLAGS) MAKEWHAT=saber_obj $(MKTARGETS)
- X
- Xlqaddfile.src:
- X #cd lqtext
- X $(MAKE) -$(MAKEFLAGS) MALLOC='$(MALLOC)' CFLAGS='$(CFLAGS)' CC='$(CC)' WHICHDBM='$(WHICHDBM)' TERMCAP='$(TERMCAP)' RANLIB='$(RANLIB)' DBMLIBS="${DBMLIBS}" LINTFLAGS='$(LINTFLAGS)' lqaddfile.src
- X #cd ..
- X
- Xtidy:
- X $(MAKE) -i$(MAKEFLAGS) MAKEWHAT=tidy $(MKTARGETS)
- X
- Xclean:
- X $(MAKE) -$(MAKEFLAGS) MAKEWHAT=clean $(MKTARGETS)
- X rm -f lib/* bin/* testbin/* core *.o m.log
- X
- Xdepend:
- X $(MAKE) -$(MAKEFLAGS) MAKEWHAT=depend $(MKTARGETS)
- X
- Xlocal: mklib mkbin libs
- X $(MAKE) -$(MAKEFLAGS) MAKEWHAT=install $(MKTARGETS)
- X
- Xinstall: libs
- X $(MAKE) -$(MAKEFLAGS) MAKEWHAT=install $(MKTARGETS)
- X ( cd bin ; for i in *; do \
- X test -f ${BINDIR}/$$i && /bin/mv ${BINDIR}/$$i ${BINDIR}/$$i.old; \
- X /bin/cp $$i ${BINDIR}; \
- X chmod 711 ${BINDIR}/$$i; chgrp ${GROUP} ${BINDIR}/$$i; \
- X chown ${OWNER} ${BINDIR}/$$i;\
- X done; \
- X )
- X ( cd lib ; for i in `ls | grep -v '\.a$'`; do \
- X test -f ${LIBDIR}/$$i && /bin/mv ${LIBDIR}/$$i ${LIBDIR}/$$i.old; \
- X /bin/cp $$i ${LIBDIR}; \
- X chmod 711 ${LIBDIR}/$$i; chgrp ${GROUP} ${LIBDIR}/$$i; \
- X chown ${OWNER} ${LIBDIR}/$$i; \
- X done; \
- X )
- X @-echo Binary Installation complete
- X @-echo Now install manual pages from ../doc if appropriate.
- X
- Xlint:
- X $(MAKE) -$(MAKEFLAGS) MAKEWHAT=lint $(MKTARGETS)
- X
- Xlibs:
- X -/bin/test -d ${LIBDIR} || mkdir ${LIBDIR}
- X $(MAKE) -$(MAKEFLAGS) MAKEWHAT=install $(MKDBM) mkliblqtext
- X -@echo libraries up to date
- X
- X# Note to mklib and mkbin:
- X# If the mkdir -p bombs out, there is a shell-script mkdir you can use
- X# in the utils directory. The -p means to create parent directores as needed.
- X
- Xmklib: # see note above about mkdir
- X -@test -d lib || mkdir lib
- X -@test -d $(LIBDIR) || mkdir -p $(LIBDIR)
- X
- Xmkbin: # see note above about mkdir
- X -@test -d bin || mkdir bin
- X -@test -d $(BINDIR) || mkdir -p $(BINDIR)
- X
- Xmkfilters:
- X cd filters; \
- X $(MAKE) -$(MAKEFLAGS) MALLOC='$(MALLOC)' \
- X CFLAGS='$(CFLAGS) -DMAILFILTER=\"$(MAILFILTER)\" -DNEWSFILTER=\"$(NEWSFILTER)\" ' \
- X CC='$(CC)' WHICHDBM='$(WHICHDBM)' TERMCAP='$(TERMCAP)' \
- X RANLIB='$(RANLIB)' \
- X DBMLIBS="${DBMLIBS}" LINTFLAGS='$(LINTFLAGS)' $(MAKEWHAT)
- X
- Xmkliblqtext:
- X cd liblqtext; \
- X $(MAKE) -$(MAKEFLAGS) MALLOC='$(MALLOC)' \
- X CFLAGS='$(CFLAGS) -DMAILFILTER=\"$(MAILFILTER)\" -DNEWSFILTER=\"$(NEWSFILTER)\" ' \
- X CC='$(CC)' WHICHDBM='$(WHICHDBM)' TERMCAP='$(TERMCAP)' \
- X RANLIB='$(RANLIB)' \
- X DBMLIBS="${DBMLIBS}" LINTFLAGS='$(LINTFLAGS)' $(MAKEWHAT)
- X
- Xmklqtext:
- X cd lqtext; \
- X RANLIB='$(RANLIB)' \
- X CFLAGS='$(CFLAGS) -DMAILFILTER=\"$(MAILFILTER)\" -DNEWSFILTER=\"$(NEWSFILTER)\" ' \
- X CC='$(CC)' WHICHDBM='$(WHICHDBM)' TERMCAP='$(TERMCAP)' \
- X $(MAKE) -$(MAKEFLAGS) MALLOC='$(MALLOC)' \
- X DBMLIBS="${DBMLIBS}" LINTFLAGS='$(LINTFLAGS)' $(MAKEWHAT)
- X
- Xmksdbm:
- X cd sdbm; \
- X $(MAKE) -$(MAKEFLAGS) MALLOC='$(MALLOC)' CFLAGS='$(CFLAGS)' \
- X CC='$(CC)' WHICHDBM='$(WHICHDBM)' TERMCAP='$(TERMCAP)' \
- X RANLIB='$(RANLIB)' \
- X DBMLIBS="${DBMLIBS}" LINTFLAGS='$(LINTFLAGS)' $(MAKEWHAT)
- X
- Xmkozmahash:
- X cd ozmahash; \
- X $(MAKE) -$(MAKEFLAGS) MALLOC='$(MALLOC)' CFLAGS='-I. $(CFLAGS)' \
- X CC='$(CC)' WHICHDBM='$(WHICHDBM)' TERMCAP='$(TERMCAP)' \
- X RANLIB='$(RANLIB)' \
- X DBMLIBS="${DBMLIBS}" LINTFLAGS='$(LINTFLAGS)' $(MAKEWHAT)
- X
- Xmktest:
- X cd test; \
- X $(MAKE) -$(MAKEFLAGS) MALLOC='$(MALLOC)' CFLAGS='$(CFLAGS)' \
- X CC='$(CC)' WHICHDBM='$(WHICHDBM)' TERMCAP='$(TERMCAP)' \
- X OWNER='$(OWNER)' RANLIB='$(RANLIB)' \
- X DBMLIBS="${DBMLIBS}" LINTFLAGS='$(LINTFLAGS)' $(MAKEWHAT)
- X
- Xmkmenu:
- X cd menu; \
- X $(MAKE) -$(MAKEFLAGS) MALLOC='$(MALLOC)' RANLIB='$(RANLIB)' \
- X CFLAGS='$(CFLAGS) -DLQSHOW=\"$(LQSHOW)\" -DLQFILE=\"$(LQFILE)\" ' \
- X CC='$(CC)' WHICHDBM='$(WHICHDBM)' TERMCAP='$(TERMCAP)' \
- X DBMLIBS="${DBMLIBS}" LINTFLAGS='$(LINTFLAGS)' $(MAKEWHAT)
- X
- X
- X#
- X# $Log: Makefile,v $
- X# Revision 1.13 91/03/02 20:23:02 lee
- X# Improved install entry.
- X#
- X# Revision 1.12 91/03/02 19:34:13 lee
- X# More comments and changed some defaults.
- X#
- X# Revision 1.11 91/03/02 19:16:27 lee
- X# Now makes ozmahash if necessary, and uses SHELL=/bin/sh.
- X#
- X# Revision 1.10 91/02/20 19:33:37 lee
- X# Removed duplicate definitions of LIB/LIBDIR, BIB/BINDIR and OWNER;
- X# OWNER now passed down in mktest correctly.
- X#
- X# Revision 1.9 90/10/05 23:40:41 lee
- X# More comments for easier configuration -- and added more examples.
- X#
- X# Revision 1.8 90/10/03 21:39:41 lee
- X# added CURSESX and more comments.
- X# ;.
- X#
- X# Revision 1.7 90/10/03 21:14:08 lee
- X# Now passes MAILFILTER and NEWSFILTER.
- X#
- X# Revision 1.6 90/10/01 18:28:43 lee
- X# Added MAILFLTER and NEWSFILTER to mkfilter
- X#
- X# Revision 1.5 90/09/28 21:52:14 lee
- X# Now does installation itself...
- X#
- X# Revision 1.4 90/09/10 13:25:31 lee
- X# Added some saber-C hooks.
- X#
- X# Revision 1.3 90/07/27 17:50:51 lee
- X# alpha test version shipped
- X#
- X# Revision 1.2 90/03/23 18:57:13 lee
- X# Added entries for lint and depend.
- X#
- X# Revision 1.1 90/03/23 15:11:30 lee
- X# Initial revision
- X#
- X#
- @@@End of lq-text/src/Makefile
- echo x - lq-text/src/PORTING 1>&2
- sed 's/^X//' >lq-text/src/PORTING <<'@@@End of lq-text/src/PORTING'
- XNotes for porting lq-text.
- X
- XThis is the free version. It is not public domain, but you can use it
- Xfreely for non-commercial purposes.
- XIf you want to sell it, or something derived from it, you should get in
- Xtouch with the author (me!), who will almost always give permission.
- X
- XYou can contact me as follows:
- X lee@sq.com,
- X Liam Quin, SoftQuad Inc. 720 Spadina Ave., Toronto, ONT., Canada
- X (+1) 416 963-8337
- X
- X==============================
- X
- XPORTING NOTES
- X
- XWell, I haven't done much porting.
- XCurrently the stuff works on
- X* System V Release 3.2 (Interactive's 386/ix 2.0.2), 80386
- X* System V Release 2 (Honeywell Bull XPX 100 X20), 68020
- X* SunOS 3 and 4 (Sun 3/60 and 3/75), 68020
- X
- XSince the 68020 and 80386 differ radically, most of the work has probably
- Xbeen done. Don't even *think* about non-Unix systems, though.
- X
- XLikely problems:
- X* the calls to lockf() in FileList.c and WordInfo.c
- X You could comment them out on a single-user system.
- X On BSD you could use flock() instead. It should be mandatory locking.
- X Actually, since the individual word entries are not locked, you could
- X simply delete the locking code.
- X
- X* if you have multiple machines sharing the same database, and they do not
- X all use the same byte-ordering, you will need to do some hacking.
- X The headers in pblock.{c,h}, and Filelist/WordInfo.c will all need
- X changing. They have to read and write fixed-length unsigned longs,
- X and very quickly too!
- X Alternatively, use sReadNumber() and sWriteNumber(), and always allow
- X four bytes.
- X
- X* you need ndbm. If you don't have it, you can use sdbm.
- X Look at Liamdb.h and smalldb.h, and compile without -DNDBM.
- X If you are on Xenix, you can buy 386/ix from your nearest Interactive
- X dealer... Or use dbm().
- X If you are on 386/ix, well, as 386/ix doesn't include dbm,
- X use sdbm or gdbm.
- X
- X* On mixed model architectures, make sure that you can have arrays larger
- X then 64KBytes (if poss.), that a pointer (char *) fits into an unsigned
- X long, and that you have a good supply of coffee.
- X
- XBy all means mail me with questions, providing that you have at least tried
- Xto get somewhere youself.
- X
- XLee
- Xutzoo!sq!lee
- Xlee@sq.com
- @@@End of lq-text/src/PORTING
- echo x - lq-text/src/TODO 1>&2
- sed 's/^X//' >lq-text/src/TODO <<'@@@End of lq-text/src/TODO'
- XKEY:
- X* -- easy change
- X** - harder, needs more understanding
- X@ -- needs understanding of internals
- X@@ - mail me if you need this!
- X
- X* give lqshow the ability to page a file
- X
- X** make a list of matches showing words nearby in a KWIC style
- X so that lqtext can show (say) a dozen at a time...
- X (see lqkwik... this will appear fairlysoon, I expect)
- X
- X@@ special treatment of dates
- X
- X** table of pagers for browsing by file/type
- X
- X**@@ Better ranking of queries
- X
- X**@@ write a manual :-(
- X
- X**@ "this" can't be accessed by lqword, but can be by lqshow[???]. The
- X entire plural code (Root.c) needs a rethink.
- X I have started Plurals.c, but it's not ready yet. Yell if you have any
- X ideas, I need them!
- X
- X* The various Filter routines should be incorporated into showfile.
- X* Automatic uncompression should be added. Should look at the magic number
- X as well as the file extension
- X ** -- lqshow would need big changes in one (obvious) routine... as it
- X currently uses lseek...
- X
- X** Showfile should be made a routine (BrowseList() I suppose) that takes
- X** a list of Phrases with their matches...
- X
- X**@ should abandon dbm for the list of filenames. A better approach would
- X be to store path components as words in the database! This would make
- X / a common-word, though. Needs some thought.
- X A btree might be a good comprimise. For now, at least ozmahash doesn't
- X have overflow problems.
- X
- X*@ should use six-bit encoding for strings. This would save a lot of space
- X with relatively little overhead.
- X Actually it doesn't save space at all, based on my experiments. Sigh.
- X
- X**@@ the ability to delete a file. Two ways:
- X 1) read the file -- could be an option to addfile, in fact. This
- X would have to check the time-stamps, of course.
- X 2) schedule (perhaps overnight) a process to go through the entire
- X database and delete old files.
- X Also, addfile (and SortWordPlaces()) could remove deleted FIDs
- X automatically, which would help.
- X
- X*@ the algorithm to add a new entry to a WID is too slow, because of the
- X requirement that the list be kept sorted. I should instead keep a
- X SortedToHere counter in the header, and simply append new words.
- X The next time someone does a getpblock() and a sort, it could be written
- X back sorted. Or there could be a daemon sorter!!
- X
- X* need better documentation!
- X
- X* README should be used more, allowing more configuration.
- X (README can now be called something else by recompiling)
- X
- X** allow dynamic definition of word start/mid/end, in README.
- X Must be at least as fast as isupper() etc.
- X Perhaps per-file-type rules, though? Makes Phrase Matching hard.
- X
- X** Better file locking
- X (no file locking or signal handling at all at the moment -- I ripped it
- X all out when I discovered that it was broken on many systems, and
- X this gave a false sense of security.)
- X
- X* Finish the ReadAhead daemon. See if it makes an improvement.
- X The idea is that whenever ReadBlock reads a block, it should ask the
- X daemon to read the next block, thus ensuring that it is in the buffer
- X cache.
- X It might be better simply to give it the WID and have it read the entire
- X chain itself.
- X
- X* Phrase Matching would be orders of magnitude faster if it did not involve
- X reading the tables of matches until they are needed, as many of them
- X won't be! It should extend the lists of matches for each word in the
- X phrase only as necessary.
- X
- X** save large FIDs for large files.
- X
- X** Use a better number scheme (numbers.c)
- X make some more of the number routines inline -- especially use
- X a #define for the common case of sReadNumber ad sWriteNumber, eliminating
- X millions (!) of funtion calls when the numbers are only 1 byte long.
- X Note that most numbers turn out to fit in one byte (more than 90%) at
- X the moment, so if I had another bit I could improve the flags stuff..
- X The current scheme sets the top bit in each byte if there is more to
- X follow. Another alternative would be to use magic values (e.g.
- X 256 - (number of bytes)) for the first byte if there's more than one
- X byte. Hence 255 255 would be used to store 255. This would roughly
- X double the number of numbers (!) fitting into one byte... hmm...
- X
- XKnown Bugs
- X==========
- X* lqshow only marks the first word of the phrase.
- X* lqshow does not know about file types!!!
- X* there is no troff (or sqtroff) file type
- X* the C filter got lost in history (sigh)
- X* I cannot distribute the CDMS and Uniplex filters
- @@@End of lq-text/src/TODO
- echo x - lq-text/src/filters/FilterMain.c 1>&2
- sed 's/^X//' >lq-text/src/filters/FilterMain.c <<'@@@End of lq-text/src/filters/FilterMain.c'
- X/* FilterMain.c -- Copyright 1989, 1990 Liam R. Quin. All Rights Reserved.
- X * This code is NOT in the public domain.
- X * See the file COPYRIGHT for full details.
- X */
- X
- X/* $Id: FilterMain.c,v 1.3 90/10/06 00:57:16 lee Rel1-10 $
- X */
- X
- X/* FilterMain is intended to make writing filters easier; one
- X * simply writes the Filter() routine and links with FilterMain.o to
- X * produce a new filter.
- X *
- X * The filter should use "wordrules.h", and should transform its input
- X * into words, spaces and newlines, with all other characters turnded into
- X * spaces.
- X *
- X * A simple filter might be something like (on System V):
- X *
- X * system("tr -c '[a-z][A-Z][0-9]_' '[ *]'");
- X *
- X * except that a word shouldn't start with a digit or _.
- X *
- X * Addfile itself maps upper case to lower, and may also check on the length
- X * of words (min is currently 3, max 20, for example).
- X *
- X * A News or Mail filter might delete things from the header (turning them
- X * into spaces to preserve file offsets), so that the index doesn't fill
- X * up with ihnp4!decwrl!seismo!utzoo!henry everwhere. Of course, it
- X * would retain the utzoo!henry at the end of the From line.
- X *
- X * A filter for the Crystal Word Processor might turn accented characters
- X * into their ASCII non-accented equivalents, (although NX-Text is 8-bit
- X * transparent, so one could also decide to use an 8-bit character set),
- X * and remove style information, non-local object banners, etc.
- X *
- X * This file must fork "compress -d" if appropriate, to read compressed files.
- X * Note -- compress -d is the same as uncompress, but more likely to work.
- X * Some sites also have zcat, but this is even rarer.
- X *
- X */
- X
- X/** Unix system calls used in this file: **/
- Xextern void exit();
- X/** C Library functions used in this file: **/
- Xextern void perror();
- X
- X#include <stdio.h>
- X
- Xchar *progname;
- Xvoid Filter();
- X
- Xextern int AsciiTrace;
- X
- Xint
- Xmain(ac, av)
- X int ac;
- X char *av[];
- X{
- X progname = av[0];
- X
- X if (ac ==1) {
- X Filter(stdin, "(standard input)");
- X } else {
- X while (--ac) {
- X FILE *f = fopen(*++av, "r");
- X
- X if (f == (FILE *) 0) {
- X fprintf(stderr, "%s: can't open ", progname);
- X perror(*av);
- X exit(1);
- X }
- X
- X Filter(f, *av);
- X
- X (void) fclose(f);
- X }
- X }
- X return 0;
- X}
- X
- X/*
- X * $Log: FilterMain.c,v $
- X * Revision 1.3 90/10/06 00:57:16 lee
- X * Prepared for first beta release.
- X *
- X * Revision 1.2 90/09/20 18:32:39 lee
- X * Removed extra variable declarations...
- X *
- X * Revision 1.1 90/08/09 19:17:54 lee
- X * Initial revision
- X *
- X * Revision 1.2 89/09/16 21:15:58 lee
- X * First demonstratable version.
- X *
- X * Revision 1.1 89/09/07 21:01:54 lee
- X * Initial revision
- X *
- X */
- @@@End of lq-text/src/filters/FilterMain.c
- echo x - lq-text/src/filters/MailFilter.c 1>&2
- sed 's/^X//' >lq-text/src/filters/MailFilter.c <<'@@@End of lq-text/src/filters/MailFilter.c'
- X/* MailFilter.c -- Copyright 1989 Liam R. Quin. All Rights Reserved.
- X * This code is NOT in the public domain.
- X * See the file COPYRIGHT for full details.
- X */
- X
- X/* $Id: MailFilter.c,v 1.5 90/10/06 00:57:24 lee Rel1-10 $
- X */
- X
- X/* Filter for mail articles.
- X * Throw away all of the header except
- X * Subject
- X * From
- X * Date
- X * Cc:
- X * Organi[sz]ation
- X * To:
- X *
- X * See FilterMain and wordrules.h for more info.
- X *
- X */
- X
- X#ifdef SYSV
- X extern int _filbuf(), _flsbuf();
- X#endif
- X#include <stdio.h>
- X#include <malloc.h>
- X#include <ctype.h>
- X#include "wordrules.h"
- X
- X#include "emalloc.h"
- X
- X#define STREQ(boy, girl) ((*(boy) == *(girl)) && !strcmp(boy, girl))
- X
- Xextern char *progname;
- X
- X/** Unix system calls used in this file **/
- X /* (none) */
- X/** Unix Library Functions used in this file: **/
- X#ifndef tolower
- X extern int tolower();
- X#endif
- Xextern int strcmp();
- X
- X/** Functions within this file used before they're defined: **/
- Xvoid Header(), Body();
- Xint GetChar();
- X
- X/** **/
- X
- Xvoid Filter();
- X
- Xchar *KeepThese[] = { /* keep this list in lower case, sorted! */
- X "cc",
- X "date",
- X "from",
- X "organisation",
- X "organization",
- X "subject",
- X "to",
- X 0
- X};
- X
- Xint icstreq(s1, s2) /* case insensitive strcmp */
- X char *s1, *s2;
- X{
- X register char ch1, ch2;
- X
- X while (*s1 && *s2) {
- X if (*s1 != *s2) {
- X if (isupper(*s1)) {
- X ch1 = tolower(*s1);
- X ch2 = (*s2);
- X } else if (isupper(*s2)) {
- X /* Note that we only have to test one character for case! */
- X ch1 = (*s1);
- X ch2 = tolower(*s2);
- X } else {
- X return 0; /* not the same */
- X }
- X if (ch1 != ch2) return 0; /* the strings differ */
- X }
- X s1++; s2++;
- X }
- X if (!*s1 && !*s2) {
- X return 1;
- X }
- X return 0; /* they are different */
- X}
- X
- Xint
- XIsWanted(String)
- X char *String;
- X{
- X char **pp;
- X int ch = String[0];
- X
- X if (isupper(ch)) ch = tolower(ch);
- X
- X for (pp = KeepThese; *pp && **pp; pp++) {
- X if (**pp > ch) break; /* gone too far */
- X else if (icstreq(String, *pp)) return 1;
- X }
- X return 0;
- X}
- X
- Xvoid
- XFilter(InputFile, Name)
- X FILE *InputFile;
- X char *Name;
- X{
- X Header(InputFile, Name);
- X Body(InputFile, Name);
- X}
- X
- Xtypedef enum {
- X F_NotSeenAnythingYet,
- X F_InTheFirstWord,
- X F_AfterTheFirstWord
- X} t_FirstWord;
- X
- Xint InWord = 0;
- X
- X/* For a mail article, the Header ends at the first line which is not
- X * a valid mail header -- i.e., is not indented and doesn't start with
- X * a capitalised word followed by a single space (uucp) or colon (RFC822).
- X * A blank line also ends the header.
- X */
- Xvoid
- XHeader(InputFile, Name)
- X FILE *InputFile;
- X char *Name;
- X{
- X int AtStartOfLine = 1;
- X int IgnoreLine = 0; /* initialised for lint... */
- X t_FirstWord FirstWord = F_NotSeenAnythingYet;
- X int ch;
- X static int BufLen;
- X static char *Buffer = 0;
- X int AtStartOfWord;
- X register char *q;
- X
- X if (Buffer == 0) {
- X BufLen = 24;
- X Buffer = emalloc(BufLen);
- X }
- X
- X q = Buffer;
- X InWord = 0;
- X
- X while ((ch = GetChar(InputFile)) != EOF) {
- X if (ch == '\n') {
- X if (AtStartOfLine) { /* a blank line */
- X putchar('\n');
- X return;
- X }
- X }
- X
- X InWord = InWord ? WithinWord(ch) : StartsWord(ch);
- X
- X switch (FirstWord) {
- X case F_NotSeenAnythingYet:
- X if (InWord) {
- X FirstWord = F_InTheFirstWord;
- X if (q - Buffer >= BufLen - 1) {
- X int where = q - Buffer;
- X
- X BufLen += 24;
- X Buffer = erealloc(Buffer, BufLen);
- X q = &Buffer[where];
- X }
- X *q++ = ch;
- X } else {
- X if (AtStartOfLine && ch != ' ' && ch != '\t') {
- X putchar(ch);
- X return;
- X }
- X putchar(' ');
- X }
- X break;
- X case F_InTheFirstWord:
- X if (InWord) {
- X if (q - Buffer >= BufLen - 1) {
- X int where = q - Buffer;
- X
- X BufLen += 24;
- X Buffer = erealloc(Buffer, BufLen);
- X q = &Buffer[where];
- X }
- X *q++ = ch;
- X break;
- X } else { /* reached the end of the first word on the line */
- X *q = '\0';
- X /* See if it's a keyword */
- X if ((IgnoreLine = !IsWanted(Buffer)) != 0) {
- X /* Turn the word into one that won't get indexed,
- X * so that word counmts are unaffected:
- X * We use qxxxxxxx (any number of x's) for this.
- X */
- X for (q = Buffer; *q; q++) {
- X putchar((q == Buffer) ? 'q' : 'x');
- X }
- X putchar (ch == '\n' ? '\n' : ' ');
- X } else {
- X printf("%s%c", Buffer, ch == '\n' ? ch : ' ');
- X }
- X FirstWord = F_AfterTheFirstWord;
- X }
- X break;
- X default:
- X if ((AtStartOfLine = (ch == '\n'))) {
- X IgnoreLine = 0;
- X q = Buffer;
- X FirstWord = F_NotSeenAnythingYet;
- X AtStartOfWord = 1;
- X }
- X if (InWord && !IgnoreLine) {
- X putchar(ch);
- X } else {
- X if (AtStartOfWord && InWord) {
- X putchar('q');
- X AtStartOfWord = 0;
- X } else if (InWord) {
- X putchar('x');
- X } else if (isspace(ch)) {
- X putchar(ch);
- X } else {
- X putchar(' ');
- X }
- X }
- X if (!InWord) AtStartOfWord = 1;
- X }
- X if ((AtStartOfLine = (ch == '\n'))) {
- X IgnoreLine = 0;
- X q = Buffer;
- X FirstWord = F_NotSeenAnythingYet;
- X AtStartOfWord = 1;
- X }
- X }
- X if (ch == EOF) {
- X fprintf(stderr, "%s: warning: Mail folder %s has no message body\n",
- X progname, Name);
- X }
- X}
- X
- Xvoid
- XBody(InputFile, Name)
- X FILE *InputFile;
- X char *Name;
- X{
- X int ch;
- X
- X while ((ch = GetChar(InputFile)) != EOF) {
- X if (InWord = InWord ? WithinWord(ch) : StartsWord(ch)) {
- X putchar(ch);
- X } else {
- X putchar((ch == '\n') ? '\n' : ' ');
- X }
- X }
- X}
- X
- X#ifdef __GNU__
- Xinline
- X#endif
- Xint
- XGetChar(fd)
- X FILE *fd;
- X{
- X static int LastChar = 0;
- X
- X if (LastChar) {
- X int ch = LastChar;
- X LastChar = 0;
- X return ch;
- X }
- X
- X /* Only return a single quote if it is surrounded by letters */
- X if ((LastChar = getc(fd)) == '\'') {
- X LastChar = getc(fd);
- X if (InWord && isalpha(LastChar)) return '\'';
- X else return ' ';
- X } else {
- X int ch = LastChar;
- X LastChar = 0;
- X return ch;
- X }
- X}
- X
- X/*
- X * $Log: MailFilter.c,v $
- X * Revision 1.5 90/10/06 00:57:24 lee
- X * Prepared for first beta release.
- X *
- X * Revision 1.4 90/09/20 16:35:40 lee
- X * Fixed icstrcmp() and IsWanted() so that the unwanted parts of headers
- X * get deleted again.... (oops!)
- X *
- X * Revision 1.3 90/09/19 21:11:54 lee
- X * Improved end-of-header detection.
- X * Now supports turning unindexed stuff into qxxxxx-words.
- X *
- X * Revision 1.2 90/08/29 21:55:57 lee
- X * Now handles mh mail better.
- X *
- X * Revision 1.1 90/08/09 19:17:56 lee
- X * Initial revision
- X *
- X * Revision 1.2 89/09/16 21:16:01 lee
- X * First demonstratable version.
- X *
- X * Revision 1.1 89/09/07 21:05:48 lee
- X * Initial revision
- X *
- X */
- @@@End of lq-text/src/filters/MailFilter.c
- echo end of part 01
- --
- Liam R. E. Quin, lee@sq.com, SoftQuad Inc., Toronto, +1 (416) 963-8337
-