home *** CD-ROM | disk | FTP | other *** search
- The JavaSearch toolkit @(#)README 1.7 95/03/20 -*- Text -*-
-
- David Brown, Sun Microsystems Inc., November 1994
-
-
- JavaSearch is a collection of classes used to CREATE and SEARCH
- inverted-index text databases. JavaSearch is used in a two-step
- process:
-
- (1) Use the "javaindex" program to build an JavaSearch database for a
- collection of documents.
-
- (2) Use the Database, Searcher, DocList and Doc classes in YOUR
- application to search an JavaSearch database.
-
- Here's the details on the two steps:
-
- ----- (1) Creating an JavaSearch database -----
-
- Use the javaindex program. Usage is:
-
- java javaindex -db database_name [-trimprefix path_to_trim]
- [-fileprefix path_prefix] [-urlprefix url_prefix]
- [-description "Description of this database"]
- filename filename ...
-
- Where:
-
- database_name is used to construct the 5 filenames which will
- be created by javaindex, and is the name you need to know when
- you later want to *search* the database. See Database.java
- for info on the 5 filenames.
-
- database_name can either be relative to the current directory,
- or an absolute path. For example, using the database name
-
- /foo/bar/databases/JAVASPEC
-
- will cause the files JAVASPEC.dbinfo, JAVASPEC.index,
- JAVASPEC.qindex, JAVASPEC.docs, and JAVASPEC.docindex to all be
- created in the directory "/foo/bar/databases".
-
- path_to_trim is a string that should be trimmed off the BEGINNING of
- every filename we index, before saving a Doc object for that
- filename in the Database we're creating. This is important
- because URLs are constructed by concatenating the DATABASE's URL
- prefix with the DOCUMENT's filename!
-
- So if you're indexing the files /foo/bar/baz/*.html, and these
- files happen to be accessible by URLs like
- "http://tachyon.eng/baz/*.html", you would TRIM /foo/bar/baz/
- and use a urlprefix of http://tachyon.eng/baz/.
-
- Here's a real-world example, for the Java spec:
-
- java javaindex -db JAVA_SPEC \
- -trimprefix /net/tachyon/export/disk1/Mosaic/docs/spec/ \
- -urlprefix http://tachyon.eng/spec/ \
- /net/tachyon/export/disk1/Mosaic/docs/spec/*.html
-
-
- path_prefix is the string that should be prepended to the "filename"
- of each individual Doc in this database, to construct a valid
- fully-qualified pathname. For example, you might index all the
- files in the current directory like this:
-
- java javaindex -db FOO -fileprefix /full/path/to/this/dir/ *
-
-
- url_prefix is the string that should be prepended to the "filename"
- of each individual Doc in this database, to construct a valid
- URL pointing to that document. See above for an example.
-
- url_prefix and path_prefix may be used together, although in
- general a Database will either be full of HTML files (in which
- case they are always going to be accessed as URLs) or full of
- plain text files (in which case path_prefix should be used,
- since the documents will be read as regular files).
-
- description is a human-readable description of this database.
-
- filename ... This is the list of documents to index. These
- filenames must be either absolute pathnames, or relative to the
- current directory, although remember the "path_to_trim" string
- is stripped off all filenames before they're stored in the
- Database.
-
- Sorry if the path_to_trim/path_prefix/url_prefix stuff is confusing;
- note that WAIS does it the same way, though. Look for my notes on
- "WAIS's URL type" in /net/barchetta/opt/wais/README.
-
- Javaindex prints out a bunch of useful statistics when it finished
- creating a database.
-
- ----- (2) Searching an JavaSearch database from your program -----
-
- First of all, look at the program "javasearch.java": this is a very
- simple command-line interface to perform searches on an JavaSearch
- database. It demonstrates how to open a database, do a search, and
- look at the results.
-
- In a nutshell, you do the following to perform a search:
-
- Database db = Database.OpenDatabase(database_name);
- Searcher searcher = new Searcher(db);
- DocList resultList = searcher.doSearch(query_string);
-
- Now, use the DocList.getDocAt() method to look at the individual Doc
- objects in the result list. For any Doc, you can get the headline
- (doc.headline), or a URL for the Doc (db.docURLPrefix+doc.filename),
- or a full pathname for the Doc (db.docPathPrefix + doc.filename).
-
- A Query string is a simple boolean expression, like (for example)
- "method and matching". Boolean operators "and", "or" and "not" are
- allowed. Precedence is a trivial left-to-right evaluation;
- parentheses are not supported. See the big doc comment for the
- doSearch() method in Searcher.java for all the details.
-
- That's all there is. A typical "Searching app" or applet might give
- the user text entry fields to select a database name and a query
- string, then show the result headlines in a scrolling list, and then
- have HotJava open the URL of any Doc which the user clicks on.
-
- -----
-
- See the detailed comment at the end of Database.java for a description
- of the Files used by JavaSearch, and an overview of most of the
- classes.
-
- -----
-
- EXAMPLES:
-
- (1) Creating and searching a database of the Java language spec:
-
- cd livejava/src/share/contrib/JavaSearch (eventually the JavaSearch
- stuff should be in a package!)
-
- java -cs javaindex -db /tmp/JAVA_SPEC -trimprefix /net/tachyon/export/disk1/Mosaic/docs/spec/ -urlprefix http://tachyon.eng/spec/ /net/tachyon/export/disk1/Mosaic/docs/spec/*.html
-
- [This builds the database, in /tmp.]
-
- [Now search for "method and matching":]
-
- java -cs javasearch /tmp/JAVA_SPEC method and matching
-
- [Look at the results. Selecting a document to view isn't too
- useful here, since these documents have no valid filenames!
- They're designed to be accessed by their URLs.]
-
- (2) Creating a database of random text files, for example RFCs:
-
- java -cs javaindex -db /tmp/Patent-stuff -trimprefix /usr/green/doc/Patents-Original/ -fileprefix /usr/green/doc/Patents-Original/ /usr/green/doc/Patents-Original/*.txt
-
- [This takes a few minutes. (The indexer really needs the 'btree
- optimization' (see below)!)]
-
- [Now search:]
-
- java -cs javasearch /tmp/Patent-stuff geographic and navigation
-
- java -cs javasearch /tmp/Patent-stuff touch and interface not speech
-
- [Once you see the results, type a document number to
- javasearch's prompt, and javasearch will display that file.]
-
- -----
-
- RESTRICTIONS:
-
- Here's a list of things that many other text search/retrieval systems
- can do that JavaSearch can't. Some of these might be important to add
- at some point.
-
- - When indexing, each input file is treated as a document. Some other
- systems let you have multiple documents per file.
-
- - The searcher is missing numerous features found in other info
- retrieval systems, such as: relevance ranked (or "weighted")
- results; stopwords; synonyms, word stemming and searches like "foo*"
- for all words beginning with "foo"; literal searches (like "method
- overloading"; full parsing of a hierarchical boolean query (with
- parentheses for precedence grouping); word proximity searching
- ("method near matching"); and many others.
-
- Some of these features would require significant changes to
- JavaSearch's architecture, some would require slight changes in the
- index format, and others could be implemented by changing only the
- Searcher class.
-
- - The indexer only currently knows about plain text or HTML files.
- The code does have an *outline* for recognizing News articles --
- just grep for "NEWS" to find all the places you need to change to
- add a new doc type.
-
- - The indexer is wildly suboptimal in how it keeps the Word objects in
- memory while building an index -- it *should* use a btree, but
- instead keeps Words in an unsorted Vector! Sorry, I didn't get
- around to writing a btree utility class. But this only makes
- indexing slow; it has no effect on Searching performance. And
- still, the indexer only takes a couple of minutes for small
- databases like the Java documentation...
-
- - Javaindex should have a "recursively index directories" feature.
- This would work by having a command-line arg ("-R"?) that told
- javaindex that each "filename" argument was really a directory,
- and that it should index ALL files in that directory's hierarchy.
- This is the only reasonable way to index very large databases, like
- for example a news spool filesystem (like the fp.* groups).
-
- - There's a whole bunch of other features which would be nice to have,
- but I haven't had time to implement. Look for "REMIND" comments in
- the JavaSearch code to find lots of notes like this.
-
-