PC World Komputer 1995 November

home *** CD-ROM | disk | FTP | other *** search

/ PC World Komputer 1995 November / PCWK1195.iso / inne / win95 / sieciowe / hotja32.lzh / hotjava / classsrc / browser / tools / javasearch / readme < prev next >

Wrap

Text File | 1995-08-11 | 8.8 KB | 215 lines

The JavaSearch toolkit @(#)README 1.7 95/03/20 -*- Text -*- David Brown, Sun Microsystems Inc., November 1994 JavaSearch is a collection of classes used to CREATE and SEARCH inverted-index text databases. JavaSearch is used in a two-step process: (1) Use the "javaindex" program to build an JavaSearch database for a collection of documents. (2) Use the Database, Searcher, DocList and Doc classes in YOUR application to search an JavaSearch database. Here's the details on the two steps: ----- (1) Creating an JavaSearch database ----- Use the javaindex program. Usage is: java javaindex -db database_name [-trimprefix path_to_trim] [-fileprefix path_prefix] [-urlprefix url_prefix] [-description "Description of this database"] filename filename ... Where: database_name is used to construct the 5 filenames which will be created by javaindex, and is the name you need to know when you later want to *search* the database. See Database.java for info on the 5 filenames. database_name can either be relative to the current directory, or an absolute path. For example, using the database name /foo/bar/databases/JAVASPEC will cause the files JAVASPEC.dbinfo, JAVASPEC.index, JAVASPEC.qindex, JAVASPEC.docs, and JAVASPEC.docindex to all be created in the directory "/foo/bar/databases". path_to_trim is a string that should be trimmed off the BEGINNING of every filename we index, before saving a Doc object for that filename in the Database we're creating. This is important because URLs are constructed by concatenating the DATABASE's URL prefix with the DOCUMENT's filename! So if you're indexing the files /foo/bar/baz/*.html, and these files happen to be accessible by URLs like "http://tachyon.eng/baz/*.html", you would TRIM /foo/bar/baz/ and use a urlprefix of http://tachyon.eng/baz/. Here's a real-world example, for the Java spec: java javaindex -db JAVA_SPEC \ -trimprefix /net/tachyon/export/disk1/Mosaic/docs/spec/ \ -urlprefix http://tachyon.eng/spec/ \ /net/tachyon/export/disk1/Mosaic/docs/spec/*.html path_prefix is the string that should be prepended to the "filename" of each individual Doc in this database, to construct a valid fully-qualified pathname. For example, you might index all the files in the current directory like this: java javaindex -db FOO -fileprefix /full/path/to/this/dir/ * url_prefix is the string that should be prepended to the "filename" of each individual Doc in this database, to construct a valid URL pointing to that document. See above for an example. url_prefix and path_prefix may be used together, although in general a Database will either be full of HTML files (in which case they are always going to be accessed as URLs) or full of plain text files (in which case path_prefix should be used, since the documents will be read as regular files). description is a human-readable description of this database. filename ... This is the list of documents to index. These filenames must be either absolute pathnames, or relative to the current directory, although remember the "path_to_trim" string is stripped off all filenames before they're stored in the Database. Sorry if the path_to_trim/path_prefix/url_prefix stuff is confusing; note that WAIS does it the same way, though. Look for my notes on "WAIS's URL type" in /net/barchetta/opt/wais/README. Javaindex prints out a bunch of useful statistics when it finished creating a database. ----- (2) Searching an JavaSearch database from your program ----- First of all, look at the program "javasearch.java": this is a very simple command-line interface to perform searches on an JavaSearch database. It demonstrates how to open a database, do a search, and look at the results. In a nutshell, you do the following to perform a search: Database db = Database.OpenDatabase(database_name); Searcher searcher = new Searcher(db); DocList resultList = searcher.doSearch(query_string); Now, use the DocList.getDocAt() method to look at the individual Doc objects in the result list. For any Doc, you can get the headline (doc.headline), or a URL for the Doc (db.docURLPrefix+doc.filename), or a full pathname for the Doc (db.docPathPrefix + doc.filename). A Query string is a simple boolean expression, like (for example) "method and matching". Boolean operators "and", "or" and "not" are allowed. Precedence is a trivial left-to-right evaluation; parentheses are not supported. See the big doc comment for the doSearch() method in Searcher.java for all the details. That's all there is. A typical "Searching app" or applet might give the user text entry fields to select a database name and a query string, then show the result headlines in a scrolling list, and then have HotJava open the URL of any Doc which the user clicks on. ----- See the detailed comment at the end of Database.java for a description of the Files used by JavaSearch, and an overview of most of the classes. ----- EXAMPLES: (1) Creating and searching a database of the Java language spec: cd livejava/src/share/contrib/JavaSearch (eventually the JavaSearch stuff should be in a package!) java -cs javaindex -db /tmp/JAVA_SPEC -trimprefix /net/tachyon/export/disk1/Mosaic/docs/spec/ -urlprefix http://tachyon.eng/spec/ /net/tachyon/export/disk1/Mosaic/docs/spec/*.html [This builds the database, in /tmp.] [Now search for "method and matching":] java -cs javasearch /tmp/JAVA_SPEC method and matching [Look at the results. Selecting a document to view isn't too useful here, since these documents have no valid filenames! They're designed to be accessed by their URLs.] (2) Creating a database of random text files, for example RFCs: java -cs javaindex -db /tmp/Patent-stuff -trimprefix /usr/green/doc/Patents-Original/ -fileprefix /usr/green/doc/Patents-Original/ /usr/green/doc/Patents-Original/*.txt [This takes a few minutes. (The indexer really needs the 'btree optimization' (see below)!)] [Now search:] java -cs javasearch /tmp/Patent-stuff geographic and navigation java -cs javasearch /tmp/Patent-stuff touch and interface not speech [Once you see the results, type a document number to javasearch's prompt, and javasearch will display that file.] ----- RESTRICTIONS: Here's a list of things that many other text search/retrieval systems can do that JavaSearch can't. Some of these might be important to add at some point. - When indexing, each input file is treated as a document. Some other systems let you have multiple documents per file. - The searcher is missing numerous features found in other info retrieval systems, such as: relevance ranked (or "weighted") results; stopwords; synonyms, word stemming and searches like "foo*" for all words beginning with "foo"; literal searches (like "method overloading"; full parsing of a hierarchical boolean query (with parentheses for precedence grouping); word proximity searching ("method near matching"); and many others. Some of these features would require significant changes to JavaSearch's architecture, some would require slight changes in the index format, and others could be implemented by changing only the Searcher class. - The indexer only currently knows about plain text or HTML files. The code does have an *outline* for recognizing News articles -- just grep for "NEWS" to find all the places you need to change to add a new doc type. - The indexer is wildly suboptimal in how it keeps the Word objects in memory while building an index -- it *should* use a btree, but instead keeps Words in an unsorted Vector! Sorry, I didn't get around to writing a btree utility class. But this only makes indexing slow; it has no effect on Searching performance. And still, the indexer only takes a couple of minutes for small databases like the Java documentation... - Javaindex should have a "recursively index directories" feature. This would work by having a command-line arg ("-R"?) that told javaindex that each "filename" argument was really a directory, and that it should index ALL files in that directory's hierarchy. This is the only reasonable way to index very large databases, like for example a news spool filesystem (like the fp.* groups). - There's a whole bunch of other features which would be nice to have, but I haven't had time to implement. Look for "REMIND" comments in the JavaSearch code to find lots of notes like this.