home *** CD-ROM | disk | FTP | other *** search
- From mipos3!omepd!uoregon!hp-pcd!hplabs!decwrl!labrea!aurora!amelia!ames!necntc!ncoast!allbery Tue Feb 9 21:18:08 PST 1988
- Article 291 of comp.sources.misc:
- Path: td2cad!mipos3!omepd!uoregon!hp-pcd!hplabs!decwrl!labrea!aurora!amelia!ames!necntc!ncoast!allbery
- From: science@nems.ARPA (Mark Zimmermann)
- Newsgroups: comp.sources.misc
- Subject: v02i041: qndxr - Inverted index builder for text files
- Message-ID: <7176@ncoast.UUCP>
- Date: 3 Feb 88 02:08:40 GMT
- Sender: allbery@ncoast.UUCP
- Lines: 1665
- Approved: allbery@ncoast.UUCP
- X-Archive: comp.sources.misc/8802/4
- Comp.sources.misc: Volume 2, Issue 41
- Submitted-By: "Mark Zimmerman" <science@nems.ARPA>
- Archive-Name: qndxr
-
- Comp.sources.misc: Volume 2, Issue 41
- Submitted-By: "Mark Zimmerman" <science@nems.ARPA>
- Archive-Name: qndxr
-
- [Yes, indeed, another un-shared program! Sigh. ++bsa]
-
- Following C code builds inverted index files for big text files ... seems
- to work on Sun, VAX, and Macintosh in my tests. Heavily commented in a
- helter-skelter style.... ^z
-
- /* revised main indexer program 'qndxr.c' by ^z -- 870820-...
- *
- * revised 870930-1008 to give the user more options and control of how
- * delimiters are defined and handled. Now can choose:
- * - the default: any non-alphanumeric characters are delimiters (my
- * original approach, and perhaps the simplest to understand and use);
- * - option "-e": keep punctuation characters ONLY when they are embedded
- * in between non-delimiter characters, otherwise turn them into
- * delimiters (approach invented by Bill Hole of MicroDynamics Ltd.;
- * very nice for displaying 'words' such as don't, non-ASCII, 3.14159,
- * 87/10/02, etc. without splitting them up and requiring proximity
- * searching);
- * - option "-a": keep ALL punctuation characters, whether embedded in
- * words or not (best for indexing FORTH programs and such, but does
- * make spurious 'distinct' words at ends of sentences, etc.);
- * - option "-s": keep all special (non-ASCII) characters (with values
- * in the range 128-255; in the Apple character set, these are the
- * symbols that carry umlauts, tilde, accents, etc., making this
- * option the best for foreign-language and symbol-heavy files);
- * default is to remove the special characters.
- *
- * Note that option "-s" can be combined with any of the other options;
- * options "-e" and "-a" are mutually exclusive. (Actually, "-a" in my
- * implementation overrides "-e".) The "-e" option does require
- * peeking ahead one character before deciding on the fate of
- * a punctuation symbol, but that isn't too hard to arrange....
- *
- * ---------------------------------------------------------------
- *
- * Synopsis of the qndxr.c program's approach to sorting an index:
- * - load a chunk of the input file into memory; during the loading,
- * filter the characters appropriately, turning lower-case
- * into capitals, changing delimiters into \0's, and noting down
- * the locations of the beginnings of keys (words) in a ptr array;
- * - do a quicksort on the ptr array, using the keys in the buffer
- * to sort on;
- * - write the resulting sorted list of pointers along with their keys
- * to disk as files, named x0k0 and x0p0, in the *.k and *.p formats
- * used in ndxr.15;
- * - repeat the above steps with another chunk of the input file, and
- * name the resulting files x0k1 and x0p1; repeat, etc., until the
- * input file is all sorted into chunks;
- * - begin to merge the sorted files in an N-way merge ... for the
- * specific case of N=2, for example, merge x0k0/x0p0 and
- * x0k1/x0p1 into x1k0/x1p0; merge x0k2/x0p2 and x0k3/x0p3 into
- * x1k1/x1p1; etc., deleting the 0th-generation files as they are
- * merged;
- * - merge x1k0/x1p0 and x1k1/x1p1 into x2k0/x2p0, etc., and continue
- * merging subsequent generations until everything has become a
- * single pair of index files, xnk0/xnp0, which is then renamed
- * to be the original document name with '.k' and '.p' endings.
- *
- * ---------------------------------------------------------------
- *
- * Comparison with the older radix sort approach:
- * The new quicksort/mergesort method gains a bit in speed (since so
- * much of the work is done in memory rather than streaming from disk
- * back and forth) and also saves on disk space requirements (since
- * the k and p files are significantly compressed relative to the raw
- * index files formerly used during index sorting). The old radix sort
- * tended to only achieve about 2 MB/hour indexing on my Mac Plus setup,
- * and it required 5-6 times the size of the original file in free
- * disk space during indexing. This new approach achieves >4 MB/hour
- * in tests on the same Mac Plus, and only requires about 1.9 times
- * the original file size in free space!
- *
- * The new approach also allows for greater key length -- try recompiling
- * with KEY_LENGTH of 28, for instance -- without such severe disk space
- * penalties as the old radix sort would have incurred, and without severe
- * time penalties. (Duplicate words in the chunks are only stored once
- * in the key file, though each must have an entry in the pointer file.)
- *
- * The only obvious disadvantage of the quicksort/mergesort approach is
- * that it is an O(NlogN) procedure rather than O(N), and thus gets slower
- * when files get very large. (Radix sorting is strictly linear in N,
- * in theory anyway.)
- *
- * ---------------------------------------------------------------
- *
- * For further details, contact the author:
- * Mark Zimmermann science (at) NEMS.ARPA
- * 9511 Gwyndale Dr. [75066,2044] on CompuServe
- * Silver Spring, MD 20910
- * 301-565-2166
- * ---------------------------------------------------------------
- */
-
- #include <stdio.h>
- #include <strings.h>
- #include <ctype.h>
-
- /* --------------header file--------------- */
-
- /* header file for project qndxr -- by ^z -- 870820-0913-...
- */
-
- /* tell what compiler we're using ... Lightspeed C by Think, if the
- * following line is #defined ... essentially, when LIGHTSPEED is here,
- * assume that we have only 16-bit int variables and that Macintosh
- * toolbox traps are available ... when LIGHTSPEED is not #defined,
- * assume that ints can hold more like 32 bits without error, so more
- * can be done using standard I/O routines from <stdio>
- */
- /* #define LIGHTSPEED */
-
- /* preprocessor 'function' to turn on/off debug printing of detailed info
- * during program runs ... when debugging, use a statement:
- * #define DEBUG(fmt, arg) printf (fmt, arg)
- * ... and when not debugging, just use: #define DEBUG(fmt, arg)
- * to effectively remove those print commands....
- */
- /* #define DEBUG(fmt, arg) printf (fmt, arg) */
- #define DEBUG(fmt, arg)
-
- /* sort on KEY_LENGTH characters ... make it 28 rather arbitrarily as an
- * experiment ... want it long enough to avoid truncation of legitimate
- * words, but short enough to save some space in the *.k files ... an
- * alternative value is 12, for compatibility with the older ndxr.c program
- * and brwsr.c, if they haven't been revised to allow for longer keys....
- */
- #define KEY_LENGTH 28
-
- /* define a structure for the index_keys file
- */
- typedef struct
- {
- char kkey[KEY_LENGTH];
- long ccount;
- } KEY_REC;
-
- /* define a structure for my simpleminded I/O buffers ...
- */
- typedef struct
- {
- char *zbufbase;
- char *zbufp;
- long zbufcounter;
- long zbufsize;
- FILE *zbuffilep;
- int zbufitemsize;
- } ZBUF;
-
- /* choose this size to be the default memory size used for total buffer
- * space ... for convenience on the Mac Plus, I have picked 512 kB, which
- * leaves me a bit of space to spare ....
- */
- #define DEFAULT_MEMSIZ 524288
-
- /* merge this many files at each stage of the merging operation for index
- * building ... 2 means a binary merge, etc. ... one needs to have at least
- * 5 + 2 * NMERGE I/O buffers around: for each of NMERGE files, there is
- * a *.k keys file and a *.p pointers file; plus there must be a single
- * output *.k and a single output *.p file; plus there is the need for stdin,
- * stdout, and stderr to be open as well. Thus, I have found that a 4-way
- * merge (NMERGE = 4) works pretty nicely....
- */
- #define NMERGE 4
-
- #ifndef TRUE
- #define TRUE 1
- #endif
-
- #ifndef FALSE
- #define FALSE 0
- #endif
-
- /* CMASK makes sure that a returned character isn't sign-extended
- */
- #ifndef CMASK
- #define CMASK 0xFF
- #endif
-
- /* put the prototypes for my functions here... assume that if LIGHTSPEED
- * is #define'd, then we want to use prototypes....
- */
- #ifdef LIGHTSPEED
-
- /* in qndxr unix main.c */
- void punt(void);
- void openfile(FILE *file, char *mode);
- void main(void);
-
- /* in qndxr.c */
- void _main(int argc, char *argv[]);
-
- /* in bufio.c */
- void create_zbuffer (int n, long bufsize, FILE *buffile, int bufitemsize);
- void free_zbuffer (int n);
- char *next_input_item (int n);
- void load_zbuffer (int n);
- char *next_output_item (int n);
- void flush_zbuffer (int n);
-
- /* in build_indices.c */
- int build_indices (void);
-
- /* in doc_buf.c */
- char *make_buf (long bufsiz);
- long load_doc_buffer (char *doc, long doc_bufsiz, char **ptr);
- int filtered_getc (void);
-
- /* in merge_files.c */
- void nway_merge_kpfiles (FILE *ink[], FILE *inp[], FILE *outk, FILE *outp,
- int n);
- void copy_ptr_recs (int inzbuf, long count, int outzbuf);
- void copy_key_rec (char *kkey, long ccount, int outzbuf);
- int merge_not_finished (KEY_REC *kr[], int n);
- int smallest_key (KEY_REC *kr[], int n);
-
- /* in merge_indices.c */
- int merge_indices (char *doc_filename);
-
- /* in file_utils.c */
- void write_sorted_files (char *doc, char **ptr, long nwords,
- int pass_number, long offset);
- int is_new_word (char *w0, char *w1);
- void write_new_key (char *p, char *kp);
-
- /* in merge_utils.c */
- FILE *open_inkfile (int generation_number, int file_number);
- FILE *open_inpfile (int generation_number, int file_number);
- void fix_oddball_file_name (int generation_number, int file_number);
- void fix_final_file_names (int generation_number, char *doc_filename);
- FILE *open_outkfile (int generation_number, int file_number);
- FILE *open_outpfile (int generation_number, int file_number);
- void remove_used_infiles (int generation_number, int file_number, int n);
- void close_infiles (FILE *ink[], FILE *inp[], int n);
-
- /* in misc_utils.c */
- long set_params (int argc, char *argv[]);
- long set_zbufsiz (long zb);
- void set_parser_options (char *str);
- void check_interrupt (void);
- long file_size (FILE *fp);
-
- /* in open_files.c */
- FILE *open_docfile (int argc, char *argv[]);
- FILE *open_kfile (int pass_number);
- FILE *open_pfile (int pass_number);
-
- /* in zqsort.c */
- void zqsort (char **first, char **last);
- int compare_ptrs (char **p1, char **p2);
- int zstrcmp (char *s1, char *s2);
-
- #endif
-
-
- /* ----------------------main program file---------------- */
-
-
- /* define a global variable to hold the chosen size of a fundamental
- * quantum of buffer space ... experiments with dynamically choosing
- * it at runtime seem to cause occasional problems on the Macintosh,
- * and have other risks with virtual memory systems, so default to
- * DEFAULT_MEMSIZ total buffer space but let the user override with
- * a command-line choice to suit ... see function set_zbufsiz() for
- * further discussion....
- */
- long zbufsiz;
-
- /* define a global variable to tell whether or not all punctuation chars
- * are kept...
- */
- int keep_all_punct = FALSE;
-
- /* define a global variable to tell whether or not only embedded punctuation
- * characters are preserved...
- */
- int keep_embedded_punct = FALSE;
-
- /* define a global variable to tell whether or not to keep special,
- * non-ASCII characters...
- */
- int keep_special_chars = FALSE;
-
- /* define a global variable to hold the (FILE *) for the document file
- */
- FILE *doc_file;
-
- /* main program to do the work ...
- */
-
- void main(argc, argv)
- int argc;
- char *argv[];
- {
- unsigned long start_time, end_time, time();
- long set_params(), file_size();
- char *ctime();
- FILE *open_docfile();
-
- time (&start_time);
- printf ("Starting work: %s\n", ctime (&start_time));
-
- printf ("\nOpening document file \"%s\"...\n", argv[1]);
- doc_file = open_docfile (argc, argv);
-
- zbufsiz = set_params (argc, argv);
- printf ("Using buffers of size %ld each...\n", zbufsiz);
-
- printf ("Beginning to build index...\n");
- while (build_indices ());
-
- printf ("Beginning to merge index subfiles...\n");
- while (merge_indices (argv[1]));
-
- time (&end_time);
- printf ("\nEnding work: %s\n", ctime (&end_time));
- printf ("Elapsed time = %ld seconds.\n", end_time - start_time);
- printf ("Indexing rate = %.2f megabytes/hour.\n", file_size (doc_file) *
- 0.003600 / ( end_time - start_time ));
- printf ("Input database file \"%s\" was %ld bytes long.\n", argv[1],
- file_size (doc_file));
-
- fclose (doc_file);
- }
-
-
- /* -------------------bufio.c file------------------- */
-
- /* This is the file 'bufio.c' ... by ^z, 870830-0913- ... it includes
- * definitions of some buffer words to accumulate information on its
- * way to/from the disks ... use these to speed up operations and reduce
- * disk head movements, in place of calls to fputc(), fread(), fwrite(),
- * etc. ... try to make them very general so that they will simplify
- * other tasks....
- *
- */
-
-
- /* set up some buffers here to save on disk head movement and speed up
- * operations ... use my simple ZBUF structure for both input and
- * output operations ... keep everything static, local here to this file
- * for safety's sake ... allocate enough items here for all the buffers
- * that we may need for an NMERGE-way merge operation, taking into account
- * the need for input buffers for all the NMERGE *.k and *.p files plus
- * the output files *.k and *.p as well....
- */
-
- static ZBUF zbuffer[2 + 2 * NMERGE];
-
-
- /* function to create a zbuffer and set its parameters appropriately
- */
-
- void create_zbuffer (n, bufsize, buffile, bufitemsize)
- int n, bufitemsize;
- long bufsize;
- FILE *buffile;
- {
- char *make_buf();
-
- zbuffer[n].zbufbase = make_buf (bufsize);
- zbuffer[n].zbufp = zbuffer[n].zbufbase;
- zbuffer[n].zbufcounter = 0;
- zbuffer[n].zbufsize = bufsize;
- zbuffer[n].zbuffilep = buffile;
- zbuffer[n].zbufitemsize = bufitemsize;
- }
-
-
- /* function to free a zbuffer ...
- */
-
- void free_zbuffer (n)
- int n;
- {
- free (zbuffer[n].zbufbase);
- }
-
-
- /* function to return a pointer to the next item in a chosen input
- * buffer 'n'; it reloads the buffer if necessary ... returns NULL
- * pointer when there's nothing left in the file ...
- */
-
- char *next_input_item (n)
- int n;
- {
- char *result;
- void load_zbuffer();
-
- if (zbuffer[n].zbufcounter == 0)
- load_zbuffer (n);
-
- zbuffer[n].zbufcounter -= zbuffer[n].zbufitemsize;
- if (zbuffer[n].zbufcounter >= 0)
- {
- result = zbuffer[n].zbufp;
- zbuffer[n].zbufp += zbuffer[n].zbufitemsize;
- return (result);
- }
- else
- return (NULL);
- }
-
-
- /* function to load the nth zbuffer appropriately ... it resets the buffer
- * pointers, etc. ... might as well give the user a chance to interrupt (in
- * the Macintosh version) here, as long as we have to go to the disk anyway....
- */
-
- void load_zbuffer (n)
- int n;
- {
- long nread;
- void exit(), check_interrupt();
-
- #ifdef LIGHTSPEED
- nread = zbuffer[n].zbufsize;
- FSRead (zbuffer[n].zbuffilep->refnum, &nread, zbuffer[n].zbufbase);
- #else
- nread = fread (zbuffer[n].zbufbase, sizeof(char),
- (int)zbuffer[n].zbufsize, zbuffer[n].zbuffilep);
- #endif
-
- zbuffer[n].zbufp = zbuffer[n].zbufbase;
- zbuffer[n].zbufcounter = nread;
-
- if (ferror (zbuffer[n].zbuffilep))
- {
- printf ("\nFatal error in load_zbuffer while reading in data!\n");
- printf ("(n=%d, nread=%ld)\n", n, nread);
- exit (1);
- }
-
- #ifdef LIGHTSPEED
- check_interrupt ();
- #endif
- }
-
-
- /* function to return a pointer to the right place to store the next
- * output item for the nth zbuffer ... when the buffer becomes full,
- * it flushes it to disk, resets pointers, etc.
- */
-
- char *next_output_item (n)
- int n;
- {
- char *result;
- void flush_zbuffer();
-
- if (zbuffer[n].zbufcounter == zbuffer[n].zbufsize)
- flush_zbuffer (n);
-
- result = zbuffer[n].zbufp;
- zbuffer[n].zbufp += zbuffer[n].zbufitemsize;
- zbuffer[n].zbufcounter += zbuffer[n].zbufitemsize;
- return (result);
- }
-
-
- /* function to flush out a zbuffer to the disk ... might as well give user
- * a chance to interrupt here (in the Macintosh version), as long as we
- * have to go to the disk anyway....
- */
-
- void flush_zbuffer (n)
- int n;
- {
- long chars_written;
- void exit();
-
- if (zbuffer[n].zbufcounter == 0)
- return;
-
- #ifdef LIGHTSPEED
- chars_written = zbuffer[n].zbufcounter;
- FSWrite (zbuffer[n].zbuffilep->refnum, &chars_written,
- zbuffer[n].zbufbase);
- #else
- chars_written = fwrite (zbuffer[n].zbufbase, sizeof(char),
- (int)zbuffer[n].zbufcounter, zbuffer[n].zbuffilep);
- #endif
- if (ferror(zbuffer[n].zbuffilep) || chars_written == 0)
- {
- printf ("\nFatal error in flush_zbuffer while writing out data!\n");
- printf ("(n=%d, zbufcounter=%ld, chars_written=%ld)\n", n,
- zbuffer[n].zbufcounter, chars_written);
- exit(1);
- }
-
- zbuffer[n].zbufp = zbuffer[n].zbufbase;
- zbuffer[n].zbufcounter = 0;
-
- #ifdef LIGHTSPEED
- check_interrupt ();
- #endif
- }
-
-
- /* ---------------build_indices.c file---------------- */
-
- /* file build_indices.c ... by ^z, 870820-0913-...
- *
- * revised 870930-871007 to allow more user options on keeping/discarding
- * punctuation, etc. -- ideas based on Bill Hole's suggestions
- *
- * contains subroutine to build indices for each chunk of input document
- * (database) text file for program qndxr ...
- *
- * Strategy is as outlined in qndxr: take in a big chunk of the doc_file,
- * generate the pointers to each word in the buffer as the buffer contents
- * are converted into appropriate (all caps, delimiters filtered into
- * spaces) form for sorting; sort the pointers in memory; and then write
- * out to disk the pointers and keys in the ndxr.c format for *.k and *.p
- * files.
- *
- * Allocate space for the doc and ptr buffers here so as to maximize use
- * of available memory ... note that we need to have room for the doc
- * buffer, for a ptr buffer that might (in the worst case of a file full
- * of 1-letter words) be twice as long as the doc buffer, and also space
- * for two standard zbuffers to accumulate the output *.k and *.p file
- * info in before sending it to disk.
- *
- * Note that for speed, while they are being sorted the pointers just point
- * directly to the key strings in the input buffer; they must be converted
- * into true offset pointers relative to the 0th byte of the document file
- * as they are written to disk in the *.p file! Make sure that all of
- * the delimiters in the document/database buffer are converted into
- * '\0' characters so that string comparison functions will work right.
- *
- * Also note that to avoid edge effects at the end of the buffer, an extra
- * amount of space is required at the end of the buffer, of length
- * KEY_LENGTH, to accomodate the end of the last word in the buffer.
- *
- * Use static local variables in the function here to keep track of where
- * we are in the document file from one chunk to the next, what chunk
- * number we are on now, etc.
- *
- * Give the user a chance to interrupt operations (in the Macintosh
- * version of this program) at intervals here, as long as
- * there are time-consuming I/O or sorting or scanning operations
- * to be done ...
- */
-
-
- int build_indices ()
- {
- static int pass_number = 0;
- long doc_bufsiz, offset, load_doc_buffer(), nwords,
- ftell();
- extern long zbufsiz;
- extern FILE *doc_file;
- char *doc, **ptr, *malloc(), *mlalloc(), *calloc(), *clalloc();
- void zqsort(), write_sorted_files();
-
- doc_bufsiz = 2 * NMERGE * zbufsiz / 3;
- DEBUG ("--allocating doc buffer of size %ld\n", doc_bufsiz + KEY_LENGTH);
- doc = make_buf (doc_bufsiz + KEY_LENGTH);
-
- DEBUG ("--allocating ptr buffer of size %ld\n", doc_bufsiz * 2);
- ptr = (char **)make_buf (doc_bufsiz * 2);
-
- #ifdef LIGHTSPEED
- check_interrupt ();
- #endif
-
- offset = ftell (doc_file);
- DEBUG ("--loading doc buffer beginning at offset %ld\n", offset);
- nwords = load_doc_buffer (doc, doc_bufsiz, ptr);
-
- if (nwords == 0)
- {
- DEBUG ("--Building done ... now freeing doc & ptr buffers\n", NULL);
- free (doc);
- free ((char *)ptr);
- return (FALSE);
- }
-
- printf ("Index subfile #%d contains %ld words...\n", pass_number,
- nwords);
-
- #ifdef LIGHTSPEED
- check_interrupt ();
- #endif
-
- DEBUG ("--sorting ptr array\n", NULL);
- zqsort (ptr, ptr + nwords);
-
- #ifdef LIGHTSPEED
- check_interrupt ();
- #endif
-
- DEBUG ("--writing sorted keys and ptrs to disk\n", NULL);
- write_sorted_files (doc, ptr, nwords, pass_number, offset);
-
- #ifdef LIGHTSPEED
- check_interrupt ();
- #endif
-
- DEBUG ("--freeing doc & ptr buffers\n", NULL);
- free (doc);
- free ((char *)ptr);
-
- ++pass_number;
- return (TRUE);
- }
-
-
- /* ---------------doc_buf.c file------------------ */
-
- /* file doc_buf.c ... by ^z 870830-0919-...
- * functions to load in text from the document file and then
- * save out keys and pointers to the *.k and *.p files ...
- * modified 871007-... to unify the doc-buffer loading with the
- * character-filtering and the pointer array building....
- */
-
- /* function to create a buffer for me to use...
- */
-
- char *make_buf (bufsiz)
- long bufsiz;
- {
- char *buf, *malloc(), *mlalloc();
- void exit();
-
- DEBUG ("--allocating a buffer, size = %ld\n", bufsiz);
-
- #ifdef LIGHTSPEED
- buf = mlalloc (bufsiz);
- #else
- buf = malloc ((unsigned int)bufsiz);
- #endif
-
- if (buf == NULL)
- {
- printf ("\nFatal error in attempt to allocate a buffer!\n");
- printf ("(bufsiz=%ld)\n", bufsiz);
- exit(1);
- }
-
- return (buf);
- }
-
-
- /* function to load the document buffer ... bring in doc_bufsiz
- * characters, and then enough more to finish out the final word,
- * followed by a terminal delimiter .... as the characters are read
- * in, filter them appropriately (depending on user choices) and
- * build the pointer array in memory to the first character of each
- * word ... return the total number of words that were
- * read in to the buffer (zero if we're finished with the file)
- *
- * ... note that one must be sure to pull in and throw away
- * any excess characters beyond KEY_LENGTH in the final word, so that
- * the remaining fragment doesn't show up as the first "word" in the
- * next chunk of the file....
- *
- * Routine modified 871007-... in order to unify the buffer-loading and
- * character-filtering and pointer-array-building operations, and to go
- * back to using getc() from <stdio> rather than Macintosh-specific
- * operations for loading the buffer....
- */
-
- long load_doc_buffer (doc, doc_bufsiz, ptr)
- char *doc, **ptr;
- long doc_bufsiz;
- {
- int c, i, in_a_word = FALSE;
- char **ptr0, *end_doc_buf;
- extern FILE *doc_file;
-
- DEBUG ("--Loading document buffer...\n", NULL);
-
- ptr0 = ptr;
- end_doc_buf = doc + doc_bufsiz;
-
- while (doc < end_doc_buf)
- {
- c = filtered_getc ();
- DEBUG ("--filtered character = \"%c\"\n", c);
- if (c == EOF)
- {
- *doc++ = '\0';
- in_a_word = FALSE;
- break;
- }
- if (! c)
- in_a_word = FALSE;
- else if (! in_a_word)
- {
- *ptr++ = doc;
- in_a_word = TRUE;
- DEBUG ("--adding new ptr = %ld\n", doc);
- }
- *doc++ = c;
- }
-
- if (doc == end_doc_buf && in_a_word)
- {
- DEBUG ("--finishing off a final buffer word...\n", NULL);
- for (i = 0; i < KEY_LENGTH; ++i)
- {
- c = filtered_getc ();
- if (c == EOF)
- {
- *doc++ = '\0';
- break;
- }
- if (! c)
- {
- *doc++ = '\0';
- break;
- }
- *doc++ = c;
- }
- if (i == KEY_LENGTH)
- while (filtered_getc ())
- ;
- }
-
- return (ptr - ptr0);
- }
-
-
- /* function to get the next character from the document file and filter it
- * as the user desires ... return:
- * EOF if end of file encountered;
- * '/0' if the character is a delimiter;
- * otherwise, the character itself (filtered into upper-case,
- * if it was lower-case)
- */
-
- int filtered_getc ()
- {
- static int prevc, c = '\0';
- int nextc;
- extern int keep_all_punct, keep_embedded_punct, keep_special_chars;
- extern FILE *doc_file;
-
- prevc = c;
- c = getc (doc_file);
-
- if (c == EOF)
- return (EOF);
-
- if (islower (c))
- return (c = toupper (c));
-
- if (isupper (c) || isdigit (c))
- return (c);
-
- if (isspace (c))
- return (c = '\0');
-
- if (keep_special_chars && ! isascii (c))
- return (c);
-
- if (keep_all_punct && ispunct (c))
- return (c);
-
- if (keep_embedded_punct && ispunct (c))
- {
- if (prevc == '\0')
- return (c = '\0');
- nextc = getc (doc_file);
- ungetc (nextc, doc_file);
- if (nextc == EOF)
- return (c = '\0');
- if (isalnum (nextc) || (keep_special_chars && ! isascii (nextc)))
- return (c);
- else
- return (c = '\0');
- }
-
- return (c = '\0');
- }
-
- /* ------------------------file_utils.c------------- */
-
- /* file file_utils.c ... by ^z -- 870820-0913-...
- * some utility routines for qndxr project, associated with files...
- */
-
-
- /* function to write out sorted k & p files based on the doc and ptr
- * arrays in memory....
- *
- * The kfile format is as described in detail elsewhere:
- * the key word, turned into all capital letters and with spaces
- * afterward, of fixed length KEY_LENGTH; and
- * the cumulative count of how many words have passed before, including
- * the current word, a long integer.
- *
- * Function revised 870907-... by ^z to use zbuffer method....
- */
-
- void write_sorted_files (doc, ptr, nwords, pass_number, offset)
- char *doc, **ptr;
- long nwords, offset;
- int pass_number;
- {
- extern long zbufsiz;
- FILE *kfile, *pfile, *open_kfile(), *open_pfile();
- char *prev_word, *next_output_item();
- KEY_REC *outk;
- long *outp, i, file_size ();
- void create_zbuffer(), write_new_key();
-
- DEBUG ("--Entering write_sorted_files with nwords %ld\n", nwords);
- if (nwords == 0)
- return;
-
- DEBUG ("--Opening kfile & pfile for pass_number = %d\n", pass_number);
- kfile = open_kfile (pass_number);
- pfile = open_pfile (pass_number);
-
- DEBUG ("--Creating buffers for keys & ptrs, size = %ld\n", zbufsiz);
- create_zbuffer (0, zbufsiz, kfile, sizeof(KEY_REC));
- create_zbuffer (1, zbufsiz, pfile, sizeof(long));
-
- DEBUG ("--Beginning to write keys and ptrs; first key=%.28s\n", ptr[0]);
- prev_word = ptr[0];
- outk = (KEY_REC *)next_output_item (0);
- write_new_key (ptr[0], outk->kkey);
-
- for (i = 0; i < nwords; ++i)
- {
- if (is_new_word (prev_word, ptr[i]))
- {
- outk->ccount = i;
- outk = (KEY_REC *)next_output_item (0);
- write_new_key (ptr[i], outk->kkey);
- prev_word = ptr[i];
- }
- outp = (long *)next_output_item (1);
- *outp = (ptr[i] - doc) + offset;
- }
- outk->ccount = i;
-
- flush_zbuffer (0);
- flush_zbuffer (1);
-
- DEBUG ("--Getting rid of key and ptr buffers...\n", NULL);
- free_zbuffer (0);
- free_zbuffer (1);
-
- printf (" ...%ld distinct words\n",
- file_size (kfile) / sizeof(KEY_REC));
- fclose (kfile);
- fclose (pfile);
- }
-
-
- /* function to determine if the current word is the same as or different
- * from the previous word -- if it is different, we'll need to write an
- * entry out to the key file kfile -- compare the words up to the first
- * '\0', or for a maximum distance of KEY_LENGTH, and return TRUE
- * if they differ, FALSE if they are identical that far. Thus, a simple
- * call to zstrcmp() does the job.... but keep ours as a function instead
- * of a macro call for the moment, for safety and readability....
- */
-
- int is_new_word (w0, w1)
- char *w0, *w1;
- {
- return (zstrcmp (w0, w1));
- }
-
-
- /* function to write out a new key entry in the key_file:
- * KEY_LENGTH letters consisting of the key word (which will be found
- * delimited by a '\0'), followed by enough blanks to fill out the
- * record to total length KEY_LENGTH ...
- */
-
- void write_new_key (p, kp)
- register char *p, *kp;
- {
- register int i, c;
-
- for (i = 0; i < KEY_LENGTH; ++i)
- {
- c = *p++;
- if (c == '\0')
- break;
- *kp++ = c;
- }
-
- for ( ; i < KEY_LENGTH; ++i)
- *kp++ = ' ';
- }
-
-
-
- /* ---------------------merge_files.c--------------- */
-
- /* file merge_files.c ... 870902-... ^z
- * more functions to do the work of merging subindices together ...
- * with buffering of inputs and outputs ...
- */
-
-
- /* function to do the real work of merging sorted k & p
- * files into a single sorted file...
- *
- * Procedure is as follows:
- *
- * Compare the current key records from each of the N files to be merged.
- * Take the smallest of the keys (or, if there is a tie, take the one
- * from the earlier file number) and write its pointer records out to
- * the *.p file for the next generation. If the key is a new one, that
- * is, if it differs from the previous key, write out the previous key
- * word and the value for cumulative_counts to the *.k file, and reset
- * the previous key's value to that of the current key. Then repeat
- * this whole procedure, until all the input files are exhausted.
- *
- * The above works provided that we are careful in choosing the smallest
- * key and never let a file that has been exhausted (reached EOF) win a
- * comparison. The function smallest_key does that properly below, since
- * a file that is at EOF has returned a NULL pointer for its key_rec...
- *
- * For each file, the variable prev_cc[i] holds the previous value of
- * cumulative_counts for that file, so that we can determine how
- * many ptr_recs to write out by doing a simple subtraction.
- *
- * Buffer numbering scheme: the Kth input file has zbuffer #K
- * for its keys and zbuffer #(K+n) for its pointers; the output
- * buffers are zbuffers #(2*n) for keys and #(2*n+1) for pointers.
- */
-
- void nway_merge_kpfiles (ink, inp, outk, outp, n)
- FILE *ink[], *inp[], *outk, *outp;
- register int n;
- {
- register int i;
- KEY_REC *kr[NMERGE], prev_key;
- long prev_cc[NMERGE], out_cc;
- extern long zbufsiz;
- char *strncpy();
- void copy_ptr_recs(), copy_key_rec();
-
- DEBUG ("--entering nway_merge_kpfiles with n=%d\n", n);
- for (i = 0; i < n; ++i)
- prev_cc[i] = 0;
- out_cc = 0;
-
- for (i = 0; i < n; ++i)
- {
- create_zbuffer (i, zbufsiz, ink[i], sizeof(KEY_REC));
- create_zbuffer (i + n, zbufsiz, inp[i], sizeof(long));
- }
- create_zbuffer (2 * n, zbufsiz, outk, sizeof(KEY_REC));
- create_zbuffer (2 * n + 1, zbufsiz, outp, sizeof(long));
-
- for (i = 0; i < n; ++i)
- kr[i] = (KEY_REC *)next_input_item (i);
-
- i = smallest_key (kr, n);
- strncpy (prev_key.kkey, kr[i]->kkey, KEY_LENGTH);
- DEBUG ("--first key is %.28s\n", prev_key.kkey);
-
- while (merge_not_finished (kr, n))
- {
- i = smallest_key (kr, n);
- if (zstrcmp (prev_key.kkey, kr[i]->kkey))
- {
- copy_key_rec (prev_key.kkey, out_cc, 2 * n);
- strncpy (prev_key.kkey, kr[i]->kkey, KEY_LENGTH);
- }
- copy_ptr_recs (i + n, kr[i]->ccount - prev_cc[i], 2 * n + 1);
- out_cc += kr[i]->ccount - prev_cc[i];
- prev_cc[i] = kr[i]->ccount;
- kr[i] = (KEY_REC *)next_input_item (i);
- }
-
- copy_key_rec (prev_key.kkey, out_cc, 2 * n);
- DEBUG ("--finished nway merge ... final key=%.28s\n", prev_key.kkey);
- flush_zbuffer (2 * n);
- flush_zbuffer (2 * n + 1);
- for (i = 0; i < 2 * n + 2; ++i)
- free_zbuffer (i);
- }
-
-
- /* function to copy a chosen number of ptr_recs (longs) from one file
- * to another ... used in merging two files by merge_kpfiles() above....
- * revised and simplified to use zbuffers....870902 ... ^z
- */
-
- void copy_ptr_recs (inzbuf, count, outzbuf)
- register int inzbuf, outzbuf;
- register long count;
- {
- register long *inp, *outp;
- char *next_input_item(), *next_output_item();
-
- for ( ; count > 0; --count)
- {
- inp = (long *)next_input_item (inzbuf);
- outp = (long *)next_output_item (outzbuf);
- *outp = *inp;
- }
- }
-
-
- /* function to write a key record including kkey (key word) and ccount
- * (cumulative_count) to an output file...
- */
-
- void copy_key_rec (kkey, cc, outzbuf)
- char *kkey;
- long cc;
- int outzbuf;
- {
- KEY_REC *outp;
- char *strncpy(), *next_output_item();
-
- outp = (KEY_REC *)next_output_item (outzbuf);
- strncpy (outp->kkey, kkey, KEY_LENGTH);
- outp->ccount = cc;
- }
-
-
- /* simple function to see if we are not yet finished with all of the input
- * files coming in to the merge operation ... signified by at least one of
- * the key pointers being non-NULL....
- */
-
- int merge_not_finished (kr, n)
- KEY_REC *kr[];
- register int n;
- {
- register int i;
-
- for (i = 0; i < n; ++i)
- if (kr[i] != NULL)
- return (TRUE);
-
- return (FALSE);
- }
-
-
- /* function to determine the smallest of the n keys pointed to by the
- * kr[] vector of pointers to KEY_RECs ... note that a NULL ptr ranks
- * after any other...also note that in case of a tie, the smaller
- * value of i is the one to return (for a stable merging sort)
- */
-
- smallest_key (kr, n)
- KEY_REC *kr[];
- register int n;
- {
- register int i, smallest;
-
- for (i = 0; i < n; ++i)
- if (kr[i] != NULL)
- {
- smallest = i;
- break;
- }
-
- for (i = smallest + 1; i < n; ++i)
- if (kr[i] != NULL)
- if (zstrcmp (kr[smallest]->kkey, kr[i]->kkey) > 0)
- smallest = i;
-
- return (smallest);
- }
-
-
- /* -------------------------merge_indices.c-------------- */
-
- /* file "merge_indices.c" ... by ^z -- 870823-...
- * function to merge sorted indices together repeatedly until finished
- * with them all in a single set of *.k/*.p files ...
- *
- * The merging strategy is straightforward enough:
- * Let "g" denote the generation_number and "f" denote the file_number.
- * Temporary file names begin with the letter x, which is then followed
- * by a generation number (decimal), the letter k or p (standing for
- * key or ptr, respectively), and then the file number (decimal). Thus,
- * the file "x0k0" is the keys file #0 for generation #0 (the starting,
- * pre-merging, generation), file "x2p3" is the ptr file #3 for generation
- * #2, etc.
- *
- * (The following discussion is specifically for a 2-way merge ... but
- * the generalization for N-way merging is straightforward.)
- *
- * On a given call to merge_indices, the following may happen:
- * - files xgkf/xgpf and xgk(f+1)/xgp(f+1) are merged into files
- * x(g+1)k(f/2)/x(g+1)p(f/2), and then the parent files are
- * deleted;
- * - file xgkf isn't found, which means we are done with this
- * generation and must go on to the next;
- * - file xgk(f+1) isn't found, which means that either we are
- * completely done with the merging work (if f=0) and just
- * have to rename the files xgkf/xgpf into the correct final
- * names (that is, doc_filename.k/doc_filename.p), or else
- * (if f>0) we have an odd number of files for this level
- * of merging, and therefore just have to rename xgkf/xgpf
- * to x(g+1)k(f/2)/x(g+1)p(f/2).
- */
-
-
- int merge_indices (doc_filename)
- char *doc_filename;
- {
- static int generation_number = 0, file_number = 0;
- FILE *ink[NMERGE], *inp[NMERGE], *outk, *outp, *open_inkfile(),
- *open_inpfile(), *open_outkfile(), *open_outpfile();
- void fix_oddball_file_name(), fix_final_file_names(), merge_kpfiles(),
- remove_used_infiles(), close_infiles();
- long inwords, indistinctwords, outdistinctwords, file_size();
- int i, n;
-
- for (n = 0; n < NMERGE; ++n)
- {
- ink[n] = open_inkfile (generation_number, file_number + n);
- if (ink[n] == NULL)
- break;
- inp[n] = open_inpfile (generation_number, file_number + n);
- }
-
- if (file_number + n == 1)
- {
- DEBUG ("--All finished with merging files!\n", NULL);
- printf ("\nFinished with index sorting! Final file sizes:\n");
- printf ("%9ld bytes of distinct key words\n", file_size (ink[0]));
- printf ("%9ld bytes of pointers to words\n", file_size (inp[0]));
- close_infiles (ink, inp, n);
- fix_final_file_names (generation_number, doc_filename);
- return (FALSE);
- }
-
- if (n < 2)
- {
- DEBUG ("--finished generation_number=%d ", generation_number);
- DEBUG ("-- file_number=%d\n", file_number);
- if (n == 1)
- {
- close_infiles (ink, inp, n);
- fix_oddball_file_name (generation_number, file_number);
- }
- ++generation_number;
- file_number = 0;
- return (TRUE);
- }
-
- outk = open_outkfile (generation_number, file_number);
- outp = open_outpfile (generation_number, file_number);
-
- inwords = 0;
- indistinctwords = 0;
- for (i = 0; i < n; ++i)
- {
- inwords += file_size (inp[i]) / sizeof(long);
- indistinctwords += file_size (ink[i]) / sizeof(KEY_REC);
- }
- printf ("Merge pass #%d.%d\n", generation_number, file_number / NMERGE);
- printf ("Input files contain %ld words -- %ld distinct words...\n",
- inwords, indistinctwords);
-
- nway_merge_kpfiles (ink, inp, outk, outp, n);
-
- outdistinctwords = file_size (outk) / sizeof(KEY_REC);
- printf (" ... merged result has %ld distinct words.\n",
- outdistinctwords);
-
- close_infiles (ink, inp, n);
- fclose (outk);
- fclose (outp);
- remove_used_infiles (generation_number, file_number, n);
-
- file_number += NMERGE;
-
- return (TRUE);
- }
-
- /* --------------------merge_utils.c---------------- */
-
- /* file "merge_utils.c" ... 870902-... ^z
- * misc. utilities for the merge_indices functions...
- */
-
-
- /* function to open an input key file for this generation and file number
- */
-
- FILE *open_inkfile (generation_number, file_number)
- int generation_number, file_number;
- {
- FILE *fopen();
- char fname[32];
-
- sprintf (fname, "x%dk%d", generation_number, file_number);
- return (fopen (fname, "rb"));
- }
-
-
- /* function to open an input ptr file for this generation and file number
- */
-
- FILE *open_inpfile (generation_number, file_number)
- int generation_number, file_number;
- {
- FILE *fopen();
- char fname[32];
-
- sprintf (fname, "x%dp%d", generation_number, file_number);
- return (fopen (fname, "rb"));
- }
-
-
- /* function to open an output key file for this generation and file number
- */
-
- FILE *open_outkfile (generation_number, file_number)
- int generation_number, file_number;
- {
- FILE *fopen();
- char fname[32];
-
- sprintf (fname, "x%dk%d", generation_number + 1, file_number / NMERGE);
- return (fopen (fname, "wb"));
- }
-
-
- /* function to open an output ptr file for this generation and file number
- */
-
- FILE *open_outpfile (generation_number, file_number)
- int generation_number, file_number;
- {
- FILE *fopen();
- char fname[32];
-
- sprintf (fname, "x%dp%d", generation_number + 1, file_number / NMERGE);
- return (fopen (fname, "wb"));
- }
-
-
- /* function to rename the remaining last unpaired key file & ptr file
- * from one generation to the next...
- */
-
- void fix_oddball_file_name (generation_number, file_number)
- int generation_number, file_number;
- {
- char oldname[32], newname[32];
-
- sprintf (oldname, "x%dk%d", generation_number, file_number);
- sprintf (newname, "x%dk%d", generation_number + 1, file_number / NMERGE);
- if (rename (oldname, newname))
- printf ("Sorry -- error in attempt to rename %s to %s!\n", oldname,
- newname);
-
- sprintf (oldname, "x%dp%d", generation_number, file_number);
- sprintf (newname, "x%dp%d", generation_number + 1, file_number / NMERGE);
- if (rename (oldname, newname))
- printf ("Sorry -- error in attempt to rename %s to %s!\n", oldname,
- newname);
- }
-
-
- /* function to give the final key and ptr files their proper ultimate
- * names ...
- */
-
- void fix_final_file_names (generation_number, doc_filename)
- int generation_number;
- char *doc_filename;
- {
- char oldname[32], finalname[65];
-
- sprintf (oldname, "x%dk0", generation_number);
- sprintf (finalname, "%s.k", doc_filename);
- if (rename (oldname, finalname))
- printf ("Sorry -- error in renaming file %s to %s!\n", oldname,
- finalname);
-
- sprintf (oldname, "x%dp0", generation_number);
- sprintf (finalname, "%s.p", doc_filename);
- if (rename (oldname, finalname))
- printf ("Sorry -- error in renaming file %s to %s!\n", oldname,
- finalname);
- }
-
-
- /* function to get rid of the superfluous k & p files now that they
- * have been merged into the next generation....
- */
-
- void remove_used_infiles (generation_number, file_number, n)
- int generation_number, file_number, n;
- {
- char fname[32];
- register int i;
-
- for (i = 0; i < n; ++i)
- {
- sprintf (fname, "x%dk%d", generation_number, file_number + i);
- DEBUG ("--removing %s\n", fname);
- if (unlink (fname))
- printf ("Sorry -- unable to delete file %s!\n", fname);
- sprintf (fname, "x%dp%d", generation_number, file_number + i);
- DEBUG ("--removing %s\n", fname);
- if (unlink (fname))
- printf ("Sorry -- unable to delete file %s!\n", fname);
- }
- }
-
-
- /* function to close out the ink and inp files that have been opened...
- */
-
- void close_infiles (ink, inp, n)
- FILE *ink[], *inp[];
- register int n;
- {
- register int i;
-
- for (i = 0; i < n; ++i)
- {
- fclose (ink[i]);
- fclose (inp[i]);
- }
- }
-
-
- /* ---------------------misc_utils.c----------------- */
-
- /* file "misc_utils.c" -- miscellaneous utilities for the qndxr project...
- * by ^z -- 870821-...
- */
-
-
- /* this function returns a value for the size of the internal buffers,
- * zbufsiz, and it also takes care of setting the other global parameters,
- * keep_all_punct, keep_embedded_punct, and keep_special_chars,
- * which govern how punctuation and special characters are handled.
- * They are set based on flags such as -e, -a, and -s in the input line.
- * The input parameters are taken one after another, and any that do
- * not convert to a nonzero number are scanned for the letters "e", "a", and
- * "s". If a parameter is not reset, the default is to leave keep_all_punct,
- * keep_embedded_punct, and keep_special_chars as FALSE.
- * The default memory size is DEFAULT_MEMSIZ, set in the header file.
- */
-
- long set_params (argc, argv)
- int argc;
- char *argv[];
- {
- int i;
- void set_parser_options();
- long zb = 0, n, atol(), set_zbufsiz();
-
- for (i = 2; i < argc; ++i)
- {
- n = atol (argv[i]);
- if (n != 0)
- zb = set_zbufsiz (n);
- else
- set_parser_options (argv[i]);
- }
-
- if (zb == 0)
- zb = set_zbufsiz (DEFAULT_MEMSIZ);
-
- return (zb);
- }
-
-
-
- /* determine how big we should make the big buffers -- want to be sure
- * to have room for at least 2*NMERGE+2 of them at one time in memory,
- * so that the N-way merge function can hold all the input files
- * (2N) plus the output files (2).
- *
- * NOTE that the chosen buffer size must be a multiple of both sizeof(long)
- * and sizeof(KEY_REC); I ensure that very simply in what follows by
- * a little modular arithmetic. I also take care of the sign and of the
- * requirement that the buffer size be non-zero -- thus, UNIX aficionados
- * can precede the parameter with a "-" with no ill effects. I have put in
- * various safeguards against picking illegal buffer sizes, along with an
- * ultimate safety net small minimum value for zbufsiz.
- */
-
- long set_zbufsiz (zb)
- long zb;
- {
- char *testb;
-
- if (zb < 0)
- zb = -zb;
-
- zb /= (2 * NMERGE + 2);
- zb = zb - zb % (sizeof(KEY_REC) * sizeof(long));
-
- if (zb < sizeof(KEY_REC) * sizeof(long))
- zb = sizeof(KEY_REC) * sizeof(long);
-
- DEBUG ("--Checking for memory to support buffer size zb=%ld\n", zb);
- testb = make_buf ((2 * NMERGE + 2) * zb);
- free (testb);
-
- return (zb);
- }
-
-
- /* function to set the parser options based on a character string ...
- * 'a' turns on keep_all_punct, 'e' turns on keep_embedded_punct,
- * and 's' turns on keep_special_chars
- */
-
- void set_parser_options (str)
- char *str;
- {
- extern int keep_all_punct, keep_embedded_punct, keep_special_chars;
-
- while (*str)
- {
- if (*str == 'a')
- {
- keep_all_punct = TRUE;
- printf ("All punctuation characters will be kept.\n");
- }
- else if (*str == 'e')
- {
- keep_embedded_punct = TRUE;
- printf ("Embedded punctuation characters will be kept.\n");
- }
- else if (*str == 's')
- {
- keep_special_chars = TRUE;
- printf ("Special characters will be kept.\n");
- }
-
- ++str;
- }
- }
-
-
-
- /* function to check for user interruption of operations (for use in the
- * Macintosh version of this program only) ... call SystemTask() to give
- * desk accessories a bit of time, and then check for non-null events
- * with GetNextEvent() ... if something comes along (a mouse click, or key
- * press, or whatever) let the user choose to exit the program or to
- * carry on ....
- */
-
- #ifdef LIGHTSPEED
-
- void check_interrupt ()
- {
- EventRecord myEvent;
- char cmd[256], *gets();
- void exit();
-
- SystemTask ();
- if (GetNextEvent (everyEvent, &myEvent))
- {
- fprintf (stderr, "\Quit indexing now?\n");
- gets (cmd);
- if (cmd[0] == 'y' || cmd[0] == 'Y')
- exit (0);
- }
- }
-
- #endif
-
-
- /* a very simple function to return the size of a file ... leave the file
- * pointer positioned at wherever it was when the routine was entered....
- * should work fine with stdio methods, but not guaranteed compatible when
- * mixed in with Mac-specific FSRead() function!
- */
-
- long file_size (fp)
- FILE *fp;
- {
- long fpos, result, ftell();
-
- DEBUG ("--finding the size of file fp=%ld\n", fp);
- fpos = ftell (fp);
- fseek (fp, 0L, 2);
- result = ftell (fp);
- fseek (fp, fpos, 0);
- return (result);
- }
-
-
- /* ----------------------open_files.c----------------- */
-
- /* functions to open files as needed in qndxr work...
- */
-
- FILE *open_docfile (argc, argv)
- int argc;
- char *argv[];
- {
- FILE *fp, *fopen();
- void exit();
-
- if (argc < 2)
- {
- printf ("\nToo few command line arguments!\n");
- printf ("\nEnter a text file name (no embedded spaces allowed)\n");
- printf ("and the program will build and sort a complete inverted\n");
- printf ("index to that file. Use \">\" in front of a file name to\n");
- printf ("redirect console output (UNIX-fashion) if desired.\n");
- printf ("Give the program a value for available memory (in bytes)\n");
- printf ("if the default value of 512kB is unsatisfactory ... larger\n");
- printf ("memory allows for faster index building and sorting.\n");
- printf ("Other command-line arguments:\n");
- printf (" \"-e\" preserves embedded punctuation\n");
- printf (" \"-a\" preserves all punctuation characters\n");
- printf (" \"-s\" preserves special (non-ASCII) characters\n");
- printf ("\nContact Mark Zimmermann, 9511 Gwyndale Drive, Silver Spring\n");
- printf ("Maryland 20910 USA; 301-565-2166; science (at) nems.arpa;\n");
- printf ("[75066,2044] on CompuServe for further information. - ^z\n");
- exit (1);
- }
-
- if ((fp = fopen (argv[1], "r")) == NULL)
- {
- printf ("\nFatal error opening document file\"%s\"!\n", argv[1]);
- exit (1);
- }
-
- return (fp);
- }
-
-
- /* open the key file with proper name for this pass ... file will be
- * named "x0kN" where N represents the pass number ....
- */
-
- FILE *open_kfile (pass_number)
- int pass_number;
- {
- FILE *fopen();
- char fname[32];
-
- sprintf (fname, "x0k%d", pass_number);
- return (fopen (fname, "wb"));
- }
-
-
- /* open the ptr file with proper name for this pass ... file will be
- * named "x0pN" where N represents the pass number ....
- */
-
- FILE *open_pfile (pass_number)
- int pass_number;
- {
- FILE *fopen();
- char fname[32];
-
- sprintf (fname, "x0p%d", pass_number);
- return (fopen (fname, "wb"));
- }
-
-
- /* ----------------------zqsort.c-------------------- */
-
- /* file zqsort.c -- by ^z, 870823-...
- * my quicksort to sort out the ptr array ... based, at least initially,
- * on the Lightspeed C library qsort routine, specialized to the task
- * at hand here ...
- */
-
-
- /* sort elements "first" through "last" */
-
- void zqsort (first, last)
- register char **first, **last;
- {
- register char **i, **j, *tmp;
-
- while (last - first > 1)
- {
- i = first;
- j = last;
- for (;;)
- {
- while (++i < last && compare_ptrs (i, first) < 0)
- ;
- while (--j > first && compare_ptrs (j, first) > 0)
- ;
- if (i >= j)
- break;
- tmp = *i;
- *i = *j;
- *j = tmp;
- }
- tmp = *first;
- *first = *j;
- *j = tmp;
- if (j - first < last - (j + 1))
- {
- zqsort (first, j);
- first = j + 1;
- }
- else
- {
- zqsort (j + 1, last);
- last = j;
- }
- }
- }
-
-
- /* function to compare two index ptrs and give a result appropriate
- * for quicksort to use in alphabetizing them....
- *
- * Since the words pointed to have already been turned into all capital
- * letters and delimiters have been filtered out, simply doing zstrcmp()
- * for KEY_LENGTH letters works fine!
- *
- * Slight modification to make the quicksort stable: if two words tie,
- * then we want to compare their pointers to make the lesser one come
- * out first in the sort ...
- */
-
- int compare_ptrs (p1, p2)
- register char **p1, **p2;
- {
- register int diff;
-
- diff = zstrcmp (*p1, *p2);
- if (diff == 0)
- diff = ((*p1 - *p2) > 0) ? 1 : -1;
- return (diff);
- }
-
-
-
- /* my function to compare two strings and give a result as to who is
- * alphabetically earlier. Note that this is almost the same as strncmp()
- * with the fixed value of KEY_LENGTH as the maximum comparison distance,
- * except that I must be sure to mask the characters to make them positive
- * (since we want to be able to handle the non-ASCII funny letters in
- * the Apple character set properly/consistently). If the masking isn't
- * done, then inconsistent results can occur with those non-ASCII chars!
- */
-
- int zstrcmp (s1, s2)
- register char *s1, *s2;
- {
- register int n = KEY_LENGTH;
-
- for (; --n && ((*s1 & CMASK) == (*s2 & CMASK)); s1++, s2++)
- if (!*s1) break;
-
- return ((*s1 & CMASK) - (*s2 & CMASK));
- }
-
-
- -------
-
-
-