home *** CD-ROM | disk | FTP | other *** search
- sspell - similar to Unix spell
- version 1.1
-
- Author: Maurice Castro
- Release Date: 26 Jan 1992
- Bug Reports: maurice@bruce.cs.monash.edu.au
-
- This code has been placed by the Author into the Public Domain.
- The code is NOT covered by any warranty, the user of the code is
- solely responsible for determining the fitness of the program
- for their purpose. No liability is accepted by the author for
- the direct or indirect losses incurred through the use of this
- program.
-
- Segments of this code may be used for any purpose that the user
- deems appropriate. It would be polite to acknowledge the source
- of the code. If you modify the code and redistribute it please
- include a message indicating your changes and how users may
- contact you for support.
-
- The author reserves the right to issue the official version of
- this program. If you have useful suggestions or changes for the
- code, please forward them to the author so that they might be
- incorporated into the official version
-
- Please forward bug reports to the author via Internet.
-
- * Introduction
-
- The program SSPELL was written by the author to provide a Unix like
- spell checker on a PC. There are several utilities of this type already
- available, however, most lacked at least one of the following:
-
- 1. Public Domain
- 2. Source Code
- 3. Simple, editable word list structure
- 4. Configurable prefix and suffix list.
- 5. To use minimal memory
- 6. To have an unlimited word list length
- 7. Reasonable speed
- 8. Portable
-
- The SSPELL program provides all these features. The program currently
- compiles under Turbo C++ (Borland) for MS-DOS and cc for Unix (OSx for
- Pyramid, SunOS for Sun 3/50, Ultrix for Decstation 2100). Minor
- modification will be required to compile under other Unix variants.
-
- * Features
-
- The SSPELL program uses a sorted plain ASCII word list for its dictionary.
- This makes adding new words to the list easy. Simply add the words and
- re-sort the list.
-
- To gain speed, without loading the complete list into memory, a cache
- of words recently recovered from the word list is maintained, the disk
- is only searched if the word is not found in the cache.
-
- A suffix/prefix list is used to allow a smaller dictionary to be used.
-
- * Operation
-
- Edit the config.h file to set up the required default locations and
- compile the code. Place the dictionary in the file specified in the
- config.h and make sure that the index file is writable. SSPELL should
- now be ready for use.
-
- Performance gains may be had by altering the parameters found in the
- config.h file. Increasing CACHESIZE increases the memory usage of the
- program, but decreases disk search time. IDXSIZ and HASHWID control
- the size of the index to the disk file. HASHWID determines the maximum
- number of characters compared to determine if an item occurs in a given
- slot. IDXSIZ determines the number of slots.
-
- A typical IBM-PC implementation could be written as:
-
- #define DICTIONARY "c:\\utility\\dict\\main.dct"
- #define INDEX "c:\\utility\\dict\\main.idx"
- #define RULE "c:\\utility\\dict\\rule.lst"
- #define CACHESIZE 1000
- #define ROOTNAME "c:\\tmp\\sspell"
- #define SORT "c:\\dos\\sort"
-
- #define MAXSTR 128
- #define SEPSTR " \n\r\t!@#$%^&*(),.<>~`\":;|/\\{}[]"
-
- /* HASHWID must always be 2 or greater */
- #define HASHWID 8
- #define IDXSIZ 1000
-
- * Command Line
-
- SSPELL has the following command line options:
-
- sspell [-v] [-x] [-D dict] [-I index] [-R rule]
- [-C cachesize] [file] ...
-
- -v all words not actually in the word list are printed and plausible
- derivations from the word list are indicated
-
- -x all plausible stems are output
-
- -D `dict' is the pathname of an alternate dictionary
-
- -I `index' is the pathname of an alternate index. This should be
- used if using a personalised dictionary or if the index file is
- unwriteable.
-
- -R `rule' is the pathname of an alternate rule list
-
- -C `cachesize' is the size of the cache of words found in the
- dictionary.
-
- SSPELL will take input from a list of files on the command line or from
- stdin if no files are supplied.
-
- The dictionary must be in sorted order with the capital letters folded onto
- the small letters. (Using Unix sort: sort -fu). The case of words in the
- dictionary is significant. Any letter appearing as a capital in the
- dictionary must appear as a capital in the text to be regarded as spelled
- correctly.
-
- The format of the rule list is fixed. `#' in the first column indicates a
- comment. All other lines are of the form:
-
- pre|post <prefix/suffix> <required> <forbiden> <delete>
-
- Any field not used must be filled with a `-'. The following examples
- illustrate the features of the rules.
-
- pre un - - -
- post ive - e -
- post ive e - e
- post ied y ay,ey,iy,oy,uy y
-
- The prefix rules are simple, their are no required or forbidden sequences
- and nothing to delete. Prefixes must not be more complex.
-
- The suffix rules are more complex. These rule specify the ending to be
- added to the root after the deletion of the delete field, provided that
- the word has a required ending, provided that the combination is not
- forbidden. For example: carried.
-
- root: carry
- required `y': carry the last letter is a `y'
- forbidden: the word does not end in a
- forbidden sequence
- delete `y': carr
- suffix `ied': carried
-
- * Overview of Internal Operation
-
- SSPELL creates an index file which speeds access to the main dictionary,
- the index is a simple list of the first part of words evenly spaced through
- the dictionary, the number of significant letters and the number of slots
- are set using hash defines in the config.h file.
-
- The index file is only created if: No index file exists or the dictionary
- has been modified since the index was created. The Dictionary is checked
- for correct ordering during the creation of the index file.
-
- Words are checked for correct spelling by initially checking the cache. The
- cache is a move to front list, so more recently used words are at the
- front of the cache. The cache size is bounded by a limit set in the config.h
- file. If the word is not found in the cache then an exact match is checked
- for in the file. If no exact match is found then a derivation is checked
- for in the cache and subsequently in the file. If a word in the dictionary
- matches either a derivation or the original then the dictionary word is
- inserted at the head of the cache list.
-
- Hyphenation and number identification have been left out of the above
- description. The output of the search process is put in a file, the
- file is then sorted using the local operating system sorting utility.
- The result is then listed on standard out such that duplicated lines
- appear only once.
-
- * Acknowledgments
-
- My thanks to people who have contributed to this program:
-
- Michael Oldfield (mao@physics.su.OZ.AU) for a number of bug fixes
-
- * Conclusion
-
- I hope that this program proves useful. Comments and suggestions welcomed;
- I can be contavcted via E-Mail at maurice@bruce.cs.monash.edu.au
-
- Maurice Castro
-
-