home *** CD-ROM | disk | FTP | other *** search
- SPELCHEK Version 1.2 - A *FAST* spelling checker by Edwin Floyd. 3-27-91
-
- Version 1.2 implements a new, faster dictionary algorithm which
- is incompatible with previous versions. Please rebuild all user
- dictionaries with MAKEDICT. SPELCHEK is distributed in three files:
-
- SP12EXE.ZIP - Executable programs
- SP12DCT.ZIP - Dictionaries (large file)
- SP12SRC.ZIP - TP6.0 source code to all programs
-
- Purpose of SPELCHEK
- -------------------
- SPELCHEK extracts words from an input file, or several input files,
- and checks them for membership in a superimposed code dictionary.
- Any words not found in the dictionary, it writes to an output file,
- one per line. The program recognizes a number of options for:
-
- o High-order bit stripping
-
- o Appending additional information to the output word list
-
- o Defining the characters comprising a "word"
-
- How to run SPELCHEK
- -------------------
- From the DOS command line enter:
-
- SPELCHEK filenames [-H] [-M] [-W[+/-]abc..] [@name] [-Uname]
- [-Oname] [-Ppath]
-
- Spaces delimit command line parameters. You may intermingle
- input text filenames and options (mark each option with a leading
- hyphen). Some options (-W,-O,-U) allow a character string or
- filename to follow the option letter. This must follow with no
- intervening spaces or the program will mistake it for an input
- file name. Some options (-H,-M) allow a "+" or "-" to
- indicate "on" or "off". This also must follow with no
- intervening space, and "+" is assumed if it is omitted. You may
- place options and filenames in an ASCII "include" file and
- specify its name with a leading "@" on the command line. An
- include file may contain references to other include files. You
- also may specify default options, filenames and include files in
- the DOS environment using "SET SPELCHEK=...". For example:
-
- SET SPELCHEK=-H+ -Owords.out -W-ABCDEFGHIJKLMNOPQRSTUVWXYZ
- SET WORDS=@defaults.spc -O
-
- SPELCHEK processes options left-to-right, first from the DOS
- environment, then from the command line. Where options conflict,
- the last option processed prevails. Thus, you may override "SET"
- environment options on the command line.
-
- What the options mean
- ---------------------
- -H[+/-] Clear the high-order bit on each input character
- (default off). Use this option to process files
- created by word processing programs, like WordStar,
- that mark some letters by setting the high-order
- bit, often at the beginning or end of a word.
-
- -M[+/-] Append markup information to output word list. This
- causes the program to insert a number in front of each
- word written to the output file. The number indicates
- the byte position in the input where the offending word
- begins. The first byte in the input file is position 1.
- Also, the program writes the file name at the beginning
- of the word list for each input file. The file name is
- preceded by a zero and a space. This output file is
- intended as input to a program such as MARKDOC which
- marks misspelled words in the input document.
-
- -P[path] Indicate the drive and directory containing the
- master dictionary files. There are seven master
- dictionary files: AB.DCT, CD.DCT, EH.DCT, IN.DCT,
- OR.DCT, ST.DCT and UZ.DCT. They all must reside
- in the same directory. If no -P path is specified,
- the master dictionary files must reside in the current
- directory or the program directory. The master dictionary
- files were created with MAKEDICT (see below) from a list
- of over a hundred thousand words obtained from from Public
- Brand Software, 1-800-IBM-DISK.
-
- -U[name] Name a user dictionary file. This option specifies the
- name of an existential dictionary file produced by the
- MAKEDICT program. You may specify the drive and full path.
- If a simple file name is specified, the file is assumed to
- be in the current directory. If SPELCHEK can't open the
- user dictionary, it issues a warning message and processes
- the input files against the master dictionaries only.
-
- -W-abc.. Replace the "word character set" with the indicated
- characters. The program checks each character in
- each input file for membership in the word character
- set and defines a "word" as an uninterrupted
- sequence of at least one but no more than 255
- characters which are members of that set. The
- default is the set of upper and lower case
- alphabetic characters.
-
- -W+abc.. Add additional characters to the word character set.
-
- -O[name] Name the output file. If the name is omitted ("-O "),
- output goes to "StdOut" and is available for DOS a
- pipe (|) or redirection (>). StdOut is the
- default.
-
- -O- Suppress output. -Onul also suppresses output. The
- program will still display word counts on the
- screen.
-
- Two examples
- ------------
- 1. Generate list of all misspelled words in the document named
- MYSTORY.DOC and write the list to file MYSTORY.BWD. The following
- are equivalent:
-
- SPELCHEK mystory.doc -Omystory.bwd
-
- SPELCHEK mystory.doc >mystory.bwd (default StdOut)
-
- SET SPELCHEK=-Omystory.bwd (set defaults)
- SPELCHEK mystory.doc
-
- If at this point we want an alphabetic, un-duplicated list of misspelled
- words, we can use the WORDS program (see WORDS.DOC for other uses):
-
- WORDS mystory.bwd -omystory.unq -a
-
- 2. Generate list of misspelled words in the documents named
- HISPHYS.WS and OPREPORT.WS and use the list as input for MARKDOC to
- mark misspelled words in both documents. The files are WordStar
- documents and we wish to check a user dictionary called MEDTERM.DCT
- in the current directory. The main dictionary files reside in
- directory: D:\SPELL.
-
- SET SPELCHEK=-Pd:\spell -H -O -M -Umedterm.dct
- SPELCHEK hisphys.ws opreport.ws | MARKDOC
-
- We could have specified all the options on the command line.
- Ordinarily you should set the -P and -U options in the environment.
-
- Networks
- --------
- FYI, network users, SPELCHEK opens its input files in "Read, Deny
- None" mode, @include files "Read, Compatibility", and the output
- file in "Write, Compatibility". Only one input file at a time is
- open, except during processing of nested @include files.
-
- MAKEDICT
- --------
- MAKEDICT creates an optimal existential dictionary (Bloom filter) which
- can be used by SPELCHEK with the "-U" option (see above). From the DOS
- command line, enter:
-
- MAKEDICT infile [bits] [extra]
-
- The input file should be a list of words, one per line. All
- characters should be upper case if the dictionary is intended
- for use with SPELCHEK. The second parameter, "bits", specifies
- the number of bits to superimpose for each input word. The
- number of bits partly determines the accuracy of the dictionary.
- For use with SPELCHEK, specify the default, 14 bits. The third
- parameter, "extra", specifies an allowance of extra space so words
- may be added to the dictionary and it still remain within the
- accuracy specified by the "bits" parameter. The default is zero.
- The output file is given the same name as the input file, except
- the extension is ".DCT". If the input file extension is ".DCT",
- the output file is given the extension ".DIC".
-
- To create a user dictionary for SPELCHEK, only the input file need
- be specified. The defaults for "bits" and "extra" are exactly what
- is required for a user dictionary. Example:
-
- MAKEDICT medterm.lst
-
- This creates a user dictionary called: MEDTERM.DCT suitable for use
- by SPELCHEK.
-
- MAKEDICT prints dictionary statistics, including the odds against
- incorrectly recognizing a word which is not in the dictionary. Please
- remember, a Bloom filter is a probabilistic technique; collisions are
- possible, but you control the collision probability by the bits setting.
- All main dictionaries were created with 14 bits, corresponding to about
- a 1/16384 chance of collision. When you specify a user dictionary, the
- odds increase to 1/16384 plus the user dictionary odds. Thus, a 14-bit
- user dictionary would increase the odds of a collision to about 1/8192.
- This means, on the average, SPELCHEK will miss about one out of every
- 8192 different misspelled words. For instance, if a really bad speller
- misspells (differently) about every tenth word in an 80,000-word
- document, SPELCHEK may miss one of the misspellings.
-
- MARKDOC
- -------
- MARKDOC reads the output file produced by SPELCHEK with the -M+
- option and marks misspelled words in the input files. From the
- DOS command line, enter:
-
- MARKDOC [markchars] [<infile]
-
- MARKDOC reads its standard input file (STDIN). Each input line
- begins with a number. The number zero is always followed by a
- document file name. Each non-zero number indicates the position
- of the first character of a misspelled word in the current
- document file. MARKDOC reads each document file and writes an
- output file which is the same as the input file, except each
- misspelled word is preceded by "mark" characters. The
- default mark character is a single "#", but you may specify
- mark characters as a parameter on the command line. Examples:
-
- SPELCHEK document.fil -M+ | MARKDOC %@
-
- SPELCHEK -M+ document.fil -Omark.$$$
- MARKDOC <mark.$$$
-
- MARKDOC saves a copy of the document file under the same name
- as the original document except with the extension ".BAK".
-
- Note: MARKDOC expects to read a file produced by SPELCHEK with the
- -M+ option. If this option is not set, MARKDOC will abort with a
- Pascal error 106. MARKDOC is intended as a demonstration of one
- use of the -M+ output file. Its crash resistance should be
- improved before it's let out into the real world.
-
- WORDS
- -----
- WORDS is a word extractor program useful for creating word lists for
- MAKEDICT, among other things. See WORDS.DOC for documentation.
-
- Legal Stuff
- -----------
- SPELCHEK.EXE, MAKEDICT.EXE, MARKDOC.EXE, WORDS.EXE, SPELCHEK.DOC,
- and WORDS.DOC and all source code files, dictionaries, and word
- lists are:
-
- Copyright (c) 1990,91 by Edwin T. Floyd,
- All rights reserved.
-
- SPELCHEK is copyrighted "free" software. The author hereby
- expressly permits and encourages individuals to use SPELCHEK at
- home and at work and to distribute it without charge. The author
- prohibits distribution of SPELCHEK for profit, or as a part of a
- product sold for profit, except where explicit written permission
- has been obtained from the author for such distribution. Also,
- users groups and shareware libraries charging a disk duplication
- fee not exceeding $10.00 may distribute SPELCHEK.
-
- The author makes no warranties of any kind, either expressed or
- implied, as to mercantability or fitness for any particular
- purpose. SPELCHEK, et. al., are available as is and in no event
- will the author be held liable for damages, including any lost
- profits or incidental or consequential damages, even if the author
- has been advised of the possibility of such damages.
-
- Authorship
- ----------
- SPELCHEK was written in Turbo Pascal v6.0 by:
-
- Edwin T. Floyd [76067,747] (CompuServe)
- #9 Adams Park Court 404/576-3305 (work)
- Columbus, GA 31909 404/322-0076 (home)
-
- The latest version of SPELCHEK is available on CompuServe in
- the IBMAPP forum, and on a number of bulletin boards around the
- country.
- - Edwin - 3-27-91
-
- Revision History
- ----------------
- 05-13-90 V1.0 ETF Original release & DDJ submission.
- 01-10-91 V1.1 ETF Test version, Bloom filter CRC algorithm (not released)
- 03-27-91 V1.2 ETF Update for TP6.0 and second public release
-