home *** CD-ROM | disk | FTP | other *** search
- 'Ortho' is my private collection of NEXTSTEP spelling checkers.
- ===============================================================
-
- This is the August 1996 version. For a summary of versions, refer to
- the section at the end of this text.
-
- The 'Ortho' distribution consists of six archives:
- --------------------------------------------------
-
- * Ortho.Dutch.4.NI.b.tar.gz
- * Ortho.French.3.NI.b.tar.gz
- * Ortho.German.3.NI.b.tar.gz
- Tar archives of NEXTSTEP installer packages. To install:
- Unpack the compressed tar archive in /tmp through
- 'gunzip < Ortho.French.3.NI.b.tar.gz | tar xvf -'
- Then double click on the .pkg file to install
- * Ortho.Dutch.4.s.tar.gz
- * Ortho.French.3.s.tar.gz
- * Ortho.German.3.s.tar.gz
- Source packages: A more detailed description can be found below.
-
- A more detailed description of the archives.
- --------------------------------------------
-
- Ortho.Dutch.4.NI.b.tar.gz
- A tar archive of 'Ortho.Dutch.pkg', an installer package for the Dutch
- version of 'Ortho'. Just use
- 'gunzip < Ortho.Dutch.4.NI.b.tar.gz | tar xvf -' in /tmp and
- double click on the package. It is the only archive that contains an
- installation manual. It is stored with the .pkg bundle, in
- the same tar file. (handleiding.rtfd: In Dutch)
-
- Ortho.Dutch.4.s.tar.gz
- A 'gzipped' tar archive of both the source that is shared among
- the different languages and the source that is specific to the
- Dutch part of the spelling checker.
- - The source consists of:
- = The file 'groen': A list of dutch words that can be
- found on various Dutch ftp archives such as 'ftp.nl.net'.
- I corrected a number of typos and tried to capitalize all
- names. Besides I made some minor changes.
- For more detail, refer to the Leesmij en LeesmijOok files
- in the Dutch source directory. To make foreign readers happy,
- the files are in Dutch.
- The original material is signed with three names:
- Jan van Bakel, Dick Grune and Patrick Groeneveld.
- The material here is based on the old orthography rules.
- The current version has been adapted to the new Dutch
- orthography rules.
- = The file 'groen.meer': A word list that supplements 'groen'.
- Although it has the same format as 'groen', the fields
- that I do not need are absolutely unreliable. Their content
- is more or less random.
- = The source of a C program that reads the word lists and
- stores the different forms of the words on a finite
- automaton. On my 33Mhz Next station turbo with 32M memory,
- it takes 15 hours. Probably it is better to use the Dutch.ind
- file from the installer package. [ Or to use a more powerful
- machine: On a Pentiun 150 with 64M memory it only takes
- 11 minutes. ]
- = The Dutch part of the spelling checker including a makefile
- and some configuration files to produce an installer package
- and two floppy disks for a floppy distribution.
- = Some directories that constitute the general part of the
- spelling package:
- * include: Include files
- * lib: Empty, will receive libraries.
- * ind: The finite automata library.
- * nextspel: The nextstep specific part of the checkers.
- * morpho: An attempt to give names to the kinds of
- forms produced by the conjugation/derivation
- program. This attempt was a failure. As I
- never replaced it with something sensible
- it is still there.
- * mspel: A command line spelling checker. Mainly
- for testing purposes and dictionary
- management.
-
- Ortho.German.3.NI.b.tar.gz
- A tar archive of 'Ortho.German.pkg', an installer package for the
- German version of 'Ortho'. Just use
- 'gunzip < Ortho.German.3.NI.b.tar.gz | tar xvf -' in /tmp and double
- click on the package. The original copyright notice by Geoff
- Kuenning that accompanies the material from Martin Schulz, is
- reproduced at the bottom of this file.
-
- Ortho.German.3.s.tar.gz
- A 'gzipped' tar archive of both the source that is shared among
- the different languages and the source that is specific to the
- German part of the spelling checker.
- - The source consists of:
- = The Directory HeinzKnutzen. It is source material that can
- be used with the international version od 'ispell'.
- According to rumours it is better than the original
- material by Martin Schulz. I cannot judge. More recent
- versions should be preferable to this material. It was
- downloaded from ftp://ftp.informatik.uni-kiel.de as
- /pub/kiel/dicts/hk-deutsch.tar.gz. The author is Heinz Knutzen
- (hk@informatik.uni-kiel.d400.de). I corrected one or two
- typos in the original material. Most of the international
- part of ispell is the work of Geoff Kuenning
- (geoff@ITcorp.com). His material is available from
- ftp.cs.ucla.edu (131.179.128.34).
- = A directory MartinSchulz with an alternative dictionary.
- It was part of the original international version of
- ispell. It is unused, but the makefile can be edited to
- use this material. I downloaded this from
- 'ftp.vlsivie.tuwien.ac.at'. An alternative source is:
- /ftp.th-darmstadt.de:
- /pub/dicts/ispell/dictionaries/deutsch.tar.gz
- Most of the international part of ispell is the
- work of Geoff Kuenning (geoff@ITcorp.com). His material is
- available from ftp.cs.ucla.edu (131.179.128.34).
- = A program that uses the ispell dictionaries and an affix
- file to make a finite automaton.
-
- Ortho.French.3.NI.b.tar.gz
- A tar archive of 'Ortho.French.pkg', an installer package for the French
- version of 'Ortho'. Just use
- 'gunzip < Ortho.French.3.NI.b.tar.gz | tar xvf -' in /tmp
- and double click on the package.
-
- Ortho.French.3.s.tar.gz
- A 'gzipped' tar archive of both the source that is shared among
- the different languages and the source that is specific to the
- French part of the spelling checker.
- - The source consists of:
- = The file 'mots': A list of French words. It is part
- of the 'epelle' software. Epelle can be downloaded from
- 'ftp.inria.fr' as 'algo/epelle' or something similar.
- I corrected one or two typos. More recent versions should be
- preferable to my file.
- Epelle was made by Paul Zimmermann, Inria Lorraine,
- <Paul.Zimmermann@inria.fr>.
- = A program that uses the word list to make a finite automaton.
-
- Compiling the source.
- ---------------------
- The makefile in the root of the source tree should be sufficient to
- compile the whole package. It is so simple that it can be its own
- documentation.
- NOTE: Do not forget to change the -arch flags in */makefile.
-
- The changes file.
- -----------------
- The spelling checkers in the Ortho package use and update the
- standard NEXT dictionaries with the name of the language in
- ~/.NeXT/dictionaries. Besides this file, they keep a human readable
- file <Language>.changes in the same directory. If the contents of
- your file reveal not too much about your private life, I would be
- grateful to receive a copy by mail occasionally for the maintenance
- of the dictionaries.
-
- Users that correct dictionaries would be nice if they sent their
- corrections to the original authors of the dictionaries, and also to
- me.
-
- Other languages
- ---------------
- I would be delighted to improve my dictionaries and to extend my
- effort to other languages, if possible in collaboration with others.
- If you have any material, or if you want to collaborate, you are
- welcome.
-
- Other platforms.
- ----------------
- Does anyone know about protocols for spelling checking on other
- platforms? CDE/ToolTalk or OpenDoc for example. If you do,
- please mail me a reference to the documentation.
-
- I originally started the project for several reasons:
- -----------------------------------------------------
- 1) I am a clumsy typist and need a spelling checker to supplement my
- physical (dis)abilities.
- 2) Typing accents is a curse. I prefer a spelling checker that suggests
- them.
- 3) I wondered whether finite automata could be used as database indexes
- for approximate search. To get some feeling I started a project with
- finite automata. The issue is still open. The project resulted in
- three spelling checkers. [ About automata as indexes: performance on
- static databases is excellent. Updates are far too slow. The indexes
- are BIG. ]
-
- The approach was the following:
- -------------------------------
- 1) Make a finite automaton for ALL forms you can derive from a lexicon.
- This automaton can be huge: More than 40 megabytes for the Dutch
- lexicon.
- 2) Minimize the automaton and use it as a search index. The cost
- of the minimization was far more than I expected: 15 hours on
- my computer for the Dutch lexicon. Most of the time is wasted in
- paging. For reasonably sized lexica such as the French and German ones
- the time that minimilalisation will take is short enough to
- make a new index occasionally.
- 3) The memory image of the resulting minimal automaton is dumped to a file.
- The spelling checker memory maps this file in its address space,
- and uses the automaton to see whether forms exist in the dictionary.
- It also the automaton to find forms that are sufficiently similar to
- be proposed as a guess for a word that is not recognized by the automaton.
- My definition of similarity is based on insertions, deletions,
- permutations. Depending on the length of the word, it gets a score.
- Every operation has a price. [Hard coded in indguess.c] All words
- recognised by the automaton, that can be derived from the misspelling
- through operations with a total cost less than the score of the
- misspelling are potential guesses.
- 4) CONCLUSION: Although the amount of memory used is a little too high,
- (2Mb for Dutch and German, 1Mb for French) and the production of
- guesses for long words is a little too slow, my general impression
- of the usability of finite automata for spelling checking is good.
- 5) About the same applies for the use as indices for text retrieval.
- Performance is good but the files are too big. Besides the limit of
- 255 tags associated to a word is too restrictive. But that can be
- solved by changing the file structure.
-
- Versions:
- ---------
- Mid 1995:
- First version. Sent to the Dutch NEXTSTEP community.
- January 1996:
- Dutch version 3, French/German version 2. The first
- public versions.
- August 1996:
- Dutch version 4,
- - The Dutch dictionary is adapted to the new Dutch orthography.
- - Memory leaks in the generation of guesses have been closed.
- This and some other changes improve performance.
- - Provisions for phrases like 'ad hoc', without accepting 'ad'
- or 'hoc'.
- - Provisions for language specific misspellings.
- - Guesses are faster but still too slow.
- - The new version is file compatible with the previous one.
- German version 3,
- - I Use the dictionaries from Heinz Knutzen. The original
- Martin Schulz material is still included.
- - The affix file reader has been improved.
- German, French version 3,
- - Memory leaks in the generation of guesses have been closed.
- This and some other changes improve performance.
- - Provisions for phrases like 'ad hoc', without accepting 'ad'
- or 'hoc'.
- - Provisions for language specific misspellings.
- - Guesses are faster but still too slow.
- - The new version is file compatible with the previous one.
-
- Mostly for our American fellow humans:
- --------------------------------------
- - The 'Ortho' source and spelling checkers are herewith made public.
- Whoever wants to use them for whatever he wants to is free to
- do so. I would appreciate an acknowledgment. I do not claim
- anything.
- - I do not accept any responsibility for the functionality of what
- I publish. It is up to you to judge the usefulness of my work.
- - I would like to hear about applications, corrections or extensions.
- It would be nice to get news from you. Again: I do not claim
- anything.
-
- Mark de Does
-
- =======================================================
- Mark de Does, Donkerstraat 24, 3511 KA Utrecht, Holland
- Tel: ++ 31 30 2314150, Email: M.de.Does@inter.nl.net
- =======================================================
-
- The copyright notice from the german ispell material reads:
- -----------------------------------------------------------
- # Copyright 1988, 1989, 1992, 1993, Geoff Kuenning, Granada Hills, CA
- # All rights reserved.
- #
- # Redistribution and use in source and binary forms, with or without
- # modification, are permitted provided that the following conditions
- # are met:
- #
- # 1. Redistributions of source code must retain the above copyright
- # notice, this list of conditions and the following disclaimer.
- # 2. Redistributions in binary form must reproduce the above copyright
- # notice, this list of conditions and the following disclaimer in the
- # documentation and/or other materials provided with the distribution.
- # 3. All modifications to the source code must be clearly marked as
- # such. Binary redistributions based on modified source code
- # must be clearly marked as modified versions in the documentation
- # and/or other materials provided with the distribution.
- # 4. All advertising materials mentioning features or use of this software
- # must display the following acknowledgment:
- # This product includes software developed by Geoff Kuenning and
- # other unpaid contributors.
- # 5. The name of Geoff Kuenning may not be used to endorse or promote
- # products derived from this software without specific prior
- # written permission.
- #
- # THIS SOFTWARE IS PROVIDED BY GEOFF KUENNING AND CONTRIBUTORS ``AS IS'' AND
- # ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
- # ARE DISCLAIMED. IN NO EVENT SHALL GEOFF KUENNING OR CONTRIBUTORS BE
- # LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
- # CONSEQUENTIAL # DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT
- # OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
- # BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
- # WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
- # OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
- # EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-