Peanuts NeXT Software Archives

home *** CD-ROM | disk | FTP | other *** search

/ Peanuts NeXT Software Archives / Peanuts-2.iso / Text / services / Ortho.README < prev next >

Wrap

Text File | 1996-08-07 | 14.1 KB | 301 lines

'Ortho' is my private collection of NEXTSTEP spelling checkers. =============================================================== This is the August 1996 version. For a summary of versions, refer to the section at the end of this text. The 'Ortho' distribution consists of six archives: -------------------------------------------------- * Ortho.Dutch.4.NI.b.tar.gz * Ortho.French.3.NI.b.tar.gz * Ortho.German.3.NI.b.tar.gz Tar archives of NEXTSTEP installer packages. To install: Unpack the compressed tar archive in /tmp through 'gunzip < Ortho.French.3.NI.b.tar.gz | tar xvf -' Then double click on the .pkg file to install * Ortho.Dutch.4.s.tar.gz * Ortho.French.3.s.tar.gz * Ortho.German.3.s.tar.gz Source packages: A more detailed description can be found below. A more detailed description of the archives. -------------------------------------------- Ortho.Dutch.4.NI.b.tar.gz A tar archive of 'Ortho.Dutch.pkg', an installer package for the Dutch version of 'Ortho'. Just use 'gunzip < Ortho.Dutch.4.NI.b.tar.gz | tar xvf -' in /tmp and double click on the package. It is the only archive that contains an installation manual. It is stored with the .pkg bundle, in the same tar file. (handleiding.rtfd: In Dutch) Ortho.Dutch.4.s.tar.gz A 'gzipped' tar archive of both the source that is shared among the different languages and the source that is specific to the Dutch part of the spelling checker. - The source consists of: = The file 'groen': A list of dutch words that can be found on various Dutch ftp archives such as 'ftp.nl.net'. I corrected a number of typos and tried to capitalize all names. Besides I made some minor changes. For more detail, refer to the Leesmij en LeesmijOok files in the Dutch source directory. To make foreign readers happy, the files are in Dutch. The original material is signed with three names: Jan van Bakel, Dick Grune and Patrick Groeneveld. The material here is based on the old orthography rules. The current version has been adapted to the new Dutch orthography rules. = The file 'groen.meer': A word list that supplements 'groen'. Although it has the same format as 'groen', the fields that I do not need are absolutely unreliable. Their content is more or less random. = The source of a C program that reads the word lists and stores the different forms of the words on a finite automaton. On my 33Mhz Next station turbo with 32M memory, it takes 15 hours. Probably it is better to use the Dutch.ind file from the installer package. [ Or to use a more powerful machine: On a Pentiun 150 with 64M memory it only takes 11 minutes. ] = The Dutch part of the spelling checker including a makefile and some configuration files to produce an installer package and two floppy disks for a floppy distribution. = Some directories that constitute the general part of the spelling package: * include: Include files * lib: Empty, will receive libraries. * ind: The finite automata library. * nextspel: The nextstep specific part of the checkers. * morpho: An attempt to give names to the kinds of forms produced by the conjugation/derivation program. This attempt was a failure. As I never replaced it with something sensible it is still there. * mspel: A command line spelling checker. Mainly for testing purposes and dictionary management. Ortho.German.3.NI.b.tar.gz A tar archive of 'Ortho.German.pkg', an installer package for the German version of 'Ortho'. Just use 'gunzip < Ortho.German.3.NI.b.tar.gz | tar xvf -' in /tmp and double click on the package. The original copyright notice by Geoff Kuenning that accompanies the material from Martin Schulz, is reproduced at the bottom of this file. Ortho.German.3.s.tar.gz A 'gzipped' tar archive of both the source that is shared among the different languages and the source that is specific to the German part of the spelling checker. - The source consists of: = The Directory HeinzKnutzen. It is source material that can be used with the international version od 'ispell'. According to rumours it is better than the original material by Martin Schulz. I cannot judge. More recent versions should be preferable to this material. It was downloaded from ftp://ftp.informatik.uni-kiel.de as /pub/kiel/dicts/hk-deutsch.tar.gz. The author is Heinz Knutzen (hk@informatik.uni-kiel.d400.de). I corrected one or two typos in the original material. Most of the international part of ispell is the work of Geoff Kuenning (geoff@ITcorp.com). His material is available from ftp.cs.ucla.edu (131.179.128.34). = A directory MartinSchulz with an alternative dictionary. It was part of the original international version of ispell. It is unused, but the makefile can be edited to use this material. I downloaded this from 'ftp.vlsivie.tuwien.ac.at'. An alternative source is: /ftp.th-darmstadt.de: /pub/dicts/ispell/dictionaries/deutsch.tar.gz Most of the international part of ispell is the work of Geoff Kuenning (geoff@ITcorp.com). His material is available from ftp.cs.ucla.edu (131.179.128.34). = A program that uses the ispell dictionaries and an affix file to make a finite automaton. Ortho.French.3.NI.b.tar.gz A tar archive of 'Ortho.French.pkg', an installer package for the French version of 'Ortho'. Just use 'gunzip < Ortho.French.3.NI.b.tar.gz | tar xvf -' in /tmp and double click on the package. Ortho.French.3.s.tar.gz A 'gzipped' tar archive of both the source that is shared among the different languages and the source that is specific to the French part of the spelling checker. - The source consists of: = The file 'mots': A list of French words. It is part of the 'epelle' software. Epelle can be downloaded from 'ftp.inria.fr' as 'algo/epelle' or something similar. I corrected one or two typos. More recent versions should be preferable to my file. Epelle was made by Paul Zimmermann, Inria Lorraine, <Paul.Zimmermann@inria.fr>. = A program that uses the word list to make a finite automaton. Compiling the source. --------------------- The makefile in the root of the source tree should be sufficient to compile the whole package. It is so simple that it can be its own documentation. NOTE: Do not forget to change the -arch flags in */makefile. The changes file. ----------------- The spelling checkers in the Ortho package use and update the standard NEXT dictionaries with the name of the language in ~/.NeXT/dictionaries. Besides this file, they keep a human readable file <Language>.changes in the same directory. If the contents of your file reveal not too much about your private life, I would be grateful to receive a copy by mail occasionally for the maintenance of the dictionaries. Users that correct dictionaries would be nice if they sent their corrections to the original authors of the dictionaries, and also to me. Other languages --------------- I would be delighted to improve my dictionaries and to extend my effort to other languages, if possible in collaboration with others. If you have any material, or if you want to collaborate, you are welcome. Other platforms. ---------------- Does anyone know about protocols for spelling checking on other platforms? CDE/ToolTalk or OpenDoc for example. If you do, please mail me a reference to the documentation. I originally started the project for several reasons: ----------------------------------------------------- 1) I am a clumsy typist and need a spelling checker to supplement my physical (dis)abilities. 2) Typing accents is a curse. I prefer a spelling checker that suggests them. 3) I wondered whether finite automata could be used as database indexes for approximate search. To get some feeling I started a project with finite automata. The issue is still open. The project resulted in three spelling checkers. [ About automata as indexes: performance on static databases is excellent. Updates are far too slow. The indexes are BIG. ] The approach was the following: ------------------------------- 1) Make a finite automaton for ALL forms you can derive from a lexicon. This automaton can be huge: More than 40 megabytes for the Dutch lexicon. 2) Minimize the automaton and use it as a search index. The cost of the minimization was far more than I expected: 15 hours on my computer for the Dutch lexicon. Most of the time is wasted in paging. For reasonably sized lexica such as the French and German ones the time that minimilalisation will take is short enough to make a new index occasionally. 3) The memory image of the resulting minimal automaton is dumped to a file. The spelling checker memory maps this file in its address space, and uses the automaton to see whether forms exist in the dictionary. It also the automaton to find forms that are sufficiently similar to be proposed as a guess for a word that is not recognized by the automaton. My definition of similarity is based on insertions, deletions, permutations. Depending on the length of the word, it gets a score. Every operation has a price. [Hard coded in indguess.c] All words recognised by the automaton, that can be derived from the misspelling through operations with a total cost less than the score of the misspelling are potential guesses. 4) CONCLUSION: Although the amount of memory used is a little too high, (2Mb for Dutch and German, 1Mb for French) and the production of guesses for long words is a little too slow, my general impression of the usability of finite automata for spelling checking is good. 5) About the same applies for the use as indices for text retrieval. Performance is good but the files are too big. Besides the limit of 255 tags associated to a word is too restrictive. But that can be solved by changing the file structure. Versions: --------- Mid 1995: First version. Sent to the Dutch NEXTSTEP community. January 1996: Dutch version 3, French/German version 2. The first public versions. August 1996: Dutch version 4, - The Dutch dictionary is adapted to the new Dutch orthography. - Memory leaks in the generation of guesses have been closed. This and some other changes improve performance. - Provisions for phrases like 'ad hoc', without accepting 'ad' or 'hoc'. - Provisions for language specific misspellings. - Guesses are faster but still too slow. - The new version is file compatible with the previous one. German version 3, - I Use the dictionaries from Heinz Knutzen. The original Martin Schulz material is still included. - The affix file reader has been improved. German, French version 3, - Memory leaks in the generation of guesses have been closed. This and some other changes improve performance. - Provisions for phrases like 'ad hoc', without accepting 'ad' or 'hoc'. - Provisions for language specific misspellings. - Guesses are faster but still too slow. - The new version is file compatible with the previous one. Mostly for our American fellow humans: -------------------------------------- - The 'Ortho' source and spelling checkers are herewith made public. Whoever wants to use them for whatever he wants to is free to do so. I would appreciate an acknowledgment. I do not claim anything. - I do not accept any responsibility for the functionality of what I publish. It is up to you to judge the usefulness of my work. - I would like to hear about applications, corrections or extensions. It would be nice to get news from you. Again: I do not claim anything. Mark de Does ======================================================= Mark de Does, Donkerstraat 24, 3511 KA Utrecht, Holland Tel: ++ 31 30 2314150, Email: M.de.Does@inter.nl.net ======================================================= The copyright notice from the german ispell material reads: ----------------------------------------------------------- # Copyright 1988, 1989, 1992, 1993, Geoff Kuenning, Granada Hills, CA # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # 3. All modifications to the source code must be clearly marked as # such. Binary redistributions based on modified source code # must be clearly marked as modified versions in the documentation # and/or other materials provided with the distribution. # 4. All advertising materials mentioning features or use of this software # must display the following acknowledgment: # This product includes software developed by Geoff Kuenning and # other unpaid contributors. # 5. The name of Geoff Kuenning may not be used to endorse or promote # products derived from this software without specific prior # written permission. # # THIS SOFTWARE IS PROVIDED BY GEOFF KUENNING AND CONTRIBUTORS ``AS IS'' AND # ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE # ARE DISCLAIMED. IN NO EVENT SHALL GEOFF KUENNING OR CONTRIBUTORS BE # LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR # CONSEQUENTIAL # DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT # OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR # BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, # WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE # OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, # EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.