home *** CD-ROM | disk | FTP | other *** search
-
-
- COGNATE
-
- an apparently
- wonderfully useless program
- implementing an algorithm
-
- by
-
- Jacques B.M. Guy
-
- Artificial Intelligence Systems
- Telecom (Australia) Research Laboratories
-
-
-
-
-
-
- WHAT IS "COGNATE"?
-
- COGNATE is the implementation of a prototype algorithm for identifying
- related words across languages.
-
- My ultimate purpose in developing COGNATE was to take a first step towards
- solving a far more interesting, and difficult, problem of automatic
- machine translation: given a bilingual text, find the rules for
- translating from either language into the other.
-
- Given the same list of words in two different languages, COGNATE will
- determine which words are likely to be regularly derivable from each
- other, and which are not. The longer the list, or the more closely related
- the two languages are, the better the performance of COGNATE. For instance,
- suppose that you have typed into a file 200 words in English (one per
- line), and in another file the same 200 words, in the same order, in German
- (again one per line). English and German are fairly close languages. Given
- these two files, and no other information whatsoever, COGNATE will be able
- to tell for instance that English "TWENTY" and German "ZWANZIG" are almost
- certainly derivable from each other, and so are English "HONEY" and German
- "HONIG"; but it will also tell you that English "HORSE" and German "PFERD"
- are not so related. COGNATE will also tell you, when comparing "TWENTY"
- with "ZWANZIG", that English "T" corresponds to German "Z".
-
- Because of the very nature of the algorithm, you may encypher each file
- using a simple-substitution code, without causing COGNATE to be confused.
- For instance, if you have encoded the English data by shifting one letter
- forward (so that "TWENTY" becomes "UXFOUZ") and the German data by shifting
- one letter backward (so that "ZWANZIG" becomes "YVZMYHF"), COGNATE will
- still able to tell that "UXFOUZ" and "YVZMYHF" are related, and that
- "IPSTF" ("HORSE") and "OEDQC" ("PFERD") are not.
-
- I thought up the algorithm behind COGNATE around 1981, and implemented it
- first in Simula 67 on a DEC KL10. Then, as a self-inflicted challenge
- which I did not expect to win, I tried to translate it into Turbo Pascal,
- to run on my Kaypro II. It worked. On a Kaypro II, it would take COGNATE 40
- seconds to analyze two files each containing 200 words, and find which were
- related and which not. On a 386DX running at 33MHz, the same operation
- looks as if it were instantaneous.
-
-
- HOW DO I RUN "COGNATE"?
-
- All you need as input for COGNATE are two or more plain ASCII files, each
- containing the same words, in the same order, but in different languages.
- If the words are not in the same order, COGNATE will become hopelessly
- confused and incapable of finding related words (with a huge bit of very
- bad luck, it may even attempt dividing by zero somewhere). I have typed in,
- for demonstration purposes, 200 words in English, in German and in Dutch
- into three files named "ENGLISH", "GERMAN" and "DUTCH".
-
- Just type "COGNATE". You will be asked for the name of the first file.
- Answer, say, ENGLISH. Then for the name of the second file. Answer, say,
- DUTCH. If you have given the names of non-existent files COGNATE will tell
- you which ones it could not find, and ask you again for file names. Just
- pressing RETURN, without typing in any name, will return you to DOS.
-
- Given those two file names, COGNATE will read the data they contain and
- start working on it. How long will it take? It all depends on the size of
- the files. With the ready-to-go files provided, and if you are using a 386
- machine, you won't have the time to count "zero, one". From then on, you
- are on your own, just follow the prompts on the screen. They are as
- self-explanatory as I could make them in one afternoon of slaving at
- translating the original Kaypro II, Turbo Pascal 2 implementation to Turbo
- Pascal 5.5, while trying to tweak some more accuracy and user-friendliness
- out of it. The screen will show the first word pair that COGNATE had read
- out of the two files you gave it to work on. First, in the form of the
- matrix it uses to evaluate whether there were related or not. This matrix
- contains probabilities (in whole percentage, to save space) whether any two
- letters of those words show evidence of cross-correlation. Under the
- matrix, the words are repeated, with COGNATE's opinion whether the
- correlations are strong enough to consider them related, and precisely how
- they correlate. At the bottom of the screen, you see a primitive menu:
-
- N(ext), P(revious), J(ump), M(inimum match), O(ther pass), Q(uit) ?
-
- Just type the first letter: N, P, J, M, O, or Q.
-
- N(ext) takes you to the next word pair.
- P(revious) takes you to the previous pair.
- J(ump) has COGNATE prompt you for the number of the word pair you want to
- jump to. A hint: the words are arranged alphabetically in the English
- list.
- M(inimum) has COGNATE prompt you for the minimum probability of correlation
- between two letters for that correlation to be considered in its
- computations.
- O(ther pass) has COGNATE carry the necessary computations to move to the
- next pass of the algorithm. If you are already in the second pass,
- it returns you to its first-pass state.
- Q(uit) will ask you for confirmation before returning you to DOS.
-
- You can also use the following keys:
-
- RETURN and Down-Arrow have the same effect as N(ext).
- Up-Arrow has the same effect as P(revious).
- PgUp moves 20 word-pairs back.
- PgDn moves 20 word-pairs forward.
- Home takes you to the top of the list.
- End takes you to the bottom of the list.
-
- COGNATE uses a two-pass algorithm. Results obtained during the second pass
- are always better than those obtained in the first (but surprisingly
- only marginally so sometimes). If you compare German with English, using
- the files provided, note how, in the first pass, COGNATE is pretty certain
- that "BEACH" and "STRAND" are related, and how, in the second pass, it has
- reversed its opinion.
-
-
- HOW ABOUT TESTING "COGNATE" ON OTHER LANGUAGES WITH OTHER WORDS?
-
- When typing in those three sample files I did not even bother to use
- phonetic spelling, which would probably have made things easier for
- COGNATE. Just for fun, you may try to create another English file,
- containing the same words (in the same order, remember!) but in a phonetic
- spelling of your invention, and see if COGNATE can correlate your phonetic
- spelling with traditional English spelling.
-
- There is no need to use the particular 200 words I have used. You may use
- any other words, in any number, as long as the same words appear in the
- same order in each file, one per line (never mind the language they're
- in), and as long, of course, as they'll fit in your computer's memory. Try
- for instance comparing Spanish, Italian, French, and Latin. Try comparing
- Polish, Ukrainian, Slovenian, and Russian. Comparing Finnish, Hungarian,
- and Turkish, however, will get you nowhere: for even though all three are
- related, their relationship is so distant that COGNATE will not find enough
- regular correspondences to work on. An amusing experiment would be to
- compare Esperanto with some European languages.
-
-
- NOW FOR THE USUAL LEGAL MATTERS
-
- The usual copyright notice applies. But in particular:
-
- The executable file must be distributed with the present documentation file
- as is, including the closing acknowledgement.
-
- COGNATE.EXE, this documentation file, and the three sample wordlists that
- ought to come with it are distributed free.
-
- For those tempted to dissassemble the code to find out what makes COGNATE
- tick, a piece of advice: a library search will get you what you are after
- infinitely more easily. Look for something around, say, 1983-85. The
- present algorithm is fundamentally the same, except for the choice of the
- objective functions. Alternately, observing its behaviour will tell you a
- great deal, perhaps even more than reading the description of the original
- algorithm.
-
-
- ACKNOWLEDGEMENT
-
- The permission of the Executive General Manager of the Research
- Laboratories of Telecom Australia to distribute this material
- is gratefully acknowledged.
-