ProfitPress Mega CDROM2 Shareware Freeware (MSDOS)(1992)(Eng)

home *** CD-ROM | disk | FTP | other *** search

/ ProfitPress Mega CDROM2 …eeware (MSDOS)(1992)(Eng) / ProfitPress-MegaCDROM2.B6I / TEXT / UTILITY / COGNATE.ZIP / COGNATE.DOC next >

Wrap

Text File | 1991-12-02 | 8.2 KB | 174 lines

COGNATE an apparently wonderfully useless program implementing an algorithm by Jacques B.M. Guy Artificial Intelligence Systems Telecom (Australia) Research Laboratories WHAT IS "COGNATE"? COGNATE is the implementation of a prototype algorithm for identifying related words across languages. My ultimate purpose in developing COGNATE was to take a first step towards solving a far more interesting, and difficult, problem of automatic machine translation: given a bilingual text, find the rules for translating from either language into the other. Given the same list of words in two different languages, COGNATE will determine which words are likely to be regularly derivable from each other, and which are not. The longer the list, or the more closely related the two languages are, the better the performance of COGNATE. For instance, suppose that you have typed into a file 200 words in English (one per line), and in another file the same 200 words, in the same order, in German (again one per line). English and German are fairly close languages. Given these two files, and no other information whatsoever, COGNATE will be able to tell for instance that English "TWENTY" and German "ZWANZIG" are almost certainly derivable from each other, and so are English "HONEY" and German "HONIG"; but it will also tell you that English "HORSE" and German "PFERD" are not so related. COGNATE will also tell you, when comparing "TWENTY" with "ZWANZIG", that English "T" corresponds to German "Z". Because of the very nature of the algorithm, you may encypher each file using a simple-substitution code, without causing COGNATE to be confused. For instance, if you have encoded the English data by shifting one letter forward (so that "TWENTY" becomes "UXFOUZ") and the German data by shifting one letter backward (so that "ZWANZIG" becomes "YVZMYHF"), COGNATE will still able to tell that "UXFOUZ" and "YVZMYHF" are related, and that "IPSTF" ("HORSE") and "OEDQC" ("PFERD") are not. I thought up the algorithm behind COGNATE around 1981, and implemented it first in Simula 67 on a DEC KL10. Then, as a self-inflicted challenge which I did not expect to win, I tried to translate it into Turbo Pascal, to run on my Kaypro II. It worked. On a Kaypro II, it would take COGNATE 40 seconds to analyze two files each containing 200 words, and find which were related and which not. On a 386DX running at 33MHz, the same operation looks as if it were instantaneous. HOW DO I RUN "COGNATE"? All you need as input for COGNATE are two or more plain ASCII files, each containing the same words, in the same order, but in different languages. If the words are not in the same order, COGNATE will become hopelessly confused and incapable of finding related words (with a huge bit of very bad luck, it may even attempt dividing by zero somewhere). I have typed in, for demonstration purposes, 200 words in English, in German and in Dutch into three files named "ENGLISH", "GERMAN" and "DUTCH". Just type "COGNATE". You will be asked for the name of the first file. Answer, say, ENGLISH. Then for the name of the second file. Answer, say, DUTCH. If you have given the names of non-existent files COGNATE will tell you which ones it could not find, and ask you again for file names. Just pressing RETURN, without typing in any name, will return you to DOS. Given those two file names, COGNATE will read the data they contain and start working on it. How long will it take? It all depends on the size of the files. With the ready-to-go files provided, and if you are using a 386 machine, you won't have the time to count "zero, one". From then on, you are on your own, just follow the prompts on the screen. They are as self-explanatory as I could make them in one afternoon of slaving at translating the original Kaypro II, Turbo Pascal 2 implementation to Turbo Pascal 5.5, while trying to tweak some more accuracy and user-friendliness out of it. The screen will show the first word pair that COGNATE had read out of the two files you gave it to work on. First, in the form of the matrix it uses to evaluate whether there were related or not. This matrix contains probabilities (in whole percentage, to save space) whether any two letters of those words show evidence of cross-correlation. Under the matrix, the words are repeated, with COGNATE's opinion whether the correlations are strong enough to consider them related, and precisely how they correlate. At the bottom of the screen, you see a primitive menu: N(ext), P(revious), J(ump), M(inimum match), O(ther pass), Q(uit) ? Just type the first letter: N, P, J, M, O, or Q. N(ext) takes you to the next word pair. P(revious) takes you to the previous pair. J(ump) has COGNATE prompt you for the number of the word pair you want to jump to. A hint: the words are arranged alphabetically in the English list. M(inimum) has COGNATE prompt you for the minimum probability of correlation between two letters for that correlation to be considered in its computations. O(ther pass) has COGNATE carry the necessary computations to move to the next pass of the algorithm. If you are already in the second pass, it returns you to its first-pass state. Q(uit) will ask you for confirmation before returning you to DOS. You can also use the following keys: RETURN and Down-Arrow have the same effect as N(ext). Up-Arrow has the same effect as P(revious). PgUp moves 20 word-pairs back. PgDn moves 20 word-pairs forward. Home takes you to the top of the list. End takes you to the bottom of the list. COGNATE uses a two-pass algorithm. Results obtained during the second pass are always better than those obtained in the first (but surprisingly only marginally so sometimes). If you compare German with English, using the files provided, note how, in the first pass, COGNATE is pretty certain that "BEACH" and "STRAND" are related, and how, in the second pass, it has reversed its opinion. HOW ABOUT TESTING "COGNATE" ON OTHER LANGUAGES WITH OTHER WORDS? When typing in those three sample files I did not even bother to use phonetic spelling, which would probably have made things easier for COGNATE. Just for fun, you may try to create another English file, containing the same words (in the same order, remember!) but in a phonetic spelling of your invention, and see if COGNATE can correlate your phonetic spelling with traditional English spelling. There is no need to use the particular 200 words I have used. You may use any other words, in any number, as long as the same words appear in the same order in each file, one per line (never mind the language they're in), and as long, of course, as they'll fit in your computer's memory. Try for instance comparing Spanish, Italian, French, and Latin. Try comparing Polish, Ukrainian, Slovenian, and Russian. Comparing Finnish, Hungarian, and Turkish, however, will get you nowhere: for even though all three are related, their relationship is so distant that COGNATE will not find enough regular correspondences to work on. An amusing experiment would be to compare Esperanto with some European languages. NOW FOR THE USUAL LEGAL MATTERS The usual copyright notice applies. But in particular: The executable file must be distributed with the present documentation file as is, including the closing acknowledgement. COGNATE.EXE, this documentation file, and the three sample wordlists that ought to come with it are distributed free. For those tempted to dissassemble the code to find out what makes COGNATE tick, a piece of advice: a library search will get you what you are after infinitely more easily. Look for something around, say, 1983-85. The present algorithm is fundamentally the same, except for the choice of the objective functions. Alternately, observing its behaviour will tell you a great deal, perhaps even more than reading the description of the original algorithm. ACKNOWLEDGEMENT The permission of the Executive General Manager of the Research Laboratories of Telecom Australia to distribute this material is gratefully acknowledged.