ProfitPress Mega CDROM2 Shareware Freeware (MSDOS)(1992)(Eng)

home *** CD-ROM | disk | FTP | other *** search

/ ProfitPress Mega CDROM2 …eeware (MSDOS)(1992)(Eng) / ProfitPress-MegaCDROM2.B6I / TEXT / UTILITY / ENGLEX10.ZIP / ENGLEX10.DOC < prev next >

Wrap

Text File | 1992-01-20 | 40.0 KB | 848 lines

ENGLEX: an English lexicon for PC-KIMMO version 1.0 November 26, 1991 Documentation updated: 20-Jan-92 Evan L. Antworth Summer Institute of Linguistics, Dallas, TX evan@sil.org Copyright (C) 1991, Summer Institute of Linguistics, Inc. Contents 1 What is Englex? 2 Copyright and fair use policy 3 Required software 4 About PC-KIMMO and KTEXT 5 System requirements and performance 6 Coverage and disclaimers 7 Test corpora 8 Comparison with appendix A of the PC-KIMMO book 9 Design philosophy 10 Extending, modifying, and fine-tuning the lexicon 11 File structure 12 Running Englex 13 Alphabet 14 British spelling 15 Archaic forms 16 Stress marks 17 Accented characters 18 Inflection and derivation 19 Multiple senses and homonyms 20 Word class conversion 21 Output data structures 22 Gloss tags 23 Compounds 24 Clitics 25 Participles 26 Special cases 27 Names and abbreviations 28 Digits and Roman numerals 29 Preprocessing text 30 Reporting defects and submitting enhancements 31 References 1 What is Englex? Englex is a morphological parsing lexicon of English. It uses the standard orthography for English. It is intended for use with PC-KIMMO (or programs that use the PC-KIMMO parser, such as KTEXT). With such software and Englex, you can morphologically parse English words and text. Practical applications include morphologically preprocessing text for a syntactic parser and producing morphologically tagged text. Englex can also be used to explore English morphological structure. 2 Copyright and fair use policy All of the files in this release of Englex are copyrighted by the Summer Institute of Linguistics (Academic Computing Department, 7500 W. Camp Wisdom Road, Dallas, TX 75236, U.S.A.). Permission is hereby granted to the user to copy, use, modify, and distribute the Englex files under the following conditions: (1) if you distribute this original release of Englex, you must include all files in unmodified form; (2) if you distribute Englex files that you have modified, you must clearly state who modified them and how they differ from the originals; (3) you may not charge money for distributing Englex, in original or modified form, beyond minimal media cost without permission of the Summer Institute of Linguistics; and (4) Englex may not be used in any commercial product without permission of the Summer Institute of Linguistics. 3 Required software Englex is of little use by itself (though you could use a word processor to search and retrieve words in the lexicon files). Englex is intended to be used with PC-KIMMO or KTEXT. If you use Englex with PC-KIMMO, you can interactively enter words to analyze or process lists of words using the file functions. However, any word you process this way must use only the alphabetic characters declared in the rules file. For example, if you enter a capitalized word, you will get an error. Also, the basic alphabet does not include any eight-bit accented characters. Using Englex interactively with PC-KIMMO is helpful when you are editing the lexicon files. There is one oddity to be aware of: due to the way PC-KIMMO handles NULLs, some words will return several identical parses (for example, "bigger"). You should also note that Englex is optimized for recognition; you can use PC-KIMMO's generator function with Englex, but it will produce many spurious output forms. To process text with Englex, you can use KTEXT. KTEXT handles all the problems noted above: capitals, accented characters, and identical parses. If you want to make adjustments to the way KTEXT works, simply modify the files ENGLISH.CTL and TEXTIN.CTL. See the KTEXT user's guide for details. A third way to use Englex is to create your own application program using the PC-KIMMO parser. See appendix C of the PC-KIMMO book (Antworth 1990). 4 About PC-KIMMO and KTEXT PC-KIMMO is a program for doing computational phonology and morphology. It is typically used to build morphological parsers for natural language processing systems. PC-KIMMO is described in the book "PC-KIMMO: a two-level processor for morphological analysis" by Evan L. Antworth, published by the Summer Institute of Linguistics (1990). The PC-KIMMO software is available for MS-DOS (IBM PCs and compatibles), Macintosh, and UNIX. The book (including software) is available for $23.00 (plus postage) from: International Academic Bookstore 7500 W. Camp Wisdom Road Dallas TX, 75236 U.S.A. phone 214/709-2404 fax 214/709-2433 The remainder of this document assumes that you are familiar with PC-KIMMO. PC-KIMMO was deliberately designed to be reusable. The core of PC-KIMMO is a library of functions such as load rules, load lexicon, generate, and recognize. The PC-KIMMO program supplies on the release diskette is just a user shell built around these basic functions. This shell provides an environment for developing and testing sets of rules and lexicons. Since the shell is a development environment, it has very little built-in data processing capability. But because PC-KIMMO is modular and portable, you can write your own data processing program that uses PC-KIMMO's function library. KTEXT is an example of how to use PC-KIMMO to create a new natural language processing program. KTEXT is a text processing program that uses PC-KIMMO to do morphological parsing. See the KTEXT user's guide for more information on how to use KTEXT to process text. Note: as of December 6, 1991 the latest version of KTEXT is version 1.0.1. The Macintosh version of KTEXT is available from: archive.umich.edu (141.211.164.153) /pub/mac/etc/linguistics/ktext094.sit.hqx The MS-DOS version of KTEXT is available from (but see section 5 below): wsmr-simtel20.army.mil (192.88.110.20) pd1:<msdos.linguistics>ktext093.zip or archive.umich.edu (141.211.164.153) /pub/msdos/linguistics/ktext093.zip The UNIX version of KTEXT is available from: Consortium for Lexical Research, New Mexico State University Direct queries to lexical@nmsu.edu or lexical@nmsu.bitnet. 5 System requirements and performance PC-KIMMO and KTEXT run on three systems: MS-DOS (IBM PC and compatibles) UNIX System V (SCO UNIX V/386 and A/UX) and 4.2 BSD UNIX Apple Macintosh (System 7 compatible) Englex takes up only about 500KB of disk space (not including executables), but requires a considerable amount of internal memory. On my Macintosh SE/30 (using Multifinder under system 6) I must set the application size of PC- KIMMO or KTEXT to a minimum of 2700KB. Thus you will need at least a 4MB Macintosh to run Englex (unless you prune out a substantial number of lexical entries). The original MS-DOS versions of PC-KIMMO and KTEXT were limited to 640KB. Obviously Englex will not run in 640KB. We have recently compiled new versions of PC-KIMMO and KTEXT for PC compatibles using the 386 processor. These versions will use all available extended/expanded memory plus virtual memory. In order to run Englex under MS-DOS, you will need a 386 machine and these new versions of the software. If they are not available from the file archives mentioned above, contact me directly. On my Macintosh SE/30, Englex takes about 1 minute 35 seconds to load. KTEXT averages about two words per second to process text. On a 33MHz 486 PC compatible, Englex takes 10 seconds to load and KTEXT averages about 10 words per second. 6 Coverage and disclaimers Englex contains approximately 20,000 lexical entries. These entries are affixes, roots, indivisible stems and solid compounds. Of these, there are approximately 11,000 nouns, 4,000 verbs, and 3,400 adjectives. Since Englex analyzes productive morphology, it will recognize several times this number of English words. No claim is made for exhaustive coverage of English vocabulary. The intent was to establish a critical mass of entries that would handle a large percentage of non-technical, non-specialized English text. Rather than simply adding lists of new words, I suggest that future lexical expansion of Englex should be done by users on the basis of the textual materials they are attempting to process. Englex attempts to account for all productive morphological structure (affixes, morphotactic constraints, word class conversion, etc.). No claim is made that it exhaustively covers everything that might be considered part of English morphology. Although my intention was to be as complete and accurate as possible, no claim is made that Englex is inerrant. I view Englex as an on-going research project to which I now invite the general academic community to contribute. The morphological analysis of English embodied in Englex should be viewed as a set of hypotheses that are subject to falsification, correction, and refinement. 7 Test corpora Englex was tested with several words lists (such as the UNIX spelling list). This does not mean that Englex contains all words found in those lists. Many words were judged too technical or infrequent to include in Englex. Englex was also tested with samples of running text, including Lewis Carroll's "Alice's Adventures in Wonderland" and "Through the Looking Glass", Herman Melville's "Moby Dick", the New Testament (Authorized version), and excerpts from the UPI newswire. Again, this does not mean that all words found in those corpora are included in Englex. 8 Comparison with appendix A of the PC-KIMMO book The PC-KIMMO release disk includes an English example which is described in appendix A of the PC-KIMMO book (Antworth 1990). The rules file that Englex uses is very similar to the rules file described there, but a few changes have been made, such as relaxing the environment for Gemination. One important difference is the NULL symbol. Because Englex handles digits, including 0, the NULL symbol has been changed to * (asterisk). Notice that null entries in the lexicon must also use * as the NULL symbol. Another difference is that the s-deletion and i:y-spelling rules described in appendix A are not used in Englex. This was done to achieve better processing performance. Because deletions are computationally expensive for the recognizer function, removing the s-deletion rule resulted in nearly a 20% speed increase. Removing the i:y-spelling rule resulted in a 10% speed increase. The trade-off is that there is some loss in linguistic felicity. The s-deletion rule deletes a possessive suffix "s" when it follows an "s", e.g. lexical "boy+s+'s" to surface "boys'"). In order to do away with this rule, it is necessary to add the allomorph +' to the GENITIVE sublexicon (in the file english.lex). The result is that a word such as "boys'" returns the lexical form "boy+s+'" rather than "boy+s+'s"; however, the gloss string is unaffected and remains N+PL+GEN. If you prefer to use the s-deletion rule, it is located in the file english.rul after the END keyword. Simply move it into the main body of rules and comment out the +' lexical entry. The i:y-spelling rule accounts for alternations such as "tie" and "tying". However, there is such a small number of words that exhibit this alternation that it is more economical to list them in the lexicon. However, if you want to use the i:y-spelling rule, it is located in the file english.rul after the END keyword. The lexicon described in appendix A is only a small sample lexicon and is totally superseded by Englex. Note that the morphotactic structure described in appendix A bears little resemblance to Englex. In short, be careful not to confuse the English files from the PC-KIMMO release disk with the files supplied with Englex. 9 Design philosophy Englex represents a convergence of two disciplines: natural language processing (NLP) and linguistics. Since the presuppositions, interests, and goals of linguists and NLP researchers do not necessarily coincide, Englex is by necessity a bundle of compromises. Englex is natural language processing (NLP) tool based on generally-accepted linguistic principles and analyses of English morphology. The basic strategy in building an NLP system like Englex is two-pronged: first, ensure that all well-formed input is analyzed correctly, and second, incrementally refine the system so that it rejects ill-formed input. Both the linguist and NLP researcher would insist that the first goal be met (though even here the NLP researcher might be more forgiving). But with regard to the second goal, only the linguist would require that it be fully met in order for the description to be adequate. For the NLP researcher, as long as well-formed input is assured, it does not necessarily matter if the system "overrecognizes" (but see below). For example, Englex will correctly recognize the comparative and superlative forms of adjectives such as "big, bigger, biggest". But it will also recognize the dubious form "aliver" as the comparative form of "alive". In other words, Englex underspecifies the morphotactic constraints related to adjective inflection; it assumes that all adjectives can have a comparative form, which of course is not true. In practice, we assume that forms such as "aliver" do not occur in well-formed text; thus overrecognition does little harm. However, overrecognition is by no means innocuous; it can result in spurious parses that seriously degrade the performance of an NLP system. For instance, consider what would happen if we relax the constraint that the comparative -er suffix only attaches to adjectives and permit it after any word. A word such as "bigger" would still be correctly parsed as a comparative adjective; but a word such as "writer" would get two parses: one where -er is correctly recognized as an agentive suffix that attaches to a verb, and another where -er is incorrectly posited as the comparative suffix. By simply encoding the constraint that the comparative suffix can only attach to adjectives, we capture the obvious and important linguistic fact that only adjective have comparative forms and at the same time reduce the number of spurious parses the system produces. The degree to which we refine a system like Englex depends on our purpose in using the system: to characterize precisely English morphological structure (the linguist's goal) or to process natural language texts to some acceptable degree of accuracy (the NLP researcher's goal). In Englex I have tried to steer a middle course between these purposes, but ultimately it is up to the user to determine the behavior of the system. 10 Extending, modifying, and fine-tuning the lexicon Since Englex is a completely open system, the user can easily add more lexical entries as they are needed. The lexicon files are standard ASCII text files that can be edited with any conventional text editor (see section 11 on file formats). The user can also change the gloss tags if this is necessary to be compatible with other software. If you do this, be sure to search all the lexicon files for instances of a particular tag. The user can even modify the morphological analysis used by Englex. Care should be observed when doing this, however, since a small change can have unforeseen results in some other part of the lexicon system. If you look at the file ENGLISH.LEX which contains entries for affixes, you will see that many affix entries are commented out. This is an example of the compromise between linguistics and natural language processing. Some affix entries are commented out because they are very infrequent or unproductive; in such cases it is preferable to simply list all words with these affixes in the lexicon. Other affix entries are commented out because they result in numerous spurious parses that cannot easily be filtered out using PC-KIMMO's rather simple system of encoding morphotactic constraints. In these cases it is preferable, from the viewpoint of natural language processing, to list words using such affixes in the lexicon rather than deal with multiple parses. However, from a linguistic point of view, it might be desireable to uncomment these affixes and see what happens. The choice is up to the user. There are other instances where the user who is primarily interested in natural language processing may want to fine-tune the lexicon by disabling certain lexical entries. For instance, the word "saw" will result in several parses: the past tense form of "see", the noun "saw", and the verb "saw" converted from the noun. Unless your text is about carpentry, it will be distracting to have three parses of such a common word as "saw" (as the past tense of "see"). Just comment out the lexical entry for the noun "saw". 11 File structure This release of Englex the following files: english.ctl KTEXT mail control file engtxtin.ctl KTEXT textin control file english.rul rules file english.lex main lexicon file (contains affixes and loads other files) noun.lex nouns verb.lex verbs adjectiv.lex adjectives adverb.lex adverbs minor.lex prepositions, determiners, conjunctions, quantifiers, demonstratives, interjections, foreign, ordinals, cardinals, digits, roman numerals proper.lex proper nouns abbrev.lex acronyms and abbreviations At the beginning of each file is a table of contents. In the noun, verb, and adjective files, irregular forms are listed in the first part of the file followed by regular forms. Each lexical entry in a file is composed of three parts: lexical form, alternation, and gloss. Each entry is limited to a single line with a single tab separating the parts. For example: `cat <TAB> N <TAB> "N" 12 Running Englex To run Englex interactively with PC-KIMMO, launch PC-KIMMO and issue the commands "load rules english" and "load lexicon english". You can also create a TAKE file to execute these commands automatically (see section 7.5.4 of the PC-KIMMO book). Note that if the Englex files are not in the same subdirectory as the PC-KIMMO program, you must either do a CD command to move into that directory or use pathnames before the filenames. To run Englex with KTEXT, you must first be sure that the control files ENGLISH.CTL and TEXTIN.CTL are present and properly configured. Then launch KTEXT with the appropriate command line arguments. For instance: ktext -w -x english -i alice.txt -o alice.ana -l alice.log See the KTEXT user's guide for details. 13 Alphabet The alphabet of word-forming characters is declared in the file ENGLISH.RUL. It consists of these characters: b c d f g h j k l m n p q r s t v w x y z a e i o u ' - ` + 0 1 2 3 4 5 6 7 8 9 Only these characters can be used in the lexical form part of a lexical entry. The gloss part of a lexical entry is not restricted to these characters. Capitalization, accented characters, and punctuation in running text can be handled by KTEXT. 14 British spelling Some British spelling variants have been included, such as colour, recognise, centre, etc, but this has not been done consistently or exhaustively. I apologi[z/s]e for this American bias. 15 Archaic forms Archaic verb ending are found in the sublexicon V_INFL in the file ENGLISH.LEX. To enable them, remove the comment character before each line. The file VERB.LEX also contains various archaic verb forms and the file MINOR.LEX contains archaic pronouns. 16 Stress marks Word stress in full words is indicated with the back quote (grave accent) `. Be careful not to confuse it with apostrophe; for instance, the lexical form of the word "woman's" is written `woman+'s. The stress marks were placed according to my own intuition and the authority of Webster's Ninth New Collegiate Dictionary. Notice that even monosyllabic words require a stress mark because the Gemination rule crucially refers to it (see the file ENGLISH.RUL and appendix A of the PC-KIMMO book [Antworth 1990]). 17 Accented characters (diacritics) Englex's alphabet does not include accented characters (characters with diacritics). For instance, the word "naivete" is usually spelled with a diaeresis over the "i" and an acute accent over the final "e"; but the lexical entry for "naivete" is spelled with no diacritics. If your input text contains accented characters, they must be converted to corresponding unaccented characters. The control file TEXTIN.CTL for KTEXT can be configured by the user to convert single eight-bit accented characters (either Macintosh or IBM extended character set) to seven-bit characters. Edit this file to make changes or additions. If your input text contains digraphs to represent accented characters (for instance, na:ivet'e), you can convert these to single characters using consistent change commands in the control file TEXTIN.CTL. See the KTEXT user's guide for details. Another way to handle eight-bit accented characters is to add them to the alphabet in ENGLISH.RUL and use them in the lexical entries. This is a less portable solution. 18 Inflection and derivation Morphological processes are traditionally divided into two types: inflection and derivation. Englex handles both types, though it does not formally distinguish them. Here are some examples of how Englex glosses inflectional morphology: cats `cat+s N+PL singing `sing+ing V+PRG sang `sang V.PST Here is how Englex glosses a derivationally complex word: computerization com`pute+er+ize+ation V+NR19+VR6+NR23 Englex contains an entry for the verb root com`pute and entries for the suffixes +er, +ize, and +ed; all words based on that root (such as computer, computerize, etc.) are recognized by decomposing them into their constituent parts. In addition to listing roots, Englex must also list derived forms that cannot be decomposed due to phonological or morphological irregularity. For example, the word "reception" is listed in the lexicon with the gloss string V(re`ceive)+NR23. Many regularly derived words in English have acquired specialized meanings. For example, the word "business" is a regular nominal derivation of the verb busy, but no longer retains its transparent meaning. Many such words have been given their own lexical entries to reflect this fact. Thus "business" will return two parses: `business N and `busy+ness AJ+NR27 Englex may reveal relations among words that you were not aware of. For example, I was surprised to find that Englex analyzed the word "amplify" as the adjective "ample" plus the verbalizing suffix -ify. Even though this formation is perfect regular and analogous to "simple, simplify", I had never consciously made the connection. It is not easy to draw a sharp line between productive, synchronic formations and static, diachronic formations. For example, the adjective "resilient" is actually derived from the verb "resile". Even though the semantic relation is perfectly transparent, the fact that "resile" is no longer in currency puts this analysis more in the arena of etymology. I have probably not been entirely consistent in handling such cases. 19 Multiple senses and homonyms Englex is intended as a parsing lexicon, not a full dictionary. In general, multiple senses are not distinguished. For example, there is only one entry for the adjective "fair", ignoring the fact that it has several senses (including 'not stormy' and 'impartial'). However the noun "fair" meaning 'a festival' is considered a homonym and because it is a different word class it is given its own entry in the noun sublexicon. There are a few instances of homonyms of the same word class; for instance, "bat" in the sense 'instrument for hitting' and "bat" in the sense 'flying mammal'. Because these two words have different derivational possibilities (the first can be converted to a verb while the second cannot), they are given separate lexical entries. Their glosses are distinguished as "bat1" and "bat2". I have no doubt missed other such cases. 20 Word class conversion Many words in English belong to more than one word class; for instance, "hit" is either verb or noun and "calm" is either adjective or verb. Since in such cases the word appears to have the same sense but just differs in word class, we can say that the word has changed from one class to another. The direction of conversion is distinctive. Examples of verb to noun conversion include "love", "laugh", "answer", "cover", and "walk", while examples of noun to verb conversion include "bottle", "grease", "peel", and "father". Englex handles conversion by permitting special continuations such as V-to-N and N-to-V (see the alternations and sublexicons by these names in the file ENGLISH.LEX). Given a word such as "talk" that has the continuation V-to-N, Englex will return two parses: V(`talk) and V(`talk).NR0 where the tag NR0 stands for nominalizer zero. When adding new lexical entries, you should take the possibility of conversion into consideration. For example, say you find that Englex fails to recognize the inflected verb "partied". Before adding "party" to the verb lexicon, first check to see if "party" already exists in the noun lexicon. If it does, then you need only to change its continuation from N to N-to-V. For a discussion of conversion in English, see Quirk et al. 1972:1009ff. 21 Output data structures Given the word "cried" as input, Englex will return as output two pieces of data: the lexical (underlying) form `cry+ed and the gloss string V+PTC (a plus sign indicates a morpheme boundary; see below for a list of gloss tags.) If you use KTEXT, the output file will contain a record for each word; for example (see the KTEXT user's guide for details): \a V+PTC \d `cry+ed \w cried There will not necessarily be an equal number of morpheme break symbols between the lexical form and the gloss string; for example: \a V(re`ceive)+NR23 \d re`ception \w reception This shows that even though the form "reception" can only be partially segmented (-ion is a regular suffix, but there is no stem "recept"), it nevertheless corresponds to a morphologically regular formation of stem plus suffix (compare "digress" and "digression"). Besides the plus sign, a period (dot) is also used to indicate a special type of morpheme, namely an irregular or zero alternant of an affix. For example, while a regular plural noun such as "cats" is glossed as N(`cat)+PL, the irregular plural "mice" is glossed as N(`mouse).PL. Word class conversion is also handled this way; thus the nominalized verb "arrival" is glossed as V(ar`rive)+NR22 while "return" is glossed V(re`turn).NR0. It is important to understand that Englex glosses morphemes, not whole words. Of course when a word is composed of only a single morpheme, this distinction is moot; thus the word "large" is glossed as AJ, which can be interpreted as either a morpheme-level or a word-level gloss. Now consider the multimorphemic word "enlargement", which is glossed as VR1+AJ(`large)+NR25. This is a string of morpheme glosses which maps directly to the parts of the lexical form en+large+ment. But there is nothing in the gloss string that tells us overtly whether the class of the whole word is adjective, verb, or noun. This level of analysis is beyond the feasible scope of PC-KIMMO and Englex. However, it should not be difficult to write an algorithm to infer word class from a gloss string such as VR1+AJ(`large)+NR25. It is a well-known fact of English morphology that the rightmost suffix determines the word class of the entire word. Such an algorithm could be applied to the output structures provided by KTEXT. Another point to notice here is that a gloss string has a strictly linear structure; that is, it does not have any internal constituent structure. Even though it can be argued that a word such as "enlargement" has a bracketed structure such as [[en+[large]]+ment], such tree-like structures are flattened out in the gloss string Englex produces. A corollary to the fact that Englex glosses morphemes rather than words is that it glosses only what is phonologically present in the input word. For example, while the word "dogs" will return the gloss N(`dog)+PL, the word "dog" will return only the gloss N(`dog); that is, it does *not* return something like N(`dog).SG to indicate that it is a singular noun. Since singular number is unmarked in English, Englex does not gloss it; plural number is marked, so Englex returns a gloss when it finds it. This shows that Englex is perhaps better understood as a recognizer than a parser, since it does not return an overt set of inflectional categories for each word. As was suggested above, such information can be obtained by postprocessing Englex's output. (NOTE: I have not been entirely consistent with this policy. See the first part of the file NOUN.LEX where I have listed zero plural nouns, nouns with equivocal number, unmarked plural nouns, and so on. Collecting lists of such words appealed to me as a linguist.) 22 Gloss tags Here is a list of all the gloss tags used in Englex. N noun PN proper noun V verb AUX auxiliary AJ adjective AV adverb PP preposition DT determiner CJ conjunction QN quantifier DEM demonstrative PR pronoun IJ interjection FN foreign CD cardinal OD ordinal 1 first person 2 second person 3 third person SG singular PL plural GEN genitive CMP comparative SPR superlative PST past PTC participle PRG progressive NR nominalizer VR verbalizer AJR adjectivizer AVR adverbizer NEG negative PEJ pejorative DEG degree ORI orientation LOC location NUM number REV reversive ORD time and order NEO neo-classical (The last nine tags listed above were suggested by Quirk et al. 1972:981ff.) Affixes with the same tag are differentiated by numbering; thus the nominalizing suffixes are tagged NR1, NR2, etc. Variants of an affix are further distinguished with letters; for instance, NR23a, NR23b, etc. Some words are given multiple tags. For instance the word "fast" is tagged as AJ/AV because it can function as either an adjective or an adverb. Alternatively, the word "fast" could be given two lexical entries, one in the adjective sublexicon and another in the adverb sublexicon. The choice depends on how you want to handle multiple parses. 23 Compounds There are three types of orthographic compounds in english (see Quirk et al. 1972:1019): solid, e.g. bedroom hyphenated, e.g. tax-free open, e.g. rose bush Open compounds are not handled by Englex at all. If you want to treat open compounds as single lexical items, you must preprocess the text to join them as either hyphenated or solid compounds (for instance, replace "rose bush" with "rose-bush" or "rosebush" and put these forms in the lexicon). Englex can handle hyphenated compounds. If it recognizes a whole word and then encounters a hyphen, it will recurse and attempt to recognize the part after the hyphen as another word. It will even handle phrasal compounds this way, such as "his come-what-may attitude". If you do not want to decompose hyphenated compounds, find the End sublexicon near the bottom of the file ENGLISH.LEX and comment out the hyphen entry. Englex treats solid hyphens as if they were indivisible stems; they are simply listed in the lexicon. It should be possible to cause Englex to decompose solid compounds by using a null lexical entry in the End sublexicon. However, I suspect that a large number of spurious parses would result. There are three types of compounds that have received special treatment. First is the "object-verb-er" type such as "lawnmower" and "sightseer". Those which are usually written as solid compounds have been included in the lexicon with entries like this: `sightseer N "N(`sight)+V(`see)+NR19" Second is the "adjective-noun-ed" type such as "red-haired" and "long- legged". Some compounds of this type are found in Englex with entries like this: clear`headed AJ "AJ(`clear)+N(`head)+AJR8" Third are the "man/men" and "woman/women" compounds such as "businessman/men/woman/women". Because there are so many of these, I have created suffix entries for +man, +men, +woman, and +women. Some must still be listed in the lexicon, such as "madman" (built on an adjective rather than a noun) and "klansman" (rather than "*klanman"). See the section on man/woman compounds in the file NOUN.LEX. 24 Clitics Clitics are distinguished from affixes. Affixes are constrained in what word classes they can attach to; for instance, the plural suffix +s can only attach to a noun. Clitics, however, are syntactically bound to phrases but phonologically bound to the last word of the phrase; thus they are not constrained by the words they attach to. For instance, the possessive clitic +'s normally attaches to nouns as in "the man's hat", but can attach to other word classes such as adjectives in a phrase such as "the president elect's hat". In Englex clitics are placed in the sublexicon CLITICS found near the end of the file ENGLISH.LEX. These include +'s for "is", +'s for "has", +'ll for "will" and so on. There is one exception: because of the frequency of the possessive clitic, it is placed in the sublexicon GENITIVE which limits its occurrence to nouns. To change this behavior, simply move it to the CLITICS sublexicon. 25 Participles The -ed form of a verb is called a past participle (as in "the surprised children") and the -ing form is a present participle or gerund (as in "the surprising children). Englex does not give any overt indication that forms such as these could be either finite verbs or participles, since to do so would result in multiple parses for every -ed and -ing verb form in English. Inferring the possibility that a verb could be a participle is left to postprocessing. However, if an -ed or -ing form occurs followed by a derivational suffixes such as -ly or -ness, then Englex will convert a verb to an adjective. For instance, "surprising" will be glossed simply as V(sur`prise)+PRG, but "surprisingly" will be glossed as V(sur`prise)+PRG.AJR0+AVR1. See the sublexicon PTC_SUFFIX in the file ENGLISH.LEX. 26 Special cases There are a couple classes of words that receive some special treatment in Englex. First are words that end in -ology and other y-final foreign suffixes. The problem comes in handling the derived forms of a word such as "biology", for instance "biological" and "biologist", where the final y is absent. It is not feasible to handle this with a general phonological rule, since it is morphologically conditioned. Instead, I have treated the final y as a suffix. This means that "biology" is represented in the lexicon as "biolog" which must take a suffix in order to be well-formed. These special "Final_y" words are found in the first part of the file NOUN.LEX. Second are adjectives that end in -ic. Some of these words also have an adjective form ending in -al, for instance "acoustic" and "acoustical". Others do not have an -al adjective form ("atomic" but not "*atomical") but require -al before adding the adverbial suffix -ly ("atomically"). These -ic adjectives are given the special continuation AJR_ic (see the file ADJECTIV.LEX). 27 Names and abbreviations The file PROPER.LEX contains proper names and related words. There is a fairly long list of geographical place names, but virtually no first and last names of people (with the exception of some historical figures). The intent was to provide a place where you can add names that occur in the text you are processing. The file ABBREV.LEX contains acronyms and abbreviations. The entries mainly come from text that I processed. Add your own entries as needed. 28 Digits and Roman numerals Englex will handle numbers such as 2, 125, 1984, etc. See the sublexicon DIGITS in the file MINOR.LEX. Unfortunately, neither PC-KIMMO nor KTEXT can correctly handle numbers that contain commas or decimal points (such as 1,200 or 5.25). This is because comma and decimal point are elsewhere used as punctuation and thus cannot also serve as alphabetic characters. It should be noted that KTEXT will not drop commas or decimal points, it will simply save them in a punctuation field; thus it will treat 1,200 as two "words" separated by a comma. Englex will also handle Roman numerals. See the sublexicon ROMAN in the file MINOR.LEX. Notice that the entry for the numeral "i" has been commented out to prevent ambiguity with the first person singular personal pronoun. 29 Preprocessing text English orthography is notoriously underspecified. For instance, capital letters are used both for proper names and to begin sentences; periods are used both after abbreviations and to end sentences; a hyphen can be used in a compound word or to form a dash; and the character ' is often used both as a single quote mark and as an apostrophe. Such ambiguities may require you to preprocess your text. For example, say your text uses the character ' both as a single quote and as an apostrophe (as does the Project Gutenberg version of "Alice's Adventures in Wonderland"). Since you want to treat forms such as "girl's" as a single word, apostrophe must be declared as an alphabetic (word forming) character. However, KTEXT will now fail on any word that is preceded or followed by a single quote mark. The only solution is to consistently change all single quote marks to some other nonalphabetic character (such as " or < or eight-bit curly quotes). Similarly, "Alice" uses two hyphens to indicate a dash as in "...as she spoke--fancy curtseying as...". If hyphen is used as a word forming character in compounds, then "spoke--fancy" will be treated as a single word, resulting in failure. The solution is to consistently change two hyphens to some other nonalphabetic characters (such as two equals signs or an eight-bit dash character). Section 23 mentions preprocessing text in order to join open compounds. A similar problem occurs with foreign expressions and names, for instance ad hoc, faux pas, El Salvador, Los Angeles. Englex already contains these forms joined with hyphens: ad-hoc, faux-pas, el-salvador, los-angeles (see the sublexicon FOREIGN in the file MINOR.LEX and the section of place names in PROPER.LEX). Use a text-processing tool such as SED or AWK to join these forms before parsing the text with Englex. 30 Reporting defects and submitting enhancements If you find errors in Englex, please report them to me at the address below. If you make enhancements to Englex that you think others would benefit from, I encourage you to send these to me also. If enough interest develops, I am willing to redistribute such enhancements to other users. If you want to be on a standing list to receive information on future development of Englex, please send me your e-mail address. You can contact me at the following mailing address, e-mail address, or phone number. Evan Antworth | Internet: evan@sil.org Academic Computing Department | UUCP: ...!uunet!convex!txsil!evan Summer Institute of Linguistics | phone: 214/709-2418 7500 W. Camp Wisdom Road | fax: 214/709-3387 Dallas, TX 75236 | 31 References Antworth, Evan L. 1990. PC-KIMMO: a two-level processor for morphological analysis. Occasional Publications in Academic Computing No. 16. Summer Institute of Linguistics. Antworth, Evan L. and Stephen R. McConnel. 1991. KTEXT User's Guide. On-line documentation. Bauer, Laurie. 1983. English word-formation. Cambridge University Press. Quirk, Randolph, Sidney Greenbaum, Geoffrey Leech and Jan Svartvik. 1972. A grammar of contemporary English. Longman. Webster's Ninth New Collegiate Dictionary. 1984. Merriam-Webster Inc.