home *** CD-ROM | disk | FTP | other *** search
-
- ENGLEX: an English lexicon for PC-KIMMO
-
- version 1.0
- November 26, 1991
- Documentation updated: 20-Jan-92
-
- Evan L. Antworth
- Summer Institute of Linguistics, Dallas, TX
- evan@sil.org
-
- Copyright (C) 1991, Summer Institute of Linguistics, Inc.
-
- Contents
- 1 What is Englex?
- 2 Copyright and fair use policy
- 3 Required software
- 4 About PC-KIMMO and KTEXT
- 5 System requirements and performance
- 6 Coverage and disclaimers
- 7 Test corpora
- 8 Comparison with appendix A of the PC-KIMMO book
- 9 Design philosophy
- 10 Extending, modifying, and fine-tuning the lexicon
- 11 File structure
- 12 Running Englex
- 13 Alphabet
- 14 British spelling
- 15 Archaic forms
- 16 Stress marks
- 17 Accented characters
- 18 Inflection and derivation
- 19 Multiple senses and homonyms
- 20 Word class conversion
- 21 Output data structures
- 22 Gloss tags
- 23 Compounds
- 24 Clitics
- 25 Participles
- 26 Special cases
- 27 Names and abbreviations
- 28 Digits and Roman numerals
- 29 Preprocessing text
- 30 Reporting defects and submitting enhancements
- 31 References
-
- 1 What is Englex?
-
- Englex is a morphological parsing lexicon of English. It uses the standard
- orthography for English. It is intended for use with PC-KIMMO (or programs
- that use the PC-KIMMO parser, such as KTEXT). With such software and Englex,
- you can morphologically parse English words and text. Practical applications
- include morphologically preprocessing text for a syntactic parser and
- producing morphologically tagged text. Englex can also be used to explore
- English morphological structure.
-
- 2 Copyright and fair use policy
-
- All of the files in this release of Englex are copyrighted by the Summer
- Institute of Linguistics (Academic Computing Department, 7500 W. Camp Wisdom
- Road, Dallas, TX 75236, U.S.A.). Permission is hereby granted to the user to
- copy, use, modify, and distribute the Englex files under the following
- conditions:
- (1) if you distribute this original release of Englex, you must
- include all files in unmodified form;
- (2) if you distribute Englex files that you have modified, you must
- clearly state who modified them and how they differ from the originals;
- (3) you may not charge money for distributing Englex, in original or
- modified form, beyond minimal media cost without permission of the Summer
- Institute of Linguistics; and
- (4) Englex may not be used in any commercial product without
- permission of the Summer Institute of Linguistics.
-
- 3 Required software
-
- Englex is of little use by itself (though you could use a word processor to
- search and retrieve words in the lexicon files). Englex is intended to be
- used with PC-KIMMO or KTEXT. If you use Englex with PC-KIMMO, you can
- interactively enter words to analyze or process lists of words using the
- file functions. However, any word you process this way must use only the
- alphabetic characters declared in the rules file. For example, if you enter
- a capitalized word, you will get an error. Also, the basic alphabet does not
- include any eight-bit accented characters. Using Englex interactively with
- PC-KIMMO is helpful when you are editing the lexicon files. There is one
- oddity to be aware of: due to the way PC-KIMMO handles NULLs, some words
- will return several identical parses (for example, "bigger"). You should
- also note that Englex is optimized for recognition; you can use PC-KIMMO's
- generator function with Englex, but it will produce many spurious output
- forms.
-
- To process text with Englex, you can use KTEXT. KTEXT handles all the
- problems noted above: capitals, accented characters, and identical parses.
- If you want to make adjustments to the way KTEXT works, simply modify the
- files ENGLISH.CTL and TEXTIN.CTL. See the KTEXT user's guide for details.
-
- A third way to use Englex is to create your own application program using
- the PC-KIMMO parser. See appendix C of the PC-KIMMO book (Antworth 1990).
-
- 4 About PC-KIMMO and KTEXT
-
- PC-KIMMO is a program for doing computational phonology and morphology. It
- is typically used to build morphological parsers for natural language
- processing systems. PC-KIMMO is described in the book "PC-KIMMO: a two-level
- processor for morphological analysis" by Evan L. Antworth, published by the
- Summer Institute of Linguistics (1990). The PC-KIMMO software is available
- for MS-DOS (IBM PCs and compatibles), Macintosh, and UNIX. The book
- (including software) is available for $23.00 (plus postage) from:
-
- International Academic Bookstore
- 7500 W. Camp Wisdom Road
- Dallas TX, 75236
- U.S.A.
- phone 214/709-2404
- fax 214/709-2433
-
- The remainder of this document assumes that you are familiar with PC-KIMMO.
-
- PC-KIMMO was deliberately designed to be reusable. The core of PC-KIMMO is a
- library of functions such as load rules, load lexicon, generate, and
- recognize. The PC-KIMMO program supplies on the release diskette is just a
- user shell built around these basic functions. This shell provides an
- environment for developing and testing sets of rules and lexicons. Since the
- shell is a development environment, it has very little built-in data
- processing capability. But because PC-KIMMO is modular and portable, you can
- write your own data processing program that uses PC-KIMMO's function
- library.
-
- KTEXT is an example of how to use PC-KIMMO to create a new natural language
- processing program. KTEXT is a text processing program that uses PC-KIMMO to
- do morphological parsing. See the KTEXT user's guide for more information on
- how to use KTEXT to process text.
-
- Note: as of December 6, 1991 the latest version of KTEXT is version 1.0.1.
-
- The Macintosh version of KTEXT is available from:
-
- archive.umich.edu (141.211.164.153)
- /pub/mac/etc/linguistics/ktext094.sit.hqx
-
- The MS-DOS version of KTEXT is available from (but see section 5 below):
-
- wsmr-simtel20.army.mil (192.88.110.20)
- pd1:<msdos.linguistics>ktext093.zip
-
- or
-
- archive.umich.edu (141.211.164.153)
- /pub/msdos/linguistics/ktext093.zip
-
- The UNIX version of KTEXT is available from:
-
- Consortium for Lexical Research, New Mexico State University
- Direct queries to lexical@nmsu.edu or lexical@nmsu.bitnet.
-
- 5 System requirements and performance
-
- PC-KIMMO and KTEXT run on three systems:
-
- MS-DOS (IBM PC and compatibles)
- UNIX System V (SCO UNIX V/386 and A/UX) and 4.2 BSD UNIX
- Apple Macintosh (System 7 compatible)
-
- Englex takes up only about 500KB of disk space (not including executables),
- but requires a considerable amount of internal memory. On my Macintosh SE/30
- (using Multifinder under system 6) I must set the application size of PC-
- KIMMO or KTEXT to a minimum of 2700KB. Thus you will need at least a 4MB
- Macintosh to run Englex (unless you prune out a substantial number of
- lexical entries).
-
- The original MS-DOS versions of PC-KIMMO and KTEXT were limited to 640KB.
- Obviously Englex will not run in 640KB. We have recently compiled new
- versions of PC-KIMMO and KTEXT for PC compatibles using the 386 processor.
- These versions will use all available extended/expanded memory plus virtual
- memory. In order to run Englex under MS-DOS, you will need a 386 machine and
- these new versions of the software. If they are not available from the file
- archives mentioned above, contact me directly.
-
- On my Macintosh SE/30, Englex takes about 1 minute 35 seconds to load. KTEXT
- averages about two words per second to process text. On a 33MHz 486 PC
- compatible, Englex takes 10 seconds to load and KTEXT averages about 10
- words per second.
-
- 6 Coverage and disclaimers
-
- Englex contains approximately 20,000 lexical entries. These entries are
- affixes, roots, indivisible stems and solid compounds. Of these, there are
- approximately 11,000 nouns, 4,000 verbs, and 3,400 adjectives. Since Englex
- analyzes productive morphology, it will recognize several times this number
- of English words. No claim is made for exhaustive coverage of English
- vocabulary. The intent was to establish a critical mass of entries that
- would handle a large percentage of non-technical, non-specialized English
- text. Rather than simply adding lists of new words, I suggest that future
- lexical expansion of Englex should be done by users on the basis of the
- textual materials they are attempting to process.
-
- Englex attempts to account for all productive morphological structure
- (affixes, morphotactic constraints, word class conversion, etc.). No claim
- is made that it exhaustively covers everything that might be considered part
- of English morphology.
-
- Although my intention was to be as complete and accurate as possible, no
- claim is made that Englex is inerrant. I view Englex as an on-going research
- project to which I now invite the general academic community to contribute.
- The morphological analysis of English embodied in Englex should be viewed as
- a set of hypotheses that are subject to falsification, correction, and
- refinement.
-
- 7 Test corpora
-
- Englex was tested with several words lists (such as the UNIX spelling list).
- This does not mean that Englex contains all words found in those lists. Many
- words were judged too technical or infrequent to include in Englex.
-
- Englex was also tested with samples of running text, including Lewis
- Carroll's "Alice's Adventures in Wonderland" and "Through the Looking
- Glass", Herman Melville's "Moby Dick", the New Testament (Authorized
- version), and excerpts from the UPI newswire. Again, this does not mean that
- all words found in those corpora are included in Englex.
-
- 8 Comparison with appendix A of the PC-KIMMO book
-
- The PC-KIMMO release disk includes an English example which is described in
- appendix A of the PC-KIMMO book (Antworth 1990). The rules file that Englex
- uses is very similar to the rules file described there, but a few changes
- have been made, such as relaxing the environment for Gemination.
-
- One important difference is the NULL symbol. Because Englex handles digits,
- including 0, the NULL symbol has been changed to * (asterisk). Notice that
- null entries in the lexicon must also use * as the NULL symbol.
-
- Another difference is that the s-deletion and i:y-spelling rules described
- in appendix A are not used in Englex. This was done to achieve better
- processing performance. Because deletions are computationally expensive for
- the recognizer function, removing the s-deletion rule resulted in nearly a
- 20% speed increase. Removing the i:y-spelling rule resulted in a 10% speed
- increase. The trade-off is that there is some loss in linguistic felicity.
- The s-deletion rule deletes a possessive suffix "s" when it follows an "s",
- e.g. lexical "boy+s+'s" to surface "boys'"). In order to do away with this
- rule, it is necessary to add the allomorph +' to the GENITIVE sublexicon (in
- the file english.lex). The result is that a word such as "boys'" returns the
- lexical form "boy+s+'" rather than "boy+s+'s"; however, the gloss string is
- unaffected and remains N+PL+GEN. If you prefer to use the s-deletion rule,
- it is located in the file english.rul after the END keyword. Simply move it
- into the main body of rules and comment out the +' lexical entry.
-
- The i:y-spelling rule accounts for alternations such as "tie" and "tying".
- However, there is such a small number of words that exhibit this alternation
- that it is more economical to list them in the lexicon. However, if you want
- to use the i:y-spelling rule, it is located in the file english.rul after
- the END keyword.
-
- The lexicon described in appendix A is only a small sample lexicon and is
- totally superseded by Englex. Note that the morphotactic structure described
- in appendix A bears little resemblance to Englex.
-
- In short, be careful not to confuse the English files from the PC-KIMMO
- release disk with the files supplied with Englex.
-
- 9 Design philosophy
-
- Englex represents a convergence of two disciplines: natural language
- processing (NLP) and linguistics. Since the presuppositions, interests, and
- goals of linguists and NLP researchers do not necessarily coincide, Englex
- is by necessity a bundle of compromises.
-
- Englex is natural language processing (NLP) tool based on generally-accepted
- linguistic principles and analyses of English morphology. The basic strategy
- in building an NLP system like Englex is two-pronged: first, ensure that all
- well-formed input is analyzed correctly, and second, incrementally refine
- the system so that it rejects ill-formed input. Both the linguist and NLP
- researcher would insist that the first goal be met (though even here the NLP
- researcher might be more forgiving). But with regard to the second goal,
- only the linguist would require that it be fully met in order for the
- description to be adequate. For the NLP researcher, as long as well-formed
- input is assured, it does not necessarily matter if the system
- "overrecognizes" (but see below).
-
- For example, Englex will correctly recognize the comparative and superlative
- forms of adjectives such as "big, bigger, biggest". But it will also
- recognize the dubious form "aliver" as the comparative form of "alive". In
- other words, Englex underspecifies the morphotactic constraints related to
- adjective inflection; it assumes that all adjectives can have a comparative
- form, which of course is not true. In practice, we assume that forms such as
- "aliver" do not occur in well-formed text; thus overrecognition does little
- harm.
-
- However, overrecognition is by no means innocuous; it can result in spurious
- parses that seriously degrade the performance of an NLP system. For
- instance, consider what would happen if we relax the constraint that the
- comparative -er suffix only attaches to adjectives and permit it after any
- word. A word such as "bigger" would still be correctly parsed as a
- comparative adjective; but a word such as "writer" would get two parses: one
- where -er is correctly recognized as an agentive suffix that attaches to a
- verb, and another where -er is incorrectly posited as the comparative
- suffix. By simply encoding the constraint that the comparative suffix can
- only attach to adjectives, we capture the obvious and important linguistic
- fact that only adjective have comparative forms and at the same time reduce
- the number of spurious parses the system produces.
-
- The degree to which we refine a system like Englex depends on our purpose in
- using the system: to characterize precisely English morphological structure
- (the linguist's goal) or to process natural language texts to some
- acceptable degree of accuracy (the NLP researcher's goal). In Englex I have
- tried to steer a middle course between these purposes, but ultimately it is
- up to the user to determine the behavior of the system.
-
- 10 Extending, modifying, and fine-tuning the lexicon
-
- Since Englex is a completely open system, the user can easily add more
- lexical entries as they are needed. The lexicon files are standard ASCII
- text files that can be edited with any conventional text editor (see section
- 11 on file formats).
-
- The user can also change the gloss tags if this is necessary to be
- compatible with other software. If you do this, be sure to search all the
- lexicon files for instances of a particular tag.
-
- The user can even modify the morphological analysis used by Englex. Care
- should be observed when doing this, however, since a small change can have
- unforeseen results in some other part of the lexicon system.
-
- If you look at the file ENGLISH.LEX which contains entries for affixes, you
- will see that many affix entries are commented out. This is an example of
- the compromise between linguistics and natural language processing. Some
- affix entries are commented out because they are very infrequent or
- unproductive; in such cases it is preferable to simply list all words with
- these affixes in the lexicon. Other affix entries are commented out because
- they result in numerous spurious parses that cannot easily be filtered out
- using PC-KIMMO's rather simple system of encoding morphotactic constraints.
- In these cases it is preferable, from the viewpoint of natural language
- processing, to list words using such affixes in the lexicon rather than deal
- with multiple parses. However, from a linguistic point of view, it might be
- desireable to uncomment these affixes and see what happens. The choice is up
- to the user.
-
- There are other instances where the user who is primarily interested in
- natural language processing may want to fine-tune the lexicon by disabling
- certain lexical entries. For instance, the word "saw" will result in several
- parses: the past tense form of "see", the noun "saw", and the verb "saw"
- converted from the noun. Unless your text is about carpentry, it will be
- distracting to have three parses of such a common word as "saw" (as the past
- tense of "see"). Just comment out the lexical entry for the noun "saw".
-
- 11 File structure
-
- This release of Englex the following files:
-
- english.ctl KTEXT mail control file
- engtxtin.ctl KTEXT textin control file
- english.rul rules file
- english.lex main lexicon file (contains affixes and loads other files)
- noun.lex nouns
- verb.lex verbs
- adjectiv.lex adjectives
- adverb.lex adverbs
- minor.lex prepositions, determiners, conjunctions, quantifiers,
- demonstratives, interjections, foreign, ordinals,
- cardinals, digits, roman numerals
- proper.lex proper nouns
- abbrev.lex acronyms and abbreviations
-
- At the beginning of each file is a table of contents. In the noun, verb, and
- adjective files, irregular forms are listed in the first part of the file
- followed by regular forms.
-
- Each lexical entry in a file is composed of three parts: lexical form,
- alternation, and gloss. Each entry is limited to a single line with a single
- tab separating the parts. For example:
-
- `cat <TAB> N <TAB> "N"
-
- 12 Running Englex
-
- To run Englex interactively with PC-KIMMO, launch PC-KIMMO and issue the
- commands "load rules english" and "load lexicon english". You can also
- create a TAKE file to execute these commands automatically (see section
- 7.5.4 of the PC-KIMMO book). Note that if the Englex files are not in the
- same subdirectory as the PC-KIMMO program, you must either do a CD command
- to move into that directory or use pathnames before the filenames.
-
- To run Englex with KTEXT, you must first be sure that the control files
- ENGLISH.CTL and TEXTIN.CTL are present and properly configured. Then launch
- KTEXT with the appropriate command line arguments. For instance:
-
- ktext -w -x english -i alice.txt -o alice.ana -l alice.log
-
- See the KTEXT user's guide for details.
-
- 13 Alphabet
-
- The alphabet of word-forming characters is declared in the file ENGLISH.RUL.
- It consists of these characters:
-
- b c d f g h j k l m n p q r s t v w x y z a e i o u ' - ` +
- 0 1 2 3 4 5 6 7 8 9
-
- Only these characters can be used in the lexical form part of a lexical
- entry. The gloss part of a lexical entry is not restricted to these
- characters. Capitalization, accented characters, and punctuation in running
- text can be handled by KTEXT.
-
- 14 British spelling
-
- Some British spelling variants have been included, such as colour,
- recognise, centre, etc, but this has not been done consistently or
- exhaustively. I apologi[z/s]e for this American bias.
-
- 15 Archaic forms
-
- Archaic verb ending are found in the sublexicon V_INFL in the file
- ENGLISH.LEX. To enable them, remove the comment character before each line.
- The file VERB.LEX also contains various archaic verb forms and the file
- MINOR.LEX contains archaic pronouns.
-
- 16 Stress marks
-
- Word stress in full words is indicated with the back quote (grave accent) `.
- Be careful not to confuse it with apostrophe; for instance, the lexical form
- of the word "woman's" is written `woman+'s. The stress marks were placed
- according to my own intuition and the authority of Webster's Ninth New
- Collegiate Dictionary. Notice that even monosyllabic words require a stress
- mark because the Gemination rule crucially refers to it (see the file
- ENGLISH.RUL and appendix A of the PC-KIMMO book [Antworth 1990]).
-
- 17 Accented characters (diacritics)
-
- Englex's alphabet does not include accented characters (characters with
- diacritics). For instance, the word "naivete" is usually spelled with a
- diaeresis over the "i" and an acute accent over the final "e"; but the
- lexical entry for "naivete" is spelled with no diacritics. If your input
- text contains accented characters, they must be converted to corresponding
- unaccented characters. The control file TEXTIN.CTL for KTEXT can be
- configured by the user to convert single eight-bit accented characters
- (either Macintosh or IBM extended character set) to seven-bit characters.
- Edit this file to make changes or additions. If your input text contains
- digraphs to represent accented characters (for instance, na:ivet'e), you can
- convert these to single characters using consistent change commands in the
- control file TEXTIN.CTL. See the KTEXT user's guide for details.
-
- Another way to handle eight-bit accented characters is to add them to the
- alphabet in ENGLISH.RUL and use them in the lexical entries. This is a less
- portable solution.
-
- 18 Inflection and derivation
-
- Morphological processes are traditionally divided into two types: inflection
- and derivation. Englex handles both types, though it does not formally
- distinguish them. Here are some examples of how Englex glosses inflectional
- morphology:
-
- cats `cat+s
- N+PL
-
- singing `sing+ing
- V+PRG
-
- sang `sang
- V.PST
-
- Here is how Englex glosses a derivationally complex word:
-
- computerization com`pute+er+ize+ation
- V+NR19+VR6+NR23
-
- Englex contains an entry for the verb root com`pute and entries for the
- suffixes +er, +ize, and +ed; all words based on that root (such as computer,
- computerize, etc.) are recognized by decomposing them into their constituent
- parts.
-
- In addition to listing roots, Englex must also list derived forms that
- cannot be decomposed due to phonological or morphological irregularity. For
- example, the word "reception" is listed in the lexicon with the gloss string
- V(re`ceive)+NR23.
-
- Many regularly derived words in English have acquired specialized meanings.
- For example, the word "business" is a regular nominal derivation of the verb
- busy, but no longer retains its transparent meaning. Many such words have
- been given their own lexical entries to reflect this fact. Thus "business"
- will return two parses: `business N and `busy+ness AJ+NR27
-
- Englex may reveal relations among words that you were not aware of. For
- example, I was surprised to find that Englex analyzed the word "amplify" as
- the adjective "ample" plus the verbalizing suffix -ify. Even though this
- formation is perfect regular and analogous to "simple, simplify", I had
- never consciously made the connection.
-
- It is not easy to draw a sharp line between productive, synchronic
- formations and static, diachronic formations. For example, the adjective
- "resilient" is actually derived from the verb "resile". Even though the
- semantic relation is perfectly transparent, the fact that "resile" is no
- longer in currency puts this analysis more in the arena of etymology. I have
- probably not been entirely consistent in handling such cases.
-
- 19 Multiple senses and homonyms
-
- Englex is intended as a parsing lexicon, not a full dictionary. In general,
- multiple senses are not distinguished. For example, there is only one entry
- for the adjective "fair", ignoring the fact that it has several senses
- (including 'not stormy' and 'impartial'). However the noun "fair" meaning 'a
- festival' is considered a homonym and because it is a different word class
- it is given its own entry in the noun sublexicon. There are a few instances
- of homonyms of the same word class; for instance, "bat" in the sense
- 'instrument for hitting' and "bat" in the sense 'flying mammal'. Because
- these two words have different derivational possibilities (the first can be
- converted to a verb while the second cannot), they are given separate
- lexical entries. Their glosses are distinguished as "bat1" and "bat2". I
- have no doubt missed other such cases.
-
- 20 Word class conversion
-
- Many words in English belong to more than one word class; for instance,
- "hit" is either verb or noun and "calm" is either adjective or verb. Since
- in such cases the word appears to have the same sense but just differs in
- word class, we can say that the word has changed from one class to another.
- The direction of conversion is distinctive. Examples of verb to noun
- conversion include "love", "laugh", "answer", "cover", and "walk", while
- examples of noun to verb conversion include "bottle", "grease", "peel", and
- "father". Englex handles conversion by permitting special continuations such
- as V-to-N and N-to-V (see the alternations and sublexicons by these names in
- the file ENGLISH.LEX). Given a word such as "talk" that has the continuation
- V-to-N, Englex will return two parses: V(`talk) and V(`talk).NR0 where the
- tag NR0 stands for nominalizer zero.
-
- When adding new lexical entries, you should take the possibility of
- conversion into consideration. For example, say you find that Englex fails
- to recognize the inflected verb "partied". Before adding "party" to the verb
- lexicon, first check to see if "party" already exists in the noun lexicon.
- If it does, then you need only to change its continuation from N to N-to-V.
-
- For a discussion of conversion in English, see Quirk et al. 1972:1009ff.
-
- 21 Output data structures
-
- Given the word "cried" as input, Englex will return as output two pieces of
- data: the lexical (underlying) form `cry+ed and the gloss string V+PTC (a
- plus sign indicates a morpheme boundary; see below for a list of gloss
- tags.) If you use KTEXT, the output file will contain a record for each
- word; for example (see the KTEXT user's guide for details):
-
- \a V+PTC
- \d `cry+ed
- \w cried
-
- There will not necessarily be an equal number of morpheme break symbols
- between the lexical form and the gloss string; for example:
-
- \a V(re`ceive)+NR23
- \d re`ception
- \w reception
-
- This shows that even though the form "reception" can only be partially
- segmented (-ion is a regular suffix, but there is no stem "recept"), it
- nevertheless corresponds to a morphologically regular formation of stem plus
- suffix (compare "digress" and "digression").
-
- Besides the plus sign, a period (dot) is also used to indicate a special
- type of morpheme, namely an irregular or zero alternant of an affix. For
- example, while a regular plural noun such as "cats" is glossed as
- N(`cat)+PL, the irregular plural "mice" is glossed as N(`mouse).PL. Word
- class conversion is also handled this way; thus the nominalized verb
- "arrival" is glossed as V(ar`rive)+NR22 while "return" is glossed
- V(re`turn).NR0.
-
- It is important to understand that Englex glosses morphemes, not whole
- words. Of course when a word is composed of only a single morpheme, this
- distinction is moot; thus the word "large" is glossed as AJ, which can be
- interpreted as either a morpheme-level or a word-level gloss. Now consider
- the multimorphemic word "enlargement", which is glossed as
- VR1+AJ(`large)+NR25. This is a string of morpheme glosses which maps
- directly to the parts of the lexical form en+large+ment. But there is
- nothing in the gloss string that tells us overtly whether the class of the
- whole word is adjective, verb, or noun. This level of analysis is beyond the
- feasible scope of PC-KIMMO and Englex. However, it should not be difficult
- to write an algorithm to infer word class from a gloss string such as
- VR1+AJ(`large)+NR25. It is a well-known fact of English morphology that the
- rightmost suffix determines the word class of the entire word. Such an
- algorithm could be applied to the output structures provided by KTEXT.
-
- Another point to notice here is that a gloss string has a strictly linear
- structure; that is, it does not have any internal constituent structure.
- Even though it can be argued that a word such as "enlargement" has a
- bracketed structure such as [[en+[large]]+ment], such tree-like structures
- are flattened out in the gloss string Englex produces.
-
- A corollary to the fact that Englex glosses morphemes rather than words is
- that it glosses only what is phonologically present in the input word. For
- example, while the word "dogs" will return the gloss N(`dog)+PL, the word
- "dog" will return only the gloss N(`dog); that is, it does *not* return
- something like N(`dog).SG to indicate that it is a singular noun. Since
- singular number is unmarked in English, Englex does not gloss it; plural
- number is marked, so Englex returns a gloss when it finds it. This shows
- that Englex is perhaps better understood as a recognizer than a parser,
- since it does not return an overt set of inflectional categories for each
- word. As was suggested above, such information can be obtained by
- postprocessing Englex's output. (NOTE: I have not been entirely consistent
- with this policy. See the first part of the file NOUN.LEX where I have
- listed zero plural nouns, nouns with equivocal number, unmarked plural
- nouns, and so on. Collecting lists of such words appealed to me as a
- linguist.)
-
- 22 Gloss tags
-
- Here is a list of all the gloss tags used in Englex.
-
- N noun
- PN proper noun
- V verb
- AUX auxiliary
- AJ adjective
- AV adverb
- PP preposition
- DT determiner
- CJ conjunction
- QN quantifier
- DEM demonstrative
- PR pronoun
- IJ interjection
- FN foreign
- CD cardinal
- OD ordinal
-
- 1 first person
- 2 second person
- 3 third person
- SG singular
- PL plural
- GEN genitive
- CMP comparative
- SPR superlative
- PST past
- PTC participle
- PRG progressive
-
- NR nominalizer
- VR verbalizer
- AJR adjectivizer
- AVR adverbizer
-
- NEG negative
- PEJ pejorative
- DEG degree
- ORI orientation
- LOC location
- NUM number
- REV reversive
- ORD time and order
- NEO neo-classical
-
- (The last nine tags listed above were suggested by Quirk et al. 1972:981ff.)
-
- Affixes with the same tag are differentiated by numbering; thus the
- nominalizing suffixes are tagged NR1, NR2, etc. Variants of an affix are
- further distinguished with letters; for instance, NR23a, NR23b, etc.
-
- Some words are given multiple tags. For instance the word "fast" is tagged
- as AJ/AV because it can function as either an adjective or an adverb.
- Alternatively, the word "fast" could be given two lexical entries, one in
- the adjective sublexicon and another in the adverb sublexicon. The choice
- depends on how you want to handle multiple parses.
-
- 23 Compounds
-
- There are three types of orthographic compounds in english (see Quirk et al.
- 1972:1019):
-
- solid, e.g. bedroom
- hyphenated, e.g. tax-free
- open, e.g. rose bush
-
- Open compounds are not handled by Englex at all. If you want to treat open
- compounds as single lexical items, you must preprocess the text to join them
- as either hyphenated or solid compounds (for instance, replace "rose bush"
- with "rose-bush" or "rosebush" and put these forms in the lexicon).
-
- Englex can handle hyphenated compounds. If it recognizes a whole word and
- then encounters a hyphen, it will recurse and attempt to recognize the part
- after the hyphen as another word. It will even handle phrasal compounds this
- way, such as "his come-what-may attitude". If you do not want to decompose
- hyphenated compounds, find the End sublexicon near the bottom of the file
- ENGLISH.LEX and comment out the hyphen entry.
-
- Englex treats solid hyphens as if they were indivisible stems; they are
- simply listed in the lexicon. It should be possible to cause Englex to
- decompose solid compounds by using a null lexical entry in the End
- sublexicon. However, I suspect that a large number of spurious parses would
- result.
-
- There are three types of compounds that have received special treatment.
- First is the "object-verb-er" type such as "lawnmower" and "sightseer".
- Those which are usually written as solid compounds have been included in the
- lexicon with entries like this:
-
- `sightseer N "N(`sight)+V(`see)+NR19"
-
- Second is the "adjective-noun-ed" type such as "red-haired" and "long-
- legged". Some compounds of this type are found in Englex with entries like
- this:
-
- clear`headed AJ "AJ(`clear)+N(`head)+AJR8"
-
- Third are the "man/men" and "woman/women" compounds such as
- "businessman/men/woman/women". Because there are so many of these, I have
- created suffix entries for +man, +men, +woman, and +women. Some must still
- be listed in the lexicon, such as "madman" (built on an adjective rather
- than a noun) and "klansman" (rather than "*klanman"). See the section on
- man/woman compounds in the file NOUN.LEX.
-
- 24 Clitics
-
- Clitics are distinguished from affixes. Affixes are constrained in what word
- classes they can attach to; for instance, the plural suffix +s can only
- attach to a noun. Clitics, however, are syntactically bound to phrases but
- phonologically bound to the last word of the phrase; thus they are not
- constrained by the words they attach to. For instance, the possessive clitic
- +'s normally attaches to nouns as in "the man's hat", but can attach to
- other word classes such as adjectives in a phrase such as "the president
- elect's hat". In Englex clitics are placed in the sublexicon CLITICS found
- near the end of the file ENGLISH.LEX. These include +'s for "is", +'s for
- "has", +'ll for "will" and so on. There is one exception: because of the
- frequency of the possessive clitic, it is placed in the sublexicon GENITIVE
- which limits its occurrence to nouns. To change this behavior, simply move
- it to the CLITICS sublexicon.
-
- 25 Participles
-
- The -ed form of a verb is called a past participle (as in "the surprised
- children") and the -ing form is a present participle or gerund (as in "the
- surprising children). Englex does not give any overt indication that forms
- such as these could be either finite verbs or participles, since to do so
- would result in multiple parses for every -ed and -ing verb form in English.
- Inferring the possibility that a verb could be a participle is left to
- postprocessing. However, if an -ed or -ing form occurs followed by a
- derivational suffixes such as -ly or -ness, then Englex will convert a verb
- to an adjective. For instance, "surprising" will be glossed simply as
- V(sur`prise)+PRG, but "surprisingly" will be glossed as
- V(sur`prise)+PRG.AJR0+AVR1. See the sublexicon PTC_SUFFIX in the file
- ENGLISH.LEX.
-
- 26 Special cases
-
- There are a couple classes of words that receive some special treatment in
- Englex. First are words that end in -ology and other y-final foreign
- suffixes. The problem comes in handling the derived forms of a word such as
- "biology", for instance "biological" and "biologist", where the final y is
- absent. It is not feasible to handle this with a general phonological rule,
- since it is morphologically conditioned. Instead, I have treated the final y
- as a suffix. This means that "biology" is represented in the lexicon as
- "biolog" which must take a suffix in order to be well-formed. These special
- "Final_y" words are found in the first part of the file NOUN.LEX.
-
- Second are adjectives that end in -ic. Some of these words also have an
- adjective form ending in -al, for instance "acoustic" and "acoustical".
- Others do not have an -al adjective form ("atomic" but not "*atomical") but
- require -al before adding the adverbial suffix -ly ("atomically"). These -ic
- adjectives are given the special continuation AJR_ic (see the file
- ADJECTIV.LEX).
-
- 27 Names and abbreviations
-
- The file PROPER.LEX contains proper names and related words. There is a
- fairly long list of geographical place names, but virtually no first and
- last names of people (with the exception of some historical figures). The
- intent was to provide a place where you can add names that occur in the text
- you are processing.
-
- The file ABBREV.LEX contains acronyms and abbreviations. The entries mainly
- come from text that I processed. Add your own entries as needed.
-
- 28 Digits and Roman numerals
-
- Englex will handle numbers such as 2, 125, 1984, etc. See the sublexicon
- DIGITS in the file MINOR.LEX. Unfortunately, neither PC-KIMMO nor KTEXT can
- correctly handle numbers that contain commas or decimal points (such as
- 1,200 or 5.25). This is because comma and decimal point are elsewhere used
- as punctuation and thus cannot also serve as alphabetic characters. It
- should be noted that KTEXT will not drop commas or decimal points, it will
- simply save them in a punctuation field; thus it will treat 1,200 as two
- "words" separated by a comma.
-
- Englex will also handle Roman numerals. See the sublexicon ROMAN in the file
- MINOR.LEX. Notice that the entry for the numeral "i" has been commented out
- to prevent ambiguity with the first person singular personal pronoun.
-
- 29 Preprocessing text
-
- English orthography is notoriously underspecified. For instance, capital
- letters are used both for proper names and to begin sentences; periods are
- used both after abbreviations and to end sentences; a hyphen can be used in
- a compound word or to form a dash; and the character ' is often used both as
- a single quote mark and as an apostrophe. Such ambiguities may require you
- to preprocess your text. For example, say your text uses the character '
- both as a single quote and as an apostrophe (as does the Project Gutenberg
- version of "Alice's Adventures in Wonderland"). Since you want to treat
- forms such as "girl's" as a single word, apostrophe must be declared as an
- alphabetic (word forming) character. However, KTEXT will now fail on any
- word that is preceded or followed by a single quote mark. The only solution
- is to consistently change all single quote marks to some other nonalphabetic
- character (such as " or < or eight-bit curly quotes).
-
- Similarly, "Alice" uses two hyphens to indicate a dash as in "...as she
- spoke--fancy curtseying as...". If hyphen is used as a word forming
- character in compounds, then "spoke--fancy" will be treated as a single
- word, resulting in failure. The solution is to consistently change two
- hyphens to some other nonalphabetic characters (such as two equals signs or
- an eight-bit dash character).
-
- Section 23 mentions preprocessing text in order to join open compounds. A
- similar problem occurs with foreign expressions and names, for instance ad
- hoc, faux pas, El Salvador, Los Angeles. Englex already contains these forms
- joined with hyphens: ad-hoc, faux-pas, el-salvador, los-angeles (see the
- sublexicon FOREIGN in the file MINOR.LEX and the section of place names in
- PROPER.LEX). Use a text-processing tool such as SED or AWK to join these
- forms before parsing the text with Englex.
-
- 30 Reporting defects and submitting enhancements
-
- If you find errors in Englex, please report them to me at the address below.
- If you make enhancements to Englex that you think others would benefit from,
- I encourage you to send these to me also. If enough interest develops, I am
- willing to redistribute such enhancements to other users. If you want to be
- on a standing list to receive information on future development of Englex,
- please send me your e-mail address.
-
- You can contact me at the following mailing address, e-mail address, or
- phone number.
-
- Evan Antworth | Internet: evan@sil.org
- Academic Computing Department | UUCP: ...!uunet!convex!txsil!evan
- Summer Institute of Linguistics | phone: 214/709-2418
- 7500 W. Camp Wisdom Road | fax: 214/709-3387
- Dallas, TX 75236 |
-
- 31 References
-
- Antworth, Evan L. 1990. PC-KIMMO: a two-level processor for morphological
- analysis. Occasional Publications in Academic Computing No. 16. Summer
- Institute of Linguistics.
-
- Antworth, Evan L. and Stephen R. McConnel. 1991. KTEXT User's Guide. On-line
- documentation.
-
- Bauer, Laurie. 1983. English word-formation. Cambridge University Press.
-
- Quirk, Randolph, Sidney Greenbaum, Geoffrey Leech and Jan Svartvik. 1972. A
- grammar of contemporary English. Longman.
-
- Webster's Ninth New Collegiate Dictionary. 1984. Merriam-Webster Inc.
-