home *** CD-ROM | disk | FTP | other *** search
- A DESCRIPTION OF A COMPUTER-USABLE DICTIONARY FILE BASED ON
- THE OXFORD ADVANCED LEARNER'S DICTIONARY OF CURRENT ENGLISH
-
- Roger Mitton,
- Department of Computer Science,
- Birkbeck College,
- University of London,
- Malet Street,
- London WC1E 7HX
-
- June 1992 (supersedes the versions of March and Nov 1986)
-
-
- In 1985-86 I produced a dictionary file called CUVOALD (Computer
- Usable Version of the Oxford Advanced Learner's Dictionary). This was
- a partial dictionary of English in computer-usable form - "partial"
- because each entry contained only some of the information from the
- original dictionary, and "computer-usable" (rather than merely
- "computer-readable") because it was in a form that made it easy for
- programs to access it. A second file, called CUV2, was produced at
- the same time. This was derived from CUVOALD and was the same except
- that it also contained all inflected forms explicitly, eg it contained
- "added", "adding" and "adds" as well as "add". I have now added some
- information to each entry and some more entries to CUV2, to produce a
- new version of CUV2. This document describes this new file.
-
- These files were derived originally from the Oxford Advanced
- Learner's Dictionary of Current English [1], third edition, published
- by the Oxford University Press, 1974, the machine-readable version of
- which is available to researchers from the Oxford Text Archive. The
- task of deriving them from the machine-readable OALDCE was carried out
- as part of a research project, funded by the Leverhulme Trust, into
- spelling correction. The more recent additions have been carried out
- as part of my research as a lecturer in Computer Science at Birkbeck
- College.
-
- THE FILE FORMAT
-
- CUV2 contains 70646 entries. Each entry occupies one line.
- Samples are given at the end of this document. The longest spelling
- is 23 characters; the longest pronunciation is also 23; the longest
- syntactic-tag field is also (coincidentally) 23; the number of
- syllables is just one character ('1' to '9'), and the longest
- verb-pattern field is 58. The fields are padded with spaces to the
- lengths of the longest, ie 23, 23, 23, 1 and 58, making the record
- length 128. The spelling begins at position 1, the pronunciation at
- position 24, the syntactic-tag field at position 47, the number of
- syllables is character 70, and the verb-pattern field begins at
- position 71. The file is sorted in ASCII sequence; this means, of
- course, that the entries are not in the same order as in the OALDCE.
-
- Page 2
-
-
-
- WHAT THE DICTIONARY CONTAINS
-
- Each entry consists of a spelling, a pronunciation, one or more
- syntactic tags (parts-of-speech) with rarity flags, a syllable count,
- and a set of verb patterns for verbs.
-
- The first file derived from the OALDCE (CUVOALD) contained all
- the headwords and subentries from the original dictionary - subentries
- are words like "abandonment" which comes under the headword "abandon"
- - except for a handful that contained funny characters (such as "Lsd"
- where the "L" was a pound sign). Subentries were not included if they
- consisted of two or three separate words that occurred individually
- elsewhere in the dictionary, such as "division bell" which comes under
- the headword "division", except when the combination formed a
- syntactic unit not immediately predictable from its constituents, eg
- "above board", which is listed as an adverb. To this list of about
- 35,000 entries, I added about 2,500 proper names - common forenames,
- British towns with a population of over 5,000, countries,
- nationalities, states, counties and major cities of the world. I
- would like to have added many more proper names, but I didn't have the
- time.
-
- The second version of the file (CUV2) contained all these entries
- plus inflected forms making a total of about 68,000 entries. Since
- 1986 I have made a number of corrections, added the rarity flags and
- the syllable counts and inserted about 2,000 new entries. The new
- entries, nearly all of which were derived forms of words already in
- the dictionary, were selected from a list of several thousand words
- that occurred in the LOB Corpus[3] but were not in CUV2. I also made
- changes to existing entries where these were implied by the new
- entries; for example, when adding a plural form of a word whose
- existing tag was "uncountable", it was necessary to change the tag of
- the singular form. I also added about 300 reasonably common
- abbreviations (see note below).
-
- A number of words (ie spellings) have more than one entry in the
- OALDCE, eg "water 1" (noun) and "water 2" (verb). In CUV2, each word
- has only one entry unless it has two different pronunciations, eg
- "abuse" (noun and verb). I have departed from this rule in the case
- of compound adjectives, such as "hard-working", which have a slightly
- different stress pattern depending on whether they are used
- attributively ("she's a hard-working girl") or predicatively ("she's
- very hard-working"). These are entered only once; they generally have
- the attributive stress pattern except when the predicative one seemed
- the more natural. (See also the note below on abbreviations.) I have
- also given only one entry to those words that have strong and weak
- forms of pronunciation, such as "am" (which can be pronounced &m, @m
- or m). Generally it is the strong form that is entered.
-
- As regards the coverage of the dictionary, readers might be
- interested in a paper by Geoffrey Sampson [4] in which he analyses a
- set of words from a sample of the LOB Corpus[3] that were not in CUV2.
- The recent additions should have gone some way to plugging the gaps
- that his study identified.
-
-