home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
200+ Great Games for PDA
/
200+PDA.BIN
/
200+PalmGames
/
MasterWord
/
Beale.txt
< prev
next >
Wrap
Text File
|
2000-10-07
|
23KB
|
425 lines
I. Introduction
The three lists contained in this archive are the product of the
"n-dicts" project (n being a variable whose value is currently
12). The purpose of this project is to create a list of words
which approximates the common core of the vocabulary of American
Engish.
The methodology of the project is to record and correlate the words
listed in a number of small dictionaries. The number of dictionaries
so recorded is now 12, comprising 8 ESL (English as a Second Language)
dictionaries and 4 "desk dictionaries". The dictionaries chosen
vary widely by publisher, by style, by completeness and by depth.
One of them is a British dictionary with an international bent; the
remainder are dictionaries of American English (three from British
publishers). The smallest of them contains about 20,000 entries, and
the largest 44,000. (All totalled, there are about 76,000 entries,
many of which appear in only a single dictionary.) All but two of
them were published in the last six years.
I hereby dub this edition of 12dicts, finalized October 7, 2000,
as version 2.0. It differs from previous versions primarily by
inclusion of an additional word list. Additionally, there have been
many error corrections, as well as changes resulting from new editions
of some of the source dictionaries.
II. The 6of12 and 2of12 lists
I tried two different ways of winnowing this data to produce lists of
common words. Both have produced interesting results, included
herein. One list, the 6of12 list, contains all words and phrases
listed in 6 of the 12 dictionaries. One way of describing this list
is that it contains those words and phrases which a (seeming) majority
of lexicographers believe are relevant to people learning English,
and/or to everyday usage. This list contains about 32,000 words and
phrases. The other list, the 2of12 list, is more inclusive in that it
includes words listed in as few as two of the source dictionaries, but
less inclusive in that it excludes items of various sorts, including
multiword phrases, proper names and abbreviations. This list contains
about 41,000 words. It is perhaps more suitable for use in areas
like spell checking or word games than the 6of12 list. (Honesty
compels me to admit that neither of these lists is, by itself, a good
choice for spell checking, due to the absence of inflections, proper
names, Roman numerals, etc.)
A third list, 2of12inf.txt, is of a rather different character, and is
discussed later.
A more precise description of the criteria by which the above lists
were composed is as follows:
1. The 6of12 list contains all non-excluded words and phrases which
appear in 6 or more of the source dictionaries.
2. Prefixes and suffixes are excluded. Abbreviations are included;
however, if they are entirely lower-case and alphabetic, they are
terminated with a colon (":") so they can be easily distinguished
from regular words.
3. Inflections of included words are not themselves included unless
they are separately defined or irregular.
4. It sometimes occurs that different spellings of the same word
are listed in 6 or more dictionaries, even though no single form
is so listed. In this case, if one spelling is clearly more
accepted, this spelling and this spelling only is listed. If all
spellings seem equally accepted, one spelling has been selected
arbitrarily for inclusion.
5. The 6of12 list contains a significant number of words which do not
meet either crierion 1 or 4. These words, sometimes called
"signature words", are discussed below. All of these words are
listed in at least one of the source dictionaries.
6. In addition to the ":" suffix discussed above, other special
suffix characters are used to mark words with certain character-
istics, as discussed below.
1. The 2of12 list contains all non-excluded words which appear in at
least 2 of the source dictionaries.
2. This list excludes capitalized words, multiword phrases, and
abbreviations, as well as prefixes and suffixes. It does not
exclude hyphenated words or contractions. If a word occurs in
both a hyphenated and an unhyphenated form, the unhyphenated
form is listed, even if the hyphenated form is generally
preferred.
3. The list excludes spellings which are considered (by a majority
of the dictionaries listing it) to be non-American usage. It
also excludes secondary spellings which are mentioned by fewer
than four of the source dictionaries.
4. Inflections of included words are not themselves included unless
they are separately defined, or irregular.
5. Several of the source dictionaries include listings for obscure
currencies, such as markka, khoum and ngwee. I was unable to
regard such words as part of the Engish "core vocabulary", and so
I required citation in over a third of the dictionaries for
inclusion of monetary units. A side-effect was the elimination
of the word "lepton", which, in addition to its use in particle
physics, is also .01 Greek drachmas.
6. This list also includes a small number of signature words, as
discussed below.
As indicated, both lists have been augmented with words (and, in the
case of the 6of12 list, phrases) which fail to meet the formal
requirements for inclusion. In the case of the 6of12 list, 1024
words were added (about 3 % of the total). These are all words which,
in the judgment of the compiler, are as familiar as many of the words
which met the criteria for inclusion. Examples of some of the sorts
of words which were added are:
1. Words of the same category as other included words. An example is
the astrological sign "Cancer", which alone of all the astro-
logical signs fails to appear in 6 or more of the dictionaries.
Similarly added were the omitted holidays "Thanksgiving" and
"Valentine's Day".
2. Vulgarities, sexual terms and insults. Some such words were
already included, but most of the source dictionaries were quite
squeamish about them. These words are very widely known indeed;
I hold that any list of "common" words which does not include the
infamous f-word is simply discredited thereby. Some may feel that
it would have been better to leave some or all of these terms
unmentioned. Nevertheless, the expression of blasphemy,
unwarranted contempt, and perverse lust, whether in words or in
deeds, is a very human trait. Suppressing the evidence of these
aspects of the human condition in our language makes no more sense
than excluding "leprosy", "gangrene" and "dementia", no matter how
unpleasant they may be to contemplate.
3. Conventional conversational phrases so common as to be practically
invisible to native speakers. Examples are "thank you", "good
night", "uh-huh", "of course" and "gesundheit".
4. Sports terminology, especially for football and baseball. (If I,
who am practically sports-blind, noticed this deficiency, it must
be of major proportions indeed.)
Note that the signature words in the 6of12 list can be identified via
the suffix character "+", and eliminated if desired.
A much smaller set of words (64) was added to the 2of12 list. These
were of two sorts:
1. Signature words from the 6of12 list which were not already present
in the 2of12 list, and which are not excluded due to being
abbreviations, phrases, etc.
2. Inflections of irregular verbs not explicitly mentioned in 2
source dictionaries, such as "outfought" and "reheard".
Some of the 6of12 list entries are annotated with a suffix character,
giving additional information about the associated word. The
annotations can be easily removed with an editor or script if
they are unwanted.
These annotations are:
: - The word is an othwerwise unmarked abbreviation. This suffix
may appear in combination with another suffix.
& - The word is primarily a non-American usage.
# - The word is generally held to be a variant or less preferred
form of another word.
< - This form of a word is held to be the primary form by fewer
dictionaries than some other form of the word.
^ - This form of the word was selected arbitrarily from a set of
variants, none of which was clearly preferred.
= - Roughly, this indicates a "second class" word. More precisely,
the word falls into one of the following classes:
a. The word is an inflection which was defined in the same
entry as the base word.
b. The word is a derived word (-ly, -ness or -er/or) which
was not defined in a separate entry.
c. The word appeared in a list of undefined words with a
common prefix, such as un- or re-.
+ - The word is a signature word.
The words in the 2of12 list are not annotated.
III. The 2of12inf list
The 2of12inf list is of a rather different character. Conceptually,
it is simple. It consists of all the words in the 2of12 list, plus
their inflections, amounting to about 81,000 words. This list may
be more useful than the other lists for applications like word games.
It was created to help Kevin Atkinson in his Aspell and SCOWL projects
(for which, see http://aspell.sourceforge.net). Unlike the 6of12 and
2of12 lists, this list is not based exclusively on the contents of my
12 source dictionaries, and for this reason it has, I feel, less
authority than the other 12dicts lists. It also probably has a
significantly higher error rate than the other lists, for reasons
explained below.
The criteria defining the 2of12inf list are as follows:
1. The 2of12inf list contains all non-excluded words which appear in
at least 2 of the source dictionaries.
2. This list excludes capitalized words, multiword phrases,
abbreviations, contractions, hyphenated words and single-letter
words, as well as prefixes and suffixes.
3. The list does not exclude secondary spellings, non-American usages
or monetary units.
4. The list includes inflections of all included words. Any
inflection mentioned or clearly implied by any of the source
dictionaries is included (i.e., two citations are not required).
Additionally, some inflections have been added from other sources.
5. Plurals of "uncountable" nouns were included, annotated with the
"%" suffix character. See below for an extended discussion of
the inclusion of these words.
6. Signature words from the other lists, plus their inflections, were
added. No other signature words were added.
Though the 2of12inf list still consists mostly of very common words,
criteria 3 through 5 cause the 2of12inf list to contain a greater
proportion of unfamiliar and unusual words than the other 12dicts
lists.
The 2of12inf list was not derived directly from the 12 source
dictionaries. The starting point was a subset of Kevin Atkinson's
AGID list, a list of words, parts of speech and inflections derived
from public-domain sources, notably Moby Words and WordNet. (See the
file agid.txt in this archive, which is a copy of the AGID "readme",
for more information on the antecedents of AGID.) 2of12inf was created
by a process of editing the AGID subset to remove spurious entries and
those which reflected a more esoteric English vocabulary than the other
12dicts lists, and to add inflections which AGID failed to identify.
This process required significantly less effort than would have been
needed to derive the list directly from the source dictionaries.
Unfortunately, a side effect of the process is that the result is
likely to be somewhat less reliable than the other 12dicts lists.
In particular, Moby Words is notoriously unreliable, and I find it
unlikely that I have successfully identified all the spurious
inflections its use has introduced. It is my hope in the future to
release another edition of 2of12inf which is not derived from AGID,
and therefore not "infected" by Moby Words.
Ideally, the 2of12inf list would contain only inflections listed in
one of the 12dicts source dictionaries. This proved not to be
practical. The reason for this has to do with the nature of these
sources, which are mostly ESL dictionaries. An ESL dictionary might
well list the word "esophagus", but, because an English learner is
unlikely to need to talk about this organ in the plural, it will
probably not bother to list the plural form "esophagi". For words of
this sort, I therefore needed to obtain their inflections from other
sources. Obviously, the decisions on when to include additional
inflections were judgment calls, as were the choices of which
inflections to add.
Adjectival inflections (comparatives and superlatives) proved to be
an especially annoying problem. Only 2 of my 12 source dictionaries
provided remotely reliable information of this sort. In fact, such
information is sparse and inconsistent in most dictionaries of any
size. I relied on a small set of additional dictionaries for this
information, which was mostly disjoint from the sources for plurals
and verb forms. Several of these sources were Scrabble(r)-related,
and therefore inclined to include forms of little plausibility such
as "iller/illest" or "fertiler/fertilest". Accordingly, I ended up
rejecting some of the documented inflections on grounds of
implausibility. I have no doubt that, in the process, I made a number
of errors of both inclusion and exclusion and, in any case, many of
the forms listed have no connection with any of the 12dicts source
dictionaries.
One additional problem in the creation of the 2of12inf list was that
of "uncountable" nouns and their plurals. Some English dictionaries,
especially ESL dictionaries, as well as other linguistic sources,
attest to the existence of nouns which cannot be counted, or used in
the plural. Examples of such nouns include "mud", "rayon", "oregano",
"chess", "fairness", "wisdom", "aluminum", "training", "materialism"
and "chickenpox". This is an entirely commonsense notion, but a
difficulty is the fact that the boundary between the countable and the
uncountable is extremely vague and ill-defined. For example, the word
"coffee" is ordinarily uncountable, but not when ordering in a
restaurant, as is the word "symmetry", except in physics or math.
In general, it is possible to contrive a context where use of the
plural of any noun whatsoever is reasonable.
An alternate position, therefore, is that in fact no nouns are
uncountable, and that any noun which is not already plural possesses
a plural. This position is especially useful in the context of word
games, where words such as "zeals" and "anthraxes" may produce large
scores. For this reason, the official Scrabble dictionaries list
words such as "thens", "onces" and "mankinds", which most people find
rather unreasonable. The fact that the 2of12inf list might well be
useful in gaming contexts, together with the fact that the boundary
between countable and uncountable nouns is so ill-defined, served as
a powerful argument for inclusion of all plural forms, whether
commonly used or not, while its derivation from ESL sources argued
for including only the plurals of countable nouns, however
distinguished.
In the end, I was unable to resolve this dilemma, and adopted a
compromise. The 2of12inf list includes all plurals, but with the
plurals of uncountable nouns marked, making it easy to remove them
if they are not wanted. That left the issue of how to establish
countability. Five of my source dictionaries included information
on countability, which was adequate to decide the status of most of
the included nouns. As for the rest, as usual, I used my best
judgment. I will confess to occasionally overriding the source
dictionaries when I believed they were clearly incorrect. (For
instance, I chose not to mark the word "hatreds" as an uncountable
plural, in defiance of the opinion of all my sources, on the grounds
that it has been used in too many news stories from Bosnia to be
considered unusual.) It is interesting to note that most of the
plurals I added from auxiliary sources were of words considered
uncountable.
The difficulties listed above, and the fact that I was forced to
exercise personal judgment frequently in creating it, emphasizes a
fundamental difference between this list and the other 12dicts lists.
I have tried to make the 6of12 and 2of12 lists reflect only the source
dictionaries, and to keep my own judgments and opinions out of the
picture (except for my addition of signature words). This has proved
impossible to achieve for the 2of12inf list, which accordingly
represents a less authoritative and more arbitrary collection.
Additionally, the 2of12inf list has undergone less proofreading and
validation than the other lists, and I suspect the error rate is
considerably higher than the idealistic goal of 0.02 % I advocate
elsewhere in this document. Nevertheless, I hope it may prove to be
of some use and interest.
I wish to offer my special thanks to Kevin Atkinson, for supplying me
with the AGID list, and for encouraging me to add the inflections. Of
course, any errors that remain in the 2of12inf list are my own
responsibility, and should not be blamed on Kevin, AGID, or even on
Moby.
IV. Some history
It may have occurred to some to wonder about how something like the
n-dicts project came to be (though I assume that anyone who bothers
to download this archive must already have some idea that such a
project could be of interest).
Some years ago, there was a post to the sci.crypt newsgroup, on the
subject of creating PGP passphrases using randomly selected entries
from a supplied list of very short words. (If this sounds interesting,
see http://world.std.com/~reinhold/diceware.html for an expanded
version of the post.) The word list, which was extracted from
/usr/dict/words on some UNIX system, seemed to me ill-suited to
its intended purpose. It included arcane acronyms (bstj, ncr),
misspellings (diety) and words of amazing obscurity (bhoy, kombu).
I decided I could do better (and eventually did).
This caused me to start downloading English word lists, of which there
are many, from the Internet. I was not impressed by the overall
quality of these lists, and the few which were high-quality were all-
inclusive, burying the everyday words under a mountain of archaisms
and esoterica.
The flaws of the vast majority of these lists are worth recounting:
1. Failure to proofread. Many of these lists are littered with
misspellings and typos, sometimes approaching gibberish. (I
presume, for instance, that the bizarre string "nondploe",
which was found in a purported Scrabble word list, is a typo
for something more or less legitimate, but I have no idea what.)
Working on my own lists has helped me understand that 100 %
accuracy is a very demanding goal, seldom actually achieved, but
I still feel it reasonable to expect no more than 1 or 2 errors
per 10,000 words.
2. Acceptance of completely undocumented lazy spellings, such as
"bullseye" and "courtmartial".
3. Failure to respect capitalization.
4. Failure to distinguish abbreviations from other entries.
5. Treating esoteric computer jargon, and especially UNIX jargon,
as everyday English. (Beware any list which includes "emacs",
"inode" and "lvalue".)
6. Apparently random word selection. The various /usr/dicts/words
files are compendia of all the above sins. Noteworthy is the
inclusion of a large set of apparently randomly chosen personal
names (uncapitalized, of course, and missing "wanda", "marge",
"polly" and "sid").
7. Inconsistent inflection. Some lists include all inflections of
their vocabulary, while others include only singulars and
infinitives. Either policy is fine, and has its advantages. I
am personally very annoyed when inflected forms appear at random.
I find this generally happens when a compiler merges several lists
with different characteristics, with no attempt to reconcile their
divergent styles.
8. Omission of everyday words. I've seen a list that includes
"bremsstrahlung", yet omits "log" and "beer". Or that includes
"saxophone" but not "sax", and "rhinoceros" but not "rhino". Of
course, due to my original purpose in seeking out common short
words, I found this especially annoying.
One result of my frustration with this situation was my working with
Mendel Cooper on ENABLE (for further information, check out
http://personal.riverusers.com/~thegrendel/software.html), which was
close to unique in having an active caretaker, one clearly concerned
with quality, and in being oriented towards American rather than
British English. (A high-quality list oriented towards British
rather than American English can be downloaded from the URL
http://www.bryson.demon.co.uk/wordlist.html.) But ENABLE is an
all-encompassing list and, even if it had been complete at the time
I started my search for a list of common words, it would not have been
what I wanted for that reason.
I finally decided that only starting from scratch with a systematic
approach was likely to get me what I was looking for, and that
dictionaries intended for non-native speakers of English were the
best possible source for words that are in some cases so familiar
that we never think of them. This has led to the 12dicts lists,
which I hope have managed to avoid the flaws recited above.
(I should acknowledge one form of inconsistency exhibited by the
12dicts lists, which is that sometimes related words are spelled
inconsistently. For instance, the 2of12 list contains both
"broadminded" and "broad-mindedness". This generally occurs as a
result of the methodology used to build the lists. In the case of
"broadminded", only one dictionary listed "broadmindedness", which was
therefore excluded. I felt unequal to trying to correct these
inconsistencies, some of which are real and not mere artifacts of
12dicts, such as the contrast between "self-conscious" and
"unselfconscious".)
It is possible that in the future the "n" of n-dicts will increase
again, but, in fact, consideration of an additional dictionary now
seems to result in the discovery that its vocabulary matches 12dicts
pretty closely. At the very least, this phenomenon gives me hope that
the n-dicts lists have at last met their goal, and will now be useful,
or at least interesting, to others.
The 12dicts lists were compiled by Alan Beale. I explicitly release
them to the public domain, but request acknowledgment of their use.
(Actually, the dependency of 2of12inf on AGID prevents its release
into the public domain. However, I do not impose any additional
requirements on its use beyond those imposed by AGID and its sources,
as described in agid.txt.) Feel free to send comments, suggestions,
inquiries and/or large sums of money to me at biljir@pobox.com. If
you find 12dicts useful, I'd love to hear about it.