I. Introduction The three lists contained in this archive are the product of the "n-dicts" project (n being a variable whose value is currently 12). The purpose of this project is to create a list of words which approximates the common core of the vocabulary of American Engish. The methodology of the project is to record and correlate the words listed in a number of small dictionaries. The number of dictionaries so recorded is now 12, comprising 8 ESL (English as a Second Language) dictionaries and 4 "desk dictionaries". The dictionaries chosen vary widely by publisher, by style, by completeness and by depth. One of them is a British dictionary with an international bent; the remainder are dictionaries of American English (three from British publishers). The smallest of them contains about 20,000 entries, and the largest 44,000. (All totalled, there are about 76,000 entries, many of which appear in only a single dictionary.) All but two of them were published in the last six years. I hereby dub this edition of 12dicts, finalized October 7, 2000, as version 2.0. It differs from previous versions primarily by inclusion of an additional word list. Additionally, there have been many error corrections, as well as changes resulting from new editions of some of the source dictionaries. II. The 6of12 and 2of12 lists I tried two different ways of winnowing this data to produce lists of common words. Both have produced interesting results, included herein. One list, the 6of12 list, contains all words and phrases listed in 6 of the 12 dictionaries. One way of describing this list is that it contains those words and phrases which a (seeming) majority of lexicographers believe are relevant to people learning English, and/or to everyday usage. This list contains about 32,000 words and phrases. The other list, the 2of12 list, is more inclusive in that it includes words listed in as few as two of the source dictionaries, but less inclusive in that it excludes items of various sorts, including multiword phrases, proper names and abbreviations. This list contains about 41,000 words. It is perhaps more suitable for use in areas like spell checking or word games than the 6of12 list. (Honesty compels me to admit that neither of these lists is, by itself, a good choice for spell checking, due to the absence of inflections, proper names, Roman numerals, etc.) A third list, 2of12inf.txt, is of a rather different character, and is discussed later. A more precise description of the criteria by which the above lists were composed is as follows: 1. The 6of12 list contains all non-excluded words and phrases which appear in 6 or more of the source dictionaries. 2. Prefixes and suffixes are excluded. Abbreviations are included; however, if they are entirely lower-case and alphabetic, they are terminated with a colon (":") so they can be easily distinguished from regular words. 3. Inflections of included words are not themselves included unless they are separately defined or irregular. 4. It sometimes occurs that different spellings of the same word are listed in 6 or more dictionaries, even though no single form is so listed. In this case, if one spelling is clearly more accepted, this spelling and this spelling only is listed. If all spellings seem equally accepted, one spelling has been selected arbitrarily for inclusion. 5. The 6of12 list contains a significant number of words which do not meet either crierion 1 or 4. These words, sometimes called "signature words", are discussed below. All of these words are listed in at least one of the source dictionaries. 6. In addition to the ":" suffix discussed above, other special suffix characters are used to mark words with certain character- istics, as discussed below. 1. The 2of12 list contains all non-excluded words which appear in at least 2 of the source dictionaries. 2. This list excludes capitalized words, multiword phrases, and abbreviations, as well as prefixes and suffixes. It does not exclude hyphenated words or contractions. If a word occurs in both a hyphenated and an unhyphenated form, the unhyphenated form is listed, even if the hyphenated form is generally preferred. 3. The list excludes spellings which are considered (by a majority of the dictionaries listing it) to be non-American usage. It also excludes secondary spellings which are mentioned by fewer than four of the source dictionaries. 4. Inflections of included words are not themselves included unless they are separately defined, or irregular. 5. Several of the source dictionaries include listings for obscure currencies, such as markka, khoum and ngwee. I was unable to regard such words as part of the Engish "core vocabulary", and so I required citation in over a third of the dictionaries for inclusion of monetary units. A side-effect was the elimination of the word "lepton", which, in addition to its use in particle physics, is also .01 Greek drachmas. 6. This list also includes a small number of signature words, as discussed below. As indicated, both lists have been augmented with words (and, in the case of the 6of12 list, phrases) which fail to meet the formal requirements for inclusion. In the case of the 6of12 list, 1024 words were added (about 3 % of the total). These are all words which, in the judgment of the compiler, are as familiar as many of the words which met the criteria for inclusion. Examples of some of the sorts of words which were added are: 1. Words of the same category as other included words. An example is the astrological sign "Cancer", which alone of all the astro- logical signs fails to appear in 6 or more of the dictionaries. Similarly added were the omitted holidays "Thanksgiving" and "Valentine's Day". 2. Vulgarities, sexual terms and insults. Some such words were already included, but most of the source dictionaries were quite squeamish about them. These words are very widely known indeed; I hold that any list of "common" words which does not include the infamous f-word is simply discredited thereby. Some may feel that it would have been better to leave some or all of these terms unmentioned. Nevertheless, the expression of blasphemy, unwarranted contempt, and perverse lust, whether in words or in deeds, is a very human trait. Suppressing the evidence of these aspects of the human condition in our language makes no more sense than excluding "leprosy", "gangrene" and "dementia", no matter how unpleasant they may be to contemplate. 3. Conventional conversational phrases so common as to be practically invisible to native speakers. Examples are "thank you", "good night", "uh-huh", "of course" and "gesundheit". 4. Sports terminology, especially for football and baseball. (If I, who am practically sports-blind, noticed this deficiency, it must be of major proportions indeed.) Note that the signature words in the 6of12 list can be identified via the suffix character "+", and eliminated if desired. A much smaller set of words (64) was added to the 2of12 list. These were of two sorts: 1. Signature words from the 6of12 list which were not already present in the 2of12 list, and which are not excluded due to being abbreviations, phrases, etc. 2. Inflections of irregular verbs not explicitly mentioned in 2 source dictionaries, such as "outfought" and "reheard". Some of the 6of12 list entries are annotated with a suffix character, giving additional information about the associated word. The annotations can be easily removed with an editor or script if they are unwanted. These annotations are: : - The word is an othwerwise unmarked abbreviation. This suffix may appear in combination with another suffix. & - The word is primarily a non-American usage. # - The word is generally held to be a variant or less preferred form of another word. < - This form of a word is held to be the primary form by fewer dictionaries than some other form of the word. ^ - This form of the word was selected arbitrarily from a set of variants, none of which was clearly preferred. = - Roughly, this indicates a "second class" word. More precisely, the word falls into one of the following classes: a. The word is an inflection which was defined in the same entry as the base word. b. The word is a derived word (-ly, -ness or -er/or) which was not defined in a separate entry. c. The word appeared in a list of undefined words with a common prefix, such as un- or re-. + - The word is a signature word. The words in the 2of12 list are not annotated. III. The 2of12inf list The 2of12inf list is of a rather different character. Conceptually, it is simple. It consists of all the words in the 2of12 list, plus their inflections, amounting to about 81,000 words. This list may be more useful than the other lists for applications like word games. It was created to help Kevin Atkinson in his Aspell and SCOWL projects (for which, see http://aspell.sourceforge.net). Unlike the 6of12 and 2of12 lists, this list is not based exclusively on the contents of my 12 source dictionaries, and for this reason it has, I feel, less authority than the other 12dicts lists. It also probably has a significantly higher error rate than the other lists, for reasons explained below. The criteria defining the 2of12inf list are as follows: 1. The 2of12inf list contains all non-excluded words which appear in at least 2 of the source dictionaries. 2. This list excludes capitalized words, multiword phrases, abbreviations, contractions, hyphenated words and single-letter words, as well as prefixes and suffixes. 3. The list does not exclude secondary spellings, non-American usages or monetary units. 4. The list includes inflections of all included words. Any inflection mentioned or clearly implied by any of the source dictionaries is included (i.e., two citations are not required). Additionally, some inflections have been added from other sources. 5. Plurals of "uncountable" nouns were included, annotated with the "%" suffix character. See below for an extended discussion of the inclusion of these words. 6. Signature words from the other lists, plus their inflections, were added. No other signature words were added. Though the 2of12inf list still consists mostly of very common words, criteria 3 through 5 cause the 2of12inf list to contain a greater proportion of unfamiliar and unusual words than the other 12dicts lists. The 2of12inf list was not derived directly from the 12 source dictionaries. The starting point was a subset of Kevin Atkinson's AGID list, a list of words, parts of speech and inflections derived from public-domain sources, notably Moby Words and WordNet. (See the file agid.txt in this archive, which is a copy of the AGID "readme", for more information on the antecedents of AGID.) 2of12inf was created by a process of editing the AGID subset to remove spurious entries and those which reflected a more esoteric English vocabulary than the other 12dicts lists, and to add inflections which AGID failed to identify. This process required significantly less effort than would have been needed to derive the list directly from the source dictionaries. Unfortunately, a side effect of the process is that the result is likely to be somewhat less reliable than the other 12dicts lists. In particular, Moby Words is notoriously unreliable, and I find it unlikely that I have successfully identified all the spurious inflections its use has introduced. It is my hope in the future to release another edition of 2of12inf which is not derived from AGID, and therefore not "infected" by Moby Words. Ideally, the 2of12inf list would contain only inflections listed in one of the 12dicts source dictionaries. This proved not to be practical. The reason for this has to do with the nature of these sources, which are mostly ESL dictionaries. An ESL dictionary might well list the word "esophagus", but, because an English learner is unlikely to need to talk about this organ in the plural, it will probably not bother to list the plural form "esophagi". For words of this sort, I therefore needed to obtain their inflections from other sources. Obviously, the decisions on when to include additional inflections were judgment calls, as were the choices of which inflections to add. Adjectival inflections (comparatives and superlatives) proved to be an especially annoying problem. Only 2 of my 12 source dictionaries provided remotely reliable information of this sort. In fact, such information is sparse and inconsistent in most dictionaries of any size. I relied on a small set of additional dictionaries for this information, which was mostly disjoint from the sources for plurals and verb forms. Several of these sources were Scrabble(r)-related, and therefore inclined to include forms of little plausibility such as "iller/illest" or "fertiler/fertilest". Accordingly, I ended up rejecting some of the documented inflections on grounds of implausibility. I have no doubt that, in the process, I made a number of errors of both inclusion and exclusion and, in any case, many of the forms listed have no connection with any of the 12dicts source dictionaries. One additional problem in the creation of the 2of12inf list was that of "uncountable" nouns and their plurals. Some English dictionaries, especially ESL dictionaries, as well as other linguistic sources, attest to the existence of nouns which cannot be counted, or used in the plural. Examples of such nouns include "mud", "rayon", "oregano", "chess", "fairness", "wisdom", "aluminum", "training", "materialism" and "chickenpox". This is an entirely commonsense notion, but a difficulty is the fact that the boundary between the countable and the uncountable is extremely vague and ill-defined. For example, the word "coffee" is ordinarily uncountable, but not when ordering in a restaurant, as is the word "symmetry", except in physics or math. In general, it is possible to contrive a context where use of the plural of any noun whatsoever is reasonable. An alternate position, therefore, is that in fact no nouns are uncountable, and that any noun which is not already plural possesses a plural. This position is especially useful in the context of word games, where words such as "zeals" and "anthraxes" may produce large scores. For this reason, the official Scrabble dictionaries list words such as "thens", "onces" and "mankinds", which most people find rather unreasonable. The fact that the 2of12inf list might well be useful in gaming contexts, together with the fact that the boundary between countable and uncountable nouns is so ill-defined, served as a powerful argument for inclusion of all plural forms, whether commonly used or not, while its derivation from ESL sources argued for including only the plurals of countable nouns, however distinguished. In the end, I was unable to resolve this dilemma, and adopted a compromise. The 2of12inf list includes all plurals, but with the plurals of uncountable nouns marked, making it easy to remove them if they are not wanted. That left the issue of how to establish countability. Five of my source dictionaries included information on countability, which was adequate to decide the status of most of the included nouns. As for the rest, as usual, I used my best judgment. I will confess to occasionally overriding the source dictionaries when I believed they were clearly incorrect. (For instance, I chose not to mark the word "hatreds" as an uncountable plural, in defiance of the opinion of all my sources, on the grounds that it has been used in too many news stories from Bosnia to be considered unusual.) It is interesting to note that most of the plurals I added from auxiliary sources were of words considered uncountable. The difficulties listed above, and the fact that I was forced to exercise personal judgment frequently in creating it, emphasizes a fundamental difference between this list and the other 12dicts lists. I have tried to make the 6of12 and 2of12 lists reflect only the source dictionaries, and to keep my own judgments and opinions out of the picture (except for my addition of signature words). This has proved impossible to achieve for the 2of12inf list, which accordingly represents a less authoritative and more arbitrary collection. Additionally, the 2of12inf list has undergone less proofreading and validation than the other lists, and I suspect the error rate is considerably higher than the idealistic goal of 0.02 % I advocate elsewhere in this document. Nevertheless, I hope it may prove to be of some use and interest. I wish to offer my special thanks to Kevin Atkinson, for supplying me with the AGID list, and for encouraging me to add the inflections. Of course, any errors that remain in the 2of12inf list are my own responsibility, and should not be blamed on Kevin, AGID, or even on Moby. IV. Some history It may have occurred to some to wonder about how something like the n-dicts project came to be (though I assume that anyone who bothers to download this archive must already have some idea that such a project could be of interest). Some years ago, there was a post to the sci.crypt newsgroup, on the subject of creating PGP passphrases using randomly selected entries from a supplied list of very short words. (If this sounds interesting, see http://world.std.com/~reinhold/diceware.html for an expanded version of the post.) The word list, which was extracted from /usr/dict/words on some UNIX system, seemed to me ill-suited to its intended purpose. It included arcane acronyms (bstj, ncr), misspellings (diety) and words of amazing obscurity (bhoy, kombu). I decided I could do better (and eventually did). This caused me to start downloading English word lists, of which there are many, from the Internet. I was not impressed by the overall quality of these lists, and the few which were high-quality were all- inclusive, burying the everyday words under a mountain of archaisms and esoterica. The flaws of the vast majority of these lists are worth recounting: 1. Failure to proofread. Many of these lists are littered with misspellings and typos, sometimes approaching gibberish. (I presume, for instance, that the bizarre string "nondploe", which was found in a purported Scrabble word list, is a typo for something more or less legitimate, but I have no idea what.) Working on my own lists has helped me understand that 100 % accuracy is a very demanding goal, seldom actually achieved, but I still feel it reasonable to expect no more than 1 or 2 errors per 10,000 words. 2. Acceptance of completely undocumented lazy spellings, such as "bullseye" and "courtmartial". 3. Failure to respect capitalization. 4. Failure to distinguish abbreviations from other entries. 5. Treating esoteric computer jargon, and especially UNIX jargon, as everyday English. (Beware any list which includes "emacs", "inode" and "lvalue".) 6. Apparently random word selection. The various /usr/dicts/words files are compendia of all the above sins. Noteworthy is the inclusion of a large set of apparently randomly chosen personal names (uncapitalized, of course, and missing "wanda", "marge", "polly" and "sid"). 7. Inconsistent inflection. Some lists include all inflections of their vocabulary, while others include only singulars and infinitives. Either policy is fine, and has its advantages. I am personally very annoyed when inflected forms appear at random. I find this generally happens when a compiler merges several lists with different characteristics, with no attempt to reconcile their divergent styles. 8. Omission of everyday words. I've seen a list that includes "bremsstrahlung", yet omits "log" and "beer". Or that includes "saxophone" but not "sax", and "rhinoceros" but not "rhino". Of course, due to my original purpose in seeking out common short words, I found this especially annoying. One result of my frustration with this situation was my working with Mendel Cooper on ENABLE (for further information, check out http://personal.riverusers.com/~thegrendel/software.html), which was close to unique in having an active caretaker, one clearly concerned with quality, and in being oriented towards American rather than British English. (A high-quality list oriented towards British rather than American English can be downloaded from the URL http://www.bryson.demon.co.uk/wordlist.html.) But ENABLE is an all-encompassing list and, even if it had been complete at the time I started my search for a list of common words, it would not have been what I wanted for that reason. I finally decided that only starting from scratch with a systematic approach was likely to get me what I was looking for, and that dictionaries intended for non-native speakers of English were the best possible source for words that are in some cases so familiar that we never think of them. This has led to the 12dicts lists, which I hope have managed to avoid the flaws recited above. (I should acknowledge one form of inconsistency exhibited by the 12dicts lists, which is that sometimes related words are spelled inconsistently. For instance, the 2of12 list contains both "broadminded" and "broad-mindedness". This generally occurs as a result of the methodology used to build the lists. In the case of "broadminded", only one dictionary listed "broadmindedness", which was therefore excluded. I felt unequal to trying to correct these inconsistencies, some of which are real and not mere artifacts of 12dicts, such as the contrast between "self-conscious" and "unselfconscious".) It is possible that in the future the "n" of n-dicts will increase again, but, in fact, consideration of an additional dictionary now seems to result in the discovery that its vocabulary matches 12dicts pretty closely. At the very least, this phenomenon gives me hope that the n-dicts lists have at last met their goal, and will now be useful, or at least interesting, to others. The 12dicts lists were compiled by Alan Beale. I explicitly release them to the public domain, but request acknowledgment of their use. (Actually, the dependency of 2of12inf on AGID prevents its release into the public domain. However, I do not impose any additional requirements on its use beyond those imposed by AGID and its sources, as described in agid.txt.) Feel free to send comments, suggestions, inquiries and/or large sums of money to me at biljir@pobox.com. If you find 12dicts useful, I'd love to hear about it.