1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209
|
#!/bin/sh -xe
# README.linux.words - file used to create linux.words
# Created: Wed Mar 10 09:12:49 1993 by faith@cs.unc.edu (Rik Faith)
# Revised: Sat Mar 13 17:02:08 1993 by faith@cs.unc.edu
#
# Care was taken to be sure that the linux.words list was free of
# copyright. This makes linux.words a suitable /usr/dict/words
# replacement for the Linux community.
#
# Since the majority of the words are from Tanenbaum's minix.dict file,
# the notice from Barry Brachman, included below, should accompany any
# redistribution of this list.
# Here is a detailed explaination of how I created the linux.words file.
#
# This README.words file is actually a shell script that you can use to
# recreate the linux.words file from original sources.
#
# First, I started with minix.dict
# from cs.ubc.ca:/pub/local/src/sp-1.5/wordlists-1.0.tar.Z
#
# The following is from the NOTES file in wordlists-1.0.tar.Z:
# NOTES> These word lists were collected by Barry Brachman
# NOTES> <brachman@cs.ubc.ca> at the University of British Columbia. They
# NOTES> may be freely distributed as long as this notice accompanies them.
# NOTES>
# NOTES> ==================================================================
# NOTES> Info for minix.dict:
# NOTES>
# NOTES> Article 1997 of comp.os.minix:
# NOTES> From: ast@botter.UUCP
# NOTES> Subject: A spelling checker for MINIX
# NOTES> Date: 6 Jan 88 22:28:22 GMT
# NOTES> Reply-To: ast@cs.vu.nl (Andy Tanenbaum)
# NOTES> Organization: VU Informatica, Amsterdam
# NOTES>
# NOTES> This dictionary is NOT based on the UNIX dictionary so it is free
# NOTES> of AT&T copyright. I built the dictionary from three sources.
# NOTES> First, I started by sorting and uniq'ing some public domain
# NOTES> dictionaries. Second, as some of you probably know, I have
# NOTES> written somewhere between 3 and 6 books (depending on precisely
# NOTES> what you count) and an additional 50 published papers on operating
# NOTES> systems, networks, compilers, languages, etc. This data base,
# NOTES> which is online, is nonnegligible :-) Finally, I added a number of
# NOTES> words that I thought ought to be in the dictionary including all
# NOTES> the U.S. states, all the European and some other major countries,
# NOTES> principal U.S. and world cities, and a bunch of technical terms.
# NOTES> I don't want my spelling checker to barf on arpanet, diskless,
# NOTES> modem, login, internetwork, subdirectory, superuser, vlsi, or
# NOTES> winchester just because Webster wouldn't approve of them. All in
# NOTES> all, the dictionary is over 40,000 words. If you have any
# NOTES> suggestions for additions or deletions, please post them. But
# NOTES> please be sure you are not infringing on anyone's copyright in
# NOTES> doing so.
# NOTES>
# NOTES> Andy Tanenbaum (ast@cs.vu.nl)
# The main problem with minix.dict is that many proper names are not
# capitalized. So, I got english.tar.Z from ftp.uu.net:/doc/dictionaries,
# which is a mirror of nic.funet.fi:/pub/unix/security/dictionaries.
#
# Here is part of the README file for english.tar.Z:
# README>
# README> FILE: english.words
# README> VERSION: DEC-SRC-92-04-05
# README>
# README> EDITOR
# README>
# README> Jorge Stolfi <stolfi@src.dec.com>
# README> DEC Systems Research Center
# README>
# README> AUTHORS OF ORIGIONAL WORDLISTS
# README>
# README> Andy Tanenbaum <ast@cs.vu.nl>
# README> Barry Brachman <brachman@cs.ubc.ca>
# README> Geoff Kuenning <geoff@itcorp.com>
# README> Henk Smit <henk@cs.vu.nl>
# README> Walt Buehring <buehring%ti-csl@csnet-relay>
#
# [stuff seleted]
#
# README> AUXILIARY LISTS
# README>
# README> In the same directory as englis.words there are a few
# README> complementary word lists, all derived from the same sources
# README> [1--8] as the main list:
# README>
# README> english.names
# README>
# README> A list of common English proper names and their derivatives.
# README> The list includes: person names ("John", "Abigail",
# README> "Barrymore"); countries, nations, and cities ("Germany",
# README> "Gypsies", "Moscow"); historical, biblical and mythological
# README> figures ("Columbus", "Isaiah", "Ulysses"); important
# README> trademarked products ("Xerox", "Teflon"); biological genera
# README> ("Aerobacter"); and some of their derivatives ("Germans",
# README> "Xeroxed", "Newtonian").
# README>
# README> misc.names
# README>
# README> A list of foreign-sounding names of persons and places
# README> ("Antonio", "Albuquerque", "Balzac", "Stravinski"), extracted
# README> from the lists [1--8]. (The distinction betweeen
# README> "English-sounding" and "foreign-sounding" is of course rather
# README> arbitrary).
# README>
# README> org.names
# README>
# README> A short lists names of corporations and other institutions
# README> ("Pepsico", "Amtrak", "Medicare"), and a few derivatives.
# README>
# README> The file also includes some initialisms --- acronyms and
# README> abbreviations that are generally pronounced as words rather
# README> than spelled out ("NASA", "UNESCO").
# README>
# README> english.abbrs
# README>
# README> A list of common abbreviations ("etc.", "Dr.", "Wed."),
# README> acronyms ("A&M", "CPU", "IEEE"), and measurement symbols
# README> ("ft", "cm", "ns", "kHz").
# README>
# README> english.trash
# README>
# README> A list of words from the original wordlists
# README> that I decided were either wrong or unsuitable for inclusion
# README> in the file english.words or any of the other auxiliary
# README> lists. It includes
# README>
# README> typos ("accupy", "aquariia", "automatontons")
# README> spelling errors ("abcissa", "alleviater", "analagous")
# README> bogus derived forms ("homeown", "unfavorablies", "catched")
# README> uncapitalized proper names ("afghanistan",
# README> "algol", "decnet")
# README> uncapitalized acronyms ("apl", "ccw", "ibm")
# README> unpunctuated abbreviations ("amp", "approx", "etc")
# README> British spellings ("advertize", "archaeology")
# README> archaic words ("bedight")
# README> rare variants ("babirousa")
# README> unassimilated foreign words ("bambino", "oui", "caballero")
# README> mis-hyphenated compounds ("babylike", "backarrows")
# README> computer keywords and slang ("lconvert", "noecho", "prog")
# README>
# README> (I apologize for excluding British spellings. I should have
# README> split the list in three sublists--- common English, British,
# README> American---as ispell does. But there are only so many hours
# README> in a day...)
# README>
# README> english.maybe
# README>
# README> A list of about 5,000 lowercase words from the "mts.dict"
# README> wordlist [6] that weren't included in english.words.
# README>
# README> This list seems to include lots of "trash", like
# README> uncapitalized proper names and weird words. It would
# README> take me several days to sort this mess, so I decided to
# README> leave it as a separate file. Use at your own risk...
#
# [stuff deleted]
#
# README> (NON-)COPYRIGHT STATUS
# README>
# README> To the best of my knowledge, all the files I used to build these
# README> wordlists were available for public distribution and use, at least
# README> for non-commercial purposes. I have confirmed this assumption with
# README> the authors of the lists, whenever they were known.
# README>
# README> Therefore, it is safe to assume that the wordlists in this
# README> package can also be freely copied, distributed, modified, and
# README> used for personal, educational, and research purposes. (Use of
# README> these files in commercial products may require written
# README> permission from DEC and/or the authors of the original lists.)
# README>
# README> Whenever you distribute any of these wordlists, please distribute
# README> also the accompanying README file. If you distribute a modified
# README> copy of one of these wordlists, please include the original README
# README> file with a note explaining your modifications. Your users will
# README> surely appreciate that.
# README>
# README> (NO-)WARRANTY DISCLAIMER
# README>
# README> These files, like the original wordlists on which they are
# README> based, are still very incomplete, uneven, and inconsitent, and
# README> probably contain many errors. They are offered "as is" without
# README> any warranty of correctness or fitness for any particular
# README> purpose. Neither I nor my employer can be held responsible for
# README> any losses or damages that may result from their use.
# subtract english.trash
cat minix.dict english.trash english.trash | sort | uniq -u > dict.1
# subtract english.maybe
cat dict.1 english.maybe english.maybe | sort | uniq -u > dict.2
# build subtraction list of proper names and abbreviations
cat english.names misc.names org.names computer.names english.abbrs > sub.1
tr 'A-Z' 'a-z' < sub.1 | sort | uniq -u > sub.2
# subtract proper names with incorrect capitalization
cat dict.2 sub.2 sub.2 | sort | uniq -u > dict.3
# build proper name list without possessives
cat english.names misc.names org.names computer.names | fgrep -v \'s > names.1
# add in proper names (use sort twice to get uppercase before lowercase)
cat dict.3 names.1 | sort | sort -df | uniq > linux.words
# clean up
rm dict.[123] sub.[12] names.1
|