home *** CD-ROM | disk | FTP | other *** search
-
- This program allows you to search a highly compressed list of words.
- It can be used for looking up spellings and cheating at various word
- games. Currently searches are specified by a substring (with '?' as a
- wildcard character) and/or a length.
-
- I have encoded three dictionaries for this program. The raw material
- came from ftp sites listed in the rec.puzzles.crosswords FAQ.
-
- dictionary words raw size .pdb size ratio
- ---------- ------- --------- --------- -----
- pocket 21,111 176,662 45,130 3.9
- ospd 79,339 614,670 100,561 6.1
- unabrd 235,544 2,492,342 516,428 4.8
-
- Compression is performed in two steps. First the beginning of each
- word is stripped off and replaced by a count of how many characters to
- copy from the beginning of the previous word. Then the file is
- compressed using context-sensitive huffman codes (different codes are
- used depending on the previous character.)
-
- To evaluate the performance of the context-sensitive huffman codes, I
- compressed the prefix-stripped ospd file using several other methods.
-
- program size comments
- -------- ------ ---------
- ha 78253 PPM (arithmetic coding, variable length context)
- bzip 79607 Burroughs-Wheeler
- gzip 97409 LZ77 program, same algorithm as pkzip
- lzo 132179 LZ77 library, used flags -m972 -b10000
-
-
- PPM and Burroughs-Wheeler aren't suitable for the Pilot because they
- need a lot of working memory. The lzo library by Markus Franz Xaver
- Johannes Oberhumer is attractive because it uses no memory and is very
- fast, but the compression isn't as good as the context-sensitive
- huffman codes that I used. It seems to me that if you are going to
- start trading compression away for speed, then you should go all the
- way to the data structures described in the article below. Searches
- could be hundreds of times faster while the dictionary would be less
- than twice as big.
-
- Andrew Appel and Guy Jacobsen, "The World's Fastest Scrabble Program",
- Communications of the ACM, May 1988.
-