home *** CD-ROM | disk | FTP | other *** search
- HUFFMAN2.BAS and DECODER.BAS
- By Rich Geldreich 1992
-
-
- The two QuickBASIC 4.5 programs you have just received can
- compress and decompress files with the Huffman compression algorithm.
- The first program HUFFMAN2.BAS, takes an input file and compresses it
- to OUTPUT.HUF. The second program, DECODER.BAS, takes OUTPUT.HUF and
- decompresses it a file specified via the command line. The two
- programs should be compiled for maximum speed.
-
- To compress EATME.DOC, use:
-
- huffman2 eatme.doc
-
- To decompress the compressed file to EATME2.DOC, use:
-
- decoder eatme2.doc
-
- Easy, right?
-
- I have released these two programs to show that Huffman
- compression isn't really that hard to implement. What is Huffman
- compression? It's a simple compression algorithm which takes
- advantage of a file's character distribution: if "A" is used more than
- "B", for instance, then the code representing "A" is significantly
- smaller than the code that represents "B". It's not the best, but it
- isn't that bad.
-
- Lets say we want to compress the string "ABCDDBBAAAADD". First,
- we must scan the string and tally up the count for each character:
-
- A is used 5 times
- B is used 3 times
- C is used 1 time
- D is used 4 times
-
- Now that we have this distribution, we must start making a binary
- tree that represents all of these characters; like this:
-
- (* = not used)
-
-
- 5 3 1 4
- / \ / \ / \ / \
- A * B * C * D *
-
-
- Now we must start combining "nodes" until there is only one node
- at the top. To do this we keep on combining 2 nodes which have the
- lowest frequency:
-
-
- 1 and 3 have the lowest frequency:
-
- 5 4 4
- / \ / \ / \
- A * B C D *
-
- 4 and 4 have the lowest frequency:
-
- 5 8
- / \ / \
- A * D 4
- / \
- B C
-
- And finally, 5 and 8 have the lowest frequency:
-
- 13
- / \
- A 8
- / \
- D 4
- / \
- B C
-
- To represent this tree, both programs use a linked list. To
- represent each node, the Father() array is used. To represent the
- left and right arrows LeftSon() and RightSon() arrays are used.
-
- No that we have the combined tree, we must travel down the tree
- and find out which bit sequence represents each character we wish to
- compress. (A recursive procedure is used to do this.) We start at the
- top, and every time we go left we send a "1" and every time we go
- right we send a "0". So "A" will be represented with "1", "B" with
- "001", "C" with "000", & "D" with "01". Notice that the character
- used most in the original string, "A", has the smallest code size, and
- the character used least, "C", has the biggest code size.
-
- A = 1 [used in 38% of the string]
- B = 001 [used in 23%]
- C = 000 [used in 7%]
- D = 01 [used in 31%]
-
- "ABCDDBBAAAADD" will be compressed as:
-
- 1 001 000 01 01 001 001 1 1 1 1 01 01
-
- or
-
- 10010000 10100100 11111010 1
-
- ...so the original 13 character sequence has been compressed to
- just 4 bytes.
-
- To decompress this 4 byte sequence, the decoder must have the
- encoding tree handy(which must be send separately).
-
- The encoding tree was:
-
- 13
- / \
- A 8
- / \
- D 4
- / \
- B C
-
- The output was:
-
- 10010000 10100100 11111010 1
-
- Start at the top. The first bit is "1"- so go left. The first
- character is then "A". So far so good! Now we go back to the top and
- start all over...
-
- The second bit is "0"- so go right. We're now at "8".
- The third bit is "0"- so go right again. We're now at "4".
- The fourth bit is "1"- so go left. We're now at "B". "B" is the
- second character.
-
- Keep on repeating this until all of the characters have been
- decoded.
-
- That's all! Have fun. The two QB programs are in the public
- domain so do what you want with 'em. If you have any questions or
- cool ideas, I can be contacted at the following address:
-
- Rich Geldreich
- 410 Market St.
- Gloucester City, NJ 08030
- (609)-742-8752
-
-
- May 29th, 1992