NetNews Usenet Archive 1992 #27

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #27 / NN_1992_27.iso / spool / comp / compress / 3870 < prev next >

Wrap

Internet Message Format | 1992-11-17 | 2.6 KB

Path: sparky!uunet!know!mips2!news.bbn.com!noc.near.net!news.Brown.EDU!qt.cs.utexas.edu!cs.utexas.edu!zaphod.mps.ohio-state.edu!uwm.edu!ogicse!das-news.harvard.edu!cantaloupe.srv.cs.cmu.edu!tgl From: tgl+@cs.cmu.edu (Tom Lane) Newsgroups: comp.compression Subject: Re: Need a compressor for sparse bit datastream Keywords: compression technique Message-ID: <Bxu5t8.D1B.2@cs.cmu.edu> Date: 17 Nov 92 01:07:53 GMT References: <1992Nov13.120505.29654@spectrum.xerox.com> Sender: news@cs.cmu.edu (Usenet News System) Organization: School of Computer Science, Carnegie Mellon Lines: 36 Nntp-Posting-Host: g.gp.cs.cmu.edu richard@garfield.noname (richard_landells.sbd-e@rx.xerox.com) writes: > I have an application that generates binary output. The output is > relatively random, but there are approximately twice as many off bits as on > bits. My objective is to compress this as much as possible. > I have tried several 'standard' compressors, arj 2.2, lharc, pkzip 1.1, and > have only managed to achieve very minimal compression in the order of 4% at > best (on a 40K file). Now I know that a truly random binary datastream > cannot be compressed, but I was kind of hoping for better than 4%. Am I > missing something fundamental, or is this really the best that can be > achieved? Most 'standard' compressors work by looking for repeated byte strings. If your source is really random then this won't be very effective since you won't have much repetition at the byte level. However, you should be able to exploit the 2:1 ratio of 0s to 1s to get better than 4% compression. To do this will require arithmetic compression. Huffman would work if the symbols were bigger than a bit, but since Huffman codes can't be less than one bit, you can't get any compression with single-bit input symbols. Arithmetic coding wins because it can encode source symbols into "fractional bits". Arithmetic compression code is easy to come by, see the comp.compression FAQ. (But if you want to use this for any commercial purpose, beware of patents.) But wait... if all we know about your source is that 0's have probability .67 and 1's have probability .33, then the entropy is -(.67*log2(.67) + .33*log2(.33)) or 0.92, which means you could theoretically achieve at most 8% compression this way. I'd say implementing arithmetic coding on individual bits is probably not worth the trouble. Are you sure the 0's and 1's are uncorrelated? If not, higher-order arithmetic coding might work. The fact that standard compressors can do as well as 4% suggests to me that there is more order in the input bitstream than you've let on. regards, tom lane