NetNews Usenet Archive 1993 #3

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #3 / NN_1993_3.iso / spool / comp / programm / 3604 < prev next >

Wrap

Internet Message Format | 1993-01-26 | 5.2 KB

Path: sparky!uunet!cs.utexas.edu!qt.cs.utexas.edu!yale.edu!ira.uka.de!sbusol.rz.uni-sb.de!coli.uni-sb.de!coli-gate.coli.uni-sb.de!spackman From: spackman@disco-sol.dfki.uni-sb.de (Stephen Spackman) Newsgroups: comp.programming Subject: Re: C/C++ Speed Followup-To: comp.programming Date: 26 Jan 1993 14:03:01 GMT Organization: DFKI Saarbruecken GmbH, D-W 6600 Saarbruecken Lines: 75 Distribution: na Message-ID: <SPACKMAN.93Jan26150814@disco-sol.dfki.uni-sb.de> References: <1993Jan19.140118.14183@bcrka451.bnr.ca> <1993Jan20.180908.8784@tc.fluke.COM> <1993Jan21.133432.26293@bcrka451.bnr.ca> <25415@galaxy.ucr.edu> Reply-To: stephen@acm.org NNTP-Posting-Host: disco-sol.dfki.uni-sb.de In-reply-to: rhyde@ucrengr.ucr.edu's message of 25 Jan 93 17:53:16 GMT In article <25415@galaxy.ucr.edu> rhyde@ucrengr.ucr.edu (randy hyde) writes: |Trying to get fast software on a UNIX box is a joke. UNIX is not |real-time and by the time you have all the background stuff running |(e.g., X-stuff) it makes for a really slow system. People who talk |about how optimizing their code doesn't pay off, when the code they're |writing is running under UNIX, are fooling themselves. I know, I've |done a lot of this (I'm not talking about going from 10 minutes to five, |I'm talking about going from 1 sec to 0.1 sec which is very important for |highly interactive programs). Given the typical 1/60 sec time slice in |a multitasking OS (not just UNIX), it is very difficult to optimize such |machines at this level. Well, I do once recall installing a small optimisation that reduced response time to about a quarter of a second, from on the order of a hundred billion years [estimated]. I thought that was twenty minutes of programming time well spent.... Even under Unix this resulted in a *significant* increase in throughput. Oh, since this is comp.programming, I'll tell you what I was doing. The problem was to store a matrix of how often each different word is used in each of a thousand books, how often by each author, how often in each year, decade and century - in all, a 2000 * 200,000 matrix of maximally 24-bit numbers. The naive method doesn't work well, because those thousand books were already being stored on the CD along with their various indices (itself no mean feat, the uncompressed text alone was bigger than the whole CD). Not room for another gig of matrix. So first we used a couple of obvious optimisations, based on the idea that when (e.g.) an author only has two books in the database, we can store just the author summary and the smaller of the book summaries. Since most of the numbers are small, the obvious encoding for the numbers themselves is something like a variable-width mantissa preceded by a huffman-encoded exponent, but in fact you can do a lot better. Since the categories of info stored are mostly heirarchic (each decade is in some century, many authors fit inside one decade), you can choose a spanning tree for the various objects and then (a) guarantee that the count for a given word in an object is no greater than the count for the same word in its parent object (thus the number of bits USED in the parent upper-bounds the number of bits you need STORE for the child - an optimisation at its most useful in the most common case, when the frequency stored is zero, and the child needs no entry at all); and (b) estimate that the frequencies are distributed evenly by corpus size (so usually you can store the number of word-occurrences for a year in the number of bits actually used to represent the product of the number of occurrences for the decade and the proportion of the *total* volume of the decade, for all words, that falls in the year). It's worth a bit to choose between these two representations (except when reprentation a needs one bit or less :-). This method is nice and compact (not that I haven't omitted a lot of obvious statistically-based refinements) - about 10,000:1 nonlossy compression over the original matrix (well, not such a huge surprise since an incredible number of words are only used in one work, ever - because they're typos or names or really weird) but when you come to decompress it, it's all yucchy bitstreams :-). And being a good functionalist, even though this was all in C, I made all my interfaces nice clean stateless ones: to decompress the entry for a word, you (1) compute the position at which its entry starts (to do this you just read the preceding entry!), (2) read the entry for the same word in the parent object, and do the little calculation that tells you how many bits to read. The only problem is that both steps recur on the whole operation, and the damned thing is (as coded!) somewhat worse than O(n^m) (where n is the number of words, 200,000 in this case, and m is the depth of the tree - call it five). Of course, memoising the functions (with LRU caches to keep the space requirements down) - makes the whole thing linear again. The nice thing is that it doesn't mess up the code structure at all. ---------------------------------------------------------------------- stephen p spackman +49 681 302 5288(o) 5282(sec) stephen@acm.org dfki / stuhlsatzenhausweg 3 / d-w-6600 saarbruecken 11 / germany ----------------------------------------------------------------------