home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!cs.utexas.edu!qt.cs.utexas.edu!yale.edu!ira.uka.de!sbusol.rz.uni-sb.de!coli.uni-sb.de!coli-gate.coli.uni-sb.de!spackman
- From: spackman@disco-sol.dfki.uni-sb.de (Stephen Spackman)
- Newsgroups: comp.programming
- Subject: Re: C/C++ Speed
- Followup-To: comp.programming
- Date: 26 Jan 1993 14:03:01 GMT
- Organization: DFKI Saarbruecken GmbH, D-W 6600 Saarbruecken
- Lines: 75
- Distribution: na
- Message-ID: <SPACKMAN.93Jan26150814@disco-sol.dfki.uni-sb.de>
- References: <1993Jan19.140118.14183@bcrka451.bnr.ca> <1993Jan20.180908.8784@tc.fluke.COM>
- <1993Jan21.133432.26293@bcrka451.bnr.ca> <25415@galaxy.ucr.edu>
- Reply-To: stephen@acm.org
- NNTP-Posting-Host: disco-sol.dfki.uni-sb.de
- In-reply-to: rhyde@ucrengr.ucr.edu's message of 25 Jan 93 17:53:16 GMT
-
- In article <25415@galaxy.ucr.edu> rhyde@ucrengr.ucr.edu (randy hyde) writes:
- |Trying to get fast software on a UNIX box is a joke. UNIX is not
- |real-time and by the time you have all the background stuff running
- |(e.g., X-stuff) it makes for a really slow system. People who talk
- |about how optimizing their code doesn't pay off, when the code they're
- |writing is running under UNIX, are fooling themselves. I know, I've
- |done a lot of this (I'm not talking about going from 10 minutes to five,
- |I'm talking about going from 1 sec to 0.1 sec which is very important for
- |highly interactive programs). Given the typical 1/60 sec time slice in
- |a multitasking OS (not just UNIX), it is very difficult to optimize such
- |machines at this level.
-
- Well, I do once recall installing a small optimisation that reduced
- response time to about a quarter of a second, from on the order of a
- hundred billion years [estimated]. I thought that was twenty minutes of
- programming time well spent.... Even under Unix this resulted in a
- *significant* increase in throughput.
-
- Oh, since this is comp.programming, I'll tell you what I was doing. The
- problem was to store a matrix of how often each different word is used
- in each of a thousand books, how often by each author, how often in each
- year, decade and century - in all, a 2000 * 200,000 matrix of maximally
- 24-bit numbers. The naive method doesn't work well, because those
- thousand books were already being stored on the CD along with their
- various indices (itself no mean feat, the uncompressed text alone was
- bigger than the whole CD). Not room for another gig of matrix.
-
- So first we used a couple of obvious optimisations, based on the idea
- that when (e.g.) an author only has two books in the database, we can
- store just the author summary and the smaller of the book summaries.
-
- Since most of the numbers are small, the obvious encoding for the
- numbers themselves is something like a variable-width mantissa preceded
- by a huffman-encoded exponent, but in fact you can do a lot better.
- Since the categories of info stored are mostly heirarchic (each decade
- is in some century, many authors fit inside one decade), you can choose
- a spanning tree for the various objects and then (a) guarantee that the
- count for a given word in an object is no greater than the count for the
- same word in its parent object (thus the number of bits USED in the
- parent upper-bounds the number of bits you need STORE for the child - an
- optimisation at its most useful in the most common case, when the
- frequency stored is zero, and the child needs no entry at all); and (b)
- estimate that the frequencies are distributed evenly by corpus size (so
- usually you can store the number of word-occurrences for a year in the
- number of bits actually used to represent the product of the number of
- occurrences for the decade and the proportion of the *total* volume of
- the decade, for all words, that falls in the year). It's worth a bit to
- choose between these two representations (except when reprentation a
- needs one bit or less :-).
-
- This method is nice and compact (not that I haven't omitted a lot of
- obvious statistically-based refinements) - about 10,000:1 nonlossy
- compression over the original matrix (well, not such a huge surprise
- since an incredible number of words are only used in one work, ever -
- because they're typos or names or really weird) but when you come to
- decompress it, it's all yucchy bitstreams :-).
-
- And being a good functionalist, even though this was all in C, I made
- all my interfaces nice clean stateless ones: to decompress the entry for
- a word, you (1) compute the position at which its entry starts (to do
- this you just read the preceding entry!), (2) read the entry for the
- same word in the parent object, and do the little calculation that tells
- you how many bits to read. The only problem is that both steps recur on
- the whole operation, and the damned thing is (as coded!) somewhat worse
- than O(n^m) (where n is the number of words, 200,000 in this case, and m
- is the depth of the tree - call it five).
-
- Of course, memoising the functions (with LRU caches to keep the space
- requirements down) - makes the whole thing linear again.
-
- The nice thing is that it doesn't mess up the code structure at all.
- ----------------------------------------------------------------------
- stephen p spackman +49 681 302 5288(o) 5282(sec) stephen@acm.org
- dfki / stuhlsatzenhausweg 3 / d-w-6600 saarbruecken 11 / germany
- ----------------------------------------------------------------------
-