NetNews Usenet Archive 1992 #27

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #27 / NN_1992_27.iso / spool / comp / programm / 3185 < prev next >

Wrap

Internet Message Format | 1992-11-18 | 4.9 KB

Path: sparky!uunet!cs.utexas.edu!sun-barr!ames!agate!nima.berkeley.edu!shein From: shein@nima.berkeley.edu (Soren Hein) Newsgroups: comp.programming Subject: SUMMARY: Partial sort/Nearest Neighbor Date: 19 Nov 1992 08:00:54 GMT Organization: U.C. Berkeley -- ERL Lines: 92 Distribution: world Message-ID: <1efhjmINNg6q@agate.berkeley.edu> NNTP-Posting-Host: nima.berkeley.edu This is a summary of responses to my posting "Partial sort/Nearest Neighbor". The problem was the following: I have a small list (N = 9 or 25 elements) of integers between 0 and 255. My goal is to find the average of the K elements which are closest in absolute value to a given element in the list. K is typically about half of the list length. I need the algorithm to be as fast as possible; space consumption is no problem. Sincere thanks to the many people who responded: Doug Moore (dougm@cs.rice.edu) Dave Dodson (dodson@bach.convex.com) Bob Stearns (is@groucho.dev.uga.edu) Chow Gloria (chow@roquefort.cerisi.fr) David desJardins (desj@ccr-p.ida.org) Jan Stout (wsbusup4@urc.tue.nl) Alex Martelli (martelli@cadlab.sublink.org) lloyd allison (lloyd@bruce.cs.monash.edu.au) Greg Doherty (greg@cs.uow.edu.au) To keep this posting short, I will not quote the individual responses verbatim, and I will not explicitly credit each idea to the persons (usually more than one) who suggested each idea. Also, I will use words more appropriate for the equivalent problem of finding the K smallest elements of a list. --- Selection sort: The standard method first finds the smallest element, then the smallest of the remaining elements etc. This can be terminated once the K smallest elements have been found. The running time should be O(K*N). This algorithm sorts the first K elements as a by-product, and this extra effort is not required. --- Insertion sort: The standard method assumes that the j-1 first elements are sorted, and inserts the j'th element in the correct place, for j between 2 and N. This can be modified to only keep a list of K sorted elements; if the j'th element is too big, it is thrown away rather than inserted. The running time should be O(K*N). This algorithm also sorts the first K elements as a by-product, and this extra effort is not required. --- Heap sort: Keep only a heap of size K rather than N. Elements that are too big are thrown away rather than inserted. The running time is O(N lg K). This algorithm does not do more sorting than it needs to, although it could be argued that there is some (unnecessary?) effort involved in returning the K smallest elements in the form of a heap, when a flat list would do. --- Quick sort: One reader referred me to Knuth (where else?), section 5.2.2. exercise 31. This in turn pointed to two papers by Hoare. The one I found most useful was C. A. R. Hoare, "Proof of a Program: FIND", CACM, Volume 14, Number 1, pp. 39-45, Jan. 1971 which has a short routine that partially sorts a list so that the Kth element is in the right place, the elements to the left are <= the Kth element, and the elements to the right are >= the Kth element. [This routine is used, conceptually, in quick sort.] This appears to me to do only what's needed, and I have lifted the routine from the paper. The running time could be O(N), but I'm not sure. --- Hashing / radix sort: Keep a count array of 256 elements, initialized to zero. For each list element, increment the count array at the position corresponding to the value of the list element. Then we can work our way outwards from any desired value until we have the K closest elements. [This method is based on the list elements being small integers.] My main reservation is that setting the count array to zero may take too long. If this can be done efficiently in hardware, it seems like a good choice. Running time is 256+O(N). --- Fancy method: Find the Kth largest element using a linear time algorithm. [This is fancy because the linear-time algorithm for doing so is apparently quite complicated. I don't yet have a reference for it.] Then add up all the elements <= the Kth element in linear time, and take the average. --- "Home made" method: Find all elements between two bounds. If there are too many, reduce the bounds; if there are too few, widen the bounds. This might work well, but its quality would depend on the choice of initial bounds. Also, I don't know of any performance analysis. --- Miscellaneous: I also received a couple of literature references which I haven't yet tracked down, and the advice that for short list, simple methods may beat more complicated ones. === My choice for now is the partial quicksort algorithm, but I admit I haven't yet had time to compare the time consumption with all the competitors. The code for the partial quicksort is simple and non-recursive; I'll send it to anybody interested. /Soren