home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!cs.utexas.edu!sun-barr!ames!agate!nima.berkeley.edu!shein
- From: shein@nima.berkeley.edu (Soren Hein)
- Newsgroups: comp.programming
- Subject: SUMMARY: Partial sort/Nearest Neighbor
- Date: 19 Nov 1992 08:00:54 GMT
- Organization: U.C. Berkeley -- ERL
- Lines: 92
- Distribution: world
- Message-ID: <1efhjmINNg6q@agate.berkeley.edu>
- NNTP-Posting-Host: nima.berkeley.edu
-
- This is a summary of responses to my posting "Partial sort/Nearest Neighbor".
- The problem was the following:
-
- I have a small list (N = 9 or 25 elements) of integers between 0 and 255.
- My goal is to find the average of the K elements which are closest in
- absolute value to a given element in the list. K is typically about
- half of the list length. I need the algorithm to be as fast as possible;
- space consumption is no problem.
-
- Sincere thanks to the many people who responded:
-
- Doug Moore (dougm@cs.rice.edu)
- Dave Dodson (dodson@bach.convex.com)
- Bob Stearns (is@groucho.dev.uga.edu)
- Chow Gloria (chow@roquefort.cerisi.fr)
- David desJardins (desj@ccr-p.ida.org)
- Jan Stout (wsbusup4@urc.tue.nl)
- Alex Martelli (martelli@cadlab.sublink.org)
- lloyd allison (lloyd@bruce.cs.monash.edu.au)
- Greg Doherty (greg@cs.uow.edu.au)
-
- To keep this posting short, I will not quote the individual responses
- verbatim, and I will not explicitly credit each idea to the persons
- (usually more than one) who suggested each idea. Also, I will use words
- more appropriate for the equivalent problem of finding the K smallest
- elements of a list.
-
- --- Selection sort: The standard method first finds the smallest element,
- then the smallest of the remaining elements etc. This can be terminated
- once the K smallest elements have been found. The running time should
- be O(K*N). This algorithm sorts the first K elements as a by-product,
- and this extra effort is not required.
-
- --- Insertion sort: The standard method assumes that the j-1 first
- elements are sorted, and inserts the j'th element in the correct place,
- for j between 2 and N. This can be modified to only keep a list of K
- sorted elements; if the j'th element is too big, it is thrown away
- rather than inserted. The running time should be O(K*N). This
- algorithm also sorts the first K elements as a by-product, and this
- extra effort is not required.
-
- --- Heap sort: Keep only a heap of size K rather than N. Elements that
- are too big are thrown away rather than inserted. The running time is
- O(N lg K). This algorithm does not do more sorting than it needs to,
- although it could be argued that there is some (unnecessary?) effort
- involved in returning the K smallest elements in the form of a heap,
- when a flat list would do.
-
- --- Quick sort: One reader referred me to Knuth (where else?), section
- 5.2.2. exercise 31. This in turn pointed to two papers by Hoare.
- The one I found most useful was
- C. A. R. Hoare, "Proof of a Program: FIND", CACM,
- Volume 14, Number 1, pp. 39-45, Jan. 1971
- which has a short routine that partially sorts a list so that the Kth
- element is in the right place, the elements to the left are <= the Kth
- element, and the elements to the right are >= the Kth element. [This
- routine is used, conceptually, in quick sort.] This appears to me to do
- only what's needed, and I have lifted the routine from the paper. The
- running time could be O(N), but I'm not sure.
-
- --- Hashing / radix sort: Keep a count array of 256 elements, initialized
- to zero. For each list element, increment the count array at the position
- corresponding to the value of the list element. Then we can work our
- way outwards from any desired value until we have the K closest elements.
- [This method is based on the list elements being small integers.]
- My main reservation is that setting the count array to zero may take too
- long. If this can be done efficiently in hardware, it seems like a good
- choice. Running time is 256+O(N).
-
- --- Fancy method: Find the Kth largest element using a linear time
- algorithm. [This is fancy because the linear-time algorithm for doing
- so is apparently quite complicated. I don't yet have a reference for
- it.] Then add up all the elements <= the Kth element in linear time,
- and take the average.
-
- --- "Home made" method: Find all elements between two bounds. If there
- are too many, reduce the bounds; if there are too few, widen the bounds.
- This might work well, but its quality would depend on the choice of
- initial bounds. Also, I don't know of any performance analysis.
-
- --- Miscellaneous: I also received a couple of literature references which
- I haven't yet tracked down, and the advice that for short list, simple
- methods may beat more complicated ones.
-
- ===
-
- My choice for now is the partial quicksort algorithm, but I admit I
- haven't yet had time to compare the time consumption with all the
- competitors. The code for the partial quicksort is simple and
- non-recursive; I'll send it to anybody interested.
-
- /Soren
-