NetNews Usenet Archive 1992 #31

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #31 / NN_1992_31.iso / spool / comp / programm / 3393 < prev next >

Wrap

Text File | 1993-01-02 | 3.0 KB | 79 lines

Newsgroups: comp.programming Path: sparky!uunet!gatech!destroyer!cs.ubc.ca!uw-beaver!newsfeed.rice.edu!rice!news.rice.edu!dougm From: dougm@titan.cs.rice.edu (Doug Moore) Subject: Re: Locating duplicates In-Reply-To: dave@cs.arizona.edu's message of 2 Jan 93 04:16:58 GMT Message-ID: <DOUGM.93Jan2014723@titan.cs.rice.edu> Sender: news@rice.edu (News) Organization: Rice University Computer Science, Houston, Texas References: <C07FJo.4vq@ux1.cso.uiuc.edu> <1993Jan2.041658.21596@organpipe.uug.arizona.edu> Date: Sat, 2 Jan 1993 07:47:23 GMT Lines: 65 >>>>> On 2 Jan 93 04:16:58 GMT, dave@cs.arizona.edu (Dave Schaumann) said: Dave> In article <C07FJo.4vq@ux1.cso.uiuc.edu>, ceblair@ux1 (Charles Blair) writes: > Given a large array of integers. Is there a way to identify duplicates >which is substantially less work than sorting the array? Dave> Hmm... imagine we have an array with just two elements that are equal. Dave> Now assume we have some duplicate finder which does not sort the array. Dave> Assuming that the values and positions of the data are not related in Dave> some way, it seems reasonable to assume that we can arrange the values Dave> so that that the algorithm must try every possible combination before Dave> it finds the duplicate pair. Dave> Not an air-tight proof Right. So try this: Assume that our duplicate checker must compare elements to do its work. And assume that we wish only to check whether or not any duplicates exist. Let the array have, besides data entries containing the numbers to be checked, pointer entries. When we compare two array data entries, we make the pointer paired with the lesser element point to the pair containing the greater element (or quit if the elements are equal). To confidently report 'no duplicates', the pointers must, in the end, form a sorted linked list. Consider two items that are consecutive in a sorted sequence; if we haven't compared those two in particular, we can't tell whether they're duplicates or not. So we must have compared all such pairs to report 'no duplicates'. So anything that let's us report 'no dups' is essentially a sorting algorithm, and comparison based sorting requires O(n lg n) time. But if these are fixed size integers (like C ints), you can radix sort in linear time anyway, which is as fast as you can reasonably expect. For example: rsort(unsigned arr[], unsigned n) { int j; unsigned* stacks = (unsigned*) malloc(n*sizeof(unsigned)); unsigned mask = 1; unsigned max = 0; for (j = 0; j < n; ++j) if (max < arr[j]) max = arr[j]; while (mask <= max && mask) /* <= 32 times on my machine */ { int top = n; int bottom = 0; for (j = 0; j < n; ++j) if (arr[j] & mask) stacks[--top] = arr[j]; else stacks[bottom++] = arr[j]; for (j = 0; j < bottom; ++j) arr[j] = stacks[j]; for (j = top; j < n; ++j) arr[j] = stacks[n-1-j+top]; mask <<= 1; } free((char*) stacks); } Doug Moore (dougm@cs.rice.edu)