home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.programming
- Path: sparky!uunet!gatech!destroyer!cs.ubc.ca!uw-beaver!newsfeed.rice.edu!rice!news.rice.edu!dougm
- From: dougm@titan.cs.rice.edu (Doug Moore)
- Subject: Re: Locating duplicates
- In-Reply-To: dave@cs.arizona.edu's message of 2 Jan 93 04:16:58 GMT
- Message-ID: <DOUGM.93Jan2014723@titan.cs.rice.edu>
- Sender: news@rice.edu (News)
- Organization: Rice University Computer Science, Houston, Texas
- References: <C07FJo.4vq@ux1.cso.uiuc.edu>
- <1993Jan2.041658.21596@organpipe.uug.arizona.edu>
- Date: Sat, 2 Jan 1993 07:47:23 GMT
- Lines: 65
-
- >>>>> On 2 Jan 93 04:16:58 GMT, dave@cs.arizona.edu (Dave Schaumann) said:
-
- Dave> In article <C07FJo.4vq@ux1.cso.uiuc.edu>, ceblair@ux1 (Charles Blair) writes:
- > Given a large array of integers. Is there a way to identify duplicates
- >which is substantially less work than sorting the array?
-
- Dave> Hmm... imagine we have an array with just two elements that are equal.
- Dave> Now assume we have some duplicate finder which does not sort the array.
- Dave> Assuming that the values and positions of the data are not related in
- Dave> some way, it seems reasonable to assume that we can arrange the values
- Dave> so that that the algorithm must try every possible combination before
- Dave> it finds the duplicate pair.
-
- Dave> Not an air-tight proof
-
- Right. So try this:
-
- Assume that our duplicate checker must compare elements to do its
- work. And assume that we wish only to check whether or not any
- duplicates exist. Let the array have, besides data entries containing
- the numbers to be checked, pointer entries. When we compare two array
- data entries, we make the pointer paired with the lesser element point
- to the pair containing the greater element (or quit if the elements
- are equal). To confidently report 'no duplicates', the pointers must,
- in the end, form a sorted linked list. Consider two items that are
- consecutive in a sorted sequence; if we haven't compared those two in
- particular, we can't tell whether they're duplicates or not. So we
- must have compared all such pairs to report 'no duplicates'. So
- anything that let's us report 'no dups' is essentially a sorting
- algorithm, and comparison based sorting requires O(n lg n) time.
-
- But if these are fixed size integers (like C ints), you can radix sort
- in linear time anyway, which is as fast as you can reasonably expect.
-
- For example:
-
- rsort(unsigned arr[], unsigned n)
- {
- int j;
- unsigned* stacks = (unsigned*) malloc(n*sizeof(unsigned));
- unsigned mask = 1;
- unsigned max = 0;
- for (j = 0; j < n; ++j)
- if (max < arr[j])
- max = arr[j];
- while (mask <= max && mask) /* <= 32 times on my machine */
- {
- int top = n;
- int bottom = 0;
- for (j = 0; j < n; ++j)
- if (arr[j] & mask)
- stacks[--top] = arr[j];
- else
- stacks[bottom++] = arr[j];
- for (j = 0; j < bottom; ++j)
- arr[j] = stacks[j];
- for (j = top; j < n; ++j)
- arr[j] = stacks[n-1-j+top];
- mask <<= 1;
- }
- free((char*) stacks);
- }
-
- Doug Moore
- (dougm@cs.rice.edu)
-