NetNews Usenet Archive 1992 #31

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #31 / NN_1992_31.iso / spool / comp / programm / 3391 < prev next >

Wrap

Internet Message Format | 1993-01-01 | 1.9 KB

Path: sparky!uunet!gatech!ncar!noao!amethyst!organpipe.uug.arizona.edu!news From: dave@cs.arizona.edu (Dave Schaumann) Newsgroups: comp.programming Subject: Re: Locating duplicates Message-ID: <1993Jan2.041658.21596@organpipe.uug.arizona.edu> Date: 2 Jan 93 04:16:58 GMT References: <C07FJo.4vq@ux1.cso.uiuc.edu> Sender: news@organpipe.uug.arizona.edu Reply-To: dave@cs.arizona.edu (Dave Schaumann) Organization: University of Arizona Lines: 27 In-Reply-To: ceblair@ux1.cso.uiuc.edu (Charles Blair) In article <C07FJo.4vq@ux1.cso.uiuc.edu>, ceblair@ux1 (Charles Blair) writes: > Given a large array of integers. Is there a way to identify duplicates >which is substantially less work than sorting the array? Hmm... imagine we have an array with just two elements that are equal. Now assume we have some duplicate finder which does not sort the array. Assuming that the values and positions of the data are not related in some way, it seems reasonable to assume that we can arrange the values so that that the algorithm must try every possible combination before it finds the duplicate pair. Not an air-tight proof, but it seems reasonable to assert that no algorithm that asserts anything about the order of the values in the array can find a duplicate in less than O(n*(n-1)/2) time in the worst case, and clearly using an O(nlogn) sort and a linear scan is better than that. To beat sort-and-scan, you would need some kind of linear-or-better ordering algorithm that would allow you to scan for duplicates with a linear-or-better scanning algorithm. However, I think that generally you can do some things to speed up the average case of sort-and-scan. For instance, you could modify a heap sort algorithm so that it stops as soon as it finds a duplicate. Also, you could employ a hashing scheme based on the lowest bytes of the values, which would break up your data into 256 independant groups. -- Dave Schaumann dave@cs.arizona.edu