NetNews Usenet Archive 1992 #31

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #31 / NN_1992_31.iso / spool / bit / listserv / sasl / 5450 < prev next >

Wrap

Text File | 1992-12-22 | 3.0 KB | 65 lines

Comments: Gated by NETNEWS@AUVM.AMERICAN.EDU Path: sparky!uunet!paladin.american.edu!auvm!BARUCH.BITNET!TEJERA Message-ID: <SAS-L%92122114531958@UGA.CC.UGA.EDU> Newsgroups: bit.listserv.sas-l Date: Mon, 21 Dec 1992 14:44:46 EST Reply-To: Philip Tejera <TEJERA@BARUCH.BITNET> Sender: "SAS(r) Discussion" <SAS-L@UGA.BITNET> From: Philip Tejera <TEJERA@BARUCH.BITNET> Subject: Re: PROC SORT NODUPLICATES is no good In-Reply-To: Message of Wed, 16 Dec 1992 19:48:41 EST from <76350.1604@COMPUSERVE.COM> Lines: 51 On Wed, 16 Dec 1992 19:48:41 EST Andy Norton said: >.. >SUMMARY: Phillip Tejera has not addressed the "adjacency" issue >.. Gee, Andy, I didn't realize I had committed such a crime! :-) My purpose was more general. I thought the so-called "adjacency" issue was more than adequately covered in the SAS v5 Basics manual, the SAS v6 Procedures manual, and previous discussions on the list. I had no quarrel with your contention that the Language and Procedures manual was inaccurate. My question is, having shown the inaccuracy, why do you continue to want to believe the L&P manual? Incidentally, despite your protests, your example did have only one key! It was clearly not unique. I was heartened, however, that in your responses to Derek and me it was clear you took back most of what you said in your first posting. My main objection was to your Proc Sort by _all_ ; . In your most recent posting you agree: > ... I sort by _ALL_ to make >them adjacent, not for any other reason. Yes this is expensive, but I >need to make them adjacent in order to delete exact duplicates. > >Note: in real life I don't really ever do this. I keep keys, and >remove duplicates on those keys. But ... If you don't do it, why do you recommend it to SAS-L's world-wide readership. Get real! My point in discussing the concept of a unique key or set of keys was to provide a basis for clarifying the context of the issue. If you have observations that are supposed to be unique as to key, but in fact have duplicates, erroneous data have crept into your study. The Noduplicates option of Proc Sort is one conservative way to clean the data in the course of sorting it. It is no news that it is NOT GUARANTEED to remove all exact duplicates. But I strongly disagree with your procedure for eliminating duplicate keys using Nodupkey and the sort by _all_; . If I had occasion to sort the data, I would use the Noduplicates option; this would be efficient since I was already doing the sorting. But more importantly, I would do a Univariate or Freq on the unique key or set of keys, SO AS TO VERIFY THAT THEY ARE INDEED UNIQUE. Lines or cells with a count greater than one would immediately identify the problem observations. Having identified them, I could then print them out, or better, examine them interactively. Having done sufficient checking, it is then a simple matter to use SAS's Delete or Where statements to eliminate the erroneous observations. And, of course, this does not require the notorious "adjacency".