home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!uunet.ca!ecicrl!clewis
- From: clewis@ferret.ocunix.on.ca (Chris Lewis)
- Newsgroups: comp.lang.perl
- Subject: Re: SOUNDEX pattern matching
- Keywords: soundex, perl
- Message-ID: <4167@ecicrl.ocunix.on.ca>
- Date: 24 Jan 93 02:04:07 GMT
- References: <1jfjejINNm1q@fernwood.mpk.ca.us> <1452@ares.edsr.eds.com> <1993Jan23.025720.1005@Happy-Man.com>
- Organization: Elegant Communications Inc., Ottawa, Canada
- Lines: 49
-
- In article <1993Jan23.025720.1005@Happy-Man.com> Irving_Wolfe@Happy-Man.com writes:
- >Before we get too far re-posting soundex code, let me ask another
- >question: I tried using this thing and found it pretty useless.
- >What's it good for? I tried to use it to find names, and it was
- >just awful at it! Is anyone at all using this for anything real
- >today, or is it a clever idea that didn't quite work?
-
- Oh, it works alright.
-
- It's very useful for finding names where you're not quite sure
- of the spelling. Which makes it very good in telephone book style
- applications. Surprisingly, it works reasonably well even with
- names that aren't of English origin, as long as the "correct" spelling
- conforms reasonably well with English phonetics. It's even good at
- finding unpronounceable slavic names. For example, something as
- minimal as "mutsi" is sufficient for finding something like Miedzinski
- (pronounced "mudZINZka") if you do the searching correctly.
-
- If you found that it was just awful at it, I think you're not
- using it right.
-
- Users of soundex soon learn little tricks on how to find things even
- easier. Like misspelling parts of the name that they're sure of
- to search for in slightly different ways, or intentionally shortening
- it to make the "hit set" bigger.
-
- It's used by very large corporations for name lookup, and works well.
- I should know, I implemented it when I wrote the telephone lookup program
- used internally by one of the largest corporations in this country.
- It's one of its best features if I do say so myself ;-)
-
- What is important is what additional heuristics you use. My program,
- by default, would show you exact matches if there were any. If not,
- it displays soundex matches. It's important to recognize when the
- soundex code has been padded - if it is, you may also want to
- match where the first non-zero characters in the soundex code
- do. Ie: a search code of A100 matches all A1xx.
-
- I also note that the perl implementation calculates an 8 digit soundex code.
- The precision that you will use will often restrict the size of the
- soundex matches. In a corporation with something like 5,000 names,
- four digit soundex works well. With 20,000 names, you may want
- to go to 5. 8 seems a trifle big for most applications. Using long
- soundex codes just tends to shorten your possible hits with long names,
- and long names are where you need the widest tolerance.
- --
- Chris Lewis; clewis@ferret.ocunix.on.ca; Phone: Canada 613 832-0541
- Psroff 3.0 info: psroff-request@ferret.ocunix.on.ca
- Ferret list: ferret-request@ferret.ocunix.on.ca
-