home *** CD-ROM | disk | FTP | other *** search
- TITLE SOUNDEXA
- PAGE ,132
-
- ;Author : bp-programs, Kalamazoo, Michigan
- ;Date: Jan 88, for Clipper Summer 87
- ;Source Code protected by United States Copyright Law
- ;Permission given for code to be incorporated in other programs by author
-
- ;Syntax: SOUNDEXA(string[,filler])
-
- ;The soundex code is useful to look up names where you aren't sure of the
- ;spelling. Codes for similar sounding names are generally (but NOT always)
- ;close together. The code has the format LETTER-DIGIT-DIGIT-DIGIT. LETTER is
- ;simply the upper case first letter of the name. DIGITs are derived from the
- ;translation table below. Empty positions are NOT translated. If there are
- ;two or more letters with the same code following each other in the name, only
- ;ONE code number is used. 'Schmidt' is 'S530', not 'S253' or 'S533'. If there
- ;are more than three code numbers, the extra ones aren't used. If there are
- ;fewer, the code is padded with zeros. (But see about FILLER below).
-
- ;Soundex ABCDEFGHIJKLMNOPQRSTUVWXYZ
- ;Translation Table: 123 12 22455 12623 1 2 2
-
- ;SOUNDEXA is an assembly language implementation of the soundex code. It
- ;follows my interpretation of the algorithm found on pages 392/393 of Knuth's
- ;book 'Sorting and Searching', volume 3 of "The Art of Computer Programming".
-
- ;It does NOT return the same code as the soundex routine in examplec.c
- ;(SOUNDEXC) distributed with Clipper Summer 87 or the Rettig soundex routine
- ;distributed with Clipper Autumn 86 in extenddb.prg (SOUNDEXD).
- ;The main differences among the three implementations are listed below.
-
- ; SOUNDEXA SOUNDEXC SOUNDEXD
- ; ---------------------- --------------------- ----------------
- ;Format A999 A999 A9999
-
- ;Dupes Skips ltrs generating Skips identical ltrs Skips duplicate
- ; the same code which are adjacent in original code numbers even
- ; immediately adjacent in text if not adjacent in
- ; original text original text
-
- ;Null 1. Null string 1. Null string 1. Null string
- ;Returns 2. Completely non-alpha 2. Non-alpha/non
- ; string space characters
- ; except first char
-
- ;Fault 1. Ltrims leading non- 1. Does not trim, uses 1. Does not trim,
- ;Tolerance alpha characters non-alpha as lead uses any char
- ; 2. Skips intermediate 2. Aborts with non- 2. Skips inter-
- ; non-alpha characters alpha/non-space mediate non-
- ; except first char alpha chars
-
- ;Speed 3 secs/5000 repeats 9 secs/5000 repeats 90 secs/5000 repts
- ;I believe, of course, that SOUNDEXA is the 'best' implementation because
- ;it's closest to Knuth's algorithm, most fault tolerant, fastest (and also
- ;smallest, by the way) and the most FLEXIBLE. More about this below.
-
- ;Knuth's algorithm uses 0s (character zero) to fill trailing empty slots.
- ;This makes sense when you're constructing an index, such as
-
- ; INDEX ON SOUNDEXA(LASTNAME) TO NAMX
-
- ;However, when you're SEEK/LOCATEing with SOUNDEX you generally want to find
- ;all likely candidates and want to make sure that you don't miss any. You'd
- ;rather find a few wrong ones than miss a single right one. In that case
- ;you want to include even partial matches, such as
-
- ; LOCATE ALL FOR TRIM(SOUNDEXA(PART_NAME))
-
- ;SOUNDEXA allows you to select between two fillers, spaces or '0'. Even
- ;though zeros are 'standard', I find spaces more flexible and have made them
- ;the default. By specifying a second argument SOUNDEXA(LASTNAME,FILLER) once,
- ;you change the state of the routine. If FILLER is a '0' (as a character, not
- ;a number), all future calls to SOUNDEXA will use zeros for filling. If
- ;FILLER is any other character (or even a null string), SOUNDEXA will use
- ;spaces in the future. If there isn't a second argument, SOUNDEXA will use
- ;what you specified before or the default. If you prefer zeros as the default,
- ;change the FILLER DB to '0' in the DATASG.
-
-
- ;===================================================
- EXTRN __PARINFO:FAR ;Clipper EXTEND routine, tells how many arguments
- EXTRN __PARC:FAR ;Clipper EXTEND routine, gets a character argument
- EXTRN __RETC:FAR ;Clipper EXTEND routine, returns a character value
-
- SX_LENGTH EQU 4 ;Length of soundex code
-
- DGROUP GROUP DATASG ;Ties this segment to the other data segments
- ;of Clipper. DS points to this DGROUP when
- ;we arrive in the assembly routine
-
- DATASG SEGMENT WORD PUBLIC 'DATA' ;All PUBLIC segments with the name DATASG
- ;will be combined by the linker. All segments
- ;with the class 'DATA' will be adjacent to
- ;each other. WORD means that the segment
- ;starts on an even byte, which can sometimes
- ;be minutely faster in an 8086/80286 machine.
-
- SOUNDEX DB SX_LENGTH DUP (?) ; Space for SOUNDEX result
- DB 00 ; Terminator byte
- ;Strings in C and Clipper are terminated by a NULL (or NUL or
- ;NIL, it all means the same thing). There is no length byte
- ;or word as in BASIC or Turbo Pascal.
- FILLER DB ' ' ; Filler byte for padding of SOUNDEX, can be
- ; space (default) or '0'
- ; Translate table from UC letters to SOUNDEX codes
- ; Omitted letters return NULL
- ; 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
- ; ' 123 12 22455 12623 1 2 2'
- TRANSLATE db 0,'123',0,'12',0,0,'22455',0,'12623',0,'1',0,'2',0,'2'
-
- DATASG ENDS
- ;==================================================
-
-
- ;==================================================
- _PROG SEGMENT BYTE PUBLIC 'CODE' ;All PUBLIC segments with
- ;the name _PROG will be com-
- ;bined, all segments with the
- ;class 'CODE' will be
- ;adjacent. BYTE means that
- ;the segments will be aligned
- ;(stuck together) without any
- ;padding.
-
- ASSUME CS:_PROG, DS:DGROUP, ES:NOTHING ;This is the way the segment
- ;registers are set up when we
- ;arrive here from Clipper.
- PUBLIC SOUNDEXA ;Used in linking to Clipper, lets Clipper know
- ;where this routine is.
-
- SOUNDEXA PROC FAR ;The name of our routine (procedure)
-
- PUSH BP ;The Clipper extend documentation on disk
- PUSH DI ;says that we have to save registers
- PUSH SI ;BP, DI, SI, ES and DS. We are not
- PUSH ES ;disturbing BP, so we may not have to save it.
- PUSH DS ;But the Clipper routines __PARINFO, __PARC
- ;and __RETC may do so, we don't know.
-
- ;Ensure null string in case of missing argument or no letters
- ;We do this by moving a NULL byte in the first place of the
- ;SOUNDEX code. It will be overwritten if there's no error.
- MOV BYTE PTR DGROUP:[SOUNDEX], 0
-
- SUB AX, AX ;Faster and smaller than MOV AX, 0
- PUSH AX
- CALL __PARINFO ;Find out how many arguments passed
- ADD SP, 2 ;Clean up stack. C routines, unlike BASIC
- ;or Pascal do NOT clean up the stack.
- CMP AX, 1 ;Is there 1 argument?
- JE MAIN_ROUT ;Yes, use stored filler for conversion
-
- CMP AX, 2 ;Are there two arguments?
- JNE LEAVE ;No, an invalid number of arguments. Leave.
- ;Two arguments, get the second one - a new FILLER
- PUSH AX ;AX is always 2 here
- CALL __PARC ;Get the address of FILLER string
- ADD SP, 2 ;DX:AX hold pointer to string
- MOV ES, DX
- MOV BX, AX ;Use ES:BX to point to FILLER string
- MOV AL, ' ' ;Load default space character
- CMP BYTE PTR ES:[BX], '0' ;Is the new FILLER a '0'?
- JNE SX010 ;No, all set with space
- MOV AL, '0' ;Yes, make it a '0'
- SX010: MOV DGROUP:FILLER, AL ;Set Filler character
- MOV AX, 1
-
- ;AX is always 1 here, either set above or CMP AX, 1
- ;Pointer to string in DX:AX (SEG:OFS) for Clipper S87
- MAIN_ROUT: PUSH AX
- CALL __PARC ;Get pointer to string to convert
- ADD SP, 2 ;Pointer to string returned in DX:AX (SEG:OFS)
- ;Set up pointer registers, seg and ofs
- ;DS:SI - String to convert, pointer incrementing
- ;ES:DI - SOUNDEX code in DGROUP, pointer incrementing
- ;ES:BX - TRANSLATE in DGROUP, points always to base
- PUSH DS
- POP ES ;ES now points to DGROUP
- MOV DS, DX ;And DS to where ever Clipper stores
- ;its string arguments.
- MOV SI, AX ;DS:SI point to start of string to convert
-
- MOV DI, OFFSET DGROUP:SOUNDEX ;ES:DI point to start of
- ;SOUNDEX
- MOV BX, OFFSET DGROUP:TRANSLATE ;ES:BX point to TRANSLATE base
- CLD ;Work upward in string instructions
-
- ASSUME DS:NOTHING, ES:DGROUP
- ;Let MASM know that we've switched seg regs around
- MOV CX, SX_LENGTH ;Maximum SOUNDEX length
-
- FIRST_LTR: LODSB ;Get start byte from string to convert
- OR AL, AL ;At end of string to convert?
- JZ LEAVE ;NULL string or no letters anywhere in
- ;string, return a NULL string
- ; Real chararacter here, but is it a letter?
- AND AL, 0DFH ;This converts letters to upper case,
- ;destroys other characters. But since
- ;we don't care about those, it's ok.
- MOV AH, AL ;Save the possible starting letter
- SUB AL, 'A' ;Subtract the ASCII value of A which
- ;is 65. This makes A 0, B 1, C 2 etc.
- JS FIRST_LTR ;Negative, so not a letter, try again
- CMP AL, 'Z' - 'A' ;ASCII Z minus ASCII A is the largest
- ;real letter value.
- JA FIRST_LTR ;Not a letter either, try again
- ;We found a valid UC starting letter. It's both in AH and AL
- XLAT DGROUP:TRANSLATE ;Convert to SOUNDEX code 1-6 or NULL
- ;XLAT adds the value in AL to BX and
- ;fetches the character pointed to by
- ;(normally) DS:BX+AL. Since in this
- ;case DS points the NOTHING and ES to
- ;DGROUP, MASM is smart enough to make
- ;a segment override so that XLAT gets
- ;the byte at ES:BX+AL and puts it in
- ;AL. (Replacing the original pointer)
-
- XCHG AH, AL ;After switch, AH holds code,
- ;AL the UC starting letter
- STOSB ;Put first letter into SOUNDEX
- LOOP DIGITS ;Decrement CX and jump to actual
- ;digit conversion. Skip over one
- ;piece of code.
-
- ERR_DIGITS: SUB AH, AH ;Jump to here only when looping back
- ;and we want to clear out false
- ;'previous' letter matches if
- ;there are non-letters in between.
-
- DIGITS: LODSB ;Get the next character
- OR AL, AL ;ORing a value is the fastest way to
- ;find out if it's NULL (end of string)
- JZ ALL_DONE ;Trailing NULL detected
- ;Not at end of string to convert
- AND AL, 0DFH ;Convert to UC
- SUB AL, 'A' ;Subtract ASCII 'A'
- JS ERR_DIGITS ;Negative, not a letter
- ;Clear out previous code in AH
- CMP AL, 'Z' - 'A'
- JA ERR_DIGITS ;Not a letter either, clear previous
- ;code in AH
- ;Valid UC letter in AL, 'previous' code in AH
- XLAT DGROUP:TRANSLATE ;Convert to SOUNDEX code or NULL
- CMP AH, AL ;Same code as previous letter?
- JE DIGITS ;Yes, duplicate, don't add to SOUNDEX
- ;New code, not a duplicate
- MOV AH, AL ;Save it as the new 'previous' code
- OR AL, AL ;Is it a real or a null code?
- JZ DIGITS ;Null code, don't add to SOUNDEX
-
- ;Valid code in AL, not the same as previous, add to SOUNDEX
- STOSB
- LOOP DIGITS ;Continue until SX_LENGTH in SOUNDEX
-
-
- ALL_DONE: JCXZ LEAVE ;Complete soundex, CX counted down
- MOV AL, DGROUP:[FILLER] ; ' ' or '0'
- REP STOSB ;Fill remainder of SOUNDEX with FILLER
-
- LEAVE: POP DS ;Restore DGROUP segment into DS
- ;Clipper routines, such as __RETC
- ;expect DS to be pointing to DGROUP
-
- PUSH DS ;Push segment of SOUNDEX string
- MOV AX, OFFSET DGROUP:SOUNDEX
- PUSH AX ;And push the offset of SOUNDEX
- CALL __RETC ;Return pointer to SOUNDEX to Clipper
- ADD SP, 4 ;Clean up stack
-
- ;DS already popped above
- POP ES ;Get remainder of saved registers back
- POP SI
- POP DI
- POP BP
- RET ;Go back to Clipper
-
- SOUNDEXA ENDP
- _PROG ENDS
- END