home *** CD-ROM | disk | FTP | other *** search
- { BOYERMO2.PAS (23 January 1988) (Rufus S. Hendon) }
-
- { This unit provides facilities for searching a text for a target using
- the Boyer-Moore search method. The routine is based on Don Strenczewilk's
- implementation of a variant form of the Boyer-Moore method (his case-
- insensitive version B1, available on CompuServe in file BLINE.ARC in
- Borland BPROGA Data Library 4, uploaded 21 August 1987). In addition to
- repackaging his routine as a Turbo Pascal 4.0 unit, I have modified it
- (1) to provide protection against endless loops that in the original
- version can arise due to wrap-around of the index used to scan the text
- when the the length of the text approaches the maximum (65521 characters)
- allowed by Turbo Pascal 4.0 for arrays of type CHAR and (2) to improve
- efficiency slightly by removing three instructions (a PUSH, a MOV, and a
- POP) from the comparison loop.
- The text to be searched must be stored in an array of type CHAR or an
- equivalent user-defined type. The lower bound of the array must be 1.
- The target for which the text is to be searched must be of type STRING.
- The program must also provide a variable for the storage of the shift
- table used by the Boyer-Moore method when it searches the text. This
- variable must provide 256 bytes of storage; it can, for example, be a
- variable of type ARRAY[CHAR] OF BYTE. The target variable and the shift-
- table variable must be in the same segment: they must both be global
- variables (located in the data segment) or both local variables (stored
- in the stack segment).
- Whenever the text is to be searched for a new target, the program must
- call MAKE_BOYER_MOORE_TABLE to create the shift table for the target.
- Thereafter the text can be searched for the target by invoking
- BOYER_MOORE_SEARCH, specifying as arguments the target and its shift
- table as well as the position in the text where the search is to begin.
- If the program maintains multiple target variables and a separate shift
- table and starting-position variable for each target, searches for
- occurrences of the various targets can be underway simultaneously.
- In a call to BOYER_MOORE_SEARCH, the argument associated with the
- parameter START determines the position in the text with which the search
- begins. To search the entire text, the function would be invoked with
- START = 1. The function scans the text beginning from the START position
- for the first substring that matches the target specified by the variable
- associated with the parameter TARGET, using the shift table stored in the
- variable associated with the parameter TABLE. If such a substring is
- found, the function returns the position (array subscript) of the initial
- character of the matching substring; since the array is required to have
- 1 as its lower bound, the position returned after a successful search
- will always be greater than 0. If the function fails to find a matching
- substring, it returns 0. (If the requirement that the TARGET and TABLE
- variables be in the same segment is violated, the function also returns
- 0.)
- When it is required that all occurrences in the text of a given target
- be found, BOYER_MOORE_SEARCH would be invoked in a loop, in which the
- START argument would initially have the value of 1; thereafter, after
- every successful search, the START argument would be reset to the
- position returned by the function plus 1. The loop would terminate when
- the function reported failure. The loop would have a general structure
- similar to this:
-
- item := [the target string];
- make_Boyer_Moore_table(item,shift_table);
- scan_beginning := 1;
- search_text_length := length(search_text);
- repeat
- i := Boyer_Moore_search(search_text,scan_beginning,search_text_length,
- item,shift_table);
- if i > 0 then begin
- [do whatever processing is required when the search is successful];
- scan_beginning := i+1
- end
- until i = 0
-
- Note that if the text array can only be referred to by means of a
- pointer, as will be the case if the array is allocated in the heap by
- means of the NEW procedure, the pointer, when used as the first argument
- of BOYER_MOORE_SEARCH, must be dereferenced by writing '^' after it. If,
- for example, TEXTPTR is a pointer to the text array, the call to the
- search function in the loop just given would take this form:
-
- i := Boyer_Moore_search(textptr^,scan_beginning,search_text_length,
- item,shift_table);
- }
- {============================================================================}
- unit BOYERMO2;
- {============================================================================}
- interface
-
- procedure MAKE_BOYER_MOORE_TABLE(var target: string; var table);
- { TARGET is the target string for which a text is to be searched. The
- shift table for the target string is constructed in TABLE, which must be
- a variable providing 256 bytes of storage, e.g. a variable declared as
- ARRAY[CHAR] OF BYTE. }
-
- function BOYER_MOORE_SEARCH(var text_array; start, text_length: word;
- var target: string; var table): word;
- { TEXT_ARRAY is an array of characters in which a text is stored; the
- text begins in TEXT_ARRAY[1] and is TEXT_LENGTH characters long. TARGET
- must either be the same variable used as parameter TARGET in an earlier
- call to MAKE_BOYER_MOORE_TABLE or another variable with the same value.
- TABLE must be the variable that was used as parameter TABLE in the same
- call to MAKE_BOYER_MOORE_TABLE. TARGET and TABLE must be in the same
- segment, i.e. they must both be global variables or both local variables.
- A Boyer-Moore search is performed on the text in TEXT_ARRAY, beginning
- with the character in position START and using shift table TABLE, for
- the first substring that matches TARGET. If a match is found, the
- position of the first character of the matching substring is returned.
- Otherwise 0 is returned. A function value of 0 is also returned if TABLE
- and TARGET are not in the same segment. }
- {============================================================================}
- implementation
-
- const
- copy: string = '';
- var
- table: array[char] of byte;
- {****************************************************************************}
- procedure MAKE_BOYER_MOORE_TABLE(var target: string; var table);
- { TARGET is the target string for which a text is to be searched. The
- shift table for the target string is constructed in TABLE, which must be
- a variable providing 256 bytes of storage, e.g. a variable declared as
- ARRAY[CHAR] OF BYTE. }
- begin { MAKE_BOYER_MOORE_TABLE }
- inline
- ($1E/ { push ds }
- $C5/$76/<target/ { lds si,[bp+target] }
- $89/$F3/ { mov bx,si }
- $8A/$04/ { mov al, [si] }
- $88/$C4/ { mov ah,al }
- $B9/$80/$00/ { mov cx,$0080 }
- $C4/$7E/<table/ { les di,[bp+table] }
- $89/$FA/ { mov dx,di }
- $FC/ { cld }
- $F2/$AB/ { rep stosw }
- $89/$DE/ { mov si,bx }
- $89/$D7/ { mov di,dx }
- $46/ { inc si }
- $98/ { cbw }
- $3C/$01/ { cmp al,1 }
- $7E/$13/ { jle done }
- $48/ { dec ax }
- $88/$E1/ { mov cl,ah }
- $88/$E7/ { mov bh,ah }
- $8A/$1C/ { next: mov bl,[si] }
- $89/$C2/ { mov dx,ax }
- $29/$CA/ { sub dx,cx }
- $88/$11/ { mov [bx+di],dl }
- $46/ { inc si }
- $41/ { inc cx }
- $39/$C1/ { cmp cx,ax }
- $75/$F2/ { jne next }
- $1F) { done: pop ds }
- end; { MAKE_BOYER_MOORE_TABLE }
- {****************************************************************************}
- function BOYER_MOORE_SEARCH(var text_array; start, text_length: word;
- var target: string; var table): word;
- { TEXT_ARRAY is an array of characters in which a text is stored; the
- text begins in TEXT_ARRAY[1] and is TEXT_LENGTH characters long. TARGET
- must either be the same variable used as parameter TARGET in an earlier
- call to MAKE_BOYER_MOORE_TABLE or another variable with the same value.
- TABLE must be the variable that was used as parameter TABLE in the same
- call to MAKE_BOYER_MOORE_TABLE. TARGET and TABLE must be in the same
- segment, i.e. they must both be global variables or both local variables.
- A Boyer-Moore search is performed on the text in TEXT_ARRAY, beginning
- with the character in position START and using shift table TABLE, for
- the first substring that matches TARGET. If a match is found, the
- position of the first character of the matching substring is returned.
- Otherwise 0 is returned. A function value of 0 is also returned if TABLE
- and TARGET are not in the same segment. }
- begin { BOYER_MOORE_SEARCH }
- inline
- ($1E/ { push ds }
- $33/$C0/ { xor ax,ax }
- $C5/$5E/<table/ { lds bx,[bp+table] } { If TABLE and }
- $8C/$D9/ { mov cx,ds } { TARGET are in }
- $C5/$76/<target/ { lds si,[bp+target] } { different }
- $8C/$DA/ { mov dx,ds } { segments, re- }
- $3B/$D1/ { cmp dx,cx } { port failure }
- $75/$76/ { jne notfound2 } { at once }
- $8A/$F4/ { mov dh,ah }
- $8A/$14/ { mov dl,[si] }
- $80/$FA/$01/ { cmp dl,1 }
- $7F/$1F/ { jg boyer }
- $7C/$6B/ { jl notfound2 }
- $8A/$44/$01/ { mov al,[si+1] }
- $8B/$56/<start/ { mov dx,[bp+start] }
- $4A/ { dec dx }
- $8B/$4E/<text_length/ { mov cx,[bp+text_length] }
- $2B/$CA/ { sub cx,dx }
- $C4/$7E/<text_array/ { les di,[bp+text_array] }
- $8B/$DF/ { mov bx,di }
- $03/$FA/ { add di,dx }
- $FC/ { cld }
- $F2/$AE/ { repne scasb }
- $75/$53/ { jne notfound2 }
- $97/ { xchg ax,di }
- $2B/$C3/ { sub ax,bx }
- $EB/$50/ { jmp short exit }
- $FE/$CA/ { boyer: dec dl }
- $03/$F2/ { add si,dx }
- $C4/$7E/<text_array/ { les di,[bp+text_array] }
- $8B/$CF/ { mov cx,di }
- $03/$4E/<text_length/ { add cx,[bp+text_length] }
- $49/ { dec cx }
- $4F/ { dec di }
- $03/$7E/<start/ { add di,[bp+start] }
- $03/$FA/ { add di,dx }
- $8A/$74/$01/ { mov dh,[si+1] }
- $55/ { push bp }
- $8B/$E9/ { mov bp,cx }
- $8A/$EC/ { mov ch,ah }
- $FD/ { std }
- $EB/$05/ { jmp short comp }
- $D7/ { nexttable: xlat }
- $03/$F8/ { add di,ax }
- $72/$2A/ { jc notfound }
- $3B/$EF/ { comp: cmp bp,di }
- $72/$26/ { jb notfound }
- $26/$8A/$05/ { mov al,es:[di] }
- $3A/$F0/ { cmp dh,al }
- $75/$F0/ { jne nexttable }
- $4F/ { dec di }
- $8A/$CA/ { mov cl,dl }
- $F3/$A6/ { repe cmpsb }
- $74/$0D/ { je found }
- $8A/$C2/ { mov al,dl }
- $2B/$C1/ { sub ax,cx }
- $03/$F8/ { add di,ax }
- $47/ { inc di }
- $03/$F0/ { add si,ax }
- $8A/$C6/ { mov al,dh }
- $EB/$DC/ { jmp short nexttable }
- $5D/ { found: pop bp }
- $C4/$46/<text_array/ { les ax,[bp+text_array] }
- $97/ { xchg ax,di }
- $2B/$C7/ { sub ax,di }
- $40/ { inc ax }
- $40/ { inc ax }
- $EB/$03/ { jmp short exit }
- $5D/ { notfound: pop bp }
- $32/$C0/ { notfound2: xor al,al }
- $89/$46/$FE/ { exit: mov [bp-2],ax }
- $FC/ { cld }
- $1F) { pop ds }
- end; { BOYER_MOORE_SEARCH }
- {****************************************************************************}
- end.