Programmer's ROM - The Computer Language Library

home *** CD-ROM | disk | FTP | other *** search

/ Programmer's ROM - The Computer Language Library / programmersrom.iso / ada / misc / strcomp.doc < prev next >

Wrap

Text File | 1988-05-03 | 7.1 KB | 166 lines

The intention of this posting is not to provide a facility, but rather to demonstrate a technique to do string comparisons in a more sophisticated way than simply using ASCII values. Comments, questions etc are very welcome to: Erland Sommarskog ENEA Data, Stockholm sommar@enea.UUCP The posting contains seven files that can be divided into three groups: I: strcompS.a and strcompB.a The core of the posting. They contain a package for string comparisons. It has a character-transscription table to be loaded by the user and comparison operators for trans- scripted string. The exported routines are described below. StrcompS is the specification, whereas strcompB contains the package body. II: latin1.a and natascii.a They declare names for characters, to be used, for example, when defining a collating sequence for the package above. Latin1 declares names for the ISO standard 8859/1. Natascii declares names for national replacements of the ordinary ASCII set. III: define.a, comline.a and main.a An demonstration application that uses the string-comparison package. Define.a loads the character collating sequence. Comline.a reads the command line. Note that this file is bound to Verdix Ada for Unix and must be rewritten for another system. Main.a is the main program. It reads lines from standard input or a named file and writes the sorted lines to standard output when end-of-file is detected. You find a description of the options last in this file. You should compile the files in the order: latin1, natascii, strcompS, strcompB, define, comline, main. Four-dimensional sorting ------------------------ The string-comparison package compares strings at four levels: 1) Alphabetic 2) Accents 3) Non-letters 4) Difference in case What is an alphabetic etc is up to the user. He may define "$" being a letter with "(" as its lowercase variant if he likes. One level is only regarded if the level above have no difference. As an example I take T^ete-`a-t^ete (I assume a "normal" loading of the character table here.) For the first level we use TETEATETE, thus we remove the accents and the hyphens. On the next we re-insert the accents so we get T^ETE`AT^ETE On level three we only take the hyphens in regard. When comparing non-letters the package uses the simple ASCII values. The earlier a character comes, the lower is the sort value. Thus, "trans-scription" will precede "transscrip-tion". (Actually, as the implementation is done, the position is more important than the ASCII value.) On the last level we use T^ete`at^ete thus, the original writing with the hyphens removed. Note that the user can specify case to be insigificant. (This isn't a description on how the package is implemented, just a way of illustrating the result. In practice it's done a little more effective.) When defining accented variants it is possible to let a character be a variant of a string, in this way the AE ligature can be sorted as "AE". The opposite is not possible, and what worse is, a string can't have an alphabetic value. Thus the package is not able to sort languages as Spanish (CH and LL) correctly. The number characters are handled in a special way if you define them as alphabetics. A sequence of figures will read as one number and sort after all other alphabetics. (Even if they were defined as the first characters.) So you will get File1 File2 File10 File11 instead of the usual File1 File10 File11 File2 If you like to sort them as they are read, this is also possible. E.g. load "0" as a variant of "zero". The package contains the following routines: Load Operations --------------- PROCEDURE Load_alphabetic(ch : IN character); Loads ch as the next alphabetic character. The order of loading determines the sorting values. PROCEDURE Load_variant(ch : IN character; Equ_ch : IN character; Equ_kind : IN Equivalence_kind); TYPE Equivalence_kind IS (Exact, Case_diff, Accented); PROCEDURE Load_variant(ch : IN character; Equ_str : IN string); Load_variant loads ch as a variant of Equ_ch or Equ_str. The interpretation of Equ_kind is: Exact: Exactly the same. There is no difference. What you use when you don't want case to be significant. Case_diff: Load ch as a lowercase variant of Equ_ch. There will be difference at level 4. Accented: Load ch as variant of Equ_ch at level 2. The latter version of Load_variant always loads ch at level 2. For simplify loading, the package also provides routines for loading a character and its ASCII lowercase equivalent simultaneously: PROCEDURE Set_case_significance(Flag : boolean); PROCEDURE Alpha_both_cases(ch : IN character); PROCEDURE Variant_both_cases(ch : IN character; Equ_ch : IN character); PROCEDURE Variant_both_cases(ch : IN character; Equ_str : IN string); With Set_case_significant you determine whether case should be significant when loading the pairs. Variant_both_cases loads ch at level 2. The loading operations raise Already_defined if an attempt is made to load a character twice. If Equ_ch or part of Equ_str is undefined, this gives the exception Undefined_equivalent. Transscription operations ------------------------- These routines translates a string to the internal coding. TYPE Transscripted_string(Max_length : natural) IS PRIVATE; PROCEDURE Transscribe(ch : IN character; Trans_str : OUT Transscripted_string); PROCEDURE Transscribe(Str : IN string; Trans_str : OUT Transscripted_string); If the transscription is too long, the routines will raise Transscription_error. Comparison operators: --------------------- FUNCTION "<=" (Left, Right : Transscripted_string) RETURN boolean; FUNCTION "<" (Left, Right : Transscripted_string) RETURN boolean; FUNCTION ">=" (Left, Right : Transscripted_string) RETURN boolean; FUNCTION ">" (Left, Right : Transscripted_string) RETURN boolean; I have only included operations for comparing transscripted strings. Of course there could be a set for uncoded strings too. Other function -------------- FUNCTION Is_letter(ch : character) RETURN boolean; The demonstration program ------------------------- The program takes the options: -8 Use ISO/Latin-1. If not present, use 7-bit ASCII with national replacements. -e Case is significant. When omitted, case is not significant. -LX Selects language. X should be one of the following: s or S: Swedish. (Default) d or D: Danish g: German1: "A, "O and "U sorts as A, O and U. G: German2: "A, "O and "U sorts as AE, OE and UE. f or F French In the definition routine I load space as the first alphabetic letter. This gives the result that "Smith, Tony" will sort before "Smithson, Alan".