home *** CD-ROM | disk | FTP | other *** search
- The intention of this posting is not to provide a facility, but
- rather to demonstrate a technique to do string comparisons
- in a more sophisticated way than simply using ASCII values.
-
- Comments, questions etc are very welcome to:
- Erland Sommarskog
- ENEA Data, Stockholm
- sommar@enea.UUCP
-
- The posting contains seven files that can be divided into three
- groups:
- I: strcompS.a and strcompB.a
- The core of the posting. They contain a package for string
- comparisons. It has a character-transscription table to be
- loaded by the user and comparison operators for trans-
- scripted string. The exported routines are described below.
- StrcompS is the specification, whereas strcompB contains
- the package body.
- II: latin1.a and natascii.a
- They declare names for characters, to be used, for example,
- when defining a collating sequence for the package above.
- Latin1 declares names for the ISO standard 8859/1. Natascii
- declares names for national replacements of the ordinary
- ASCII set.
- III: define.a, comline.a and main.a
- An demonstration application that uses the string-comparison
- package. Define.a loads the character collating sequence.
- Comline.a reads the command line. Note that this file is
- bound to Verdix Ada for Unix and must be rewritten for another
- system.
- Main.a is the main program. It reads lines from standard
- input or a named file and writes the sorted lines to standard
- output when end-of-file is detected.
- You find a description of the options last in this file.
-
- You should compile the files in the order: latin1, natascii,
- strcompS, strcompB, define, comline, main.
-
- Four-dimensional sorting
- ------------------------
-
- The string-comparison package compares strings at four levels:
- 1) Alphabetic
- 2) Accents
- 3) Non-letters
- 4) Difference in case
- What is an alphabetic etc is up to the user. He may define "$"
- being a letter with "(" as its lowercase variant if he likes.
-
- One level is only regarded if the level above have no difference.
- As an example I take
- T^ete-`a-t^ete
- (I assume a "normal" loading of the character table here.)
- For the first level we use TETEATETE, thus we remove the accents
- and the hyphens. On the next we re-insert the accents so we get
- T^ETE`AT^ETE
- On level three we only take the hyphens in regard. When comparing
- non-letters the package uses the simple ASCII values. The earlier
- a character comes, the lower is the sort value. Thus, "trans-scription"
- will precede "transscrip-tion". (Actually, as the implementation
- is done, the position is more important than the ASCII value.)
- On the last level we use
- T^ete`at^ete
- thus, the original writing with the hyphens removed. Note that the
- user can specify case to be insigificant.
- (This isn't a description on how the package is implemented, just
- a way of illustrating the result. In practice it's done a little
- more effective.)
-
- When defining accented variants it is possible to let a character
- be a variant of a string, in this way the AE ligature can be sorted
- as "AE". The opposite is not possible, and what worse is, a string
- can't have an alphabetic value. Thus the package is not able to sort
- languages as Spanish (CH and LL) correctly.
-
- The number characters are handled in a special way if you define them
- as alphabetics. A sequence of figures will read as one number and sort
- after all other alphabetics. (Even if they were defined as the first
- characters.) So you will get
- File1 File2 File10 File11
- instead of the usual
- File1 File10 File11 File2
- If you like to sort them as they are read, this is also possible.
- E.g. load "0" as a variant of "zero".
-
- The package contains the following routines:
-
- Load Operations
- ---------------
- PROCEDURE Load_alphabetic(ch : IN character);
- Loads ch as the next alphabetic character. The order of loading
- determines the sorting values.
-
- PROCEDURE Load_variant(ch : IN character;
- Equ_ch : IN character;
- Equ_kind : IN Equivalence_kind);
- TYPE Equivalence_kind IS (Exact, Case_diff, Accented);
- PROCEDURE Load_variant(ch : IN character;
- Equ_str : IN string);
- Load_variant loads ch as a variant of Equ_ch or Equ_str. The interpretation
- of Equ_kind is:
- Exact: Exactly the same. There is no difference. What you use when you
- don't want case to be significant.
- Case_diff: Load ch as a lowercase variant of Equ_ch. There will be
- difference at level 4.
- Accented: Load ch as variant of Equ_ch at level 2.
- The latter version of Load_variant always loads ch at level 2.
-
- For simplify loading, the package also provides routines for loading
- a character and its ASCII lowercase equivalent simultaneously:
- PROCEDURE Set_case_significance(Flag : boolean);
- PROCEDURE Alpha_both_cases(ch : IN character);
- PROCEDURE Variant_both_cases(ch : IN character;
- Equ_ch : IN character);
- PROCEDURE Variant_both_cases(ch : IN character;
- Equ_str : IN string);
- With Set_case_significant you determine whether case should be
- significant when loading the pairs. Variant_both_cases loads ch
- at level 2.
-
- The loading operations raise Already_defined if an attempt is
- made to load a character twice. If Equ_ch or part of Equ_str is
- undefined, this gives the exception Undefined_equivalent.
-
- Transscription operations
- -------------------------
- These routines translates a string to the internal coding.
- TYPE Transscripted_string(Max_length : natural) IS PRIVATE;
- PROCEDURE Transscribe(ch : IN character;
- Trans_str : OUT Transscripted_string);
- PROCEDURE Transscribe(Str : IN string;
- Trans_str : OUT Transscripted_string);
- If the transscription is too long, the routines will raise
- Transscription_error.
-
- Comparison operators:
- ---------------------
- FUNCTION "<=" (Left, Right : Transscripted_string) RETURN boolean;
- FUNCTION "<" (Left, Right : Transscripted_string) RETURN boolean;
- FUNCTION ">=" (Left, Right : Transscripted_string) RETURN boolean;
- FUNCTION ">" (Left, Right : Transscripted_string) RETURN boolean;
-
- I have only included operations for comparing transscripted
- strings. Of course there could be a set for uncoded strings too.
-
- Other function
- --------------
- FUNCTION Is_letter(ch : character) RETURN boolean;
-
- The demonstration program
- -------------------------
- The program takes the options:
- -8 Use ISO/Latin-1. If not present, use 7-bit ASCII with national
- replacements.
- -e Case is significant. When omitted, case is not significant.
- -LX Selects language. X should be one of the following:
- s or S: Swedish. (Default)
- d or D: Danish
- g: German1: "A, "O and "U sorts as A, O and U.
- G: German2: "A, "O and "U sorts as AE, OE and UE.
- f or F French
-
- In the definition routine I load space as the first alphabetic
- letter. This gives the result that "Smith, Tony" will sort
- before "Smithson, Alan".
-