NetNews Usenet Archive 1992 #31

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #31 / NN_1992_31.iso / spool / comp / std / internat / 990 < prev next >

Wrap

Internet Message Format | 1993-01-02 | 3.9 KB

Path: sparky!uunet!spool.mu.edu!uwm.edu!ogicse!mintaka.lcs.mit.edu!ai-lab!muesli!glenn From: glenn@muesli.ai.mit.edu (Glenn A. Adams) Newsgroups: comp.std.internat Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST) Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages Message-ID: <1i3pf7INNcri@life.ai.mit.edu> Date: 2 Jan 93 10:06:31 GMT Article-I.D.: life.1i3pf7INNcri References: <1992Dec30.010216.2550@nobeltech.se> <1992Dec30.061759.8690@fcom.cc.utah.edu> <1hu9v5INNbp1@rodan.UU.NET> Organization: MIT Artificial Intelligence Laboratory Lines: 67 NNTP-Posting-Host: muesli.ai.mit.edu In article <1hu9v5INNbp1@rodan.UU.NET> avg@rodan.UU.NET (Vadim Antonov) writes: >In article <1992Dec30.061759.8690@fcom.cc.utah.edu> terry@cs.weber.edu (A Wizard of Earth C) writes: >>The "ugly thing Unicode does with asiatic languages" is exactly what it >>does with all other languages: There is a single lexical assignment for >>for each possible glyph. >It means that: > >1) "mechanistic" conversion between upper and lower case > is impossible (as well as case-insensitive comparisons) > > Example: Latin T -> t > Cyrillic T -> m > Greek T -> ? > > This property alone renders Unicode useless for any business > applications. > After reading this yet again, I now believe that this entire conversation may be based on a misunderstanding. Unicode does not unify Latin T, Cyrillic T, and Greek T! They are separate characters, as are Latin A, Cyrillic A, and Greek A. Nor does Unicode unify LATIN A WITH RING and ANGSTROM SYMBOL. Unicode only unifies according to abstract form within the context of a particular script, i.e., Unicode encodes the elements of scripts. Furthermore, where there is a clear difference in functional use, e.g., MINUS vs. HYPHEN vs. HYPHEN vs. FIGURE DASH, Unicode maintains separate encodings, even though the shapes may be depicted by a single form (glyph). More examples include EXCLAMATION POINT vs. LATIN LETTER EXCLAMATION POINT (used as a letter in African alphabets based on Latin script) and LATIN LETTER EPSILON (used with a variety of Latin script based alphabets). I apologize for not recognizing earlier where this argument went astray. I assumed that you had at least seen a copy of Unicode, thus I didn't expect this particular misunderstanding could arise. As for Asian writing systems based on the Han script, the historical relation these uses is much stronger than that between Greek, Latin, and Cyrillic. The differences that have developed are more along aesthetic dimensions, although differences in functional value have developed; but then again, the Latin script is nowhere near exact in its form to function mapping, at least in some important writing systems, e.g., English & French. It would be as ridiculous to encode two <c>s for /k/ and /s/ in English as it would be to encode two Han characters with the same form which have developed specialized or slightly different meanings in the writing system in which they were used. Unlike a glyphic encoding, in which forms may be willy-nilly unified regardless of function, Unicode takes both form and function into account in the determination of what constitutes a separate character code element. In some instances, form is given priority; in others, function is given priority; in most cases, both have an input. [N.B. In addition to form and function, Unicode maintains distinctions which existed in character sets whose characters were incorporated into Unicode. This insures that one can have round-trip conversion between existing data. This "compatibility rule" resulted in the inclusion of many which would not have been included otherwise, e.g., FULLWIDTH LATIN LETTER A-Z, a-z, etc (needed for compatibility with most Asian character sets). Many Han characters which are stroke variants were encoded for this reason, and would have been otherwise unified.] Regards, Glenn Adams