NetNews Usenet Archive 1992 #31

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #31 / NN_1992_31.iso / spool / comp / unix / bsd / 10751 < prev next >

Wrap

Text File | 1992-12-30 | 18.8 KB | 406 lines

Newsgroups: comp.unix.bsd Path: sparky!uunet!cs.utexas.edu!sun-barr!ames!agate!dog.ee.lbl.gov!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry From: terry@cs.weber.edu (A Wizard of Earth C) Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST) Message-ID: <1992Dec30.061759.8690@fcom.cc.utah.edu> Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages Sender: news@fcom.cc.utah.edu Organization: Weber State University (Ogden, UT) References: <1992Dec19.083137.4400@fcom.cc.utah.edu> <2564@titccy.cc.titech.ac.jp> <1992Dec30.010216.2550@nobeltech.se> Date: Wed, 30 Dec 92 06:17:59 GMT Lines: 393 In article <1992Dec30.010216.2550@nobeltech.se> ppan@nobeltech.se (Per Andersson) writes: >In article <2564@titccy.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes: >> >>Do you know that Japan vote AGAINST ISO10646/Unicode, because it's not >>good for Japanese? >> >>>So even if the Unicode standard ignores backward compatability >>>with Japanese standards (and specific American and European standards), >>>it better supports true internationalization. >> >>The reason of disapproval is not backward compatibility. >> >>The reason is that, with Unicode, we can't achieve internationalization. > >But, what has Unicode got to do with ISO-10646 ? Has the promised (very much >needed IMHO) revision of Unicode arrived ? (1.1). Unicode is a 16bit character- >set which I know did ugly things with asiatic languages. I thought 10646, >which is a 32bit standard (by ISO !) did not, except for doing something >the turks didn't like, don't remember what it was. Enlighten me ! The following is my trieste on the Unicode standard, and why I think it is applicable or can be made applicable to internationalization of 386BSD. Let me restate here (as I do below) that I do not believe the attribution of language by the ordinal value of a character is a goal of the storage representation of the character glyph. The goal of attribution of the language a particular character is adequately satisfied by the choice of I/O mechanisms and mappings, and can also be satisfied by localized attribution of a file within the storage mechanism for that file. The only particular hurdle to overcome is the provision of multiple concurrent language specific I/O mechanisms for the benefit of translators. This has been discussed elsewhere. ======================= ======================= ======================= ======================= ======================= ======================= ======================= ======================= ======================= ISO10646 is based on the Unicode standard. The "ugly thing Unicode does with asiatic languages" is exactly what it does with all other languages: There is a single lexical assignment for for each possible glyph. This doesn't "screw up" anything, unless you expect your character set to be attributed by language... ie: English '#' character | v -+-+-+-+-+-+-+-+-+-+-+-+- -+-+-+-+-+-+-+-+-+-+-+-+- ... | |!|"|#|$|%|&|'|(|)|*| ... ... | |!|"| |$|%|&|'|(|)|*| ... -+-+-+-+-+-+-+-+-+-+-+-+- -+-+-+-+-+-+-+-+-+-+-+-+- US ASCII UK ASCII In the example above, the lexical set for US ASCII and UK ASCII are not intersecting, even though they contain exactly the same glyphs for all but one character Thus by the lexical order of a character, you can tell if it is an American, English, Japanese, or Chinese character. The argument against Unicode, as I understand it so far, is that the ordinal value of a character is not an indicator of which language it came from... ie: Character set excerpt Which set (count) /----------------------\ | | -+-+-+-+-+-+-+-+-+-+-+-+- | ... | |!|"| |$|%|&|'|(|)|*| ... | UK ASCII (96) -+-+-+-+-+-+-+-+-+-+-+-+- | | | | | | | | | | | | v v v v v v v v v v | -+-+-+-+-+-+-+-+-+-+-+-+- | ... | |!|"|#|$|%|&|'|(|)|*| ... | US ASCII (96) -+-+-+-+-+-+-+-+-+-+-+-+- \------\ | | | | | | | | | | | | v v v v v v v v v v v v -+-+-+-+-+-+-+-+-+-+-+-+- -+-+-+-+- ... | |!|"|#|$|%|&|'|(|)|*| ... ... | | | | Unicode (34348) -+-+-+-+-+-+-+-+-+-+-+-+- -+-+-+-+- This demonstrates the "problem", wherein the lexical order of the Unicode character set does not map to lexically adjacent characters in the ASCII sets. This behaviour is greatly exaggerated for Japanese/Chinese character sets, which have relatively large numbers of non-intersecting characters (as opposed to the 7 non-intersecting characters for most Western European languages and US ASCII), thus leaving a relatively large number of "gaps" in the lexical tables for a particular Asian language... ie: * = A valid character in a language -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- ... |*|*| |*|*|*|*|*| |*|*| | | | |*|*|*|*|*|*|*| | ... Japanese -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- ... | |*|*| |*|*| |*|*|*|*|*|*|*|*|*| |*|*| | |*|*| ... Chinese -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- [ I do not view the attribution of language by the ordinal value of a character as a goal of the storage representation of the character glyph. This is, perhaps, where I differ with the (stated by various individuals) Japanese assessment of the Unicode standard. ] The fact that ISO-Latin-1 is used as the base character set for Unicode (and is thus in an pre-existing lexical order), and languages with glyphs that are non-existant in other languages (ie: Tamil and Arabic) are also in a preexisting adjacent lexical order is seen as Eurocentric (or US centric, when one considers that the ISO-Latin-1 set is a superset of US ASCII). The rationale is the Americans and Western Europeans get to keep the majority of their existing equipment without modification, while the Japanese are required to give up their existing investment in the JIS standard and equipment which supports it. THIS IS PATENTLY FALSE ON SEVERAL GROUNDS: 1) The storage mechanism (in this case, Unicode) does not have to bear a direct relationship to the input/output mechanism. For instance, a "Unicode font" for use in Japan could contain only those Glyphs used in Japanese, just as a Unicode font for use in the US can be limited to US ASCII. As long as the additional characters are not used, there is no reason for them to actually exist in the font. 2) Localization is, and for the forseeable future will continue to be, necessary for the input mechanism. This will be true of the large glyph count languages forever because of the awkward input mechanisms required. The most likely immediate change will be in the small glyph count languages, such as those of Western Europe. The additional penalty for small glyph count languages will be a 16k table for Unicode to Local ISO standard translation, and a 512b table for Local ISO standard to Unicode translation. There is also the lookup penalty that will be paid referencing these tables. 3) The base change for the Japanese language will be a 128k table for 16 bits worth of short-to-short JIS to Unicode lexical translation, and a 128k table for the reverse translation for output. The most likely immediate change here will be a direct change to the JIS translation table output from JIS to Unicode, thus eliminating any additional translation penalties for Japanese over and above that required by the large glyph set. In the case of most existing Japanese hardware, this is a simple ROM replacement. Thus Japanese I/O will realize a performance increase over and above that realized by native input of Western languages. This is a large glyph count only advantage shared by Japanese and Chinese. 4) The storage requirements for text files in small glyph count languages double from 8 bits to 16 bits per glyph when using any internationalization mechanism which allows for large glyph count languages. This is a significant concession from users of small glyph count languages in the interests of allowing for future internationalization from which they will not profit significantly. Methods of avoiding this expansion (such as Runic encoding [the commonly accepted mechanism] and file system language attribution [my concept] carry performance or loss-of-information penalties for small glyph countt language users. This is discussed later). Unicode is NOT to the advantage of Western users as has been claimed... to the contrary, internationalization carries significant penalties for the Western user, discounting entirely the programming issues involved. The relatively low cost of enforced localization by simply making code 8 bit clean is highly attractive... but localization has significant drawbacks for the non-Western users, and in particular large glyph count languages like Japanese and Chinese. PERFORMANCE AND STORAGE OPTIMIZATIONS It is possible to overcome, or at least alleviate somewhat, several disadvantages to Western users by selective optimization. In particular, the selection of optimistic assumptions in the initial lexical order is of benefit to Western users, and explicitly for US ASCII or ISO Latin-1 users. These benefits are for the most part dependant upon the small glyph count of Western languages, and do not translate to a specific advantages of lexical order that large glyph count languages could benefit from. In other words, using a JIS set to enable all Japanese characters to be lexically adjacent WOULD NOT RESULT IN THE JAPANESE USER GAINING THE SAME BENEFITS. THE BENEFITS *DEPEND* ON BOTH LEXICAL ORDER AND ON THE FACT OF A SMALL GLYPH COUNT LANGUAGE. 1) OPTIMIZATION: Type 1 Runic Encoding. BENEFITS: US ASCII and ISO-Latin-1. File size is an indicator of character count for beneficiary languages. File size is mathematically related to the character count for language not intersecting the beneficiary ISO-Latin-1 glyph set. COSTS: Additional storage for Runic encoded languages. Additional cost for character 0 in text files in benificiary languages. File size is not an indicator of character count for languages which partially intersect the beneficiary ISO-Latin-1 glyph set. Difficulty in Glyph substitution. IMPLEMENTATION: Type 1 Runic Encoding uses the NULL character (lexical position 0) to indicate a rune sequence. For all users of the ISO-Latin-1 character set or lexically adjacent subsets (ie: US ASCII), a NULL character is used to introduce a sequence of 8-bit characters representing a 16 bit character whose ordinal value is in excess of the ordinal value 255. Type 1 Runic encoding allows almost all existing Western files to remain unchanged, as nearly no text files contain NULL characters. 2) OPTIMIZATION: Type 2 Runic Encoding. BENEFITS: US ASCII. Lesser benefits for ISO-Latin-1 and other sets intersecting with US ACSII. Frequency encoding allows shorter sequences of characters to represent the majority of Runic encoded characters in large glyph set languages than is possib in Type 1 Runic Encoding. File size is an indicator of character count for US ASCII text files. File size is mathematically related to the character count for language not intersecting the beneficiary US ASCII glyph set. COSTS: Additional storage for Runic encoded languages. Additional cost for characters with an ordinal value in excess of 127 in text files in benificiary languages. File size is not an indicator of character count for languages which partially intersect the beneficiary US ASCII glyph set. Difficulty in Glyph substitution. IMPLEMENTATION: Type 2 Runic Encoding uses characters with an ordinal value in the range 128-255 to introduce a sequence of 8-bit characters representing a 16 bit character whose ordinal value is in excess of the ordinal value 127. Type 2 Runic encoding allows frequency encoding (by virtue of multiple introducers) of 128*256 (32k) glyphs. Since the Unicode set consists of 34348 glyphs, this is an average of two 8-bit characters per Runic Encoded glyph for the vast majority (32768) of encoded glyphs. This is significant when compared to the average of three 8-bit characters per encoded glyph for Type 1 Runic encoding. [ For the obvious reason of file size no longer directly representing what has been, in the past, meaningful information, I personally dislike the concept of Runic encoding, even though it will tend to effect only those languages which are permissable in an "8-bit clean" internationalization environment, thus not effecting me personally. An additional penalty is in glyph substition within documents. A single glyph substitution could potentially change the size of a file -- requiring a shift of the contents of the file on the storage medium to account for the additional space requirements of the substitute glyph. A final penalty is the input buffer mechanisms not being able to specify a language-independant field length for data input. This is particularly nasty for GUI and screen based input such as that found in most commercial spreadsheets and databases. For these reasons, I can not advocate Runic encoding or the XPG3/XPG4 standards which appear to require it. ] 3) OPTIMIZATION: Glyph Count Attribute (in file system) BENEFITS: Non-direct beneficiary languages of the Runic Encoding, assuming use of Type 1 or Type 2 Runic Encoding. COSTS: File system modification. File I/O modifications. IMPLEMENTATION: A Glyph Count Attribute kept as part of the information on a file would restore the ability to relate directly available information about a file with the character count of the text in the file. This is something that is normally lost with Runic encoding. There are not insignificant costs associated with this in terms of the required modifications to the File I/O system to perform "glyph counting". This is especially significant when dealing with the concept of glyph substitution in the file. 4) OPTIMIZATION: Language Attribution (in file system) BENEFITS: All languages capable of existing in "8-bit clean" environments (all small glyph count languages). COSTS: File system modification. File I/O based translation (buffer modification processing time). Requirement of conversion to change to/from a multilingual storage format with non-intescting "8-bit clean" sets (ie: Arabic and US ASCII). Conversion utilities. Changes to UNIX utilities to allow access to and manipulation of attributions. IMPLEMENTATION: The Language Attribution kept as part of the information on a file allows 8-bit storage of any language for which an "8-bit clean" character set exists/can be produced. Unicode buffers of 16-bit glyphs are converted on write to the "8-bit clean" character set glyph. This requires a 64k table to allow for direct index conversion. In practice, this can be a 16k table due to the lexical location of the small glyph count languages within the Unicode character set. The conversion on read requires a 512b table to allow direct indext conversion of 256 8-bit values into the 256 corresponding Unicode 16-bit characters. [ This is clever, if I do say so myself ] 5) OPTIMIZATION: Sparse Character Sets For Language Localization BENEFITS: Reduced character set/graphic requirements. Continued use of non-graphic devices (depends on being used in concert with Language Attribution). Reduced memory requirements for fonts in graphical environments (like X). COSTS: Non-localized text files can not benefit. Device channel mapping for devices supporting less than the full Unicode character set. Translation tables and lookup time for devices supported using this mechanism. IMPLEMENTATION: [Prexisting] Language specific fonts for "8-bit clean" languages can be used, as can existing fonts for Unicode character sets for systems like X, which allow sparse font sets. Basically, since there is no need to display multilingual messages in a localized environment, there is no need to use fonts/devices which support an internationalized character set. For instance, using a DEC VT220, the full ISO-Latin-1 font is available for use. Thus for languages using only characters contained in the ISO-Latin-1 set, it is not necessary to supply other glyphs within the set as long as output mapping of Unicode to the device set is done (preferrably in the tty driver). Similarly, JIS devices for Japenese I/O are not required to support, for instance, Finnish, Arabic, or French characters. [ This is also clever, in that it does not was the existing investments in hardware. ] ADMITTED DRAWBACKS IN UNICODE: The fact that lexical order is not maintained for all existing character sets (NOTE: NO CURRENT OR PROPOSED STANDARD SUPPORTS THIS IDEA!) means that a direct arithmatic translation is not possible for, for instance, JIS to Unicode mappings; instead a table lookup is required on input and output. This is not a significant penalty anywhere but languages which do not require multiple keystroke input on their respective input devices and which are not lexically adjacent in the Unicode set (ie: Turkish). The penalty is a table lookup on I/O rather than a direct arithmetic translation (an add or subtract depending on direction). NOTE THAT THIS IS NOT A PENALTY FOR JIS INPUT, SINE MULTICHARACTER INPUT SEQUENCES REQUIRE A TABLE LOOKUP TO IMPLEMENT REGARDLESS OF THE STORAGE. The fact that all character sets do not occur in their local lexical order means that a particular character can not be identified as to language by its ordinal value. This is a small penalty to pay for the vast reduction in storage requirements between a 32-bit and a 16-bit character set that contains all required glyphs. The fact that Japanese and Chinese characters can not be distinguished as to language by ordinal value is no worse than the fact that one can not distinguish an English 's' in the ISO-Latin-1 set from a French 's'. The significance of language attribution must be handled by the input (and potentially output) mechanisms in any case, and thus they must be locale specific. This is sufficient to provide information as to the language being output, since input and output devices are generally closely associated. ======================= ======================= ======================= ======================= ======================= ======================= ======================= ======================= ======================= Terry Lambert terry@icarus.weber.edu terry_lambert@novell.com --- Any opinions in this posting are my own and not those of my present or previous employers. -- ------------------------------------------------------------------------------- "I have an 8 user poetic license" - me Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial -------------------------------------------------------------------------------