home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.unix.bsd
- Path: sparky!uunet!cs.utexas.edu!sun-barr!ames!agate!dog.ee.lbl.gov!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry
- From: terry@cs.weber.edu (A Wizard of Earth C)
- Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST)
- Message-ID: <1992Dec30.061759.8690@fcom.cc.utah.edu>
- Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages
- Sender: news@fcom.cc.utah.edu
- Organization: Weber State University (Ogden, UT)
- References: <1992Dec19.083137.4400@fcom.cc.utah.edu> <2564@titccy.cc.titech.ac.jp> <1992Dec30.010216.2550@nobeltech.se>
- Date: Wed, 30 Dec 92 06:17:59 GMT
- Lines: 393
-
- In article <1992Dec30.010216.2550@nobeltech.se> ppan@nobeltech.se (Per Andersson) writes:
- >In article <2564@titccy.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes:
- >>
- >>Do you know that Japan vote AGAINST ISO10646/Unicode, because it's not
- >>good for Japanese?
- >>
- >>>So even if the Unicode standard ignores backward compatability
- >>>with Japanese standards (and specific American and European standards),
- >>>it better supports true internationalization.
- >>
- >>The reason of disapproval is not backward compatibility.
- >>
- >>The reason is that, with Unicode, we can't achieve internationalization.
- >
- >But, what has Unicode got to do with ISO-10646 ? Has the promised (very much
- >needed IMHO) revision of Unicode arrived ? (1.1). Unicode is a 16bit character-
- >set which I know did ugly things with asiatic languages. I thought 10646,
- >which is a 32bit standard (by ISO !) did not, except for doing something
- >the turks didn't like, don't remember what it was. Enlighten me !
-
-
- The following is my trieste on the Unicode standard, and why I think it
- is applicable or can be made applicable to internationalization of 386BSD.
-
- Let me restate here (as I do below) that I do not believe the attribution
- of language by the ordinal value of a character is a goal of the storage
- representation of the character glyph.
-
- The goal of attribution of the language a particular character is
- adequately satisfied by the choice of I/O mechanisms and mappings, and can
- also be satisfied by localized attribution of a file within the storage
- mechanism for that file. The only particular hurdle to overcome is the
- provision of multiple concurrent language specific I/O mechanisms for the
- benefit of translators. This has been discussed elsewhere.
-
-
- ======================= ======================= =======================
- ======================= ======================= =======================
- ======================= ======================= =======================
-
- ISO10646 is based on the Unicode standard.
-
- The "ugly thing Unicode does with asiatic languages" is exactly what it
- does with all other languages: There is a single lexical assignment for
- for each possible glyph.
-
- This doesn't "screw up" anything, unless you expect your character set to
- be attributed by language... ie:
-
- English '#' character
- |
- v
- -+-+-+-+-+-+-+-+-+-+-+-+- -+-+-+-+-+-+-+-+-+-+-+-+-
- ... | |!|"|#|$|%|&|'|(|)|*| ... ... | |!|"| |$|%|&|'|(|)|*| ...
- -+-+-+-+-+-+-+-+-+-+-+-+- -+-+-+-+-+-+-+-+-+-+-+-+-
-
- US ASCII UK ASCII
-
- In the example above, the lexical set for US ASCII and UK ASCII are not
- intersecting, even though they contain exactly the same glyphs for all
- but one character
-
- Thus by the lexical order of a character, you can tell if it is an American,
- English, Japanese, or Chinese character. The argument against Unicode, as
- I understand it so far, is that the ordinal value of a character is not an
- indicator of which language it came from... ie:
-
- Character set excerpt Which set (count)
-
-
- /----------------------\
- | |
- -+-+-+-+-+-+-+-+-+-+-+-+- |
- ... | |!|"| |$|%|&|'|(|)|*| ... | UK ASCII (96)
- -+-+-+-+-+-+-+-+-+-+-+-+- |
- | | | | | | | | | | |
- v v v v v v v v v v |
- -+-+-+-+-+-+-+-+-+-+-+-+- |
- ... | |!|"|#|$|%|&|'|(|)|*| ... | US ASCII (96)
- -+-+-+-+-+-+-+-+-+-+-+-+- \------\
- | | | | | | | | | | | |
- v v v v v v v v v v v v
- -+-+-+-+-+-+-+-+-+-+-+-+- -+-+-+-+-
- ... | |!|"|#|$|%|&|'|(|)|*| ... ... | | | | Unicode (34348)
- -+-+-+-+-+-+-+-+-+-+-+-+- -+-+-+-+-
-
-
- This demonstrates the "problem", wherein the lexical order of the Unicode
- character set does not map to lexically adjacent characters in the ASCII
- sets. This behaviour is greatly exaggerated for Japanese/Chinese character
- sets, which have relatively large numbers of non-intersecting characters
- (as opposed to the 7 non-intersecting characters for most Western European
- languages and US ASCII), thus leaving a relatively large number of "gaps"
- in the lexical tables for a particular Asian language... ie:
-
-
- * = A valid character in a language
-
- -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
- ... |*|*| |*|*|*|*|*| |*|*| | | | |*|*|*|*|*|*|*| | ... Japanese
- -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
-
- -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
- ... | |*|*| |*|*| |*|*|*|*|*|*|*|*|*| |*|*| | |*|*| ... Chinese
- -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
-
-
- [ I do not view the attribution of language by the ordinal value of a
- character as a goal of the storage representation of the character
- glyph. This is, perhaps, where I differ with the (stated by various
- individuals) Japanese assessment of the Unicode standard. ]
-
- The fact that ISO-Latin-1 is used as the base character set for Unicode
- (and is thus in an pre-existing lexical order), and languages with glyphs
- that are non-existant in other languages (ie: Tamil and Arabic) are also
- in a preexisting adjacent lexical order is seen as Eurocentric (or US
- centric, when one considers that the ISO-Latin-1 set is a superset of US
- ASCII).
-
- The rationale is the Americans and Western Europeans get to keep the
- majority of their existing equipment without modification, while the
- Japanese are required to give up their existing investment in the JIS
- standard and equipment which supports it.
-
-
- THIS IS PATENTLY FALSE ON SEVERAL GROUNDS:
-
- 1) The storage mechanism (in this case, Unicode) does not have to
- bear a direct relationship to the input/output mechanism. For
- instance, a "Unicode font" for use in Japan could contain only
- those Glyphs used in Japanese, just as a Unicode font for use
- in the US can be limited to US ASCII. As long as the additional
- characters are not used, there is no reason for them to actually
- exist in the font.
-
- 2) Localization is, and for the forseeable future will continue to
- be, necessary for the input mechanism. This will be true of the
- large glyph count languages forever because of the awkward input
- mechanisms required. The most likely immediate change will be in
- the small glyph count languages, such as those of Western Europe.
- The additional penalty for small glyph count languages will be a
- 16k table for Unicode to Local ISO standard translation, and a
- 512b table for Local ISO standard to Unicode translation. There
- is also the lookup penalty that will be paid referencing these
- tables.
-
- 3) The base change for the Japanese language will be a 128k table for
- 16 bits worth of short-to-short JIS to Unicode lexical translation,
- and a 128k table for the reverse translation for output. The
- most likely immediate change here will be a direct change to the
- JIS translation table output from JIS to Unicode, thus eliminating
- any additional translation penalties for Japanese over and above
- that required by the large glyph set. In the case of most existing
- Japanese hardware, this is a simple ROM replacement. Thus Japanese
- I/O will realize a performance increase over and above that realized
- by native input of Western languages. This is a large glyph count
- only advantage shared by Japanese and Chinese.
-
- 4) The storage requirements for text files in small glyph count
- languages double from 8 bits to 16 bits per glyph when using any
- internationalization mechanism which allows for large glyph count
- languages. This is a significant concession from users of small
- glyph count languages in the interests of allowing for future
- internationalization from which they will not profit significantly.
- Methods of avoiding this expansion (such as Runic encoding [the
- commonly accepted mechanism] and file system language attribution
- [my concept] carry performance or loss-of-information penalties
- for small glyph countt language users. This is discussed later).
-
-
- Unicode is NOT to the advantage of Western users as has been claimed... to
- the contrary, internationalization carries significant penalties for the
- Western user, discounting entirely the programming issues involved. The
- relatively low cost of enforced localization by simply making code 8 bit
- clean is highly attractive... but localization has significant drawbacks
- for the non-Western users, and in particular large glyph count languages
- like Japanese and Chinese.
-
-
- PERFORMANCE AND STORAGE OPTIMIZATIONS
-
- It is possible to overcome, or at least alleviate somewhat, several
- disadvantages to Western users by selective optimization. In particular,
- the selection of optimistic assumptions in the initial lexical order is
- of benefit to Western users, and explicitly for US ASCII or ISO Latin-1
- users.
-
- These benefits are for the most part dependant upon the small glyph count
- of Western languages, and do not translate to a specific advantages of
- lexical order that large glyph count languages could benefit from. In
- other words, using a JIS set to enable all Japanese characters to be
- lexically adjacent WOULD NOT RESULT IN THE JAPANESE USER GAINING THE SAME
- BENEFITS. THE BENEFITS *DEPEND* ON BOTH LEXICAL ORDER AND ON THE FACT OF
- A SMALL GLYPH COUNT LANGUAGE.
-
- 1) OPTIMIZATION: Type 1 Runic Encoding.
- BENEFITS: US ASCII and ISO-Latin-1.
- File size is an indicator of character count for
- beneficiary languages.
- File size is mathematically related to the
- character count for language not intersecting the
- beneficiary ISO-Latin-1 glyph set.
- COSTS: Additional storage for Runic encoded languages.
- Additional cost for character 0 in text files
- in benificiary languages.
- File size is not an indicator of character count
- for languages which partially intersect the
- beneficiary ISO-Latin-1 glyph set.
- Difficulty in Glyph substitution.
- IMPLEMENTATION:
-
- Type 1 Runic Encoding uses the NULL character (lexical position 0)
- to indicate a rune sequence. For all users of the ISO-Latin-1
- character set or lexically adjacent subsets (ie: US ASCII), a NULL
- character is used to introduce a sequence of 8-bit characters
- representing a 16 bit character whose ordinal value is in excess
- of the ordinal value 255. Type 1 Runic encoding allows almost all
- existing Western files to remain unchanged, as nearly no text
- files contain NULL characters.
-
- 2) OPTIMIZATION: Type 2 Runic Encoding.
- BENEFITS: US ASCII.
- Lesser benefits for ISO-Latin-1 and other sets
- intersecting with US ACSII.
- Frequency encoding allows shorter sequences of
- characters to represent the majority of Runic
- encoded characters in large glyph set languages
- than is possib in Type 1 Runic Encoding.
- File size is an indicator of character count for
- US ASCII text files.
- File size is mathematically related to the
- character count for language not intersecting the
- beneficiary US ASCII glyph set.
- COSTS: Additional storage for Runic encoded languages.
- Additional cost for characters with an ordinal
- value in excess of 127 in text files in benificiary
- languages.
- File size is not an indicator of character count
- for languages which partially intersect the
- beneficiary US ASCII glyph set.
- Difficulty in Glyph substitution.
- IMPLEMENTATION:
-
- Type 2 Runic Encoding uses characters with an ordinal value in
- the range 128-255 to introduce a sequence of 8-bit characters
- representing a 16 bit character whose ordinal value is in excess
- of the ordinal value 127. Type 2 Runic encoding allows frequency
- encoding (by virtue of multiple introducers) of 128*256 (32k)
- glyphs. Since the Unicode set consists of 34348 glyphs, this is
- an average of two 8-bit characters per Runic Encoded glyph for
- the vast majority (32768) of encoded glyphs. This is significant
- when compared to the average of three 8-bit characters per encoded
- glyph for Type 1 Runic encoding.
-
-
- [ For the obvious reason of file size no longer directly representing what
- has been, in the past, meaningful information, I personally dislike the
- concept of Runic encoding, even though it will tend to effect only those
- languages which are permissable in an "8-bit clean" internationalization
- environment, thus not effecting me personally. An additional penalty is
- in glyph substition within documents. A single glyph substitution could
- potentially change the size of a file -- requiring a shift of the contents
- of the file on the storage medium to account for the additional space
- requirements of the substitute glyph. A final penalty is the input buffer
- mechanisms not being able to specify a language-independant field length
- for data input. This is particularly nasty for GUI and screen based input
- such as that found in most commercial spreadsheets and databases. For
- these reasons, I can not advocate Runic encoding or the XPG3/XPG4 standards
- which appear to require it. ]
-
-
- 3) OPTIMIZATION: Glyph Count Attribute (in file system)
- BENEFITS: Non-direct beneficiary languages of the Runic
- Encoding, assuming use of Type 1 or Type 2
- Runic Encoding.
- COSTS: File system modification.
- File I/O modifications.
- IMPLEMENTATION:
-
- A Glyph Count Attribute kept as part of the information on a file
- would restore the ability to relate directly available information
- about a file with the character count of the text in the file.
- This is something that is normally lost with Runic encoding. There
- are not insignificant costs associated with this in terms of the
- required modifications to the File I/O system to perform "glyph
- counting". This is especially significant when dealing with the
- concept of glyph substitution in the file.
-
-
- 4) OPTIMIZATION: Language Attribution (in file system)
- BENEFITS: All languages capable of existing in "8-bit clean"
- environments (all small glyph count languages).
- COSTS: File system modification.
- File I/O based translation (buffer modification
- processing time).
- Requirement of conversion to change to/from a
- multilingual storage format with non-intescting
- "8-bit clean" sets (ie: Arabic and US ASCII).
- Conversion utilities.
- Changes to UNIX utilities to allow access to
- and manipulation of attributions.
- IMPLEMENTATION:
-
- The Language Attribution kept as part of the information on a file
- allows 8-bit storage of any language for which an "8-bit clean"
- character set exists/can be produced. Unicode buffers of 16-bit
- glyphs are converted on write to the "8-bit clean" character set
- glyph. This requires a 64k table to allow for direct index
- conversion. In practice, this can be a 16k table due to the
- lexical location of the small glyph count languages within the
- Unicode character set. The conversion on read requires a 512b
- table to allow direct indext conversion of 256 8-bit values into
- the 256 corresponding Unicode 16-bit characters.
-
- [ This is clever, if I do say so myself ]
-
-
- 5) OPTIMIZATION: Sparse Character Sets For Language Localization
- BENEFITS: Reduced character set/graphic requirements.
- Continued use of non-graphic devices (depends
- on being used in concert with Language Attribution).
- Reduced memory requirements for fonts in graphical
- environments (like X).
- COSTS: Non-localized text files can not benefit.
- Device channel mapping for devices supporting less
- than the full Unicode character set.
- Translation tables and lookup time for devices
- supported using this mechanism.
-
- IMPLEMENTATION:
-
- [Prexisting] Language specific fonts for "8-bit clean" languages
- can be used, as can existing fonts for Unicode character sets
- for systems like X, which allow sparse font sets. Basically,
- since there is no need to display multilingual messages in a
- localized environment, there is no need to use fonts/devices
- which support an internationalized character set. For instance,
- using a DEC VT220, the full ISO-Latin-1 font is available for
- use. Thus for languages using only characters contained in the
- ISO-Latin-1 set, it is not necessary to supply other glyphs
- within the set as long as output mapping of Unicode to the device
- set is done (preferrably in the tty driver). Similarly, JIS
- devices for Japenese I/O are not required to support, for instance,
- Finnish, Arabic, or French characters.
-
- [ This is also clever, in that it does not was the existing investments
- in hardware. ]
-
-
-
- ADMITTED DRAWBACKS IN UNICODE:
-
- The fact that lexical order is not maintained for all existing character
- sets (NOTE: NO CURRENT OR PROPOSED STANDARD SUPPORTS THIS IDEA!) means that
- a direct arithmatic translation is not possible for, for instance, JIS to
- Unicode mappings; instead a table lookup is required on input and output.
- This is not a significant penalty anywhere but languages which do not
- require multiple keystroke input on their respective input devices and
- which are not lexically adjacent in the Unicode set (ie: Turkish). The
- penalty is a table lookup on I/O rather than a direct arithmetic translation
- (an add or subtract depending on direction). NOTE THAT THIS IS NOT A PENALTY
- FOR JIS INPUT, SINE MULTICHARACTER INPUT SEQUENCES REQUIRE A TABLE LOOKUP TO
- IMPLEMENT REGARDLESS OF THE STORAGE.
-
- The fact that all character sets do not occur in their local lexical order
- means that a particular character can not be identified as to language by
- its ordinal value. This is a small penalty to pay for the vast reduction
- in storage requirements between a 32-bit and a 16-bit character set that
- contains all required glyphs. The fact that Japanese and Chinese characters
- can not be distinguished as to language by ordinal value is no worse than
- the fact that one can not distinguish an English 's' in the ISO-Latin-1 set
- from a French 's'. The significance of language attribution must be handled
- by the input (and potentially output) mechanisms in any case, and thus they
- must be locale specific. This is sufficient to provide information as to
- the language being output, since input and output devices are generally
- closely associated.
-
- ======================= ======================= =======================
- ======================= ======================= =======================
- ======================= ======================= =======================
-
-
- Terry Lambert
- terry@icarus.weber.edu
- terry_lambert@novell.com
- ---
- Any opinions in this posting are my own and not those of my present
- or previous employers.
- --
- -------------------------------------------------------------------------------
- "I have an 8 user poetic license" - me
- Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial
- -------------------------------------------------------------------------------
-