home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.unix.bsd
- Path: sparky!uunet!gatech!usenet.ins.cwru.edu!agate!dog.ee.lbl.gov!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry
- From: terry@cs.weber.edu (A Wizard of Earth C)
- Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST)
- Message-ID: <1993Jan1.094759.8021@fcom.cc.utah.edu>
- Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages
- Sender: news@fcom.cc.utah.edu
- Organization: Weber State University (Ogden, UT)
- References: <1992Dec30.010216.2550@nobeltech.se> <1992Dec30.061759.8690@fcom.cc.utah.edu> <1ht8v4INNj7i@rodan.UU.NET>
- Date: Fri, 1 Jan 93 09:47:59 GMT
- Lines: 281
-
- In article <1ht8v4INNj7i@rodan.UU.NET> avg@rodan.UU.NET (Vadim Antonov) writes:
- >In article <1992Dec30.061759.8690@fcom.cc.utah.edu> terry@cs.weber.edu (A Wizard of Earth C) writes:
- >>The "ugly thing Unicode does with asiatic languages" is exactly what it
- >>does with all other languages: There is a single lexical assignment for
- >>for each possible glyph.
- >>....
- >>ADMITTED DRAWBACKS IN UNICODE:
- >>
- >>The fact that lexical order is not maintained for all existing character
- >>sets (NOTE: NO CURRENT OR PROPOSED STANDARD SUPPORTS THIS IDEA!) means that
- >>a direct arithmatic translation is not possible for...
- >
- >It means that:
- >
- >1) "mechanistic" conversion between upper and lower case
- > is impossible (as well as case-insensitive comparisons)
- >
- > Example: Latin T -> t
- > Cyrillic T -> m
- > Greek T -> ?
- >
- > This property alone renders Unicode useless for any business
- > applications.
-
- This is simply untrue. Because a subtractive/additive conversion is
- impossible in *some* cases does not mean a *mechanistic* conversion is
- also impossible. In particular, a tabular conversion is an obvious
- approach which has already been used with success, with a minimal
- (multiply plus dereference) overhead.
-
- The Lexical ordering of the Latin-1 character set is not in question;
- case conversion is done by an arithmetic offset of decimal 32.
-
- The Cyrillic characters within the Unicode standard (U+0400 -> U+04FF)
- are based on the ECMA registry under ISO-2375 for use with ISO-2022. It
- contains several Cyrillic subsets. The most recent and most widely
- accepted of these is ISO-8859-5. Unicode uses the same relative
- positions as in ISO-8859-5. Are you also adverse to ISO-8859-5?
-
- There are a number of Cyrillic letters not defined in ISO-8859-5 (both
- historical and extended) which exist in the Unicode standard; it is true
- that the case conversion is not based on an offset of decimal 32 for the
- extended characters not covered by the 8859-5 standard; however, the
- historic letters (such as those used in Ukranian and Belorussian) are
- dialectic in nater, and thus are regarded as a font change. Bearing this
- in mind, case conversion can be done in the context of the dialectic
- table used for local representation of the characters for device I/O
- using the decimal 32 offset *through the lookup table*. I fail to follow
- your T -> m conversion argument; could you please identify the letters
- in question with regard to their ordinal values in ISO-8859-5?
-
-
- The argument for case conversion within the Greek is equally flawed,
- unless you are also taking issue with the ISO-8859-7 character set,
- per ECMA registry ISO-2375 for use with ISO-2022. Taking issue with
- this particular standard would be difficult to support on your part,
- as the ISO-8859-7 standard is based on the Greek national standard
- ELOT-928 and also ECMA-118, the origin of which is Greece.
-
- Again, historical forms are not in a lexically correct order for decimal
- 32 conversion of case; however, these are also dialectical variants and
- the difficults inherent in these variants are resolvable under the same
- mechanisms as those discussed for Cyrillic.
-
- As to business suitability, it is unlikely that one would use something
- like Polytonic (re: classical and Byzantine ancient Greek) for a business
- application.
-
-
- The main "disording" of character sets is with regard to the Japanese
- JIS standard. The minutes of the 20 Apr 90 UNICODE meeting (as reported
- by Ken Whistler, Metaphor Computer Systems justify this as follows:
-
- ] Han Unification Issues:
- ]
- ] The compromise WG2 position advocated Han unification, but it seemed
- ] to imply that the unified set would start off with codes in JIS order.
- ] There was some discussion of whether the compromise proposal really
- ] did or did not state (or imply) that. Then the group reviewed the
- ] Japanese objections to a Han unification that does not incorporate
- ] JIS ordering.
- ]
- ] The consensus was that a JIS-first ordering in a unified Han encoding
- ] is unacceptable for at least 3 reasons:
- ] 1. It is morally unacceptable to favor the Japanese standard
- ] this way in an international encoding, at the expense
- ] of the Chinese and Korean standards.
- ] 2. The proposal attempts to solve a technical problem (namely
- ] the actual work of unifying the characters) with a
- ] political solution.
- ] 3. Preservation of the JIS order, so as to attempt to
- ] encapsulate that as a default sort order, makes no
- ] sense outside of a JIS-oriented application. The
- ] Han unification should present a more generally
- ] recognizable default sort order (i.e. one which
- ] can also be used by the Chinese and the Koreans,
- ] and which applies to the characters beyond JIS 1 & 2).
- ]
- ] Examination of the cost/benefits of unified Han character encoding
- ] should lead to the following conclusions: If an application is
- ] Japanese only, then simply use JIS. If an application is truly
- ] multilingual, then a JIS-first encoding doesn't make particular
- ] sense. Hence, the Unicode consensus is that an alternative and
- ] universal ordering principle should be applied to the unified
- ] Han set. (The consensus is still that radical/stroke order, with
- ] or without level distinctions, is the right way to go.)
-
- Of these, some argument can be made against only the final paragraph,
- since it vies internationalization as a tool for multinationalization
- rather than localization. I feel that a strong argument can be held
- out for internationalization as a means of providing fully data driven
- localizations of software. As such, the argument of monolingual vs.
- multilingual is not supported. However, lexical sort order can be
- enforced in the access rather than the storage mechanism, making this
- a null point.
-
-
- >2) there is no trivial way to sort anything.
- > An elementary sort program will require access to enormous
- > tables for all possible languages.
- >
- > English: A B C D E ... T ...
- > Russian: A .. B ... E ... C T ...
-
- I believe this is addressed adequately in the ISO standards; however,
- the lexical order argument is one of the sticking points against the
- Japanese acceptance of Unicode, and is a valid argument in that arena.
- The fact of the matter is that Unicode is not an information manipulation
- standard, but (for the purposes of it's use in internationalization) a
- storage and an I/O standard. View this way, the lexical ordering argument
- is nonapplicable.
-
-
- >3) there is no reasonable way to do hyphenation.
- > Since there is no way to tell language from the text there
- > is no way to do any reasonable attempts to hyphenate.
- > [OX - which language this word is from]?
- >
- > Good-bye wordprocessors and formatters?
-
- By this, you are obviously not referring to idegrahic languages, such as
- Han, since hyphenation is meaningless for such languages. Setting aside
- the argument that if you don't know how to hyphenate in a language, you
- have no business generating situations requiring hyphenation by virtue
- of the fact that you are basically illeterate in taht language... ;-).
-
- Hyphenation as a process is language dependent, and, in particular,
- dependent on the rendering mechanism (rendereing mechanisms are *not*
- the subject under discussion; storage mechanisms *are*). Bluntly
- speaking, why does one need word processing software at all if this
- type of thing is codified? Hyphenation, like sorting, is manipulation
- of the information in a native language specific way.
-
- Find another standard to tell you how to write a word processor.
-
-
- >4) "the similar gliphs" in Unicode are often SLIGHTLY different
- > typographical gliphs -- everybody who ever dealt with international
- > publishing knows that fonts are designed as a WHOLE and every
- > letter is designed with all others in mind -- i.e. X in Cyrillic
- > is NOT the same X as Latin even if the fonts are variations of
- > the same style. I'd wish you to see how ugly the Russian
- > texts prited on American desktop publishing systems with
- > "few characters added" are.
- >
- > In reality it means that Unicode is not a solution for
- > typesetting.
-
- No, you're right; neither is it a standard for the production of
- pipefittings or the design of urban transportation systems. Your
- complaint is one of the representation of multilingual text using the
- same characters (as a result of unification) in the same document.
-
- >Having unique glyphs works ONLY WITHIN a group of languages
- >which are based on variations of a single alphabet with
- >non-conflicting alphabetical ordering and sets of
- >vowels. You can do that for European languages.
- >An attempt to do it for different groups (like Cyrillic and Latin)
- >is disastrous at best -- we already tried is and finally came to
- >the encodings with two absolutely separate alphabets.
- >
- >I think that there is no many such groups, though, and it is possible
- >to identify several "meta-alpahbets". The meta-alphabets have no
- >defined rules for cross-sorting (unlike latters WITHIN one
- >meta-alphabet; you CAN sort English and German words together
- >and it still will make sense; sorting Russian and English together
- >is at best useless). It increases the number of codes but not
- >as drastically as codifying languages; there are hundreds of
- >languages based on a dozen of meta-alphabets.
-
- Forgetting for the moment that worrying about the output mechanism for
- such a document before worrying about the input mechanism whereby such
- a document can be created, The Unicode 1.0 standard (in section 2.1)
- clearly makes a distinction between "Plain" and "Fancy" text:
-
- ] Plain and Fancy Text
- ]
- ] Plain text is a pure sequence of character codes; plain Unicode text
- ] is a sequence of Unicode character codes. Fancy text is any text
- ] representation consisting of plain text pluss added information such
- ] as font size, color, and so on. For example, a multifont text as
- ] formated by a desktop publishing system is fancy text.
-
- Clearly, then, the applications you are describing are *not* Unicode
- applications, but "Fancy text" apllications which could potentially
- make use of Unicode for character storage.
-
- This is, incidently, the resoloution of the Chinese/Japanese/Korean
- unification arguments.
-
- >>The fact that all character sets do not occur in their local lexical order
- >>means that a particular character can not be identified as to language by
- >>its ordinal value. This is a small penalty to pay for the vast reduction
- >>in storage requirements between a 32-bit and a 16-bit character set that
- >>contains all required glyphs.
- >
- >Not true. First of all nothing forces to use 32-bit representation
- >where only 10 bits are necessary.
-
- This would be Runic encoding, right? I can post the Plan-9 and Metis
- mechanisms for doing this, if you want. Both are, in my opinion,
- vastly inferior to other available mechanisms. In particular, the
- requirement of using up to 6 characters to represent a single 31 bit
- value is particularly repulsive, especially for glyphs in excess of
- hex 04000000 (6 character encoding mandatory). Far eastern users
- already have the penalty of effectively half the disk space per glyph
- for storage of texts using raw (16 bit) Unicode. Admittedly, this
- has more to do with their use of pictographic rather than phonetic
- writing, but asking them to sacrifice yet more disk spac for Western
- convenience is ludicrous.
-
- sunce the 386BSD file system works on byte boundries, I can't believe
- were suggesting direct 10-bit encoding of characters, right?
-
-
- >So, as you see the Unicode is more a problem than a solution.
- >The fundamental idea is simply wrong -- it is inadequate for
- >anything except for Latin-based languages. No wonder we're
- >hearing that Unicode is US-centric.
- >
- >Unfortunately Unicode looks like a cool solution for people who
- >never did any real localization work and i fear that this
- >particular mistake will be promoted as standard presenting
- >us a new round of headache. It does not remove necessity to
- >carry off-text information (like "X-Language: english") and
- >it makes it not better than existing ISO 8-bit encodings
- >(if i know the language i already know its alphabet --
- >all extra bits are simply wasted; and programs handling Unicode
- >text have to know the laguage for reasons stated before).
-
- I don't see many multinational applications or standards coming out
- of Zambia or elsewhere (to point out the fact that they have to come
- from somewhere, and the US is as good as any place else). The fact
- that much of Unicode is based on ISO standards, and ISO-10646 encompasses
- all of Unicode, means that there is more than US support and input on
- the standard.
-
- >UNICODE IS A *BIG* MISTAKE.
- >
- >(Don't get me wrong -- i'm for the universal encoding; it's
- >just that particular idea of unique glyphs that i strongly
- >oppose).
-
- I am willing to listen to arguments for any accepted or draft standards
- you care to put forward.
-
- Arguments *against* proposals are well and good, as long as the constructive
- criticism is accompanied by constructive suggestions.
-
-
- Terry Lambert
- terry@icarus.weber.edu
- terry_lambert@novell.com
- ---
- Any opinions in this posting are my own and not those of my present
- or previous employers.
- --
- -------------------------------------------------------------------------------
- "I have an 8 user poetic license" - me
- Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial
- -------------------------------------------------------------------------------
-