NetNews Usenet Archive 1992 #31

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #31 / NN_1992_31.iso / spool / comp / std / internat / 982 < prev next >

Wrap

Text File | 1993-01-01 | 8.6 KB | 211 lines

Path: sparky!uunet!zaphod.mps.ohio-state.edu!cs.utexas.edu!qt.cs.utexas.edu!yale.edu!ira.uka.de!smurf.sub.org!incom!kostis!blues!kosta From: kosta@blues.kk.sub.org (Kosta Kostis) Newsgroups: comp.std.internat Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST) Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages Message-ID: <DwqPwB3w165w@blues.kk.sub.org> Date: Fri, 01 Jan 93 20:34:36 MET References: <1hu9v5INNbp1@rodan.UU.NET> Organization: The Blues Family Lines: 199 avg@rodan.UU.NET (Vadim Antonov) writes: > 1) "mechanistic" conversion between upper and lower case > is impossible (as well as case-insensitive comparisons) > > Example: Latin T -> t > Cyrillic T -> m > Greek T -> ? What do you consider a "mechanistic" case conversion? UniCode includes all the characters defined in ISO 8859-x. Is case conversion a real problem with these? We need more "clever" conversion routines. Whenever you want to convert a characters case or want to sort things, the language used is one of the parameters, not only the character code. > This property alone renders Unicode useless for any business > applications. You really had a look at UniCode, hadn't you? :-) UniCode is limited because everything is squeezed into 16-bit, but for latin languages (and the like) it's just fine. > 2) there is no trivial way to sort anything. > An elementary sort program will require access to enormous > tables for all possible languages. > > English: A B C D E ... T ... > Russian: A .. B ... E ... C T ... This is a problem regardless of the character set being used. Why do you blame UniCode for that? > 3) there is no reasonable way to do hyphenation. > Since there is no way to tell language from the text there > is no way to do any reasonable attempts to hyphenate. > [OX - which language this word is from]? > > Good-bye wordprocessors and formatters? No. It's a big "hello" for *good* wordprocessors and formatters that "know" about the language the words are written in and they *need* to know that to perform correct actions. This is again regardless of the character set being used. No reason to blame UniCode *here*. > 4) "the similar gliphs" in Unicode are often SLIGHTLY different > typographical gliphs -- everybody who ever dealt with international > publishing knows that fonts are designed as a WHOLE and every > letter is designed with all others in mind -- i.e. X in Cyrillic > is NOT the same X as Latin even if the fonts are variations of > the same style. I'd wish you to see how ugly the Russian > texts prited on American desktop publishing systems with > "few characters added" are. The Cyrillic "X" is *not* the same as the Latin 1 "X" in UniCode, as you might know. Have you ever tried this with an "american DTP system" that uses UniCode? :-) I don't think so... > In reality it means that Unicode is not a solution for > typesetting. Come on. What did you expect? A universal character set is not the "solution", it's just intended to help you develop it. Anyway I doubt there will ever *a* solution for this field. Even with UniCode I guess there will be many solutions covering just facets of the whole "problem". > Having unique glyphs works ONLY WITHIN a group of languages > which are based on variations of a single alphabet with > non-conflicting alphabetical ordering and sets of > vowels. You can do that for European languages. You can't. Maybe you should learn more about European languages. ;-) > An attempt to do it for different groups (like Cyrillic and Latin) > is disastrous at best -- we already tried is and finally came to > the encodings with two absolutely separate alphabets. That's what's done in ISO 8859-5 and what UniCode does. RTFM. :-) > I think that there is no many such groups, though, and it is possible > to identify several "meta-alpahbets". The meta-alphabets have no > defined rules for cross-sorting (unlike latters WITHIN one > meta-alphabet; you CAN sort English and German words together > and it still will make sense; Why do you think it would make sense? It won't make sense. Lexical sorting makes only sense, if at all, in *one* single language. You will have to sort data with the attribut "language" if you want to do it correctly in any case, either implied or added explicitly. A sorting program made for Russian text in e. g. ISO 8859-5 may do the job well for exactly that, it will create trash when run over German text coded in ISO 8859-1. A "general" sorting program would be so complex that it's *almost* impossible to do. It will at least be very much memory and CPU consuming at it will take years (many years) to develop it. I'd love to be wrong here. > sorting Russian and English together > is at best useless). It increases the number of codes but not > as drastically as codifying languages; there are hundreds of > languages based on a dozen of meta-alphabets. Why should you want to sort data with mixed Russian and e. g. English words anyway? If you do so using plain character sets instead of clever algorithms, you must fail hopelessly. > >The fact that all character sets do not occur in their local lexical order > >means that a particular character can not be identified as to language by > >its ordinal value. This is a small penalty to pay for the vast reduction > >in storage requirements between a 32-bit and a 16-bit character set that > >contains all required glyphs. (sorry, Vadim, I know you didn't write the above) Who defines "required"? This is a classical "don't care about the users culture/needs" approach that was so harmfull in the past. > Not true. First of all nothing forces to use 32-bit representation > where only 10 bits are necessary. 10 bits? You want to keep Asian languages out of the game, too? :-) Transfering 10-bit will cost the same as 16-bit if you don't do some tricky encoding which I don't think I want to have, do you? This assumes you organize/store character data in multiple 8-bit octets which should be true for the vast majority of computers today and in the future. > So, as you see the Unicode is more a problem than a solution. No, I think it's more like one step aside and one towards a solution. > The fundamental idea is simply wrong -- it is inadequate for > anything except for Latin-based languages. No wonder we're > hearing that Unicode is US-centric. I agree that UniCode is not very good for Asian languages, but for European languages (and some more) it's really OK. Should we ever decide to use the full 32-bits ISO 10646 intended to allocate for character codes, we should be able to cover almost all languages (to some extend). Digital systems and numbering systems in general are limited. You can get close to the "original", but you will never be able to do more than an approximation, but that's another story. ;-) > Unfortunately Unicode looks like a cool solution for people who > never did any real localization work and i fear that this > particular mistake will be promoted as standard presenting > us a new round of headache. It does not remove necessity to > carry off-text information (like "X-Language: english") and > it makes it not better than existing ISO 8-bit encodings > (if i know the language i already know its alphabet -- No, you don't. Look at the existing ISO 8-bit encodings, namely ISO 8859-x and you can see that many languages can be encoded in several character sets. You will always have to include a language tag if the language is not implied. > all extra bits are simply wasted; and programs handling Unicode > text have to know the laguage for reasons stated before). Extra bits aren't wasted. 10-bit or 16-bit make no real difference. > UNICODE IS A *BIG* MISTAKE. This may be your opinion, but I don't agree. It's still a better "mistake" than plain US ASCII and it's better than 8-bit encodings and/or character set switching. ;-) > (Don't get me wrong -- i'm for the universal encoding; it's > just that particular idea of unique glyphs that i strongly > oppose). I agree. Unique glyphs are important, and this is not done for Japanese, Chinese and Korean(?). Well, at least not for those... :-) The other languages you stated, like Russian, Greek and English seem to be served well by UniCode, I think. As much as I can see no universal character set will ever *solve* the problems arising from sorting, case conversion, hyphenation and many more, thus we shouldn't expect that. > --vadim Kosta -- Kosta Kostis, Talstrasse 25, D-6074 Roedermark 3, Germany kosta@blues.kk.sub.org (home) sw authors: please support ISO 8859-x! Σ÷ⁿ─╓▄▀ = aeoeueAEOEUEss