home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!zaphod.mps.ohio-state.edu!cs.utexas.edu!qt.cs.utexas.edu!yale.edu!ira.uka.de!smurf.sub.org!incom!kostis!blues!kosta
- From: kosta@blues.kk.sub.org (Kosta Kostis)
- Newsgroups: comp.std.internat
- Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST)
- Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages
- Message-ID: <DwqPwB3w165w@blues.kk.sub.org>
- Date: Fri, 01 Jan 93 20:34:36 MET
- References: <1hu9v5INNbp1@rodan.UU.NET>
- Organization: The Blues Family
- Lines: 199
-
- avg@rodan.UU.NET (Vadim Antonov) writes:
-
- > 1) "mechanistic" conversion between upper and lower case
- > is impossible (as well as case-insensitive comparisons)
- >
- > Example: Latin T -> t
- > Cyrillic T -> m
- > Greek T -> ?
-
- What do you consider a "mechanistic" case conversion?
-
- UniCode includes all the characters defined in ISO 8859-x.
- Is case conversion a real problem with these?
- We need more "clever" conversion routines.
-
- Whenever you want to convert a characters case or want to sort things,
- the language used is one of the parameters, not only the character code.
-
- > This property alone renders Unicode useless for any business
- > applications.
-
- You really had a look at UniCode, hadn't you? :-)
-
- UniCode is limited because everything is squeezed into 16-bit, but
- for latin languages (and the like) it's just fine.
-
- > 2) there is no trivial way to sort anything.
- > An elementary sort program will require access to enormous
- > tables for all possible languages.
- >
- > English: A B C D E ... T ...
- > Russian: A .. B ... E ... C T ...
-
- This is a problem regardless of the character set being used.
- Why do you blame UniCode for that?
-
- > 3) there is no reasonable way to do hyphenation.
- > Since there is no way to tell language from the text there
- > is no way to do any reasonable attempts to hyphenate.
- > [OX - which language this word is from]?
- >
- > Good-bye wordprocessors and formatters?
-
- No. It's a big "hello" for *good* wordprocessors and formatters that
- "know" about the language the words are written in and they *need* to
- know that to perform correct actions. This is again regardless of the
- character set being used. No reason to blame UniCode *here*.
-
- > 4) "the similar gliphs" in Unicode are often SLIGHTLY different
- > typographical gliphs -- everybody who ever dealt with international
- > publishing knows that fonts are designed as a WHOLE and every
- > letter is designed with all others in mind -- i.e. X in Cyrillic
- > is NOT the same X as Latin even if the fonts are variations of
- > the same style. I'd wish you to see how ugly the Russian
- > texts prited on American desktop publishing systems with
- > "few characters added" are.
-
- The Cyrillic "X" is *not* the same as the Latin 1 "X" in UniCode,
- as you might know. Have you ever tried this with an "american DTP system"
- that uses UniCode? :-) I don't think so...
-
- > In reality it means that Unicode is not a solution for
- > typesetting.
-
- Come on. What did you expect? A universal character set is not the
- "solution", it's just intended to help you develop it.
-
- Anyway I doubt there will ever *a* solution for this field.
- Even with UniCode I guess there will be many solutions covering
- just facets of the whole "problem".
-
- > Having unique glyphs works ONLY WITHIN a group of languages
- > which are based on variations of a single alphabet with
- > non-conflicting alphabetical ordering and sets of
- > vowels. You can do that for European languages.
-
- You can't. Maybe you should learn more about European languages. ;-)
-
- > An attempt to do it for different groups (like Cyrillic and Latin)
- > is disastrous at best -- we already tried is and finally came to
- > the encodings with two absolutely separate alphabets.
-
- That's what's done in ISO 8859-5 and what UniCode does. RTFM. :-)
-
- > I think that there is no many such groups, though, and it is possible
- > to identify several "meta-alpahbets". The meta-alphabets have no
- > defined rules for cross-sorting (unlike latters WITHIN one
- > meta-alphabet; you CAN sort English and German words together
- > and it still will make sense;
-
- Why do you think it would make sense?
-
- It won't make sense. Lexical sorting makes only sense, if at all, in
- *one* single language. You will have to sort data with the attribut
- "language" if you want to do it correctly in any case, either implied
- or added explicitly. A sorting program made for Russian text in e. g.
- ISO 8859-5 may do the job well for exactly that, it will create trash
- when run over German text coded in ISO 8859-1. A "general" sorting
- program would be so complex that it's *almost* impossible to do.
- It will at least be very much memory and CPU consuming at it will
- take years (many years) to develop it. I'd love to be wrong here.
-
- > sorting Russian and English together
- > is at best useless). It increases the number of codes but not
- > as drastically as codifying languages; there are hundreds of
- > languages based on a dozen of meta-alphabets.
-
- Why should you want to sort data with mixed Russian and e. g. English
- words anyway? If you do so using plain character sets instead of clever
- algorithms, you must fail hopelessly.
-
- > >The fact that all character sets do not occur in their local lexical order
- > >means that a particular character can not be identified as to language by
- > >its ordinal value. This is a small penalty to pay for the vast reduction
- > >in storage requirements between a 32-bit and a 16-bit character set that
- > >contains all required glyphs.
-
- (sorry, Vadim, I know you didn't write the above)
-
- Who defines "required"? This is a classical "don't care about the
- users culture/needs" approach that was so harmfull in the past.
-
- > Not true. First of all nothing forces to use 32-bit representation
- > where only 10 bits are necessary.
-
- 10 bits? You want to keep Asian languages out of the game, too? :-)
-
- Transfering 10-bit will cost the same as 16-bit if you don't do some
- tricky encoding which I don't think I want to have, do you?
-
- This assumes you organize/store character data in multiple 8-bit
- octets which should be true for the vast majority of computers today
- and in the future.
-
- > So, as you see the Unicode is more a problem than a solution.
-
- No, I think it's more like one step aside and one towards a solution.
-
- > The fundamental idea is simply wrong -- it is inadequate for
- > anything except for Latin-based languages. No wonder we're
- > hearing that Unicode is US-centric.
-
- I agree that UniCode is not very good for Asian languages, but
- for European languages (and some more) it's really OK. Should
- we ever decide to use the full 32-bits ISO 10646 intended to
- allocate for character codes, we should be able to cover almost
- all languages (to some extend).
-
- Digital systems and numbering systems in general are limited.
- You can get close to the "original", but you will never be able
- to do more than an approximation, but that's another story. ;-)
-
- > Unfortunately Unicode looks like a cool solution for people who
- > never did any real localization work and i fear that this
- > particular mistake will be promoted as standard presenting
- > us a new round of headache. It does not remove necessity to
- > carry off-text information (like "X-Language: english") and
- > it makes it not better than existing ISO 8-bit encodings
- > (if i know the language i already know its alphabet --
-
- No, you don't. Look at the existing ISO 8-bit encodings, namely
- ISO 8859-x and you can see that many languages can be encoded in
- several character sets. You will always have to include a language
- tag if the language is not implied.
-
- > all extra bits are simply wasted; and programs handling Unicode
- > text have to know the laguage for reasons stated before).
-
- Extra bits aren't wasted. 10-bit or 16-bit make no real difference.
-
- > UNICODE IS A *BIG* MISTAKE.
-
- This may be your opinion, but I don't agree. It's still a better
- "mistake" than plain US ASCII and it's better than 8-bit encodings
- and/or character set switching. ;-)
-
- > (Don't get me wrong -- i'm for the universal encoding; it's
- > just that particular idea of unique glyphs that i strongly
- > oppose).
-
- I agree. Unique glyphs are important, and this is not done for
- Japanese, Chinese and Korean(?). Well, at least not for those... :-)
-
- The other languages you stated, like Russian, Greek and English
- seem to be served well by UniCode, I think.
-
- As much as I can see no universal character set will ever *solve*
- the problems arising from sorting, case conversion, hyphenation
- and many more, thus we shouldn't expect that.
-
- > --vadim
-
- Kosta
-
-
- --
- Kosta Kostis, Talstrasse 25, D-6074 Roedermark 3, Germany
- kosta@blues.kk.sub.org (home)
- sw authors: please support ISO 8859-x! Σ÷ⁿ─╓▄▀ = aeoeueAEOEUEss
-