home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!not-for-mail
- From: avg@rodan.UU.NET (Vadim Antonov)
- Newsgroups: comp.std.internat
- Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST)
- Date: 1 Jan 1993 19:47:18 -0500
- Organization: UUNET Technologies Inc, Falls Church, VA
- Lines: 203
- Message-ID: <1i2ommINN5uh@rodan.UU.NET>
- References: <1hu9v5INNbp1@rodan.UU.NET> <DwqPwB3w165w@blues.kk.sub.org>
- NNTP-Posting-Host: rodan.uu.net
- Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages
-
- In article <DwqPwB3w165w@blues.kk.sub.org> kosta@blues.kk.sub.org (Kosta Kostis) writes:
- >What do you consider a "mechanistic" case conversion?
-
- The one which does not ask user which language he means every time
- he runs more -i.
-
- >UniCode includes all the characters defined in ISO 8859-x.
- >Is case conversion a real problem with these?
-
- It is not in every separate code (well, there are some irregularities
- but it's still algorithmic).
-
- >We need more "clever" conversion routines.
-
- Unfortunately there is a logical gap. I don't care WHICH algorith
- im used as long as it is ALGORITHM. There is no way to convert Unicode
- strings uppercase without "external" information.
-
- >Whenever you want to convert a characters case or want to sort things,
- >the language used is one of the parameters, not only the character code.
-
- Exactly. Then, if you already know the language why on the Earth
- do you need to waste bits on Unicode?
-
- >> This property alone renders Unicode useless for any business
- >> applications.
- >
- >You really had a look at UniCode, hadn't you? :-)
-
- Sure.
-
- >UniCode is limited because everything is squeezed into 16-bit, but
- >for latin languages (and the like) it's just fine.
-
- Nope. It was already discussed here.
-
- >> 2) there is no trivial way to sort anything.
- >> An elementary sort program will require access to enormous
- >> tables for all possible languages.
- >>
- >> English: A B C D E ... T ...
- >> Russian: A .. B ... E ... C T ...
- >
- >This is a problem regardless of the character set being used.
- >Why do you blame UniCode for that?
-
- Regardless???? ASCII allows to sort English; KOI-8 allows to
- sort both Russian and English. With Unicode i can't do it
- if i don't know the language of the text. What about AI in sort?
-
- >The Cyrillic "X" is *not* the same as the Latin 1 "X" in UniCode,
- >as you might know. Have you ever tried this with an "american DTP system"
- >that uses UniCode? :-) I don't think so...
-
- No i haven't tried and it is one more reason to be careful with Unicode --
- it wasn't tested in the real life.
-
- >> Having unique glyphs works ONLY WITHIN a group of languages
- >> which are based on variations of a single alphabet with
- >> non-conflicting alphabetical ordering and sets of
- >> vowels. You can do that for European languages.
- >
- >You can't. Maybe you should learn more about European languages. ;-)
-
- Already discussed. I sure don't know everything but i know that
- you can made a minimal strictly ordered set from unification of
- strictly ordered sets by merging similar elements.
-
- >> An attempt to do it for different groups (like Cyrillic and Latin)
- >> is disastrous at best -- we already tried is and finally came to
- >> the encodings with two absolutely separate alphabets.
- >
- >That's what's done in ISO 8859-5 and what UniCode does. RTFM. :-)
-
- ISO 885905 is dead. Nobody uses it in Russia, FYI. And we've got
- the same problem with cyrillic-based languages as with latin-based
- languages in Europe; in my native Northern Caucasus there are about
- 300 different languages, most of them have writing based on cyrillic.
-
- >> I think that there is no many such groups, though, and it is possible
- >> to identify several "meta-alpahbets". The meta-alphabets have no
- >> defined rules for cross-sorting (unlike latters WITHIN one
- >> meta-alphabet; you CAN sort English and German words together
- >> and it still will make sense;
- >
- >Why do you think it would make sense?
-
- It's easy, Watson. Names, for example. Or (a case from my practice)
- -- there is a lot of commercial enterprises in Moscow with English
- names (moda, i guess). Then, i've got to sort the list.
- The KOI-8 sorting produced a list contained all Russian names
- in alphabetical order and then all English in alpabetical order
- which was exactly what was desired.
-
- >It won't make sense. Lexical sorting makes only sense, if at all, in
- >*one* single language.
-
- See before.
-
- >You will have to sort data with the attribut
- >"language" if you want to do it correctly in any case, either implied
- >or added explicitly. A sorting program made for Russian text in e. g.
- >ISO 8859-5 may do the job well for exactly that, it will create trash
- >when run over German text coded in ISO 8859-1.
-
- "Sorting" is not only as in sort, it is also in [a-z] in grep or
- in the screen editor search; it is in awk, perl and shell globbing.
- Want to modify all those languages to tell the language every time?
- Don't be ridiculous.
-
- >A "general" sorting
- >program would be so complex that it's *almost* impossible to do.
- >It will at least be very much memory and CPU consuming at it will
- >take years (many years) to develop it. I'd love to be wrong here.
-
- Your wish is granted. See my prevous postings with discussion of
- techniques (composite letters and equivalence classes).
-
- >> sorting Russian and English together
- >> is at best useless). It increases the number of codes but not
- >> as drastically as codifying languages; there are hundreds of
- >> languages based on a dozen of meta-alphabets.
- >
- >Why should you want to sort data with mixed Russian and e. g. English
- >words anyway? If you do so using plain character sets instead of clever
- >algorithms, you must fail hopelessly.
-
- See before. I have files with both Russian and English names in my
- directory, btw, and ls produces exactly what i want.
-
- >> Not true. First of all nothing forces to use 32-bit representation
- >> where only 10 bits are necessary.
- >
- >10 bits? You want to keep Asian languages out of the game, too? :-)
-
- It was nothing more than a figure of speech (i.e. 32 bits aren't always
- necessary). We're in agreement here :-)
-
- >> The fundamental idea is simply wrong -- it is inadequate for
- >> anything except for Latin-based languages. No wonder we're
- >> hearing that Unicode is US-centric.
- >
- >I agree that UniCode is not very good for Asian languages, but
- >for European languages (and some more) it's really OK. Should
- >we ever decide to use the full 32-bits ISO 10646 intended to
- >allocate for character codes, we should be able to cover almost
- >all languages (to some extend).
-
- It is also inadequate for cyrillic-based languages (slavic and
- others; not all slavic languages use cyrillic and most of
- Cyrillic languages aren't slavic!)
-
- >> Unfortunately Unicode looks like a cool solution for people who
- >> never did any real localization work and i fear that this
- >> particular mistake will be promoted as standard presenting
- >> us a new round of headache. It does not remove necessity to
- >> carry off-text information (like "X-Language: english") and
- >> it makes it not better than existing ISO 8-bit encodings
- >> (if i know the language i already know its alphabet --
- >
- >No, you don't. Look at the existing ISO 8-bit encodings, namely
- >ISO 8859-x and you can see that many languages can be encoded in
- >several character sets. You will always have to include a language
- >tag if the language is not implied.
-
- I never tols that i like ISO 8859-x encodings; quite opposite.
-
- However, there are multilingual enbcodings which do not require
- explicit language specifications for sorting and case conversion.
- KOI-8 (English and Russian) is an example.
-
- >> all extra bits are simply wasted; and programs handling Unicode
- >> text have to know the laguage for reasons stated before).
- >
- >Extra bits aren't wasted. 10-bit or 16-bit make no real difference.
-
- You missed the argument, apparently i failed to explain. See in my
- other postings.
-
-
- >> UNICODE IS A *BIG* MISTAKE.
- >
- >This may be your opinion, but I don't agree. It's still a better
- >"mistake" than plain US ASCII and it's better than 8-bit encodings
- >and/or character set switching. ;-)
-
- I simply know the hole it'll sunk in. We already saw it with
- several Russian-English encodings.
-
- >The other languages you stated, like Russian, Greek and English
- >seem to be served well by UniCode, I think.
-
- I can say for sure aboout Russian (since it's my native language and
- i'm quite experienced in localization issues) that it is out of
- question that Unicode will never be used inside Russia.
-
- >As much as I can see no universal character set will ever *solve*
- >the problems arising from sorting, case conversion, hyphenation
- >and many more, thus we shouldn't expect that.
-
- There are some sensible solutions, Unicode isn't one of them.
-
- --vadim
-