NetNews Usenet Archive 1992 #31

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #31 / NN_1992_31.iso / spool / comp / std / internat / 984 < prev next >

Wrap

Internet Message Format | 1993-01-01 | 8.5 KB

Path: sparky!uunet!not-for-mail From: avg@rodan.UU.NET (Vadim Antonov) Newsgroups: comp.std.internat Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST) Date: 1 Jan 1993 19:47:18 -0500 Organization: UUNET Technologies Inc, Falls Church, VA Lines: 203 Message-ID: <1i2ommINN5uh@rodan.UU.NET> References: <1hu9v5INNbp1@rodan.UU.NET> <DwqPwB3w165w@blues.kk.sub.org> NNTP-Posting-Host: rodan.uu.net Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages In article <DwqPwB3w165w@blues.kk.sub.org> kosta@blues.kk.sub.org (Kosta Kostis) writes: >What do you consider a "mechanistic" case conversion? The one which does not ask user which language he means every time he runs more -i. >UniCode includes all the characters defined in ISO 8859-x. >Is case conversion a real problem with these? It is not in every separate code (well, there are some irregularities but it's still algorithmic). >We need more "clever" conversion routines. Unfortunately there is a logical gap. I don't care WHICH algorith im used as long as it is ALGORITHM. There is no way to convert Unicode strings uppercase without "external" information. >Whenever you want to convert a characters case or want to sort things, >the language used is one of the parameters, not only the character code. Exactly. Then, if you already know the language why on the Earth do you need to waste bits on Unicode? >> This property alone renders Unicode useless for any business >> applications. > >You really had a look at UniCode, hadn't you? :-) Sure. >UniCode is limited because everything is squeezed into 16-bit, but >for latin languages (and the like) it's just fine. Nope. It was already discussed here. >> 2) there is no trivial way to sort anything. >> An elementary sort program will require access to enormous >> tables for all possible languages. >> >> English: A B C D E ... T ... >> Russian: A .. B ... E ... C T ... > >This is a problem regardless of the character set being used. >Why do you blame UniCode for that? Regardless???? ASCII allows to sort English; KOI-8 allows to sort both Russian and English. With Unicode i can't do it if i don't know the language of the text. What about AI in sort? >The Cyrillic "X" is *not* the same as the Latin 1 "X" in UniCode, >as you might know. Have you ever tried this with an "american DTP system" >that uses UniCode? :-) I don't think so... No i haven't tried and it is one more reason to be careful with Unicode -- it wasn't tested in the real life. >> Having unique glyphs works ONLY WITHIN a group of languages >> which are based on variations of a single alphabet with >> non-conflicting alphabetical ordering and sets of >> vowels. You can do that for European languages. > >You can't. Maybe you should learn more about European languages. ;-) Already discussed. I sure don't know everything but i know that you can made a minimal strictly ordered set from unification of strictly ordered sets by merging similar elements. >> An attempt to do it for different groups (like Cyrillic and Latin) >> is disastrous at best -- we already tried is and finally came to >> the encodings with two absolutely separate alphabets. > >That's what's done in ISO 8859-5 and what UniCode does. RTFM. :-) ISO 885905 is dead. Nobody uses it in Russia, FYI. And we've got the same problem with cyrillic-based languages as with latin-based languages in Europe; in my native Northern Caucasus there are about 300 different languages, most of them have writing based on cyrillic. >> I think that there is no many such groups, though, and it is possible >> to identify several "meta-alpahbets". The meta-alphabets have no >> defined rules for cross-sorting (unlike latters WITHIN one >> meta-alphabet; you CAN sort English and German words together >> and it still will make sense; > >Why do you think it would make sense? It's easy, Watson. Names, for example. Or (a case from my practice) -- there is a lot of commercial enterprises in Moscow with English names (moda, i guess). Then, i've got to sort the list. The KOI-8 sorting produced a list contained all Russian names in alphabetical order and then all English in alpabetical order which was exactly what was desired. >It won't make sense. Lexical sorting makes only sense, if at all, in >*one* single language. See before. >You will have to sort data with the attribut >"language" if you want to do it correctly in any case, either implied >or added explicitly. A sorting program made for Russian text in e. g. >ISO 8859-5 may do the job well for exactly that, it will create trash >when run over German text coded in ISO 8859-1. "Sorting" is not only as in sort, it is also in [a-z] in grep or in the screen editor search; it is in awk, perl and shell globbing. Want to modify all those languages to tell the language every time? Don't be ridiculous. >A "general" sorting >program would be so complex that it's *almost* impossible to do. >It will at least be very much memory and CPU consuming at it will >take years (many years) to develop it. I'd love to be wrong here. Your wish is granted. See my prevous postings with discussion of techniques (composite letters and equivalence classes). >> sorting Russian and English together >> is at best useless). It increases the number of codes but not >> as drastically as codifying languages; there are hundreds of >> languages based on a dozen of meta-alphabets. > >Why should you want to sort data with mixed Russian and e. g. English >words anyway? If you do so using plain character sets instead of clever >algorithms, you must fail hopelessly. See before. I have files with both Russian and English names in my directory, btw, and ls produces exactly what i want. >> Not true. First of all nothing forces to use 32-bit representation >> where only 10 bits are necessary. > >10 bits? You want to keep Asian languages out of the game, too? :-) It was nothing more than a figure of speech (i.e. 32 bits aren't always necessary). We're in agreement here :-) >> The fundamental idea is simply wrong -- it is inadequate for >> anything except for Latin-based languages. No wonder we're >> hearing that Unicode is US-centric. > >I agree that UniCode is not very good for Asian languages, but >for European languages (and some more) it's really OK. Should >we ever decide to use the full 32-bits ISO 10646 intended to >allocate for character codes, we should be able to cover almost >all languages (to some extend). It is also inadequate for cyrillic-based languages (slavic and others; not all slavic languages use cyrillic and most of Cyrillic languages aren't slavic!) >> Unfortunately Unicode looks like a cool solution for people who >> never did any real localization work and i fear that this >> particular mistake will be promoted as standard presenting >> us a new round of headache. It does not remove necessity to >> carry off-text information (like "X-Language: english") and >> it makes it not better than existing ISO 8-bit encodings >> (if i know the language i already know its alphabet -- > >No, you don't. Look at the existing ISO 8-bit encodings, namely >ISO 8859-x and you can see that many languages can be encoded in >several character sets. You will always have to include a language >tag if the language is not implied. I never tols that i like ISO 8859-x encodings; quite opposite. However, there are multilingual enbcodings which do not require explicit language specifications for sorting and case conversion. KOI-8 (English and Russian) is an example. >> all extra bits are simply wasted; and programs handling Unicode >> text have to know the laguage for reasons stated before). > >Extra bits aren't wasted. 10-bit or 16-bit make no real difference. You missed the argument, apparently i failed to explain. See in my other postings. >> UNICODE IS A *BIG* MISTAKE. > >This may be your opinion, but I don't agree. It's still a better >"mistake" than plain US ASCII and it's better than 8-bit encodings >and/or character set switching. ;-) I simply know the hole it'll sunk in. We already saw it with several Russian-English encodings. >The other languages you stated, like Russian, Greek and English >seem to be served well by UniCode, I think. I can say for sure aboout Russian (since it's my native language and i'm quite experienced in localization issues) that it is out of question that Unicode will never be used inside Russia. >As much as I can see no universal character set will ever *solve* >the problems arising from sorting, case conversion, hyphenation >and many more, thus we shouldn't expect that. There are some sensible solutions, Unicode isn't one of them. --vadim