NetNews Usenet Archive 1992 #31

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #31 / NN_1992_31.iso / spool / comp / std / internat / 996 < prev next >

Wrap

Text File | 1993-01-02 | 10.6 KB | 253 lines

Path: sparky!uunet!cs.utexas.edu!qt.cs.utexas.edu!yale.edu!ira.uka.de!Germany.EU.net!incom!kostis!blues!kosta From: kosta@blues.kk.sub.org (Kosta Kostis) Newsgroups: comp.std.internat Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST) Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages Message-ID: <VyqRwB2w165w@blues.kk.sub.org> Date: Sat, 02 Jan 93 22:31:18 MET References: <1i2ommINN5uh@rodan.UU.NET> Organization: The Blues Family Lines: 241 avg@rodan.UU.NET (Vadim Antonov) writes: > In article <DwqPwB3w165w@blues.kk.sub.org> kosta@blues.kk.sub.org (Kosta Kost > >What do you consider a "mechanistic" case conversion? > > The one which does not ask user which language he means every time > he runs more -i. Nice. Your local language will be implied somehow and you can use it as a default. What's your problem? > >We need more "clever" conversion routines. > > Unfortunately there is a logical gap. I don't care WHICH algorith > im used as long as it is ALGORITHM. There is no way to convert Unicode > strings uppercase without "external" information. There is no way to convert non-US ASCII strings without "external" information. Simple "solutions" may work for "Russian and English" or for "Greek and English", where you imply the language, but there's no *general* solution. You don't seem to understand that. > >Whenever you want to convert a characters case or want to sort things, > >the language used is one of the parameters, not only the character code. > > Exactly. Then, if you already know the language why on the Earth > do you need to waste bits on Unicode? Because there are more languages but English and Russian. ;-) > >> 2) there is no trivial way to sort anything. > >> An elementary sort program will require access to enormous > >> tables for all possible languages. > >> > >> English: A B C D E ... T ... > >> Russian: A .. B ... E ... C T ... > > > >This is a problem regardless of the character set being used. > >Why do you blame UniCode for that? > > Regardless???? ASCII allows to sort English; KOI-8 allows to > sort both Russian and English. With Unicode i can't do it > if i don't know the language of the text. What about AI in sort? Nice for you you're bilingual, but there are companies and the like that need support for much more than two languages and their "common alphabet" won't fit in 8-bit, 9-bit or 10-bit. > >The Cyrillic "X" is *not* the same as the Latin 1 "X" in UniCode, > >as you might know. Have you ever tried this with an "american DTP system" > >that uses UniCode? :-) I don't think so... > > No i haven't tried and it is one more reason to be careful with Unicode -- > it wasn't tested in the real life. This is funny. I think you were joking here, right? ;-) > >> Having unique glyphs works ONLY WITHIN a group of languages > >> which are based on variations of a single alphabet with > >> non-conflicting alphabetical ordering and sets of > >> vowels. You can do that for European languages. > > > >You can't. Maybe you should learn more about European languages. ;-) > > Already discussed. I sure don't know everything but i know that > you can made a minimal strictly ordered set from unification of > strictly ordered sets by merging similar elements. You think you can do so. Implement it and try to sell it. Good luck. > >> An attempt to do it for different groups (like Cyrillic and Latin) > >> is disastrous at best -- we already tried is and finally came to > >> the encodings with two absolutely separate alphabets. > > > >That's what's done in ISO 8859-5 and what UniCode does. RTFM. :-) > > ISO 885905 is dead. Nobody uses it in Russia, FYI. And we've got > the same problem with cyrillic-based languages as with latin-based > languages in Europe; in my native Northern Caucasus there are about > 300 different languages, most of them have writing based on cyrillic. I can see 202 cyrillic characters (including diacritic marks) in UniCode Version 1.0 - that's better than ISO 8859-5 (96 characters). Does KOI-8 cover more than 202 cyrillic characters? > >> I think that there is no many such groups, though, and it is possible > >> to identify several "meta-alpahbets". The meta-alphabets have no > >> defined rules for cross-sorting (unlike latters WITHIN one > >> meta-alphabet; you CAN sort English and German words together > >> and it still will make sense; > > > >Why do you think it would make sense? > > It's easy, Watson. Names, for example. Or (a case from my practice) > -- there is a lot of commercial enterprises in Moscow with English > names (moda, i guess). Then, i've got to sort the list. > The KOI-8 sorting produced a list contained all Russian names > in alphabetical order and then all English in alpabetical order > which was exactly what was desired. > > >It won't make sense. Lexical sorting makes only sense, if at all, in > >*one* single language. > > See before. You have your way of sorting names, others have other ways of sorting. Foreign names are written in German with German letters in Germany. My name is Greek, but I write it with Latin characters, so are all names in Germany. German sorting rules apply and we would never think to distinguish between an English and a Russian or German name in that case. > >You will have to sort data with the attribut > >"language" if you want to do it correctly in any case, either implied > >or added explicitly. A sorting program made for Russian text in e. g. > >ISO 8859-5 may do the job well for exactly that, it will create trash > >when run over German text coded in ISO 8859-1. > > "Sorting" is not only as in sort, it is also in [a-z] in grep or > in the screen editor search; it is in awk, perl and shell globbing. > Want to modify all those languages to tell the language every time? > Don't be ridiculous. *This* is ridiculous. You constantly miss the point. Dig it: there are more than two languages. You can add "implied" bilinguality in your grep or whatever but that's about it. > >A "general" sorting > >program would be so complex that it's *almost* impossible to do. > >It will at least be very much memory and CPU consuming at it will > >take years (many years) to develop it. I'd love to be wrong here. > > Your wish is granted. See my prevous postings with discussion of > techniques (composite letters and equivalence classes). Will this help with Arabic, Hebrew, Kanji and all that? You constantly try to squeeze everything in the scheme you used while having to bring together cyrillic and english, but what's good in one case doesn't need to be good in another case. ASCII like solution just don't work for the whole world. > >> sorting Russian and English together > >> is at best useless). It increases the number of codes but not > >> as drastically as codifying languages; there are hundreds of > >> languages based on a dozen of meta-alphabets. > > > >Why should you want to sort data with mixed Russian and e. g. English > >words anyway? If you do so using plain character sets instead of clever > >algorithms, you must fail hopelessly. > > See before. I have files with both Russian and English names in my > directory, btw, and ls produces exactly what i want. Yeah, OK. Your "ls" would not produce what I want on a ISO 8859-1 terminal. I don't want to see cyrillic letters when I expect umlauts and so on. Your cyrillic order scheme doesn't work for other character sets. If you want to support many languages with one program you will have to tell the program the languages *and* have a more rich character set. > It is also inadequate for cyrillic-based languages (slavic and > others; not all slavic languages use cyrillic and most of > Cyrillic languages aren't slavic!) I believe you, but what's the problem with UniCode here? Maybe you should tell me more about KOI-8 (by email, please). > >No, you don't. Look at the existing ISO 8-bit encodings, namely > >ISO 8859-x and you can see that many languages can be encoded in > >several character sets. You will always have to include a language > >tag if the language is not implied. > > I never tols that i like ISO 8859-x encodings; quite opposite. Why? For me it's better than US ASCII and I have no advantage using KOI-8 when I want to read/write Greek, right? > However, there are multilingual enbcodings which do not require > explicit language specifications for sorting and case conversion. > KOI-8 (English and Russian) is an example. The language is implied. It's yet another incarnation of a national "island solution". Do you think one character set for every country (or better region) is really what we should be looking for? Don't we have that already and wasn't that one reason to create UniCode? > >> all extra bits are simply wasted; and programs handling Unicode > >> text have to know the laguage for reasons stated before). > > > >Extra bits aren't wasted. 10-bit or 16-bit make no real difference. > > You missed the argument, apparently i failed to explain. See in my > other postings. I Can't. Expire done the dirty work already. :-) (BTW: in which group? I don't read all. :-) ) > >> UNICODE IS A *BIG* MISTAKE. > > > >This may be your opinion, but I don't agree. It's still a better > >"mistake" than plain US ASCII and it's better than 8-bit encodings > >and/or character set switching. ;-) > > I simply know the hole it'll sunk in. We already saw it with > several Russian-English encodings. UniCode is not a Russian-English encoding. It's a multilingual encoding with advantages and drawbacks. > >The other languages you stated, like Russian, Greek and English > >seem to be served well by UniCode, I think. > > I can say for sure aboout Russian (since it's my native language and > i'm quite experienced in localization issues) that it is out of > question that Unicode will never be used inside Russia. Did I hear your shoe on the table? :-) Changes are not liked in general. Especially if they mean "work". You don't see the advantages right now, but as soon as Russia makes more business with non-english speaking companies, you will understand the problems. Just open the door, it's real nice outside. :-) > >As much as I can see no universal character set will ever *solve* > >the problems arising from sorting, case conversion, hyphenation > >and many more, thus we shouldn't expect that. > > There are some sensible solutions, Unicode isn't one of them. There are partial solutions for local problems, fine, but that's it and that's what will be. No universal character set will ever solve that [period]. (Now hear my shoe on the table ... ;-) ) Kosta -- Kosta Kostis, Talstrasse 25, D-6074 Roedermark 3, Germany kosta@blues.kk.sub.org (home) sw authors: please support ISO 8859-x! Σ÷ⁿ─╓▄▀ = aeoeueAEOEUEss