home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!cs.utexas.edu!qt.cs.utexas.edu!yale.edu!ira.uka.de!Germany.EU.net!incom!kostis!blues!kosta
- From: kosta@blues.kk.sub.org (Kosta Kostis)
- Newsgroups: comp.std.internat
- Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST)
- Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages
- Message-ID: <VyqRwB2w165w@blues.kk.sub.org>
- Date: Sat, 02 Jan 93 22:31:18 MET
- References: <1i2ommINN5uh@rodan.UU.NET>
- Organization: The Blues Family
- Lines: 241
-
- avg@rodan.UU.NET (Vadim Antonov) writes:
-
- > In article <DwqPwB3w165w@blues.kk.sub.org> kosta@blues.kk.sub.org (Kosta Kost
- > >What do you consider a "mechanistic" case conversion?
- >
- > The one which does not ask user which language he means every time
- > he runs more -i.
-
- Nice. Your local language will be implied somehow and you can use it as
- a default. What's your problem?
-
- > >We need more "clever" conversion routines.
- >
- > Unfortunately there is a logical gap. I don't care WHICH algorith
- > im used as long as it is ALGORITHM. There is no way to convert Unicode
- > strings uppercase without "external" information.
-
- There is no way to convert non-US ASCII strings without "external"
- information. Simple "solutions" may work for "Russian and English"
- or for "Greek and English", where you imply the language, but there's
- no *general* solution. You don't seem to understand that.
-
- > >Whenever you want to convert a characters case or want to sort things,
- > >the language used is one of the parameters, not only the character code.
- >
- > Exactly. Then, if you already know the language why on the Earth
- > do you need to waste bits on Unicode?
-
- Because there are more languages but English and Russian. ;-)
-
- > >> 2) there is no trivial way to sort anything.
- > >> An elementary sort program will require access to enormous
- > >> tables for all possible languages.
- > >>
- > >> English: A B C D E ... T ...
- > >> Russian: A .. B ... E ... C T ...
- > >
- > >This is a problem regardless of the character set being used.
- > >Why do you blame UniCode for that?
- >
- > Regardless???? ASCII allows to sort English; KOI-8 allows to
- > sort both Russian and English. With Unicode i can't do it
- > if i don't know the language of the text. What about AI in sort?
-
- Nice for you you're bilingual, but there are companies and the like
- that need support for much more than two languages and their
- "common alphabet" won't fit in 8-bit, 9-bit or 10-bit.
-
- > >The Cyrillic "X" is *not* the same as the Latin 1 "X" in UniCode,
- > >as you might know. Have you ever tried this with an "american DTP system"
- > >that uses UniCode? :-) I don't think so...
- >
- > No i haven't tried and it is one more reason to be careful with Unicode --
- > it wasn't tested in the real life.
-
- This is funny. I think you were joking here, right? ;-)
-
- > >> Having unique glyphs works ONLY WITHIN a group of languages
- > >> which are based on variations of a single alphabet with
- > >> non-conflicting alphabetical ordering and sets of
- > >> vowels. You can do that for European languages.
- > >
- > >You can't. Maybe you should learn more about European languages. ;-)
- >
- > Already discussed. I sure don't know everything but i know that
- > you can made a minimal strictly ordered set from unification of
- > strictly ordered sets by merging similar elements.
-
- You think you can do so. Implement it and try to sell it. Good luck.
-
- > >> An attempt to do it for different groups (like Cyrillic and Latin)
- > >> is disastrous at best -- we already tried is and finally came to
- > >> the encodings with two absolutely separate alphabets.
- > >
- > >That's what's done in ISO 8859-5 and what UniCode does. RTFM. :-)
- >
- > ISO 885905 is dead. Nobody uses it in Russia, FYI. And we've got
- > the same problem with cyrillic-based languages as with latin-based
- > languages in Europe; in my native Northern Caucasus there are about
- > 300 different languages, most of them have writing based on cyrillic.
-
- I can see 202 cyrillic characters (including diacritic marks) in
- UniCode Version 1.0 - that's better than ISO 8859-5 (96 characters).
- Does KOI-8 cover more than 202 cyrillic characters?
-
- > >> I think that there is no many such groups, though, and it is possible
- > >> to identify several "meta-alpahbets". The meta-alphabets have no
- > >> defined rules for cross-sorting (unlike latters WITHIN one
- > >> meta-alphabet; you CAN sort English and German words together
- > >> and it still will make sense;
- > >
- > >Why do you think it would make sense?
- >
- > It's easy, Watson. Names, for example. Or (a case from my practice)
- > -- there is a lot of commercial enterprises in Moscow with English
- > names (moda, i guess). Then, i've got to sort the list.
- > The KOI-8 sorting produced a list contained all Russian names
- > in alphabetical order and then all English in alpabetical order
- > which was exactly what was desired.
- >
- > >It won't make sense. Lexical sorting makes only sense, if at all, in
- > >*one* single language.
- >
- > See before.
-
- You have your way of sorting names, others have other ways of sorting.
- Foreign names are written in German with German letters in Germany.
- My name is Greek, but I write it with Latin characters, so are all names
- in Germany. German sorting rules apply and we would never think to
- distinguish between an English and a Russian or German name in that case.
-
- > >You will have to sort data with the attribut
- > >"language" if you want to do it correctly in any case, either implied
- > >or added explicitly. A sorting program made for Russian text in e. g.
- > >ISO 8859-5 may do the job well for exactly that, it will create trash
- > >when run over German text coded in ISO 8859-1.
- >
- > "Sorting" is not only as in sort, it is also in [a-z] in grep or
- > in the screen editor search; it is in awk, perl and shell globbing.
- > Want to modify all those languages to tell the language every time?
- > Don't be ridiculous.
-
- *This* is ridiculous. You constantly miss the point.
-
- Dig it: there are more than two languages.
-
- You can add "implied" bilinguality in your grep or whatever but that's
- about it.
-
- > >A "general" sorting
- > >program would be so complex that it's *almost* impossible to do.
- > >It will at least be very much memory and CPU consuming at it will
- > >take years (many years) to develop it. I'd love to be wrong here.
- >
- > Your wish is granted. See my prevous postings with discussion of
- > techniques (composite letters and equivalence classes).
-
- Will this help with Arabic, Hebrew, Kanji and all that?
- You constantly try to squeeze everything in the scheme you used
- while having to bring together cyrillic and english, but what's
- good in one case doesn't need to be good in another case.
- ASCII like solution just don't work for the whole world.
-
- > >> sorting Russian and English together
- > >> is at best useless). It increases the number of codes but not
- > >> as drastically as codifying languages; there are hundreds of
- > >> languages based on a dozen of meta-alphabets.
- > >
- > >Why should you want to sort data with mixed Russian and e. g. English
- > >words anyway? If you do so using plain character sets instead of clever
- > >algorithms, you must fail hopelessly.
- >
- > See before. I have files with both Russian and English names in my
- > directory, btw, and ls produces exactly what i want.
-
- Yeah, OK. Your "ls" would not produce what I want on a ISO 8859-1 terminal.
- I don't want to see cyrillic letters when I expect umlauts and so on.
- Your cyrillic order scheme doesn't work for other character sets.
- If you want to support many languages with one program you will have to
- tell the program the languages *and* have a more rich character set.
-
- > It is also inadequate for cyrillic-based languages (slavic and
- > others; not all slavic languages use cyrillic and most of
- > Cyrillic languages aren't slavic!)
-
- I believe you, but what's the problem with UniCode here?
- Maybe you should tell me more about KOI-8 (by email, please).
-
- > >No, you don't. Look at the existing ISO 8-bit encodings, namely
- > >ISO 8859-x and you can see that many languages can be encoded in
- > >several character sets. You will always have to include a language
- > >tag if the language is not implied.
- >
- > I never tols that i like ISO 8859-x encodings; quite opposite.
-
- Why? For me it's better than US ASCII and I have no advantage using
- KOI-8 when I want to read/write Greek, right?
-
- > However, there are multilingual enbcodings which do not require
- > explicit language specifications for sorting and case conversion.
- > KOI-8 (English and Russian) is an example.
-
- The language is implied. It's yet another incarnation of a national
- "island solution". Do you think one character set for every country
- (or better region) is really what we should be looking for?
- Don't we have that already and wasn't that one reason to create UniCode?
-
- > >> all extra bits are simply wasted; and programs handling Unicode
- > >> text have to know the laguage for reasons stated before).
- > >
- > >Extra bits aren't wasted. 10-bit or 16-bit make no real difference.
- >
- > You missed the argument, apparently i failed to explain. See in my
- > other postings.
-
- I Can't. Expire done the dirty work already. :-)
- (BTW: in which group? I don't read all. :-) )
-
- > >> UNICODE IS A *BIG* MISTAKE.
- > >
- > >This may be your opinion, but I don't agree. It's still a better
- > >"mistake" than plain US ASCII and it's better than 8-bit encodings
- > >and/or character set switching. ;-)
- >
- > I simply know the hole it'll sunk in. We already saw it with
- > several Russian-English encodings.
-
- UniCode is not a Russian-English encoding. It's a multilingual
- encoding with advantages and drawbacks.
-
- > >The other languages you stated, like Russian, Greek and English
- > >seem to be served well by UniCode, I think.
- >
- > I can say for sure aboout Russian (since it's my native language and
- > i'm quite experienced in localization issues) that it is out of
- > question that Unicode will never be used inside Russia.
-
- Did I hear your shoe on the table? :-)
-
- Changes are not liked in general. Especially if they mean "work".
- You don't see the advantages right now, but as soon as Russia makes
- more business with non-english speaking companies, you will understand
- the problems. Just open the door, it's real nice outside. :-)
-
- > >As much as I can see no universal character set will ever *solve*
- > >the problems arising from sorting, case conversion, hyphenation
- > >and many more, thus we shouldn't expect that.
- >
- > There are some sensible solutions, Unicode isn't one of them.
-
- There are partial solutions for local problems, fine, but that's
- it and that's what will be. No universal character set will ever
- solve that [period]. (Now hear my shoe on the table ... ;-) )
-
- Kosta
-
-
- --
- Kosta Kostis, Talstrasse 25, D-6074 Roedermark 3, Germany
- kosta@blues.kk.sub.org (home)
- sw authors: please support ISO 8859-x! Σ÷ⁿ─╓▄▀ = aeoeueAEOEUEss
-