NetNews Usenet Archive 1992 #31

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #31 / NN_1992_31.iso / spool / comp / std / internat / 964 < prev next >

Wrap

Internet Message Format | 1993-01-01 | 5.8 KB

Path: sparky!uunet!not-for-mail From: avg@rodan.UU.NET (Vadim Antonov) Newsgroups: comp.std.internat Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST) Date: 1 Jan 1993 03:35:02 -0500 Organization: UUNET Technologies Inc, Falls Church, VA Lines: 116 Message-ID: <1i0vnmINN352@rodan.UU.NET> References: <8490@charon.cwi.nl> <1hvu79INN4qf@rodan.UU.NET> <8492@charon.cwi.nl> NNTP-Posting-Host: rodan.uu.net Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages In article <8492@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: > > My point was that there obviously are identifyable meta-alpahbets > > covering several languages. >I do think that the number in several is very small. With a trivial trick of including several codes for identical glyphs for letters from different languages you can put all of them in ONE meta-alphabet. It's a matter of compromise, as you understand. As for "several" my estimate is that all we need is about 15-20 meta-alpabets. You simply underestimate the number of languages based on the same grapho-phonetical sets. In my native Nothern Caucasus people speak (and write) in more than 300 languages -- but most of them have cyrillic-based written form designed after the region was appended to the Russian Empire. >Still wrong. Take the dutch ij... >So how would the keyboard driver deal >with the 'ij' combination? When I enter the combination it can either be >the single letter ij (some dutch people say there is no such single letter), >or two letters, depending on context. So must the keyboard driver look >at the context (e.g. it is a french loadnword like bijoux so that ij is >really two letters), or what? There are different solutions -- the radical one is to tell the driver which language you're writing in (somehow i'm used to type on four different keyboards and find that the "native" layout for the language is simply the best). Another is to use the _compose_ key. Specifying the language is a good tone -- and you do it anyway if your editor supports interactive spell-checking. Most people will never bother to use more than two languages anyway, so it can simply be a couple of assignable registers toggled by, say, right Alt. The problem should be localized at the one place (input) -- instead of having *every* application to keep (or even worse - to ask!) the language. >Sorting is extremely context sensitive, even in a single language. Yes, sure. Then, why exaggregate the problem even more? >As >another person already mentioned in english you sort McNeill as if it >is MacNeill. Similar the abbreviation St. which can be either Street or >Saint. (Moreover, when sorting names I would prefer to sort C. van der Bilt >under V if it is an American and under B if it is a Dutchman ;-).) Sorting names and addresses is a big problem everywhere -- and (as always) the simpliest solution is the best -- to sort everything LITERALLY. The real-world applications work this way; after breaking several times on names like MCAAN XIOY banks revert to placing MARX between MCNEILL and MACNEILL. Dunno about UK but in US with its diverse population the problem is not new. There is a lot of arguments for abolishing separating names on first, middle and last because there are cultures where names cannot be separated or should be modified or simply have the different order of names. After airplanes crossed the ocean you can relay on your local knowledge no more. British Telecom already dropped titles in phone books, didn't they? >To me it appears very silly to put more than superficial sorting >information in the encoding. It is NOT silly if it covers 99.9% of all practical applications. >The remainder must be handled by the >applications (through library programs). And indeed, that may require >table look-up. It is not just table lookup if you didn't understand it yet. The problem with Unicode is that in order to sort (or even capitalise) strings your program should KNOW the language the strings are in -- therefore you can use already existent 8-bit encodings just as well. Extended Unicode sequences carry NO useful information IF the particular alphabet is already known. Now, if we're going this way why do we need Unicode at all? We're back to the original problem but this time with more complexities. It is THAT simple. I'm often amazed how people fail to see obvious things. > > The idea of visual encoding (and one letter-onr glyph is nothing more > > than a compressed image of the text) is simply wrong because it > > drops valuable information readily available at the point of the CREATION > > of the text but not later. >But as I said, such information is not readily available at the point of >creation, only if the system asks everytime. It should ask at some time anyway -- and it is much better when "this name is Dutch" is entered by a teller in Europe who typed the name in than by an operator in Honkong who may be well ignorant about the difference between Dutch and Spanish. Telling the language is NOT annoying -- most of the time it's one or the two languages and nobody switches between them often anyway. Finally, in the most places the language was told once by the person who set up the system and the system surely won't forget it. >That would be silly as most text is not sorted anyway. Really? I guess you're seriously wrong. Even USENET postings get clipped by words and sorted by thesaurus-collecting algorithms of things like grapeVINE. And ALL words people enter into *real-life* applications (banking, mailing, legal, publishing) got eventually sorted. It's easy to miss -- but statistics is that 30% of computing power gets spent on *sorting* (i can't recall source -- apparently some known book by Myers (sp?) or Brooks). It is safe to assume that practically any published (or shared) text (or parts of it like title) will be sorted somewhere. --vadim