home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!not-for-mail
- From: avg@rodan.UU.NET (Vadim Antonov)
- Newsgroups: comp.std.internat
- Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST)
- Date: 1 Jan 1993 03:35:02 -0500
- Organization: UUNET Technologies Inc, Falls Church, VA
- Lines: 116
- Message-ID: <1i0vnmINN352@rodan.UU.NET>
- References: <8490@charon.cwi.nl> <1hvu79INN4qf@rodan.UU.NET> <8492@charon.cwi.nl>
- NNTP-Posting-Host: rodan.uu.net
- Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages
-
- In article <8492@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
- > > My point was that there obviously are identifyable meta-alpahbets
- > > covering several languages.
- >I do think that the number in several is very small.
-
- With a trivial trick of including several codes for identical glyphs
- for letters from different languages you can put all of them in ONE
- meta-alphabet.
-
- It's a matter of compromise, as you understand.
-
- As for "several" my estimate is that all we need is about 15-20
- meta-alpabets. You simply underestimate the number of languages
- based on the same grapho-phonetical sets.
- In my native Nothern Caucasus people speak (and write) in more than
- 300 languages -- but most of them have cyrillic-based written form
- designed after the region was appended to the Russian Empire.
-
- >Still wrong. Take the dutch ij...
- >So how would the keyboard driver deal
- >with the 'ij' combination? When I enter the combination it can either be
- >the single letter ij (some dutch people say there is no such single letter),
- >or two letters, depending on context. So must the keyboard driver look
- >at the context (e.g. it is a french loadnword like bijoux so that ij is
- >really two letters), or what?
-
- There are different solutions -- the radical one is to tell the driver
- which language you're writing in (somehow i'm used to type on four
- different keyboards and find that the "native" layout for the language
- is simply the best). Another is to use the _compose_ key.
-
- Specifying the language is a good tone -- and you do it anyway if your
- editor supports interactive spell-checking. Most people will never
- bother to use more than two languages anyway, so it can simply be
- a couple of assignable registers toggled by, say, right Alt.
-
- The problem should be localized at the one place (input) -- instead of
- having *every* application to keep (or even worse - to ask!) the
- language.
-
- >Sorting is extremely context sensitive, even in a single language.
-
- Yes, sure. Then, why exaggregate the problem even more?
-
- >As
- >another person already mentioned in english you sort McNeill as if it
- >is MacNeill. Similar the abbreviation St. which can be either Street or
- >Saint. (Moreover, when sorting names I would prefer to sort C. van der Bilt
- >under V if it is an American and under B if it is a Dutchman ;-).)
-
- Sorting names and addresses is a big problem everywhere -- and (as always)
- the simpliest solution is the best -- to sort everything LITERALLY.
- The real-world applications work this way; after breaking several times
- on names like MCAAN XIOY banks revert to placing MARX between MCNEILL
- and MACNEILL. Dunno about UK but in US with its diverse population
- the problem is not new. There is a lot of arguments for abolishing
- separating names on first, middle and last because there are cultures
- where names cannot be separated or should be modified or simply have
- the different order of names.
-
- After airplanes crossed the ocean you can relay on your local
- knowledge no more. British Telecom already dropped titles in
- phone books, didn't they?
-
- >To me it appears very silly to put more than superficial sorting
- >information in the encoding.
-
- It is NOT silly if it covers 99.9% of all practical applications.
-
- >The remainder must be handled by the
- >applications (through library programs). And indeed, that may require
- >table look-up.
-
- It is not just table lookup if you didn't understand it yet. The problem
- with Unicode is that in order to sort (or even capitalise) strings your
- program should KNOW the language the strings are in -- therefore
- you can use already existent 8-bit encodings just as well. Extended
- Unicode sequences carry NO useful information IF the particular
- alphabet is already known. Now, if we're going this way why do we need
- Unicode at all? We're back to the original problem but this time with more
- complexities. It is THAT simple.
-
- I'm often amazed how people fail to see obvious things.
-
- > > The idea of visual encoding (and one letter-onr glyph is nothing more
- > > than a compressed image of the text) is simply wrong because it
- > > drops valuable information readily available at the point of the CREATION
- > > of the text but not later.
- >But as I said, such information is not readily available at the point of
- >creation, only if the system asks everytime.
-
- It should ask at some time anyway -- and it is much better when
- "this name is Dutch" is entered by a teller in Europe who typed the name
- in than by an operator in Honkong who may be well ignorant about
- the difference between Dutch and Spanish.
-
- Telling the language is NOT annoying -- most of the time it's one
- or the two languages and nobody switches between them often anyway.
-
- Finally, in the most places the language was told once by the
- person who set up the system and the system surely won't forget it.
-
- >That would be silly as most text is not sorted anyway.
-
- Really? I guess you're seriously wrong. Even USENET postings get
- clipped by words and sorted by thesaurus-collecting algorithms of
- things like grapeVINE. And ALL words people enter into *real-life*
- applications (banking, mailing, legal, publishing) got eventually sorted.
- It's easy to miss -- but statistics is that 30% of computing power
- gets spent on *sorting* (i can't recall source -- apparently some
- known book by Myers (sp?) or Brooks).
-
- It is safe to assume that practically any published (or shared) text
- (or parts of it like title) will be sorted somewhere.
-
- --vadim
-