home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!not-for-mail
- From: avg@rodan.UU.NET (Vadim Antonov)
- Newsgroups: comp.std.internat
- Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST)
- Date: 1 Jan 1993 17:32:31 -0500
- Organization: UUNET Technologies Inc, Falls Church, VA
- Lines: 113
- Message-ID: <1i2gpvINN3lm@rodan.UU.NET>
- References: <8490@charon.cwi.nl> <1hvu79INN4qf@rodan.UU.NET> <1993Jan1.115424.27258@enea.se>
- NNTP-Posting-Host: rodan.uu.net
-
- In article <1993Jan1.115424.27258@enea.se> sommar@enea.se (Erland Sommarskog) writes:
- >So if I type a C then a million key presses later changes puts in
- >an H after the C how can the keyboard driver handle that? It might
- >not even be the same driver who are seeing the two!
-
- Aw, don't be silly. It's trivial.
-
- >>FYI, English has some compound letters too (though they're used only
- >>in typesetting) -- ff, fff, fi, ffi, fl, ffl..
- >
- >Which is the not the same as Spanish CH or LL. Saying that ff is one
- >letter is like saying Russian "bI" is two...
-
- Sure not, they're not "letters" for sorting and case-conversion purposes.
-
- >>why on the Earth do i need to spare bits for encoding glyphs if
- >>i already know the language and 8 (or 16 for oriental languages) bits
- >>is quite enough to map the alphabet. Don't you see this gap in
- >>the logic nullifying all benefits of 10646?
- >
- >What the hell has the number of bits to do with anything? Do computers
- >exist for the programmers of the users?
-
- Look, you've missed the logic completely. Read it please again. I also
- explained it several times in other postings.
-
- >>With a trivial trick of including several codes for identical glyphs
- >>for letters from different languages you can put all of them in ONE
- >>meta-alphabet.
- >
- >Well that's is already done in 10646 for letters which are the same in
- >Latin, Cyrillic and Greek scripts. Hopefully, that will not cause to
- >much of a mess.
-
- It was the only solution -- the problem is the same but it's a worse
- case.
-
-
- >But what Vadim Antonov was discussing was including identical glyphs
- >for languages like Swedish, German etc. I guess people are in for real
- >surprises because things don't end up where they expect them because
- >they happen to use the wrong type of dotted A. Not talking about the
- >confusion they get when they are searching the text. Possibly this
- >arrangement is friendly for the the lazy programmer Vadim Antonov,
- >but not for the poor user.
-
- What do you tell the poor user when he has a database with English
- and Russian company names (a case from my practice, to be real) --
- in both upper and lower case and the smart guys (apparently Erlands
- pupils) made a terminal which converts cyrillic codes for the letters
- of the same shape as latin to the latin codes? Go get a rope?
-
- As for lazy i bet i wrote ten times more than you. Send you my resume?
-
- >>ASCII is for English, period.
-
- >In what way is ASCII, which is - as state yourself - for English,
- >useful for data processing in German or French?
-
- Because:
- 1) the programs working with German ASCII and French ASCII
- aren't the same programs as those working with English ASCII
- -- they have language-specific translation tables in comparison
- routines which effictively reorder ASCII making it a somehow
- different code.
- 2) since relatively few programs were designed this way there
- is a lot of programs with erroneous behaviour, for example:
-
- tr ? SS
- ^-ezsat here
- 3) it is not ASCII anyway (where's my {?)
-
- >Or even its
- >semantics useful for these languages?
-
- The basic ASCII principles (after reordering and replacing several
- characters) remained the same -- there is a way to convert upper<->lower
- case and there is a way to sort without asking which language every word
- came from (it's known apriori).
-
- That does not work with Unicode.
-
- >In the poor variety of
- >English you can render with ASCII, sorting can be based simply
- >on the letter ordering, because accents, digraphs and diaeresis
- >which only occurs occassionaly were left out. But German and
- >French cannot be simplified in this ways because umlauts and
- >accents appear much more often. For these languages the sorting
- >algorithm must be more complex that simple sorting on collation
- >order, so what's the use of a hard-coded semantics a la ASCII?
-
- There always is a reasonable approximation people use daily --
- technically speaking ANY sorting can be made arithmetic by
- trivial character convertion rules. Since invention of lexicographic
- sorting those rules came to be pretty simple.
-
- >You are seeing the solution, simple bit-order comparisons. But
- >unfortunately there are not many problems which have this solution.
-
- I do not claim to provide a panacea. I simply warn about the known
- problem which can easily outweigh all benefits of the unified code.
- The solution may seem weird -- until you bump on those holes yourself.
- As i already said there is no easy way around -- you have to deal
- with those issues somewhere and it's better to have it solved on
- the elementary level -- otherwise EVERY program will be forced to
- keep track of the language which is not easy and sometimes ruins
- the whole logic of the program (see shell globbing example in my previous
- posting os tr example before).
-
- I'm pretty sure Unicode is dead-born exactly because it requires non-trivial
- changes in existing programs for no reason.
-
- --vadim
-