home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!not-for-mail
- From: avg@rodan.UU.NET (Vadim Antonov)
- Newsgroups: comp.unix.bsd
- Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST)
- Date: 1 Jan 1993 18:27:05 -0500
- Organization: UUNET Technologies Inc, Falls Church, VA
- Lines: 217
- Message-ID: <1i2k09INN4hl@rodan.UU.NET>
- References: <1992Dec30.061759.8690@fcom.cc.utah.edu> <1ht8v4INNj7i@rodan.UU.NET> <1993Jan1.094759.8021@fcom.cc.utah.edu>
- NNTP-Posting-Host: rodan.uu.net
- Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages
-
- In article <1993Jan1.094759.8021@fcom.cc.utah.edu> terry@cs.weber.edu (A Wizard of Earth C) writes:
- >In article <1ht8v4INNj7i@rodan.UU.NET> avg@rodan.UU.NET (Vadim Antonov) writes:
- >>In article <1992Dec30.061759.8690@fcom.cc.utah.edu> terry@cs.weber.edu (A Wizard of Earth C) writes:
- >>1) "mechanistic" conversion between upper and lower case
- >> is impossible (as well as case-insensitive comparisons)
- >>
- >> Example: Latin T -> t
- >> Cyrillic T -> m
- >> Greek T -> ?
- >>
- >> This property alone renders Unicode useless for any business
- >> applications.
- >
- >This is simply untrue. Because a subtractive/additive conversion is
- >impossible in *some* cases does not mean a *mechanistic* conversion is
- >also impossible. In particular, a tabular conversion is an obvious
- >approach which has already been used with success, with a minimal
- >(multiply plus dereference) overhead.
-
- You omitted one small "detail" -- you need to know the language of the word
- the letter belongs to to make a conversion. Since Unicode does not
- provide for specifying the language it is obvious that is should be
- obtained from user or kept somewhere off the text. In both cases
- as our program ALREADY knows the language from the environment it knows
- the particular (small) alphabet -- no need to use multibyte encodings!
- See how Unicode renders itself useless?
-
- I wonder why programmers aren't taught mathematical logic. I'm somehow
- an exception because i'm a mathematican by education and i use to look
- for holes in "logical" statements.
-
- >The Cyrillic characters within the Unicode standard (U+0400 -> U+04FF)
- >are based on the ECMA registry under ISO-2375 for use with ISO-2022. It
- >contains several Cyrillic subsets. The most recent and most widely
- >accepted of these is ISO-8859-5. Unicode uses the same relative
- >positions as in ISO-8859-5. Are you also adverse to ISO-8859-5?
-
- ISO-8859-5 is ok, though it is a dead code. Nobody uses it in Russia,
- mind you. The most wide-spread codes are KOI-8 (de-facto Unix and
- networking standard) and the so-called "alternative" code which is
- popular between MS-DOS users.
-
- [lots of information about the dead code is omitted]
-
- >The main "disording" of character sets is with regard to the Japanese
- >JIS standard. The minutes of the 20 Apr 90 UNICODE meeting (as reported
- >by Ken Whistler, Metaphor Computer Systems justify this as follows:
-
- Unfortunately i'm not competent to discuss Japanese and Chinese.
-
- >Of these, some argument can be made against only the final paragraph,
- >since it vies internationalization as a tool for multinationalization
- >rather than localization. I feel that a strong argument can be held
- >out for internationalization as a means of providing fully data driven
- >localizations of software. As such, the argument of monolingual vs.
- >multilingual is not supported. However, lexical sort order can be
- >enforced in the access rather than the storage mechanism, making this
- >a null point.
-
- Nay, you missed the same point again. You need information about
- laguage case-conversion and sorting rules and you can obtain it from
- the encoding (making user programs simple) or from user programs
- (forcing them to ask user at every step or to keep track of the language).
- What would you choose for your program?
- Besides, as i already argued asking or keeping off-text inforamtion
- makes the whole enterprise useless.
-
- >>2) there is no trivial way to sort anything.
- >> An elementary sort program will require access to enormous
- >> tables for all possible languages.
- >>
- >> English: A B C D E ... T ...
- >> Russian: A .. B ... E ... C T ...
- >
- >I believe this is addressed adequately in the ISO standards; however,
-
- Your belief is wrong for it is not considered adequate by real users.
-
- >the lexical order argument is one of the sticking points against the
- >Japanese acceptance of Unicode, and is a valid argument in that arena.
- >The fact of the matter is that Unicode is not an information manipulation
- >standard, but (for the purposes of it's use in internationalization) a
- >storage and an I/O standard. View this way, the lexical ordering argument
- >is nonapplicable.
-
- It'd be sticking point about Slavic languages as well, you may be sure.
- Knowing ex-Soviet standard-making routine i think the official fishy-eyed
- representatives will silentlly vote pro to get some more time for raving
- in Western stores and nobody will use it since then. The "working" standards
- in Russia aren't made by commitees.
-
- >>3) there is no reasonable way to do hyphenation.
- >> Since there is no way to tell language from the text there
- >> is no way to do any reasonable attempts to hyphenate.
- >> [OX - which language this word is from]?
- >>
- >> Good-bye wordprocessors and formatters?
- >
- >By this, you are obviously not referring to idegrahic languages, such as
- >Han, since hyphenation is meaningless for such languages. Setting aside
- >the argument that if you don't know how to hyphenate in a language, you
- >have no business generating situations requiring hyphenation by virtue
- >of the fact that you are basically illeterate in taht language... ;-).
-
- The reason may be as simple as reformatting spreadsheet containing
- (particularly) addresses of companies in the language i don't comprehend
- (though i can write it on the envelope).
-
- >Hyphenation as a process is language dependent, and, in particular,
- >dependent on the rendering mechanism (rendereing mechanisms are *not*
- >the subject under discussion; storage mechanisms *are*). Bluntly
- >speaking, why does one need word processing software at all if this
- >type of thing is codified? Hyphenation, like sorting, is manipulation
- >of the information in a native language specific way.
-
- Exactly. But there is a lot of "legal" ways to do hyphenation -- and
- there are algorithms which do reasonably well knowing nothing about
- the language except which letters are vowels. It's quite enough
- for printing address labels. If i'm formatting a book i can specify the
- language myself.
-
- >Find another standard to tell you how to write a word processor.
-
- Is there any? :-)
-
-
- >>4) "the similar gliphs" in Unicode are often SLIGHTLY different
- >> typographical gliphs -- everybody who ever dealt with international
- >> publishing knows that fonts are designed as a WHOLE and every
- >> letter is designed with all others in mind -- i.e. X in Cyrillic
- >> is NOT the same X as Latin even if the fonts are variations of
- >> the same style. I'd wish you to see how ugly the Russian
- >> texts prited on American desktop publishing systems with
- >> "few characters added" are.
- >>
- >> In reality it means that Unicode is not a solution for
- >> typesetting.
- >
- >No, you're right; neither is it a standard for the production of
- >pipefittings or the design of urban transportation systems.
-
- You somehow forgot that increasing number of texts get printed with
- typographical quality with all the stuff which follows.
- Ever saw a laser printer?
-
- >Forgetting for the moment that worrying about the output mechanism for
- >such a document before worrying about the input mechanism whereby such
- >a document can be created, The Unicode 1.0 standard (in section 2.1)
- >clearly makes a distinction between "Plain" and "Fancy" text:
-
- I see no reasons why we should treat the regular expression matching
- as "fancy" feature.
-
- >Clearly, then, the applications you are describing are *not* Unicode
- >applications, but "Fancy text" apllications which could potentially
- >make use of Unicode for character storage.
-
- Don't you think the ANY text is going to be fancy because Unicode
- as it is does not provide adequate means for the trivial operations?
-
- >This is, incidently, the resoloution of the Chinese/Japanese/Korean
- >unification arguments.
-
- As well i can provide every text with a font file. It is not a solution
- at all.
-
- >This would be Runic encoding, right?
-
- Exactly.
-
- >I can post the Plan-9 and Metis
- >mechanisms for doing this, if you want.
-
- Thank you, i already expressed my opinion on Plan 9 UTF in comp.os.research.
- I also do not think it's exciting. There are much more efficient runic
- encodings (my toy OS uses 7bit per byte and 8th bit as a continuation
- indicator).
-
- >sunce the 386BSD file system works on byte boundries, I can't believe
- >were suggesting direct 10-bit encoding of characters, right?
-
- 10 bits are nothing more than example.
-
- >I don't see many multinational applications or standards coming out
- >of Zambia or elsewhere (to point out the fact that they have to come
- >from somewhere, and the US is as good as any place else). The fact
- >that much of Unicode is based on ISO standards, and ISO-10646 encompasses
- >all of Unicode, means that there is more than US support and input on
- >the standard.
-
- Pretty soon it will be a dead standard because of the logical problems
- in the design. Voting is inadequate replacement for logic, you know.
- I'd better stick to a good standard from Zambia than to the brain-dead
- creature of ISO even if every petty bureaucrat voted for it.
-
- >I am willing to listen to arguments for any accepted or draft standards
- >you care to put forward.
- >
- >Arguments *against* proposals are well and good, as long as the constructive
- >criticism is accompanied by constructive suggestions.
-
- I expressed my point of view (and proposed some kind of solution) in
- comp.std.internat, where the discussion should belong. I'd like you to
- see the problem not as an excercise in wrestling consensus from an
- international body but as a mathematical problem. From the logistical
- point of view the solution is simply incorrect and no standard commitee
- can vote out that small fact. The fundamental assumption Unicode is
- based upon (i.e. one glyph - one code) makes the whole construction
- illogical and it, unfortunately, cannot be mended without serious
- redesign of the whole thing.
-
- Try to understand the argument about the redundance of encoding with
- external restrictions provided i used earlier in this letter. The
- Unicode commitee really get caught in a logical trap and it's a pity
- few people realize that.
-
- --vadim
-