home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!not-for-mail
- From: avg@rodan.UU.NET (Vadim Antonov)
- Newsgroups: comp.std.internat
- Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST)
- Date: 1 Jan 1993 04:45:31 -0500
- Organization: UUNET Technologies Inc, Falls Church, VA
- Lines: 212
- Message-ID: <1i13rrINNars@rodan.UU.NET>
- References: <8490@charon.cwi.nl> <1hvu79INN4qf@rodan.UU.NET> <1i0oj2INNp4v@life.ai.mit.edu>
- NNTP-Posting-Host: rodan.uu.net
- Keywords: ISO10646 Unicode
-
- In article <1i0oj2INNp4v@life.ai.mit.edu> glenn@wheat-chex.ai.mit.edu (Glenn A. Adams) writes:
- >In article <1hvu79INN4qf@rodan.UU.NET> avg@rodan.UU.NET (Vadim Antonov) writes:
- >>If a combination of letters is treated as a letter IT IS A LETTER.
- >So, are <qu> and <ch> letters in English?
-
- We were talking about lexicographical sorting, not abouth phonetics.
- Your argument is irrelevant.
-
- >>The idea of visual encoding (and one letter-onr glyph is nothing more
- >>than a compressed image of the text) is simply wrong...
- >
- >Where did you get the idea that 10646 is a "visual encoding"? Sure
- >10646 contains some glyphic like encodings. But they are there not
- >only to satisfy compatibility goals. Nobody is recommending their
- >use.
-
- Isn't it easy to comprehend that ASCII/Unicode/whatever representation
- of text is a form of compression of a graphical image from a particular
- class of images? I thought everybody who ever bothered to read Shannon
- knows it.
-
- >>10646 was meant as an encoding eliminating the necessity to carry off-text
- >>information (which is not a piece of cake, especially in multi-lingual
- >>texts).
-
- >This is just complete nonsense. I don't know who you've been talking
- >with, but whoever it was, they certainly don't know much about 10646
- >or Unicode. Unicode (and the 10646 BMP by extension) is oriented around
- >encoding the minimum content that allows for minimally legible display.
-
- Then you KNOW that it is compressed graphical format -- which is
- essentially useless in anything except for storing and then reproduction
- of the text.
-
- What makes encoded text useful is that its encoding extracts
- some SEMANTIC allowing for mechanical processing (particularly sorting).
- The semantic in ASCII is hard-coded -- it is the order of letters
- and the trivial upper-case to lower-case convertion.
- Unfortunately the move to abolish the last traces of semantic and
- make it PURELY graphical format destroyed the usefulness of such
- encoding for data processing.
-
- >This is alredy a well known and understood model for text: ASCII, EBCDIC,
- >etc. I don't know a single ASCII-only encoding that tells me what
- >language it is, what correct sort order to use, what font to display it
- >with, etc.
-
- ASCII is for English, period.
- And as i see the ASCII model is not really "well known and understood",
- at least not by my respectable opponent.
-
- >Why should Unicode change this?
-
- I wonder. It CHANGED some fundamental properties of ASCII -- it
- does not have upper-case/lower-case symmetry and ariphmetic order
- is no longer equivalent to the lexicographical (and even does not
- allow for a trivial comparison algorithm).
-
- "Why should Unicode change this?" You should know better.
-
- >Nobody (who understands Unicode) ever claimed that it could solve all
- >text processing problems without extra information.
-
- Look - if a program has that "extra information" it already know
- the small alphabet of a particular language and the deliberate
- Unicode stuff is nothing more than wasted bits.
-
- >Indeed, Unicode
- >explicitly does not specify this additional information for good
- >engineering reasons.
-
- Now, the complete lack of logic is called "good engineering reasons".
-
- >Consider for a moment why this was a good idea: (1) Unicode fits the
- >ASCII model extremely well;
-
- Wrong. See before.
-
- >(2) Unicode explicitly supports higher
- >level protocols just like ASCII, e.g., escape sequences;
-
- Ok, though it's nothing more than "there are 32 non-graphic characters".
-
- >(3) the
- >designers of Unicode recognized that an essentially unbounded amount
- >of additional information may be useful for various text processes,
- >various system platforms, etc.;
-
- And threw out the child together with water.
-
- >(4) obtaining a consensus in ISO on
- >a universal set of characters is an enormous problem -- expanding the
- >goals to solve all the problems of multi-lingual text processing
- >would have doomed the effort from the beginning.
-
- It has nothing to do with the final product. I saw enough ISO standards
- such that we'd all be much better off if they were never accepted.
- Note that ASCII is _American_ standard.
-
- >Good engineers know that building solutions to complex problems require
- >dividing it into simpler sub-problem. Good engineers also recognize
- >that many complex problems do not have a single optimal solution, but many
- >sub-optimal solutions.
-
- Good engineers usually think before writing standards.
-
- >The designers of Unicode and 10646 recognized
- >from considerable experience in the field that the biggest problem was
- >the proliferation of incomplete, inadequate character sets. Creating
- >a single character set that could correct this rapidly problematic
- >state of affairs was the single and most important goal of Unicode's
- >design. Doing this task efficiently and with some measure of compatibility,
- >both for existing data and existing software, were also important
- >goals for its design.
-
- If it was THE goal the set of national standard encodings (which are alredy
- in place for decades) is quite adequate. The only action needed was
- to assign them "standard numbers" (as in Internet protocols) and to
- define minimal subsets.
-
- However, the result is nothing more than one more inefficient encoding
- leaving all the old problems exactly where they were.
-
- >What Unicode was not designed to do is extremely important for you and
- >others to know. It was not designed to solve the "multi-lingual text
- >processing" problem.
-
- Hear, hear! Then if it was not designed to do
- 1) multi-lingual
- 2) text
- 3) processing
- what the hell it was designed for? Monolingual was around since Morse,
- non-text - obviously not the case. Aha! It was not meant to allow
- *processing* of text. I'd like everybody to understand that and drop
- the hopes on the sensible standard in the future.
-
- >Indeed, I would challenge you to create a single
- >character set, which, in and of itself, solves this problem,
-
- See my previous postings. The idea is trivial as it is and surely
- not new.
-
- >and which
- >could pass through existing standard bodies to become an international
- >standard.
-
- Now, we got to the root of the problem.
-
- >Your radical optimism about the possibility of doing so
- >leads me to believe that you really have little experience in this field.
-
- Somehow i had introduced my "own" cyrillic encoding and it lived (and
- still lives) and was used by a lot of people (it became obsolete nearly
- at the time it was approved as a national standard). The company i worked
- before managed to create a totally bilingual family of Unix-like systems --
- something unheard of in the West. I was in a leading position in a
- company which built the now-largest European e-mail network -- and
- it IS 8-bit transparent and bilingual too. I dare to say that i have the
- real-life expericence with internationalization issues. Somehow,
- people in my company got a different attitude -- they do real things leaving
- bureaucracy to bureaucrats.
-
- >[Not to say that others, including myself, didn't start out with a similar
- >ungrounded optimism.]
-
- I always thought that Communist USSR was a stronghold of Kafkaese
- bureaucracy. It turns out that the so-called "free world" is much
- worse. I'm starting to think about joining the Communist Party :-)
-
- >If you want to truly bring forward the state of the art in multi-lingual
- >text processing, you would be much better off to consider how to begin
- >using Unicode (10646) with all of its intentional, designed-in limitations,
- >rather than incorrectly attributing to 10646 a goal of panacea, then using
- >the reality of its limitations to shoot down your misattribution.
-
- I'm not going to start using Unicode because i know the underwater
- boulders you haven't yet discovered -- exactly because we did the
- similar (though limited) things before.
-
- >If you
- >take the time to look at the facts, you will find that (1) Unicode was
- >designed by a truly global community, and not a USCentric one as has been
- >wrongly claimed;
-
- I don't care who grew an apple if it is good enough.
- [Need i to remind that the road to hell is paved with good intentions?]
-
- >(2) Unicode and 10646 continues to solicit ideas
- >and aid from persons who have useful contributions to make;
-
- You got my free advice -- drop the idea of single code per single
- glyph and restore the alphabetical ordering. Otherwise the whole
- enterprise makes no sense. I'm tired to reiterate the simple
- arguments.
-
- >and, (3)
- >Unicode (10646) provides an adequate foundation on which complete
- >solutions to the problems of multi-lingual text processing can be
- >constructed.
-
- Wrong. See detailed explanation in my previous postings.
-
- >If you have a genuine interest in learning about the facts surrounding
- >Unicode and 10646, I would recommend a good reading of the Unicode Standard
- >and the Proceedings of the Unicode Implementor's Workshops.
-
- Thank you, i've got no time to read things i don't need.
- I already got my share of meetings in the U.S.S.R. and have sworn to avoid
- it like plague. That million lemmings got drowned themselves does
- not make me to feel like suicide.
-
- --vadim
-