NetNews Usenet Archive 1992 #31

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #31 / NN_1992_31.iso / spool / comp / std / internat / 965 < prev next >

Wrap

Internet Message Format | 1993-01-01 | 9.3 KB

Path: sparky!uunet!not-for-mail From: avg@rodan.UU.NET (Vadim Antonov) Newsgroups: comp.std.internat Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST) Date: 1 Jan 1993 04:45:31 -0500 Organization: UUNET Technologies Inc, Falls Church, VA Lines: 212 Message-ID: <1i13rrINNars@rodan.UU.NET> References: <8490@charon.cwi.nl> <1hvu79INN4qf@rodan.UU.NET> <1i0oj2INNp4v@life.ai.mit.edu> NNTP-Posting-Host: rodan.uu.net Keywords: ISO10646 Unicode In article <1i0oj2INNp4v@life.ai.mit.edu> glenn@wheat-chex.ai.mit.edu (Glenn A. Adams) writes: >In article <1hvu79INN4qf@rodan.UU.NET> avg@rodan.UU.NET (Vadim Antonov) writes: >>If a combination of letters is treated as a letter IT IS A LETTER. >So, are <qu> and <ch> letters in English? We were talking about lexicographical sorting, not abouth phonetics. Your argument is irrelevant. >>The idea of visual encoding (and one letter-onr glyph is nothing more >>than a compressed image of the text) is simply wrong... > >Where did you get the idea that 10646 is a "visual encoding"? Sure >10646 contains some glyphic like encodings. But they are there not >only to satisfy compatibility goals. Nobody is recommending their >use. Isn't it easy to comprehend that ASCII/Unicode/whatever representation of text is a form of compression of a graphical image from a particular class of images? I thought everybody who ever bothered to read Shannon knows it. >>10646 was meant as an encoding eliminating the necessity to carry off-text >>information (which is not a piece of cake, especially in multi-lingual >>texts). >This is just complete nonsense. I don't know who you've been talking >with, but whoever it was, they certainly don't know much about 10646 >or Unicode. Unicode (and the 10646 BMP by extension) is oriented around >encoding the minimum content that allows for minimally legible display. Then you KNOW that it is compressed graphical format -- which is essentially useless in anything except for storing and then reproduction of the text. What makes encoded text useful is that its encoding extracts some SEMANTIC allowing for mechanical processing (particularly sorting). The semantic in ASCII is hard-coded -- it is the order of letters and the trivial upper-case to lower-case convertion. Unfortunately the move to abolish the last traces of semantic and make it PURELY graphical format destroyed the usefulness of such encoding for data processing. >This is alredy a well known and understood model for text: ASCII, EBCDIC, >etc. I don't know a single ASCII-only encoding that tells me what >language it is, what correct sort order to use, what font to display it >with, etc. ASCII is for English, period. And as i see the ASCII model is not really "well known and understood", at least not by my respectable opponent. >Why should Unicode change this? I wonder. It CHANGED some fundamental properties of ASCII -- it does not have upper-case/lower-case symmetry and ariphmetic order is no longer equivalent to the lexicographical (and even does not allow for a trivial comparison algorithm). "Why should Unicode change this?" You should know better. >Nobody (who understands Unicode) ever claimed that it could solve all >text processing problems without extra information. Look - if a program has that "extra information" it already know the small alphabet of a particular language and the deliberate Unicode stuff is nothing more than wasted bits. >Indeed, Unicode >explicitly does not specify this additional information for good >engineering reasons. Now, the complete lack of logic is called "good engineering reasons". >Consider for a moment why this was a good idea: (1) Unicode fits the >ASCII model extremely well; Wrong. See before. >(2) Unicode explicitly supports higher >level protocols just like ASCII, e.g., escape sequences; Ok, though it's nothing more than "there are 32 non-graphic characters". >(3) the >designers of Unicode recognized that an essentially unbounded amount >of additional information may be useful for various text processes, >various system platforms, etc.; And threw out the child together with water. >(4) obtaining a consensus in ISO on >a universal set of characters is an enormous problem -- expanding the >goals to solve all the problems of multi-lingual text processing >would have doomed the effort from the beginning. It has nothing to do with the final product. I saw enough ISO standards such that we'd all be much better off if they were never accepted. Note that ASCII is _American_ standard. >Good engineers know that building solutions to complex problems require >dividing it into simpler sub-problem. Good engineers also recognize >that many complex problems do not have a single optimal solution, but many >sub-optimal solutions. Good engineers usually think before writing standards. >The designers of Unicode and 10646 recognized >from considerable experience in the field that the biggest problem was >the proliferation of incomplete, inadequate character sets. Creating >a single character set that could correct this rapidly problematic >state of affairs was the single and most important goal of Unicode's >design. Doing this task efficiently and with some measure of compatibility, >both for existing data and existing software, were also important >goals for its design. If it was THE goal the set of national standard encodings (which are alredy in place for decades) is quite adequate. The only action needed was to assign them "standard numbers" (as in Internet protocols) and to define minimal subsets. However, the result is nothing more than one more inefficient encoding leaving all the old problems exactly where they were. >What Unicode was not designed to do is extremely important for you and >others to know. It was not designed to solve the "multi-lingual text >processing" problem. Hear, hear! Then if it was not designed to do 1) multi-lingual 2) text 3) processing what the hell it was designed for? Monolingual was around since Morse, non-text - obviously not the case. Aha! It was not meant to allow *processing* of text. I'd like everybody to understand that and drop the hopes on the sensible standard in the future. >Indeed, I would challenge you to create a single >character set, which, in and of itself, solves this problem, See my previous postings. The idea is trivial as it is and surely not new. >and which >could pass through existing standard bodies to become an international >standard. Now, we got to the root of the problem. >Your radical optimism about the possibility of doing so >leads me to believe that you really have little experience in this field. Somehow i had introduced my "own" cyrillic encoding and it lived (and still lives) and was used by a lot of people (it became obsolete nearly at the time it was approved as a national standard). The company i worked before managed to create a totally bilingual family of Unix-like systems -- something unheard of in the West. I was in a leading position in a company which built the now-largest European e-mail network -- and it IS 8-bit transparent and bilingual too. I dare to say that i have the real-life expericence with internationalization issues. Somehow, people in my company got a different attitude -- they do real things leaving bureaucracy to bureaucrats. >[Not to say that others, including myself, didn't start out with a similar >ungrounded optimism.] I always thought that Communist USSR was a stronghold of Kafkaese bureaucracy. It turns out that the so-called "free world" is much worse. I'm starting to think about joining the Communist Party :-) >If you want to truly bring forward the state of the art in multi-lingual >text processing, you would be much better off to consider how to begin >using Unicode (10646) with all of its intentional, designed-in limitations, >rather than incorrectly attributing to 10646 a goal of panacea, then using >the reality of its limitations to shoot down your misattribution. I'm not going to start using Unicode because i know the underwater boulders you haven't yet discovered -- exactly because we did the similar (though limited) things before. >If you >take the time to look at the facts, you will find that (1) Unicode was >designed by a truly global community, and not a USCentric one as has been >wrongly claimed; I don't care who grew an apple if it is good enough. [Need i to remind that the road to hell is paved with good intentions?] >(2) Unicode and 10646 continues to solicit ideas >and aid from persons who have useful contributions to make; You got my free advice -- drop the idea of single code per single glyph and restore the alphabetical ordering. Otherwise the whole enterprise makes no sense. I'm tired to reiterate the simple arguments. >and, (3) >Unicode (10646) provides an adequate foundation on which complete >solutions to the problems of multi-lingual text processing can be >constructed. Wrong. See detailed explanation in my previous postings. >If you have a genuine interest in learning about the facts surrounding >Unicode and 10646, I would recommend a good reading of the Unicode Standard >and the Proceedings of the Unicode Implementor's Workshops. Thank you, i've got no time to read things i don't need. I already got my share of meetings in the U.S.S.R. and have sworn to avoid it like plague. That million lemmings got drowned themselves does not make me to feel like suicide. --vadim