NetNews Usenet Archive 1992 #31

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #31 / NN_1992_31.iso / spool / comp / std / internat / 962 < prev next >

Wrap

Internet Message Format | 1993-01-01 | 5.2 KB

Path: sparky!uunet!cs.utexas.edu!qt.cs.utexas.edu!yale.edu!yale!mintaka.lcs.mit.edu!ai-lab!wheat-chex!glenn From: glenn@wheat-chex.ai.mit.edu (Glenn A. Adams) Newsgroups: comp.std.internat Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST) Date: 1 Jan 1993 06:33:06 GMT Organization: MIT Artificial Intelligence Laboratory Lines: 91 Message-ID: <1i0oj2INNp4v@life.ai.mit.edu> References: <1hu9v5INNbp1@rodan.UU.NET> <8490@charon.cwi.nl> <1hvu79INN4qf@rodan.UU.NET> NNTP-Posting-Host: wheat-chex.ai.mit.edu Keywords: ISO10646 Unicode In article <1hvu79INN4qf@rodan.UU.NET> avg@rodan.UU.NET (Vadim Antonov) writes: >If a combination of letters is treated as a letter IT IS A LETTER. So, are <qu> and <ch> letters in English? Is <a-e> a discontiguous letter in English 'take'? You are grossly oversimplifying the process of determining what the graphemes are in a writing system, the way that its users perceive it, and the best way to encode it as information. >The idea of visual encoding (and one letter-onr glyph is nothing more >than a compressed image of the text) is simply wrong... Where did you get the idea that 10646 is a "visual encoding"? Sure 10646 contains some glyphic like encodings. But they are there not only to satisfy compatibility goals. Nobody is recommending their use. >10646 was meant as an encoding eliminating the necessity to carry off-text >information (which is not a piece of cake, especially in multi-lingual >texts). This is just complete nonsense. I don't know who you've been talking with, but whoever it was, they certainly don't know much about 10646 or Unicode. Unicode (and the 10646 BMP by extension) is oriented around encoding the minimum content that allows for minimally legible display. This is alredy a well known and understood model for text: ASCII, EBCDIC, etc. I don't know a single ASCII-only encoding that tells me what language it is, what correct sort order to use, what font to display it with, etc. Why should Unicode change this? Nobody (who understands Unicode) ever claimed that it could solve all text processing problems without extra information. Indeed, Unicode explicitly does not specify this additional information for good engineering reasons. Consider for a moment why this was a good idea: (1) Unicode fits the ASCII model extremely well; (2) Unicode explicitly supports higher level protocols just like ASCII, e.g., escape sequences; (3) the designers of Unicode recognized that an essentially unbounded amount of additional information may be useful for various text processes, various system platforms, etc.; (4) obtaining a consensus in ISO on a universal set of characters is an enormous problem -- expanding the goals to solve all the problems of multi-lingual text processing would have doomed the effort from the beginning. >Take a life, guys. We in Russia did that mistake (DKOI and "GOST" encodings) >many years ago and came to realize that this solution is too simple to >be correct. Good engineers know that building solutions to complex problems require dividing it into simpler sub-problem. Good engineers also recognize that many complex problems do not have a single optimal solution, but many sub-optimal solutions. The designers of Unicode and 10646 recognized from considerable experience in the field that the biggest problem was the proliferation of incomplete, inadequate character sets. Creating a single character set that could correct this rapidly problematic state of affairs was the single and most important goal of Unicode's design. Doing this task efficiently and with some measure of compatibility, both for existing data and existing software, were also important goals for its design. What Unicode was not designed to do is extremely important for you and others to know. It was not designed to solve the "multi-lingual text processing" problem. Indeed, I would challenge you to create a single character set, which, in and of itself, solves this problem, and which could pass through existing standard bodies to become an international standard. Your radical optimism about the possibility of doing so leads me to believe that you really have little experience in this field. [Not to say that others, including myself, didn't start out with a similar ungrounded optimism.] If you want to truly bring forward the state of the art in multi-lingual text processing, you would be much better off to consider how to begin using Unicode (10646) with all of its intentional, designed-in limitations, rather than incorrectly attributing to 10646 a goal of panacea, then using the reality of its limitations to shoot down your misattribution. If you take the time to look at the facts, you will find that (1) Unicode was designed by a truly global community, and not a USCentric one as has been wrongly claimed; (2) Unicode and 10646 continues to solicit ideas and aid from persons who have useful contributions to make; and, (3) Unicode (10646) provides an adequate foundation on which complete solutions to the problems of multi-lingual text processing can be constructed. If you have a genuine interest in learning about the facts surrounding Unicode and 10646, I would recommend a good reading of the Unicode Standard and the Proceedings of the Unicode Implementor's Workshops. Glenn Adams Cambridge, Massachusetts