NetNews Usenet Archive 1992 #31

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #31 / NN_1992_31.iso / spool / comp / std / internat / 950 < prev next >

Wrap

Internet Message Format | 1992-12-31 | 4.0 KB

Path: sparky!uunet!usc!cs.utexas.edu!qt.cs.utexas.edu!yale.edu!yale!mintaka.lcs.mit.edu!ai-lab!wheat-chex!glenn From: glenn@wheat-chex.ai.mit.edu (Glenn A. Adams) Newsgroups: comp.std.internat Subject: Re: An alternative I18N paradigm Date: 31 Dec 1992 08:11:21 GMT Organization: MIT Artificial Intelligence Laboratory Lines: 63 Message-ID: <1hu9v9INN923@life.ai.mit.edu> References: <1hkff3EINN5uv@uni-erlangen.de> <1hncs1INN1qq@corax.udac.uu.se> <DAN.92Dec29102634@dan.watson.ibm.com> NNTP-Posting-Host: wheat-chex.ai.mit.edu The real problem, at least as I see it, is that the locale model doesn't distinguish between the consumer of information and the producer of information. It naively assumes that an individual end user must choose the manner in which (linguistically and culturally sensitive) information is to be presented, and that this choice can be determined by one fixed value parameter (i.e., the locale setting). The first assumption quite poorly models the real world of text; for here the author or editor is usually responsible for the presentation of the information, including the written form of that information. [We are not quite at the state yet when a system can automatically translate random text to the written form preferred by the consumer of text.] The second assumption fails miserably in the real world of text where many languages are intermixed sometimes in a single writing system, sometimes in a single document employing multiple writing systems. What should first be done is an analysis of the producer and potential consumers of information. If the producer is the OS or a local utility, e.g., /bin/date, and the consumer is a single individual who prefers reading dates and time in a particular way, and this way can be characterized by a single (possibly complex) parameter, then the locale model will work. However, if the producer is another individual, then the locale probably should be ignored, and information contained in the text itself should be consulted about matters such as character set encoding, font(s), language tag(s), or other presentation information. In this case, the locale may be of help only in the case that such explicit information (encoding, font, etc.) is absent, and here, it may produce complete garbage if it makes the wrong assumptions. One may ask where 10646/Unicode fits into all of this? It provides only a small part of the solution; namely, a single universal character set encoding rather than many non-universal encodings. Of course, one might propose to solve a small part of the problem by using 10646: use 10646 for all encoded text. However, as has been pointed out by Ohta-san and others, this doesn't solve many other problems, e.g., whether to use a Chinese or a Japanese font to display a given Han character. So more is needed still. I, for one, do not believe that 10646 will become universally used in a fortnight. Local and other standard encodings will continue to exist, probably forever. So we need to start doing one thing very quickly: tagging character data as to its encoding. Another thing we need to do, is add language (or writing system) tags to texts which mix multiple languages. Alternatively, this could be done by tagging font runs and then associating languages with those runs [I do not advocate this method - I prefer explicit language or writing system tags]. Other kinds of tags might be necessary for certain types of processing, e.g., yomi (phonetic reading) tags for allowing the display of furugana, sort keys for allowing producer specified sorting behavior, and so forth. 10646 will not even address any of these matters. However, Unicode may do so in the form of implementation guidelines or further work on I18N. Nonetheless, I think that it is quite important for many parties to begin implementing such systems so that development of standard tags and tagging systems can proceed. We need prior art and experience in these areas before effective standards can be developed with a reasonable hope of success. Glenn Adams Cambridge, Massachusetts