NetNews Usenet Archive 1993 #3

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #3 / NN_1993_3.iso / spool / comp / std / internat / 1257 < prev next >

Wrap

Internet Message Format | 1993-01-21 | 12.6 KB

Path: sparky!uunet!spool.mu.edu!yale.edu!yale!mintaka.lcs.mit.edu!ai-lab!muesli!glenn From: glenn@muesli.ai.mit.edu (Glenn A. Adams) Newsgroups: comp.std.internat Subject: Script Unification [was: Re: Cleanicode] Date: 21 Jan 1993 08:58:25 GMT Organization: MIT Artificial Intelligence Laboratory Lines: 223 Message-ID: <1jlojhINNqv3@life.ai.mit.edu> References: <C138zr.r3@poel.juice.or.jp> <1jiotjINNj5q@life.ai.mit.edu> <2179@blue.cis.pitt.edu> NNTP-Posting-Host: muesli.ai.mit.edu Keywords: CJK Han Unification L/C/G In article <2179@blue.cis.pitt.edu> djbpitt+@pitt.edu (David J Birnbaum) writes: >In summation, it is clear to me that LGC should not be unified, but I >would feel more comfortable if I could articulate what it means to be >an autonomous script. Suggestions welcome, as always. I will try to take a stab at this (and at some of the other related messages on this subject). First I would like to define some terms to be used in the context of this message: script A collection of symbols used by one or more writing systems to represent linguistic information; e.g., sound, meaning, structure, and so on. unification The process of taking the union of two or more collections of symbols and unifying similar symbols according to a set of unification criteria and/or principles. unified script The result produced by performing unification on two or more collections of symbols. unification utility The degree to which unification increases the efficiency of (1) the representation of the written language(s) which employ a unified script; and (2) the processing of such representations; e.g., display, searching, sorting, indexing, parsing, and all other types of tasks. Now the problem statement, in two parts: (1) In the creation of a universal encoding for the representation of all written languages, when is it more useful than not to unify two or more collections of symbols? (2) When it is useful to unify two or more collections of symbols, which unification criteria or principles result in the highest degree of unification utility? An important point to note at this point is that I have couched the problem not in linguistic or cultural terms, but entirely in engineering (and information theoretic) terms. This is a point which seems to get passed by quite often: the business at hand is to create a computer encoding of textual information; therefore, the task of the encoding designer is (1) to do this in a way that maximizes the economics of the situation; namely, the economics of memory, speed, and complexity; and (2) to do this in a way which maintains certain degrees of compatibility with past practices. Since these goals are not independent, but highly interdependent, it is necessary to prioritize their importance. In the case of Unicode, compatibility is given the highest priority: this occasionally degrades the various economies mentioned above, e.g., complexity increases. Just to be clear by what I mean by compatibility, two types of compatibility are possible: (1) data compatibility; and (2) software compatibility. Past encoding techniques such as ISO2022 have given equal priority to these two types of compatibility. In particular, ISO2022 supports interoperability with software components designed to employ 7-bit and/or 8-bit character coding methodology. At the same time, ISO2022 supports interoperability with existing data by directly incorporating that data into ISO2022 character strings. In contrast, Unicode was designed in such a way as to sacrifice software interoperability in order to decrease system complexity. On the other hand, data interoperability is given greater priority than the economy of complexity. [I should mention that ISO2022 8-bit compatibility can be achieved with Unicode or arbitrary UCS[24] 10646 data by using one of the transformation formats, e.g., UTF-1 or UTF-FSS (UTF-2).] Data interoperability is understood in the design of Unicode as the support for one-to-one round trip correspondence between data encoded with coded character elements of (important) existing character set standards and coded character elements of Unicode. This insures that vast amounts of existing textual data can be translated into Unicode and then back to its original representation form without loss of information. This by itself precludes unification of certain collections of symbols (e.g., Latin, Cyrillic, and Greek); and, in other cases, introduces inefficiencies in processing (e.g., by having to specially process distinct encodings which would have otherwise been unified -- examples here include fullwidth variants, arabic presentation forms, vertical variants, small variants, latin ligatures, han z-variants (stroke variants), and so on). So, in the case of Latin/Cyrillic/Greek, unification is already not possible because of the data compatibility requirement in the design of Unicode. However, for the sake of argument, it may interesting to ignore this requirement, and ask whether in the absence of this compatibility requirement if such a unification should occur. This is where we get to the notion of utility: (1) what does unification buy? (2) is the cost of unification greater than its potential benefits? In the case of L/C/G, unification doesn't buy much at all. One might save perhaps a dozen or so code positions out of 2^16 code positions; on the other hand, it makes certain things more complicated: one couldn't perform upper/lower case conversion without knowing which written langauge was being represented (an issue discussed at length in this newsgroup -- and, by the way, an issue raised on the incorrect assumption that Unicode *did* unify L/C/G); one couldn't determine a known, default ordering for such a unified script; and so on. Clearly, unifying L/C/G doesn't give much, but costs dearly. Furthermore, there is very little overlap among the collections of symbols in L/C/G; any unification criteria used here would have to radically abstract either formal information, functional information, or both form and function. Finally, there has been an extremely large number of communities in modern times which have developed their written languages around extended forms of the Latin and Cyrillic scripts. Now, contrast the situation with L/C/G with that of Chinese, Japanese, and Korean (Vietnamese Chu+" No>m, Xixia, Tangut, and other writing systems must be included here also). In the case of Chinese, Japanese, and Korean writing systems alone, there is a vast overlap in the symbol collections used. Indeed, in considering the 5,000 most common symbols used in these writing systems, a significant majority (perhaps > 90%) of the elements overlap. In performing the CJK unification represented by the Unicode Han elements, over 100,000 characters of different character sets were unified into 20,902 elements; this was accomplished using a simple set of verifiable unification criteria and principles which were jointly proposed and adopted by delegations from China, Japan, Korea, Taiwan, and the USA. Obviously, the savings in terms of code point allocations was enormous. Other very imporant benefits can also be derived from this unification; e.g., symbols which have the same or highly similar forms (shape) and which, at the same time, have the same or quite similar functions (meaning) are consistently unified so as to further the needs of text processing tasks such as searching, simple display, simple sorting, and so forth. While the benefits of CJK unification are clearly high, the associated cost is very low: no important (simple) text processing task is affected. Of course it is true that CJK unification does have certain costs, e.g., different implicit sort orders cannot be maintained without language tags, minor distinctions in the glyphic representation of CJK character data cannot be made without language tags, and so forth. However, and this is important to consider, such distinctions are not maintained by character set standards practices for other scripts either: the English, German, French, and Spanish alphabets, all distinct in their ordering rules, all potentially requiring slightly different glyphic displays, are not encoded as distinct elements of a standard like ISO8859-1, but instead are unified into a single collection of symbols irrespective of those symbols' usage in these different alphabets. Even though such a unification is universally employed by all character sets which incorporate the symbols of these alphabets, the important text processes can still be quite adequately performed: searching, sorting, case-folding, script-based default sorting, simple display, etc. It is clear to me at least, and to most people familiar with the details of many writing systems, that certain unifications tend to make more sense than others. L/C/G does not make sense in any useful way; on the other hand, CJK unification makes considerable amount of sense. It not only greatly reduces the demands on encoding space, but it also greatly simplifies certain types of multilingual processing in texts which intermix CJK or some combination thereof. In the case of CJK unification, the principles and criteria for unification were developed primarily by the CJK countries themselves; an extraordinary amount of consensus was exhibited in determining these criteria. This was possible because there is a great deal of appeal in the idea of a CJK unification; were such a unification perceived to cover only a small minority of elements, or should the unification have produced great adversity to constructing simple text processes, surely no such agreement would have been reached. Objections to the CJK unification employed in Unicode tend to be based on some combination of (1) a misunderstanding of what principles were employed; (2) a misunderstanding of who performed the unification and/or who developed the principles for unification; (3) national or cultural sentiments which desire to maintain national and/or cultural distinctions. The CJK Unification was undertaken and led by delegations of the countries most intimately concerned with this task, i.e., China, Japan, and Korea; the principles developed for the unification are objectively verifiable and repeatable; and, last but not least, Unicode and ISO10646 are in the business of defining an adequate and efficient representation of the symbols employed by the writing systems of the world in a linguistically, culturally, and nationally neutral fashion. In producing the Unified Repertoire and Ordering of Han characters, the CJK-JRG, Unicode (UTC & Han SC), and ISO10646 (JTC1/SC2/WG2) has performed exceptionally well and produced an excellent foundation on which to construct multilingual CJK systems. To summarize this discussion of script unification, the first determinative is a measure of utility: if utility is low, and cost is high, why do it? If it buys a lot, or if the cost is low, or if it makes sense for reasons not explicitly stated above (e.g., historical relatedness), then unification probably should be considered. If unification is to be performed, then criteria for unification need to be specified for the process to proceed. The precedent of CJK unification and the principles and criteria developed there together inform any future unification efforts. The primary principles employed for CJK character unification were: (1) if separate in source character set, don't unify (2) if no historical relationship (not cognates), then don't unify (3) if distinct abstract forms, then don't unify Another unification exclusion rule that may be considered, which wasn't explicitly part of the CJK rules (because it didn't apply): (4) if simple text processes (e.g., case conversion, default per-script sorting, simple display, etc.) become impossible without language tags (or writing system tags), then don't unify If none of these exclusion rules apply, then unify. If, after performing this procedure on a set of candidate symbol collections, a minority of elements were unified, then consider encoding the symbol collections separately (i.e., don't unify). I believe the above description of unification fairly captures the way that script unification is viewed by the designers of Unicode, and by the authors of the unification methodology employed in the CJK unification in Unicode and ISO10646. I would anticipate using something very close to this procedure in the future for candidate symbol collections. A couple of examples of where unifications might be considered: the different collections of Runes, Burmese and Shan, etc. The emphasis of unification in the context of character encoding should be on utility and economy -- engineering considerations -- and not on theoretical purity or cultural demands. Glenn Adams