home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!spool.mu.edu!yale.edu!yale!mintaka.lcs.mit.edu!ai-lab!muesli!glenn
- From: glenn@muesli.ai.mit.edu (Glenn A. Adams)
- Newsgroups: comp.std.internat
- Subject: Script Unification [was: Re: Cleanicode]
- Date: 21 Jan 1993 08:58:25 GMT
- Organization: MIT Artificial Intelligence Laboratory
- Lines: 223
- Message-ID: <1jlojhINNqv3@life.ai.mit.edu>
- References: <C138zr.r3@poel.juice.or.jp> <1jiotjINNj5q@life.ai.mit.edu> <2179@blue.cis.pitt.edu>
- NNTP-Posting-Host: muesli.ai.mit.edu
- Keywords: CJK Han Unification L/C/G
-
- In article <2179@blue.cis.pitt.edu> djbpitt+@pitt.edu (David J Birnbaum) writes:
- >In summation, it is clear to me that LGC should not be unified, but I
- >would feel more comfortable if I could articulate what it means to be
- >an autonomous script. Suggestions welcome, as always.
-
- I will try to take a stab at this (and at some of the other related messages
- on this subject). First I would like to define some terms to be used in the
- context of this message:
-
- script
-
- A collection of symbols used by one or more writing systems
- to represent linguistic information; e.g., sound, meaning,
- structure, and so on.
-
- unification
-
- The process of taking the union of two or more collections of symbols
- and unifying similar symbols according to a set of unification criteria
- and/or principles.
-
- unified script
-
- The result produced by performing unification on two or more
- collections of symbols.
-
- unification utility
-
- The degree to which unification increases the efficiency of (1) the
- representation of the written language(s) which employ a unified
- script; and (2) the processing of such representations; e.g., display,
- searching, sorting, indexing, parsing, and all other types of tasks.
-
- Now the problem statement, in two parts:
-
- (1) In the creation of a universal encoding for the representation
- of all written languages, when is it more useful than not to
- unify two or more collections of symbols?
-
- (2) When it is useful to unify two or more collections of symbols,
- which unification criteria or principles result in the highest
- degree of unification utility?
-
- An important point to note at this point is that I have couched the
- problem not in linguistic or cultural terms, but entirely in engineering
- (and information theoretic) terms. This is a point which seems to get
- passed by quite often: the business at hand is to create a computer
- encoding of textual information; therefore, the task of the encoding
- designer is (1) to do this in a way that maximizes the economics of the
- situation; namely, the economics of memory, speed, and complexity; and
- (2) to do this in a way which maintains certain degrees of compatibility
- with past practices. Since these goals are not independent, but highly
- interdependent, it is necessary to prioritize their importance.
-
- In the case of Unicode, compatibility is given the highest priority:
- this occasionally degrades the various economies mentioned above, e.g.,
- complexity increases. Just to be clear by what I mean by compatibility,
- two types of compatibility are possible: (1) data compatibility; and (2)
- software compatibility. Past encoding techniques such as ISO2022 have
- given equal priority to these two types of compatibility. In particular,
- ISO2022 supports interoperability with software components designed to
- employ 7-bit and/or 8-bit character coding methodology. At the same time,
- ISO2022 supports interoperability with existing data by directly incorporating
- that data into ISO2022 character strings. In contrast, Unicode was
- designed in such a way as to sacrifice software interoperability in order
- to decrease system complexity. On the other hand, data interoperability
- is given greater priority than the economy of complexity.
-
- [I should mention that ISO2022 8-bit compatibility can be achieved with
- Unicode or arbitrary UCS[24] 10646 data by using one of the transformation
- formats, e.g., UTF-1 or UTF-FSS (UTF-2).]
-
- Data interoperability is understood in the design of Unicode as the
- support for one-to-one round trip correspondence between data encoded
- with coded character elements of (important) existing character set
- standards and coded character elements of Unicode. This insures that
- vast amounts of existing textual data can be translated into Unicode
- and then back to its original representation form without loss of
- information.
-
- This by itself precludes unification of certain collections of symbols
- (e.g., Latin, Cyrillic, and Greek); and, in other cases, introduces
- inefficiencies in processing (e.g., by having to specially process
- distinct encodings which would have otherwise been unified -- examples
- here include fullwidth variants, arabic presentation forms, vertical
- variants, small variants, latin ligatures, han z-variants (stroke
- variants), and so on).
-
- So, in the case of Latin/Cyrillic/Greek, unification is already not
- possible because of the data compatibility requirement in the design
- of Unicode. However, for the sake of argument, it may interesting
- to ignore this requirement, and ask whether in the absence of this
- compatibility requirement if such a unification should occur. This
- is where we get to the notion of utility:
-
- (1) what does unification buy?
- (2) is the cost of unification greater than its potential benefits?
-
- In the case of L/C/G, unification doesn't buy much at all. One might
- save perhaps a dozen or so code positions out of 2^16 code positions;
- on the other hand, it makes certain things more complicated: one couldn't
- perform upper/lower case conversion without knowing which written
- langauge was being represented (an issue discussed at length in this
- newsgroup -- and, by the way, an issue raised on the incorrect assumption
- that Unicode *did* unify L/C/G); one couldn't determine a known, default
- ordering for such a unified script; and so on. Clearly, unifying L/C/G
- doesn't give much, but costs dearly. Furthermore, there is very little
- overlap among the collections of symbols in L/C/G; any unification
- criteria used here would have to radically abstract either formal
- information, functional information, or both form and function. Finally,
- there has been an extremely large number of communities in modern
- times which have developed their written languages around extended
- forms of the Latin and Cyrillic scripts.
-
- Now, contrast the situation with L/C/G with that of Chinese, Japanese,
- and Korean (Vietnamese Chu+" No>m, Xixia, Tangut, and other writing
- systems must be included here also). In the case of Chinese, Japanese,
- and Korean writing systems alone, there is a vast overlap in the
- symbol collections used. Indeed, in considering the 5,000 most common
- symbols used in these writing systems, a significant majority (perhaps
- > 90%) of the elements overlap. In performing the CJK unification
- represented by the Unicode Han elements, over 100,000 characters of
- different character sets were unified into 20,902 elements; this was
- accomplished using a simple set of verifiable unification criteria
- and principles which were jointly proposed and adopted by delegations
- from China, Japan, Korea, Taiwan, and the USA. Obviously, the savings
- in terms of code point allocations was enormous. Other very imporant
- benefits can also be derived from this unification; e.g., symbols which
- have the same or highly similar forms (shape) and which, at the same time,
- have the same or quite similar functions (meaning) are consistently
- unified so as to further the needs of text processing tasks such as
- searching, simple display, simple sorting, and so forth. While the
- benefits of CJK unification are clearly high, the associated cost is
- very low: no important (simple) text processing task is affected.
- Of course it is true that CJK unification does have certain costs,
- e.g., different implicit sort orders cannot be maintained without
- language tags, minor distinctions in the glyphic representation of
- CJK character data cannot be made without language tags, and so
- forth. However, and this is important to consider, such distinctions
- are not maintained by character set standards practices for other
- scripts either: the English, German, French, and Spanish alphabets,
- all distinct in their ordering rules, all potentially requiring slightly
- different glyphic displays, are not encoded as distinct elements
- of a standard like ISO8859-1, but instead are unified into a single
- collection of symbols irrespective of those symbols' usage in these
- different alphabets. Even though such a unification is universally
- employed by all character sets which incorporate the symbols of these
- alphabets, the important text processes can still be quite adequately
- performed: searching, sorting, case-folding, script-based default
- sorting, simple display, etc.
-
- It is clear to me at least, and to most people familiar with the
- details of many writing systems, that certain unifications tend to
- make more sense than others. L/C/G does not make sense in any
- useful way; on the other hand, CJK unification makes considerable
- amount of sense. It not only greatly reduces the demands on encoding
- space, but it also greatly simplifies certain types of multilingual
- processing in texts which intermix CJK or some combination thereof.
- In the case of CJK unification, the principles and criteria for
- unification were developed primarily by the CJK countries themselves;
- an extraordinary amount of consensus was exhibited in determining
- these criteria. This was possible because there is a great deal
- of appeal in the idea of a CJK unification; were such a unification
- perceived to cover only a small minority of elements, or should
- the unification have produced great adversity to constructing simple
- text processes, surely no such agreement would have been reached.
-
- Objections to the CJK unification employed in Unicode tend to be
- based on some combination of (1) a misunderstanding of what principles
- were employed; (2) a misunderstanding of who performed the unification
- and/or who developed the principles for unification; (3) national or
- cultural sentiments which desire to maintain national and/or cultural
- distinctions. The CJK Unification was undertaken and led by delegations
- of the countries most intimately concerned with this task, i.e., China,
- Japan, and Korea; the principles developed for the unification are
- objectively verifiable and repeatable; and, last but not least, Unicode
- and ISO10646 are in the business of defining an adequate and efficient
- representation of the symbols employed by the writing systems of the
- world in a linguistically, culturally, and nationally neutral fashion.
- In producing the Unified Repertoire and Ordering of Han characters,
- the CJK-JRG, Unicode (UTC & Han SC), and ISO10646 (JTC1/SC2/WG2) has
- performed exceptionally well and produced an excellent foundation on
- which to construct multilingual CJK systems.
-
- To summarize this discussion of script unification, the first determinative
- is a measure of utility: if utility is low, and cost is high, why do it?
- If it buys a lot, or if the cost is low, or if it makes sense for reasons
- not explicitly stated above (e.g., historical relatedness), then unification
- probably should be considered. If unification is to be performed, then
- criteria for unification need to be specified for the process to proceed.
- The precedent of CJK unification and the principles and criteria developed
- there together inform any future unification efforts. The primary principles
- employed for CJK character unification were:
-
- (1) if separate in source character set, don't unify
- (2) if no historical relationship (not cognates), then don't unify
- (3) if distinct abstract forms, then don't unify
-
- Another unification exclusion rule that may be considered, which wasn't
- explicitly part of the CJK rules (because it didn't apply):
-
- (4) if simple text processes (e.g., case conversion, default per-script
- sorting, simple display, etc.) become impossible without language
- tags (or writing system tags), then don't unify
-
- If none of these exclusion rules apply, then unify. If, after performing
- this procedure on a set of candidate symbol collections, a minority of
- elements were unified, then consider encoding the symbol collections
- separately (i.e., don't unify).
-
- I believe the above description of unification fairly captures the way
- that script unification is viewed by the designers of Unicode, and by
- the authors of the unification methodology employed in the CJK unification
- in Unicode and ISO10646. I would anticipate using something very close
- to this procedure in the future for candidate symbol collections. A
- couple of examples of where unifications might be considered: the
- different collections of Runes, Burmese and Shan, etc.
-
- The emphasis of unification in the context of character encoding should
- be on utility and economy -- engineering considerations -- and not on
- theoretical purity or cultural demands.
-
- Glenn Adams
-