NetNews Usenet Archive 1993 #3

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #3 / NN_1993_3.iso / spool / comp / std / internat / 1316 < prev next >

Wrap

Internet Message Format | 1993-01-25 | 5.0 KB

Path: sparky!uunet!olivea!mintaka.lcs.mit.edu!ai-lab!wheat-chex!glenn From: glenn@wheat-chex.ai.mit.edu (Glenn A. Adams) Newsgroups: comp.std.internat Subject: Re: Radicals Instead of Characters Message-ID: <1k1amuINN7rq@life.ai.mit.edu> Date: 25 Jan 93 18:14:54 GMT References: <1jpj9sINNlie@flop.ENGR.ORST.EDU> <1jtbfvINNqvr@life.ai.mit.edu> <1jucp0INN5pe@corax.udac.uu.se> Organization: MIT Artificial Intelligence Laboratory Lines: 93 NNTP-Posting-Host: wheat-chex.ai.mit.edu In article <1jucp0INN5pe@corax.udac.uu.se> andersa@Riga.DoCS.UU.SE (Anders Andersson) writes: >Didn't you skip one step in your otherwise excellent analysis here? >The 20,000+ Han characters of Unicode today effectively employ 15 bits, >leaving 50-75% of the code space for characters from other scripts. >If the Han characters were only to use 11 bits, we surely wouldn't >start processing text in 11- or 12-bit chunks where we now use 16. >Instead, there would be more room within a 16-bit character set for >yet other scripts, as the Han characters would need less than 4% of it. >Is 16 bits enough for everything we'd like to encode, then? There are probably about 45,000 - 50,000 legitimate Han characters that somehow need to be encoded. More than half of these are no longer used, but were used in the past. The question about how to use a 16-bit encoding space is a good question. The goal should be to represent as many scripts as possible in order to cover the maximum number of writing systems; however, the question of what is needed to adequately cover a writing system is an issue that must be resolved. 20K Han characters will certainly cover modern uses of the CJK writing systems, but not all past uses. Furthermore, you are neglecting a major goal of Unicode in proposing decomposed encodings for the 20,902 existing Han character in Unicode; namely, that there must be a one-to-one round trip mapping between Unicode and common character sets. This precludes a decomposed encoding for these 20K characters. >Once we accept decomposition of characters into smaller elements, >we are de facto introducing variable-length encoding of characters >(as a character may consist of a variable number of elements). >in terms of storage and processing. Here is where the term "character" becomes problematic. There are two common, but different uses of the term: (1) an element of a coded character set, which has a unique encoding (bit representation) and name. (2) an element of an alphabet, i.e., elements which are naturally perceived as the atomic elements of a writing system by the users of that writing system. In the vocabulary used by the Unicode standard, these two elements are termed CODE ELEMENT and TEXT ELEMENT, respectively. Furthermore, the second use above is only one kind of TEXT ELEMENT. In the case of Unicode, each CODE ELEMENT, i.e., coded character element, is of fixed 16-bits width. It is not variable length. In contrast, a TEXT ELEMENT may be not only variable length in its encoding (by means of code elements), but also may have multiple encodings. For example: Text Element Unicode Encoding By Code Element(S) E WITH CIRCUMFLEX = 0x00CA or 0x0045 0x0302 E WITH CIRCUMFLEX & ACUTE = 0x1EbE or 0x00CA 0x0301 or 0x0045 0x0302 0x0301 In Unicode, every "coded character element" consists of one and only one fixed-length 16-bit value; however, a "text element" may or may not have a fixed-length encoding by means of character code elements. In our current discsussion about Han characters, you can think of such use of "character" to denote a graphical symbol which is considered atomic at some level of processing, but not necessarily at the level of encoding, i.e., it is a text element. The design goal of Unicode is to define the set of code elements which can encode the largest number of text elements in a way most convenient for processing, and to insure that a certain collection of text elements, namely, those encoded by existing character set standards, have a direct (fixed-length) encoding (in order to insure 1-1 round trip mapping). >Maybe there are reasonable compromises between a full precomposed set, >and a fully decomposed system based on radicals or whatever character >component is found useful. Unicode was designed to be just this kind of compromise. >Han characters are not the only candidates for decomposition; >I'd decompose Latin ligatures and letters with diacritical marks >right away (assuming we are still talking about a potential >character set, and not Unicode itself). Many text elements used by less common alphabets cannot be represented in Unicode (currently) except by means of combining diacritical marks. This may always be the case, since Unicode may never contain all the precomposed combinations need by every future use or even uses which are not today well known. A number of Unicode systems are designed to decomposed *everything* into maximal decompositions. It turns out that certain types of processing are much simpler in this case. Glenn Adams