home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!olivea!mintaka.lcs.mit.edu!ai-lab!wheat-chex!glenn
- From: glenn@wheat-chex.ai.mit.edu (Glenn A. Adams)
- Newsgroups: comp.std.internat
- Subject: Re: Radicals Instead of Characters
- Message-ID: <1k1amuINN7rq@life.ai.mit.edu>
- Date: 25 Jan 93 18:14:54 GMT
- References: <1jpj9sINNlie@flop.ENGR.ORST.EDU> <1jtbfvINNqvr@life.ai.mit.edu> <1jucp0INN5pe@corax.udac.uu.se>
- Organization: MIT Artificial Intelligence Laboratory
- Lines: 93
- NNTP-Posting-Host: wheat-chex.ai.mit.edu
-
- In article <1jucp0INN5pe@corax.udac.uu.se> andersa@Riga.DoCS.UU.SE (Anders Andersson) writes:
- >Didn't you skip one step in your otherwise excellent analysis here?
- >The 20,000+ Han characters of Unicode today effectively employ 15 bits,
- >leaving 50-75% of the code space for characters from other scripts.
- >If the Han characters were only to use 11 bits, we surely wouldn't
- >start processing text in 11- or 12-bit chunks where we now use 16.
- >Instead, there would be more room within a 16-bit character set for
- >yet other scripts, as the Han characters would need less than 4% of it.
- >Is 16 bits enough for everything we'd like to encode, then?
-
- There are probably about 45,000 - 50,000 legitimate Han characters that
- somehow need to be encoded. More than half of these are no longer used,
- but were used in the past.
-
- The question about how to use a 16-bit encoding space is a good question.
- The goal should be to represent as many scripts as possible in order
- to cover the maximum number of writing systems; however, the question of
- what is needed to adequately cover a writing system is an issue that must be
- resolved. 20K Han characters will certainly cover modern uses of the
- CJK writing systems, but not all past uses.
-
- Furthermore, you are neglecting a major goal of Unicode in proposing
- decomposed encodings for the 20,902 existing Han character in Unicode;
- namely, that there must be a one-to-one round trip mapping between
- Unicode and common character sets. This precludes a decomposed encoding
- for these 20K characters.
-
- >Once we accept decomposition of characters into smaller elements,
- >we are de facto introducing variable-length encoding of characters
- >(as a character may consist of a variable number of elements).
- >in terms of storage and processing.
-
- Here is where the term "character" becomes problematic. There are
- two common, but different uses of the term:
-
- (1) an element of a coded character set, which has a unique
- encoding (bit representation) and name.
-
- (2) an element of an alphabet, i.e., elements which are naturally
- perceived as the atomic elements of a writing system by the users
- of that writing system.
-
- In the vocabulary used by the Unicode standard, these two elements are
- termed CODE ELEMENT and TEXT ELEMENT, respectively. Furthermore, the
- second use above is only one kind of TEXT ELEMENT.
-
- In the case of Unicode, each CODE ELEMENT, i.e., coded character element,
- is of fixed 16-bits width. It is not variable length. In contrast, a
- TEXT ELEMENT may be not only variable length in its encoding (by means
- of code elements), but also may have multiple encodings. For example:
-
- Text Element Unicode Encoding By Code Element(S)
-
- E WITH CIRCUMFLEX = 0x00CA or 0x0045 0x0302
- E WITH CIRCUMFLEX & ACUTE = 0x1EbE or 0x00CA 0x0301 or 0x0045 0x0302 0x0301
-
- In Unicode, every "coded character element" consists of one and only one
- fixed-length 16-bit value; however, a "text element" may or may not have
- a fixed-length encoding by means of character code elements.
-
- In our current discsussion about Han characters, you can think of such
- use of "character" to denote a graphical symbol which is considered atomic
- at some level of processing, but not necessarily at the level of encoding,
- i.e., it is a text element.
-
- The design goal of Unicode is to define the set of code elements which
- can encode the largest number of text elements in a way most convenient
- for processing, and to insure that a certain collection of text elements,
- namely, those encoded by existing character set standards, have a direct
- (fixed-length) encoding (in order to insure 1-1 round trip mapping).
-
- >Maybe there are reasonable compromises between a full precomposed set,
- >and a fully decomposed system based on radicals or whatever character
- >component is found useful.
-
- Unicode was designed to be just this kind of compromise.
-
- >Han characters are not the only candidates for decomposition;
- >I'd decompose Latin ligatures and letters with diacritical marks
- >right away (assuming we are still talking about a potential
- >character set, and not Unicode itself).
-
- Many text elements used by less common alphabets cannot be represented
- in Unicode (currently) except by means of combining diacritical marks.
- This may always be the case, since Unicode may never contain all the
- precomposed combinations need by every future use or even uses which are
- not today well known.
-
- A number of Unicode systems are designed to decomposed *everything*
- into maximal decompositions. It turns out that certain types of
- processing are much simpler in this case.
-
- Glenn Adams
-