home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!cs.utexas.edu!qt.cs.utexas.edu!yale.edu!yale!mintaka.lcs.mit.edu!ai-lab!wheat-chex!glenn
- From: glenn@wheat-chex.ai.mit.edu (Glenn A. Adams)
- Newsgroups: comp.std.internat
- Subject: Re: Radicals Instead of Characters
- Date: 24 Jan 1993 06:03:43 GMT
- Organization: MIT Artificial Intelligence Laboratory
- Lines: 137
- Message-ID: <1jtbfvINNqvr@life.ai.mit.edu>
- References: <1jfgq1INNqmn@flop.ENGR.ORST.EDU> <2791@titccy.cc.titech.ac.jp> <1jpj9sINNlie@flop.ENGR.ORST.EDU>
- NNTP-Posting-Host: wheat-chex.ai.mit.edu
-
- In article <1jpj9sINNlie@flop.ENGR.ORST.EDU> crowl@jade.CS.ORST.EDU (Lawrence Crowl) writes:
- >An encoding every variant of every character ever written is not? At
- >214 radicals, we can represent a radical in eight bits, and a character
- >in 8*(average number of radicals per character). Your non-unified
- >approach would require roughly eighteen bits per character.
-
- Forget about encoding Han characters as a string of radicals (at least
- the Kangxi radicals). It won't work. Take the first 10 Unihan characters
- as a test. Each of these are composed of radical 1 (meaning 'one') plus
- additional strokes as follows, with the additional strokes further
- decomposable as follows:
-
- 4E00 rad 1 + 0 strokes = rad 1
- 4E01 rad 1 + 1 strokes = rad 1 + rad 6
- 4E02 rad 1 + 1 strokes = rad 1 + ?
- 4E03 rad 1 + 1 strokes = rad 1 + rad 5
- 4E04 rad 1 + 1 strokes = rad 1 + rad 2
- 4E05 rad 1 + 1 strokes = rad 1 + rad 2
- 4E06 rad 1 + 1 strokes = rad 1 + rad 4
- 4E07 rad 1 + 2 strokes = rad 1 + ?
- 4E08 rad 1 + 2 strokes = rad 1 + ?
- 4E09 rad 1 + 2 strokes = rad 1 + rad 7
-
- 4E02, 4E07, and 4E08 do not have a complete decomposition by these radicals;
- on the other hand, 4E04 and 4E05 have the same decomposition. The main
- problems with your proposal (which frequently appears and soon disappears)
- are (1) more component radicals are needed than in the Kanxi radical set,
- and (2) a mechanism is needed to disambiguate identical decompositions.
-
- Another component set, called the Chiao-tung set, contains 446 components
- and has been shown to generate more than 48,700 characters; however, even
- it fails to describe certain characters. Perhaps as many as 1,200 components
- are needed to fully describe all known characters.
-
- To disambiguate identical decompositions, it is necessary to tag each
- decomposed component with a position and size metric. The fully decomposed
- character can then be unambiguously described as the sum of N components,
- each having a specified position and size.
-
- This more thorough method would only suffice to describe characters as
- abstract shapes. It would not be sufficient to describe the glyphs which
- correspond to these characters. A glyph decomposition would have to
- use a collection of strokes, perhaps as few as 40, along with the position
- and size of each stroke, to describe each glyph a Han character might
- take. Furthermore, since more than one glyph can represent the same
- character, some mechanism by which such a family of glyphs could be
- interrelated would probably be needed.
-
- There are a number of text processing domains where character decomposition
- might be useful:
-
- 1. character encoding
- 2. character input
- 3. text compression
-
- Similarly, glyph decomposition might be useful in the following domains:
-
- 1. font encoding
- 2. glyph representation
- 3. glyph recognition (OCR)
-
- Since the present discussion revolves around character encoding issues,
- it is useful to say how decomposition may affect character representation
- and processing:
-
- 1. Reduces character code element requirements: the number of code elements
- required could be as few as N components + P relative positions +
- S relative sizes. Assuming that 1200 components could fully generate all
- characters; and assuming that P and S could be digitized to a small
- set of discrete values (say 10 or fewer each), then perhaps as few as
- 1220 character code positions would be required.
-
- 2. Increases string lengths: my guess is that, on the average, approximately
- 4-5 components would be necessary, and, since each would require a position
- and size attribute, total string length would be increased by 9-12 times
- over a fully precomposed character encoding. This could be reduced by
- multiply encoding each component, with one encoding for each position and
- size which that component could take. In this case, if we assume that
- a fourth of the possible 10*10 position/size feature space was actually
- used by each component (on average), then 25*1200 code points would be
- needed, bringing us up to 25,000 code points. A string would now increase
- only 3-4 times in size (in the average case), since position and size would
- be implicit. However, the attraction of a small number of code points would
- be lost.
-
- 3. Increase compression efficiency in the case of only using 1220 code
- positions; most compression schemes compress at a ratio which is
- directly proportional to the symbol count: fewer symbols mean better
- compression.
-
- 4. Allows for simple creation of new Han character symbols using combinatorial
- methods.
-
- 5. Text operations which must operate on each such decomposed Han character
- must now parse the string to find the boundaries of each decomposed Han
- character. Indexing now becomes a O(n) problem, rather than a O(1)
- problem.
-
-
- To summarize, a decomposed Han character approach may reduce the number of
- code points needed from approximately 2^16 to approximately 2^11; however
- text storage sizes will have a commensurate increase. So, to roughly gauge
- this, we might have:
-
- precomposed encoding
-
- 2^20 precomposed characters *
- 2^16 bits/precomposed character =
- 2^36 bits
-
- decomposed encoding
-
- 2^20 precomposed characters *
- 2^4 decomposed characters/precomposed character *
- 2^11 bits/decomposed character =
- 2^35 bits
-
- Thus a decomposed encoding may produce a 2 times space savings overall;
- perhaps more still if the average decomposition is much smaller than 16
- elements. However, the cost of processing now increases dramatically
- since indexing is no longer possible without parsing a string. Furthermore,
- nobody is going to use 11 bit character codes. Once you go over 8 bits,
- the only logical choice is 16 bits, or perhaps 32 bits. Since 32 bits is
- clearly overkill, there remains a 16-bit encoding model: Unicode.
-
- Perhaps further investigation of decomposition is justified in the context
- of finding good compression algorithms; however, it is not justified in
- the context of finding a reasonably simple yet adequate character encoding.
-
- Some information regarding Han character and glyph decomposition which was
- used above was taken from "On the Formalization of Glyph in Chinese
- Language," by C. C. Hsieh, C. T. Chang, and Jack K. T. Huang, Feb 6, 1990,
- a contribution to the Kyoto meeting of AFII (Association for Font Information
- and Interchange).
-
- Glenn Adams
-
-