NetNews Usenet Archive 1993 #3

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #3 / NN_1993_3.iso / spool / comp / std / internat / 1295 < prev next >

Wrap

Internet Message Format | 1993-01-24 | 6.7 KB

Path: sparky!uunet!cs.utexas.edu!qt.cs.utexas.edu!yale.edu!yale!mintaka.lcs.mit.edu!ai-lab!wheat-chex!glenn From: glenn@wheat-chex.ai.mit.edu (Glenn A. Adams) Newsgroups: comp.std.internat Subject: Re: Radicals Instead of Characters Date: 24 Jan 1993 06:03:43 GMT Organization: MIT Artificial Intelligence Laboratory Lines: 137 Message-ID: <1jtbfvINNqvr@life.ai.mit.edu> References: <1jfgq1INNqmn@flop.ENGR.ORST.EDU> <2791@titccy.cc.titech.ac.jp> <1jpj9sINNlie@flop.ENGR.ORST.EDU> NNTP-Posting-Host: wheat-chex.ai.mit.edu In article <1jpj9sINNlie@flop.ENGR.ORST.EDU> crowl@jade.CS.ORST.EDU (Lawrence Crowl) writes: >An encoding every variant of every character ever written is not? At >214 radicals, we can represent a radical in eight bits, and a character >in 8*(average number of radicals per character). Your non-unified >approach would require roughly eighteen bits per character. Forget about encoding Han characters as a string of radicals (at least the Kangxi radicals). It won't work. Take the first 10 Unihan characters as a test. Each of these are composed of radical 1 (meaning 'one') plus additional strokes as follows, with the additional strokes further decomposable as follows: 4E00 rad 1 + 0 strokes = rad 1 4E01 rad 1 + 1 strokes = rad 1 + rad 6 4E02 rad 1 + 1 strokes = rad 1 + ? 4E03 rad 1 + 1 strokes = rad 1 + rad 5 4E04 rad 1 + 1 strokes = rad 1 + rad 2 4E05 rad 1 + 1 strokes = rad 1 + rad 2 4E06 rad 1 + 1 strokes = rad 1 + rad 4 4E07 rad 1 + 2 strokes = rad 1 + ? 4E08 rad 1 + 2 strokes = rad 1 + ? 4E09 rad 1 + 2 strokes = rad 1 + rad 7 4E02, 4E07, and 4E08 do not have a complete decomposition by these radicals; on the other hand, 4E04 and 4E05 have the same decomposition. The main problems with your proposal (which frequently appears and soon disappears) are (1) more component radicals are needed than in the Kanxi radical set, and (2) a mechanism is needed to disambiguate identical decompositions. Another component set, called the Chiao-tung set, contains 446 components and has been shown to generate more than 48,700 characters; however, even it fails to describe certain characters. Perhaps as many as 1,200 components are needed to fully describe all known characters. To disambiguate identical decompositions, it is necessary to tag each decomposed component with a position and size metric. The fully decomposed character can then be unambiguously described as the sum of N components, each having a specified position and size. This more thorough method would only suffice to describe characters as abstract shapes. It would not be sufficient to describe the glyphs which correspond to these characters. A glyph decomposition would have to use a collection of strokes, perhaps as few as 40, along with the position and size of each stroke, to describe each glyph a Han character might take. Furthermore, since more than one glyph can represent the same character, some mechanism by which such a family of glyphs could be interrelated would probably be needed. There are a number of text processing domains where character decomposition might be useful: 1. character encoding 2. character input 3. text compression Similarly, glyph decomposition might be useful in the following domains: 1. font encoding 2. glyph representation 3. glyph recognition (OCR) Since the present discussion revolves around character encoding issues, it is useful to say how decomposition may affect character representation and processing: 1. Reduces character code element requirements: the number of code elements required could be as few as N components + P relative positions + S relative sizes. Assuming that 1200 components could fully generate all characters; and assuming that P and S could be digitized to a small set of discrete values (say 10 or fewer each), then perhaps as few as 1220 character code positions would be required. 2. Increases string lengths: my guess is that, on the average, approximately 4-5 components would be necessary, and, since each would require a position and size attribute, total string length would be increased by 9-12 times over a fully precomposed character encoding. This could be reduced by multiply encoding each component, with one encoding for each position and size which that component could take. In this case, if we assume that a fourth of the possible 10*10 position/size feature space was actually used by each component (on average), then 25*1200 code points would be needed, bringing us up to 25,000 code points. A string would now increase only 3-4 times in size (in the average case), since position and size would be implicit. However, the attraction of a small number of code points would be lost. 3. Increase compression efficiency in the case of only using 1220 code positions; most compression schemes compress at a ratio which is directly proportional to the symbol count: fewer symbols mean better compression. 4. Allows for simple creation of new Han character symbols using combinatorial methods. 5. Text operations which must operate on each such decomposed Han character must now parse the string to find the boundaries of each decomposed Han character. Indexing now becomes a O(n) problem, rather than a O(1) problem. To summarize, a decomposed Han character approach may reduce the number of code points needed from approximately 2^16 to approximately 2^11; however text storage sizes will have a commensurate increase. So, to roughly gauge this, we might have: precomposed encoding 2^20 precomposed characters * 2^16 bits/precomposed character = 2^36 bits decomposed encoding 2^20 precomposed characters * 2^4 decomposed characters/precomposed character * 2^11 bits/decomposed character = 2^35 bits Thus a decomposed encoding may produce a 2 times space savings overall; perhaps more still if the average decomposition is much smaller than 16 elements. However, the cost of processing now increases dramatically since indexing is no longer possible without parsing a string. Furthermore, nobody is going to use 11 bit character codes. Once you go over 8 bits, the only logical choice is 16 bits, or perhaps 32 bits. Since 32 bits is clearly overkill, there remains a 16-bit encoding model: Unicode. Perhaps further investigation of decomposition is justified in the context of finding good compression algorithms; however, it is not justified in the context of finding a reasonably simple yet adequate character encoding. Some information regarding Han character and glyph decomposition which was used above was taken from "On the Formalization of Glyph in Chinese Language," by C. C. Hsieh, C. T. Chang, and Jack K. T. Huang, Feb 6, 1990, a contribution to the Kyoto meeting of AFII (Association for Font Information and Interchange). Glenn Adams