home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.std.internat
- Path: sparky!uunet!cs.utexas.edu!uwm.edu!spool.mu.edu!yale.edu!ira.uka.de!scsing.switch.ch!josef!mduerst
- From: mduerst@ifi.unizh.ch (Martin J. Duerst)
- Subject: Re: Radicals Instead of Characters
- Message-ID: <1993Jan25.194330.680@ifi.unizh.ch>
- Sender: mduerst@ifi.unizh.ch (Martin J. Duerst)
- Organization: University of Zurich, Department of Computer Science
- References: <1jfgq1INNqmn@flop.ENGR.ORST.EDU> <2791@titccy.cc.titech.ac.jp> <1jpj9sINNlie@flop.ENGR.ORST.EDU> <1jtbfvINNqvr@life.ai.mit.edu>
- Date: Mon, 25 Jan 93 19:43:30 GMT
- Lines: 161
-
-
- In article <1jtbfvINNqvr@life.ai.mit.edu>, glenn@wheat-chex.ai.mit.edu (Glenn A. Adams) writes:
-
- (A good post ideed, as always from Glenn A Adams. Just add my two cents)
- Just that you understand my comments: I am in no way advocating to split
- up Han characters by a radical coding.
-
- > Another component set, called the Chiao-tung set, contains 446 components
- > and has been shown to generate more than 48,700 characters; however, even
- > it fails to describe certain characters. Perhaps as many as 1,200 components
- > are needed to fully describe all known characters.
- >
- Do you have any more information on this Chiao-tung set? I am greatly interested
- in it. I know of several character dictionaries that use more than the
- 218 radical set to explain certain connections and give hints for memorization.
-
- > To disambiguate identical decompositions, it is necessary to tag each
- > decomposed component with a position and size metric. The fully decomposed
- > character can then be unambiguously described as the sum of N components,
- > each having a specified position and size.
- >
- I agree that you would have to disambiguate. But the bulk of the compositions
- would be left/right or top/bottom, or something similar. And for that, a single,
- prefixed code would be o.k. The positional balance could be refined by the
- rendering machine.
-
-
- > There are a number of text processing domains where character decomposition
- > might be useful:
- >
- > 1. character encoding
- > 2. character input
- > 3. text compression
- >
- > Similarly, glyph decomposition might be useful in the following domains:
- >
- > 1. font encoding
- > 2. glyph representation
- > 3. glyph recognition (OCR)
- >
- Add font design in both cases. For a font designer, it might be very convenient
- to ask for all the characters with a certain part in a certain position, not
- to mention specifying all these same parts on a global level before making
- the (clearly necessary) adjustments for each character.
-
- > Since the present discussion revolves around character encoding issues,
- > it is useful to say how decomposition may affect character representation
- > and processing:
- >
- > 1. Reduces character code element requirements: (explanation deleted)
- >
- > 2. Increases string lengths: my guess is that, on the average, approximately
- > 4-5 components would be necessary, and, since each would require a position
- > and size attribute, total string length would be increased by 9-12 times
- > over a fully precomposed character encoding. This could be reduced by
- > multiply encoding each component, with one encoding for each position and
- > size which that component could take. In this case, if we assume that
- > a fourth of the possible 10*10 position/size feature space was actually
- > used by each component (on average), then 25*1200 code points would be
- > needed, bringing us up to 25,000 code points. A string would now increase
- > only 3-4 times in size (in the average case), since position and size would
- > be implicit. However, the attraction of a small number of code points would
- > be lost.
- >
- As above, I slightly disagree. In your scheme, and that of most others
- proposed recently (with more or less details), you just add one componet
- after the other. You then need a code that tells you that the character
- is complete, which takes some space, too. On the other hand, with prefix
- composing codes like left-right and so, you can catch the great bunch of
- characters, and know where a character ends without any additional code.
- Of course, a small percentage of 'strange' characters is left, which still
- adds up to a fair number. But it would be better to code them as fixely
- composed characters, or to use some really basic description (what about
- postscript) for really rare characters.
-
- > 3. Increase compression efficiency in the case of only using 1220 code
- > positions; most compression schemes compress at a ratio which is
- > directly proportional to the symbol count: fewer symbols mean better
- > compression.
- >
- Interesting point, but of no avail:
- Fewer symbols -> longer messages -> average compressed string length
- -> more compression ->
- More symbols -> shorter messages -> average compressed string length
- -> less compression ->
- Do you think coding ASCII with one bit per byte (seven bytes per char)
- would increase compression efficiency? Surely not!
-
- > 4. Allows for simple creation of new Han character symbols using combinatorial
- > methods.
- >
- Interesting for historians (not new, but newly discovered characters) and
- maybe writers/designers. But this should not be encouraged too much, just
- the existing characters give a lot to learn to anybody. And keep in mind
- that quite a few characters in use (or in a standard or a dictionary)
- have their origin in pure misspellings.
-
- > 5. Text operations which must operate on each such decomposed Han character
- > must now parse the string to find the boundaries of each decomposed Han
- > character. Indexing now becomes a O(n) problem, rather than a O(1)
- > problem.
- >
- >
- > To summarize, a decomposed Han character approach may reduce the number of
- > code points needed from approximately 2^16 to approximately 2^11; however
- > text storage sizes will have a commensurate increase. So, to roughly gauge
- > this, we might have:
- >
- > precomposed encoding
- >
- > 2^20 precomposed characters *
- > 2^16 bits/precomposed character =
- > 2^36 bits
- >
- > decomposed encoding
- >
- > 2^20 precomposed characters *
- > 2^4 decomposed characters/precomposed character *
- > 2^11 bits/decomposed character =
- > 2^35 bits
- >
- > Thus a decomposed encoding may produce a 2 times space savings overall;
- > perhaps more still if the average decomposition is much smaller than 16
- > elements. However, the cost of processing now increases dramatically
- > since indexing is no longer possible without parsing a string. Furthermore,
- > nobody is going to use 11 bit character codes. Once you go over 8 bits,
- > the only logical choice is 16 bits, or perhaps 32 bits. Since 32 bits is
- > clearly overkill, there remains a 16-bit encoding model: Unicode.
- I don't understand this. Are you trying to store a 1Mchar (2^20) text?
- This would be 2^20*16 = 2^24 bits = 16Mbits = 2Mbytes for precomposed and
- 2^20*4*11 = 44Mbits ~= 5.5Mbytes for decomposed, although the second is
- far below your estimate above (point 2).
- Or are you calculating something else, and I got it completely wrong?
-
- >
- > Perhaps further investigation of decomposition is justified in the context
- > of finding good compression algorithms; however, it is not justified in
- > the context of finding a reasonably simple yet adequate character encoding.
- >
- > Some information regarding Han character and glyph decomposition which was
- > used above was taken from "On the Formalization of Glyph in Chinese
- > Language," by C. C. Hsieh, C. T. Chang, and Jack K. T. Huang, Feb 6, 1990,
- > a contribution to the Kyoto meeting of AFII (Association for Font Information
- > and Interchange).
- >
- Do you have any information on this Association? Is it possible to join as
- an individual?
-
- > Glenn Adams
- >
-
-
- ----
- Dr.sc. Martin J. Du"rst ' , . p y f g c R l / =
- Institut fu"r Informatik a o e U i D h T n S -
- der Universita"t Zu"rich ; q j k x b m w v z
- Winterthurerstrasse 190 (the Dvorak keyboard)
- CH-8057 Zu"rich-Irchel Tel: +41 1 257 43 16
- S w i t z e r l a n d Fax: +41 1 363 00 35 Email: mduerst@ifi.unizh.ch
- Ñ╞Ñσí╝ÑδÑ╣Ñ╚íªÑ▐í╝Ñ╞ÑúÑ≤íªÑΣÑ│Ñ╓í╩Ñ┴Ñσí╝ÑΩÑ├Ñ╥┬τ│╪╛≡╩≤▓╩│╪▓╩í╦
- ----
-