NetNews Usenet Archive 1993 #3

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #3 / NN_1993_3.iso / spool / comp / std / internat / 1319 < prev next >

Wrap

Text File | 1993-01-25 | 7.9 KB | 173 lines

Newsgroups: comp.std.internat Path: sparky!uunet!cs.utexas.edu!uwm.edu!spool.mu.edu!yale.edu!ira.uka.de!scsing.switch.ch!josef!mduerst From: mduerst@ifi.unizh.ch (Martin J. Duerst) Subject: Re: Radicals Instead of Characters Message-ID: <1993Jan25.194330.680@ifi.unizh.ch> Sender: mduerst@ifi.unizh.ch (Martin J. Duerst) Organization: University of Zurich, Department of Computer Science References: <1jfgq1INNqmn@flop.ENGR.ORST.EDU> <2791@titccy.cc.titech.ac.jp> <1jpj9sINNlie@flop.ENGR.ORST.EDU> <1jtbfvINNqvr@life.ai.mit.edu> Date: Mon, 25 Jan 93 19:43:30 GMT Lines: 161 In article <1jtbfvINNqvr@life.ai.mit.edu>, glenn@wheat-chex.ai.mit.edu (Glenn A. Adams) writes: (A good post ideed, as always from Glenn A Adams. Just add my two cents) Just that you understand my comments: I am in no way advocating to split up Han characters by a radical coding. > Another component set, called the Chiao-tung set, contains 446 components > and has been shown to generate more than 48,700 characters; however, even > it fails to describe certain characters. Perhaps as many as 1,200 components > are needed to fully describe all known characters. > Do you have any more information on this Chiao-tung set? I am greatly interested in it. I know of several character dictionaries that use more than the 218 radical set to explain certain connections and give hints for memorization. > To disambiguate identical decompositions, it is necessary to tag each > decomposed component with a position and size metric. The fully decomposed > character can then be unambiguously described as the sum of N components, > each having a specified position and size. > I agree that you would have to disambiguate. But the bulk of the compositions would be left/right or top/bottom, or something similar. And for that, a single, prefixed code would be o.k. The positional balance could be refined by the rendering machine. > There are a number of text processing domains where character decomposition > might be useful: > > 1. character encoding > 2. character input > 3. text compression > > Similarly, glyph decomposition might be useful in the following domains: > > 1. font encoding > 2. glyph representation > 3. glyph recognition (OCR) > Add font design in both cases. For a font designer, it might be very convenient to ask for all the characters with a certain part in a certain position, not to mention specifying all these same parts on a global level before making the (clearly necessary) adjustments for each character. > Since the present discussion revolves around character encoding issues, > it is useful to say how decomposition may affect character representation > and processing: > > 1. Reduces character code element requirements: (explanation deleted) > > 2. Increases string lengths: my guess is that, on the average, approximately > 4-5 components would be necessary, and, since each would require a position > and size attribute, total string length would be increased by 9-12 times > over a fully precomposed character encoding. This could be reduced by > multiply encoding each component, with one encoding for each position and > size which that component could take. In this case, if we assume that > a fourth of the possible 10*10 position/size feature space was actually > used by each component (on average), then 25*1200 code points would be > needed, bringing us up to 25,000 code points. A string would now increase > only 3-4 times in size (in the average case), since position and size would > be implicit. However, the attraction of a small number of code points would > be lost. > As above, I slightly disagree. In your scheme, and that of most others proposed recently (with more or less details), you just add one componet after the other. You then need a code that tells you that the character is complete, which takes some space, too. On the other hand, with prefix composing codes like left-right and so, you can catch the great bunch of characters, and know where a character ends without any additional code. Of course, a small percentage of 'strange' characters is left, which still adds up to a fair number. But it would be better to code them as fixely composed characters, or to use some really basic description (what about postscript) for really rare characters. > 3. Increase compression efficiency in the case of only using 1220 code > positions; most compression schemes compress at a ratio which is > directly proportional to the symbol count: fewer symbols mean better > compression. > Interesting point, but of no avail: Fewer symbols -> longer messages -> average compressed string length -> more compression -> More symbols -> shorter messages -> average compressed string length -> less compression -> Do you think coding ASCII with one bit per byte (seven bytes per char) would increase compression efficiency? Surely not! > 4. Allows for simple creation of new Han character symbols using combinatorial > methods. > Interesting for historians (not new, but newly discovered characters) and maybe writers/designers. But this should not be encouraged too much, just the existing characters give a lot to learn to anybody. And keep in mind that quite a few characters in use (or in a standard or a dictionary) have their origin in pure misspellings. > 5. Text operations which must operate on each such decomposed Han character > must now parse the string to find the boundaries of each decomposed Han > character. Indexing now becomes a O(n) problem, rather than a O(1) > problem. > > > To summarize, a decomposed Han character approach may reduce the number of > code points needed from approximately 2^16 to approximately 2^11; however > text storage sizes will have a commensurate increase. So, to roughly gauge > this, we might have: > > precomposed encoding > > 2^20 precomposed characters * > 2^16 bits/precomposed character = > 2^36 bits > > decomposed encoding > > 2^20 precomposed characters * > 2^4 decomposed characters/precomposed character * > 2^11 bits/decomposed character = > 2^35 bits > > Thus a decomposed encoding may produce a 2 times space savings overall; > perhaps more still if the average decomposition is much smaller than 16 > elements. However, the cost of processing now increases dramatically > since indexing is no longer possible without parsing a string. Furthermore, > nobody is going to use 11 bit character codes. Once you go over 8 bits, > the only logical choice is 16 bits, or perhaps 32 bits. Since 32 bits is > clearly overkill, there remains a 16-bit encoding model: Unicode. I don't understand this. Are you trying to store a 1Mchar (2^20) text? This would be 2^20*16 = 2^24 bits = 16Mbits = 2Mbytes for precomposed and 2^20*4*11 = 44Mbits ~= 5.5Mbytes for decomposed, although the second is far below your estimate above (point 2). Or are you calculating something else, and I got it completely wrong? > > Perhaps further investigation of decomposition is justified in the context > of finding good compression algorithms; however, it is not justified in > the context of finding a reasonably simple yet adequate character encoding. > > Some information regarding Han character and glyph decomposition which was > used above was taken from "On the Formalization of Glyph in Chinese > Language," by C. C. Hsieh, C. T. Chang, and Jack K. T. Huang, Feb 6, 1990, > a contribution to the Kyoto meeting of AFII (Association for Font Information > and Interchange). > Do you have any information on this Association? Is it possible to join as an individual? > Glenn Adams > ---- Dr.sc. Martin J. Du"rst ' , . p y f g c R l / = Institut fu"r Informatik a o e U i D h T n S - der Universita"t Zu"rich ; q j k x b m w v z Winterthurerstrasse 190 (the Dvorak keyboard) CH-8057 Zu"rich-Irchel Tel: +41 1 257 43 16 S w i t z e r l a n d Fax: +41 1 363 00 35 Email: mduerst@ifi.unizh.ch Ñ╞Ñσí╝ÑδÑ╣Ñ╚íªÑ▐í╝Ñ╞ÑúÑ≤íªÑΣÑ│Ñ╓í╩Ñ┴Ñσí╝ÑΩÑ├Ñ╥┬τ│╪╛≡╩≤▓╩│╪▓╩í╦ ----