NetNews Usenet Archive 1993 #3

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #3 / NN_1993_3.iso / spool / comp / std / internat / 1322 < prev next >

Wrap

Internet Message Format | 1993-01-25 | 3.1 KB

Path: sparky!uunet!ogicse!mintaka.lcs.mit.edu!ai-lab!wheat-chex!glenn From: glenn@wheat-chex.ai.mit.edu (Glenn A. Adams) Newsgroups: comp.std.internat Subject: Re: Radicals Instead of Characters Message-ID: <1k1p4kINNjse@life.ai.mit.edu> Date: 25 Jan 93 22:21:07 GMT Article-I.D.: life.1k1p4kINNjse References: <1jpj9sINNlie@flop.ENGR.ORST.EDU> <1jtbfvINNqvr@life.ai.mit.edu> <1993Jan25.194330.680@ifi.unizh.ch> Organization: MIT Artificial Intelligence Laboratory Lines: 62 NNTP-Posting-Host: wheat-chex.ai.mit.edu In article <1993Jan25.194330.680@ifi.unizh.ch> mduerst@ifi.unizh.ch (Martin J. Duerst) writes: > >In article <1jtbfvINNqvr@life.ai.mit.edu>, glenn@wheat-chex.ai.mit.edu (Glenn A. Adams) writes: >> precomposed encoding >> >> 2^20 precomposed characters * >> 2^16 bits/precomposed character = >> 2^36 bits >> >> decomposed encoding >> >> 2^20 precomposed characters * >> 2^4 decomposed characters/precomposed character * >> 2^11 bits/decomposed character = >> 2^35 bits Sorry, my arithmetic was broken, I should have had: precomposed encoding 2^20 precomposed characters * 16 bits/precomposed character = 2^24 bits = 2 MB decomposed encoding 2^20 precomposed characters * 9 decomposed characters/precomposed character * 11 bits/decomposed character = 1.55 * 2^26 bits = 12.375 MB This works out to a 6.2 times increase in size for the decomposed form. If the decomposition required fewer elements on average, say 4 instead of 9, then that would reduce the expansion to 2.75 times the precomposed form. I don't know if 4 element decompositions are possible on the average, particularly given size and position ambiguities. Even if it were possible, not only would you require nearly 3 times the size as the precomposed form, but you also would have the complexities introduced by variable length text element (in this case, Han graphic symbol = Han character) encodings. The only reason I could see in pursuing this is to support encoding rarely used characters rather than wasting the encoding space on little used elements. If all characters through the 90 percentile are encoded as single code elements and the remainder decomposed using a decomposition scheme, then I expect the best efficiency will hold on both size and on encoding space usage. Text processing algorithms will be somewhat complicated by having to deal with variable length text element encodings; however, the same is true even now for dealing with Unicode non-spacing diacritics on Latin and Cyrillic base forms. One way out of this last problem is to use the Unicode Private Use Zone (6144 code positions) as a dynamically reconfigurable character set area. Then, upon input, translate rare decomposed Han elements (or other decomposed text elements) to a code point which is dynamically assigned from the private use zone. Of course, when interchanging the data, it would have to be retranslated back to the decomposed form -- I assume you don't want to send someone Unicode data with private use zone data, that is, unless you are in complete control of both sending and receiving parties. Glenn Adams