home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!ogicse!mintaka.lcs.mit.edu!ai-lab!wheat-chex!glenn
- From: glenn@wheat-chex.ai.mit.edu (Glenn A. Adams)
- Newsgroups: comp.std.internat
- Subject: Re: Radicals Instead of Characters
- Message-ID: <1k1p4kINNjse@life.ai.mit.edu>
- Date: 25 Jan 93 22:21:07 GMT
- Article-I.D.: life.1k1p4kINNjse
- References: <1jpj9sINNlie@flop.ENGR.ORST.EDU> <1jtbfvINNqvr@life.ai.mit.edu> <1993Jan25.194330.680@ifi.unizh.ch>
- Organization: MIT Artificial Intelligence Laboratory
- Lines: 62
- NNTP-Posting-Host: wheat-chex.ai.mit.edu
-
- In article <1993Jan25.194330.680@ifi.unizh.ch> mduerst@ifi.unizh.ch (Martin J. Duerst) writes:
- >
- >In article <1jtbfvINNqvr@life.ai.mit.edu>, glenn@wheat-chex.ai.mit.edu (Glenn A. Adams) writes:
- >> precomposed encoding
- >>
- >> 2^20 precomposed characters *
- >> 2^16 bits/precomposed character =
- >> 2^36 bits
- >>
- >> decomposed encoding
- >>
- >> 2^20 precomposed characters *
- >> 2^4 decomposed characters/precomposed character *
- >> 2^11 bits/decomposed character =
- >> 2^35 bits
-
- Sorry, my arithmetic was broken, I should have had:
-
- precomposed encoding
-
- 2^20 precomposed characters *
- 16 bits/precomposed character =
- 2^24 bits = 2 MB
-
- decomposed encoding
-
- 2^20 precomposed characters *
- 9 decomposed characters/precomposed character *
- 11 bits/decomposed character =
- 1.55 * 2^26 bits = 12.375 MB
-
- This works out to a 6.2 times increase in size for the decomposed form. If
- the decomposition required fewer elements on average, say 4 instead of 9,
- then that would reduce the expansion to 2.75 times the precomposed form.
-
- I don't know if 4 element decompositions are possible on the average,
- particularly given size and position ambiguities. Even if it were
- possible, not only would you require nearly 3 times the size as the
- precomposed form, but you also would have the complexities introduced by
- variable length text element (in this case, Han graphic symbol = Han
- character) encodings.
-
- The only reason I could see in pursuing this is to support encoding rarely
- used characters rather than wasting the encoding space on little used
- elements. If all characters through the 90 percentile are encoded as
- single code elements and the remainder decomposed using a decomposition
- scheme, then I expect the best efficiency will hold on both size and
- on encoding space usage. Text processing algorithms will be somewhat
- complicated by having to deal with variable length text element encodings;
- however, the same is true even now for dealing with Unicode non-spacing
- diacritics on Latin and Cyrillic base forms.
-
- One way out of this last problem is to use the Unicode Private Use Zone
- (6144 code positions) as a dynamically reconfigurable character set area.
- Then, upon input, translate rare decomposed Han elements (or other decomposed
- text elements) to a code point which is dynamically assigned from the private
- use zone. Of course, when interchanging the data, it would have to be
- retranslated back to the decomposed form -- I assume you don't want to send
- someone Unicode data with private use zone data, that is, unless you are
- in complete control of both sending and receiving parties.
-
- Glenn Adams
-