NetNews Usenet Archive 1993 #3

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #3 / NN_1993_3.iso / spool / comp / std / internat / 1314 < prev next >

Wrap

Internet Message Format | 1993-01-25 | 6.0 KB

Path: sparky!uunet!mcsun!sunic!corax.udac.uu.se!Riga.DoCS.UU.SE!andersa From: andersa@Riga.DoCS.UU.SE (Anders Andersson) Newsgroups: comp.std.internat Subject: Re: Radicals Instead of Characters Date: 24 Jan 1993 15:31:44 GMT Organization: Uppsala University, Sweden Lines: 111 Distribution: world Message-ID: <1jucp0INN5pe@corax.udac.uu.se> References: <1jfgq1INNqmn@flop.ENGR.ORST.EDU> <2791@titccy.cc.titech.ac.jp> <1jpj9sINNlie@flop.ENGR.ORST.EDU> <1jtbfvINNqvr@life.ai.mit.edu> NNTP-Posting-Host: riga.docs.uu.se In article <1jtbfvINNqvr@life.ai.mit.edu>, glenn@wheat-chex.ai.mit.edu (Glenn A. Adams) writes: > To summarize, a decomposed Han character approach may reduce the number of > code points needed from approximately 2^16 to approximately 2^11; however > text storage sizes will have a commensurate increase. So, to roughly gauge > this, we might have: > > precomposed encoding > > 2^20 precomposed characters * > 2^16 bits/precomposed character = > 2^36 bits > > decomposed encoding > > 2^20 precomposed characters * > 2^4 decomposed characters/precomposed character * > 2^11 bits/decomposed character = > 2^35 bits > > Thus a decomposed encoding may produce a 2 times space savings overall; > perhaps more still if the average decomposition is much smaller than 16 > elements. However, the cost of processing now increases dramatically > since indexing is no longer possible without parsing a string. Furthermore, > nobody is going to use 11 bit character codes. Once you go over 8 bits, > the only logical choice is 16 bits, or perhaps 32 bits. Since 32 bits is > clearly overkill, there remains a 16-bit encoding model: Unicode. Didn't you skip one step in your otherwise excellent analysis here? The 20,000+ Han characters of Unicode today effectively employ 15 bits, leaving 50-75% of the code space for characters from other scripts. If the Han characters were only to use 11 bits, we surely wouldn't start processing text in 11- or 12-bit chunks where we now use 16. Instead, there would be more room within a 16-bit character set for yet other scripts, as the Han characters would need less than 4% of it. Is 16 bits enough for everything we'd like to encode, then? Once we accept decomposition of characters into smaller elements, we are de facto introducing variable-length encoding of characters (as a character may consist of a variable number of elements). However, we need not allow for arbitrary combinations of elements from different scripts (I suppose a Japanese `grass' radical makes no sense with an Arabic `alef' letter), so a suitable number of bits could be picked for each set of related elements, and which set is meant could be determined from an initial (short) bit sequence. For storage and transmission, we need a simple way to determine where each character starts and stops. My suggestion for the environment biased towards 8 bits: Divide the entire character bit sequence into 7-bit bytes (padding the last byte with 0's) and set the 8th bit of each byte except the last to "1". Now we can easily determine where each character stops, without knowing the intrinsics of every script encoded. Example: 11001101 ;1=more,10=Han,011=element 3 of 8,01+110=e. 14 of 32, 11100110 ;1=more,110 consumed above,0110+10001=e. 209 of 512, 01000100 ;0=end,10001 consumed above,00=padding. When reading this into a `character' data type for processing, you may store it as 11001101 11100110 01000100 00000000, or as 10011011 10011010 00100000, or even as 10 011 01110 011010001, depending on your particular processing and optimization needs. The length of the actual bit sequences used to represent the different character elements should probably be optimized with respect to some average character frequency, in order not to unnecessarily alienate any particular portion of the market. For instance, we may be able to cram the most frequent Latin and Cyrillic letters (such as the lowercase ones) into the first octet, but not much else. Compare with Morse code. A difference between this method and UTF encoding (as I have understood it from examples posted to this forum) is that the length of the bit sequence used to represent a `character' is directly dependent on the length of the element codes, rather than on the position of the character in a fixed-width code table. I haven't analyzed this in detail; maybe it can be shown that there exists a fixed-width encoding such that it equals or surpasses my variable-width element encoding in terms of storage compactness. In any case, my aim is not to provide the most space-efficient encoding possible, but to minimize the need for table lookups for doing basic character identification and processing (such as case unification and font selection). Of course, table lookups are fast, but using a table for converting "a" to "A" looks like overkill to me. Part of this aim is to allow for (some) future extensions to the code without distributing new tables each time. Am I misjudging the efforts required to implement the different solutions? Please note that I'm not saying that this point of mine makes decomposed Han character encoding worthwhile; I'm simply not versed enough with Han characters to tell what it would mean in terms of storage and processing. Maybe there are reasonable compromises between a full precomposed set, and a fully decomposed system based on radicals or whatever character component is found useful. Han characters are not the only candidates for decomposition; I'd decompose Latin ligatures and letters with diacritical marks right away (assuming we are still talking about a potential character set, and not Unicode itself). I don't see much reason for giving the superscript "TM" compound symbol a code of its own (other than the aforementioned backward compatibility goal), as we'll need a generic superscript mechanism anyway. Same for the KSC (and GB?) measurement units and encircled numerals. Why stop at "(20)", really? Maybe someone wants "(21)"? -- Anders Andersson, Dept. of Computer Systems, Uppsala University Paper Mail: Box 325, S-751 05 UPPSALA, Sweden Phone: +46 18 183170 EMail: andersa@DoCS.UU.SE