home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!mcsun!sunic!corax.udac.uu.se!Riga.DoCS.UU.SE!andersa
- From: andersa@Riga.DoCS.UU.SE (Anders Andersson)
- Newsgroups: comp.std.internat
- Subject: Re: Radicals Instead of Characters
- Date: 24 Jan 1993 15:31:44 GMT
- Organization: Uppsala University, Sweden
- Lines: 111
- Distribution: world
- Message-ID: <1jucp0INN5pe@corax.udac.uu.se>
- References: <1jfgq1INNqmn@flop.ENGR.ORST.EDU> <2791@titccy.cc.titech.ac.jp> <1jpj9sINNlie@flop.ENGR.ORST.EDU> <1jtbfvINNqvr@life.ai.mit.edu>
- NNTP-Posting-Host: riga.docs.uu.se
-
- In article <1jtbfvINNqvr@life.ai.mit.edu>, glenn@wheat-chex.ai.mit.edu (Glenn A. Adams) writes:
- > To summarize, a decomposed Han character approach may reduce the number of
- > code points needed from approximately 2^16 to approximately 2^11; however
- > text storage sizes will have a commensurate increase. So, to roughly gauge
- > this, we might have:
- >
- > precomposed encoding
- >
- > 2^20 precomposed characters *
- > 2^16 bits/precomposed character =
- > 2^36 bits
- >
- > decomposed encoding
- >
- > 2^20 precomposed characters *
- > 2^4 decomposed characters/precomposed character *
- > 2^11 bits/decomposed character =
- > 2^35 bits
- >
- > Thus a decomposed encoding may produce a 2 times space savings overall;
- > perhaps more still if the average decomposition is much smaller than 16
- > elements. However, the cost of processing now increases dramatically
- > since indexing is no longer possible without parsing a string. Furthermore,
- > nobody is going to use 11 bit character codes. Once you go over 8 bits,
- > the only logical choice is 16 bits, or perhaps 32 bits. Since 32 bits is
- > clearly overkill, there remains a 16-bit encoding model: Unicode.
-
- Didn't you skip one step in your otherwise excellent analysis here?
- The 20,000+ Han characters of Unicode today effectively employ 15 bits,
- leaving 50-75% of the code space for characters from other scripts.
- If the Han characters were only to use 11 bits, we surely wouldn't
- start processing text in 11- or 12-bit chunks where we now use 16.
- Instead, there would be more room within a 16-bit character set for
- yet other scripts, as the Han characters would need less than 4% of it.
- Is 16 bits enough for everything we'd like to encode, then?
-
- Once we accept decomposition of characters into smaller elements,
- we are de facto introducing variable-length encoding of characters
- (as a character may consist of a variable number of elements).
- However, we need not allow for arbitrary combinations of elements
- from different scripts (I suppose a Japanese `grass' radical makes
- no sense with an Arabic `alef' letter), so a suitable number of bits
- could be picked for each set of related elements, and which set is
- meant could be determined from an initial (short) bit sequence.
-
- For storage and transmission, we need a simple way to determine
- where each character starts and stops. My suggestion for the
- environment biased towards 8 bits: Divide the entire character
- bit sequence into 7-bit bytes (padding the last byte with 0's)
- and set the 8th bit of each byte except the last to "1". Now
- we can easily determine where each character stops, without
- knowing the intrinsics of every script encoded. Example:
-
- 11001101 ;1=more,10=Han,011=element 3 of 8,01+110=e. 14 of 32,
- 11100110 ;1=more,110 consumed above,0110+10001=e. 209 of 512,
- 01000100 ;0=end,10001 consumed above,00=padding.
-
- When reading this into a `character' data type for processing,
- you may store it as 11001101 11100110 01000100 00000000, or as
- 10011011 10011010 00100000, or even as 10 011 01110 011010001,
- depending on your particular processing and optimization needs.
-
- The length of the actual bit sequences used to represent the
- different character elements should probably be optimized with
- respect to some average character frequency, in order not to
- unnecessarily alienate any particular portion of the market.
- For instance, we may be able to cram the most frequent Latin
- and Cyrillic letters (such as the lowercase ones) into the
- first octet, but not much else. Compare with Morse code.
-
- A difference between this method and UTF encoding (as I have
- understood it from examples posted to this forum) is that the
- length of the bit sequence used to represent a `character' is
- directly dependent on the length of the element codes, rather
- than on the position of the character in a fixed-width code
- table. I haven't analyzed this in detail; maybe it can be
- shown that there exists a fixed-width encoding such that it
- equals or surpasses my variable-width element encoding in
- terms of storage compactness.
-
- In any case, my aim is not to provide the most space-efficient
- encoding possible, but to minimize the need for table lookups
- for doing basic character identification and processing (such
- as case unification and font selection). Of course, table
- lookups are fast, but using a table for converting "a" to
- "A" looks like overkill to me. Part of this aim is to allow
- for (some) future extensions to the code without distributing
- new tables each time. Am I misjudging the efforts required
- to implement the different solutions?
-
- Please note that I'm not saying that this point of mine makes
- decomposed Han character encoding worthwhile; I'm simply not
- versed enough with Han characters to tell what it would mean
- in terms of storage and processing. Maybe there are reasonable
- compromises between a full precomposed set, and a fully decomposed
- system based on radicals or whatever character component is found
- useful.
-
- Han characters are not the only candidates for decomposition;
- I'd decompose Latin ligatures and letters with diacritical marks
- right away (assuming we are still talking about a potential
- character set, and not Unicode itself). I don't see much reason
- for giving the superscript "TM" compound symbol a code of its
- own (other than the aforementioned backward compatibility goal),
- as we'll need a generic superscript mechanism anyway. Same for
- the KSC (and GB?) measurement units and encircled numerals.
- Why stop at "(20)", really? Maybe someone wants "(21)"?
- --
- Anders Andersson, Dept. of Computer Systems, Uppsala University
- Paper Mail: Box 325, S-751 05 UPPSALA, Sweden
- Phone: +46 18 183170 EMail: andersa@DoCS.UU.SE
-