NetNews Usenet Archive 1993 #3

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #3 / NN_1993_3.iso / spool / comp / std / internat / 1263 < prev next >

Wrap

Internet Message Format | 1993-01-21 | 5.2 KB

Path: sparky!uunet!sequent!gaia.ucs.orst.edu!flop.ENGR.ORST.EDU!jade.CS.ORST.EDU!crowl From: crowl@jade.CS.ORST.EDU (Lawrence Crowl) Newsgroups: comp.std.internat Subject: Re: Radicals Instead of Characters Date: 21 Jan 1993 21:07:10 GMT Organization: Computer Science Department, Oregon State University Lines: 93 Message-ID: <1jn39uINNbk0@flop.ENGR.ORST.EDU> References: <1j8kroINNf59@flop.ENGR.ORST.EDU> <1993Jan18.212846.3030@fcom.cc.utah.edu> <mvdvalk.727454246@rhone> <MELBY.93Jan21144739@dove.yk.fujitsu.co.jp> <1jlngtINNqnk@life.ai.mit.edu> NNTP-Posting-Host: jade.cs.orst.edu In article <mvdvalk.727454246@rhone> mvdvalk@cs.utwente.nl (Martijn van der Valk) writes: >May be it's because I'm just too dumb to see the point of using ``radicals >instead of characters'', but to me it seems that the majority of Chinese >characters contain alot of ``familiar sets of strokes'' which are NOT radicals >according to KangXi dictionary. How to encode these? Enlarge the set of >radicals to encompass these ``pseudo-radicals''? Anyway, I don't get the >point. Could the original poster please re-explain what he means with the >original statement? The proposal was that each of the 214 radicals be given a code point similar to letters. Each CJK character would be represented as a sequence of radical codes, just as an English word is represented as a sequence of letter codes. The important criteria is that each CJK character be uniquely determined by a sequence of radicals, not that it _appear_ as a simple composition of radicals. The advantage to this approach is that it permits coding (nearly) all >50,000 CJK characters with (roughly) 214 code points, in contrast to the current unicode scheme, which (presently) encodes >20,000 characters in >20,000 code points. Fewer code points translates to smaller tables, possibly fewer bits in the code, and potentially lower costs. With this approach, an international code might fit into 12 bits instead of 16. Countries not requiring CJK characters would then save 25% in the length of their text and 95% in the size of display tables and so forth. The disadvantage to the radical coding approach, is that now individual characters require multiple code points. Suppose - the hypothetical international-radical-coded character set requires 12 bits, - the number of distict characters to be represented requires 18 bits, and - there are an average of 2.5 radicals per character. A paragraph of CJK characters would require 66% more bits with the radical-coded approach. Table sizes for CJK fonts would not be affected. In article <1jlngtINNqnk@life.ai.mit.edu> glenn@muesli.ai.mit.edu (Glenn A. Adams) writes: >In article <MELBY.93Jan21144739@dove.yk.fujitsu.co.jp> >melby@dove.yk.fujitsu.co.jp (John B. Melby) writes: >>Looking at Han characters in a probabilistic sense probably is not going >>to help much, since the positioning of radicals varies widely between >>characters. > >The idea being discussed for Han decomposition would have different >combining radicals for each of the possible positions the radical >could take; e.g. MAN-LEFT, MAN-TOP, MAN-BOTTOM, etc. I was thinking more that radicals would have a defined order, so that, for instance, the first radical coded would be the one on the upper left. >>(1) some rare characters cannot be expressed in this manner, > >Characters which could not be decomposed in this manner would be >represented in their entirety (i.e., as non-decomposed symbols). > >>(2) allowing the display of arbitrary characters using this sort of >>composition does not mean that their components will be aesthetically >>spaced. > >A system that displayed such decomposed symbols would most likely >employ a font which either (1) contained glyphs that represented the >entire symbol; or (2) contained internal instructions that would allow >it to position the radical properly. In both cases, the correct >display geometry would be used. The display engine would have to >map multiple coded character elements to single glyph references >or mutliple glyph references as appropriate. In addition, display systems using the radical coded approach can provide cheap low-quality display of CJK characters by composing the radicals. This would permit display of CJK characters in those markets where need for such display is rare. I can't imagine anyone selling such displays to someone who uses CJK characters more than rarely. >>A 16 bit font is insufficient for encoding rare characters, whichever way >>you look at it, although having 16-bit CJK unification and a user-defined >>character facility may be sufficient for an average user. > >Keep in mind that there is no necessary relation between a 16-bit character >encoding and a 16-bit font. One can have a 16-bit character encoding like >Unicode (with 20,902 precomposed Han characters, and possibly a collection >of combining radical characters) and display with a 16-bit font that contains >2^16 Han glyphs, or even with a 24-bit font, a 32-bit font, etc. The >relation of Unicode character code to font code is not defined by the >Unicode display model. -- Lawrence Crowl 503-737-2554 Computer Science Department crowl@cs.orst.edu Oregon State University ...!hplabs!hp-pcd!orstcs!crowl Corvallis, Oregon, 97331-3202