home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.unix.bsd
- Path: sparky!uunet!gatech!news.byu.edu!ux1!fcom.cc.utah.edu!cs.weber.edu!terry
- From: terry@cs.weber.edu (A Wizard of Earth C)
- Subject: Re: multibyte character representations and Unicode
- Message-ID: <1992Nov23.193620.9513@fcom.cc.utah.edu>
- Sender: news@fcom.cc.utah.edu
- Organization: Weber State University (Ogden, UT)
- References: <721993836.11625@minster.york.ac.uk>
- Date: Mon, 23 Nov 92 19:36:20 GMT
- Lines: 73
-
- In article <721993836.11625@minster.york.ac.uk> forsyth@minster.york.ac.uk writes:
- >Terry Weber suggests that half one's disc space will vanish
- >on adopting Unicode. Not so: I draw your attention to Plan 9,
- >which uses Unicode very successfully. See the Plan 9 documentation
- >on research.att.com (dist/plan9doc, I think).
-
- If you are talking about truly using the Unicode standard, then you are
- talking about using 16 bits for English characters instead of 8 bits.
- While your disk space wouldn "disappear", it would be halved, unless you
- mixed storage of Unicode and ASCII on the same disk.
-
- Unicode contains a total of 34,348 characters. This is 52% of the largest
- number of characters representable in 16 bits (65536), and is also larger
- than what can be represented with conditional multibyting (8th bit set on
- first character indicating multibyte, otherwise 7 bit ASCII), which is
- 32,896 characters (128 + 128 * 256).
-
- It seems to me that Unicode representation on disk requires 2 bytes per
- character (symbol). Thus a document file in English that used to tak
- 2K stored in ASCII would take 4K (2K symbols * 2 bytes per symbol) to
- store in Unicode.
-
- >Eventually Plan 9 switched to a new encoding -- which apparently has now been
- >proposed for use in ISO 10646 -- that lacks all the unfortunate features.
- >The second and third bytes of the encoding do not look like ASCII characters.
- >(All bytes of an encoded character have the 0x80 bit set.)
- >The consequence is that even fewer programs are affected:
- >most pass Unicode encodings straight through.
-
- This isn't really Unicode unless it follows Unicode encoding, and it lacks
- the ability to provide a fixed size per symbol storage mechanism, but I
- agree that ISO 10646 is a real possibility, although it seems rather English
- centric.
-
- In X, to provide an 8x8 Unicode font, it takes 274784 bytes of storage for
- the actual font glyphs, plus overhead; a 10x20 takes 1030440 (just under a
- Meg, assuming the overhead is less than 18K). Both could easily be done in
- ROM.
-
- Without multibyte encoding (ie: straight 16 bit multibyte), the output is
- straightforward using X. The same is true for an "English-only" or other
- ("Cyrillic -only", etc.) font, since X fonts are allowed to be sparse;
- thus the full Unicode font is only necessary for multinational use of the
- same device... even then, the amount of glyphs in a font need only be
- enough to intersect both sets.
-
- Thus, in many cases, font-fill centric encoding (ie: this is the font I used,
- and these are the 8 bit representations of the Unicode characters lexically
- within the font) is sufficient to provide 8 bit storage for all but Kanji.
- If the Japaneese could limit themselves to Kana (Katakana/Hirugana), then
- they could also benefit from this storage technique as well (this would
- also go a long way towards making them compute-competitive and reduce the
- hoops one jumps through when using a Kanji keyboard).
-
- >In particular, the `normal' file system names can hold Unicode
- >characters without fuss. There is certainly no need to switch to 16-bit
- >representations for them, with all that that entails.
-
- No argument here; however, I would say that picking a font-fill encoding
- as a file storage attribute would be sufficient for this as well.
-
-
- Terry Lambert
- terry@icarus.weber.edu
- terry_lambert@novell.com
- ---
- Any opinions in this posting are my own and not those of my present
- or previous employers.
- --
- -------------------------------------------------------------------------------
- "I have an 8 user poetic license" - me
- Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial
- -------------------------------------------------------------------------------
-