home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!mcsun!uknet!yorkohm!minster!forsyth
- From: forsyth@minster.york.ac.uk
- Newsgroups: comp.unix.bsd
- Subject: multibyte character representations and Unicode
- Message-ID: <721993836.11625@minster.york.ac.uk>
- Date: 17 Nov 92 09:50:36 GMT
- Organization: Department of Computer Science, University of York, England
- Lines: 42
-
- Terry Weber suggests that half one's disc space will vanish
- on adopting Unicode. Not so: I draw your attention to Plan 9,
- which uses Unicode very successfully. See the Plan 9 documentation
- on research.att.com (dist/plan9doc, I think).
-
- Basically, there is a multibyte encoding for Unicode that works well.
- Inside relatively FEW programs the multibyte encoding is converted
- to an integer representation (the type `Rune') to simplify manipulation.
- For instance, the text displayed in a text frame by sam or the window
- manager is kept as Runes, but ONLY the text displayed. Any hidden
- text -- and text in disc files -- is kept in the multibyte encoding.
-
- Some care is required in specifying the multibyte encoding.
- It seems that Plan 9 originally followed the encoding specified in
- the Unicode standard, but it has some messy consequences in practice:
- not least that the 2nd and 3rd bytes can appear to be valid
- ASCII. (Why anyone would design an encoding that does this is beyond
- me, since the problems are fairly obvious, but that's what Unicode did.)
- Eventually Plan 9 switched to a new encoding -- which apparently has now been
- proposed for use in ISO 10646 -- that lacks all the unfortunate features.
- The second and third bytes of the encoding do not look like ASCII characters.
- (All bytes of an encoded character have the 0x80 bit set.)
- The consequence is that even fewer programs are affected:
- most pass Unicode encodings straight through.
-
- In particular, the `normal' file system names can hold Unicode
- characters without fuss. There is certainly no need to switch to 16-bit
- representations for them, with all that that entails.
-
- Actually, on Plan 9 you cannot even run the window manager without using
- Unicode: it's name is `eight and a half' (ie, 8 followed by a 1/2 symbol!),
- entered as `8 ALT 1 2' (on my keyboard, anyhow).
-
- You can find much of the Plan 9 Rune support in
- the source for Pike's editor `sam', also on research.att.com
- (dist/sam, i think).
- (You also get a very decent editor, a library that gives you a sane
- interface to X11, and a library for managing text on a bitmap display.)
-
- Obviously programs can store Runes in disc files if that's really what
- they need, or if their authors work for disc manufacturers, but it
- isn't necessary.
-