home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.unix.bsd
- Path: sparky!uunet!ferkel.ucsb.edu!taco!rock!stanford.edu!agate!spool.mu.edu!wupost!usc!sol.ctr.columbia.edu!eff!news.byu.edu!ux1!fcom.cc.utah.edu!cs.weber.edu!terry
- From: terry@cs.weber.edu (A Wizard of Earth C)
- Subject: [386bsd] INTERNATIONALIZATION (was can't deal with 8-bit input)
- Message-ID: <1992Nov16.232035.6307@fcom.cc.utah.edu>
- Summary: Going global
- Sender: news@fcom.cc.utah.edu
- Organization: Weber State University (Ogden, UT)
- References: <1992Nov16.081801.15019@kum.kaist.ac.kr>
- Date: Mon, 16 Nov 92 23:20:35 GMT
- Lines: 118
-
- In article <1992Nov16.081801.15019@kum.kaist.ac.kr> jbkang@csking.kaist.ac.kr (Joongbin Kang) writes:
- > ...But another problem occured during using the 'hanterm', Korean version
- > of xterm. It can display Korean texts with MSB set (the same to most
- > oriental languages, such as kanji etc), but I couldn't input Korean text.
- > Hanterm itself provides Korean input automata, and it should work well
- > with X11R5. Another test shows that kernel seems to have trouble with
- > multibyte characters.
- > % cat
- > test
- > test (echoed to tty)
- > ^D
- > % cat
- > xxxx(entered korean characters -- it can be seen when typing)
- > (but no echo to tty!)
- > ^D (this DIDN'T work)
- > ^C
- > %
- > So, what's the problem? If I cannot use hangul in 386bsd, it loses
- > practicality...Help!
-
- Most likely, the echo to your X term was broken when it sent you back
- your characters... seriously!
-
- The default cflags for a tty in 386BSD strip parity (the 8th bit) by
- setting cs7 and setting even parity (-parodd, parenb), and setting the
- iflag istrip. The fact that you got the characters you typed back at
- all is an indicator that istrip wasn't working on echoed characters.
-
- An additional problem with ANSI teminal emulation, if not internationalized,
- is the CSI characters (0x80-0x9f) which are seen as <ESC> + <char - 0x60>
- (ie 0x9b = 0x1b + 0x3b ...or... <CSI> = <ESC>[). Basically, you have to
- disable this functionality to get around the problem with 0x80-0x9f range
- characters. SCO "gets around" the problem by allowing the output of the
- characters in this range with an escape sequence to pick the character
- set in that range instead of the normal output (<ESC>[12m ? It's been
- about 3 years since I wrote the SCO color console emulator for TERM from
- Century Software. Doing it this way get you a PC character set on your
- console, but it is hardly 8-bit clean.
-
- Hangul, like the Japaneese Katakana and Hirugana, is representable in an
- 8 bit set, the lower 128 characters being ASCII. Unless you are tryingtrying
- for Unicode, you should not need multibyte -- even then, you only need 16 bits,
- not the 32 bits Sun is currently using for their Internationalization.
-
- Get the echoing working with cat by setting your terminal modes correctly,
- and you should be able to type in 8 bit characters using the input
- automata (or even more clever, use a korean keyboard with one of the magic
- unused shifts (like alt or compose) to get the English characters for
- programming -- with an X terminal, the means to do this are provided in
- the xmodmap utility.
-
- I don't understand "kernel seems to have trouble with multibyte characters".
- If by this, you mean the file system (trying to use 16 bit characters in
- a file name), you are correct: the file system doesn't understand this type
- of representation, epecially for characters in the 0-255 range, since they
- will have initial leading NULLs and there is no provision in the kernel or
- shells for byte-count prefixed multibyte strings with NULLs in them.
-
- If you mean that you could not use 8-bit characters in a file name, then
- the fault lies in the input mechanism (your shell, if it isn't 8 bit clean,
- or your tty modes if it is).
-
- Other than file naming (directory entry manipulation services), a stream
- of data is a stream of data, and the file system storing it doesn't care
- if it's a stream of bytes to be treated as a double-byte character set,
- or a stream of bytes to be treated as ASCII.
-
- An intrinsic limiting factor in the use of Unicode or other multibyte
- technology within all shell tools and libraries is the fact that by
- doing so, you effectively halve your usable disk space (by doubling the
- size of the data to be stored, even if it is vanilla ASCII). I think
- it will be a long time before we see large numbers of products coming
- out of the US which have this limitation.
-
- One way to internationalize without falling into this trap is to adopt the
- ISO Latin-1 character set (usable by the majority of countries) as the
- standard console character set. The "codrv" program gives us a means of
- providing an initial load of this onto existing video hardware.
-
- An additional help would be the adoption of the BSD4.4 file system, in
- particular, the Ficus layering (ala John Heidemann), which would allow
- for the provision of a Unicode naming layer so that some file systems
- *could* be multibyte in nature. An additional layer for Unicode disk
- access would probably be desirable, since this would allow cannonical
- representation of a text in it's native language, allowing multilingual
- use of the same file system without interference like one might get
- with one user running Latin-1 and another running Cyrillic-1. This would
- allow people requiring more than 8 bits for their name space/text space
- to halve the size of their disk (which they would have to do anyway)
- without negatively impacting those of us who can live in 8 bits.
-
- As to the other internationaization issue, which is internationalization
- of text strings in error messages and utilities, I think that we need to
- adopt the XPG3 standards for string identification, with Unicode
- storage of the strings (PC code page representation is for the birds).
- This will buy us usable error messages with little or no penalty, since
- the locale database can be loaded on a per-site basis (ie: I don't need
- to load English or Spanish if I'm German).
-
- The limiting factor on this (in the PC market anyway) is running the
- display hardware in "text" mode. Given a reliable mechanism for the
- identification of video hardware (maybe a non-protected mode install
- or portion of the boot?), the built in limitation of the IBM PC
- character set will become less of a consideration, at least for the
- 8-bit character sets which can be fully downloaded to VGA cards.
-
-
- Terry Lambert
- terry@icarus.weber.edu
- terry_lambert@novell.com
- ---
- Any opinions in this posting are my own and not those of my present
- or previous employers.
- --
- -------------------------------------------------------------------------------
- "I have an 8 user poetic license" - me
- Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial
- -------------------------------------------------------------------------------
-