NetNews Usenet Archive 1992 #27

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #27 / NN_1992_27.iso / spool / comp / unix / bsd / 8962 < prev next >

Wrap

Text File | 1992-11-17 | 6.7 KB | 131 lines

Newsgroups: comp.unix.bsd Path: sparky!uunet!ferkel.ucsb.edu!taco!rock!stanford.edu!agate!spool.mu.edu!wupost!usc!sol.ctr.columbia.edu!eff!news.byu.edu!ux1!fcom.cc.utah.edu!cs.weber.edu!terry From: terry@cs.weber.edu (A Wizard of Earth C) Subject: [386bsd] INTERNATIONALIZATION (was can't deal with 8-bit input) Message-ID: <1992Nov16.232035.6307@fcom.cc.utah.edu> Summary: Going global Sender: news@fcom.cc.utah.edu Organization: Weber State University (Ogden, UT) References: <1992Nov16.081801.15019@kum.kaist.ac.kr> Date: Mon, 16 Nov 92 23:20:35 GMT Lines: 118 In article <1992Nov16.081801.15019@kum.kaist.ac.kr> jbkang@csking.kaist.ac.kr (Joongbin Kang) writes: > ...But another problem occured during using the 'hanterm', Korean version > of xterm. It can display Korean texts with MSB set (the same to most > oriental languages, such as kanji etc), but I couldn't input Korean text. > Hanterm itself provides Korean input automata, and it should work well > with X11R5. Another test shows that kernel seems to have trouble with > multibyte characters. > % cat > test > test (echoed to tty) > ^D > % cat > xxxx(entered korean characters -- it can be seen when typing) > (but no echo to tty!) > ^D (this DIDN'T work) > ^C > % > So, what's the problem? If I cannot use hangul in 386bsd, it loses > practicality...Help! Most likely, the echo to your X term was broken when it sent you back your characters... seriously! The default cflags for a tty in 386BSD strip parity (the 8th bit) by setting cs7 and setting even parity (-parodd, parenb), and setting the iflag istrip. The fact that you got the characters you typed back at all is an indicator that istrip wasn't working on echoed characters. An additional problem with ANSI teminal emulation, if not internationalized, is the CSI characters (0x80-0x9f) which are seen as <ESC> + <char - 0x60> (ie 0x9b = 0x1b + 0x3b ...or... <CSI> = <ESC>[). Basically, you have to disable this functionality to get around the problem with 0x80-0x9f range characters. SCO "gets around" the problem by allowing the output of the characters in this range with an escape sequence to pick the character set in that range instead of the normal output (<ESC>[12m ? It's been about 3 years since I wrote the SCO color console emulator for TERM from Century Software. Doing it this way get you a PC character set on your console, but it is hardly 8-bit clean. Hangul, like the Japaneese Katakana and Hirugana, is representable in an 8 bit set, the lower 128 characters being ASCII. Unless you are tryingtrying for Unicode, you should not need multibyte -- even then, you only need 16 bits, not the 32 bits Sun is currently using for their Internationalization. Get the echoing working with cat by setting your terminal modes correctly, and you should be able to type in 8 bit characters using the input automata (or even more clever, use a korean keyboard with one of the magic unused shifts (like alt or compose) to get the English characters for programming -- with an X terminal, the means to do this are provided in the xmodmap utility. I don't understand "kernel seems to have trouble with multibyte characters". If by this, you mean the file system (trying to use 16 bit characters in a file name), you are correct: the file system doesn't understand this type of representation, epecially for characters in the 0-255 range, since they will have initial leading NULLs and there is no provision in the kernel or shells for byte-count prefixed multibyte strings with NULLs in them. If you mean that you could not use 8-bit characters in a file name, then the fault lies in the input mechanism (your shell, if it isn't 8 bit clean, or your tty modes if it is). Other than file naming (directory entry manipulation services), a stream of data is a stream of data, and the file system storing it doesn't care if it's a stream of bytes to be treated as a double-byte character set, or a stream of bytes to be treated as ASCII. An intrinsic limiting factor in the use of Unicode or other multibyte technology within all shell tools and libraries is the fact that by doing so, you effectively halve your usable disk space (by doubling the size of the data to be stored, even if it is vanilla ASCII). I think it will be a long time before we see large numbers of products coming out of the US which have this limitation. One way to internationalize without falling into this trap is to adopt the ISO Latin-1 character set (usable by the majority of countries) as the standard console character set. The "codrv" program gives us a means of providing an initial load of this onto existing video hardware. An additional help would be the adoption of the BSD4.4 file system, in particular, the Ficus layering (ala John Heidemann), which would allow for the provision of a Unicode naming layer so that some file systems *could* be multibyte in nature. An additional layer for Unicode disk access would probably be desirable, since this would allow cannonical representation of a text in it's native language, allowing multilingual use of the same file system without interference like one might get with one user running Latin-1 and another running Cyrillic-1. This would allow people requiring more than 8 bits for their name space/text space to halve the size of their disk (which they would have to do anyway) without negatively impacting those of us who can live in 8 bits. As to the other internationaization issue, which is internationalization of text strings in error messages and utilities, I think that we need to adopt the XPG3 standards for string identification, with Unicode storage of the strings (PC code page representation is for the birds). This will buy us usable error messages with little or no penalty, since the locale database can be loaded on a per-site basis (ie: I don't need to load English or Spanish if I'm German). The limiting factor on this (in the PC market anyway) is running the display hardware in "text" mode. Given a reliable mechanism for the identification of video hardware (maybe a non-protected mode install or portion of the boot?), the built in limitation of the IBM PC character set will become less of a consideration, at least for the 8-bit character sets which can be fully downloaded to VGA cards. Terry Lambert terry@icarus.weber.edu terry_lambert@novell.com --- Any opinions in this posting are my own and not those of my present or previous employers. -- ------------------------------------------------------------------------------- "I have an 8 user poetic license" - me Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial -------------------------------------------------------------------------------