NetNews Usenet Archive 1992 #31

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #31 / NN_1992_31.iso / spool / comp / unix / bsd / 10838 < prev next >

Wrap

Text File | 1993-01-02 | 22.9 KB | 452 lines

Newsgroups: comp.unix.bsd Path: sparky!uunet!spool.mu.edu!agate!dog.ee.lbl.gov!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry From: terry@cs.weber.edu (A Wizard of Earth C) Subject: INTERNATIONALIZATION: IN GENERAL Message-ID: <1993Jan2.083734.22776@fcom.cc.utah.edu> Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages Sender: news@fcom.cc.utah.edu Organization: Weber State University (Ogden, UT) References: <1ht8v4INNj7i@rodan.UU.NET> <1993Jan1.094759.8021@fcom.cc.utah.edu> <1i2k09INN4hl@rodan.UU.NET> Date: Sat, 2 Jan 93 08:37:34 GMT Lines: 439 A discussion between Vadim Antonov (V) and myself (T): V: 1) "mechanistic" conversion between upper and lower case V: is impossible (as well as case-insensitive comparisons) V: V: Example: Latin T -> t V: Cyrillic T -> m V: Greek T -> ? V: V: This property alone renders Unicode useless for any business V: applications. T: This is simply untrue. Because a subtractive/additive conversion is T: impossible in *some* cases does not mean a *mechanistic* conversion is T: also impossible. In particular, a tabular conversion is an obvious T: approach which has already been used with success, with a minimal T: (multiply plus dereference) overhead. V: You omitted one small "detail" -- you need to know the language of the word V: the letter belongs to to make a conversion. Since Unicode does not V: provide for specifying the language it is obvious that is should be V: obtained from user or kept somewhere off the text. In both cases V: as our program ALREADY knows the language from the environment it knows V: the particular (small) alphabet -- no need to use multibyte encodings! V: See how Unicode renders itself useless? Correct. You need to know the language, because the information you are storing is which glyph to display rather than the language and the glyph. There are several problems with a unique ordinal value per glyph, where a particular glyph is not unique within the set of glyphs. In particular, programs which process text as data (like C compilers require the ability to distinguish characters. If one looks at the JIS standard, one sees that in include an English alphabet. Without unification between this and the ISO-Latin-1 font, for instance, a great deal of additional code is required to allow the compiler to recognize characters (basically, do it's own unificiation. You can't tell if the characters in the string "printf" were input in a JIS or the Latin-1 font just by looking at them, but the compiler can certainly tell that they are unique. In order to provide "natural" operations on words (such as hyphenation, case conversion, and, in particular, abbreviation (All of which are potentially desirable in our hypothetical program which also alphabetizes), you also require information about the language. Hyphenation and abbreviation, in particular, require a detailed knowledge of the idea of sequences of glyphs (ie: words). This information will not be available regardless of your glyph encoding standard. Other word processing operations (such as dictionary and thesaurus use within the program) require knowledge of which language to use. The idea of sort order should be (and is in Unicode) divorced from the idea of information storage. The fact that one will have text files, data files, and text files which act as data files on the same machine *requires* some type of promiscuous [out of data band] method of determining the format of the data within a file. This method, whether it be language tagging of the files in the inode, or language tagging of the user during the login process is imperitive. To do otherwise means that your localization data coexists with system data rather than system data being localized as well. The operations you wish to perform are the province of applications running on the system, not the system itself. Regardless of whether this is done by an application programmer (as a per application localization) or by the creator of a library used by applications (as part of developement system localization), THE CODE BELONGS IN USER SPACE. V: I wonder why programmers aren't taught mathematical logic. I'm somehow V: an exception because i'm a mathematican by education and i use to look V: for holes in "logical" statements. Most American programmers are, if they attempt to get a degree at an institute of higher learning in the US. Most are also forcibly taught how to bowl or shoot a bow and arrow as part of their graduation requirements. A point of contention: a logician is, by disipline, a philosopher, not a mathematician. Being the latter does not qualify one as the former. The point of having to know what language a particular document is written in in order to manipulate it was *not* omitted; it was taken as an *axiom*. T: The Cyrillic characters within the Unicode standard (U+0400 -> U+04FF) T: are based on the ECMA registry under ISO-2375 for use with ISO-2022. It T: contains several Cyrillic subsets. The most recent and most widely T: accepted of these is ISO-8859-5. Unicode uses the same relative T: positions as in ISO-8859-5. Are you also adverse to ISO-8859-5? V: ISO-8859-5 is ok, though it is a dead code. Nobody uses it in Russia, V: mind you. The most wide-spread codes are KOI-8 (de-facto Unix and V: networking standard) and the so-called "alternative" code which is V: popular between MS-DOS users. With all due respect, the ISO-8859-5 is an international standard to which engineering outside of Russia is done for use in Russia. Barring another published standard for external use, this is probably what Russian users are going to be stuck with for code originating outside of Russia. I suggest that if this concerns you, you should have the "defacto standard" codified for use by external agencies. One wonders at the ECMA registration of a supposed "non-standard" by Russian nationals if the standard is not used in Russia. T: Of these, some argument can be made against only the final paragraph, T: since it vies internationalization as a tool for multinationalization T: rather than localization. I feel that a strong argument can be held T: out for internationalization as a means of providing fully data driven T: localizations of software. As such, the argument of monolingual vs. T: multilingual is not supported. However, lexical sort order can be T: enforced in the access rather than the storage mechanism, making this T: a null point. V: Nay, you missed the same point again. You need information about V: laguage case-conversion and sorting rules and you can obtain it from V: the encoding (making user programs simple) or from user programs V: (forcing them to ask user at every step or to keep track of the language). V: What would you choose for your program? The process of "asking the user" is near 0 cost regardless of whether the implementation is some means of file attribution per language or some method of user attribution (ala proc struct, password file, or environment). It becomes more complicate if you are attemting a multinational document; the point here is to enable localization with user supplied data seta rather than providing a tool for linguistic scholars or multilingual word processors. It is possible to do both of these things within the confines of Unicode, penalizing only the authors of the applications. >Besides, as i already argued asking or keeping off-text inforamtion >makes the whole enterprise useless. This is perhaps true if the goal is multinationalization rather than internationalization or localization. Consider a document in Japanese, Tamil, and Devanagari (Sanskrit). How does one resolve the issues of input mechanism for these languages? JIS encoding does not cut it. Basically, for a multinational document, there must be multiple instances if input mechanisms, or a switch between input mechanisms during the input process. A switch between mechanisms is sufficient indicator of a switch between languages, since each input mechanism will be more or less language specific in any case because of the N->M keyboard mapping issues if nothing else. I believe that multinational documented will be the exception, not the rule. I further believe that in the specific case of multinational documents, the use of a particular in-band storage mechanism (such as "Fancy Text" from the Unicode 1.0 standard) is not an unacceptable penalty for exceptional use. I believe the goal is *NOT* multinationalization, but internationalization. In this context, internationalization refers not to the ability to provide perfect access to all supported languages (by way of glyph preference), but refers instead to an enabling technology to allow better operating system support for localization. Multinational use is out of the question until modifications are made to the file system in terms of supporting multiple nation name spaces for the files. Localization in terms of multinationalization requires other considerations not directly possible, in particular, the concept of "well known file system objects" must be adjusted. Consider, if you will, the fact that such a localization of the existing UNIX infrastructure is currently impossible in this framework. I am thinking in particular renaming the /etc directory or the /etc/passwd file to localized non-English equivalents. The idea of multinationalization falls under its own weight. Considr a system used by users of several languages (ie: a multinational environment). Providing each use with their own language's view of file requires a minimum of the number of well known file names times the number of languages (bearing in mind that translation may effect name length) for directory information alone. Now consider that each of these users will want their names and passwords to be in their own language in a single password file. Multinationalization is possible, but of questionalble utility and merit in current computing systems. We ned only worry about providing the mechanisms for concurrency of use for the translators. Consider now the goal of data-driven localization (a single translation for all system application programs and switching of language environments without application recompilation. Does this goal require internationalization of applications? The answer is no. The only thing it requires is internationalization of the underlying system to allow data-driving of localization. Applications themselves need only be localized through their use of the underlying system. Rather than rewriting all applications which use text as data (cf: the C compiler example), unification of the glyph sets makes more sense. The only goal I am esposing here is enabling for localization. For this task, Unicode is far from useless. T: I believe this is addressed adequately in the ISO standards; however, V: Your belief is wrong for it is not considered adequate by real users. Then "real users" can supply a codified alternative in short order or lump it. T: the lexical order argument is one of the sticking points against the T: Japanese acceptance of Unicode, and is a valid argument in that arena. T: The fact of the matter is that Unicode is not an information manipulation T: standard, but (for the purposes of it's use in internationalization) a T: storage and an I/O standard. View this way, the lexical ordering argument T: is nonapplicable. V: It'd be sticking point about Slavic languages as well, you may be sure. V: Knowing ex-Soviet standard-making routine i think the official fishy-eyed V: representatives will silentlly vote pro to get some more time for raving V: in Western stores and nobody will use it since then. The "working" standards V: in Russia aren't made by commitees. Then this will have to change, or the Russian users will pay the price. Those of us external to Russia are in no position to involve ourselves in this process. Any changes will have to originate in Russia. I haven't seen you come right out and say the Cyrillic lexical order in the Unicode standard (characters U+0400->U+04FF) and in the ISO-8859-5 sets are "wrong". Neither have I seen an alternative lexical order (with an accompanying rationale) put forth. V: 3) there is no reasonable way to do hyphenation. V: Since there is no way to tell language from the text there V: is no way to do any reasonable attempts to hyphenate. V: [OX - which language this word is from]? V: V: Good-bye wordprocessors and formatters? T: By this, you are obviously not referring to idegrahic languages, such as T: Han, since hyphenation is meaningless for such languages. Setting aside T: the argument that if you don't know how to hyphenate in a language, you T: have no business generating situations requiring hyphenation by virtue T: of the fact that you are basically illeterate in taht language... ;-). V: The reason may be as simple as reformatting spreadsheet containing V: (particularly) addresses of companies in the language i don't comprehend V: (though i can write it on the envelope). T: Hyphenation as a process is language dependent, and, in particular, T: dependent on the rendering mechanism (rendereing mechanisms are *not* T: the subject under discussion; storage mechanisms *are*). Bluntly T: speaking, why does one need word processing software at all if this T: type of thing is codified? Hyphenation, like sorting, is manipulation T: of the information in a native language specific way. V: Exactly. But there is a lot of "legal" ways to do hyphenation -- and V: there are algorithms which do reasonably well knowing nothing about V: the language except which letters are vowels. It's quite enough V: for printing address labels. If i'm formatting a book i can specify the V: language myself. Address information can not be hyphenated, at least in US and other Western mail of which I am personally aware. This is a non-issue. This is also something that is not the responsibility of the operating system or the storage mechanism therein... unless you are arguing that UFS knows to store documents without hyphenation, and that the "cat" and "more" programs will hyphenate for you. If you are talking about ANY OTHER APPICATION, THE HYPHENATION IS THE APPLICATIONS RESPONSIBILITY. PERIOD. The fact that you will have to maintain vowel/consonent tables on a per language basis is an obvious outcome of the processing of multinational information. It makes little difference to the application user how these tables are keyed. T: Find another standard to tell you how to write a word processor. V: Is there any? :-) No, there isn't; that was the point. It is not the intent of the Unicode standard to provide a means of performing the operations normally associated with word processing. That is the job of the word processor, and is the reason people who write word processors are paid money by an employer rather than starving to death. V: 4) "the similar gliphs" in Unicode are often SLIGHTLY different V: typographical gliphs -- everybody who ever dealt with international V: publishing knows that fonts are designed as a WHOLE and every V: letter is designed with all others in mind -- i.e. X in Cyrillic V: is NOT the same X as Latin even if the fonts are variations of V: the same style. I'd wish you to see how ugly the Russian V: texts prited on American desktop publishing systems with V: "few characters added" are. V: V: In reality it means that Unicode is not a solution for V: typesetting. T: No, you're right; neither is it a standard for the production of T: pipefittings or the design of urban transportation systems. V: You somehow forgot that increasing number of texts get printed with V: typographical quality with all the stuff which follows. V: Ever saw a laser printer? Printing is simply another user-mode application program which can take advantage of the language indicators (whether on the file or in a document) for printing the prettiest, most lovely font of your choice. You think there are not font-selection mechanisms within Postscript for doing this? Again, font *changes* only become a problem if one attempts to print a *multinational* document. Since we aren't interested in multinationalization, it's unlikely that a Unicode font containing all Unicode glyphs will be used for that purpose. In all likelyhood, use will be in a localized environmentn *NOT* a multinational one. Since this is the case, it follows that the sum total of the Unicode font implemented in the US will be the ISO Latin-1 set. Similarly, if you are printing a Cyrillic document, you will be using a Cyrillic font; the "X" character you are concerned about will be *localized* to the Cyrillic "X", *NOT* the Latin "X". V: I see no reasons why we should treat the regular expression matching V: as "fancy" feature. Because globbing characters are language dependant. The easiest example of this is the distingtion made between "localized" UNIX SVR4 for English vs. Spanish. The fact is, the character set used for Spanish replaces several characters in the English set with other characters particular to Spanish (DOS is the foremost example of this, with it's reference to code pages and the fact that DOS file names fall within a very narrow set of characters). The globbing ("regular expression pattern match") characters DO change for any patterns more complicated than "*". T: Clearly, then, the applications you are describing are *not* Unicode T: applications, but "Fancy text" apllications which could potentially T: make use of Unicode for character storage. V: Don't you think the ANY text is going to be fancy because Unicode V: as it is does not provide adequate means for the trivial operations? Perhaps any multinational text, yes; for normal text, processing will be done using the localized form, not the Unicode form; therefore the issue will never come up, unless the application requires embedded attributes (like a desktop publishing package. Since multinational processing is the exception rather than the rule, let the multinational users pay the proce in terms of "Fancy text". V: As well i can provide every text with a font file. It is not a solution V: at all. Quite right; but doing so would be redundant unless you were using a output device, such as a CRT or a printer. It is the responsibility of the output device to present the data in a suitable format. For the most part, except for printing, which is difficult enough currently, this will be done by using localized fonts containing only a part of the full Unicode set (that part necessary for the localization language in use for that session/user/file) and thus will be coherently defined within the context of it's localization. Again, multinational software is not being addressed; however, were we to address the issue, I suspect that it would, in all cases, be implementation dependant upon the multinational application. V: Thank you, i already expressed my opinion on Plan 9 UTF in comp.os.research. V: I also do not think it's exciting. There are much more efficient runic V: encodings (my toy OS uses 7bit per byte and 8th bit as a continuation V: indicator). I don't know how stridently I can express this: runic encoding destroys information (such as file size = character count) and makes file system processing re character substituion totally unacceptable... consider the case of a substitution of a character requiring 3 bytes to encode for on that takes 1 byte (or 4 bytes) currently. Say further that it is the first character in a 2M file. You are talking about either shifting the contents of the entire file, or, MUCH WORSE, going to record oriented files for text. If there is defacto attribution of text vs. other files (shifting the data is unacceptable. period.), there is no reason to avoid making that attribution as meaningful as possible. V: Pretty soon it will be a dead standard because of the logical problems V: in the design. Voting is inadequate replacement for logic, you know. V: I'd better stick to a good standard from Zambia than to the brain-dead V: creature of ISO even if every petty bureaucrat voted for it. I agree; however, the peope involved were slightly more knowledgable about the subject than your average "petty bureaucrat". And there has not been a suggested alternative, only rantings of "not Unicode". V: I expressed my point of view (and proposed some kind of solution) in V: comp.std.internat, where the discussion should belong. I'd like you to V: see the problem not as an excercise in wrestling consensus from an V: international body but as a mathematical problem. From the logistical V: point of view the solution is simply incorrect and no standard commitee V: can vote out that small fact. The fundamental assumption Unicode is V: based upon (i.e. one glyph - one code) makes the whole construction V: illogical and it, unfortunately, cannot be mended without serious V: redesign of the whole thing. Wrong, wrong, wrong. 1) We are not discussiong the embodiment of a standard, but the applicability of existing standards to a particular problem. Basically, we could care less about anything other than the existing or draft standards and their suitability to the task at hand, the international enabling of 386BSD. 2) We are not interested in "arriving" at a new standard or defending existing or draft standards, except as regards their suitability to our goal of enabling. 3) The proposal of new soloutions (new standards) is neither useful nor interesting, in light of our need being "now" and the adoption of a new soloution or standard being "at some future date". 4) Barring a suggestion of a more suitable standard, I and others will begin coding to the Unicode standard. 5) Since we are discussing adoption of a standard for enabling of localization of 386BSD, and are neither intent on a general defense of any existing standard, nor the proposal of changes to an existing standard or the emobodiment of a new standard, this discussion doe *NOT* belong in comp.std.internat, since the subscribers of comp.unix.bsd are infinitely more qualified to determine which existing or draft standard they wish to use without a discussion of multinationalization (something only potentially useful to a limited audience, and then only at some future date when multinational processing on 386BSD becomes a generally desirable feature. V: Try to understand the argument about the redundance of encoding with V: external restrictions provided i used earlier in this letter. The V: Unicode commitee really get caught in a logical trap and it's a pity V: few people realize that. I *understand* the argument; I simply *disagree* with it's applicability to anything other than enabling multinationalization as opposed to enabling localization, which is the goal. Terry Lambert terry@icarus.weber.edu terry_lambert@novell.com --- Any opinions in this posting are my own and not those of my present or previous employers. -- ------------------------------------------------------------------------------- "I have an 8 user poetic license" - me Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial -------------------------------------------------------------------------------