home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.unix.bsd
- Path: sparky!uunet!spool.mu.edu!agate!dog.ee.lbl.gov!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry
- From: terry@cs.weber.edu (A Wizard of Earth C)
- Subject: INTERNATIONALIZATION: IN GENERAL
- Message-ID: <1993Jan2.083734.22776@fcom.cc.utah.edu>
- Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages
- Sender: news@fcom.cc.utah.edu
- Organization: Weber State University (Ogden, UT)
- References: <1ht8v4INNj7i@rodan.UU.NET> <1993Jan1.094759.8021@fcom.cc.utah.edu> <1i2k09INN4hl@rodan.UU.NET>
- Date: Sat, 2 Jan 93 08:37:34 GMT
- Lines: 439
-
- A discussion between Vadim Antonov (V) and myself (T):
-
- V: 1) "mechanistic" conversion between upper and lower case
- V: is impossible (as well as case-insensitive comparisons)
- V:
- V: Example: Latin T -> t
- V: Cyrillic T -> m
- V: Greek T -> ?
- V:
- V: This property alone renders Unicode useless for any business
- V: applications.
-
- T: This is simply untrue. Because a subtractive/additive conversion is
- T: impossible in *some* cases does not mean a *mechanistic* conversion is
- T: also impossible. In particular, a tabular conversion is an obvious
- T: approach which has already been used with success, with a minimal
- T: (multiply plus dereference) overhead.
-
- V: You omitted one small "detail" -- you need to know the language of the word
- V: the letter belongs to to make a conversion. Since Unicode does not
- V: provide for specifying the language it is obvious that is should be
- V: obtained from user or kept somewhere off the text. In both cases
- V: as our program ALREADY knows the language from the environment it knows
- V: the particular (small) alphabet -- no need to use multibyte encodings!
- V: See how Unicode renders itself useless?
-
- Correct. You need to know the language, because the information you are
- storing is which glyph to display rather than the language and the glyph.
-
- There are several problems with a unique ordinal value per glyph, where
- a particular glyph is not unique within the set of glyphs. In particular,
- programs which process text as data (like C compilers require the ability
- to distinguish characters. If one looks at the JIS standard, one sees that
- in include an English alphabet. Without unification between this and the
- ISO-Latin-1 font, for instance, a great deal of additional code is required
- to allow the compiler to recognize characters (basically, do it's own
- unificiation. You can't tell if the characters in the string "printf" were
- input in a JIS or the Latin-1 font just by looking at them, but the
- compiler can certainly tell that they are unique.
-
- In order to provide "natural" operations on words (such as hyphenation,
- case conversion, and, in particular, abbreviation (All of which are
- potentially desirable in our hypothetical program which also alphabetizes),
- you also require information about the language. Hyphenation and
- abbreviation, in particular, require a detailed knowledge of the idea of
- sequences of glyphs (ie: words). This information will not be available
- regardless of your glyph encoding standard.
-
- Other word processing operations (such as dictionary and thesaurus use
- within the program) require knowledge of which language to use.
-
- The idea of sort order should be (and is in Unicode) divorced from the
- idea of information storage. The fact that one will have text files,
- data files, and text files which act as data files on the same machine
- *requires* some type of promiscuous [out of data band] method of
- determining the format of the data within a file. This method, whether
- it be language tagging of the files in the inode, or language tagging
- of the user during the login process is imperitive. To do otherwise
- means that your localization data coexists with system data rather
- than system data being localized as well.
-
- The operations you wish to perform are the province of applications running
- on the system, not the system itself. Regardless of whether this is done
- by an application programmer (as a per application localization) or by the
- creator of a library used by applications (as part of developement system
- localization), THE CODE BELONGS IN USER SPACE.
-
-
- V: I wonder why programmers aren't taught mathematical logic. I'm somehow
- V: an exception because i'm a mathematican by education and i use to look
- V: for holes in "logical" statements.
-
- Most American programmers are, if they attempt to get a degree at an
- institute of higher learning in the US. Most are also forcibly taught
- how to bowl or shoot a bow and arrow as part of their graduation
- requirements.
-
- A point of contention: a logician is, by disipline, a philosopher, not a
- mathematician. Being the latter does not qualify one as the former.
-
- The point of having to know what language a particular document is written
- in in order to manipulate it was *not* omitted; it was taken as an *axiom*.
-
- T: The Cyrillic characters within the Unicode standard (U+0400 -> U+04FF)
- T: are based on the ECMA registry under ISO-2375 for use with ISO-2022. It
- T: contains several Cyrillic subsets. The most recent and most widely
- T: accepted of these is ISO-8859-5. Unicode uses the same relative
- T: positions as in ISO-8859-5. Are you also adverse to ISO-8859-5?
-
- V: ISO-8859-5 is ok, though it is a dead code. Nobody uses it in Russia,
- V: mind you. The most wide-spread codes are KOI-8 (de-facto Unix and
- V: networking standard) and the so-called "alternative" code which is
- V: popular between MS-DOS users.
-
- With all due respect, the ISO-8859-5 is an international standard to
- which engineering outside of Russia is done for use in Russia. Barring
- another published standard for external use, this is probably what
- Russian users are going to be stuck with for code originating outside
- of Russia. I suggest that if this concerns you, you should have the
- "defacto standard" codified for use by external agencies.
-
- One wonders at the ECMA registration of a supposed "non-standard" by Russian
- nationals if the standard is not used in Russia.
-
- T: Of these, some argument can be made against only the final paragraph,
- T: since it vies internationalization as a tool for multinationalization
- T: rather than localization. I feel that a strong argument can be held
- T: out for internationalization as a means of providing fully data driven
- T: localizations of software. As such, the argument of monolingual vs.
- T: multilingual is not supported. However, lexical sort order can be
- T: enforced in the access rather than the storage mechanism, making this
- T: a null point.
-
- V: Nay, you missed the same point again. You need information about
- V: laguage case-conversion and sorting rules and you can obtain it from
- V: the encoding (making user programs simple) or from user programs
- V: (forcing them to ask user at every step or to keep track of the language).
- V: What would you choose for your program?
-
- The process of "asking the user" is near 0 cost regardless of whether the
- implementation is some means of file attribution per language or some
- method of user attribution (ala proc struct, password file, or environment).
-
- It becomes more complicate if you are attemting a multinational document;
- the point here is to enable localization with user supplied data seta
- rather than providing a tool for linguistic scholars or multilingual
- word processors. It is possible to do both of these things within the
- confines of Unicode, penalizing only the authors of the applications.
-
- >Besides, as i already argued asking or keeping off-text inforamtion
- >makes the whole enterprise useless.
-
- This is perhaps true if the goal is multinationalization rather than
- internationalization or localization. Consider a document in Japanese,
- Tamil, and Devanagari (Sanskrit). How does one resolve the issues of
- input mechanism for these languages? JIS encoding does not cut it.
- Basically, for a multinational document, there must be multiple instances
- if input mechanisms, or a switch between input mechanisms during the
- input process. A switch between mechanisms is sufficient indicator of a
- switch between languages, since each input mechanism will be more or
- less language specific in any case because of the N->M keyboard mapping
- issues if nothing else.
-
- I believe that multinational documented will be the exception, not the
- rule. I further believe that in the specific case of multinational
- documents, the use of a particular in-band storage mechanism (such as
- "Fancy Text" from the Unicode 1.0 standard) is not an unacceptable
- penalty for exceptional use.
-
- I believe the goal is *NOT* multinationalization, but internationalization.
- In this context, internationalization refers not to the ability to provide
- perfect access to all supported languages (by way of glyph preference), but
- refers instead to an enabling technology to allow better operating system
- support for localization.
-
- Multinational use is out of the question until modifications are made to
- the file system in terms of supporting multiple nation name spaces for the
- files.
-
- Localization in terms of multinationalization requires other considerations
- not directly possible, in particular, the concept of "well known file system
- objects" must be adjusted. Consider, if you will, the fact that such a
- localization of the existing UNIX infrastructure is currently impossible
- in this framework. I am thinking in particular renaming the /etc directory
- or the /etc/passwd file to localized non-English equivalents. The idea
- of multinationalization falls under its own weight. Considr a system used
- by users of several languages (ie: a multinational environment). Providing
- each use with their own language's view of file requires a minimum of the
- number of well known file names times the number of languages (bearing in
- mind that translation may effect name length) for directory information
- alone. Now consider that each of these users will want their names and
- passwords to be in their own language in a single password file.
-
- Multinationalization is possible, but of questionalble utility and merit
- in current computing systems. We ned only worry about providing the
- mechanisms for concurrency of use for the translators.
-
-
- Consider now the goal of data-driven localization (a single translation
- for all system application programs and switching of language environments
- without application recompilation.
-
- Does this goal require internationalization of applications? The answer is
- no. The only thing it requires is internationalization of the underlying
- system to allow data-driving of localization. Applications themselves need
- only be localized through their use of the underlying system.
-
- Rather than rewriting all applications which use text as data (cf: the C
- compiler example), unification of the glyph sets makes more sense.
-
- The only goal I am esposing here is enabling for localization. For this
- task, Unicode is far from useless.
-
- T: I believe this is addressed adequately in the ISO standards; however,
-
- V: Your belief is wrong for it is not considered adequate by real users.
-
- Then "real users" can supply a codified alternative in short order or lump it.
-
- T: the lexical order argument is one of the sticking points against the
- T: Japanese acceptance of Unicode, and is a valid argument in that arena.
- T: The fact of the matter is that Unicode is not an information manipulation
- T: standard, but (for the purposes of it's use in internationalization) a
- T: storage and an I/O standard. View this way, the lexical ordering argument
- T: is nonapplicable.
-
- V: It'd be sticking point about Slavic languages as well, you may be sure.
- V: Knowing ex-Soviet standard-making routine i think the official fishy-eyed
- V: representatives will silentlly vote pro to get some more time for raving
- V: in Western stores and nobody will use it since then. The "working" standards
- V: in Russia aren't made by commitees.
-
- Then this will have to change, or the Russian users will pay the price.
- Those of us external to Russia are in no position to involve ourselves in
- this process. Any changes will have to originate in Russia.
-
- I haven't seen you come right out and say the Cyrillic lexical order in
- the Unicode standard (characters U+0400->U+04FF) and in the ISO-8859-5 sets
- are "wrong". Neither have I seen an alternative lexical order (with an
- accompanying rationale) put forth.
-
- V: 3) there is no reasonable way to do hyphenation.
- V: Since there is no way to tell language from the text there
- V: is no way to do any reasonable attempts to hyphenate.
- V: [OX - which language this word is from]?
- V:
- V: Good-bye wordprocessors and formatters?
-
- T: By this, you are obviously not referring to idegrahic languages, such as
- T: Han, since hyphenation is meaningless for such languages. Setting aside
- T: the argument that if you don't know how to hyphenate in a language, you
- T: have no business generating situations requiring hyphenation by virtue
- T: of the fact that you are basically illeterate in taht language... ;-).
-
- V: The reason may be as simple as reformatting spreadsheet containing
- V: (particularly) addresses of companies in the language i don't comprehend
- V: (though i can write it on the envelope).
-
- T: Hyphenation as a process is language dependent, and, in particular,
- T: dependent on the rendering mechanism (rendereing mechanisms are *not*
- T: the subject under discussion; storage mechanisms *are*). Bluntly
- T: speaking, why does one need word processing software at all if this
- T: type of thing is codified? Hyphenation, like sorting, is manipulation
- T: of the information in a native language specific way.
-
- V: Exactly. But there is a lot of "legal" ways to do hyphenation -- and
- V: there are algorithms which do reasonably well knowing nothing about
- V: the language except which letters are vowels. It's quite enough
- V: for printing address labels. If i'm formatting a book i can specify the
- V: language myself.
-
- Address information can not be hyphenated, at least in US and other Western
- mail of which I am personally aware. This is a non-issue. This is also
- something that is not the responsibility of the operating system or the
- storage mechanism therein... unless you are arguing that UFS knows to store
- documents without hyphenation, and that the "cat" and "more" programs will
- hyphenate for you. If you are talking about ANY OTHER APPICATION, THE
- HYPHENATION IS THE APPLICATIONS RESPONSIBILITY. PERIOD. The fact that you
- will have to maintain vowel/consonent tables on a per language basis is
- an obvious outcome of the processing of multinational information. It makes
- little difference to the application user how these tables are keyed.
-
- T: Find another standard to tell you how to write a word processor.
-
- V: Is there any? :-)
-
- No, there isn't; that was the point. It is not the intent of the Unicode
- standard to provide a means of performing the operations normally
- associated with word processing. That is the job of the word processor, and
- is the reason people who write word processors are paid money by an employer
- rather than starving to death.
-
-
- V: 4) "the similar gliphs" in Unicode are often SLIGHTLY different
- V: typographical gliphs -- everybody who ever dealt with international
- V: publishing knows that fonts are designed as a WHOLE and every
- V: letter is designed with all others in mind -- i.e. X in Cyrillic
- V: is NOT the same X as Latin even if the fonts are variations of
- V: the same style. I'd wish you to see how ugly the Russian
- V: texts prited on American desktop publishing systems with
- V: "few characters added" are.
- V:
- V: In reality it means that Unicode is not a solution for
- V: typesetting.
-
- T: No, you're right; neither is it a standard for the production of
- T: pipefittings or the design of urban transportation systems.
-
- V: You somehow forgot that increasing number of texts get printed with
- V: typographical quality with all the stuff which follows.
- V: Ever saw a laser printer?
-
- Printing is simply another user-mode application program which can take
- advantage of the language indicators (whether on the file or in a document)
- for printing the prettiest, most lovely font of your choice. You think
- there are not font-selection mechanisms within Postscript for doing this?
-
- Again, font *changes* only become a problem if one attempts to print a
- *multinational* document. Since we aren't interested in multinationalization,
- it's unlikely that a Unicode font containing all Unicode glyphs will be
- used for that purpose.
-
- In all likelyhood, use will be in a localized environmentn *NOT* a
- multinational one. Since this is the case, it follows that the sum total of
- the Unicode font implemented in the US will be the ISO Latin-1 set.
- Similarly, if you are printing a Cyrillic document, you will be using a
- Cyrillic font; the "X" character you are concerned about will be *localized*
- to the Cyrillic "X", *NOT* the Latin "X".
-
-
- V: I see no reasons why we should treat the regular expression matching
- V: as "fancy" feature.
-
- Because globbing characters are language dependant. The easiest example
- of this is the distingtion made between "localized" UNIX SVR4 for English
- vs. Spanish. The fact is, the character set used for Spanish replaces
- several characters in the English set with other characters particular to
- Spanish (DOS is the foremost example of this, with it's reference to code
- pages and the fact that DOS file names fall within a very narrow set of
- characters). The globbing ("regular expression pattern match") characters
- DO change for any patterns more complicated than "*".
-
- T: Clearly, then, the applications you are describing are *not* Unicode
- T: applications, but "Fancy text" apllications which could potentially
- T: make use of Unicode for character storage.
-
- V: Don't you think the ANY text is going to be fancy because Unicode
- V: as it is does not provide adequate means for the trivial operations?
-
- Perhaps any multinational text, yes; for normal text, processing will be
- done using the localized form, not the Unicode form; therefore the issue
- will never come up, unless the application requires embedded attributes
- (like a desktop publishing package. Since multinational processing is
- the exception rather than the rule, let the multinational users pay the
- proce in terms of "Fancy text".
-
- V: As well i can provide every text with a font file. It is not a solution
- V: at all.
-
- Quite right; but doing so would be redundant unless you were using a output
- device, such as a CRT or a printer. It is the responsibility of the output
- device to present the data in a suitable format. For the most part, except
- for printing, which is difficult enough currently, this will be done by
- using localized fonts containing only a part of the full Unicode set (that
- part necessary for the localization language in use for that session/user/file)
- and thus will be coherently defined within the context of it's localization.
-
- Again, multinational software is not being addressed; however, were we to
- address the issue, I suspect that it would, in all cases, be implementation
- dependant upon the multinational application.
-
- V: Thank you, i already expressed my opinion on Plan 9 UTF in comp.os.research.
- V: I also do not think it's exciting. There are much more efficient runic
- V: encodings (my toy OS uses 7bit per byte and 8th bit as a continuation
- V: indicator).
-
- I don't know how stridently I can express this: runic encoding destroys
- information (such as file size = character count) and makes file system
- processing re character substituion totally unacceptable... consider the
- case of a substitution of a character requiring 3 bytes to encode for on
- that takes 1 byte (or 4 bytes) currently. Say further that it is the
- first character in a 2M file. You are talking about either shifting the
- contents of the entire file, or, MUCH WORSE, going to record oriented files
- for text. If there is defacto attribution of text vs. other files (shifting
- the data is unacceptable. period.), there is no reason to avoid making
- that attribution as meaningful as possible.
-
- V: Pretty soon it will be a dead standard because of the logical problems
- V: in the design. Voting is inadequate replacement for logic, you know.
- V: I'd better stick to a good standard from Zambia than to the brain-dead
- V: creature of ISO even if every petty bureaucrat voted for it.
-
- I agree; however, the peope involved were slightly more knowledgable about
- the subject than your average "petty bureaucrat". And there has not been
- a suggested alternative, only rantings of "not Unicode".
-
- V: I expressed my point of view (and proposed some kind of solution) in
- V: comp.std.internat, where the discussion should belong. I'd like you to
- V: see the problem not as an excercise in wrestling consensus from an
- V: international body but as a mathematical problem. From the logistical
- V: point of view the solution is simply incorrect and no standard commitee
- V: can vote out that small fact. The fundamental assumption Unicode is
- V: based upon (i.e. one glyph - one code) makes the whole construction
- V: illogical and it, unfortunately, cannot be mended without serious
- V: redesign of the whole thing.
-
- Wrong, wrong, wrong.
-
- 1) We are not discussiong the embodiment of a standard, but the
- applicability of existing standards to a particular problem.
- Basically, we could care less about anything other than the
- existing or draft standards and their suitability to the task
- at hand, the international enabling of 386BSD.
-
- 2) We are not interested in "arriving" at a new standard or defending
- existing or draft standards, except as regards their suitability
- to our goal of enabling.
-
- 3) The proposal of new soloutions (new standards) is neither useful
- nor interesting, in light of our need being "now" and the adoption
- of a new soloution or standard being "at some future date".
-
- 4) Barring a suggestion of a more suitable standard, I and others
- will begin coding to the Unicode standard.
-
- 5) Since we are discussing adoption of a standard for enabling of
- localization of 386BSD, and are neither intent on a general defense
- of any existing standard, nor the proposal of changes to an
- existing standard or the emobodiment of a new standard, this
- discussion doe *NOT* belong in comp.std.internat, since the
- subscribers of comp.unix.bsd are infinitely more qualified to
- determine which existing or draft standard they wish to use
- without a discussion of multinationalization (something only
- potentially useful to a limited audience, and then only at some
- future date when multinational processing on 386BSD becomes a
- generally desirable feature.
-
-
- V: Try to understand the argument about the redundance of encoding with
- V: external restrictions provided i used earlier in this letter. The
- V: Unicode commitee really get caught in a logical trap and it's a pity
- V: few people realize that.
-
- I *understand* the argument; I simply *disagree* with it's applicability
- to anything other than enabling multinationalization as opposed to
- enabling localization, which is the goal.
-
-
- Terry Lambert
- terry@icarus.weber.edu
- terry_lambert@novell.com
- ---
- Any opinions in this posting are my own and not those of my present
- or previous employers.
- --
- -------------------------------------------------------------------------------
- "I have an 8 user poetic license" - me
- Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial
- -------------------------------------------------------------------------------
-