home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!paladin.american.edu!gatech!pitt.edu!djbpitt
- From: djbpitt+@pitt.edu (David J Birnbaum)
- Newsgroups: comp.std.internat
- Subject: Re: Language tagging
- Message-ID: <1336@blue.cis.pitt.edu>
- Date: 3 Jan 93 15:53:29 GMT
- References: <1993Jan2.020512.3287@klaava.Helsinki.FI> <1321@blue.cis.pitt.edu> <1993Jan2.231703.21201@enea.se>
- Sender: news+@pitt.edu
- Organization: University of Pittsburgh
- Lines: 78
-
- >>In both cases, if you want language-specific data in your text stream,
- >>you have to say so during input. If I need to insert Bulgarian words
- >>into a Russian text stream I can do so without indicating a change,
- >>as long as I understand that the consequence will be that the Bulgarian
- >>data will be treated like Russian.
- >
- >Then I have to ask you explain something about Unicode I don't
- >know. It is true that if you are using language-dependent features
- >such as spell-checking and hyphenation, while inputting the text,
- >then you have to know what you are doing with Unicode. But once
- >you're done with it, it doesn't matter any longer, until the next
- >time you want to process the text in some way. With Unicode you
- >can decide whether to treat the text in Bulgarian or Russian,
- >with Vadim's system you're stuck unless you convert it.
-
- It is true that Vadim's character set always carries language
- information along with it, while bare Unicode does not. A Unicode based
- system, though, will require more than the bare character set (which is
- why I refer to it as a "Unicode-based system," rather than simply as
- "Unicode").
-
- There are two sets of problems: information that can (and must, if it is
- to be present at all) be entered during input and information that
- cannot be foreseen until subsequent processing is to be performed (and
- that, therefore, cannot be entered during input).
-
- Concerning the former, I would normally require that my texts include
- language identification, so that the same "this is Bulgarian" or "this
- is Russian" information would be present in both a Unicode-based system
- and Vadim's system, although it would not be an inalienable part of the
- character set in the former. Thus, I would be "stuck with" language
- information under both systems. While Unicode is capable of
- representing text without language information (by eschewing the use of
- tags), I can't think of a situation where I would want to do so. I can
- always ignore the tags if they are an impediment to some later
- processing, but I can't restore them automatically if they haven't been
- encoded at all. Thus, I don't think the ability of Unicode to
- represent text without language information is either a virtue or a
- liability; it isn't a virtue because text normally _is_ in a specific
- language (although I can construct perverse exceptions) and it isn't a
- liability because that information can be encoded on a different level.
- If, under your proposal, the "Russian" or "Bulgarian" tag can be
- disgarded after input, how can a subsequent user process a multilingual
- text that has to distinguish Bulgarian from Russian information? I agree
- with Vadim that this information should travel with the text, but I
- disagree with his insistence that it must be part of the character
- encoding architecture itself.
-
- Concerning the latter, as has been indicated by others on this list,
- processing depends on factors other than the language of input, such as
- locale (defined more narrowly than language) or, more specifically, the
- processor's locale (which is not necessarily the same as the locale in
- which the text was entered). Thus, Vadim's proposal to store some
- information needed for processing as part of the character set will not
- obviate the need to provide additional information, information that
- must reside not only outside the character set, but outside the text
- stream. Under either Vadim's proposal or the type of Unicode-based
- system with language tags that I suggest, a mechanism will be needed to
- distinguish information that is an inherent and inalienable part of the
- text independent of locale from information that is at least partially
- determined by locale. We can't wish away this complication.
-
- >But at least
- >shifting scripts is a more obvious, than changing languages. If I
- >switch from Swedish to Russian I would probably change the keyboard
- >set-up, but not if I switch from Swedish to German - there is no
- >reason to.
-
- Script, language, and keyboard may be independent, although certain
- combinations are more common than others.
-
- --David
-
- --
- Professor David J. Birnbaum djbpitt+@pitt.edu [Internet]
- The Royal York Apartments, #802 djbpitt@pittvms [Bitnet]
- 3955 Bigelow Boulevard voice: 1-412-687-4653
- Pittsburgh, PA 15213 USA fax: 1-412-624-9714
-