NetNews Usenet Archive 1992 #31

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #31 / NN_1992_31.iso / spool / comp / std / internat / 1001 < prev next >

Wrap

Internet Message Format | 1993-01-03 | 4.6 KB

Path: sparky!uunet!paladin.american.edu!gatech!pitt.edu!djbpitt From: djbpitt+@pitt.edu (David J Birnbaum) Newsgroups: comp.std.internat Subject: Re: Language tagging Message-ID: <1336@blue.cis.pitt.edu> Date: 3 Jan 93 15:53:29 GMT References: <1993Jan2.020512.3287@klaava.Helsinki.FI> <1321@blue.cis.pitt.edu> <1993Jan2.231703.21201@enea.se> Sender: news+@pitt.edu Organization: University of Pittsburgh Lines: 78 >>In both cases, if you want language-specific data in your text stream, >>you have to say so during input. If I need to insert Bulgarian words >>into a Russian text stream I can do so without indicating a change, >>as long as I understand that the consequence will be that the Bulgarian >>data will be treated like Russian. > >Then I have to ask you explain something about Unicode I don't >know. It is true that if you are using language-dependent features >such as spell-checking and hyphenation, while inputting the text, >then you have to know what you are doing with Unicode. But once >you're done with it, it doesn't matter any longer, until the next >time you want to process the text in some way. With Unicode you >can decide whether to treat the text in Bulgarian or Russian, >with Vadim's system you're stuck unless you convert it. It is true that Vadim's character set always carries language information along with it, while bare Unicode does not. A Unicode based system, though, will require more than the bare character set (which is why I refer to it as a "Unicode-based system," rather than simply as "Unicode"). There are two sets of problems: information that can (and must, if it is to be present at all) be entered during input and information that cannot be foreseen until subsequent processing is to be performed (and that, therefore, cannot be entered during input). Concerning the former, I would normally require that my texts include language identification, so that the same "this is Bulgarian" or "this is Russian" information would be present in both a Unicode-based system and Vadim's system, although it would not be an inalienable part of the character set in the former. Thus, I would be "stuck with" language information under both systems. While Unicode is capable of representing text without language information (by eschewing the use of tags), I can't think of a situation where I would want to do so. I can always ignore the tags if they are an impediment to some later processing, but I can't restore them automatically if they haven't been encoded at all. Thus, I don't think the ability of Unicode to represent text without language information is either a virtue or a liability; it isn't a virtue because text normally _is_ in a specific language (although I can construct perverse exceptions) and it isn't a liability because that information can be encoded on a different level. If, under your proposal, the "Russian" or "Bulgarian" tag can be disgarded after input, how can a subsequent user process a multilingual text that has to distinguish Bulgarian from Russian information? I agree with Vadim that this information should travel with the text, but I disagree with his insistence that it must be part of the character encoding architecture itself. Concerning the latter, as has been indicated by others on this list, processing depends on factors other than the language of input, such as locale (defined more narrowly than language) or, more specifically, the processor's locale (which is not necessarily the same as the locale in which the text was entered). Thus, Vadim's proposal to store some information needed for processing as part of the character set will not obviate the need to provide additional information, information that must reside not only outside the character set, but outside the text stream. Under either Vadim's proposal or the type of Unicode-based system with language tags that I suggest, a mechanism will be needed to distinguish information that is an inherent and inalienable part of the text independent of locale from information that is at least partially determined by locale. We can't wish away this complication. >But at least >shifting scripts is a more obvious, than changing languages. If I >switch from Swedish to Russian I would probably change the keyboard >set-up, but not if I switch from Swedish to German - there is no >reason to. Script, language, and keyboard may be independent, although certain combinations are more common than others. --David -- Professor David J. Birnbaum djbpitt+@pitt.edu [Internet] The Royal York Apartments, #802 djbpitt@pittvms [Bitnet] 3955 Bigelow Boulevard voice: 1-412-687-4653 Pittsburgh, PA 15213 USA fax: 1-412-624-9714