NetNews Usenet Archive 1992 #31

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #31 / NN_1992_31.iso / spool / comp / std / internat / 985 < prev next >

Wrap

Internet Message Format | 1993-01-01 | 3.6 KB

Path: sparky!uunet!mcsun!sun4nl!cwi.nl!dik From: dik@cwi.nl (Dik T. Winter) Newsgroups: comp.std.internat Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST) Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages Message-ID: <8498@charon.cwi.nl> Date: 2 Jan 93 01:27:24 GMT References: <1i2durINN2pj@rodan.UU.NET> <8496@charon.cwi.nl> <1i2lojINN4se@rodan.UU.NET> Sender: news@cwi.nl Organization: CWI, Amsterdam Lines: 57 I move the discussion a bit: would we like sorting according to the texts language or the users language? In article <1i2lojINN4se@rodan.UU.NET> avg@rodan.UU.NET (Vadim Antonov) writes: > BTW, how do they deal with things like ["o-z] in regualar expressions? I do not know, but I do know that in many cases a Finn would like to handle it differently from a German, regardless of the language of the text involved! For sorting it is much less important what is the language of the underlying text, it is more important to know what the native language is of the intended public. Moreover, if the intended public is multilingual it makes sense to make multiple entries for single items, sorted according to different criteria. I have an atlas published by a number of West European publishers. It has an index where place names are sorted by local customs and by non-local customs. I.e. the Spanish Llano occurs two times, once between the letter L and once under a separate heading. So look-up is easy both for the native Spanish speaker and for the non-speaker. Putting it in only one place makes it difficult either for the Spanish speaker (who would look under Ll) or for the person who does not know Spanish (and does not know that Spanish Ll is a separate letter). > > Another solution is to create a generic rule for EQUIVALENT letters > which have identical position in the sorting order and to add a > "letter" oe. Will not work, because this letter has to sort between the combinations od and of, which again mixes multiple letters and single letters. > > So, even if sorting is not regular there always is a way around -- > with Unicode you can't do even that. Eh? I would think the same way around! > > (I repeat it: to do trivial operations like case-insensitive comparisons, > sorting, regular expression matching Unicode requires explicit > specification of the language -- it can be obtained from user or > recorded somewhere outside the text itself. The "paradox" is that > if we have this information we DO NOT NEED extended Unicode codes > because we already know the alpahbet and it is small!) As I said above, you do in general *not* need the language of the text involved, but the language of the user, which can not be recorded in the text. I, as a non-German, non-Swedish, non-Finnish native would be extremely surprised if my searches sometimes would give German a-umlaut and sometimes the non-German ones, especially in a multi-lingual text. How would you deal with mixed languages? (And I do not mean mixed different scripts, which Latin, Cyrillic and Greek are in fact, which is the reason they get different code-points.) > > Users are interested if they're able to do the work without grep > asking them which language they mean everytime they run it. Right. That is why, when I do sorts on texts I want the Dutch sorting order, regardless the original language of the text, which means ignoring diacritics. And I would like to be able to set an environment variable in my profile that my preferred language is Dutch. > > Beat it! Beat that! -- dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland home: bovenover 215, 1025 jn amsterdam, nederland; e-mail: dik@cwi.nl