home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!mcsun!sun4nl!cwi.nl!dik
- From: dik@cwi.nl (Dik T. Winter)
- Newsgroups: comp.std.internat
- Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST)
- Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages
- Message-ID: <8498@charon.cwi.nl>
- Date: 2 Jan 93 01:27:24 GMT
- References: <1i2durINN2pj@rodan.UU.NET> <8496@charon.cwi.nl> <1i2lojINN4se@rodan.UU.NET>
- Sender: news@cwi.nl
- Organization: CWI, Amsterdam
- Lines: 57
-
- I move the discussion a bit: would we like sorting according to the texts
- language or the users language?
-
- In article <1i2lojINN4se@rodan.UU.NET> avg@rodan.UU.NET (Vadim Antonov) writes:
- > BTW, how do they deal with things like ["o-z] in regualar expressions?
- I do not know, but I do know that in many cases a Finn would like to handle
- it differently from a German, regardless of the language of the text involved!
- For sorting it is much less important what is the language of the underlying
- text, it is more important to know what the native language is of the
- intended public. Moreover, if the intended public is multilingual it makes
- sense to make multiple entries for single items, sorted according to
- different criteria. I have an atlas published by a number of West European
- publishers. It has an index where place names are sorted by local customs
- and by non-local customs. I.e. the Spanish Llano occurs two times, once
- between the letter L and once under a separate heading. So look-up is easy
- both for the native Spanish speaker and for the non-speaker. Putting it in
- only one place makes it difficult either for the Spanish speaker (who would
- look under Ll) or for the person who does not know Spanish (and does not
- know that Spanish Ll is a separate letter).
- >
- > Another solution is to create a generic rule for EQUIVALENT letters
- > which have identical position in the sorting order and to add a
- > "letter" oe.
- Will not work, because this letter has to sort between the combinations
- od and of, which again mixes multiple letters and single letters.
- >
- > So, even if sorting is not regular there always is a way around --
- > with Unicode you can't do even that.
- Eh? I would think the same way around!
- >
- > (I repeat it: to do trivial operations like case-insensitive comparisons,
- > sorting, regular expression matching Unicode requires explicit
- > specification of the language -- it can be obtained from user or
- > recorded somewhere outside the text itself. The "paradox" is that
- > if we have this information we DO NOT NEED extended Unicode codes
- > because we already know the alpahbet and it is small!)
- As I said above, you do in general *not* need the language of the text
- involved, but the language of the user, which can not be recorded in the
- text. I, as a non-German, non-Swedish, non-Finnish native would be
- extremely surprised if my searches sometimes would give German a-umlaut
- and sometimes the non-German ones, especially in a multi-lingual text.
- How would you deal with mixed languages? (And I do not mean mixed
- different scripts, which Latin, Cyrillic and Greek are in fact, which
- is the reason they get different code-points.)
- >
- > Users are interested if they're able to do the work without grep
- > asking them which language they mean everytime they run it.
- Right. That is why, when I do sorts on texts I want the Dutch sorting
- order, regardless the original language of the text, which means
- ignoring diacritics. And I would like to be able to set an environment
- variable in my profile that my preferred language is Dutch.
- >
- > Beat it!
- Beat that!
- --
- dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland
- home: bovenover 215, 1025 jn amsterdam, nederland; e-mail: dik@cwi.nl
-