home *** CD-ROM | disk | FTP | other *** search
- Sender: Postmaster@iecc.cambridge.ma.us
- Newsgroups: comp.std.internat
- Path: sparky!uunet!wupost!usc!elroy.jpl.nasa.gov!decwrl!world!iecc!mailgateway
- Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST)
- References: <KIRAVUO.93Jan1164705@lesti.hut.fi>
- Organization: I.E.C.C.
- Date: 1 Jan 93 17:57:35 EST (Fri)
- From: johnl@iecc.cambridge.ma.us (John R. Levine)
- Message-ID: <9301011757.AA04714@iecc.cambridge.ma.us>
- Lines: 35
-
- >it is my opinion that there is no way to make a simple character code that
- >will perform sorting and character conversion automatically.
-
- Having done my share of i18n, I thoroughly agree. When I was writing the
- international scaffolding for Javelin, a PC time-series modelling package,
- we came up with a locale-like thing that let you load a country
- configuration file. The config file set the collating sequence, including
- which characters sort together, and whether there are pairs of characters
- that sort as one like spanish ch and ll, or single characters that sort as
- two, like the German umlauted vowels. It also loaded the strings that
- were inserted automatically into graphs and printouts, e.g. month names
- and words like "Millions."
-
- What did not change was the message strings in the program or the table of
- function and macro names, all of which were version-specific. That is, if
- you bought a French version of Javelin, it always spoke French, but you
- could load in a country driver to produce reports in German or Spanish or
- Dutch. The separation between the "locale" for the program and the
- "locale" for the reports was quite useful, and the users liked it. The
- program message strings were linked in when a particular version of
- Javelin was built, so that the distributor for a particular country could
- build the version for that country.
-
- Specifically referring to sorting, it became quite clear that we could not
- depend on there being a canonical printable version of a sortable string.
- In some languages, there are lower case characters without upper case
- equivalents or vice versa. The canonical form was a list of collating
- sequence positions so that in English all versions of the letter "A" might
- turn into 12, all versions of "B" into 13, and so on. The canonical form
- was easy to sort and useful for determining whether two strings were
- equivalent, important in the symbol table, but you couldn't turn it back
- into something printable.
-
- Regards,
- John Levine, johnl@iecc.cambridge.ma.us, {spdcc|ima|world}!iecc!johnl
-