Java Cookbook:
Creating Global Applications


Limitations of JDK 1.1

Introduction, Converting, Limitations, Future, Closing

The JDK 1.1 release provides a great deal of support for European languages and the countries that use them (we will refer to these locales as Western), but due to time constraints there is only minimal support for the Far East, and there is no adequate support for the Middle East and Southeast Asia (in which we also include the Indian subcontinent). In addition, the font support is very weak, even for English!

On the plus side, JDK 1.1 fonts have the capability to draw any Unicode characters, assuming that the host platform/browser supports drawing those characters. This requires that the appropriate fonts be installed on the host system. JDK 1.1 does have a mechanism for letting you combine many different native fonts together into a single logical font in order to cover a larger range of Unicode characters. It is currently done by means of editing one or more text font.properties files in a special format. JavaSoft has very good documentation about this process and its current limitations on the JavaSoft Web site at http://www.javasoft.com:80/products/jdk/1.1/docs/guide/intl/index.html .

On the minus side, Java support for fonts is still very weak. For one thing, there is no way to access the full set of fonts on a system; you are limited to a small set of logical fonts: Serif, SansSerif, Monospaced, etc. (By the way, to get the list of fonts, it is futile to search for that method in Font; you must use Toolkit.getFontList). For most applets this is not so bad; the few available logical fonts supported on each implementation are usually sufficient. However, for Java applications this is a real problem; you can't build a Java application that can list and use the available fonts on a system, something that the simplest of native applications can do.

 Note

The logical font names will map to different fonts on different platforms; never make assumptions about metrics or coverage of these fonts.

The following are general deficiencies in the current international support. For background information on all the following topics, see the Unicode Standard.

  • Default Locales in Applets. There is unfortunately no per-thread data in JDK 1.1. That deficiency, in addition to security concerns, prevents applets from being able to call Locale.setDefault. If you want to do this, the only current work-around is to store your own default in a well-known place, and pass it around explicitly. This will not work for inaccessable code that only uses the standard default locale, such as exception formatting.
  • Full Locale Coverage. Though a large number of locales are in JDK 1.1, more need to be added. This would be especially true for for South America, where the locale data tends to only differ in the name of the country and the currency and number formats. These countries are otherwise in pretty good shape.
  • Editing. The current TextArea and TextField use host peers to do the editing. That means that if the host does not support Unicode natively, there is a conversion to some character set that the host does handle, typically a character set that only handles the hosts default locale. In such circumstances, the rich set of symbols and punctuation in Unicode (let alone letters in other languages) is simply discarded. In addition, some current implementations do signed conversion back to Unicode, so when you put 00E516 (å) into the TextArea, you get back FFE516 (fullwidth ¥)!
  • Character Code Conversions. As we mentioned above, you really need to be able to iterate through all of the installed character code converters and to have a richer API--for, among other things, better performance.
  • Keyboards. Most non-Western locales use mixtures of different scripts; for example, you will find English product names mixed in with Japanese or with Arabic. To handle this, the user needs to be able to change between different keyboard mappings. Usually, the operating system will provide support for this, but for word processing, you need to be able to find out what the current keyboard is, iterate through the installed keyboards, and reset the current keyboard. You can thereby provide a convenience for the user, in which, if he clicks down into Japanese text, you automatically switch the keyboard to Japanese.
  • Calendars. Most non-Western locales have alternative calendars and need to allow a choice between the standard (Gregorian) calendar and at least one alternative. Japanese, for example, needs an additional calendar, which is based on the year of the various Emperors' reigns. If you need to do this in JDK 1.1, you will need to subclass Calendar, which is fairly straightforward. Although you can do this, it may be of limited use until some of the other features are supported in Java.
  • Styled Text. Using the current Java API, you can perform your own text layout to support drawing, hit-testing, highlighting, and line-break. For example, you would make a series of draw calls with font changes between each one. Even with Western scripts, this does not support higher-level features such as justification efficiently. With non-Western scripts, such as Hebrew, that have a mixture of right-to-left and left-to-right characters, this method breaks down very quickly.

Far East

Chinese, Japanese, and Korean (CJK) have very large alphabets, which require fonts that can handle large character sets, and require special support for inputting characters. For high-end systems, vertical text and ruby (textual annotations) are also required.

Input Methods

The main issue for correct localization for CJK is input. Due to the complexity of the character sets, there is a conversion facility that transforms input from a small set of phonetic or component characters that the user types into the actual CJK characters stored in the document. This facility is often called an input method engine (IME) or sometimes a front-end processor (FEP). An IME is generally quite complex. It often does sophisticated grammatical analysis of the text, and it commonly

  • uses the input context to disambiguate characters
  • marks special states of text with distinctive highlighting
  • allows the user to choose and control alternative transformations
  • allows the user to add new expressions to user dictionaries.

There are three main types of input support that each offer different levels of capability and require different degrees of application changes:

  1. Name: Off-the-spot (a.k.a. bottom-line)
    User Value: minimal
    Application Changes: none

    When the user types a character, a window appears (usually at the bottom of the screen). Within that window, the user interacts with the IME. When the user is finished, a series of keyboard events are fed to the unsuspecting application one at a time. (This would speed up if there were a Java keyboard event that contained an entire string.)

  2. Name: Over-the-spot
    User Value: partial
    Application Changes: minimal

    When the user types a character, a window appears right over the place the user was typing. The text is often in the same font and size and feels more like the user is typing directly into the document. Otherwise, this is the same as off-the-spot.

  3. Name: On-the-spot (a.k.a. inline)
    User Value: full
    Application Changes: major (for word processors)

    When the user types a character, it goes directly into the document. The special highlighting happens within the text, and changes are immediately reflected, including word-wrap. These require fairly complex interactions for word processors; programs that use the built-in Java editing (TextField, TextArea) are not affected.

Currently, you are completely dependent on the quality of the Java implementation on the host platform or browser.

  • You will get on-the-spot in TextField and TextArea--but only if the implementation supports it.
  • Otherwise you will get off-the-spot support--but only if the implementation supports it.

It is fairly easy for a host platform or browser to support both these features, at least on the major platforms that have CJK support, but, unfortunately there are no guarantees that this is done.

Moreover, there is no way for a Java program to support on-the-spot outside of TextArea/TextField, such as for word processors doing real rich-text editing with mixed styles and fonts. You can do over-the-spot support yourself by opening up your own small window that contains a TextArea, putting the window in the right position and setting the font yourself. However, you can't get a list of the available IMEs or choose which gets invoked.

Fonts

Large character fonts are handled in JDK 1.1. As discussed above, there is a limited selection.

Neither Ruby nor Vertical text is in JDK 1.1. Both require special handling in text layout, but are fairly high-end features and so are not a problem for most programs.

  • Ruby. Because people often don't know the pronunciation of a particular ideograph (Kanji), small phonetic symbols are often placed over one or more ideographs. Figure 1 shows an example of how this works, using English characters to show the pronunciation of Greek letters.

    Figure 1
    Ruby

  • Vertical Text. CJK characters can also be written vertically, with lines that go from right to left (usually). There are two complications: Some characters will rotate or change shape in a vertical context, and intermixed Latin text may rotate 90 degrees clockwise (and Arabic characters 90 degrees counter-clockwise!).

Middle East

Arabic and Hebrew are written from right-to-left but also allow mixing in left-to-right text, such as numbers or English text. An example is shown in Figure 2. This feature is called BIDI (short for bidirectional). Moreover, Arabic characters may change shape radically, depending on their context. Both these features require very special handling in text layout to support drawing, hit-testing, highlighting, and line-break and are not optional for these locales.

Figure 2
Bidirectional Reordering

Moreover, the general flow of objects will generally also be from right to left. This includes the flow of components with a FlowLayout, and tab stops in text, and on which side the box appears in a Checkbox. Text is also generally right-flush instead of left-flush. The localizer and developer need to have control of this flow direction on a component-by-component basis.

In addition, legacy data in other character sets may be stored in either visual or logical order, while Unicode uses logical order. So special character converters must be be written that can convert back and forth.

Southeast Asia

Indic languages require special handling, since they require special ligatures (called conjuncts) and also rearrange certain vowels. Thai does not have these issues, but does require precise placement of multiple accents, which stack upon one another. This requires very special handling in text layout to support drawing, hit-testing, highlighting, and line-break, and is not optional.

Figure 3

Moreover, Thai requires special word-break handling, since spaces are not used to separate words--think of this as hyphenating within English words. It also needs special collation to sort some vowel-consonant combinations as if they were reversed. These languages may also employ simple input methods to alert the user to illegitimate combinations of letters.


Introduction, Converting, Limitations, Future, Closing




JavaTM is a trademark of Sun Microsystems, Inc.

Microsoft, Windows and Windows NT are registered trademarks of Microsoft Corporation.

Other companies, products, and service names may be trademarks or service marks of others.

Copyright    Trademark



  Java Feature Page Java Home  
IBM HomeOrderEmployment