Introduction,
Converting,
Limitations,
Future,
Closing
The JDK 1.1 release provides a great deal of support for
European languages and the countries that use them (we
will refer to these locales as Western), but due
to time constraints there is only minimal support for the
Far East, and there is no adequate support for the Middle
East and Southeast Asia (in which we also include the
Indian subcontinent). In addition, the font support is
very weak, even for English!
On the plus side, JDK 1.1 fonts have the capability to
draw any Unicode characters, assuming that the host
platform/browser supports drawing those characters. This
requires that the appropriate fonts be installed on the
host system. JDK 1.1 does have a mechanism for letting
you combine many different native fonts together into a
single logical font in order to cover a larger range of Unicode
characters. It is currently done by means of editing one
or more text font.properties files in a special
format. JavaSoft has very good documentation about this process
and its current limitations on the JavaSoft Web site at
http://www.javasoft.com:80/products/jdk/1.1/docs/guide/intl/index.html
.
On the minus side, Java support for fonts is still
very weak. For one thing, there is no way to access the
full set of fonts on a system; you are limited to a small
set of logical fonts: Serif, SansSerif, Monospaced, etc.
(By the way, to get the list of fonts, it is futile to
search for that method in Font; you must use Toolkit.getFontList).
For most applets this is not so bad; the few available
logical fonts supported on each implementation are
usually sufficient. However, for Java applications this
is a real problem; you can't build a Java application
that can list and use the available fonts on a system,
something that the simplest of native applications can
do.
Note
|
The logical font names will map to different
fonts on different platforms; never make
assumptions about metrics or coverage of these
fonts. |
The following are general deficiencies in the current
international support. For background information on
all the following topics, see the Unicode Standard.
- Default Locales in Applets. There is
unfortunately no per-thread data in JDK 1.1.
That deficiency, in addition to security concerns, prevents applets
from being able to call Locale.setDefault.
If you want to do this, the only current
work-around is to store your own default in a
well-known place, and pass it around explicitly.
This will not work for inaccessable code that
only uses the standard default locale, such as
exception formatting.
- Full Locale Coverage. Though a large number
of locales are in JDK 1.1, more need to be added.
This would be especially true for for South
America, where the locale data tends to only
differ in the name of the country and the
currency and number formats. These countries are
otherwise in pretty good shape.
- Editing. The current TextArea and
TextField use host peers to do the
editing. That means that if the host does not
support Unicode natively, there is a conversion
to some character set that the host does handle,
typically a character set that only handles the
hosts default locale. In such circumstances, the
rich set of symbols and punctuation in Unicode
(let alone letters in other languages) is simply
discarded. In addition, some current
implementations do signed conversion back to
Unicode, so when you put 00E516 (å)
into the TextArea, you get back FFE516
(fullwidth ¥)!
- Character Code Conversions. As we
mentioned above, you really need to be able to
iterate through all of the installed character
code converters and to have a richer API--for, among
other things, better performance.
- Keyboards. Most non-Western locales use
mixtures of different scripts; for example, you
will find English product names mixed in with
Japanese or with Arabic. To handle this, the user
needs to be able to change between different
keyboard mappings. Usually, the operating system
will provide support for this, but for word
processing, you need to be able to find out
what the current keyboard is, iterate through the
installed keyboards, and reset the current
keyboard. You can thereby provide a
convenience for the user, in which, if he clicks
down into Japanese text, you automatically
switch the keyboard to Japanese.
- Calendars. Most non-Western locales have
alternative calendars and need to allow a choice
between the standard (Gregorian) calendar and at
least one alternative. Japanese, for example,
needs an additional calendar, which is based on
the year of the various Emperors' reigns. If you
need to do this in JDK 1.1, you will need to
subclass Calendar, which is fairly
straightforward. Although you can do this, it may
be of limited use until some of the other
features are supported in Java.
- Styled Text. Using the current Java API,
you can perform your own text layout to support
drawing, hit-testing, highlighting, and
line-break. For example, you would make a series
of draw calls with font changes between each
one. Even with Western scripts, this does not
support higher-level features such as
justification efficiently. With non-Western
scripts, such as Hebrew, that have a mixture of
right-to-left and left-to-right characters, this
method breaks down very quickly.
Far East
Chinese, Japanese, and Korean (CJK) have very large
alphabets, which require fonts that can handle large
character sets, and require special support for inputting
characters. For high-end systems, vertical text and ruby
(textual annotations) are also required.
Input Methods
The main issue for correct localization for CJK is
input. Due to the complexity of the character sets, there
is a conversion facility that transforms input from a
small set of phonetic or component characters that the
user types into the actual CJK characters stored in the
document. This facility is often called an input
method engine (IME) or sometimes a front-end
processor (FEP). An IME is generally quite complex.
It often does sophisticated grammatical analysis of the
text, and it commonly
- uses the input context to disambiguate
characters
- marks special states of text with distinctive
highlighting
- allows the user to choose and control alternative
transformations
- allows the user to add new expressions to user
dictionaries.
There are three main types of input support that each
offer different levels of capability and require
different degrees of application changes:
Name: |
Off-the-spot (a.k.a.
bottom-line) |
User Value: |
minimal |
Application Changes: |
none |
When the user types a character, a window
appears (usually at the bottom of the screen).
Within that window, the user interacts with the
IME. When the user is finished, a series of keyboard
events are fed to the unsuspecting application
one at a time. (This would speed up if there were
a Java keyboard event that contained an entire
string.)
Name: |
Over-the-spot |
User Value: |
partial |
Application Changes: |
minimal |
When the user types a character, a window
appears right over the place the user was typing. The
text is often in the same font and size and
feels more like the user is typing directly into
the document. Otherwise, this is the same as
off-the-spot.
Name: |
On-the-spot (a.k.a.
inline) |
User Value: |
full |
Application Changes: |
major (for word processors) |
When the user types a character, it goes
directly into the document. The special
highlighting happens within the text, and changes
are immediately reflected, including word-wrap.
These require fairly complex interactions for
word processors; programs that use the built-in
Java editing (TextField, TextArea) are not
affected.
Currently, you are completely dependent on the quality
of the Java implementation on the host platform or browser.
- You will get on-the-spot in TextField and
TextArea--but only if the implementation
supports it.
- Otherwise you will get off-the-spot support--but
only if the implementation supports it.
It is fairly easy for a host platform or browser to
support both these features, at least on the major
platforms that have CJK support, but, unfortunately there
are no guarantees that this is done.
Moreover, there is no way for a Java program to
support on-the-spot outside of TextArea/TextField, such
as for word processors doing real rich-text editing with
mixed styles and fonts. You can do over-the-spot support
yourself by opening up your own small window that
contains a TextArea, putting the window in the right
position and setting the font yourself. However, you
can't get a list of the available IMEs or choose which
gets invoked.
Fonts
Large character fonts are handled in JDK 1.1. As
discussed above, there is a limited selection.
Neither Ruby nor Vertical text is in JDK 1.1. Both
require special handling in text layout, but are fairly
high-end features and so are not a problem for most
programs.
- Ruby. Because people often don't know the
pronunciation of a particular ideograph (Kanji),
small phonetic symbols are often placed over one
or more ideographs. Figure 1 shows an example of
how this works, using English characters to show
the pronunciation of Greek letters.

Figure 1
Ruby
- Vertical Text. CJK characters can also be
written vertically, with lines that go from right
to left (usually). There are two
complications: Some characters will rotate or
change shape in a vertical context, and
intermixed Latin text may rotate 90 degrees clockwise
(and Arabic characters 90 degrees
counter-clockwise!).
Middle East
Arabic and Hebrew are written from right-to-left but
also allow mixing in left-to-right text, such as numbers
or English text. An example is shown in Figure 2. This
feature is called BIDI (short for bidirectional).
Moreover, Arabic characters may change shape radically,
depending on their context. Both these features require
very special handling in text layout to support drawing,
hit-testing, highlighting, and line-break and are not
optional for these locales.

Figure 2
Bidirectional Reordering
Moreover, the general flow of objects will generally
also be from right to left. This includes
the flow of components with a FlowLayout, and tab
stops in text, and on which side the box appears in a Checkbox.
Text is also generally right-flush instead of left-flush.
The localizer and developer need to have control of this
flow direction on a component-by-component basis.
In addition, legacy data in other character sets may
be stored in either visual or logical order, while
Unicode uses logical order. So special character
converters must be be written that can convert back and
forth.
Southeast Asia
Indic languages require special handling, since they
require special ligatures (called conjuncts) and
also rearrange certain vowels. Thai does not have these
issues, but does require precise placement of multiple
accents, which stack upon one another. This requires very
special handling in text layout to support drawing,
hit-testing, highlighting, and line-break, and is not
optional.

Figure 3
Moreover, Thai requires special word-break handling,
since spaces are not used to separate words--think of
this as hyphenating within English words. It also needs
special collation to sort some vowel-consonant
combinations as if they were reversed. These languages
may also employ simple input methods to alert the user to
illegitimate combinations of letters.
Introduction,
Converting,
Limitations,
Future,
Closing