r2 - 06 Jul 2005 - 20:15:42 - KatieCappsParlanteYou are here: OSAF >  Projects Web  >  DevelopmentHome > InternationalizationProject > GlobalizationI18N > PythonI18N

Python Internationalization

The Problems

  • We want to make sure all strings containing UI text are Unicode encoded.
  • Even with Unicode strings, basic Python operators don't work correctly.
  • We'd like to use existing Python support for I18N, but it's often based on C libraries

All Strings In Unicode

Unfortunately it's easy to create a Python string that uses the default ("plain text") encoding, instead of Unicode. All you have to do is say "some text" instead of u"some text".

Hopefully we won't actually have any raw text in Python code. And we could use scripts to scan for raw text, to help enforce this.

I don't know whether you could set the default encoding (via setdefaultencoding()) to be utf-16-be or utf-16-le, so that even strings created without explicitly specifying Unicode would be created as Unicode strings.

Python string operators

Even with a Unicode string, typical operations on strings won't work properly. I call this the curse of strcmp - things seem to work OK, as long as you're using ASCII (read "English"). But then you start running into problems with European languages that make common use of accented characters, and it totally breaks down with Asian languages like Japanese.

For example, using if (uniStrA < uniStrB): doesn't work as expected, since it does binary comparison versus true collation. Slicing a Unicode string can also split a code point in half, if it's comprised of two 16-bit code units (surrogate pair). And a grapheme cluster (e.g. "u" + zero width joiner + umlaut) can also get split.

A more obscure example involves transliteration. Python tries to leverage Unicode data to handle upper and lower-casing of text. But this is actually locale-dependent; for example, upper-casing a lower-case 'i' in Turkey should give you u0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE), not "I".

We could create an ICU string class that you use everywhere versus built-in Python operators. But then you have to make sure all your code doesn't accidentally use the built-in support.

We could modify Python to use ICU for Unicode string operations, perhaps under the control of a switch (or triggered by the locale selected). But I've heard that getting Python to accept direct use of ICU isn't likely, and it also might cause unexpected results (e.g. a slice winds up being empty, because the requested index is in the middle of a grapheme cluster).

Python support for I18N

Python tries to leverage the standard "C" libraries as much as possible. This leads to problems in areas of I18N support, for things like locales, date/time, etc. The two common problems are that the library support is lacking, or there are platform dependencies that make it hard to work in a nice cross-platform manner.

One solution is to use external (non-core) Python modules such as mxDate.

Another approach is to wrap ICU so that it can be used from Python to provide the required I18N support.

Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r2 < r1 | More topic actions
 
Open Source Applications Foundation
Except where otherwise noted, this site and its content are licensed by OSAF under an Creative Commons License, Attribution Only 3.0.
See list of page contributors for attributions.