Chandler Internationalization Proposal, v0.4.1

Authors: Brian Kirsch

Last edited: August 8, 2005 9:24 AM Creation date: June 16, 2005
Reviewers:

Overview

Goals and Objectives

 

Design an architecture for .6 and beyond that will ensure Chandler's success in non-US markets. This will require significant and fundamental changes to all aspects of the Chandler development process.

 

Brian Kirsch
Spec owner and contributor

Definition of Terms and Unicode encodings

 

unicode: The Python unicode object which can be either be UCS-2 or UCS-4 depending on the Python binary build.

Unicode: Defined in the Unicode Standard. Code points are represented abstractly as U+0041. The code point is stored in one of the following encoding's: UTF-8, UTF-16, and UTF-32.

Repository Unicode Storage: The UString type is used for storage of Unicode data in the Repository. It maps to a Python unicode object and is stored as UTF-8 encoded bytes.

PyLucene Unicode: Wraps the Java String object which is UTF-16.

UnicodeString: ICU UnicodeString object which is UTF-16.

$CHANDLER_HOME: The file system path of the Chandler root directory. The path is determined by the user on install and can contain non-ascii characters.

$PARCEL_NAME: The name of the parcel for example, mail.

$PARCEL_PATH: The relative file system path to the parcel under the $CHANDLER_HOME directory for example chandler/parcels/osaf/mail.

 

Background

 

Developing globalized software is a continuous balancing act as software developers and project managers inadvertently underestimate the level of effort and detail required to create foreign-language software releases.


In general, the standard process for creating globalized software includes "internationalization," which covers generic coding and design issues, and "localization," which involves translating and customizing a product for a specific market.


Software developers must understand the intricacies of internationalization since they write the actual underlying code. How well they use established services to achieve mission objectives determines the overall success of the project. At a fundamental level, code and feature design affect how a product is translated and customized. Therefore, software developers need to understand localization concepts. Localization is the process of customizing a product for a particular locale.

To internationalize / localize an application requires:

Additional code modules that implement locale-specific functionality, such as an input method editor for Japanese or a module that calculates Hebrew calendar dates may also be required.

 

Internationalization is everywhere

Every time a program needs to do something with data (HTML, RSS, etc), internationalization plays a role. Display can involve charset conversion, font selection, font size, processing through a special rendering engine, positioning on the screen, string formatting, word wrap, hyphenation, numeric and date formatting, and other processes. Read can require charset conversion, communication with input methods, string formatting, numeric and date formatting, protocol formatting, and more. Locale awareness is required for sort, search, word wrap, hyphenate, and sometimes parse. Search, compress, string format, character index and count, word wrap, and hyphenate are especially sensitive to the charset of the data, since they are looking at individual characters. Protocol format must often include parameters describing the charset, language, and/or locale of the included data.

 

Internationalization Roles

• Designers & architects - help evaluate i18n areas in product designs

• Engineers - understand i18n considerations in implementation

• Engineering managers - roadmap the product & resource planning
• Technical product reviewers – help evaluate product i18n status

• Software QA - understand which areas to test for i18n functionality
 

What is Unicode?


The Unicode Standard assigns a unique scalar number to every character in its character set. The resulting numbered set is referred to as a coded character set.

Units of a coded character set are known as code points. In Unicode there are a number of ways of encoding the same character. These include UTF-8, UTF-16, and UTF-32.


UTF-8 uses 1 byte to represent characters in the old ASCII set, two bytes for characters in several more alphabetic blocks, and three  bytes for the rest of the BMP. Supplementary characters use 4 bytes.


UTF-16 uses 2 bytes for any character in the BMP, and 4 bytes for supplementary characters.


UTF-32 uses 4 bytes everywhere. In the chart on the slide, the first line of numbers represents the position of the characters in the  Unicode coded character set. The other lines show the byte values used to represent that

 

In Unicode, a letter maps to something called a code point which is still just a theoretical concept. How that code point is represented in memory or on disk is a different story. In Unicode, the letter A is a platonic ideal. It's just floating in heaven:


This platonic A is different than B, and different from a, but the same as A and A and A. The idea that A in a Times New Roman font is the same character as the A in a Helvetica font, but different from "a" in lower case, does not seem very controversial, but in some languages just figuring out what a letter is can cause controversy.


 Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0645.  This magic number is called a code point. The U+ means "Unicode" and the numbers are hexadecimal. U+FEC9 is the Arabic letter Ain. The English letter A would be U+0041.

 

 

 

How to deal with Unicode

Developers from many big Python projects have come up with simple rules of thumb to prevent runtime UnicodeDecodeErrors, and the rules may be summarized into one sentence: always do the conversion at IO-barriers.


To express this same concept a bit more extensively: whenever your program receives text data from the outside (from the network, from a file, from user input, ...), construct unicode objects immediately. Find out the appropriate encoding from, e.g., an HTTP- header, or look for some appropriate convention to determine the encoding to use.


Whenever your program sends text data to the outside (to the network, to some file, to the user, ...), determine the correct encoding, and convert your text to a byte string with that encoding. (Otherwise, Python would attempt to convert unicode to an ascii byte string, likely producing UnicodeEncodeErrors which are just the converse of the UnicodeDecodeErrors previously mentioned). With these two rules, you will find that most unicode problems just go away. For Chandler our default outbound encoding will be utf-8.

 

The ICU

ICU is a mature, widely used set of C/C++ and Java libraries for unicode support, software internationalization and globalization (i18n/g11n). It grew out of the JDK 1.1 internationalization API's, which the ICU team contributed, and the project continues to be developed for the most advanced unicode/i18n support. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.

The International Components for Unicode (ICU) libraries provide robust and full-featured unicode services on a wide variety of platforms, without sacrificing performance. It supports the most current version of the unicode standard, and provides support for supplementary unicode characters (needed for support of the repertoires of GB 18030, HKSCS, and JIS X 0213). The ICU Provides:

 

Why we need the ICU

Python tries to leverage the standard "c" libraries as much as possible. This leads to problems in areas of i18n support, for things like locales, date/time, etc. The two common problems are that the library support is lacking or there are platform dependencies that make it hard to work in a nice cross-platform manner. Python supports a number of specific character encoding "codecs'' in the standard distribution Python has built-in support for most western codecs. Japanese, Korean, and Chinese codecs are available as third party distributions but many non-western codecs are simply not available.

The ICU however, has a large file system foot print close to 10 mb. The 500+ locale data files and codepages are contributing the majority of the foot print. ICU can be compiled with a fewer supported locales. This would reduce the package size considerably.

 

PyICU

PyICU (http://svn.osafoundation.org/pyicu/trunk) is a swig wrap of the C++ ICU API developed and maintained by OSAF under the guidance of Andi Vadja. It is shipped as part of the Chandler distribution. Currently under active development, the C++ ICU string, locale, format, calendar, timezone and various iterator classes have already been wrapped. To use PyICU from Chandler go to $CHANDLER_HOME/chandler and type ./release/RunPython

>>> from PyICU import *
>>> df = DateFormat.createInstance()
>>> df
<SimpleDateFormat: M/d/yy h:mm a>
>>> f = Formattable(940284258.0, Formattable.kIsDate)
>>> df.format(f)
<UnicodeString: 10/18/99 3:04 PM>

UnicodeString

The ICU UnicodeString points at a mutable array of UChar unicode 16-bit wide characters. The Python unicode type
is an immutable string of 16-bit or 32-bit wide unicode characters. Because of these differences, UnicodeString and Python's unicode type are not merged into the same type when crossing the SWIG boundary.


ICU API's taking UnicodeString arguments have been overloaded to also accept Python str or unicode type arguments. In the case of str objects, utf-8 encoding is assumed when converting them to UnicodeString objects.

To convert a Python str encoded in an encoding other than utf-8 to an ICU UnicodeString use the UnicodeString(str, encodingName) constructor.

ICU's C++ API's accept and return UnicodeString arguments in several ways: by value, by pointer or by reference.
When an ICU C++ API is documented to accept a UnicodeString & parameter, it is safe to assume that there are several corresponding PyICU python API's making it accessible in simpler ways. The UnicodeString type implements the Python __unicode__ method and thus can be coerced to a Python unicode string as follows:

>>> unicodeString = UnicodeString("Test")

>>> pythonUnicode = unicode(unicodeString)

>> pythonUnicodeTwo = unicodeString.__unicode__()

 

Again, the ICU UnicodeString is mutable and can change in line. This can cause confusion for Python developers used to dealing with immutable objects. For example, a call to a UnicodeString's toLower() method does not return a copy of the UnicodeString but actually lowers each character inline in the original object.

Chandler now has two types to represent unicode the Python unicode type and the ICU UnicodeString. The Python unicode type will be the default for the Chandler application space. Data stored in the Repository will map to a Python unicode type and developers will work with Python unicode between layers. The UnicodeString type will be leveraged internally with in PyICU and at key locations in Chandler where text requires manipulation in a locale aware manner. For PyICU operations that return unicode an ICU UnicodeString will be returned. The developer then converts this UnicodeString to Python unicode in Chandler. Based on feedback a wrapper API may be added to ease this conversion process.

The Python unicode object does have its limitations however. For example, it is not able to distinguish surrogate pairs. A surrogate pair occurs when two or more unicode code points are leveraged to represent one displayable character. This is common in languages such as Chinese and Arabic. An operation on Python unicode that needs to perform text manipulation may end up splitting a surrogate code pair resulting in garbage text. The ICU UnicodeString handles surrogate code pairs correctly as well as toUpper() and a host of other methods that will return different results based on locale.

The PyICU UnicodeString should also be used at textual I/O boundaries where content may be in any number of different character set encoding's. This textual data needs to be converted to Python unicode for use in the Chandler application space. Python however, ships with only a limited number of encoding codecs. Incoming content to Chandler via the filesystem, mail, http, etc can come in hundreds of different encoding's. The ICU with its 500+ character set codepages is robust enough to handle this. The standard Python unicode object can be leveraged directly without the need for a UnicodeString, if the specific I/O boundary knows ahead of time its incoming and outgoing charset's are supported in Python.

 

>>> # 1. Convert text from the ISO-8859-1 encoding to a UnicodeString

>>> unicodeString = UnicodeString("test", "ISO-8859-1")

>>> # 2. Convert the UnicodeString to Python unicode for use in Chandler

>>> pythonUnicode = unicodeString.__unicode__()

 

Message and Choice Formatting

Messages are a concatenation of strings, numbers, and dates that present a complex formatting challenge. The PyICU message and choice format classes will be leveraged in Chandler to display informative and error messages in a localized manner.

The PyICU MessageFormat class facilitates localization by preventing the concatenation of message strings. This class enables localizers to create more natural messages and avoid phrases like "3 file(s)". While the MessageFormat class formats message strings, the ChoiceFormat class enables users to attach a format to a range of numbers and handles
plurals and names series in user messages. The two classes enable localizers to change the content, format, and order of any text, as appropriate, for any language.

The MessageFormat class assembles messages from various fragments (such as text fragments, numbers, and dates) supplied by the program using ICU. Because of the MessageFormat class, the program does not need to know the order of the fragments. The class uses the formatting specifications for the fragments to assemble them into a message that is contained in a single string within a resource bundle. For example, MessageFormat enables you to print the phrase "Finished printing x out of y files..." in a manner that still allows for flexibility in translation. The message and choice format syntax will be covered in detail in a separate document.

MessageFormat Example

 

>>> from PyICU import *

>>> args = [Formattable(7), Formattable(Calendar.getNow(), Formattable.kIsDate), Formattable( "a disturbance in the Force")]

>>> # Returns an ICU UnicodeString

>>> unicodeString = MessageFormat.formatMessage( "At {1,time} on {1,date}, there was {2} on planet{0,number,integer}.", args)

>>> # Convert the UnicodeString to Python unicode

>>> pythonUnicode = unicodeString.__unicode__()

>>> # Convert the Python unicode to bytes for printing

>>> print pythonUnicode.encode("utf8")

>>> "At 4:34:20 PM on 23-Mar-98, there was a disturbance in
the Force on planet 7."

 

ChoiceFormat Example

 

>>> from PyICU import *

>>> limits = [1,2,3,4,5,6,7]

>>> monthNames = [u"Sun", u"Mon", u"Tue", u"Wed", u"Thu", u "Fri", u"Sat"]

>>> fmt = ChoiceFormat(limits, monthNames, 7)

>>> for i in limits:

>>>    #Maps the array of day numbers to day names i.e. Sun = 0

>>>    #This is a very basic example. ChoiceFormat is capable of much more complicated replacement tasks.

>>>   unicodeString = fmt.format(i)

 

[Brian] Need to test PyICU ChoiceFormat support

Other areas of the PyICU API leveraged in Chandler

Repository Changes

Overview

The Chandler Internationalization strategy requires changes to Repository types to support localization and better enforcement.

Currently, the Repository has three text types: String, Symbol, and Lob. String is a broad type which can support 8-bit and unicode values. Symbol is a developer level string type which supports alpha numeric characters and underscores. Lob is a generic type which can store binary, 8 bit, or unicode data. String will be deprecated in Chandler going forward and the lob type will only store binary data and unicode. The Symbol type will continue to be leveraged in Chandler core but will not be used for active development.Four new Repository types are now required:

UString BString Text LocalizableString

 

UString

The UString type will replace the current String type. It will only support the Python unicode type and will raise an error upon committing if an 8-bit Python string or any other type is assigned to it. The UString type is for displayable textual user data. Having a type that can represent both an 8-bit Python string and a Python unicode type leads to developer confusion and internationalization inconsistencies. Going forward Chandler will work exclusively with unicode for user data and localizable content. 8-bit Python strings will be reserved for developer level task not suitable for display. This type is not localizable but is displayable.

 

BString

8-bit Python binary String type used for fonts, size, and color definitions. enums and constants. The BString should be used in place of the Symbol type. The BString is not localizable nor is it user displayable.

LocalizableString

A Repository Struct leveraged for unicode content that will requires translation. LocalizableString contains several values related to localization including a defaultText UString that stores the english text. The translatable content is looked up at runtime from the current locale settings. If no translatable content is found the english unicode text is returned. This type will be covered later in further detail. It is localizable and displayable.

Text

A Repository Alias that will accept either a UString or a LocalizableString type. It will throw an error if passed a Python unicode or str object. All displayable textual attributes will be of type Text. If an attribute of type Text is passed a LocalizableString it indicates the content needs translation. If it is passed a UString then the content does not require translation. This allows developers to determine which attributes on which instances are localizable. A Text attribute defined in parcel.xml will be stored as a LocalizableString. For example:

There are two types of collections on the Sidebar:

1. Chandler collections All, In, and Out which will require translation
2. User created collections which can be any Unicode value and will not require translation

Both these types use the displayName attribute to hold the UI displayable name which will be of type Text. The All, In, and Out collections would assign to displayName a LocalizableString since the text "All", "In", and "Out" will change based on the locale Chandler is running in. User created collections would assign to displayName a UString since a user created collection would not require localization.

The Lob Type

In previous iterations of the Repository two types existed the TextLob and the BinaryLob. After closer inspection it was determined that theses two types could be condensed in to one generic Lob type. This generic type can support 8-bit string, unicode, or binary data. The later can also be compressed. Going forward the Lob type will only deal with unicode and binary data. 8-bit string storage will be possible in Chandler.

It is rare that 8-bit string content should need to be stored since Chandler converts any incoming 8-bit content to unicode at I/O boundaries. However, if an HTML document for example, needed to be stored in its original character set encoding it should be saved as binary data using the Lob's Outputstream writer.

>>> binary = contentItem.getAttributeAspect(attribute, 'type').makeValue(None, mimetype="text/html")
>>> binaryStream = binary.getOutputStream()
>>> binaryStream.write(HTMLDocumentData)
>>> binaryStream.close()

[Brian] Andi is going to make changes to lob but wants to keep the encoding feature. This section will be updated after the changes are reviewed.

User data vs. Localizable data

There is a key distinction between localizable data and user data. Both types are displayable and as such are stored and represented as unicode. Both types of data also live in the Repository. Localizable data, however, is meta data related to the application environment. The menu item description "Save" or the text label "Subject:" are examples of localizable data. Localizable data is immutable. It can not be changed or altered. When a user switches locale the localizable data key (The default text) remains the same. The value that the key points to changes. Localizable data is stored in the Repository as an instance of the LocalizableString type.

User data on the other hand is mutable. It is data the user can add, change, or delete. A mail message subject is an example of user data. The "Subject:" text label is localizable data but the value the user types in to the subject text input box ie. "Regarding your request" is user data. The content is unicode but is not intended to be localized. It is stored in the Repository as the UString type.

 

Creating Localizable Content

 

A key goal for the Chandler Internationalization process is continue to leverage the Repository as the core storage mechanism of data. As such there needs to be a means to import and export Localizable content to and from the Repository. This is achieved by modifying the open source industry standard translation tool gettext (http://docs.python.org/lib/module-gettext.html). Chandler has two varieties of localizable textual content, Items that contain unicode strings requiring translation and general error and information message that are not associated with Items but too require translation. Both of these will be model and stored as LocalizableString's. The LocalizableString type has a boolean isMessage attribute which will be used to delineate internally between the two varieties.

Creating Localizable Strings

LocalizableString semantics are modeled on the gettext API architecture. The advantage of this is three fold. First, the locale in Chandler can be changed at runtime. Second, individual parcels can run in a different locales from each other and from Chandler core. Third, LocalizableString's will not be translated at startup reducing performance overhead.

In accordance with the gettext architecture, english text is used as the key to discover the translated content value. The LocalizableString, under the hood, calls the Chandler I18nManager which performs the gettext lookup. The manager interprets the callers parcel path and loads the appropriate .mo file via gettext for the given Locale set*. If no translation exists the default unicode english text is returned. An implementation of LocalizableString would look like the following:

*Gettext allows a language fallback set to be defined. For example, I can specify a set such as JP, fr_CA, fr. In this case gettext would first try to load the Japanese translation file (JP). If that failed it would try the French Canadian (fr_CA) and then the French. If no translation is available the default text (English) would be returned.

class LocalizableString(Item):  
    __slots__ = ['_path', '_defaultText', '_isMessage', '_info'] 
    def __init__(self, path, defaultText, isMessage=False, info=None): 
        self._path = path 
        if isinstance(defaultText, str):
		    defaultText = unicode(defaultText, "utf8", "replace")
        self._defaultText = defaultText 
        self._isMessage = isMessage
        self._info = info
    def __repr__(self):
        return LocalizableString(%r, %r) % (self._path, self._defaultText)
    def __str__(self): 
        return self.__unicode__().encode("utf-8")
    def __unicode__(self): 
        return I18nManager.translate(self._path, self._defaultText)
There are currently two mechanisms for defining Repository data, parcel.xml and Python schema. A LocalizableString is a Repository Struct. The defaultText attribute will be unicode english text. If the value passed to a LocalizableString __init__ method is an 8-bit Python string it will raise an error. This case occurs mostly in Python schema definition when a developer forgets to add a u"" around the LocalizableString definition.

To create a LocalizableString instance in parcel.xml leverage the Text alias. The Parcel Loader will assign a LocalizableString to a Text attribute loaded via parcel.xml.

<Kind itsName="DynamicChild">
<Attribute itsName="helpString">
<type itemref="Text"/>
<initialValue type="Text">This is some help text</initialValue>
</Attribute>
</Kind>

To Create a LocalizableString definition in Python schema code:

class DynamicChild(DynamicBlock):
     helpString = schema.One(schema.Text, initialValue = LocalizableString(u"this is some help text"))

An example of using the LocalizableString type in a Python interpreter:

>>> loc = LocalizableString(u"osaf.mail", u"this is a test")

>>> # Python calls the __unicode__ method of LocalizableString which returns a

>>> # localized Python unicode object.

>>> uniLoc = unicode(loc)

>>> # Convert the unicode text to bytes for printing

>>> print "the value is: ", uniLoc.encode('utf-8')

>>> the value is: this is a test

 

Messaging

Messages are a unique case. They are reuseable textual information that is localizable. They need to be easily created without the overhead of defining Chandler Items of type LocalizableString in parcel.xml. That process is to tedious and error prone and discourage developers from internationalizing. A better approach is to use what they already know for message creation gettext *.

* The use of the gettext API will limit Mac OS X localization customization by end users, since the file/structure won't match the standard for Mac apps. However, the system wide OS X locale set fallback order will be supported.

 

The traditional approach

Gettext is an incredibly easy tool to leverage for internationalization. Developers add the _() syntax around string content. A tool is run on the source to find the strings contained in the _(). The output of this discovery is written to a messages.pot for translation. This template file contains the strings with in _() as msgid's and adds comments pointing to the lines of source code where the strings were found. A translator copies the messages.pot to the appropriate locale directory and adds the translated text as msgstr's. The .po file format is key value pair. The msgid (English text) is the key and the msgstr is the value. An example of traditional use of gettext

1. Create a Python file test.py containing localizable content

test.py

 

import gettext

japaneseTranslation = gettext.translation("test", languages=["jp"])

japaneseTranslation.install(unicode=True)

#_() returns the translation as a unicode object
print "The translation is: ", _("this is a test").encode('utf-8')

2. Run the pygettext.py utility on test.py. It will find all the strings wrapped in _() and create a messages.pot template file for translation

>>> python pygettext.py test.py

messages.pot

 

# SOME DESCRIPTIVE TITLE.
# Copyright (C) YEAR ORGANIZATION
# FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
#
msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"POT-Creation-Date: 2005-06-20 19:13+PDT\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language-Team: LANGUAGE <LL@li.org>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=CHARSET\n"
"Content-Transfer-Encoding: ENCODING\n"
"Generated-By: pygettext.py 1.5\n"

#: test.py:5
msgid "this is a test"
msgstr ""

 

3. Copy the messages.pot translation file to the appropriate locale directory. For Japanese that would be jp/LC_MESSAGES/test.po

>>> mkdir -p jp/LC_MESSAGES

>>> cp message.pot jp/LC_MESSAGES/test.po

4. Add the translated text to the msgstr's in test.po

test.po

 

# Japanese Translation
# Copyright (C) 2005 Open Source Application Foundation
# Brian Kirsch <bkirsch@osafoundation.org>, 2005.
#
msgid ""
msgstr ""
"Project-Id-Version: .1\n"
"POT-Creation-Date: 2005-06-20 19:13+PDT\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Generated-By: pygettext.py 1.5\n"

#: test.py:5
msgid "this is a test"
msgstr "石田リチャード "

5. Convert the textual .po file to a machine readable .mo file

>>> cd jp/LC_MESSAGES

>>> msgfmt.py test.po

6. Run your program in the Python interpreter

>>> python test.py

>>> The translation is: 石田リチャード

The Chandler gettext approach

The gettext approach will still be employed in the Chandler universe but the under the hood semantics of gettext will be modified to leverage the Repository. Developers will continue to use the _() syntax to define localizable message content. The Chandler pygettext.py utility similiar to the parcel loader walks the Chandler filesystem source tree looking for Python files containing _(). The utility will deduce the parcel path based on the filesystem path just like the parcel loader. It will add a ref-collection to the parcel. For each unique key contained inside the _() the utility will create a LocalizableString and add it to the ref-collection. Just like gettext the utility records the Python file name and line number for each _(). When two or more occurrences of the same key are found the file name and line number of each occurrence are recorded but only one LocalizableString is created. The LocalizableString type has two attributes to support messages. The first is the boolean isMessage to differentiate dynamically created content from schema or parcel.xml defined LocalizableString's. The second is the info array. Each position in the LocalizableString._info array stores the file name and line number of a _() translatable text occurrence. The _() syntax will accept both the 8-bit string and unicode types.

In the example below the utility found three occurrences of the text _("No New Messages Found") in the Chandler mail parcel. Because these three instances contain the same key only one LocalizableString is created and added to the parcels messages ref-collection. The info array records the filename and line number of each occurrence.

 

>>> message = LocalizableString(u"osaf.mail", u"No New Messages Found", isMessage=True, info=[u"parcels/osaf/mail:constants.py:4", u"parcels/osaf/mail/imap.py:45", u"parcels/osaf/mail/pop.py:134"])

>>> parcelRefCollection.add(message)

 

During Chandler runtime the _() method is overloaded to call I18nManager.getTranslation($PARCEL_PATH, defaultText). The I18nManager will look in the messages ref-collection for the parcel and retrieve the LocalizableString with key matching the defaultText. Because the I18nManager dynamically calculates the parcel path at translation time _() message tags must be created in the same filesystem directory as the parcels root. The parcels root is the directory containing parcel.xml.

[Brian] The overloading of _() must be done with in the Chandler namespace and not in the global namespace as wxPython also leverages the traditional version of gettext for it message catalogs and other third party libraries leveraged by Chandler may also do so in the future.

 

By definition, the _() message syntax in Chandler is parcel specific. The mail parcel can not access the sharing parcels messages because they are in a different parcel path. This approach has its pros and cons. The benefit is no message space collision between parcels. This is a very crucial point to prevent third party developer messages from colliding in the global namespace. If one developer created a parcel with the message _("Unable to contact server") and the other created a different parcel with _("Unable to contact server") the later would overwrite the first in the global ref-collection thus causing unanticipated and very difficult to debug results when localized. The con is _("Unable to contact server") is a generic enough message that any Chandler parcel performing network I/O operations would want to leverage it. We need a way to specify global messages as well as parcel level messages. This is accomplished by a central file messages.py in $CHANDLER_HOME/i18n directory. Messages placed in this file are added to a global message ref-collection. This file contains both error and information messages.

[Brian] Should separate files be used for info and error messages?

[Brian] An alternate solution for .po layout and retrieval was proposed for .7 with the introduction of Python eggs. Need to investigate if a better way to layout the global message catalog as well.

 

messages.py sample

 

#Networking Messages
DOWLOAD_COMPLETE = _("Download Complete")
CONNECTING = _("Connecting to server ")

#CPIA Messages
CHANGES_COMMITED = _("committing changes to the repository...")
CREATING_COLLECTION = _("Creating Collection ")

#Networking Errors
SERVER_UNKNOWN = _("Unable to connect to server it is unknown")

#CPIA Errors
COMMIT_FAILED = _("committing to the repository failed")

Why a central message.py file?
  1. Easily delineate programmatically between parcel level and global messages without forcing developers to define gettext domains and manage translation loading.
  2. Reduce developer errors caused by misspelling the gettext lookup key.
  3. Ease debugging as content not defined in messages.py will not be in the global message collection.
 
Using the Global Message catalog in Python

>>> from chandler.i18n import messages

>>> from PyICU import Locale

>>> print unicode(messages.DOWNLOAD_COMPLETE).encode("utf-8")

>>> Download Complete

>>> #Change the Locale

>>> #This will set the Python, wxPython, and PyICU Locale

>>> I18nManager.setLocaleSet([Locale("jp")])

>>> print unicode(messages.DOWNLOAD_COMPLETE).encode("utf-8")

>>> 石田リチャード

Importing with gettext

There are currently two mechanisms for defining Repository data, parcel.xml and Python schema. A third means will be added the pygettext API. The API will perform the same function steps as the parcel loader and could tie in to the parcel loader itself. This would eliminate the need for it and the parcel loader to both scan the $CHANDLER_HOME filesystem hierarchy independently.

Startup with no Repository

When Chandler starts up without a Repository its version of the pygettext API scans all directories under $CHANDLER_HOME for parcel.xml files. For each directory with a parcel.xml file, the API scans all files ending in .py for _() messages and creates a ref-collection in the Repository under the parcel path adding the LocalizableString content. It also scans $CHANDLER_HOME/i18n/message.py for _() messages and creates a global ref-collection adding the LocalizableString content.

Startup with a Repository

When Chandler starts up with a Repository its version of the pygettext API scans all directories under $CHANDLER_HOME for parcel.xml files. For each directory with a parcel.xml file, the API scans all files ending in .py for changes, additions, and removals of _() messages. If one or more changes are found it updates the parcels ref-collection. It also scans $CHANDLER_HOME/i18n/message.py for changes, additions,and removals of _() messages. If one or more changes are found it updates the global ref-collection. The scanning of the $CHANDLER_HOME tree when a Repository already exists will only occur in developer builds. This traversal strategy will be refined at a later date to maximize performance.

 

It should be noted that the _() syntax is merely a tool to help ease message creation, management, and lookup. If a parcel desires more control it can explicitly create a LocalizableString and work with it directly. The Chandler pygettext API is just a short cut.

 

Exporting with gettext

The Chandler version of pygettext.py scans a pre-populated Repository for LocalizableString content and creates .pot template files for translation from the LocalizableString defaultText. The utility distinguishes between dynamic message LocalizableString's and explicit LocalizableString's defined in Python schema or parcel.xml by looking at each instances isMessage flag.

Running the export utility does the following
  1. For each parcel in the Repository look for items of type LocalizableString.
  2. If the parcel has one or more LocalizableString's add a $PARCEL_NAME.pot file containing the translatable content under $PARCEL_PATH/messages
  3. If the messages folder does not exist create it
  4. Scan the global messages ref-collection for LocalizableString's and add a message.pot file to $CHANDLER_HOME/i18n/messages
The .po(t) format

Chandler will employ the traditional gettext .po format but will add some additional Chandler specific comment information to better aid the translator. In addition, Chandler will only load .po files encoded in the utf-8 character set and export .pot template files in the utf-8 character set. The following example shows the mail.pot file created by the export utility in $CHANDLER_HOME/parcels/osaf/mail/messages. The file contains all LocalizableString's under the parcels path in the Repository including dynamic messages created using _(). If the LocalizableString is a message the comment line for that msgid contains the line number and file name of each occurrence of the message in Python. Otherwise the LocalizableString instance path is documented.

mail.pot sample

# osaf.mail Translation template
# Copyright (C) 2005 Open Source Application Foundation
#
msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"POT-Creation-Date: 2005-06-20 19:13+PDT\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Generated-By: Chandler Translation Export Utility 1.0\n"

# LocalizableString: osaf.contentmodel.mail.IMAPAccount.displayName

msgid "IMAPAccount One"

msgstr ""

# Message: /osaf/mail/messages.py:88, /osaf/mail/messages.py:34

msgid "this is a message"

msgstr ""

Steps to translating the Mail Parcel

  1. Run the Chandler gettext export utility
  2. Create the directory $CHANDLER_HOME/parcels/osaf/mail/message/$LOCALE/LC_MESSAGES
  3. Copy $CHANDLER_HOME/parcels/osaf/mail/messages/mail.pot to $CHANDLER_HOME/parcels/osaf/mail/message/$LOCALE/LC_MESSAGES/mail.po
  4. Add the translation to the msgstr's in the .po file
  5. Run the msgfmt utility to convert the .po file to a .mo file
  6. Either startup Chandler and change the locale or change the locale in a running Chandler to test

Steps to translate global messages

  1. Run the Chandler gettext export utility
  2. Copy $CHANDLER_HOME/i18n/messages/messages.pot to $CHANDLER_HOME/i18n/message/$LOCALE/LC_MESSAGES/messages.po
  3. Add the translation to the msgstr's
  4. Run the msgfmt utility to convert the .po file to a .mo file
  5. Either startup Chandler and change the locale or change the locale in a running Chandler to test

The Chandler version of pygettext will also provide helper methods to further aid the translation process including allowing the translator to specify a locale set on the command line to automate the the directory structure creation and $PARCEL_NAME.po file set up.

 

The $CHANDLER_HOME/ i18n directory

Background

Currently Chandler has a locale directory under $CHANDLER_HOME. This directory was used in previous iterations of Chandler to localize the product using the traditional gettext approach. Going forward this directory will continue to be leveraged as the root localization point in Chandler but will be renamed i18n. it will house all localizable information that is not contained in the Repository as well as Chandler global error and information messages. The directory structure is as follows with a Japanese (jp) localization shown for clarity:

	      $CHANDLER_HOME/i18n/
	             /messages.py
                 /resources/
			         /images/
                          /Sidebar.png
                          /week.png
                          /jp/
                              /Sidebar.png
                              /week.png
                     /audio/
                          /busy.wav
                          /youvegotmail.wav
                          /jp/
                              /busy.wav
                     /video/
                          /intro.avi
                          /tutorial.avi
                          /jp/
                              /intro.avi
                              /tutorial.avi       
                 /messages/	
                    /jp/
                         /LC_MESSAGES/
                              /messages.po			
	            /external/
                   /help
				       /welcome.html
					   /jp/
					      /welcome.html
                   /dialogs.xrc
                       /jp/
                          /dialogs.xrc
                          /LC_MESSAGES/
                             /wxPythonMessages.po
			      	         /ChandlerStartupMessages.po	   
		

 

Resources

The $CHANDLER_HOME/i18n/resources directory is the root location for all non-textual displayable or audible data that lives outside the Repository. This includes images, audio, and video. Each resource node will have its own directory located under $CHANDLER_HOME/i18n/resources. The root directory for images is $CHANDLER_HOME/i18n/resources/images and the root directory for audio is $CHANDLER_HOME/i18n/resources/audio. Each resource node will contain its default media in its root directory. For each locale a subdirectory is defined and local specific media placed. Resource loading via the I18nLoader will employ the same look up algorithm as gettext.

 

External

The directory $CHANDLER_HOME/i18n/external is the central location for localizable content that lives outside of Chandler. This includes any custom wxPython message strings, wx resource .rxc files, help / tutorial files, and Chandler startup code. wxPython / wxWidgets uses its own gettext .po files and gettext API to localize dialog buttons, titles, etc. Theses files will continue to reside with in the wx library. Since Chandler control of these resources (textual verbage, layout definition, etc) is limited a strategy should be investigated to migrate these elements to CPIA in the future.

At Chandler startup a number of actions take place independent of the Repository that require localization. The string and resource content for startup will be stored under external. Theses actions include:

 

 

The Parcel directory structure

 

The parcel directory structure closely mirrors the $CHANDLER_HOME/i18n directory. All parcel specific messages translation files are stored under $PARCEL_PATH/messages and all parcel specific resource files are stored under $PARCEL_PATH/resources. The directory structure is as follows with a Japanese (jp) localization shown for clarity:

  $PARCELPATH/
       /somePythonFile.py 
	     /parcel.xml 
	     /resources/ 
	        /images/ 
		        /parcelImage.png 
		   	    /jp/ 
			       /parcelImage.png 
	        /audio/
		       /parcelNoise.wav 
			   /jp/ 
			      /parceNoise.wav
		/messages/ 
		   /jp/ 
		      /LC_MESSAGES/ 
			      /$PARCEL_NAME.po 

The I18nManager

 

The chandler.18n.I18nManager manages the locale set space and handles translation lookup of LocalizableString content. At Chandler startup the I18nManager determines the users current locale set from the Operating System. The locale set is the one or more locales the user has specified to Operating System to use. The manager initializes the Python locale space, the wxPython locale space, and the ICU locale space passing in the locale set retrieved from the Operating System. The I18nManager stores internally each locale in the set as an ICU Locale instance. The I18nManager can also store a unique locale set for each parcel. All gettext translation file lookup is handled internally by the I18nManager.translate method. This lookup is called by the LocalizableString. __unicode___ method to find the appropriate translation for its defaultText. The translate method dynamically loads the gettext .mo files for a parcel the first time a LocalizableString in the parcel has its __unicode__ method called. Only when a parcel's translatable content is leveraged is the lookup performed. This allows Chandler to have hundreds of parcels with out degrading startup performance. Content is cached the first time it is referenced.

The translate method algorithm
  1. Check the internal dictionary to see if a gettext Translation object has been loaded for the parcel.
  2. If it has not then look in the internal dictionary to see if the parcel has specified a custom locale set.
  3. If it has create a gettext Translation object using the locale set otherwise create a gettext Translation object using the Chandler locale set.
  4. Confirm all .mo translations files for the Translation object where created from .po files in utf-8 format.
  5. Store the Translation object for the parcel in the internal dictionary.
  6. Call the Translation objects ugettext method which will return as Python unicode the translation if found or the defaultText.
Stub implementation of the I18nManager
class I18nManager(object):
    def discoverLocaleSet(self):
	       """Queries the Operating System for the LocaleSet and sets the Python, wxPython, and ICU Locales"""
    def setLocaleSet(self, localeSetArray, parcelPath=None):
        """
         @param localeSetArray: an array of ICU Locale Objects
		    @type localeSetArray: array
		    @param parcelPath: if the parcelPath is not None the the localeSetArray will only be used 
			                   for the parcel
	        @type parcelPath: 8-bit Python string or None
			@NOTE: Changing the locale set results in the translation dictionary being flushed for the parcel or for Chandler if the 
			       parcelPath is None.
				   *** Is there a  security concerns with exposing the Chandler globals Locale setting mechanism 
			       to parcel developers?
         """
	 def getLocaleSet(self, parcelPath=None):
	      """ @param parcelPath: if the parcel path is not None returns the LocaleSet for the parcel otherwise
		                         returns the Chandler Locale Set 
	          @type parcelPath: 8-bit Python string or None
			  @return: an array of ICU Locale objects or None
		  """
	  def translate(self, parcelPath, defaultText):
	      """ @param parcelPath: The parcels Repository path
		      @type parcelPath: 8-bit Python string
			  @param defaultText: The defaultText to return if no translation found
			  @param defaultText: Python unicode
		      @return: Python unicode object
		   """ 
  

 

The I18nLoader

 

The I18nLoader handles the lookup of localizable resources including images, audio, and video as well as help / tutorial files. Resources can be global in which case they are under $CHANDLER_HOME/i18n/resources or parcel specific in which case they are under $PARCEL_PATH/resources. The I18nLoader has utility methods to easily find image, video, and audio resources at the Chandler or parcel level. It is also flexible enough to allow parcel developers to define new types of resources. It employs the same locale set fallback mechanism as the gettext API. In addition, the loader can be leveraged to retrieve localized help / tutorial files.

 

Example Locating a Chandler Image Resource "Sidebar.png"
  1. Python code I18nLoader.getImage("Sidebar.png") is called
  2. The I18nLoader gets the Chandler locale set by calling I18nManager.getLocaleSet()
  3. The I18nLoader gets the first $LOCALE in the set and scans the $CHANDLER_HOME/i18n/resources/images/$LOCALE directory for a resource named Sidebar.png
  4. If not found the I18nLoader gets the next locale in the locale set and scan its $CHANDLER_HOME/i18n/resources/images/$LOCALE directory for a resource named Sidebar.png
  5. After scanning the entire locale set if the resource is not found the default resource $CHANDLER_HOME/i18n/resources/images/Sidebar.png is returned as a file reader object.
  6. The resulting lookup path is cached so the next call to I18nLoader.getImage("Sidebar.png") retrieves the path from the cache.
 
Example Locating a parcel Image Resource "parcelImage.png"
  1. Python code I18nLoader.getImage("parcelImage.png", $PARCEL_PATH) is called
  2. The I18nLoader gets the parcels locale set by calling I18nManager.getLocaleSet($PARCEL_PATH)
  3. If no locale set is specified for the parcel the I18nLoader gets the Chandler locale set I18nManager.getLocaleSet()
  4. The I18nLoader gets the first $LOCALE in the set and scans the $PARCEL_PATH/resources/images/$LOCALE directory for a resource named parcelImage.png
  5. If not found the I18nLoader gets the next locale in the locale set and scan its $PARCEL_PATH/resources/images/$LOCALE directory for a resource named parcelImage.png
  6. After scanning the entire locale set if the resource is not found the default resource $PARCEL_PATH/resources/images/parcelImage.png is returned as a file reader object.
  7. The resulting lookup path is cached so the next call to I18nLoader.getImage("parcelImage.png", $PARCEL_PATH) retrieves the path from the cache.
I18nLoader API examples

 

>>> from chandler.i18n import I18nLoader

>>> #Loads the image Sidebar.png from $CHANDLER_HOME/i18n/resources/images in a locale aware manner

>>> I18nLoader.getImage("Sidebar.png")

>>> #Loads the audio youveGotMail.wav from $CHANDLER_HOME/i18n/resources/audio in a locale aware manner

>>> I18nLoader.getAudio("youveGotMail.wav")

>>> #Loads the video tutorial.mpg from $CHANDLER_HOME/i18n/resources/video in a locale aware manner

>>> I18nLoader.getVideo("tutorial.mpg")

>>> #Loads a custom resource "example.html" from $CHANDLER_HOME/i18n/resources/documents in a locale aware manner

>>> I18nLoader.getResource("documents", "example.html")

>>> #Loads the audio parcelAudio.wav from $PARCEL_PATH/resources/audio in a locale aware manner

>>> I18nLoader.getAudio("parcelAudio.wav", $PARCEL_PATH )

>>> #Loads a custom resource "default.rss" from $PARCEL_PATH/resources/rssFeeds in a locale aware manner

>>> I18nLoader.getResource("rssFeeds", "default.rss", $PARCEL_PATH)

>>> #Loads the help file welcome.html from $CHANDLER_HOME/i18n/external/help in a locale aware manner

>>> I18nLoader.getHelpFile("welcome.html")

 

 

[Brian] Help files many be stored and retrieved in a different manner in future releases in which case it really is just a ResourceLoader

 

The Translation process

A translation of Chandler to Japanese would require the following steps:

  1. 1. Run the gettext export utility passing the locale "jp" as an argument. This will create a $PARCEL_NAME/messages/jp/LC_MESSAGES/$PARCE_NAME.po file for each parcel that is localizable. It will also create a global $CHANDLER_HOME/i18n/messages/jp/LC_MESSAGES/messages.po.
  2. Edit each .po file adding the Japanese translation. A script will be provided to ease the .po discovery process.
  3. Run msgfmt.py -all. This will walk the $CHANDLER_HOME file path converting .po files to .mo files.
  4. Determining which resources (images, audio, etc) will need to be localized and place the localized media under $CHANDLER_HOME/i18n/resources/$RESOURCE_TYPE/jp/.
  5. Create a jp directory under $CHANDLER_HOME/external/ and add any localizable layouts and text required.

Once the translation to Japanese is complete there are two options. If the developer has commit access to the Chandler SVN tree the Japanese translation can be checked in. Translations, however should only happen at release points. To package the translation for distribution the developer would run the Chandler translation packaging tool. The tool adds all the Japanese localized content which includes .po files, .mo files, .xrc, images, parcel.xml, etc to a zip file. This file would maintain the directory structure of the translation and could be unpacked in any $CHANDLER_HOME directory.

[Brian] Need to refine how translators contribute translations back to Chandler and how theses translations are updated and verified from release to release

This translation process will further be refined via tools which will reduce the number of steps required to produce a translation. For example, having a .po file for each parcel while important as an architectural decision can make creating a translation difficult. The use of tools will reduce this burden. The specifics of the tools is still to be refined.

 

The Chandler Application Universe

wxWidgets / wxPython

wxWidgets wraps native Operating System widgets. As such on our three target platforms Windows, Linux, and Mac wxWidgets should not require any custom internationalization logic. Font sizes, colors, etc should come from the host Operating System and thus are already localized for the user. Three key issues however, must be address to ensure that input and display of unicode work in wxWidgets. First we must leverage the unicode build of wx which Chandler has done since the .5 release. Second widgets that receive text input or display text must wrap native Operating System widgets and can not be custom wx widgets that are platform independent. For each native widget confirm text editing, selection, and wrapping work as expected in the target locale. Third and most importantly the Operating System character set encoding (code page) specified by the user must be converted to Python unicode at the wxWidgets input boundary to Chandler.

 

CPIA

Chandler has many custom OSAF created wxWidgets which are leveraged in CPIA. Each of these custom widgets needs to support localization and thus must tie in to the Operating System for font, sizes, colors, and layout. A specific analysis of each custom widget is required to determine specific steps required for it to support localization.

Dialogs

Chandler dialogs should leverage Attribute Editors for input and should tie in to the Chandler gettext API for localization. This will most likely mean that the current .xrc layout files will not be utilized.

Parcel, Attribute, and file naming limitations

The current Chandler iso-8859-1 charset limitation imposed on filesystem paths will no longer be in effect. Any patching going forward will need to be converted from the Operating System native character set encoding to unicode. Parcel names, Parcel directory names, Item names, and attribute names will continue to be limited to alpha numeric characters and _.

All parcel.xml files will be in the utf-8 character set as will any textual files created by Chandler. The names of these textual files will also be in the utf-8 character set.

Exceptions

Exceptions are intended for developer level debugging. The exception message string is not localized and as such is not suitable for display to the user. Developers must capture Exceptions and use LocalizableString messages to convey the error situation.

   try:
       self.connectToServer()
   except TimeoutError, e:
      # Pass a translatable LocalizableString message to the CPIA Layer
      notifyUI(_("Communication with the Server timed out"))
      # Log the exception for developer debugging
      logger.error(e)

I/O Boundaries

“Localizability is only one (and the easiest) aspect of i18n.  The more difficult area is data processing, as it requires architectural and design considerations fundamental to the product.”

- Andrea Vine
Sun Internationalization Architect

 

Proper boundary conversions will be the key focal point of the Chandler internationalization process. Chandler is not only an application but an application framework and as such contains a great number of boundaries. Boundaries include:

Each boundary point should be closely examined and documented with unit and regression tests ensuring input textual data is converted to unicode and output unicode is correctly converted to bytes. For each boundary there will be additional considerations. The SMTP output boundary for example requires conversion of its displayable headers from utf-8 bytes to quoted printable. HTTP protocols such as WebDAV require BASE 64 encoding of the utf-8 bytes for any data appearing in an URL. The utf-8 character set should be used when encoding unicode to bytes except in specific cases where aspects of the protocol require an alternate character set. Each boundary point in Chandler needs an owner who can document a conversion strategy and enforce that strategy. In addition, moving boundary point conversion to a small set of utilities should be explored to reduce points of failure.

Boundary Example

The Twisted API is both a network and third party boundary point in Chandler. Twisted as a networking framework has no internationalization features. All textual data sent via Twisted protocols must be 8-bit. Passing a unicode Python object to a Twisted network API will result in a UnicodeDecodeError if the content is not ascii. Python by default uses the ascii codec to convert data from unicode to bytes. Thus it is up to the developer to handle all unicode to 8-bit conversion before entering the Twisted layer. as we have seen before this is done by encoding the unicode object in the utf-8 character set.

>>> username = unicode("brian")

>>> password = unicode("test")

>>> twistedProtocol.registerAuthenticator(imap4.LOGINAuthenticator(username.encode("utf-8")))

>>> twistedProtocol.authenticate(password.encode("utf-8"))

 

If an error occurs in Twisted a Failure object wrapping an exception is passed to an errorback. The failure is not localized and displaying the error message text of the exception is not an option. The user needs constructive error information in his or her language. Thus like all exceptions the twisted failures will have to be caught and a message of the LocalizableString type used.

 errorType   = str(err.__class__)
 if isinstance(err, twistedSMTP.SMTPClientError):
    if errorType == errors.AUTH_DECLINED_ERROR:
         errMsg = _("Your username or password is invalid")
    if errorType == errors.TLS_ERROR:
	      errMsg = _("The server does not support secure communication")
    notifyUI(errMsg)
    logger.error(err)
    

 

Architectural Design

Centralized i18n core actions

A key goal of the move from the 8-bit english centric space to the world of unicode and localization is centralization of core tasks to minimize potential points of failure. The majority of localization will take place in the Repository and CPIA layers. The Repository will house the LocalizableString type which along with the I18nManager will handle the translation process. It will also manage searching via PyLucene and leverage the PyICU for sorting of ref and item collections in a localized manner.

The CPIA layer detail and summary views along with the attribute editor will be the central points requiring localization. The attribute editor in specific will need to use the PyICU for display and parsing of numbers, currencies, and dates in a localized manner. It will also need to convert any Operating System character set input to unicode if not already handled by the wxWidgets library.

The Chandler Calendar view will need to tie in to the PyICU for localization as PyICU provides support for over 500+ locales. Timezone, first day of week, day names, month names, etc are all localizable. The builtin Python datetime module along with the dateutil module will be used to manipulate date specific user data. More detailed information will be provided at a later time.

A chandler.i18n.i18Utils class will also be provided to help ease localization tasks and provide a central point of control. The specifics of the methods provided by the i18nUtils still need to be explored.

Locale Set Determination

Chandler will retrieve its locale set from the host Operating System. The set will contain one or more locales and the order in which they are to be tried. Operating Systems such as Apple's OS X have an ordered set of locales for fallback purposes. For example, try fr_CA if not available try fr in not available try en. This ordering will be preserved in Chandler. The retrieval of the locale set will require c / C++ OS specific code for each target platform and a Swig wrapper for access via Python. In addition, Chandler will allow the user to specify a Chandler only locale which takes precedence over the Operating System locale set.

[Brian] investigate if any OS specific locale retrieval code has been written or wrapped for Python already

 

Schema Changes

The following changes need to be made to the schema:

  1. Create an schema.Text type which is an alias for UString and LocalizableString.
  2. Convert all attributes of type schema.String to schema.Text.
  3. Condense the description, examples, and issues attributes to description which will be of type BString. the description attribute is intended for developers and is not displayable in the UI.
  4. Review the content model and determine which schema.Text attributes are of type UString and which are of type LocalizableString.

Searching

PyLucene will continue to be the mechanism for searching and indexing of unicode strings. Currently the tool indexes only the Lob type. It will need to be augmented to index attributes of type UString as well. The LocalizableString and BString will not be indexed.

[Brian] PyLucene does not allow a locale and case insensitive search. This type of search is useful for email and web pages.

Sorting

Ref-collections and item collections today perform binary sorts. Theses sorts are not localizable. An additional feature that ties in to PyICU's locale aware sort is required. Currency, number, and Date (utc timestamp) will continue to use the binary sort feature.

Dates and Timezones

The Python datetime class in conjunction with PyICU Timezone will be leveraged for Calendaring and Date processing. For more information refer to the Timezone specification (http://svn.osafoundation.org/docs/trunk/docs/specs/rel0_6/Timezone-0.6.html). The PyICU SimpleDateFormat class will be used for localized date and time display in the UI and the DateFormatSymbols class will be used for accessing localized date-time formatting strings, such as names of the months, days of the week, and timezone display name.

Numbers and Currencies

The PyICU NumberFormat, DecimalFormat, RuleBasedNumberFormat, and DecimalFormatSymbols classes will be employed for localized number and currency formatting.

 

Other Localization Areas

Installer

The build environment will need to be augmented to support not only builds on target platforms but builds for target languages as well. Chandler will support both the ability to be shipped for a specific locale market and to dynamically add new locale translations for an installed version. The installer for each designated Chandler locale will need to be localized for that language. Because this is a large task we should designate a subset of the most spoken languages (Spanish, Portuguese) to support via the installer. Alternate languages will be available post install as downloadable bundles. The bundles unpack and install the appropriate translation files / resources in the Chandler application hierarchy.

 

Help Files

The "About Chandler" Dialog which uses the file welcome.html for its content will require localization. The dialog is launched from a menu item under "Help". All help and tutorial related resources will be placed under $CHANDLER_HOME/i18n/external/help

Tools

poEdit

poEdit (http://poedit.org) is cross-platform gettext catalogs (.po files) editor. It is built with wxWidgets toolkit and can run on any platform supported by it (although it was only tested on Unix with GTK+ and Windows). It aims to provide more convenient approach to editing catalogs than launching vi and editing the file by hand. Unlike other catalogs editors, poEdit shows data in very compact way. Entries are arranged in a list, so that you can easily navigate large catalogs and immediately get an idea about how big part of the catalog is already translated, what needs translating and which parts are only translated in a "fuzzy" way

Mozilla character set detector

The Mozilla character set detector (http://www.mozilla.org/projects/intl/ChardetInterface.htm) will be interfaced and wrapped using swig. The detector is leveraged through out the Mozilla family suite including Firefox. When an incoming textual source is received by Chandler it must be converted to unicode. If no character set is specified for the textual source or an incorrect character set is specified the Mozilla character set detector should be leveraged. The detector will apply an algorithm on the textual source to determine the true character set of the source.


Developer Awareness

Every developer working on Chandler plays a role in the internationalization process. The burden of internationalization does not fall to one or two individuals but to the entire team. As such there will be a certain understand of basic i18n principles that each developer will be expected to know and follow. Detailed documentation on general internationalization and localization principles will be provided via the wiki. In addition an internationalization and localization best practices developer guide will be created. This along with a detailed code design guideline will go along way to ramping up the team. Detailed code reviews should be put in place with at least one person attending having a strong internationalization background.

 

 

Testing

Python String Assignments

Create a grep tool that searches Chandler code base for str() casts and direct Python string assignments in open and closing parenthesis i.e. "this is a test". The evaluation of the grepped output is a manual task. The review is looking for incorrect assignment of Python str for either network I/O conversion or UI displayable content.

Unicode enforcement checks

Add checks to the Python unicode object for slicing and other operations which it does not handle correctly. The debug build will warn when these assignments occur and will also warn if the unicode.__str__ method is called. This check will not only highlight issues in Chandler but also potential issues in third party libraries leveraged by Chandler.

MessageFormat

A consistency checker needs to be written to confirm that the abstract structure of the message (number/type of parameters) format has not changed between the original and any translations.

Locale specific Tinderbox Builds

A tinderbox is required that can perform the following tasks:

  1. Build Chandler releases for the languages / locales which will be supported in the 1.0 release. For each build, a volunteer who spoke that specific language would review it to confirm a correct translation.
  2. Launch a series of Chandler builds each time changing the OS locale from a locale set (Chinese, Russian, etc) which has been chosen for its potential to break the build and thus highlight any localization errors in Chandler and third party API's.

[Brian] Need to design a i18n test plan with QA and add the plan under to this section

 

Future

CPIA localizable layout architecture including Acelerator Key assignments

Auto Complete

Spell Checking

Break Iterators

Localized help documentation

 

 

Links

 

GetText API
Unicode Must Knows
Unicode Website
ICU

Python

Python Unicode Must Know
http://www.python.org/sigs/i18n-sig/
http://www.python.org/doc/lib/module-locale.html
http://www.python.org/doc/lib/module-time.html


wxWindows


http://www.wxwindows.org/i18n.htm
http://www.wxwindows.org/technote/internat.htm

 

QA

http://www.i18nfaq.com/qa.html

 

 

Author Edit date Description
Brian Kirsch June 16 , 2005 Second Draft