r7 - 03 Nov 2008 - 18:43:57 - LinkUpdaterYou are here: OSAF >  Projects Web  >  DevelopmentHome > InternationalizationProject > InternationalizationIssues
-- KenKrugler - 05 Nov 2004 And now, for my first Wiki edit on Chandler...below are issues that came up during my meeting on Oct 28th w/John, Katie and Chao.

Localizable data in code vs. resources

The typical Unix command-line tool approach is to use gettext to extract properly tagged (e.g. _("my string")) strings from source.

The typical approach for most GUI operating systems (Mac, Windows, Palm are the ones I know about) is to have all localizable data (text or otherwise) in separate resource files, which get compiled into binary data that code explicitly loads at runtime.

Some of the advantages of the gettext approach are:

  • It's a well-known approach for developers coming from the Unix world.
  • It's easy for programmers to add localizable strings.
  • Automatic substitution of English text for untranslated strings.
  • No need to coordinate resource IDs for all source in a module.

Some of the advantages of the separate resource apprach are:

  • Better separation of code & data.
  • Explicit context (no issue with two strings having the same English text but different translations).
  • No separate extraction phase to generate localizable data.
  • Support for non-string data such as soft constants, menus and form layouts.
  • Easier to enforce the "no localizable strings in code" rule.
  • It's harder for programmers to add localizable strings.

The last "advantage" is kind of an odd one, but the issue is that you do want to require a bit of extra thought whenever localizable data is added. Programmers need to be thinking about issues like message length (e.g. German will grow a lot), message construction (don't use sprintf), plural forms, whether a localizer will need extra information in order to correctly translate the string, and so on.

Storing localizable data

Currently localizable text data is stored in a .po gettext catalog. John has been pushing for storing UI data in the repository - when the repository is created, or parcels are added, then the UI data is imported into the repository. Hopefully when a parcel is deleted it gets automatically removed.

This does create an extra step during installation, and it will duplicate UI data. One advantage is that all data access is via a Python object that maps to a repository item (not sure if I've got all of the terms right). This one class can act as a cover for data access if additional work needs to be done under the hood. It would need to use the current locale setting to determine which data to return.

I don't know how John is planning on handling segmentation of resource data between parcels, if the repository winds up being where the data gets dumped.

Localizable data formats

wxWidgets has an XML-based format (XRC) that's used for menus and dialogs. Apparently it can be extended (the docs that I saw weren't complete in this area) to support additional resource types. When you make a call to load a dialog or menu, wxWidgets automatically does a gettext operation on all string data. Since the dialog layout description doesn't include positioning data (that's all dynamic) this approach can work, though I don't know how often you run into problems during translation.

If all resource data was going to be kept in this format (whether or not it's inside the repository) you'd want to add support for strings. Another useful type is the soft constant, which is a resource that contains a single numeric value. We've used these in the past in Mac & Palm to give the localizers additional control over program behavior...for example, using a soft constant to turn off auto-type completion when the locale is Japanese (jaJP). String lists can also be handy.

One issue with the wxWidget approach is that it's not clear how you'd handle loading locale-specific resources. For example, if you needed to have a completely different dialog layout for Japanese, then it appears like you'd need to create a new dialog, versus being able to specify that there was a different version in the resource data.

I also don't know if wxWidgets applies additional context when it does localized string substitution. For example, it could look for a string with English text "Maybe" that was explicitly a button title, and then if it didn't find a match fall back to any string with the English key "Maybe".

Incremental localization

I don't know enough about the gettext tool world to know how they handle incremental localization. In the Mac world there's an AppleGlot tool that lets you take an English version of your resource data, the localized version of this same version, and a newer English version, and then generate a first cut of the new localized version (along with a report of what remains to be translated). In the Palm world most of this is handled by a translation memory (e.g. TRADOS) with macros to handle moving data between the Palm XRD (XML-based) resource format and the TM tool.

This is a more significant issue if you're planning on having volunteers from around the world providing you with translations. The typical problem is that Padme Wongdi from Bhutan sends you translations for version 2.0, but you're on version 3.5.

Enforcing good I18N hygiene

One approach is to have tools that, as part of the build process, scan the source looking for any text that's not inside of a trace.

Another method is to have a tool that "auto-localizes" the English resource data into Pig Latin. Then a tester can run the app, looking for "unlocalized" text.

You could scan for usage of printf, which is another common cause of localization woes.

I18N support

Much of what you need at the GUI layer should already be implemented inside of wxWidgets - e.g. support for non-Latin text drawing, entry, wrapping, etc.

If you don't want to be tied to wxWidgets for non-GUI support, then my suggestion is to use ICU. It's good, cross-platform library library that provides much of the same support found in Java. Unfortunately nobody has created a Python extension module for ICU. Nick Bastin is working on the Locale portion, and plans to do his first release (under LGPL or BSD) around Thanksgiving, but there would be a lot more to do to provide access to the entire library. Nick also said that using automated tools like SWIG, SIP and Boost create ugly Python APIs, thus he's rolling his own by hand.

Message construction

One area where ICU could help is when messages need to be constructed from multiple strings/data. They've got a pretty good template engine, which can handle things like plural form variants, auto-date formatting, parameter re-ordering and such.

Using sprintf() is right out. Even the enhancements done to support parameter re-ordering don't solve all of the problems. And if the format string has to be loaded from a resource anyway, it doesn't buy you much over using another call (like something from ICU) that's more powerful while being safer for localizers.

Other Internationalization issues

Other issues in internationalization include input methods in wxWidgets, searching, sorting, dates, numbers, time zones, documentation, and spell-checking.

External Internationalization Resources

Web pages:

  • Open source i18n projects:
    • Common Locale Data Repository
    • International Components for Unicode
    • Globalization step-by-step
    • You are not world-ready if...

Mailing list messages:

  • JeremyMcGee? discusses a few i18n issues
  • Multilingual from the base up by MattiPicus? points out that cities (and countries) can have different names in different languages and different writing systems.


One issue that being extensible brings up is centralized resources vs. distributed resources.
  • If the resources are all centralized, then how does a parcel author drop in a localizable package? How do you avoid namespace collisions? Would OSAF need to moderate which strings were allowed to be in the centralized resource?
  • If the resources are all distributed, how will they all be agglomerated at run-time? How do you avoid namespace collisions?

  • DuckySherwood - 15 Nov 2004
Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r7 < r6 < r5 < r4 < r3 | More topic actions
 
Open Source Applications Foundation
Except where otherwise noted, this site and its content are licensed by OSAF under an Creative Commons License, Attribution Only 3.0.
See list of page contributors for attributions.