This document contains examples in another language or script.

Accesskey n skips to in-page navigation. Skip to the content start

Go to W3C Home PageGo to Architecture Domain home page  Internationalization 
 

Internationalizing the Web for Africa

Front matter

This talk was presented at the Pan African Workshop on Localization, in Casablanca, June 2005.

Objectives

After this presentation you should have a better understanding of:

How to use this material

This material is organized around a set of presentation slides which can be viewed in several ways. Each view is identified by an icon as described below.

Icon for viewing the all-in-one version. All in one A single page containing all explanatory text followed by small accompanying slides.

Icon for viewing the slide by slide version. Slide by slide One page per slide view. This is particularly useful if you need to see the detail on a slide.

Icon for viewing the text version. Slide text This page by page version of the slides is provided mainly for those who want to cut and paste the text on the slides. (You will need appropriate fonts and rendering software to see the text correctly.)

Icon for linking to the overview. Overview The overview provides a list of headings to help you navigate around the presentation quickly.

Please send any comments to ishida@w3.org.

W3C Overview

What?

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Who?

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Real, nice, people, not a faceless institution!

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Note that, in order to assist organizations in developing countries in Africa and elswhere to join the W3C, the Consortium has just recently announced a new fee structure. See the W3C web site for details.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

How?

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Internationalization (i18n) Activity

New I18n Activity structure

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Outreach material

A long list exists, but we are interested in knowing what additional topics people would like us to address. See the list.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Also a number of tutorials (material built around a slide format).

slide Go to individual slides view. Show the text for this slide. Go to Overview.

We are in the process of making several improvements to the Internationalization subsite.

We need your feedback.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Content vs. presentation

Separate content and presentation

The X/HTML should contain no presentational information - this should be in a CSS stylesheet.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

The default browser styling should be able to display the content in a readable fashion, even without the CSS.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

The CSS can be used to apply all kinds of different styling to the same content.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Use semantic rather than presentational markup

Using <i> tags is putting presentational information in the markup, and should be avoided.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

It's better to use <em> tags - expressing that this is emphasis.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Do not use <em class="italic">.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

If you need to qualify the type of emphasis, use something like <em class="important">, etc. In other words, keep the markup to expressing the meaning, not the presentation.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Do the same for application of styles relating to document conventions, eg. call out references to document titles as, eg., <span class="doctitle"> rather than in terms of the styling.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Semantic markup is good for localization

Appropriate styling approaches can vary for different scripts. All the following are reasons that it is good to avoid presentational markup when text is to be localized.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

As another example, quotation marks vary on a language by language basis.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Declaring the shape of the quotes in CSS, rather than hard-coding in the text, can make localization faster and less error-prone. A single change in the CSS can be applied to all text in your document.

Note that for this to work, the appropriate selectors need to be supported on all user agents.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Internationalizing characters

Universal accessibility has always been a key objective of the World Wide Web Consortium. The slide shows the phrase 'Making the World Wide Web world wide' in 15 different scripts. Only ten years ago having so many scripts on the same page would have been very difficult.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

What, and how many?

In the early days there was ASCII, which allowed for a maximum of 128 character assignments, and was based on English support.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Adding an eighth bit to a byte yielded a total of 256 character assignement per code page.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

With multiple code pages you could support other regions, such as Western European languages, Greek, Russian, etc. But handling multiple code pages causes problems for multilingual text and for expansion into new regions.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Meanwhile, in the Far East computers had to deal with 'alphabets' of thousands of characters. The solution was to use 'double-byte' character sets - ie. two bytes per character. These character sets still restricted multilingual text and expansion into other areas, just as code pages did.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Originating in the early 1980's, Unicode provides an architecture that allows for the support of pretty much all of the world's languages and scripts currently in use.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

It provides for around a million possible code points. This removes the need for code page switching and makes it easy to extent your product to support characters required for new areas. Its use is now very widespread.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Unicode also separates content and presentation

The visual representation of a character is called a 'glyph'. The letter 'a' may have different glyph shapes in different circumstances (eg. standard vs. italic text), but it is still the same character. The different glyph shapes are described by the font that is used to display the character.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

In a script such as Arabic, where letters are shaped in different ways according to the joining context, Unicode uses the same character for each of the different shapes displayed, and relies on the font and rendering software to produce the appropriate glyph shapes. This significantly simplifies input and many computer-based operations on characters.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

On the wall outside the place we are meeting is an inscription that uses a magreb form of arabic script. For example, the letter qaf is represented with one dot below, rather two dots above. In Unicode this is purely a presentational difference. The same character is used but the font provides a different glyph.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Working with characters

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Character issues

In the following slides we will show just a couple of examples of considerations that have to be taken into account to build international Web technologies.

When the characters in Unicode are mapped to numbers for use in the computer (ie. a character encoding), one character may be represented by one to four bytes.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

This means that as you step through text, character by character, or point to a specific location in text, you must know where one character starts and another ends. This is one of the things we have had to ensure is taken into account in specifications of W3C technologies to ensure international support.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Another aspect that has to be considered at a low, architectural level by specifications and implementations of Web technologies is the need to normalise equivalent text where graphemes can be expressed using more than one combination of characters.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Declaring character encodings

Since Unicode is not the only character set that can be used on the Web, it is also important to declare the character encoding of any document or text on the Web. For advice on how to do this for XHTML, see the GEO tutorial.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Internationalized Resource Identifiers (IRIs)

Until recently Web addresses had to use English letters. Standards recently published by the IETF, with contributions from the W3C, have changed that. Using native script for Web addresses makes them easier to create, memorize, transcribe, interpret, guess, find and relate to (branding).

slide Go to individual slides view. Show the text for this slide. Go to Overview.

The domain name and the path are handled separately for native script Web addresses. The client converts each to a form that can pass through the protocols used to retrieve resources.

There is currently some concern about 'phishing' that has slowed the implementation of this technology. The fear is that a domain name such as www.pаypаl.com could, say, use a Russian 'а' and therefore point to a different place than the user suspects. This is not a new problem - previous phishing attempts have replaced the 'l' with a digit '1', for example. Discussion is taking place in a number of organizations about how to make this type of deception much more difficult.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Internationalizing markup

Language declarations

In this section we will mention just a couple of topics that the W3C is or has been working on with regards to internationalization of markup.

It is important to declare the language of text in documents on the Web, and will become more important as technology develops in the future. There are two types of language declaration:

slide Go to individual slides view. Show the text for this slide. Go to Overview.

The primary language metadata of a document describes the intended audience of a document in terms of the language or langauges they speak.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

There are a number of places where you can declare language for an XTHML document. For more information about this see the GEO tutorial.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Markup supporting international text requirements

Sometimes specific markup is required to support behaviour in a particular script. For example, bidirectional text in Arabic cannot be achieved solely by reliance on the bidirectional algorithm specified by the Unicode standard. Additional markup is needed.

The International Tag Set (ITS) Working Group at the W3C is currently looking at the requirements for markup needed to support international document formats, with a view to producing a set of tags that people can include in new schemas.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Markup to support localization effectiveness

The ITS Working Group is also looking at requirements for tags that would improve the efficiency of localization For example, some way on indicating whether or not specific ranges of text should be translated can improve the efficiency of translation.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Guidelines for schema developers

Some requirements for better internationalized schemas cannot be addressed by the provision of elements and attributes. In some cases it is merely a question of best practise. For example, translatable text in attribute values cannot be annotated for language, directionality, abbreviation, styling, etc. It would be much better for schema developers to simply avoid creating such attributes, and use embedded elements instead.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Internationalizing presentation

Line breaking and wrapping

Next we will look at some examples of the type of work the W3C Internationalization Activity does in addressing the needs of stylesheets.

One of the issues we are currently discussing is how to reflow text across line breaks. This is particularly problematic for East Asian and South-east Asian scripts, but we need to also be made aware if there are any special behaviours required by African scripts.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

The CSS3 specification will not only attempt to deal with reflowed text appropriately, but will allow for script specific preferences in how text wrapping occurs. For example, you will be able to specify whether embedded Latin text in Chinese wraps on a word by word basis, or character by character.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

You will also be able to specify whether you prefer small kana characters in Japanese to begin a new line or not.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Where sentence final punctuation would otherwise appear at the beginning of a new line, you will be able to specify whether it pulls down the last character on the previous line or sticks out of the margin.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Text alignment and justification

CSS3 will also allow for different approaches to text justification, supporting the needs of various scripts.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Amongst other preferences, you will be able to specify that Arabic justification be done using kashidas to lengthen the word, and you will have some control about how that is applied.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Text direction and layout

Another area where work is still ongoing is that of text direction - particularly where horizontal and vertical text are mixed. For example, how should it be possible to embed Latin or Arabic text in vertical Chinese, Japanese, Korean or Mongolian.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Text spacing

Other CSS3 properties allow you to trim large spaces associated with punctuation during justification or ...

slide Go to individual slides view. Show the text for this slide. Go to Overview.

... from the beginning of lines.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

You will also be able to introduce small amounts of space between ideographic text and embedded Latin or numeric text through styling rather than by adding space characters (since this is presentational in nature).

slide Go to individual slides view. Show the text for this slide. Go to Overview.

You will also be able to implement typical Japanese typographic conventions such as warichu (reduced size double-layered text) and kumimoji (up to 5 characters in a single glyph space).

slide Go to individual slides view. Show the text for this slide. Go to Overview.

You will be able to implement Japanese conventions for emphasis such as dots or accents associated with characters.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

And, in conjunction with the work on markup already in place for XHTML 1.1, you will be able to apply styling preferences to text annotations ('ruby').

slide Go to individual slides view. Show the text for this slide. Go to Overview.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Making it happen

Much of the specified behaviour in CSS3 derives from input from the Far East. The W3C always wants to hear about requirements from other parts of the world, such as Africa. For example, is there specific behaviour associated with the wordspace character (፡) in Ethiopic that affects justification or line-breaking? If so, we would like to hear from experts in this area.

List style types provide an object lesson in this area. In the CSS 2.0 specification eight non-Latin numbering methods were specified. This was reduced to two in CSS 2.1, since there had not been two or more implementations of the others.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

The CSS3 specification includes a very large number of possibilities for numbering lists using non-Latin scripts, including many that are specifically African scripts. If these are to make it to the final version of the specification, however, it would be helpful for potential users to express their interest in seeing this happen. There are regular questions posted on mailing lists about whether these are really worth the effort.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

The W3C needs the assistance of users and experts in typography from African countries if we are to ensure that your needs are met in this very important and very internatioanlization-sensitive area of the Web. Not only do we need help to specify the requirements and the specifications, but user agent implementors need to be encouraged to implement the international aspects of the specification. To make the happen, consider doing the following:

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Internationalizing content

Mechanical problems

Over an above specifying international features of W3C specifications, there are things that content authors must bear in mind to ensure that content is localizable. In this section we will provide just two examples of this. The first discusses issues that arise due to the differing mechanics of languages.

The syntax and content of translations of a single phrase can be widely different in different languages.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

One result of this can be that a phrase containing two variables (ie. data supplied at run-time) may need to swap their positions. If you have produced the original text using the wrong type of scripting you may prevent this happening and thereby make it impossible to achieve a good translation.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Sometimes developers provide a single string for a basic sentence pattern and try to swap in alternative words to make a number of similar messages. They do this to save memory. Unfortunately, this tends to only work in translation if the syntax and agreement in the original language and the target are identical - not a likely circumstance. This can produce unsurmountable problems for localization.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

A similar problem arises when a designer uses a single string for a concept such as 'on', intending to copy it to the various parts of the user interface where it is needed. In Spanish, however, 'on' could be translated 'activado', 'encendido' or 'conectado', depending on the subject. Given a single string, it is not possible to provide a sensible translation in all contexts.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Cultural issues

For our next set of examples we look at a very different set of problems. Kenneth Keniston said:

"... one Latin American teacher recently complained to me that the US-manufactured and well-translated educational software currently being used in his country's primary schools presupposed 'solitary problem solvers', whereas his culture stressed collective problem-solving."

In this case we are far from mechanical issues.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

The next few slides show an example of a company that has taken this to heart. If you compare the subtopics on Yahoo's directory in various different localized sites you find that they have adapted the content to suit the audience, not just translated.

The page for the UK and Northern Ireland lists under Arts & Humanities the following: Literature, History, Photography.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

For the French site we find: Literature, Cinema, Music, Bandes Dessinées.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

The Japanese site starts with photography, and includes museums and architecture.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

And the Italian site lists: Literature, Erotic tales, and Fashion.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

TBD (to be developed) ...

Topics that will need attention

Here we list a few examples of the topics that still need work.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Some parting thoughts

slide Go to individual slides view. Show the text for this slide. Go to Overview.

And finally, this is your Web - not the W3C's - so if something is not right, get involved to fix it.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

slide Go to individual slides view. Show the text for this slide. Go to Overview.

Author: Richard Ishida, W3C.

Valid XHTML 1.0!
Valid CSS!
Encoded in UTF-8!

Content created 14 June, 2005. Last update 2005-06-16 21:41 GMT