Weaving the multilingual Web - Standards and their implementation

Fifteenth International Unicode Conference, San Jose, CA, U.S.A., August/September 1999

Martin J. Dürst: Associate Professor (W3C Project), Keio University, Japan, <duerst@w3.org>
François Yergeau: Product Manager, Alis Technologies Inc., Canada, <yergeau@alis.com>

Overview

Introduction: the Web
Reference i18n Model
- Character Encoding
- Reference Processing Model
- Character Encoding Identification
Identifiers
Forms
Language Identification
Miscellaneous i18n features
Bidirectional Text (Bidi)
Current Status
Future Developments
Multilingual Typography
Advanced multilingual applications
Designing a multilingual site
Resources
Questions and Answers

Introduction: The Web

WWW: World Wide Web

Base: The Internet as a heterogeneous network of computer nodes

Nodes can be (at the same time!):

Servers: Storing and delivering documents, delivering services
Clients: Visualizing documents and allowing user interaction
Proxies: Caches, firewalls, transcoders,...

Hypertext

The documents of the Web are structured as a global hypertext

Hypertext = Text (structured) + Links (+ Style Sheets)

Inclusion of other documents in a document (e.g. inline images)
Active areas to jump to other documents

Linked Web documents can be distributed worldwide

HTML

HTML: HyperText Markup Language

HTML is the format of choice (lingua franca, glue) for (hyper)texts in the Web

HTML is defined as an application of SGML

HTML is simple and easy to learn and understand

HTML is evolving

Example of HTML

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN"
    "http://www.w3.org/TR/REC-html40/strict.dtd">
<html>
 <head>
  <title>My first HTML document</title>
 </head>
 <body>
  <p>Hello world!</p>
 </body>
</html>

XML

Is a rigorously defined subset of SGML (single SGML declaration)

Provides many of the advantages, but few of the headaches of SGML

Is good for: marking up parallel texts, database usage, stylesheet, hyperlinking

Current and upcoming applications of XML at W3C: MathML, SMIL, RDF, XHTML, XSL, SVG,...

Lots of applications elsewhere

Example of XML

<?xml version="1.0"?>
<page>
 <title>My first XML document</title>
 <body>
  <para>Hello world!</para>
 </body>
</page>

CSS & XSL

What are Style Sheets?

Define style (e.g. font, size, color, margins, line height,...) for document components
Separate style from content
Match to markup (e.g. <h1>)
Allow user overrides (larger characters,...)
Reduce bandwidth requirements

Example of CSS

body {
       color: #ffffff;
       background: url("texture.jpg") #000060;
       font-size: 30px;
       font-family: arial, helvetica, sans-serif; 
}
div.slidebody { height: 345px; }
code {
       font-weight: bold;
       font-family: "Courier New",courier,monospace;
}

HTTP

HTTP: HyperText Transfer Protocol

Web data transfer between server and client mostly uses the HTTP protocol

General pattern:

Client sends request to server
Server sends document back to client

Redirection via proxies for

Security, efficiency, load distribution,...
Data conversion, add-on services

The Idea of a World Wide Web

For WWW clients to receive understandable responses from a web server, no matter where they are in the world, and no matter what the language and encoding of the data being retrieved.

Global Interoperability and Communication!

Key Points

The WWW is a single application

All parts should work together

Locale independent representation

Data should be viewable by anyone, anywhere

Must rely on standards

What Standards Specify

Standards are important for

Minimal interoperability
Common ground between interacting parties
(formal) correctness

Web standards generally dont specify

Server-only or client-only issues
Implementation details
Error behavior

A Brief History Of WWW Internationalisation

1991: Inception of WWW

1992: Mosaic-L10N

1993: IETF begins standardization

1994: ML-WWW, Netscape

1995: Alis Tango, Internet with an Accent

1995/6: IETF, WInter

1997: Netscape 4.0, MS IE 4.0, RFC 2070, HTML 4.0

1998: XML, CSS2

1999: XHTML

Reference I18N Model

Character encoding

What can go wrong?

Reference processing model

Escaping and Numeric Character References

Character encoding identification

Character Encoding

Very basic and central

Has to answer various requirements

Often leads to confusion

Internationalization:

May need more complex model
May clarify existing model

What can go wrong?

$B$'$j$,$H$&$4$6$$$^$7$?!#O@J8$
r<h$j$K9T$-$^$7$?!#<+EY$NO@J8(J

Encoding not correctly identified
No clear model of character encoding

Character Encoding: Requirements

Internationalization:: Large sets of characters
SGML (HTML, XML):: Single document character set
HTTP:: Efficient transmission
Users:: Use of common local encodings

From HTTP to Text

Raw octets on the wire (e.g. TCP/IP): Transfer-Encoding (e.g. gzip, compress; general purpose)
Multibyte character encoding (e.g. raw text files): Character encoding/switching schemes (e.g. ISO 2022)
Characters as code positions in coded character sets: Code tables, standards
Characters as logical text components: UCS (Unicode/ISO 10646)

Decoding...

Picture of decoding process

Reference processing model

Logically, characters are UCS characters
- For HTML, UCS is declared as the SGML Document Character Set
- For XML, the grammar is based on characters (not bytes): "A character is an atomic unit of text as specified by ISO/IEC 10646"
- For CSS, essentially the same: "A CSS style sheet is a sequence of characters from the UCS..."
On-the-wire encoding can be anything compatible with UCS (i.e. any encoding of a subset of UCS)
Identify encoding, perform transcoding on input, then deal only with Unicode

Ref. proc. model (cont.)

Ref. proc. model conformance

Implementation does not have to use Unicode, only to behave as if it did

HTML and CSS do not require that all characters be displayed; XML does not speak about display

Model is backwards compatible for old HTML browsers

XML requires that a processor accept UTF-8 and UTF-16 input, so it's difficult not to use Unicode internally

Other Recommendations, e.g. DOM (API), require Unicode

Escaping

Escaping used to represent:

Syntax-significant characters (i.e. & for '&')
Characters that cannot be represented in some encoding

HTML/XML:: Numeric character references (NCRs): "é" = "é"
Predefined entities (HTML only): "é" = "é"

CSS:: "\E9 " or "\0000E9" = "é"

Escaping (cont.)

With the reference processing model, escapes become unambiguous:

Independent of encoding, always refer to UCS
No need to parse HTML/XML/CSS when transcoding (e.g. at a proxy)
Escapes allow use of any Unicode character in any document (even when encoded in US-ASCII only)

NCRs are decimal (A is A) or, in HTML 4.0 and XML, hexadecimal (A is A); now in SGML corrigendum

One character = one escape, not two for surrogate pairs

Bad NCRs

In HTML  is sometimes used (mostly based on CP 1252), but is illegal, because it does not exist in Unicode

Replace with correct NCR (named entities also provided in HTML 4.0):

wrong number	correct NCR	character	named entity
132	„	„	&bdquo;
133	…	…	…
134	†	†	&dagger;
135	‡	‡	&Dagger;
139	‹	‹	&lsaquo;
140	Œ	Œ	&OElig;
145	‘	‘	‘
146	’	’	’
147	“	“	“
148	”	”	”
149	•	•	•
151	—	—	—
153	™	™	™
155	›	›	&rsaquo;
156	œ	œ	&oelig;
	€	¬	€

Practical Limits of the Reference Processing Model

HTML:: Cannot use NCRs in <script> and <style> contents

XML:: Cannot use NCRs in element and attribute names
Cannot use NCRs in CDATA marked sections, comments, PIs

CSS:: Some tricky interactions between character escapes and syntax

Transcoding always works from smaller repertoire to larger repertoire, but not the other way round

Caveats

Character identity depends on UCS codepoint, which depends only on character encoding and encoded value, not font

Do not use <FONT FACE="..."> to cheat on characters (e.g. Symbol font)

Text might still show sometimes
Font fallbacks won't work
Search engines won't work
Style sheets won't work
Databases won't work

Caveats

Do not use character entities(e.g. á) or NCR if you can just type the character as is (e.g. á):

One wouldn't use &latinA; for A (layering issue)
Typing is easier
Works better with some search engines
Independent of DTD and other built-in knowledge (XHTML/XML)

Character encoding identification

MIME 'charset' parameter

IANA charset registry

'charset' negotiation

Fallbacks

Priorities

Failure cases

The MIME 'charset' Parameter

MIME: Multipurpose Internet Mail Extensions

Designed for Email, used in HTTP

MIME headers indicate resource types (text/image/audio/...),...

'charset' parameter

Indicates character encoding (and not merely character set!)
Defines mapping from octets to characters (includes code tables and switching schemes)
Used by HTTP (and HTML, XML, CSS) for character encoding identification

IANA 'charset' Registry

MIME charsets registered with IANA (Internet assigned numbers authority [RFC1790])

IANA registry is polluted by useless charsets and aliases

Unregistered charsets preceeded by x- (e.g. x-iscii-devan)

Generate most important charsets:

ISO-8859-1, ISO-8859-2,...
ISO-2022-JP, EUC-JP, EUC-KR, Shift_JIS,...
UTF-8, UTF-16,...

These identifiers are case-insensitive

use only 'MIME preferred' values, maybe accept others

Character encoding negotiation

Client sends Accept-Charset HTTP header (most HTTP/1.1 browsers do)

Accept-Charset: UTF-8,ISO-8859-1;q=0.9,*;q=0.1

Server knows encoding for each document and sends 'charset' parameter in HTTP header

Content-Type: text/html; charset="UTF-8"

Problem: Difficult on most servers to configure this

Indicating character encoding (fallbacks)

Difficult to tell the server about the character encoding of a document

Configuration not easy, perhaps only by webmaster (ISP)
Local file system, no server

'Self-identifying' document:

HTML:: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
Goes inside HTML <head>
XML:: <?xml version="1.0" encoding="UTF-16BE"?>
At the very start; UTF-8 by default
CSS:: @charset "UTF-8";

Additional mechanism in HTML:

'charset' attribute on links (<a> and <link>):: <a href=... charset="UTF-8">...</a>
Useful in bookmark list

Client priorities for charset

(highest priority first)

1. [Per-document user override]

2. HTTP header (or other protocol information)

3. Self-identification (<meta> for HTML, encoding for XML, @charset for CSS)

4. 'charset' parameter on links

5. [User preferences/heuristics]

Failure cases

'charset' on links:: Document encoding on server is changed

Self-identifying documents:: Need to know charset to parse <meta>, but charset is in <meta> ! (Works only if <meta> is ASCII and nothing before gets the parser disturbed.)
Put <meta> as early as possible

Transcoders don't change self-identification

Identifiers

Identifiers in HTML, CSS, XML

URI Internationalization

Identifiers in HTML, CSS, XML

Identifiers are element and attribute names, CSS selectors and properties, etc.

In HTML, all identifiers are restricted to a subset of ASCII (-_:.A-Za-z0-9). Case insensivity is the rule, but there are exceptions

In CSS, identifiers are almost unrestricted (-A-Za-z0-9 + any char > 160); in fact, all identifers are ASCII. Almost everything is case-insensitive.

In XML, identifiers may be formed from a very large subset of Unicode; non-ASCII identifiers are widely used. Everything is case-sensitive.

In XHTML, the identifiers are those of HTML in lower-case and everything is case-sensitive (from XML).

URI Internationalization

Currently: URIs encode bytes, not characters

Most ASCII bytes expressed as ASCII chars, other byte values as %HH

No standard way to use non-ASCII characters (no defined character encoding)

Converging to use UTF-8 (implemented in IE 5.0)

Up to date references:

Internationalization of URIs and other identifiers http://www.w3.org/International/O-URL-and-ident.html

Forms

The i18n problem

The "Send back as received" convention

The Hidden Field solution

The Accept-Charset attribute

Internationalized URIs

Forms and Query Parts

Data entered in forms is sent back as URIs or as a URL-encoded body

Problem: Reliable character encoding identification

Various provisions, none of them fully established:

Use encoding identical to document received
Use hidden field to track transcodings
Use body part or file upload (multipart/form-data, RFC 1867)
Use accept-charset attribute on <form>
Internationalize URIs with UTF-8

Forms: Send back as received

Basic idea: If document is received in iso-2022-jp, send back iso-2022-jp

'charset' of document received has to be correctly identified (assure reader can check)

Fails with transcoding proxies

Fails with multiple forms handled by single CGI

Character repertoire may be limited

Not working on all older browsers

Forms: Hidden field

Form fields can be hidden (like cookies)

Easy way to identify the encoding sent (because text is known and chosen carefully)

Allows to track transcodings

Needs analysis in CGI script

Forms: accept-charset

accept-charset attribute on <form>

Allows wider character repertoire

Comma-separated list of charsets

Example:

  <form action="..."
   accept-charset="iso-8859-1,utf-8">

Note: was on <input> and <textarea> in RFC 2070, is on <form> in HTML 4.0

Forms: Internationalized URIs

Use only UTF-8 for query parts: no charset identification problems anymore

Backwards compatibility problems:

Query name space is dense, heuristics can be dangerous

Author-triggered with accept-charset on <form>

Needs work in CGI script

Language Identification

Why language information?

Language tag syntax

Nesting

Stylesheets

Limitations

Why language information?

Language tagging helps to:

control classification, searching and sorting
control hyphenation, quotation marks, spacing, ligatures
allow reasonable voice synthesis
resolve glyph ambiguities ("undo" Han unification,...)

Language is largely orthogonal to character encoding

Language Tags

HTML: All elements (exceptions: param, base, script) can carry the lang attribute:

<span lang="it">Grazie</span>

<span> is a generic container, phrase-level, with meaning only from its attributes (lang, dir, id and class)

XML: xml:lang attribute, can go on any element but must be declared in DTD for strict validation.

XHTML (HTML expressed in XML syntax): use both lang and xml:lang

Language tags syntax

Value is language tag per RFC 1766:

'fr'          ISO 639
'en-US'       ISO 639 + 3166
'no-nynorsk'  additional qualifier
'i-navajo'    IANA registry
'x-anything'  "experimental"

Case is irrelevant

Do not use period or underline as separator

RFC 1766 will be updated with ISO 639-2
(see draft-alvestrand-lang-tags-v2-00.txt)

Language tag nesting

Inner tag overrides outer, which overrides HTTP header

<p lang="en">He said <q lang="it">Grazie</q> and left.</p>

Tags are hierarchical:: 'en-US' matches 'en-US' (preferred); 'en-US' also matches 'en'

[Language info sometimes may be derived heuristically from charset]

Language-dependent styling

Language-dependent styling allows to fine-tune presentation

The language pseudo-class:: *:lang(xx) { <something> }
Matches elements in (human) language xx (exactly)

Attribute selector for hyphen-separated tags: *[lang|=yy] { <something else> }Matches any element whose lang attribute has a hyphen-separated list of tokens beginning with "yy" (hierarchical matching).

Limits of language information

Dialects, idioms, translitteration,...

Creoles, pidgins,...

Names (what language is "Martin")

Text not associated to any language (e.g. programming text, math,...)

Mixed uses (e.g. hyphenation according to one language, voice synthesis according to another)

Miscellaneous i18n features

Language-dependent quotes

List styles

Text transformations

Font specification

Font description

Language-dependent quotes

Quoting style depends on language:

       »Dansk ’da’ Danish«
     „Deutsch ‘de’ German”
     “English ‘en’ English”
  « Français « fr » French »
    «Italiano «it» Italian»
       «Norsk ’no’ Norwegian»
     « CAA:89 „ru” Russian»

Language-dependent quotes (cont.)

HTML has the <q> element, which adds quotation marks before and after

The quoting style is controlled by CSS properties and values

[lang|=no] > *  { quotes: "«" "»" "\2019" "\2019" }
q:before { content: open-quote }
q:after  { content: close-quote }

CSS niceties

CSS has a list-style-type property with the following values

disc|circle|square|decimal|decimal-leading-zero|
lower-roman|upper-roman|lower-greek|lower-alpha|
lower-latin|upper-alpha|upper-latin|hebrew|
armenian|georgian|cjk-ideographic|hiragana|
katakana|hiragana-iroha|katakana-iroha

text-transform: uppercase, lowercase

language-dependent

CSS Font specification

body { font-family: Baskerville, "Heisei Mincho W3",Symbol,serif }

Fallback mechanism:

If not Baskerville, then Mincho, then...
serif is a <generic-family>, last resort

Font specification (cont.)

Language-dependent styling allows to fine-tune font combinations

*:lang(ja) { font: 900 14pt/16pt "Heisei Mincho W9", serif }

*:lang(zh-tw) { font: 800 14pt/16pt "Li Sung", serif }

Font descriptors

'WebFonts' capabilities enable client-side font matching, font synthesis and progressive rendering, font download

font-family, font-style, font-weight,...

unicode-range descriptor:

unicode-range: U+??, U+AC00-D7F

Meaning: this font covers (some of) Latin-1 and Hangul

Bidirectional Text (Bidi)

What is it?

Unicode algorithm vs Markup

DIR attribute

<BDO> element

Useful entities

Bidi in CSS

Bidirectional Text (Bidi)

Bidirectional text: Mixture of right-to-left and left-to-right text for Arabic, Hebrew,...

Bidirectional text is stored in logical (reading) order

Needs reordering for display

Unicode algorithm components:

Implicit character properties (L, R, EN, AN, N...)
Explicit directional characters (embeddings, overrides)
Reordering algorithm

Bidi (cont.)

HTML is a "higher level protocol"

BIDI embeddings are usually in sync with document structure (paragraphs, citations, emphasis,...)

BIDI markup maps directly to the Unicode algorithm or to corresponding Unicode characters

DIR attribute

dir='ltr' or dir='rtl'

Default is ltr

On block elements (<div>, <p>, <li>, <td>, etc.), gives base direction

Is inherited from enclosing elements

Affects default value of align

Put one on <html> to establish base for whole document, including title

DIR attribute (cont.)

On in-line elements (<span>, <em>,<strong>, etc.), the dir attribute creates a new embedding level

Example data (UPPERCASE IS RTL, lowercase is ltr):: he said: <span dir=rtl>«HE SAID: «hello» AND SHUT UP»</span> and shut up.

Correct rendering:: he said: «PU TUHS DNA «hello» DIAS EH» and shut up.
Wrong rendering:: he said: «DIAS EH «hello» PU TUHS DNA» and shut up.

`<bdo>` element

Bidi override: Overrides implicit directional properties of contents

Requires dir attribute

Useful for part numbers (and for including visually formated text)

<bdo dir=ltr>ABab12DE</bdo>

Correct: ABab12DE

Wrong: BAab12ED

Some useful entities

&lrm; and &rlm;provide directional context for neutrals, symmetric swapping, etc.

Example: &rlm;(&rlm; ==> )

NO other effect (invisible, no word break, etc.)

&zwj; and &zwnj; force or prevent joining in cursive scripts

This is syntactic sugar for actual Unicode characters, but helps with editing source

BIDI in CSS

Not needed for HTML, but for XML

Many things in CSS2 have defaults that are directionality-dependent: tables, alignment, lists, etc.

Exception : background-position

BIDI in CSS (cont.)

direction:ltr|rtl|inherit: specifies the base direction of flow, table layout, etc.

unicode-bidi: normal|embed|bidi-override: specifies embeddings and overrides

BIDI: CSS2 and HTML 4.0

Preserve HTML bidi semantics

bdo[dir="ltr"] { direction: ltr; unicode-bidi: bidi-override }

*[dir="ltr"] { direction: ltr; unicode-bidi: embed }

Block-level elements { unicode-bidi: embed }

(the last line is only relevant if a block-level element is reformatted as an inline element)

Current status

RFC 2070 is IETF Proposed Standard

HTML 4.0 is W3C Recommendation

CSS2 is W3C Recommendation

XML 1.0 is W3C Recommendation

XHTML 1.0 is W3C Proposed Recommendation

I18n features of HTML 4.0 incorporated in ISO HTML

Support increasing in browsers

Support for UTF-8 in Tango, Netscape 4.0 (needs Unicode font) and MSIE 4.0 (needs language packs)

Future developments

Markup enabling locale-sensitive rendering and/or form input of date, time, monetary, etc. values

XML Schema Datatypes

Ruby (HTML, CSS, XSL)

Hyphenation, vertical writing (CSS, XSL)

Generic I18N markup for XML

Future: Ruby

Ruby characters are small annotations set on top of ideographic characters to indicate pronunciation

Hideki Ruby <ruby>
<rb></rb>
<rt></rt>
</ruby>

Working on details: Ruby on both sides, association details, line breaking (see http://www.w3.org/TR/WD-ruby/)

Future: Hyphenation

Various levels of complexity

Marking hyphenation points (, often not correctly implemented)
Complex behaviour (e.g. Zucker => Zuk - ker)
Algorithm-based hyphenation
- How to associate algorithm with text
- How to define/exchange algorithms
Dictionary-based hyphenation
- How to associate dictionaries with text
- How to define dictionary format

XSL

XSL (Extensible Stylesheet Language)

XML syntax
Transformations (e.g. table of contents): XSLT
High-end styling language
Coordination with CSS for properties

Multilingual Typograhy

Missing Glyphs

Missing Characters

What is good ML typography?

Missing Glyphs and Characters

Glyph: Shape in font used for display

Character: Basic logical text component

Missing glyph: Appropriate font resource not available

Missing character: Not clear how to transmit character

Missing Glyphs: Fallbacks

Don't show at all (e.g. Hebrew points)
Display alert (not for each character!)
Show uniform replacement glyph (?)
Show replacement glyph per script
Show numeric representation (hexadecimal!)
Insert link to character information
Display character information as popup

Missing Glyphs: Solutions

Provide full Unicode support with at least one font

Use font downloading/encapsulation mechanisms (CSS2 WebFonts)

Use conversion server (converts characters to inline images)

Missing Characters

Use inline images (PNG, GIF, SVG!): Guaranteed to work

(Don't!) use private zone: Missing character -> missing glyph
Only for well-defined communities

Use markup (not defined yet)

Multilingual Typography

Caveat: HTML mostly shows structure, not presentation (=> style sheets and client-side issue)

Readability first, typography next

Typography differs for each language

Typography needs time to develop

Some bilingual examples,but almost no multilingual examples

Multilingual Typography (cont.)

Font availability

Fonts for single scripts (no Unicode fonts)
Differing selections depending on script
Differing selections depending on platform

Font matching

Not transitive (when A matches B and B matches C, A matches C is not guaranteed)
Close match can look ugly

Multilingual Typography (cont.)

Relation of multilingual text pieces

Seamless integration
Contrast to show different roles

Interaction between styles and lang

Use lang as a selector for style (CSS2)
Allow to attach a language to a class

Advanced Multilingual Applications

Conversion servers

Translation servers

Translation Helper Applications

Parallel Documents

Conversion Servers

Servers that can perform encoding conversion

Other servers can redirect a client to them (shared objects)

Can also perform transliteration

Translation servers

Two primary categories:

Servers with human translators (use the WWW as a communication medium for translation services)
Machine translation systems
- Useful in Intranets or localization houses where grammars are more constrained.
- Useful for highly dynamic content

Examples of all these already exist!

Translation Helper Applications

Applications that "plug in" to clients, and provide machine translation services.

Already widely available in Japan

Limited by client machine capabilities

So far, do not produce good results

Parallel Documents

Same text in different languages

Important in legal contexts (e.g. EU)

Alignment levels: document, sentence

Tools should provide:

Alignment conservation with translation
Parallel link management (server)
Parallel display (client)

Xlink is going to help

Designing an international Web site

Translation/localization

Language selectors

Site structure

File naming

Page encoding

Text as graphics

Definition

World-wide accessibility does not make a site international

An international site needs:

TRANSLATION of the content, in the languages of the target audiences
LOCALIZATION of the content, adaptation to the audience's culture(s)
There is always a part of l10n in translation

Advantages

Synergy between broadcasting and narrowcasting

Better understanding of information in multicultural or foreign environments

Enhancement of corporate image

Better response time

Localization

Text, images, etc.

Cultural differences in the interpretation of images, colors, symbols

Hands Short cartoon
of man pushing cart

Language selection

Use content negotiation

Provide controls

On home page (frameless sites)
On every page (framed sites)
Mixed

Control types

Flags
(distributed sites only)
flags

List box

Mistake!

Text
(as images)
Images of language names

Structure

Organization by contents:

Facilitates site administration
Facilitates file editing and update

Organization by languages:

Looks logical, but less recommended
Forces segregation or replication of images, scripts that belong in directory

Structure (cont.)

File names

2-letter language codes

Insert ISO language code between file name and extension:

index.am.html

Allows language negotiation

Beware of intuition

Do not translate file name!

http://babel.alis.com/langues/iso639.htm

Page encoding

Use appropriate character set

Do provide charset identification

Do not use entities, esp. falsely

search engine problems
complicate editing

Multilingual tools make for easier editing

Respect standards

Make your pages processable:

indexable by search engines
translatable by MT engines
editable by standard tools

Make your pages lasting, reusable

Respect standards (cont.)

Make your pages universal

not screen-specific
not browser/platform specific
not version-specific !

Ask whether it works on multiple platforms: not everyone has the same system as yours

http://validator.w3.org/

Text as graphics

More attractive for navigation elements

May help respect corporate image

Art costs add up to translation

Fast update difficult

Avoid absolutely for headings and paragraphs

Heavy and un-indexable
Accessability problems

Graphics

Many images must be localized and/or translated

Beware of text expansion !

SVG will make things easier

Scripts, applets, multimedia

Thread carefully with scripts: charset issues, hard-coded messages

Java, ActiveX may require full localization by experts

Sound, movies may require dubbing, localization. Multimedia localization firms have expertise that can be leveraged on the Web.

Resources

W3C Internationalization Activity: http://www.w3.org/International/
HTML 4.0, XHTML, RFC 2070: http://www.w3.org/MarkUp/
http://www.ietf.org/rfc/rfc2070.txt
XML: http://www.w3.org/XML/
CSS2/XSL: http://www.w3.org/Style/
HTTP 1.1, RFC 2616, now IETF Draft Standard: http://www.ietf.org/rfc/rfc2616.txt
URIs, RFC 2396: http://www.ietf.org/rfc/rfc2396.txt
Unicode: http://www.unicode.org/
Babel: Internet internationalization: http://babel.alis.com:8080/

Related Talks

Web i18n track, September 1st
A1: 10:00 - 10:40 Multilingual Application Server Yamasaki - Netscape	A3: 11:30 - 12:10 Globalization of Amaya Guetari - W3C
A2: 10:45 - 11:25 Int'l Features of MSIE Suignard - Microsoft	A4: 13:20 - 14:00 Character Model Dürst - W3C
	A5: 14:05 - 15:30 Panel: i18n of the Internet

Q&A

Weaving the multilingual Web - Standards and their implementation

Overview

Introduction: The Web

Hypertext

HTML

Example of HTML

XML

Example of XML

CSS & XSL

What are Style Sheets?

Example of CSS

HTTP

The Idea of a World Wide Web

Global Interoperability and Communication!

Key Points

What Standards Specify

A Brief History Of WWW Internationalisation

Reference I18N Model

Character Encoding

What can go wrong?

Character Encoding: Requirements

From HTTP to Text

Decoding...

Reference processing model

Ref. proc. model (cont.)

Ref. proc. model conformance

Escaping

Escaping (cont.)

Bad NCRs

Practical Limits of the Reference Processing Model

Caveats

Caveats

Character encoding identification

The MIME 'charset' Parameter

IANA 'charset' Registry

Character encoding negotiation

Indicating character encoding (fallbacks)

Client priorities for charset

Failure cases

Identifiers

Identifiers in HTML, CSS, XML

URI Internationalization

Forms

Forms and Query Parts

Forms: Send back as received

Forms: Hidden field

Forms: accept-charset

Forms: Internationalized URIs

Language Identification

Why language information?

Language Tags

Language tags syntax

Language tag nesting

Language-dependent styling

Limits of language information

Miscellaneous i18n features

Language-dependent quotes

Language-dependent quotes (cont.)

CSS niceties

CSS Font specification

Font specification (cont.)

Font descriptors

Bidirectional Text (Bidi)

Bidirectional Text (Bidi)

Bidi (cont.)

DIR attribute

DIR attribute (cont.)

<bdo> element

Some useful entities

BIDI in CSS

BIDI in CSS (cont.)

BIDI: CSS2 and HTML 4.0

Current status

Future developments

Future: Ruby

Future: Hyphenation

XSL

Multilingual Typograhy

Missing Glyphs and Characters

Missing Glyphs: Fallbacks

`<bdo>` element