Model for Character Encoding on the Web

WWW2005 Tutorial: Internationalizing Web Content and Web Technology

10 May 2005, Makuhari, Chiba, Japan

Martin J. Dürst (duerst@it.aoyama.ac.jp)

Department of Integrated Information Technology
College of Science and Engineering
Aoyama Gakuin University
Tokyo/Sagamihara, Japan

AGU

© 2005 Martin J. Dürst Aoyama Gakuin University

Web technologies starting with HTML, continuing with XML and RDF, and including CSS, use a common model for encoding characters. Each document can be encoded in a different character encoding, but this encoding has to be carefully declared. For processing, all documents are first converted to Unicode in order to have a common reference. The model, and its consequences in various technologies such as XSLT and XQuery, will be explained in detail. This section will also discuss some advanced topics closely related to character encoding on the Web, such as Unicode Normalization Forms and how to use them appropriately.

Historic Background

Current State

Character Encoding: Fix it or Label it

Character Encoding Identification

IANA maintains a registry of character encodings (misleadingly called 'Character Sets'). These are also called MIME charsets.

Examples of Labels

(case insensitive)

Use of Labels

HTTP Header (request)
Accept-Charset: utf-8, iso-8859-1, *
HTTP Header (response)
Content-Type: text/html; charset=utf-8
XML Text Declaration (in document)
<?xml version='1.0' encoding='shift_jis'
HTML <meta> (in document)
<meta http-equiv='Content-Type' content='text/html; charset=utf-8' />

In-document Label Bootstrapping

Vicious cycle: We need to know the encoding of the document to read the document so that we can find the encoding of the document!?

In theory, impossible, in practice, solvable:

Encoding Label Priorities

External vs. Internal Encoding

Model Core: Unicode as a Hub

inputs in utf-8 and shift_jis, outputs in iso-8859-1 and UTF-16, processing in Unicode

Hub or broker; mention Document Character Set

Model Consequences

In SGML terms
Unicode is the Document Character Set
For HTML and XML
Browsers/processors may use something else than Unicode if they can make you believe that they use Unicode

Numeric Character References (NCR, character escapes) are in Unicode independent of document encoding

Example: € is &#x20AC;, not e.g. &#128;

For XSLT and XQuery:
Much more difficult to use something else than Unicode

Model Limits

Model assumes that transcoding from non-Unicode encodings to Unicode-based encodings is uniformly defined.

This is mostly true, but there are some slight variations.

Example: Transcoding from Shift_JIS to Unicode (see XML Japanese Profile)

Unicode Normalization and Character Equivalents

For historic reason, Unicode defines some combinations of characters as canonical equivalents or compatibility equivalents.

Canonical equivalence examples:

Å (U+005C), Å (U+212B), A ̊(U+0041 U+030A)

Compatibility equivalence examples:

A (U+0041), A (U+FF21, full width)

Unicode Normalization Forms

form composition equivalences
NFC composing (mostly) canonical only
NFKC composing
(only canonical)
canonical and compatibility
NFD decomposing canonical only
NFKD decomposing canonical and compatibility

Why is Normalization a Difficult but Important

Normalization: Advice

See Character Model for the World Wide Web 1.0: Normalization (W3C Working Draft) for details.

See Unicode in XML and other Markup Languages for more info on how to use characters and markup.

Conclusion and Questions