Model for Character Encoding on the Web

WWW2005 Tutorial: Internationalizing Web Content and Web Technology

10 May 2005, Makuhari, Chiba, Japan

Martin J. Dürst (duerst@it.aoyama.ac.jp)

Department of Integrated Information Technology
College of Science and Engineering
Aoyama Gakuin University
Tokyo/Sagamihara, Japan

Web technologies starting with HTML, continuing with XML and RDF, and including CSS, use a common model for encoding characters. Each document can be encoded in a different character encoding, but this encoding has to be carefully declared. For processing, all documents are first converted to Unicode in order to have a common reference. The model, and its consequences in various technologies such as XSLT and XQuery, will be explained in detail. This section will also discuss some advanced topics closely related to character encoding on the Web, such as Unicode Normalization Forms and how to use them appropriately.

Historic Background

The Web was invented at CERN in Geneva (Switzerland)
Originally used iso-8859-1 (Latin-1) to cover languages of Western Europe
- Trying to having as much as possible a single encoding
- Not general enough, let to some bad legacy (e.g. HTTP warnings,...)
Ad-hoc usage in different countries with different encodings for Web pages (HTML)
Unified model for HTML in RFC 2070, HTML Internationalization (now historic, integrated into HTML 4)

Current State

Model from HTML adopted in XML, CSS, RDF,...
Documented in Character Model for the World Wide Web 1.0 (Fundamentals), a W3C Recommendation
Basics are widely deployed and used
Some specifics are still being worked on (e.g. normalization)

Character Encoding: Fix it or Label it

Fixed encoding is preferred:
- For new protocols and new formats where possible
- For very small protocol elements
- UTF-8 or UTF-16
If encoding can vary:
- Same encoding for big chunks
  - files
  - MIME entities
  - XML external entities
- Label the encoding

Character Encoding Identification

IANA maintains a registry of character encodings (misleadingly called 'Character Sets'). These are also called MIME charsets.

Specifications should use these tags to identify character encodings
If an encoding you want to use is not registered, apply for registration
Use the MIME preferred form, not an alias
For private agreement, private use (x-...) tags may be used

Examples of Labels

(case insensitive)

utf-8, utf-16, utf-16be, utf-16le, utf-32,...
iso-8859-1, iso-8859-2,...
iso-2022-jp, euc-jp, shift_jis, gb2312, big5,...
windows-1252,...

Use of Labels

HTTP Header (request): Accept-Charset: utf-8, iso-8859-1, *
HTTP Header (response): Content-Type: text/html; charset=utf-8
XML Text Declaration (in document): <?xml version='1.0' encoding='shift_jis'
HTML <meta> (in document): <meta http-equiv='Content-Type' content='text/html; charset=utf-8' />

In-document Label Bootstrapping

Vicious cycle: We need to know the encoding of the document to read the document so that we can find the encoding of the document!?

In theory, impossible, in practice, solvable:

HTML:
- Put <meta...> as early in document as possible
- Do not use anything else than ASCII before that
XML:
- Uses <?xml as magic number
- Detect encoding family (ASCII, EBCDIC, UTF-16BE, UTF-16LE,...) based on first few bytes
- Detect encoding= based on encoding family
- See RDF Validator for code example (Java)

Encoding Label Priorities

In some cases, there are encoding labels in different places
Priority has to be defined clearly
For HTML and XML:
1. (HTML: individual user setting)
2. HTTP header (outside document)
3. in-document information
4. (HTML: information from incomming link)
5. (HTML: general user/browser setting)

External vs. Internal Encoding

Advantages of external information
- Easier to decode document
- Easy to change for external programs (e.g. transcoding)
- Fits well with protocol/stream-based architecture (Web services, database,...)
Advantages of internal information
- Travels with the document, does not get lost
- Easy to add for document author/editor
- Fits well with file-based architecture

Model Core: Unicode as a Hub

Formats (e.g. XML) are defined as sequences of Unicode characters
Actual character encoding(s) may be different
Processing converts (transcodes) from non-Unicode to Unicode-based encoding

inputs in utf-8 and shift_jis, outputs in iso-8859-1 and UTF-16, processing in Unicode

Hub or broker; mention Document Character Set

Model Consequences

In SGML terms

Unicode is the Document Character Set

For HTML and XML

Browsers/processors may use something else than Unicode if they can make you believe that they use Unicode

Numeric Character References (NCR, character escapes) are in Unicode independent of document encoding

Example: € is €, not e.g. 

For XSLT and XQuery:

Much more difficult to use something else than Unicode

Model Limits

Model assumes that transcoding from non-Unicode encodings to Unicode-based encodings is uniformly defined.

This is mostly true, but there are some slight variations.

Example: Transcoding from Shift_JIS to Unicode (see XML Japanese Profile)

Unicode Normalization and Character Equivalents

For historic reason, Unicode defines some combinations of characters as canonical equivalents or compatibility equivalents.

Canonical equivalence examples:

Å (U+005C), Å (U+212B), A ̊(U+0041 U+030A)

Compatibility equivalence examples:

A (U+0041), Ａ (U+FF21, full width)

Unicode Normalization Forms

form	composition	equivalences
NFC	composing (mostly)	canonical only
NFKC	composing (only canonical)	canonical and compatibility
NFD	decomposing	canonical only
NFKD	decomposing	canonical and compatibility

Why is Normalization a Difficult but Important

Occurrence of problems relatively rare
With use of NFC, even rarer
Even simple operations on strings (e.g. concatenation) do not maintain a normalization form in all cases
Difficult to get low-level machinery (protocols, parsers) to 'do the right thing'
Strongly affects certain scripts, languages, and encodings: e.g. Vietnamese (windows-1258)

Normalization: Advice

Use Unicode (e.g. UTF-8) from the start where possible
Use NFC, and avoid compatibility-related characters characters unless really necessary (means that your data is also NFKC)

See Character Model for the World Wide Web 1.0: Normalization (W3C Working Draft) for details.

See Unicode in XML and other Markup Languages for more info on how to use characters and markup.