Model for Character Encoding on the Web
WWW2005 Tutorial: Internationalizing Web
Content and Web Technology
10 May 2005, Makuhari, Chiba, Japan
Martin J. Dürst (duerst@it.aoyama.ac.jp)
Department of Integrated Information
Technology
College of Science and Engineering
Aoyama Gakuin University
Tokyo/Sagamihara, Japan

© 2005 Martin
J. Dürst Aoyama Gakuin
University
Historic Background
- The Web was invented at CERN in Geneva (Switzerland)
- Originally used iso-8859-1 (Latin-1) to cover languages of Western
Europe
- Trying to having as much as possible a single encoding
- Not general enough, let to some bad legacy (e.g. HTTP
warnings,...)
- Ad-hoc usage in different countries with different encodings for Web
pages (HTML)
- Unified model for HTML in RFC 2070, HTML
Internationalization (now historic, integrated into HTML 4)
Current State
- Model from HTML adopted in XML, CSS, RDF,...
- Documented in Character Model
for the World Wide Web 1.0 (Fundamentals), a W3C Recommendation
- Basics are widely deployed and used
- Some specifics are still being worked on (e.g. normalization)
Character Encoding: Fix it or Label it
- Fixed encoding is preferred:
- For new protocols and new formats where possible
- For very small protocol elements
- UTF-8 or UTF-16
- If encoding can vary:
- Same encoding for big chunks
- files
- MIME entities
- XML external entities
- Label the encoding
Character Encoding Identification
IANA maintains a registry of character
encodings (misleadingly called 'Character Sets'). These are also called
MIME charsets.
- Specifications should use these tags to identify character
encodings
- If an encoding you want to use is not registered, apply for
registration
- Use the MIME preferred form, not an alias
- For private agreement, private use (x-...) tags may be used
Examples of Labels
(case insensitive)
- utf-8, utf-16, utf-16be, utf-16le, utf-32,...
- iso-8859-1, iso-8859-2,...
- iso-2022-jp, euc-jp, shift_jis, gb2312, big5,...
- windows-1252,...
Use of Labels
- HTTP Header (request)
Accept-Charset: utf-8, iso-8859-1, *
- HTTP Header (response)
Content-Type: text/html; charset=utf-8
- XML Text Declaration (in document)
<?xml version='1.0' encoding='shift_jis'
- HTML
<meta> (in document)
<meta http-equiv='Content-Type' content='text/html;
charset=utf-8' />
In-document Label Bootstrapping
Vicious cycle: We need to know the encoding of the document to read the
document so that we can find the encoding of the document!?
In theory, impossible, in practice, solvable:
- HTML:
- Put <
meta...> as early in document
as possible
- Do not use anything else than ASCII before that
- XML:
- Uses
<?xml as magic number
- Detect encoding family (ASCII, EBCDIC, UTF-16BE, UTF-16LE,...)
based on first few bytes
- Detect
encoding= based on encoding family
- See RDF Validator
for code
example (Java)
Encoding Label Priorities
- In some cases, there are encoding labels in different places
- Priority has to be defined clearly
- For HTML and XML:
- (HTML: individual user setting)
- HTTP header (outside document)
- in-document information
- (HTML: information from incomming link)
- (HTML: general user/browser setting)
External vs. Internal Encoding
- Advantages of external information
- Easier to decode document
- Easy to change for external programs (e.g. transcoding)
- Fits well with protocol/stream-based architecture (Web services,
database,...)
- Advantages of internal information
- Travels with the document, does not get lost
- Easy to add for document author/editor
- Fits well with file-based architecture
Model Core: Unicode as a Hub
- Formats (e.g. XML) are defined as sequences of Unicode characters
- Actual character encoding(s) may be different
- Processing converts (transcodes) from non-Unicode to Unicode-based
encoding

Model Consequences
- In SGML terms
- Unicode is the Document Character Set
- For HTML and XML
- Browsers/processors may use something else than Unicode if they can
make you believe that they use Unicode
Numeric Character References (NCR, character escapes) are in
Unicode independent of document encoding
Example: € is €, not e.g. €
- For XSLT and XQuery:
- Much more difficult to use something else than Unicode
Model Limits
Model assumes that transcoding from non-Unicode encodings to Unicode-based
encodings is uniformly defined.
This is mostly true, but there are some slight variations.
Example: Transcoding from Shift_JIS to Unicode (see XML Japanese
Profile)
Unicode Normalization and Character Equivalents
For historic reason, Unicode defines some combinations of characters as
canonical equivalents or compatibility equivalents.
Canonical equivalence examples:
Å (U+005C), Å (U+212B), A ̊(U+0041 U+030A)
Compatibility equivalence examples:
A (U+0041), A (U+FF21, full width)
Unicode Normalization Forms
| form |
composition |
equivalences |
| NFC |
composing (mostly) |
canonical only |
| NFKC |
composing
(only canonical) |
canonical and compatibility |
| NFD |
decomposing |
canonical only |
| NFKD |
decomposing |
canonical and compatibility |
Why is Normalization a Difficult but Important
- Occurrence of problems relatively rare
- With use of NFC, even rarer
- Even simple operations on strings (e.g. concatenation) do not maintain
a normalization form in all cases
- Difficult to get low-level machinery (protocols, parsers) to 'do the
right thing'
- Strongly affects certain scripts, languages, and encodings: e.g.
Vietnamese (windows-1258)
Normalization: Advice
- Use Unicode (e.g. UTF-8) from the start where possible
- Use NFC, and avoid compatibility-related characters characters unless
really necessary (means that your data is also NFKC)
See Character Model for the World Wide Web 1.0: Normalization (W3C Working
Draft) for details.
See Unicode in XML and
other Markup Languages for more info on how to use characters and
markup.
Conclusion and Questions