A Character Model for the WWW:
Purpose and Status

Martin J. Dürst
W3C/Keio University

Topics

Overview of character model
Use of UCS as a common reference
String Identity Matching
Early Uniform Normalization
String Indexing
Character encoding in URIs
Current state and future directions

W3C Character Model

Under development, first public Working Draft on February 25
Developed by W3C I18N Working Group (with the help of the I18N Interest Group) in the context of the W3C I18N Activity
For use by:
- Other W3C working groups
- W3C Recommendations
- Implementors of W3C technology
- End users
- Technology interfacing to the WWW
Affects other specifications and software

Character Model Basics

Distinguish the following:
- Characters
- Glyphs
- Codepoints
- Bytes
- Language,... information
Widely accepted in internationalization community
Other documents available:
- ISO Chararacter/Glyph model document
- Unicode Proposed Draft TR#17
Give overview for common understanding
Address web-specific issues

Use of UCS as a common reference

UCS: Universal Character Set (Unicode/ISO 10646)
Allow documents to be encoded in various legacy encodings
Have documents/protocol declare document encodings (MIME "charset" parameter)
Define character escapes based on Unicode/ISO 10646
- ꯍ in HTML 4.0/XML 1.0
- easy transcoding
- full UCS repertoire usable
Applications behave as if Unicode/ISO 10646 was used internally (RFC 2070 reference processing model, UCS as the document character set)

Why Unicode/ISO 10646

Only universal character repertoire available
Covers widest possible range
Provides a way of referencing characters independently of encoding
Is being updated carefully
Is widely accepted and used by industry

Increased Integration on the WWW

Initially: Unidirectional data transfer (server sends page to browser)

Increasingly: WWW as a single application with many components:

Increase in data transfers among servers, proxies, and clients
Increase of cases where non-ASCII characters are allowed (due to Unicode/ISO 10646!)
Increase in data transfers between different protocol/format elements (such as element/attribute names, URI components, and textual content)
Definition of specifications for APIs (e.g. DOM)

XPointers as an Example

Defines links for XML
HTML: <A HREF="http://foo.com/abc.html#def"> refers to <A NAME="def"> in document abc.html
XPointers: Refer to arbitrary parts of documents
- Based on document structure (=> String Identity Matching for element names)
- Based on actual text (=> String Identity Matching)
- Based on offsets (=> String Indexing)

String Identity Matching

String => characters, text
Identity => no general matching (e.g. only case-sensitive)
Matching => no comparison (i.e. not for sorting)

Main application: Identifiers and related phenomena:

URIs (web addresses)
Element/Attribute names (XML)
Identifiers in programming languages (Java,...)

String Identity Matching Problems

Duplicate/multiple encodings in Unicode:

Default ordering of multiple non-spacing marks
Precomposed/decomposed diacritic character representation
Hangul jamo vs. johab and jamo representation alternatives
CJK compatibility ideographs
Other backwards compatibility duplicated characters
Separately coded Indic length/AI/AU marks
...

Invisible control codes,...

String Identity Matching Requirements

Exact definition
Hide invisible encoding differences from the user (base on Unicode canonical equivalences)
Distinguish characters that can usually be distinguished by the user
Forward-compatibility
Broad applicability
Usable with opaque identifiers and data (URIs, encryption)
Allow to be "conservative in what you send"

See Requirements for String Identity Matching and String Indexing, Section 2

String Identity Matching Choices

Which representations to treat as equivalent (and which not)
Which components in the WWW architecture to make responsible for equivalences:
1. Each individual component that performs a string identity check has to take equivalences into account (Late Normalization)
2. Duplicates and ambiguities are removed as close to their source as possible (Early Normalization)
Which way to normalize (in the case that early normalization (2.2) is needed, even if only in some cases)

Why Uniform Early Normalization

Needed for "be conservative in what you send"
Only solution to deal with opaque data
Not all parts of the WWW may reasonably be expected to do normalization (small devices,...)
Less need for software updates to address forward-compatibility issues
More efficient implementations for string indexing (see below)
Increasingly difficult to hide implementation details anyway

Problem: Early normalization has to be uniform across the WWW. This requires aditional specifications.

Uniform Early Normalization Requirements

Say exactly who is responsible
Baseon widespread practice (=>prefer precomposed over decomposed)
Specify in collaboration with the expert communities on character encoding
Make feasible to implement
Provide reference software
Provide test cases

See Requirements for String Identity Matching and String Indexing, Section 3

How to do Early Uniform Normalization

Working together with Unicode Technical Committee
Current status (Unicode Draft TR#15):
- Base on cannonical equivalence
- Use precomposed where available
- Exceptions for scripts such as Hebrew
- Cutoff at version 3.0 of Unicode (and the next edition of ISO/IEC 10646-1)
- Decomposition for precomposed forms introduced after Unicode 3.0
- Discussion about details when only part of a combining sequence can be absorbed into a precomposed character

String Indexing

What does substring from "character" X to "character" Y mean?

Different user expectations
Different levels of resolution
Differences due to encoding duplicates
On lower levels, differences due to different encodings (UCS-4, UTF-16, UTF-8)

String Indexing Requirements

Consistent behaviour across implementations
Take user expectations into account
Try to address "characters" at various levels
Assure forward-compatible
Assure feasible and efficient implementation

See Requirements for String Identity Matching and String Indexing, Section 4

Current State

(as of 24 March, 1999)

Basics widely accepted, but not uniformly codified
String Identity Matching and String Indexing being worked on
- Requirements document published as a W3C Working Draft
- First version of Character Model published as W3C Working Draft
Character model also includes provisions for dealing with internationalized URIs

Conclusions

Unicode/ISO 10646 is used in more and more places
Thighter integration on the WWW increases need for more consistent behaviour
Solve problems (e.g. equivalences) at their source, not everywhere
Give simple implementations, straightforward implementers, and small devices a chance
Try to address problems early to keep them small
High need for compromizes: If everybody gives in a bit, overall everybody will get some benefits

A Character Model for the WWW: Purpose and Status

Martin J. Dürst W3C/Keio University