A Character Model for the WWW:
Purpose and Status
© 1999 Unicode/W3C/Keio University
Topics
-
Overview of character model
-
Use of UCS as a common reference
-
String Identity Matching
-
Early Uniform Normalization
-
String Indexing
-
Character encoding in URIs
-
Current state and future directions
W3C Character Model
-
Under development, first public
Working Draft on February 25
-
Developed by W3C I18N Working Group (with the help of the I18N Interest Group)
in the context of the
W3C I18N Activity
-
For use by:
-
Other W3C working groups
-
W3C Recommendations
-
Implementors of W3C technology
-
End users
-
Technology interfacing to the WWW
-
Affects other specifications and software
Character Model Basics
-
Distinguish the following:
-
Characters
-
Glyphs
-
Codepoints
-
Bytes
-
Language,... information
-
Widely accepted in internationalization community
-
Other documents available:
-
Give overview for common understanding
-
Address web-specific issues
Use of UCS as a common reference
-
UCS: Universal Character Set (Unicode/ISO 10646)
-
Allow documents to be encoded in various legacy encodings
-
Have documents/protocol declare document encodings (MIME "charset" parameter)
-
Define character escapes based on Unicode/ISO 10646
-
ꯍ in HTML 4.0/XML 1.0
-
easy transcoding
-
full UCS repertoire usable
-
Applications behave as if Unicode/ISO 10646 was used internally (RFC 2070
reference processing model, UCS as the document character set)
Why Unicode/ISO 10646
-
Only universal character repertoire available
-
Covers widest possible range
-
Provides a way of referencing characters independently of encoding
-
Is being updated carefully
-
Is widely accepted and used by industry
Increased Integration on the WWW
Initially: Unidirectional data transfer (server sends page to browser)
Increasingly: WWW as a single application with many components:
-
Increase in data transfers among servers, proxies, and clients
-
Increase of cases where non-ASCII characters are allowed (due to Unicode/ISO
10646!)
-
Increase in data transfers between different protocol/format elements (such
as element/attribute names, URI components, and textual content)
-
Definition of specifications for APIs (e.g.
DOM)
XPointers as an Example
-
Defines links for XML
-
HTML: <A HREF="http://foo.com/abc.html#def"> refers to <A
NAME="def"> in document abc.html
-
XPointers: Refer to arbitrary parts of documents
-
Based on document structure (=> String Identity Matching for
element names)
-
Based on actual text (=> String Identity Matching)
-
Based on offsets (=> String Indexing)
String Identity Matching
-
String => characters, text
-
Identity => no general matching (e.g. only case-sensitive)
-
Matching => no comparison (i.e. not for sorting)
Main application: Identifiers and related phenomena:
-
URIs (web addresses)
-
Element/Attribute names (XML)
-
Identifiers in programming languages (Java,...)
String Identity Matching Problems
Duplicate/multiple encodings in Unicode:
-
Default ordering of multiple non-spacing marks
-
Precomposed/decomposed diacritic character representation
-
Hangul jamo vs. johab and jamo representation alternatives
-
CJK compatibility ideographs
-
Other backwards compatibility duplicated characters
-
Separately coded Indic length/AI/AU marks
-
...
Invisible control codes,...
String Identity Matching Requirements
-
Exact definition
-
Hide invisible encoding differences from the user (base on Unicode canonical
equivalences)
-
Distinguish characters that can usually be distinguished by the user
-
Forward-compatibility
-
Broad applicability
-
Usable with opaque identifiers and data (URIs, encryption)
-
Allow to be "conservative in what you send"
See Requirements for String Identity
Matching and String Indexing, Section 2
String Identity Matching Choices
-
Which representations to treat as equivalent (and which not)
-
Which components in the WWW architecture to make responsible for equivalences:
-
Each individual component that performs a string identity check has to take
equivalences into account (Late Normalization)
-
Duplicates and ambiguities are removed as close to their source as possible
(Early Normalization)
-
Which way to normalize (in the case that early normalization (2.2) is needed,
even if only in some cases)
Why Uniform Early Normalization
-
Needed for "be conservative in what you send"
-
Only solution to deal with opaque data
-
Not all parts of the WWW may reasonably be expected to do normalization (small
devices,...)
-
Less need for software updates to address forward-compatibility issues
-
More efficient implementations for string indexing (see below)
-
Increasingly difficult to hide implementation details anyway
Problem: Early normalization has to be uniform across the WWW. This requires
aditional specifications.
Uniform Early Normalization Requirements
-
Say exactly who is responsible
-
Baseon widespread practice (=>prefer precomposed over decomposed)
-
Specify in collaboration with the expert communities on character encoding
-
Make feasible to implement
-
Provide reference software
-
Provide test cases
See Requirements for String Identity
Matching and String Indexing, Section 3
How to do Early Uniform Normalization
-
Working together with Unicode Technical Committee
-
Current status (Unicode
Draft TR#15):
-
Base on cannonical equivalence
-
Use precomposed where available
-
Exceptions for scripts such as Hebrew
-
Cutoff at version 3.0 of Unicode (and the next edition of ISO/IEC 10646-1)
-
Decomposition for precomposed forms introduced after Unicode 3.0
-
Discussion about details when only part of a combining sequence can be absorbed
into a precomposed character
String Indexing
What does substring from "character" X to "character" Y mean?
-
Different user expectations
-
Different levels of resolution
-
Differences due to encoding duplicates
-
On lower levels, differences due to different encodings (UCS-4, UTF-16, UTF-8)
String Indexing Requirements
-
Consistent behaviour across implementations
-
Take user expectations into account
-
Try to address "characters" at various levels
-
Assure forward-compatible
-
Assure feasible and efficient implementation
See Requirements for String Identity
Matching and String Indexing, Section 4
Current State
(as of 24 March, 1999)
-
Basics widely accepted, but not uniformly codified
-
String Identity Matching and String Indexing being worked on
-
Character model also includes provisions for dealing with internationalized
URIs
Conclusions
-
Unicode/ISO 10646 is used in more and more places
-
Thighter integration on the WWW increases need for more consistent behaviour
-
Solve problems (e.g. equivalences) at their source, not everywhere
-
Give simple implementations, straightforward implementers, and small devices
a chance
-
Try to address problems early to keep them small
-
High need for compromizes: If everybody gives in a bit, overall everybody
will get some benefits