A Character Model for the WWW:
Purpose and Status

Martin J. Dürst
W3C/Keio University

© 1999 Unicode/W3C/Keio University

Topics

W3C Character Model

Character Model Basics

Use of UCS as a common reference

Why Unicode/ISO 10646

Increased Integration on the WWW

Initially: Unidirectional data transfer (server sends page to browser)

Increasingly: WWW as a single application with many components:

XPointers as an Example

String Identity Matching

Main application: Identifiers and related phenomena:

String Identity Matching Problems

Duplicate/multiple encodings in Unicode:

Invisible control codes,...

String Identity Matching Requirements

See Requirements for String Identity Matching and String Indexing, Section 2

String Identity Matching Choices

  1. Which representations to treat as equivalent (and which not)
  2. Which components in the WWW architecture to make responsible for equivalences:
    1. Each individual component that performs a string identity check has to take equivalences into account (Late Normalization)
    2. Duplicates and ambiguities are removed as close to their source as possible (Early Normalization)
  3. Which way to normalize (in the case that early normalization (2.2) is needed, even if only in some cases)

Why Uniform Early Normalization

Problem: Early normalization has to be uniform across the WWW. This requires aditional specifications.

Uniform Early Normalization Requirements

See Requirements for String Identity Matching and String Indexing, Section 3

How to do Early Uniform Normalization

String Indexing

What does substring from "character" X to "character" Y mean?

String Indexing Requirements

See Requirements for String Identity Matching and String Indexing, Section 4

Current State

(as of 24 March, 1999)

Conclusions