Equivalence versus Normalization: The Case of Identifiers

Martin J. Dürst
Keio University/W3C

Goals

Understand the difference between equivalence and normalization
Understand the different needs of different applications
Understand the problems with cannonical equivalence
Understand the potential solutions, and difficulties

Equivalence vs. Normalization

Equivalence:

1 inch is equivalent to 2.54 centimeters

Normalization:

Let's all use centimeters, so that we understand each other better

Equivalence in Unicode

é is (cannonically) equivalent to e + ´

There is no way the reader can (or should) see any difference
Unicode requires this equivalence for conformance
Applications can use one representation or the other
Applications that accept data have to change it to their preferred representation
Equivalence is defined, but not normalization

Ambiguities in Unicode

Default ordering of multiple non-spacing marks
Precomposed/decomposed diacritic character representation
Hangul jamo vs. johab and jamo representation alternatives
CJK compatibility ideographs
Other backwards compatibility duplicated characters
Separately coded Indic length/AI/AU marks
Glyphs for vertical variants
Croatian digraphs, other ligatures (Latin, Arabic,...)
Various variant punctuation (apostrophes, middle dots, spaces,...)
Half-width/full-width characters (Latin, Katakana and Hangul)
Vertical variants (U+FE30...)
Presence or absence of joiner/non-joiner
Superscript/subscript variants (numbers and IPA)
Small form variants (U+FE50...)
Upper case/lower case
Similar letters from different scripts (varying degrees) (e.g. "A" in Latin, Greek, and Cyrillic)
Letterlike symbols, Roman numerals (varying degrees)
Enclosed alphanumerics, katakana, hangul,...
Squared katakana (units,...), squared Latin abbreviations,...
CJK ideograph variants (varying degrees, in particular general simplifications, backwards-compatibility non-unifications, JIS 78/83 problems)
Ignorable whitespace, hyphens,... (sorting)
Ignorable accents,... (sorting)

Equivalence Categories

Cannonical Equivalence: Reader has no chance to make or see a difference
=> Needs to work, or reader will be highly confused

Compatibility Equivalence: Almost the same, but difference can be identified
=> Important for specific applications (searching, sorting)

Unicode for Running Text

Input and rendering (display/printing) is main application
Exchange of text is rare
Exchange of text occurs in large chunks
Text editors are large applications
Equivalence is a relatively small problem
Equivalence was accepted to create a single standard

Identifiers

URLs, URNs, URIs (Universal Resource Locators/Names/Identifiers)
Element and attribute names in document languages (XML,...)
Identifiers in programming languages (Java,...)
Using non-ASCII characters in identifiers was not really possible before Unicode

Equivalence for Identifiers

Comparison for equality is the most frequent and most important operation for identifiers
Identifiers are sent across the network very offen
Comparison occurs in many different places
Comparison occurs in small software components
Comparison occurs on very small strings
Equivalence may be the only I18N operation
Equivalence may not be possible because the encoding is not known

Do We Need Normalization?

YES:

Java and XML already use binary comparison
URIs are opaque strings
Architecturally the right thing

NO?:

It's too difficult politically
There are too many details
Natural selection will solve the problem

Architecture

Do the right things at the right place

Origin knows more about what it has to normalize
Origin knows more about how to normalize
Origin:
- Keyboards
- Text editors
- Databases
Doing it once is more efficient

An example: Proxies

Between servers and clients
Caching, security, transformation
Have no idea about URI encoding
Less efficient if more URIs for the same resource (double caching)
Helps even if not everybody does normalization

Internet Engineering Principle

Be conservative in what you send, be liberal in what you accept

Problem: No way to be conservative

Problem: Which Way to Normalize

Implementation efficiency: Simple rules
Run-time efficiency: Fast if already normalized
Acceptability: Normalized to more frequent form
Forward-compatibility: Normalize to currently defined representation

Conclusion

Increased awareness
Need to advance pretty soon
Wide architectural knowledge needed
Wide internationalizaition expertize needed
High readiness for compromizes needed