Equivalence versus Normalization: The Case of Identifiers
© 1998 Unicode/W3C/Keio University
Goals
-
Understand the difference between equivalence and normalization
-
Understand the different needs of different applications
-
Understand the problems with cannonical equivalence
-
Understand the potential solutions, and difficulties
Equivalence vs. Normalization
Equivalence:
1 inch is equivalent to 2.54 centimeters
Normalization:
Let's all use centimeters, so that we understand each other better
Equivalence in Unicode
é is (cannonically) equivalent to e + ´
-
There is no way the reader can (or should) see any difference
-
Unicode requires this equivalence for conformance
-
Applications can use one representation or the other
-
Applications that accept data have to change it to their preferred representation
-
Equivalence is defined, but not normalization
Ambiguities in Unicode
-
Default ordering of multiple non-spacing marks
-
Precomposed/decomposed diacritic character representation
-
Hangul jamo vs. johab and jamo representation alternatives
-
CJK compatibility ideographs
-
Other backwards compatibility duplicated characters
-
Separately coded Indic length/AI/AU marks
-
Glyphs for vertical variants
-
Croatian digraphs, other ligatures (Latin, Arabic,...)
-
Various variant punctuation (apostrophes, middle dots, spaces,...)
-
Half-width/full-width characters (Latin, Katakana and Hangul)
-
Vertical variants (U+FE30...)
-
Presence or absence of joiner/non-joiner
-
Superscript/subscript variants (numbers and IPA)
-
Small form variants (U+FE50...)
-
Upper case/lower case
-
Similar letters from different scripts (varying degrees) (e.g. "A" in Latin,
Greek, and Cyrillic)
-
Letterlike symbols, Roman numerals (varying degrees)
-
Enclosed alphanumerics, katakana, hangul,...
-
Squared katakana (units,...), squared Latin abbreviations,...
-
CJK ideograph variants (varying degrees, in particular general simplifications,
backwards-compatibility non-unifications, JIS 78/83 problems)
-
Ignorable whitespace, hyphens,... (sorting)
-
Ignorable accents,... (sorting)
Equivalence Categories
Cannonical Equivalence: Reader has no chance to make or see a difference
=> Needs to work, or reader will be highly confused
Compatibility Equivalence: Almost the same, but difference can be
identified
=> Important for specific applications (searching, sorting)
Unicode for Running Text
-
Input and rendering (display/printing) is main application
-
Exchange of text is rare
-
Exchange of text occurs in large chunks
-
Text editors are large applications
-
Equivalence is a relatively small problem
-
Equivalence was accepted to create a single standard
Identifiers
-
URLs, URNs, URIs (Universal Resource Locators/Names/Identifiers)
-
Element and attribute names in document languages (XML,...)
-
Identifiers in programming languages (Java,...)
-
Using non-ASCII characters in identifiers was not really possible before
Unicode
Equivalence for Identifiers
-
Comparison for equality is the most frequent and most important
operation for identifiers
-
Identifiers are sent across the network very offen
-
Comparison occurs in many different places
-
Comparison occurs in small software components
-
Comparison occurs on very small strings
-
Equivalence may be the only I18N operation
-
Equivalence may not be possible because the encoding is not known
Do We Need Normalization?
YES:
-
Java and XML already use binary comparison
-
URIs are opaque strings
-
Architecturally the right thing
NO?:
-
It's too difficult politically
-
There are too many details
-
Natural selection will solve the problem
Architecture
Do the right things at the right place
-
Origin knows more about what it has to normalize
-
Origin knows more about how to normalize
-
Origin:
-
Keyboards
-
Text editors
-
Databases
-
Doing it once is more efficient
An example: Proxies
-
Between servers and clients
-
Caching, security, transformation
-
Have no idea about URI encoding
-
Less efficient if more URIs for the same resource (double caching)
-
Helps even if not everybody does normalization
Internet Engineering Principle
Be conservative in what you send, be liberal in what you accept
Problem: No way to be conservative
Problem: Which Way to Normalize
-
Implementation efficiency: Simple rules
-
Run-time efficiency: Fast if already normalized
-
Acceptability: Normalized to more frequent form
-
Forward-compatibility: Normalize to currently defined representation
Conclusion
-
Increased awareness
-
Need to advance pretty soon
-
Wide architectural knowledge needed
-
Wide internationalizaition expertize needed
-
High readiness for compromizes needed