Reference processing model
-
Logically, characters are UCS characters
-
For HTML, UCS is declared as the SGML Document Character Set
-
For XML, the grammar is based on characters (not bytes): "A character is
an atomic unit of text as specified by ISO/IEC 10646"
-
For CSS, essentially the same: "A CSS style sheet is a sequence of characters
from the UCS..."
-
On-the-wire encoding can be anything compatible with UCS (i.e. any encoding
of a subset of UCS)
-
Identify encoding, perform transcoding on input, then deal only with Unicode