Warning:
This wiki has been archived and is now read-only.

HTMLCharacterUsage

From HTML WG Wiki

Jump to: navigation, search

Character use in HTML

To help authors and implementors understand the intricacies of Unicode characters, this page proposes to conduct some browser tests and add information to the HTML5 draft about unusual characters. Currently the proposal includes four tiers of character usage: 1) follow Unicode norms and guidance; 2) HTML5 provided norms and guidance; 3) Discourage use of characters in favor of other facilities; 4) Avoid any use of characters: these characters would be basically deprecated for use in HTML5.

This proposal also examines what presentational facilities HTML, CSS and other specifications require to dispense with all use of all characters in Unicode's compatibility range (and some other compatibility blocks).

While this section has the potential of growing too lengthy, it is worth considering that this provides a fairly comprehensive look at the usage, semantics and presentational implications of the entire Unicode standard. Since most characters are basic graphical characters — in other words, the characters simply map from a character to a glyph representing a letter, numeral, or punctuation or other mark within a particular language — we need say little about those characters. That leaves only a handful of character categories that need to be addressed:

Byte-order mark (1)
Separators (including spaces, tab, new line, and new paragraph) (26)
Word boundary characters (4)
Grapheme boundary characters (4)
Bidirectional control characters (7)
Compatibility area characters (≅3,000)
Variation selectors (256)
Control characters (65)
Tags (97)
Specials (5)
Surrogates (2,048)
Musical Notation formatting (8)
Deprecated (10)
Private use (137,468)
Symbols (≅4,000)

Of those categories, only the following categories need to discuss individual characters in any depth for a total of 14 characters:

Byte-order mark (1)
Separators (including spaces, tab, new line, and new paragraph) (3; space, tab, line feed)
Word boundary characters (4; no-break space, zero-width no-break space, word joiner, no-break hyphen)
Grapheme boundary characters (4; zero-width joiner, zero-width non-joiner, combining grapheme joiner, soft hyphen)
Bidirectional control characters (2)

Quicklinks

#head-bc74f7e6594b8419ba506ef81306e5c68acd2a83 See Unicode guidance | #head-37cd5d410bc3afea713c7d7cbd60629d85014c1b Graphical characters (≅100,000) ° #head-6b942b419084b40fe916dd6c93aafe424539d530 Excluding compatibility characters ° #head-58561c075c1f61054ea1e45018bde398e11e74e2 Including symbols (≅3,000) | #head-9b063a1b2037a9625279e21269e0ca1df9ea7273 Variation selectors | #head-74a3d8c076a57cc7f779d609f5514284a5cce202 Byte-order mark (U+FEFF) | #head-3742817d422817e8e5e36c07831049322bda1d17 Controlling bidirectional neutral characters
#head-a45c9929cdec068ebe931da40f25d6dc5680b6d0 Guided usage | #head-3dceaa9b9e22c8ec5ca5dd27969f67385df3db53 Space (U+0020) | #head-37e4a2160b50df30486cbe03497c04ce15d0beda Source formatting whitespace | #head-35f81cb0d259c0afe53fede9cbb4c6d910513632 Other (non-)whitespace | #head-77638abc13541d600a7972b79038f55a0ad8de81 Separator and joiner | #head-961bc9fce7f1bb75e808d216337c304844968675 Line-break controlling | #head-f8ef02863a63b9e4c96c3fb79d62528580d56b51 Private use
#head-87a5dab8d01a9d0a3410537fe73ca2f61dee41dc Discouraged | #head-63fef1d622701507a38415dfede5a66b6f11d6d5 Styling space | #head-57360e8cb18302ecc70cfec75f47d3d2fed87dce Compatibility | #head-902ea2bf79e87b3432e49c10ab5dfea66d93e1c3 Control characters | #head-4a9fadb5cfd5de5d377d187a2d43bfa161f6cf12 Other whitespace characters
#head-d5d84fa98e208f494297a68e60908c2c77d09bb2 Avoid | #head-e772adad1f4bf139e1d9fb68cb9e07b33c1a3c84 Specials | #head-a4e9d1a2c4b1223f4a25025dcc665af0772549ab Tags | #head-7b3abae1e53867bb8c69e29578df68d759a084ee Deprecated | #head-a8bed763c135fdf20b65bf887062913e47077f95 Bidirectional text controlling | #head-9c0e86b464b44f91ca6651ddaf6657798b67a3a2 Surrogate
#head-128fb2b253010cc5db9301e830d24162263c1a29 Presentational requirements for avoiding compatibility characters
#head-3f9f140ea7d4b80e45e74fa0baf9324a561249d3 UA Conformance | #head-0ad4ff9485f506d71000cb51bb2c49f88f93efaf Character foldings | #head-a2d4797b70a0b63db405c6d45f0064d2e3736b8c Editing and conversion UAs | #head-779578e5cfe26ea8c2265334a7b13d38eb387541 Test characters |
#head-be2174b6f3267db3f3d4456c460ed1156f7308ff See also

See Unicode guidance

Graphical characters

The main aim of Unicode in assigning plain-text graphical characters to its over 1 million code points. As of Unicode 5.0 nearly 100 thousand such graphical characters have been assigned from over 50 scripts (writing systems). In some sense the meaning of plain text is shaped by the Unicode standard. Plain text now includes over 3,000 symbols from various disciplines such as mathematics, geometry, and computer science. It also includes music notation, though implementation support for such musical notation may still be limited.

While these characters are typically called graphical characters, they more broadly correspond to: the letters (including alphabetic letters, syllabic letters, ideographs, etc.); numerals (Indic-Arabic base ten, Hexadecimal digits, Ancient Greek, Roman, East Asian, etc); punctuation; diacritical and other marks; and so on. Graphically, these one or more of these characters gets mapped to one or more glyphs through a font or glyphlette (a small ad hoc font). However, when a language has been specified, these characters also can be mapped through speech synthesizers to audible words or through embossed or refreshable Braille for audible and tactile presentation as well.

Excluding compatibility characters

While Unicode focusses on defining plain text characters it also seeks to provide a mechanism to translate all encodings into Unicode without any loss of information. It therefore includes several thousand characters that are not plain text per se. Instead these characters are deemed compatibility characters as they often represent the same visual and semantics of equivalent characters with rich text styling applied. When used with a language such as HTML, Unicode expects the higher level language to handle this styling in a more uniform way than that which can be provided through these compatibility characters. Therefore the use of such characters is discouraged in HTML.

Including symbols

In addition to supporting the letters, numerals, and marks from over 50 scripts, Unicode also includes over 3,000 symbols for various specialized technical, religious, political and other uses. Most of the blocks devoted to symbols occur in the range U+20A0 – U+2BFF, though symbol blocks exist outside this range too.

Variation selectors

Variation selectors allow a single code point to stand for up to 256 character or glyph variants. Mainly designed to deal with variants in East Asian ideographs, Unicode provides a registry where any vendor or author may register an official character variant for any of the over 70,000 ideograph code points assigned in Unicode 5.0.

16 in BMP (U+FE00 – U+FE0F)
240 in Supplement Special-purpose Plane (U+E0100 – U+E01EF)

Byte-order mark (U+FEFF)

The Byte-order mark provides an initial byte in a text file to indicate the endianess of the text. It also helps UAs easily identify Unicode transformation format (UTF) text encodings. The code point with the opposite endianess — U+FFFE — has been permanently set aside as a Unicode non-character. So this non-character code point (U+FFFE) MUST never appear in a document: particularly as the first code point in a document.

In addition, the byte-order mark character originally served a double-duty as a zero-width non-breaking space. This usage has now been deprecated so a byte-order mark MUST only appear as the first character in an HTML document (the very first character before all elements and before any prolog or any other characters).

To get the same functionality, formerly provided through the ZERO-WIDTH NON-BREAKING SPACE authors should use a WORD JOINER (U+2060) instead (see #head-77638abc13541d600a7972b79038f55a0ad8de81 Separators and joiners).

Controlling bidirectional neutral characters

Authors may use the following characters to control the directionality of bidirectionally neutral characters.

LEFT-TO-RIGHT MARK (U+200E)
RIGHT-TO-LEFT MARK (U+200F)

To intricately control bidirectional text in other ways authors should not use the other Unicode bidirectional text control characters, but instead SHOULD use the 'dir' attribute along with nested phrase elements for increasing the embedding level or nested 'bdo' elements (bidirectional override) to override the normal text direction.

Guided usage

Whitespace

HORIZONTAL TAB (U+0009)
LINE FEED (U+000A)
SPACE (U+0020)

Meaning of space separated

Space U+0020

Source formatting whitespace

Space U+0020
Horizontal Tab U+0009
Line Feed U+000A

Semantically significant whitespace

Discuss the 'pre' element or any elements where authors apply the CSS 'white-space' property.

Separator and joiner

Separators and joiners may have many different uses depending on what they join and separate and what aspect of joining they control. For example, the WORD JOINER (U+2060) joins words to prevent line breaks on characters that might otherwise signal an ideal place for a line break. In contrast a standard SPACE (U+0020) as input from a keyboard spacebar, is a word separator in many scripts. It indicates the boundary between two words and signals a suitable location for a line break. Other separators and joiners join or separate graphemes that might be joined as ligatures (ligation) or joined through cursive connection of glyphs. Independent of the sorts of control available for controlling glyphs, the COMBINING GRAPHEME JOINER (U+20034F) instead joins characters for differentiation in searching and collation.

	Separators (Joiners)
	Character semantics	Glyph presentation
Grapheme	N/A (Combining Grapheme Joiner)	Zero-width non-joiner (Zero-width joiner)
Syllable	Soft Hyphen
Word	Space Zero-width space (Word Joiner) (No-break Space) (Non-breaking hyphen)
Line	<br/> <separator> <l></l> U+000A as a character reference or literal wherever whitespace is relevant
Paragraph	<p></p>

These characters are special characters that add semantic content to the text by either joining what would otherwise be separated or separating what would otherwise be joined. For example the WORD JOINER (U+2060) can be used to indicate a character normally treated as a word boundary should not be treated as a word boundary and avoid a line-break at that character. For word boundaries authors use the following characters:

SPACE U+0020
NO-BREAK SPACE (U+00A0)
ZERO WIDTH SPACE (U+200B)
Provides a character to separate words without visible whitespace.
WORD JOINER (U+2060)
Other script specific separators (2)
Other script specific formatting characters (16, including 8 music notation control characters)

		Space
		Collapsed	Visible
Line-breaking	allows	ZERO WIDTH SPACE	SPACE
Line-breaking	prevents	WORD JOINER	NO-BREAK SPACE

COMBINING GRAPHEME JOINER (U+034F)
To control searching and collation of combined graphemes such as digraphs (e.g., the “Mc” in “McDonalds”)
This has no effect on the glyph presentation of the characters joined into a grapheme cluster. To control the ligation and the cursive connection of glyphs use the #head-cdfb5554d88260b7bc852f917ef70741b78d4137 Grapheme glyph control characters: ZERO WIDTH NON-JOINER (U+200C) and ZERO WIDTH JOINER (U+200D).

For mathematical expressions authors can use the following characters to encode machine-readable semantics.

FUNCTION APPLICATION (U+2061)
INVISIBLE TIMES (U+2062)
INVISIBLE SEPARATOR (U+2063)

For LINE SEPARATOR (U+2028) and PARAGRAPH SEPARATOR (U+2029) see #head-4a9fadb5cfd5de5d377d187a2d43bfa161f6cf12 ‘Other whitespace characters’: both are discouraged.

Line-break controlling

Lines never break within a grapheme cluster nor within a word except as determined by hyphenation algorithm or at word boundaries. To control these word boundaries, Unicode provides several characters to control these word boundaries — the most important is the WORD JOINER (U+2060) and the SPACE (U+0020). The WORD JOINER can be used to avoid a line-break by combining it with other characters that normally serve as a word boundary. So while Unicode provides NO-BREAK SPACE (U+00A0) to insert a space between words while not allowing a line-break, the same thing could be accomplished with the sequence WORD JOINER (U+2060) + SPACE (U+0020) + WORD JOINER (U+2060). Similarly, the sequence WORD JOINER (U+2060) + HYPHEN-MINUS (U+002D) + WORD JOINER (U+2060) is the same as NON-BREAKING HYPHEN (U+2011). While several no-break characters are provided in Unicode, the WORD JOINER can be combined with any word boundary character to create a no-break character in this way.

No-Break characters

NO-BREAK SPACE (U+00A0)
NON-BREAKING HYPHEN (U+2011)
FIGURE SPACE (U+2007)
NARROW NO-BREAK SPACE (U+202F)
TIBETAN MARK DELIMITER TSHEG BSTAR (U+0F0C)

In addition to these characters the SOFT HYPHEN (U+00AD) can be used to provide line-break hints within words that UAs might not have in their own hyphenation dictionaries.

Private use

When using private use characters authors should ensure that fonts or glyphletted are provided in one way or another so that users can properly view the text. (speech synthesizer and Braille issues).

BMP private use area (U+E000 – U+F8FF)
Supplementary private use area-A (U+F0000 – U+FFFFD)
Supplementary private use area-B (U+100000 – U+10FFFD)

Discouraged

Due to interoperability, presentational, or non-plain-text nature of the characters, the following characters are discouraged for general use in HTML. There may be some applications of HTML where they are necessary, but they may undermine interoperability of the flexibility afforded by a strict separation of semantics from styling.

Grapheme glyph control

ZERO WIDTH NON-JOINER (U+200C)
Provides a character to separate characters that would otherwise be joined as
ZERO WIDTH JOINER (U+200D)
Provides a character to join characters.

Together these characters allow authors to control ligation and cursive connections between glyphs on a case by case basis. The Non-Joiner prevents ligation and cursive connections between characters that would other wise ligate or join cursively. The joiner encourages ligation and cursive connections.

[Issue: what should HTML5 say about these characters? What are the use cases? Do we want to require UA support? This is presentation only (the meaning should not be effected by non-ligation/ligation or cursive connections), however there may not be another mechanism to encode these ad hoc presentational aspects. A similar issue relates to the styling spaces and kerning and leading in general where an author/designer may want a one-time override of the default visual presentation and metrics of a font.]

Styling space

EN QUAD (U+2000)
EM QUAD (U+2001)
EN SPACE (U+2002)
EM SPACE (U+2003)
THREE-PER-EM SPACE (U+2004)
FOUR-PER-EM SPACE (U+2005)
SIX-PER-EM SPACE (U+2006)
FIGURE SPACE (U+2007)
PUNCTUATION SPACE (U+2008)
THIN SPACE (U+2009)
HAIR SPACE (U+200A)
MEDIUM MATHEMATICAL SPACE (U+205F)
NARROW NO-BREAK SPACE (U+202F)
IDEOGRAPHIC SPACE (U+3000)

To avoid using these styling spaces authors may use SPACE U+0020 and NO-BREAK SPACE (U+00A0) and using other mechanisms to style the size of the space. For example where document semantics imply a larger or smaller space, authors may use the CSS 'word-spacing' as well as the margin and padding related properties to adjust the placement of element generated boxes relative to one another.

Compatibility

The compatibility characters in these regions and blocks of Unicode are presentational in nature and this presentation should be better handled through HTML semantics or CSS presentation. By avoiding these characters authors create a document that is more interoperable in terms of searching and collation while at the same time creating more consistent presentational idioms across many more styling options than could be provided by this finite number of characters.

Superscripts and Subscripts (U+2070 – U+209F) (48)
Use CSS and _{and ^{elements
NOTE: There are other superscript and subscript characters outside this block that carry semantic values specific enough to rely on a font’s own glyphs to render the subscript and superscript and to include such semantic encoding within a document.}}
Enclosed Alphanumerics (U+2460 – U+24FF) (160)
Use CSS border-radius
KangXi Radicals (U+2F00 – U+2FDF) (224)
Use CSS font-family
Hangul Compatibility Jamo (U+3130 – U+318F) (96)
Use CSS font-family
Enclosed CJK Letters and Months (U+3200 – U+32FF) (256)
Use CSS border-radius
CJK Compatibility (U+3300 – U+33FF) (256)
Use CSS font-family
BMP contiguous compatibility blocks (U+F900 – U+FFEF) (1,744)
Requires
- Opentype glyph substitution for Arabic forms
- Ligatures and composing character support for Latin, Armenian, Arabic and Hebrew ligature forms
- CSS vertical text layout for CJK half-width, full-width and vertical forms
- CSS border-radius, ‘::before’ and ‘::after’ selectors, and for enlcosed CJK letters and months
CJK Compatibility Ideographs Supplement (U+2F800 – U+2FA1F) (544)
Use CSS font-family

In addition to the compatibility characters in these specific blocks there are other compatibility characters (as well as canonical equivalent characters) scattered throughout Unicode. Those characters are either handled through general character foldings. Many of those are either canonical equivalents or compatibility equivalents related to ligature composition and the like and are not as presentational in nature. Different authors may choose to author documents heavily leaning toward decomposed characters or towards precomposed characters and ligatures. UA case foldings should handle the differences for those characters.

Control characters

Control characters consist of:

C0 U+0000 – U+001F & U+007F (except for TAB U+0009 and LINE FEED U+000A; see #head-37e4a2160b50df30486cbe03497c04ce15d0beda whitespace characters)
C1 U+0080 – U+009F

While UAs should preserve these characters, they have no agreed upon interoperable meaning within an HTML document. Authors MAY use them within an HTML simply as a storage container. These characters will not be rendered and HTML5 does not define any behavior for them. There are other issues such as:

Though it can be added through the HTML5 DOM, there is no straightforward way to serialize and then de-serialize the character U+000D using the legacy text/html serialization.
XML 1.0 does not support all of the C0 control characters, but only supports: U+0009, U+000A, and U+000D. This leaves 28 C0 control characters unsupported for serialization to XML 1.0
XML 1.1 supports all of he C0 and C1 control characters, but only as character references (for example,  for 'carriage return' or  for 'delete').

Other whitespace characters

Extended HTML whitespace
- FORM FEED (U+0009)
- VERTICAL TAB (U+000A)
- CARRIAGE RETURN (U+000D)
Other whitespace (not treated as whitespace)
- LINE SEPARATOR (U+2028)
- PARAGRAPH SEPARATOR (U+2029)
- NEXT LINE (U+0085)
  treated as horizontal esllipses (U+2026) in some browsers and with some fonts
  - U+2026 |…|
  - U+0085 |…|
  - U+0085 literal |

Like U+000D there is no straightforward way to serialize and then de-serialize the character these other new line characters using the legacy text/html serialization. Preserving these characters for a roundtrip likely requires using a character reference and might require XML 1.0 or XML 1.1 serialization or using some other DOM manipulation for the legacy text/html serialization.

Avoid

Specials

For interlinear annotation use Ruby markup. For Object replacement use the 'object', 'img' or other replaced elements. The replacment character should not be included in documents deliberately.

INTERLINEAR ANNOTATION ANCHOR (U+FFF9)
INTERLINEAR ANNOTATION SEPARATOR (U+FFFA)
INTERLINEAR ANNOTATION TERMINATOR (U+FFFB)
OBJECT REPLACEMENT CHARACTER (U+FFFC)
REPLACEMENT CHARACTER (U+FFFD)

Deprecated

See Unicode for suitable substitutes for these characters (10). These may appear in legacy documents, but newly authored documents MUST NOT use them.

Arabic shaping and swapping control characters U+206A – U+206F
COMBINING GRAVE TONE MARK (U+0340) —
COMBINING ACUTE TONE MARK (U+0341) —
KHMER INDEPENDENT VOWEL QAQ (U+17A3) — use KHMER LETTER QA U+17A2
KHMER SIGN BATHAMASAT (U+17D3): use KHMER SYMBOL PATHAMASAT (U+19E0)

Bidirectional text controlling

Authors should not use these characters. Instead authors should use the 'dir' attribute combined either with nested phrase elements for bidirectionally embedding levels or nested BDO element for bidirectional override.

LEFT-TO-RIGHT EMBEDDING (U+202A)
RIGHT-TO-LEFT EMBEDDING (U+202B)
POP DIRECTIONAL FORMATTING (U+202C)
LEFT-TO-RIGHT OVERRIDE (U+202D)
RIGHT-TO-LEFT OVERRIDE (U+202E)

Authors may use these characters to control the directionality of directionally neutral characters.

LEFT-TO-RIGHT MARK (U+200E)
RIGHT-TO-LEFT MARK (U+200F)

Surrogate

These characters (2,048) provide support for UTF-16 and other Unicode Transformation Formats (UTFs). They MUST NOT appear directly within a document. Some browsers support surrogate pair character references, but authors SHOULD NOT rely on these but instead SHOULD insert the literal character or a character reference for the actual character designated by the surrogate pair.

High surrogates U+D800 – U+DB7F
High private use surrogates U+DB80 – U+DBFF
Low surrogates U+DC00 – U+DFFF

Noncharacters

Unicode currently designates 66 code points as noncharacters. These are code points Unicode guarantees it will never assign characters to and are sometimes called illegal characters. These code points include the last two code points in each of the 17 Unicode planes: xFFFE and xFFFF. Also 32 code points in the middle of the Arabic presentation Forms-A block from U+FDD0 to U+FDEF are noncharacters. As Unicode assigns characters in the other 11 planes the possibility of other noncharacter code points could arise. Authors should avoid using these code points for interoperability reasons. There is and should not be any standardized approach to handle these code points.

Presentational requirements for avoiding compatibility characters

Superscripts and Subscripts (U+2070 – U+209F) (48)
Use CSS and _{and ^elements}
Enclosed Alphanumerics (U+2460 – U+24FF) (160)
Use CSS border-radius
KangXi Radicals (U+2F00 – U+2FDF) (224)
Use CSS font-family
Hangul Compatibility Jamo (U+3130 – U+318F) (96)
Use CSS font-family
Enclosed CJK Letters and Months (U+3200 – U+32FF) (256)
Use CSS border-radius
CJK Compatibility (U+3300 – U+33FF) (256)
Use CSS font-family
BMP contiguous compatibility blocks (U+F900 – U+FFEF) (1,744)
Requires
- Opentype glyph substitution for Arabic forms
- Ligatures and composing character support for Latin, Armenian, Arabic and Hebrew ligature forms
- CSS vertical text layout for CJK half-width, full-width and vertical forms
- CSS border-radius, ‘::before’ and ‘::after’ selectors, and for enlcosed CJK letters and months
CJK Compatibility Ideographs Supplement (U+2F800 – U+2FA1F) (544)
Use CSS font-family
For 16 precomposed fraction characters in Number forms (U+2153 – U+215F) and Latin-1 Supplement (U+00BC, U+00BD, and U+00BE)
Document use and proper UA implementation of Fraction Slash (U+2044)

Requirements recap

Existing CSS2 capabilities
General text layout capabilities using Opentype or Opentype-like positional glyph substitution
Styling property for ligatures (suggest CSS WG liaison on this issue) and UA support for ligatures
CSS Vertical text properties and UA support for vertical text layout
CSS Border-radius for encircled characters and CSS2 properties and pseudo-selectors for other enclosings
Document use and proper UA implementation of Fraction Slash (U+2044)

UA conformance

Proposed UA conformance requirements for non-graphical characters and compatibility characters.

whitespace handling
serialization limitations for certain characters
character foldings
encourage UA normalization (NFC, NFD, NFKC, NFKD) as general character folding only; not as a lossy transformation of the text.
encourage authoring use of normalized characters through normalization and language aware input managers
- authors only need to be presented with one character equivalent or the other except in the case of singletons.
- for arbitrary character composing the decomposed character components should be exposed to authors (e.g., composing fractions through any characters combined with a fraction slash.
- canonical compatibility singletons should be presented as such to ensure authors understand the possibility of fidelity loss from subsequent normalization.
UAs should never perform lossy normalization of singletons except at author option and with proper warning to authors

Character foldings

UAs that perform searches and queries on HTML documents should pay particular attention to the special characters singled out on this page. Unicode’s technical report on Character Foldings, while still a draft, may eventually provide valuable guidance to UAs on handling these issues. In particular UAs performing searches should fold canonical and compatibility equivalent characters. UAs may fold differences between typewriter characters and more typographically appropriate characters. For example a user searching for the phrase: "Chuckie's revenge", will likely want results returned for either the typewriter character single-quote "'" or the curly quote "’". While these types of foldings do not yet correspond to any Unicode property, UAs still may want to perform such foldings.

Character foldings should account for

Canonical equivalents
Compatibility equivalents
Ignoring separators, control, and formatting characters
Ignoring Arabic Tail Fragment (U+FE73)
Characters with visually similar glyphs; for example:
- Latin letter A and Greek Alpha
- Left (“) and right (”) double quote and straight double quote (")
- Diacritic insensitive searches
- Double hyphen for various dashes
characters composed with combining half marks (U+FE20 – U+FE2F)

Editing and conversion UAs

Editing and conversion UAs SHOULD NOT insert or append characters under the "discouraged" heading and MUST NOT insert or append characters under the "avoid" heading.

UAs and Unicode normalization

For browsers and many other UAs, the bulk of their task is read-only when it comes to HTML. Any Unicode normalization must therefore be handled in-memory and not necessarily through altering the serialized document. The need to handle normalization without altering the original serialized document implies that UAs should be able to accomplish Unicode normalization without changing the original document. This further implies that UAs need not alter the original serialized document to achieve Unicode normalization.

Since Unicode normalization of a serialized document is lossy in nature, it is much better to accomplish it without altering the original document.

Canonical normalization (NFC, NFD)

Canonical normalization is a normalization between canonical equivalents. In general the intention that some characters included in Unicode have no semantic nor presentation difference from other characters in Unicode. They may have been included in Unicode simply to express a distinction made in another character set or encoding. In the case of precomposed characters included from legacy encodings, Unicode also includes the decomposed characters for greater authoring flexibility. For example, Unicode includes a Latin letter "A", a combining ring above and also a precomposed Å from legacy character encodings. There is no semantic difference between the precomposed Å and the sequence A + combining ring above. With a properly designed font, there should also be no difference in the presentation of glyphs between the precomposed and base character combining character sequence.

However, there are situations where UAs cannot be sure of such semantic and presentation equivalents. For canonical equivalent characters, the concern arises with singleton decomposable characters. For these characters, one character "decomposes" to one other character. However, the characters have different names, different code positions and sometimes even different glyphs. Therefore authors understand these characters as distinct and in choosing one character over the other expect that difference to be preserved. Lossy normalization therefore has lossy effects.

Compatibility normalization (NFKC, NFKD)

The other area where Unicode normalization might effect semantics and presentation is with compatibility (non-canonical) normalization. Often with these compatibility characters, Unicode included these characters only because other encodings included them. From the perspective of the Unicode consortium, they are properly understood as rich text constructs. These include ligatures, subscripts, superscripts, fractions, and many other characters that have plain-text equivalents, but lose both semantics content and presentational distinctions when substituting the plain-text compatibility equivalent characters without at the same time applying semantic and presentational properties. Therefore NFKC and the NFKD normalization forms imply UAs must either perform only an in-memory normalization or must treat the normalization as a conversion of rich text to HTML (and optionally CSS).

Rich text examples: superscript, subscript, enclosed alphanumerics, enclosed CJK letters and months, CJK vertical text, CJK radicals.
Other examples: fractions, ligatures, CJK vertical text, Arabic forms.

Normalization for 'find' on page

Ideally, the comparisons between a find string and the document would occur through normalizing both the search string and the document.innerText string to normalized form compatibility composed. This would produce the most possible positive matches (including some false positives). Other generalized character foldings may help users find text even more.

Normalization for string length

Perhaps a graphemeCount() method for string objects. Normalization form compatibility composed (NFKC) should create the closest thing to a grapheme count of a string. However, authors may expect the actual character count at other times. This needs to also take into account surrogate characters for implementations that treat Unicode text internally as 16-bit text.

Normalization for indexing

Indices should make use of NFKD normalized strings for index terms, but preserve the original lossless form of the document (or a pointer to the original document depending on the UA).

Normalization and glyph selection

While UAs should preserve the text without normalization except for performing find or counting grpahemes and such, another area where UAs may consider compatibility decompositions is in fallback glyph selection. When a UA cannot find a glyph for a particular character, it is advisable to look for a glyph from a canonical equivalent and then for a compatibility equivalent. For example, it would be better to display a '1' glyph for a missing '¹' than to display a generic last resort glyph. UAs may also use the decomposition keyword (<super> in this example) to adjust the layout and appearance of the glyph to synthesize the appropriate glyph from the decomposition mapping.

Test characters

MoinMoin inhibits this so a separate document will provide a test page for the handling of these separators, joiners special characters and other code points.

Code point	literal	reference	comment
U+0085	--	--	NEXT LINE or horizontal ellipses
U+2028	--	- -	line separator
U+2029	--	- -	paragraph separator
U+0000
U+0001
...