From HTML WG Wiki
Jump to: navigation, search

Character use in HTML

To help authors and implementors understand the intricacies of Unicode characters, this page proposes to conduct some browser tests and add information to the HTML5 draft about unusual characters. Currently the proposal includes four tiers of character usage: 1) follow Unicode norms and guidance; 2) HTML5 provided norms and guidance; 3) Discourage use of characters in favor of other facilities; 4) Avoid any use of characters: these characters would be basically deprecated for use in HTML5.

This proposal also examines what presentational facilities HTML, CSS and other specifications require to dispense with all use of all characters in Unicode's compatibility range (and some other compatibility blocks).

While this section has the potential of growing too lengthy, it is worth considering that this provides a fairly comprehensive look at the usage, semantics and presentational implications of the entire Unicode standard. Since most characters are basic graphical characters — in other words, the characters simply map from a character to a glyph representing a letter, numeral, or punctuation or other mark within a particular language — we need say little about those characters. That leaves only a handful of character categories that need to be addressed:

  1. Byte-order mark (1)
  2. Separators (including spaces, tab, new line, and new paragraph) (26)
  3. Word boundary characters (4)
  4. Grapheme boundary characters (4)
  5. Bidirectional control characters (7)
  6. Compatibility area characters (≅3,000)
  7. Variation selectors (256)
  8. Control characters (65)
  9. Tags (97)
  10. Specials (5)
  11. Surrogates (2,048)
  12. Musical Notation formatting (8)
  13. Deprecated (10)
  14. Private use (137,468)
  15. Symbols (≅4,000)

Of those categories, only the following categories need to discuss individual characters in any depth for a total of 14 characters:

  1. Byte-order mark (1)
  2. Separators (including spaces, tab, new line, and new paragraph) (3; space, tab, line feed)
  3. Word boundary characters (4; no-break space, zero-width no-break space, word joiner, no-break hyphen)
  4. Grapheme boundary characters (4; zero-width joiner, zero-width non-joiner, combining grapheme joiner, soft hyphen)
  5. Bidirectional control characters (2)


See Unicode guidance

Graphical characters

The main aim of Unicode in assigning plain-text graphical characters to its over 1 million code points. As of Unicode 5.0 nearly 100 thousand such graphical characters have been assigned from over 50 scripts (writing systems). In some sense the meaning of plain text is shaped by the Unicode standard. Plain text now includes over 3,000 symbols from various disciplines such as mathematics, geometry, and computer science. It also includes music notation, though implementation support for such musical notation may still be limited.

While these characters are typically called graphical characters, they more broadly correspond to: the letters (including alphabetic letters, syllabic letters, ideographs, etc.); numerals (Indic-Arabic base ten, Hexadecimal digits, Ancient Greek, Roman, East Asian, etc); punctuation; diacritical and other marks; and so on. Graphically, these one or more of these characters gets mapped to one or more glyphs through a font or glyphlette (a small ad hoc font). However, when a language has been specified, these characters also can be mapped through speech synthesizers to audible words or through embossed or refreshable Braille for audible and tactile presentation as well.

Excluding compatibility characters

While Unicode focusses on defining plain text characters it also seeks to provide a mechanism to translate all encodings into Unicode without any loss of information. It therefore includes several thousand characters that are not plain text per se. Instead these characters are deemed compatibility characters as they often represent the same visual and semantics of equivalent characters with rich text styling applied. When used with a language such as HTML, Unicode expects the higher level language to handle this styling in a more uniform way than that which can be provided through these compatibility characters. Therefore the use of such characters is discouraged in HTML.

Including symbols

In addition to supporting the letters, numerals, and marks from over 50 scripts, Unicode also includes over 3,000 symbols for various specialized technical, religious, political and other uses. Most of the blocks devoted to symbols occur in the range U+20A0 – U+2BFF, though symbol blocks exist outside this range too.

Variation selectors

Variation selectors allow a single code point to stand for up to 256 character or glyph variants. Mainly designed to deal with variants in East Asian ideographs, Unicode provides a registry where any vendor or author may register an official character variant for any of the over 70,000 ideograph code points assigned in Unicode 5.0.

  • 16 in BMP (U+FE00 – U+FE0F)
  • 240 in Supplement Special-purpose Plane (U+E0100 – U+E01EF)

Byte-order mark (U+FEFF)

The Byte-order mark provides an initial byte in a text file to indicate the endianess of the text. It also helps UAs easily identify Unicode transformation format (UTF) text encodings. The code point with the opposite endianess — U+FFFE — has been permanently set aside as a Unicode non-character. So this non-character code point (U+FFFE) MUST never appear in a document: particularly as the first code point in a document.

In addition, the byte-order mark character originally served a double-duty as a zero-width non-breaking space. This usage has now been deprecated so a byte-order mark MUST only appear as the first character in an HTML document (the very first character before all elements and before any prolog or any other characters).

To get the same functionality, formerly provided through the ZERO-WIDTH NON-BREAKING SPACE authors should use a WORD JOINER (U+2060) instead (see #head-77638abc13541d600a7972b79038f55a0ad8de81 Separators and joiners).

Controlling bidirectional neutral characters

Authors may use the following characters to control the directionality of bidirectionally neutral characters.


To intricately control bidirectional text in other ways authors should not use the other Unicode bidirectional text control characters, but instead SHOULD use the 'dir' attribute along with nested phrase elements for increasing the embedding level or nested 'bdo' elements (bidirectional override) to override the normal text direction.

Guided usage


  • LINE FEED (U+000A)
  • SPACE (U+0020)

Meaning of space separated

Space U+0020

Source formatting whitespace

  • Space U+0020
  • Horizontal Tab U+0009
  • Line Feed U+000A

Semantically significant whitespace

Discuss the 'pre' element or any elements where authors apply the CSS 'white-space' property.

Separator and joiner

Separators and joiners may have many different uses depending on what they join and separate and what aspect of joining they control. For example, the WORD JOINER (U+2060) joins words to prevent line breaks on characters that might otherwise signal an ideal place for a line break. In contrast a standard SPACE (U+0020) as input from a keyboard spacebar, is a word separator in many scripts. It indicates the boundary between two words and signals a suitable location for a line break. Other separators and joiners join or separate graphemes that might be joined as ligatures (ligation) or joined through cursive connection of glyphs. Independent of the sorts of control available for controlling glyphs, the COMBINING GRAPHEME JOINER (U+20034F) instead joins characters for differentiation in searching and collation.

Separators (Joiners)
Character semantics Glyph presentation
Grapheme N/A
(Combining Grapheme Joiner)
Zero-width non-joiner
(Zero-width joiner)
Syllable Soft Hyphen
Word Space
Zero-width space
(Word Joiner)
(No-break Space)
(Non-breaking hyphen)
Line <br/>
U+000A as a character reference or literal wherever whitespace is relevant
Paragraph <p></p>

These characters are special characters that add semantic content to the text by either joining what would otherwise be separated or separating what would otherwise be joined. For example the WORD JOINER (U+2060) can be used to indicate a character normally treated as a word boundary should not be treated as a word boundary and avoid a line-break at that character. For word boundaries authors use the following characters:

  • SPACE U+0020
    Provides a character to separate words without visible whitespace.
  • WORD JOINER (U+2060)
  • Other script specific separators (2)
  • Other script specific formatting characters (16, including 8 music notation control characters)
Collapsed Visible
Line-breaking allows ZERO WIDTH SPACE SPACE
    To control searching and collation of combined graphemes such as digraphs (e.g., the “Mc” in “McDonalds”)
    This has no effect on the glyph presentation of the characters joined into a grapheme cluster. To control the ligation and the cursive connection of glyphs use the #head-cdfb5554d88260b7bc852f917ef70741b78d4137 Grapheme glyph control characters: ZERO WIDTH NON-JOINER (U+200C) and ZERO WIDTH JOINER (U+200D).

For mathematical expressions authors can use the following characters to encode machine-readable semantics.


For LINE SEPARATOR (U+2028) and PARAGRAPH SEPARATOR (U+2029) see #head-4a9fadb5cfd5de5d377d187a2d43bfa161f6cf12 ‘Other whitespace characters’: both are discouraged.

Line-break controlling

Lines never break within a grapheme cluster nor within a word except as determined by hyphenation algorithm or at word boundaries. To control these word boundaries, Unicode provides several characters to control these word boundaries — the most important is the WORD JOINER (U+2060) and the SPACE (U+0020). The WORD JOINER can be used to avoid a line-break by combining it with other characters that normally serve as a word boundary. So while Unicode provides NO-BREAK SPACE (U+00A0) to insert a space between words while not allowing a line-break, the same thing could be accomplished with the sequence WORD JOINER (U+2060) + SPACE (U+0020) + WORD JOINER (U+2060). Similarly, the sequence WORD JOINER (U+2060) + HYPHEN-MINUS (U+002D) + WORD JOINER (U+2060) is the same as NON-BREAKING HYPHEN (U+2011). While several no-break characters are provided in Unicode, the WORD JOINER can be combined with any word boundary character to create a no-break character in this way.

No-Break characters

  • FIGURE SPACE (U+2007)

In addition to these characters the SOFT HYPHEN (U+00AD) can be used to provide line-break hints within words that UAs might not have in their own hyphenation dictionaries.

Private use

When using private use characters authors should ensure that fonts or glyphletted are provided in one way or another so that users can properly view the text. (speech synthesizer and Braille issues).

  • BMP private use area (U+E000 – U+F8FF)
  • Supplementary private use area-A (U+F0000 – U+FFFFD)
  • Supplementary private use area-B (U+100000 – U+10FFFD)


Due to interoperability, presentational, or non-plain-text nature of the characters, the following characters are discouraged for general use in HTML. There may be some applications of HTML where they are necessary, but they may undermine interoperability of the flexibility afforded by a strict separation of semantics from styling.

Grapheme glyph control

    Provides a character to separate characters that would otherwise be joined as
    Provides a character to join characters.

Together these characters allow authors to control ligation and cursive connections between glyphs on a case by case basis. The Non-Joiner prevents ligation and cursive connections between characters that would other wise ligate or join cursively. The joiner encourages ligation and cursive connections.

[Issue: what should HTML5 say about these characters? What are the use cases? Do we want to require UA support? This is presentation only (the meaning should not be effected by non-ligation/ligation or cursive connections), however there may not be another mechanism to encode these ad hoc presentational aspects. A similar issue relates to the styling spaces and kerning and leading in general where an author/designer may want a one-time override of the default visual presentation and metrics of a font.]

Styling space

  • EN QUAD (U+2000)
  • EM QUAD (U+2001)
  • EN SPACE (U+2002)
  • EM SPACE (U+2003)
  • FOUR-PER-EM SPACE (U+2005)
  • SIX-PER-EM SPACE (U+2006)
  • FIGURE SPACE (U+2007)
  • THIN SPACE (U+2009)
  • HAIR SPACE (U+200A)

To avoid using these styling spaces authors may use SPACE U+0020 and NO-BREAK SPACE (U+00A0) and using other mechanisms to style the size of the space. For example where document semantics imply a larger or smaller space, authors may use the CSS 'word-spacing' as well as the margin and padding related properties to adjust the placement of element generated boxes relative to one another.


The compatibility characters in these regions and blocks of Unicode are presentational in nature and this presentation should be better handled through HTML semantics or CSS presentation. By avoiding these characters authors create a document that is more interoperable in terms of searching and collation while at the same time creating more consistent presentational idioms across many more styling options than could be provided by this finite number of characters.

  • Superscripts and Subscripts (U+2070 – U+209F) (48)
    Use CSS and and elements
    NOTE: There are other superscript and subscript characters outside this block that carry semantic values specific enough to rely on a font’s own glyphs to render the subscript and superscript and to include such semantic encoding within a document.
  • Enclosed Alphanumerics (U+2460 – U+24FF) (160)
    Use CSS border-radius
  • KangXi Radicals (U+2F00 – U+2FDF) (224)
    Use CSS font-family
  • Hangul Compatibility Jamo (U+3130 – U+318F) (96)
    Use CSS font-family
  • Enclosed CJK Letters and Months (U+3200 – U+32FF) (256)
    Use CSS border-radius
  • CJK Compatibility (U+3300 – U+33FF) (256)
    Use CSS font-family
  • BMP contiguous compatibility blocks (U+F900 – U+FFEF) (1,744)
    • Opentype glyph substitution for Arabic forms
    • Ligatures and composing character support for Latin, Armenian, Arabic and Hebrew ligature forms
    • CSS vertical text layout for CJK half-width, full-width and vertical forms
    • CSS border-radius, ‘::before’ and ‘::after’ selectors, and for enlcosed CJK letters and months
  • CJK Compatibility Ideographs Supplement (U+2F800 – U+2FA1F) (544)
    Use CSS font-family

In addition to the compatibility characters in these specific blocks there are other compatibility characters (as well as canonical equivalent characters) scattered throughout Unicode. Those characters are either handled through general character foldings. Many of those are either canonical equivalents or compatibility equivalents related to ligature composition and the like and are not as presentational in nature. Different authors may choose to author documents heavily leaning toward decomposed characters or towards precomposed characters and ligatures. UA case foldings should handle the differences for those characters.

Control characters

Control characters consist of:

While UAs should preserve these characters, they have no agreed upon interoperable meaning within an HTML document. Authors MAY use them within an HTML simply as a storage container. These characters will not be rendered and HTML5 does not define any behavior for them. There are other issues such as:

  • Though it can be added through the HTML5 DOM, there is no straightforward way to serialize and then de-serialize the character U+000D using the legacy text/html serialization.
  • XML 1.0 does not support all of the C0 control characters, but only supports: U+0009, U+000A, and U+000D. This leaves 28 C0 control characters unsupported for serialization to XML 1.0
  • XML 1.1 supports all of he C0 and C1 control characters, but only as character references (for example, &#xD; for 'carriage return' or &#x7F; for 'delete').

Other whitespace characters

  • Extended HTML whitespace
    • FORM FEED (U+0009)
    • VERTICAL TAB (U+000A)
  • Other whitespace (not treated as whitespace)
    • LINE SEPARATOR (U+2028)
    • NEXT LINE (U+0085)
      treated as horizontal esllipses (U+2026) in some browsers and with some fonts
      • U+2026 |…|
      • U+0085 |…|
      • U+0085 literal |


Like U+000D there is no straightforward way to serialize and then de-serialize the character these other new line characters using the legacy text/html serialization. Preserving these characters for a roundtrip likely requires using a character reference and might require XML 1.0 or XML 1.1 serialization or using some other DOM manipulation for the legacy text/html serialization.



For interlinear annotation use Ruby markup. For Object replacement use the 'object', 'img' or other replaced elements. The replacment character should not be included in documents deliberately.



Tags are used to indicate language in plain text formats. Authors MUST NOT use tag characters, but instead SHOULD use the xml:lang or the lang attributes.

  • U+E0000 – U+E007F


See Unicode for suitable substitutes for these characters (10). These may appear in legacy documents, but newly authored documents MUST NOT use them.

  • Arabic shaping and swapping control characters U+206A – U+206F

Bidirectional text controlling

Authors should not use these characters. Instead authors should use the 'dir' attribute combined either with nested phrase elements for bidirectionally embedding levels or nested BDO element for bidirectional override.


Authors may use these characters to control the directionality of directionally neutral characters.



These characters (2,048) provide support for UTF-16 and other Unicode Transformation Formats (UTFs). They MUST NOT appear directly within a document. Some browsers support surrogate pair character references, but authors SHOULD NOT rely on these but instead SHOULD insert the literal character or a character reference for the actual character designated by the surrogate pair.

  • High surrogates U+D800 – U+DB7F
  • High private use surrogates U+DB80 – U+DBFF
  • Low surrogates U+DC00 – U+DFFF


Unicode currently designates 66 code points as noncharacters. These are code points Unicode guarantees it will never assign characters to and are sometimes called illegal characters. These code points include the last two code points in each of the 17 Unicode planes: xFFFE and xFFFF. Also 32 code points in the middle of the Arabic presentation Forms-A block from U+FDD0 to U+FDEF are noncharacters. As Unicode assigns characters in the other 11 planes the possibility of other noncharacter code points could arise. Authors should avoid using these code points for interoperability reasons. There is and should not be any standardized approach to handle these code points.

Presentational requirements for avoiding compatibility characters

  • Superscripts and Subscripts (U+2070 – U+209F) (48)
    Use CSS and and elements
  • Enclosed Alphanumerics (U+2460 – U+24FF) (160)
    Use CSS border-radius
  • KangXi Radicals (U+2F00 – U+2FDF) (224)
    Use CSS font-family
  • Hangul Compatibility Jamo (U+3130 – U+318F) (96)
    Use CSS font-family
  • Enclosed CJK Letters and Months (U+3200 – U+32FF) (256)
    Use CSS border-radius
  • CJK Compatibility (U+3300 – U+33FF) (256)
    Use CSS font-family
  • BMP contiguous compatibility blocks (U+F900 – U+FFEF) (1,744)
    • Opentype glyph substitution for Arabic forms
    • Ligatures and composing character support for Latin, Armenian, Arabic and Hebrew ligature forms
    • CSS vertical text layout for CJK half-width, full-width and vertical forms
    • CSS border-radius, ‘::before’ and ‘::after’ selectors, and for enlcosed CJK letters and months
  • CJK Compatibility Ideographs Supplement (U+2F800 – U+2FA1F) (544)
    Use CSS font-family
  • For 16 precomposed fraction characters in Number forms (U+2153 – U+215F) and Latin-1 Supplement (U+00BC, U+00BD, and U+00BE)
    Document use and proper UA implementation of Fraction Slash (U+2044)

Requirements recap

  • Existing CSS2 capabilities
  • General text layout capabilities using Opentype or Opentype-like positional glyph substitution
  • Styling property for ligatures (suggest CSS WG liaison on this issue) and UA support for ligatures
  • CSS Vertical text properties and UA support for vertical text layout
  • CSS Border-radius for encircled characters and CSS2 properties and pseudo-selectors for other enclosings
  • Document use and proper UA implementation of Fraction Slash (U+2044)

UA conformance

Proposed UA conformance requirements for non-graphical characters and compatibility characters.

  • whitespace handling
  • serialization limitations for certain characters
  • character foldings
  • encourage UA normalization (NFC, NFD, NFKC, NFKD) as general character folding only; not as a lossy transformation of the text.
  • encourage authoring use of normalized characters through normalization and language aware input managers
    • authors only need to be presented with one character equivalent or the other except in the case of singletons.
    • for arbitrary character composing the decomposed character components should be exposed to authors (e.g., composing fractions through any characters combined with a fraction slash.
    • canonical compatibility singletons should be presented as such to ensure authors understand the possibility of fidelity loss from subsequent normalization.
  • UAs should never perform lossy normalization of singletons except at author option and with proper warning to authors

Character foldings

UAs that perform searches and queries on HTML documents should pay particular attention to the special characters singled out on this page. Unicode’s technical report on Character Foldings, while still a draft, may eventually provide valuable guidance to UAs on handling these issues. In particular UAs performing searches should fold canonical and compatibility equivalent characters. UAs may fold differences between typewriter characters and more typographically appropriate characters. For example a user searching for the phrase: "Chuckie's revenge", will likely want results returned for either the typewriter character single-quote "'" or the curly quote "’". While these types of foldings do not yet correspond to any Unicode property, UAs still may want to perform such foldings.

Character foldings should account for

  • Canonical equivalents
  • Compatibility equivalents
  • Ignoring separators, control, and formatting characters
  • Ignoring Arabic Tail Fragment (U+FE73)
  • Characters with visually similar glyphs; for example:
    • Latin letter A and Greek Alpha
    • Left (“) and right (”) double quote and straight double quote (")
    • Diacritic insensitive searches
    • Double hyphen for various dashes
  • characters composed with combining half marks (U+FE20 – U+FE2F)

Editing and conversion UAs

Editing and conversion UAs SHOULD NOT insert or append characters under the "discouraged" heading and MUST NOT insert or append characters under the "avoid" heading.

UAs and Unicode normalization

For browsers and many other UAs, the bulk of their task is read-only when it comes to HTML. Any Unicode normalization must therefore be handled in-memory and not necessarily through altering the serialized document. The need to handle normalization without altering the original serialized document implies that UAs should be able to accomplish Unicode normalization without changing the original document. This further implies that UAs need not alter the original serialized document to achieve Unicode normalization.

Since Unicode normalization of a serialized document is lossy in nature, it is much better to accomplish it without altering the original document.

Canonical normalization (NFC, NFD)

Canonical normalization is a normalization between canonical equivalents. In general the intention that some characters included in Unicode have no semantic nor presentation difference from other characters in Unicode. They may have been included in Unicode simply to express a distinction made in another character set or encoding. In the case of precomposed characters included from legacy encodings, Unicode also includes the decomposed characters for greater authoring flexibility. For example, Unicode includes a Latin letter "A", a combining ring above and also a precomposed Å from legacy character encodings. There is no semantic difference between the precomposed Å and the sequence A + combining ring above. With a properly designed font, there should also be no difference in the presentation of glyphs between the precomposed and base character combining character sequence.

However, there are situations where UAs cannot be sure of such semantic and presentation equivalents. For canonical equivalent characters, the concern arises with singleton decomposable characters. For these characters, one character "decomposes" to one other character. However, the characters have different names, different code positions and sometimes even different glyphs. Therefore authors understand these characters as distinct and in choosing one character over the other expect that difference to be preserved. Lossy normalization therefore has lossy effects.

Compatibility normalization (NFKC, NFKD)

The other area where Unicode normalization might effect semantics and presentation is with compatibility (non-canonical) normalization. Often with these compatibility characters, Unicode included these characters only because other encodings included them. From the perspective of the Unicode consortium, they are properly understood as rich text constructs. These include ligatures, subscripts, superscripts, fractions, and many other characters that have plain-text equivalents, but lose both semantics content and presentational distinctions when substituting the plain-text compatibility equivalent characters without at the same time applying semantic and presentational properties. Therefore NFKC and the NFKD normalization forms imply UAs must either perform only an in-memory normalization or must treat the normalization as a conversion of rich text to HTML (and optionally CSS).

  • Rich text examples: superscript, subscript, enclosed alphanumerics, enclosed CJK letters and months, CJK vertical text, CJK radicals.
  • Other examples: fractions, ligatures, CJK vertical text, Arabic forms.

Normalization for 'find' on page

Ideally, the comparisons between a find string and the document would occur through normalizing both the search string and the document.innerText string to normalized form compatibility composed. This would produce the most possible positive matches (including some false positives). Other generalized character foldings may help users find text even more.

Normalization for string length

Perhaps a graphemeCount() method for string objects. Normalization form compatibility composed (NFKC) should create the closest thing to a grapheme count of a string. However, authors may expect the actual character count at other times. This needs to also take into account surrogate characters for implementations that treat Unicode text internally as 16-bit text.

Normalization for indexing

Indices should make use of NFKD normalized strings for index terms, but preserve the original lossless form of the document (or a pointer to the original document depending on the UA).

Normalization and glyph selection

While UAs should preserve the text without normalization except for performing find or counting grpahemes and such, another area where UAs may consider compatibility decompositions is in fallback glyph selection. When a UA cannot find a glyph for a particular character, it is advisable to look for a glyph from a canonical equivalent and then for a compatibility equivalent. For example, it would be better to display a '1' glyph for a missing '¹' than to display a generic last resort glyph. UAs may also use the decomposition keyword (<super> in this example) to adjust the layout and appearance of the glyph to synthesize the appropriate glyph from the decomposition mapping.

Test characters

MoinMoin inhibits this so a separate document will provide a test page for the handling of these separators, joiners special characters and other code points.

Code point literal reference comment
U+0085 -- -&#x85;- NEXT LINE or horizontal ellipses
U+2028 -- -
- line separator
U+2029 -- -
- paragraph separator

See also