Characters or markup?


There are a range of control-like Unicode characters, some of which fulfill the same role as markup. Which should I use, and which should I avoid?


The answer depends on which characters are being considered. For more detail you should read the W3C Note and Unicode Technical Report Unicode in XML & Other Markup Languages. This article will summarize some of that information.

Some Unicode characters are not suitable for use with markup

The following table lists Unicode characters that should not be used in a markup context, according to Unicode in XML & Other Markup Languages. You should use markup instead.

Names/ Description Short Comment
Line and paragraph separator use <br>, <p>, or equivalent
BIDI embedding controls (LRE, RLE, LRO, RLO, PDF) Strongly discouraged where markup exists.
Activate/Inhibit Symmetric swapping Deprecated in Unicode
Activate/Inhibit Arabic form shaping Deprecated in Unicode
Activate/Inhibit National digit shapes Deprecated in Unicode
Interlinear annotation characters Use ruby markup
Byte order mark / ZWNBSP Use only as byte order mark. Use U+2060 Word Joiner instead of using U+FEFF as ZWNBSP
Object replacement character Use markup, e.g. HTML <object> or HTML <img>
Scoping for Musical Notation Use an appropriate markup language
Language Tag code points Use lang and/or xml:lang

The bidirectional text embedding controls, in particular, often cause confusion. There are some places where these have to be used to produce correctly ordered bidirectional text in languages that use right-to-left scripts, such as Arabic, Hebrew, Thaana, etc. These are places where an element doesn't allow embedded markup, such as the title element. Where markup is available, however, you should use it. For more information about this, see Unicode controls vs. markup for bidi support. For guidance on how to use the embedding controls in situations where markup cannot be used, see Using Unicode controls for bidi text.

Other Unicode characters are OK

This is not an exhaustive list. It is merely intended to provide some examples of Unicode characters that are valid for use in addition to markup to provide information about the text.

Names/ Description Short Comment
Various No-break space, Soft Hyphen, Combining Grapheme Joiner, Non breaking Hyphen, Word Joiner, etc.
Zero-width Joiners (ZWJ and ZWNJ) eg. required for Persian
Implicit directional marks (LRM and RLM)
Subtending marks common feature in the Arabic and Syriac scripts
Variation Selectors eg. required for Mongolian
Ideographic Description Characters indicate the composition of ideographs

'Compatibility characters' vary in appropriateness

This is taken from Unicode in XML & Other Markup Languages:

The Unicode Standard provides compatibility mappings for a number of characters. Compatibility mappings indicate a relationship to another character, but the exact nature of the relationship varies. In some cases the relationship means "is based on", in some other cases it denotes a property. When plain text is marked up, it may make sense to map some of these characters to their compatibility equivalents and suitable markup. It is important to understand the nature of the distinctions between characters and their compatibility equivalents and the context in which these distinctions matter. It is never advisable to apply compatibility mappings indiscriminately.

The following table gives an non-exhaustive list of examples.

Names/ Description Examples Verdict
Circled letters and digits used for list item markers ① ② ③ Ⓐ Ⓑ Ⓒ ㊂ ㊃ ㊄ ㊓ ㊔ ㊕ ㋝ ㋞ ㋟ OK
Parenthesized or dotted number used as list item markers ⑴ ⑵ ⑶ use list item marker style
Arabic Presentation forms ﻉ ﻊ ﻋ ﻌ normalize
Half-width and full-width characters ヤ ユ ヨ ラ a b c d OK
Superscripted and subscripted characters ¹ ² ³ ₁ ₂ ₃ use <sup> or <sub> markup