Characters, Entities and Fonts

7.1 Introduction

Notation and symbols have proved very important for mathematics. Mathematics has grown in part because its notation continually changes toward being succinct and suggestive. Many new signs have been developed for use in mathematical notation, and many have been adopted that were originally introduced elsewhere.The result is that mathematics makes use of a very large collection of symbols. It is difficult to write mathematics fluently if these characters are not available for use. It is difficult to read mathematics if corresponding glyphs are not available for presentation on specific display devices.

The W3C Math Working Group therefore took on the job of specifying part of the mechanism needed to proceed from notation to final presentation, and has collaborated with the Unicode Technical Committee (UTC) and the STIX Fonts Project in undertaking specification of the rest.

This chapter contains discussion of characters for use within MathML, recommendations for their use, and warnings concerning the correct form of the corresponding code points given in the Universal Multiple-Octet Coded Character Set (UCS) [ISO10646] as codified in Unicode [Unicode]. For simplicity we refer to this character set by the short name Unicode. Unless otherwise stated, the mappings discussed in this chapter and elsewhere in the MathML 3.0 recommendation are based on Unicode 5.2. Conformant MathML processors (see Section 2.3 Conformance) are free to use characters defined in Unicode 5.2 or later.

While a long process of review and adoption by UTC and ISO/IEC of the characters of special interest to mathematics and MathML is now complete, more characters may be added in the future. For the latest character tables and font information, see the [Entities] and the Unicode Home Page, notably Unicode Work in Progress and Unicode Technical Report #25 “Unicode Support for Mathematics”.

A MathML token element (see Section 3.2 Token Elements, Section 4.2.1 Numbers <cn>, Section 4.2.2 Content Identifiers <ci>, Section 4.2.3 Content Symbols <csymbol>) takes as content a sequence of MathML characters or mglyph elements. The latter are used to represent characters that do not have a Unicode encoding, as described in Section 3.2.1.2 Using images to represent symbols <mglyph/>. The need for mglyph should be rare because Unicode 3.1 provided approximately one thousand alphabetic characters for mathematics, and Unicode 3.2 added over 900 more special mathematical symbols.

7.2 Unicode Character Data

Any character allowed by XML may be used in MathML. More precisely, the legal Unicode characters have the hexadecimal code numbers 09 (tab = U+0009), 0A (line feed = U+000A), 0D (carriage return = U+000D), 20-D7FF (U+0020..U+D7FF), E000-FFFD (U+E000..U+FFFD), and 10000-10FFFF (U+10000..U+10FFFF). The exclusions above code number D7FF are of the blocks used in surrogate pairs, and the two characters guaranteed not to be Unicode characters at all. U+FFFE is excluded to allow determination of byte order in certain encodings.

There are essentially three different ways of encoding character data in an XML document.

Using characters directly: For example, the 'é' (character U+00E9 [LATIN SMALL LETTER E WITH ACUTE]) may have been inserted. This option is only useful if the character encoding specified for the XML document includes the character intended. Note that if the document is, for example, encoded in Latin-1 (ISO-8859-1) then only the characters in that encoding are available directly; for instance character U+00E9 (eacute) is, but character U+03B1 (alpha) is not.
Using numeric XML character references: For example, 'é' may be represented as é (decimal) or é (hex), or é (decimal) or é. Note that the numbers in the character references always refer to the Unicode encoding (and not to the character encoding used in the XML file). By using character references it is always possible to access the entire Unicode range.
Using entity references: The MathML DTD defines internal entities that expand to character data. Thus for example the entity reference é may be used rather than the character reference é. An XML fragment that uses an entity reference which is not defined in a DTD is not well-formed; therefore it will be rejected by an XML parser. For this reason every fragment using entity references must use a DOCTYPE declaration which specifies the MathML DTD, or a DTD that at least declares any entity reference used in the MathML instance. The need to use a DOCTYPE complicates inclusion of MathML in some documents. However, entity references can be useful for small illustrative examples.

7.3 Entity Declarations

Earlier versions of this MathML specification included detailed listings of the entity definitions to be used with the MathML DTD. These entity definitions are of more general use, and have now been separated into an ancillary document, XML Entity Definitions for Characters [Entities]. The tables there list the entity names and the corresponding Unicode character references. That document describes several entity sets; not all of them are used in the MathML DTD. The MathML DTD references the combined HTML MathML entity set defined in [Entities].

7.4 Special Characters Not in Unicode

For special purposes, one may need a symbol which does not have a Unicode representation. In these cases one may use the mglyph element for direct access to a glyph as an image, or (in some systems) from a font that uses a non-Unicode encoding. All MathML token elements accept characters in their content and also accept an mglyph there. Beware, however, that use of mglyph to access a font is deprecated and the mechanism may not work in all systems. The mglyph element should always supply a useful alternative representation in its alt attribute.

7.5 Mathematical Alphanumeric Symbols

In mathematical and scientific writing, single letters often denote variables and constants in a given context. The increasing complexity of science has led to the use of certain common alphabet and font variations to provide enough special symbols of this letter-like type. These denotations are generally not letters that may be used to make up words with recognized meanings, but individual carriers of semantics themselves. Writing a string of such symbols is usually interpreted in terms of some composition law, for instance, multiplication. Many letter-like symbols may be quickly interpreted as of a certain mathematical type by specialists in a given area: for instance, bold symbols, whether based on Latin or Greek letters, as vectors in physics or engineering, or Fraktur symbols as Lie algebras in part of pure mathematics.

The additional Mathematical Alphanumeric Symbols provided in Unicode 3.1 have code points in the range U+1D400 to U+1D7FF in Plane 1, that is, in the first plane with Unicode values higher than 2¹⁶. This plane of characters is also known as the Secondary Multilingual Plane (SMP), in contrast to the Basic Multilingual Plane (BMP) which was originally the entire extent of Unicode. Support for Plane 1 characters in currently deployed software is not always reliable, but it should be possible in multilingual operating systems, since Plane 2 has many Chinese characters that must be displayable in East Asian locales.

As discussed in Section 3.2.2 Mathematics style attributes common to token elements, MathML offers an alternative mechanism to specify mathematical alphanumeric characters. This alternative mechanism spans the gap between the specification of the mathematical alphanumeric symbols as Unicode code points, and the deployment of software and fonts that support them. Namely, one uses the mathvariant attribute on a token element such as mi to indicate that the character data in the token element selects a mathematical alphanumeric symbol.

In principle, any mathvariant value may be used with any character data to define a specific symbolic token. In practice, only certain combinations of character data and mathvariant values will be visually distinguished by a given renderer. In this section we explain the correspondence between certain characters in Plane 0 that, when modified by the mathvariant attribute, are considered equivalent to mathematical alphanumeric symbol characters.

The mathematical alphanumeric symbol characters in Plane 1 include alphabets for Latin upper-case and lower-case letters, including dotless i and j, Greek upper-case and lower-case letters, Greek symbols (also known as variants), including upper-case and lower-case digamma, and Latin digits. These alphabets provide Plane 1 Unicode code points that differ from corresponding Plane 0 characters only by a variation in font that carries mathematical semantics when used in a formula.

The mathvariant attribute uses exactly this correspondence to provide an alternate markup encoding that selects these Plane 1 characters. For example, the Mathematical Italic alphabet runs from U+1D434 ("A") to U+1D467 ("z"). Thus, a typical example of an identifier for a variable, marked up as

<mi>a</mi>

and rendered in a mathematical italic font (as described in Section 3.2.3 Identifier <mi>) could equivalently be marked up as

<mi>&#x1D44E;<!--MATHEMATICAL ITALIC SMALL A--><!--MATHEMATICAL ITALIC SMALL A--></mi>

which invokes the Mathematical Italic lower-case a explicitly.

An important use of the mathematical alphanumeric symbols in Plane 1 is for identifiers normally printed in special mathematical fonts, such as Fraktur, Greek, Boldface, or Script. As another example, the Mathematical Fraktur alphabet runs from U+1D504 ("A") to U+1D537 ("z"). Thus, an identifier for a variable that uses Fraktur characters could be marked up as

<mi>&#x1D504;<!--MATHEMATICAL FRAKTUR CAPITAL A--><!--BLACK-LETTER CAPITAL A--></mi>

An alternative, equivalent markup for this example is to use the common upper-case A, modified by using the mathvariant attribute:

<mi mathvariant="fraktur">A</mi>

A MathML processor must treat a mathematical alphanumeric character (when it appears) as identical to the corresponding combination of the unstyled character and mathvariant attribute value. It is important to note that the mathvariant attribute specifies a semantic class of characters, each of which has a specific appearance that should be protected from document-wide style changes, so the intended meaning of the character may be preserved. The use of a mathematical alphanumeric character is also intended to preserve this specific appearance, and so these characters are also not to be affected by surrounding style changes.

Not all combinations of character data and mathvariant values have assigned Unicode code points. For example, sans-serif Greek alphabets are omitted, while bold sans-serif Greek alphabets are included, and bold digits are included, while bold-italic digits are excluded. A renderer should visually distinguish those combinations of character data and mathvariant attribute values that it can subject to the availability of font characters. It is intended that renderers distinghish at least those combinations that have equivalent Unicode code points, and renderers are free to ignore those combinations that have no assigned Unicode code point or for which adequate font support is unavailable.

The exact correspondence between a mathematical alphabetic character and an unstyled character is complicated by the fact that certain characters that were already present in Unicode in Plane 0 are not in the 'expected' sequence in Plane 1. The table below shows the Plane 0 mathematical alphanumeric symbols, listing for each character its Unicode code point, its Unicode character name, its corresponding unstyled alphabetic character, and the code point in Plane 1 where one might naturally have sought this character.

Unicode code point	Unicode name	BMP code	Plane-1 code
U+210E	PLANCK CONSTANT	U+0068	U+1D455
U+212C	SCRIPT CAPITAL B	U+0042	U+1D49D
U+2130	SCRIPT CAPITAL E	U+0045	U+1D4A0
U+2131	SCRIPT CAPITAL F	U+0046	U+1D4A1
U+210B	SCRIPT CAPITAL H	U+0048	U+1D4A3
U+2110	SCRIPT CAPITAL I	U+0049	U+1D4A4
U+2112	SCRIPT CAPITAL L	U+004C	U+1D4A7
U+2133	SCRIPT CAPITAL M	U+004D	U+1D4A8
U+211B	SCRIPT CAPITAL R	U+0052	U+1D4AD
U+212F	SCRIPT SMALL E	U+0065	U+1D4BA
U+210A	SCRIPT SMALL G	U+0067	U+1D4BC
U+2134	SCRIPT SMALL O	U+006F	U+1D4C4
U+212D	BLACK-LETTER CAPITAL C	U+0043	U+1D506
U+210C	BLACK-LETTER CAPITAL H	U+0048	U+1D50B
U+2111	BLACK-LETTER CAPITAL I	U+0049	U+1D50C
U+211C	BLACK-LETTER CAPITAL R	U+0052	U+1D515
U+2128	BLACK-LETTER CAPITAL Z	U+005A	U+1D51D
U+2102	DOUBLE-STRUCK CAPITAL C	U+0043	U+1D53A
U+210D	DOUBLE-STRUCK CAPITAL H	U+0048	U+1D53F
U+2115	DOUBLE-STRUCK CAPITAL N	U+004E	U+1D545
U+2119	DOUBLE-STRUCK CAPITAL P	U+0050	U+1D547
U+211A	DOUBLE-STRUCK CAPITAL Q	U+0051	U+1D548
U+211D	DOUBLE-STRUCK CAPITAL R	U+0052	U+1D549
U+2124	DOUBLE-STRUCK CAPITAL Z	U+005A	U+1D551

Mathematical Alphanumeric Symbol characters should not be used for styled prose. For example, Mathematical Fraktur A must not be used to just select a blackletter font for an uppercase A as it would create problems for searching, restyling (e.g. for accessibility), and many other kinds of processing.

7.6 Non-Marking Characters

Some characters, although important for the quality of print or alternative rendering, do not have glyph marks that correspond directly to them. They are called here non-marking characters. Their roles are discussed in Chapter 3 Presentation Markup and Chapter 4 Content Markup.

In MathML, control of page composition, such as line-breaking, is effected by the use of the proper attributes on the mo and mspace elements.

The characters below are not simple spacers. They are especially important new additions to the UCS because they provide textual clues which can increase the quality of print rendering, permit correct audio rendering, and allow the unique recovery of mathematical semantics from text which is visually ambiguous.

Unicode code point	Unicode name	Description
U+2061	FUNCTION APPLICATION	character showing function application in presentation tagging (Section 3.2.5 Operator, Fence, Separator or Accent `<mo>`)
U+2062	INVISIBLE TIMES	marks multiplication when it is understood without a mark (Section 3.2.5 Operator, Fence, Separator or Accent `<mo>`)
U+2063	INVISIBLE SEPARATOR	used as a separator, e.g., in indices (Section 3.2.5 Operator, Fence, Separator or Accent `<mo>`)
U+2064	INVISIBLE PLUS	marks addition, especially in constructs such as 1½ (Section 3.2.5 Operator, Fence, Separator or Accent `<mo>`)

7.7 Anomalous Mathematical Characters

Some characters which occur fairly often in mathematical texts, and have special significance there, are frequently confused with other similar characters in the UCS. In some cases, common keyboard characters have become entrenched as alternatives to the more appropriate mathematical characters. In others, characters have legitimate uses in both formulas and text, but conflicting rendering and font conventions. All these characters are called here anomalous characters.

7.7.1 Keyboard Characters

Typical Latin-1-based keyboards contain several characters that are visually similar to important mathematical characters. Consequently, these characters are frequently substituted, intentionally or unintentionally, for their more correct mathematical counterparts.

7.7.1.1 Minus

The most common ordinary text character which enjoys a special mathematical use is U+002D [HYPHEN-MINUS]. As its Unicode name suggests, it is used as a hyphen in prose contexts, and as a minus or negative sign in formulas. For text use, there is a specific code point U+2010 [HYPHEN] which is intended for prose contexts, and which should render as a hyphen or short dash. For mathematical use, there is another code point U+2212 [MINUS SIGN] which is intended for mathematical formulas, and which should render as a longer minus or negative sign. MathML renderers should treat U+002D [HYPHEN-MINUS] as equivalent to U+2212 [MINUS SIGN] in formula contexts such as mo, and as equivalent to U+2010 [HYPHEN] in text contexts such as mtext.

7.7.1.2 Apostrophes, Quotes and Primes

On a typical European keyboard there is a key available which is viewed as an apostrophe or a single quotation mark (an upright or right quotation mark). Thus one key is doing double duty for prose input to enter U+0027 [APOSTROPHE] and U+2019 [RIGHT SINGLE QUOTATION MARK]. In mathematical contexts it is also commonly used for the prime, which should be U+2032 [PRIME]. Unicode recognizes the overloading of this symbol and remarks that it can also signify the units of minutes or feet. In the unstructured printed text of normal prose the characters are placed next to one another. The U+0027 [APOSTROPHE] and U+2019 [RIGHT SINGLE QUOTATION MARK] are marked with glyphs that are small and raised with respect to the center line of the text. The fonts used provide small raised glyphs in the appropriate places indexed by the Unicode codes. The U+2032 [PRIME] of mathematics is similarly treated in fuller Unicode fonts.

MathML renderers are encouraged to treat U+0027 [APOSTROPHE] as U+2032 [PRIME] when appropriate in formula contexts.

A final remark is that a ‘prime’ is often used in transliteration of the Cyrillic character U+044C [CYRILLIC SMALL LETTER SOFT SIGN]. This different use of primes is not part of considerations for mathematical formulas.

7.7.1.3 Other Keyboard Substitutions

While the minus and prime characters are the most common and important keyboard characters with more precise mathematical counterparts, there are a number of other keyboard character substitutions that are sometime used. For example some may expect

<mo>''</mo>

to be treated as U+2033 [DOUBLE PRIME], and analogous substitutions could perhaps be made for U+2034 [TRIPLE PRIME] and U+2057 [QUADRUPLE PRIME]. Similarly, sometimes U+007C [VERTICAL LINE] is used for U+2223 [DIVIDES]. MathML regards these as application-specific authoring conventions, and recommends that authoring tools generate markup using the more precise mathematical characters for better interoperability.

7.7.2 Pseudo-scripts

There are a number of characters in the UCS that traditionally have been taken to have a natural ‘script’ aspect. The visual presentation of these characters is similar to a script, that is, raised from the baseline, and smaller than the base font size. The degree symbol and prime characters are examples. For use in text, such characters occur in sequence with the identifier they follow, and are typically rendered using the same font. These characters are called pseudo-scripts here.

In almost all mathematical contexts, pseudo-script characters should be associated with a base expression using explicit script markup in MathML. For example, the preferred encoding of "x prime" is

<msup><mi>x</mi><mo>&#x2032;<!--PRIME--><!--PRIME--></mo></msup>

and not

<mi>x'</mi>

or any other variants not using an explicit script construct. Note, however, that within text contexts such as mtext, pseudo-scripts may be used in sequence with other character data.

There are two reasons why explicit markup is preferable in mathematical contexts. First, a problem arises with typesetting, when pseudo-scripts are used with subscripted identifiers. Traditionally, subscripting of x' would be rendered stacked under the prime. This is easily accomplished with script markup, for example:

<mrow><msubsup><mi>x</mi><mn>0</mn><mo>&#x2032;<!--PRIME--><!--PRIME--></mo></msubsup></mrow>

By contrast,

<mrow><msub><mi>x'</mi><mn>0</mn></msub></mrow>

will render with staggered scripts.

Note this means that a renderer of MathML will have to treat pseudo-scripts differently from most other character codes it finds in a superscript position; in most fonts, the glyphs for pseudo-scripts are already shrunk and raised from the baseline.

The second reason that explicit script markup is preferrable to juxtaposition of characters is that it generally better reflects the intended mathematical structure. For example,

<msup>
  <mrow><mo>(</mo><mrow><mi>f</mi><mo>+</mo><mi>g</mi></mrow><mo>)</mo></mrow>
  <mo>&#x2032;<!--PRIME--><!--PRIME--></mo>
</msup>

accurately reflects that the prime here is operating on an entire expression, and does not suggest that the prime is acting on the final right parenthesis.

However, the data model for all MathML token elements is Unicode text, so one cannot rule out the possibility of valid MathML markup containing constructions such as

<mrow><mi>x'</mi></mrow>

and

<mrow><mi>x</mi><mo>'</mo></mrow>

While the first form may, in some rare situations, legitmately be used to distinguish a multi-character identifer named x' from the derivative of a function x, such forms should generally be avoided. Authoring and validation tools are encouraged to generate the recommended script markup:

<mrow><msup><mi>x</mi><mo>&#x2032;<!--PRIME--><!--PRIME--></mo></msup></mrow>

The U+2032 [PRIME] character is perhaps the most common pseudo-script, but there are many others, as listed below:

Pseudo-script Characters
U+0022	QUOTATION MARK
U+0027	APOSTROPHE
U+002A	ASTERISK
U+0060	GRAVE ACCENT
U+00AA	FEMININE ORDINAL INDICATOR
U+00B0	DEGREE SIGN
U+00B2	SUPERSCRIPT TWO
U+00B3	SUPERSCRIPT THREE
U+00B4	ACUTE ACCENT
U+00B9	SUPERSCRIPT ONE
U+00BA	MASCULINE ORDINAL INDICATOR
U+2018	LEFT SINGLE QUOTATION MARK
U+2019	RIGHT SINGLE QUOTATION MARK
U+201A	SINGLE LOW-9 QUOTATION MARK
U+201B	SINGLE HIGH-REVERSED-9 QUOTATION MARK
U+201C	LEFT DOUBLE QUOTATION MARK
U+201D	RIGHT DOUBLE QUOTATION MARK
U+201E	DOUBLE LOW-9 QUOTATION MARK
U+201F	DOUBLE HIGH-REVERSED-9 QUOTATION MARK
U+2032	PRIME
U+2033	DOUBLE PRIME
U+2034	TRIPLE PRIME
U+2035	REVERSED PRIME
U+2036	REVERSED DOUBLE PRIME
U+2037	REVERSED TRIPLE PRIME
U+2057	QUADRUPLE PRIME

In addition, the characters in the Unicode Superscript and Subscript block (beginning at U+2070) should be treated as pseudo-scripts when they appear in mathematical formulas.

Note that several of these characters are common on keyboards, including U+002A [ASTERISK], U+00B0 [DEGREE SIGN], U+2033 [DOUBLE PRIME], and U+2035 [REVERSED PRIME] also known as a back prime.

7.7.3 Combining Characters

In the UCS there are many combining characters that are intended to be used for the many accents of numerous different natural languages. Some of them may seem to provide markup needed for mathematical accents. They should not be used in mathematical markup. Superscript, subscript, underscript, and overscript constructions as just discussed above should be used for this purpose. Of course, combining characters may be used in multi-character identifiers as they are needed, or in text contexts.

There is one more case where combining characters turn up naturally in mathematical markup. Some relations have associated negations, such as U+226F [NOT GREATER-THAN] for the negation of U+003E [GREATER-THAN SIGN]. The glyph for U+226F [NOT GREATER-THAN] is usually just that for U+003E [GREATER-THAN SIGN] with a slash through it. Thus it could also be expressed by U+003E-0338 making use of the combining slash U+0338 [COMBINING LONG SOLIDUS OVERLAY]. That is true of 25 other characters in common enough mathematical use to merit their own Unicode code points. In the other direction there are 31 character entity names listed in [Entities] which are to be expressed using U+0338 [COMBINING LONG SOLIDUS OVERLAY].

In a similar way there are mathematical characters which have negations given by a vertical bar overlay U+20D2 [COMBINING LONG VERTICAL LINE OVERLAY]. Some are available in pre-composed forms, and some named character entities are given explicitly as combinations. In addition there are examples using U+0333 [COMBINING DOUBLE LOW LINE] and U+20E5 [COMBINING REVERSE SOLIDUS OVERLAY], and variants specified by use of the U+FE00 [VARIATION SELECTOR-1]. For fuller listing of these cases see the listings in [Entities].

The general rule is that a base character followed by a string of combining characters should be treated just as though it were the pre-composed character that results from the combination, if such a character exists.

7 Characters, Entities and Fonts