7 Characters, Entities and Fonts

Overview: Mathematical Markup Language (MathML) Version 3.0
Previous: 6 Interactions with the Host Environment
Next: A Parsing MathML

7 Characters, Entities and Fonts
    7.1 Introduction
    7.2 Unicode Character Data
    7.3 Entity Declarations
    7.4 Special Characters Not in Unicode
    7.5 Mathematical Alphanumeric Symbols
    7.6 Non-Marking Characters
    7.7 Anomalous Mathematical Characters
        7.7.1 Keyboard Characters
            7.7.1.1 Minus
            7.7.1.2 Apostrophes, Quotes and Primes
            7.7.1.3 Other Keyboard Substitutions
        7.7.2 Pseudo-scripts
        7.7.3 Combining Characters

7.1 Introduction

Notation and symbols have proved very important for mathematics. Mathematics has grown in part because its notation continually changes toward being succinct and suggestive. Many new signs have been developed for use in mathematical notation, and many have been adopted that were originally introduced elsewhere.The result is that mathematics makes use of a very large collection of symbols. It is difficult to write mathematics fluently if these characters are not available for use. It is difficult to read mathematics if corresponding glyphs are not available for presentation on specific display devices.

The W3C Math Working Group therefore took on the job of specifying part of the mechanism needed to proceed from notation to final presentation, and has collaborated with the Unicode Technical Committee (UTC) and the STIX Fonts Project in undertaking specification of the rest.

This chapter contains discussion of characters for use within MathML, recommendations for their use, and warnings concerning the correct form of the corresponding code points given in the Universal Character Set (UCS) as codified in Unicode; see ISO 10646 [Unicode] and the Unicode Home Page. For simplicity we refer to this character set by the short name Unicode. Unless otherwise stated, MathML 2.0 (Second Edition) is based on Unicode 4.0, and MathML 3.0 on Unicode 5.1.

While a long process of review and adoption by UTC and ISO/IEC of the characters of special interest to mathematics and MathML is now complete, more characters may be added in the future. For the latest character tables and font information, see the [Entities] and the Unicode Home Page, notably Unicode Work in Progress and Unicode Technical Report #25 “Unicode Support for Mathematics”.

A MathML token element (see Section 3.2 Token Elements, Section 4.2.1 Numbers <cn>, Section 4.2.2 Content Identifiers <ci>, Section 4.2.3 Content Symbols <csymbol>) takes as content a sequence of MathML characters or mglyph elements. The latter are used to represent characters that do not have a Unicode encoding, as described in Section 3.2.1.2 <mglyph/>. The need for mglyph should be rare because Unicode 3.1 provided approximately one thousand alphabetic characters for mathematics, and Unicode 3.2 added over 900 more special mathematical symbols.

7.2 Unicode Character Data

Any character allowed by XML may be used in MathML. More precisely, the legal Unicode characters have the hexadecimal code numbers 09 (tab = U+0009), 0A (line feed = U+000A), 0D (carriage return = U+000D), 20-D7FF (U+0020..U+D7FF), E000-FFFD (U+E000..U+FFFD), and 10000-10FFFF (U+10000..U+10FFFF). The exclusions above code number D7FF are of the blocks used in surrogate pairs, and the two characters guaranteed not to be Unicode characters at all. U+FFFE is excluded to allow determination of byte order in certain encodings.

There are essentially three different ways of encoding character data in an XML document.

7.3 Entity Declarations

Earlier versions of this MathML specification included detailed listings of the entity definitions to be used with the MathML DTD. These entity definitions are of more general use, and have now been separated into an ancillary document, [Entities]. The tables there list the entity names and the corresponding Unicode character references. That document describes several entity sets; not all of them are used in the MathML DTD. The standard MathML DTD references the following entity sets:

7.4 Special Characters Not in Unicode

For special purposes, one may need a symbol which does not have a Unicode representation. In these cases one may use the mglyph element for direct access to a glyph as an image, or (in some systems) from a font that uses a non-Unicode encoding. All MathML token elements accept characters in their content and also accept an mglyph there. Beware, however, that use of mglyph to access a font is deprecated and the mechanism may not work in all systems. The mglyph element should always supply a useful alternative representation in its alt attribute.

7.5 Mathematical Alphanumeric Symbols

In mathematical and scientific writing, single letters often denote variables and constants in a given context. The increasing complexity of science has led to the use of certain common alphabet and font variations to provide enough special symbols of this letter-like type. These denotations are generally not letters that may be used to make up words with recognized meanings, but individual carriers of semantics themselves. Writing a string of such symbols is usually interpreted in terms of some composition law, for instance, multiplication. Many letter-like symbols may be quickly interpreted as of a certain mathematical type by specialists in a given area: for instance, bold symbols, whether based on Latin or Greek letters, as vectors in physics or engineering, or Fraktur symbols as Lie algebras in part of pure mathematics.

The additional Mathematical Alphanumeric Symbols provided in Unicode 3.1 have code points in the range U+1D400 to U+1D7FF in Plane 1, that is, in the first plane with Unicode values higher than 216. This plane of characters is also known as the Secondary Multilingual Plane (SMP), in contrast to the Basic Multilingual Plane (BMP) which was originally the entire extent of Unicode. Support for Plane 1 characters in currently deployed software is not always reliable, but it should be possible in multilingual operating systems, since Plane 2 has many Chinese characters that must be displayable in East Asian locales.

As discussed in Section 3.2.2 Mathematics style attributes common to token elements, MathML offers an alternative mechanism to specify mathematical alphabetic characters. This alternative spans the gap between the specification of Unicode 3.1 and its associated deployment in software and fonts. Namely, one uses the mathvariant attribute on the surrounding token element, which will most commonly be mi. In this section we explain the correspondence that a MathML processor should apply between certain characters in Plane 0 (BMP) of Unicode, modified by the mathvariant attribute, and the Plane 1 Mathematical Alphanumeric Symbol characters; see also Section 3.2.2 Mathematics style attributes common to token elements.

The basic idea of the correspondence is simple. For example, there is a Mathematical Italic alphabet in Plane 1, and the code point for Mathematical Italic a is U+1D44E. Thus a typical example identifier of a variable might be marked up as

<mi>a</mi>

and then by the rules set out for rendering in Section 3.2.3 Identifier <mi>, this identifier would be printed in mathematical italic font. An alternative, and in some sense more rigorously specific markup for this identifier would be

<mi>&#x1D44E;<!--MATHEMATICAL ITALIC SMALL A--></mi>

which invokes the Mathematical Italic a explicitly.

Thanks to the consciously specified special arrangements for simple Mathematical Italic just illustrated, and adopted for backwards compatibility to earlier times, the main uses of Plane 1 markup are for identifiers normally printed in special mathematical fonts, such as Fraktur, Greek, Boldface or Script. In another example then, there is a Mathematical Fraktur alphabet in Plane 1, and the code point for Mathematical Fraktur A is U+1D504. Thus using Fraktur characters, a typical example might contain

<mi>&#x1D504;<!--BLACK-LETTER CAPITAL A--></mi>

An alternative, equivalent markup for this example is to use the standard A and modify the identifier using the mathvariant attribute, as follows:

<mi mathvariant="fraktur">A</mi>

The exact correspondence between a mathematical alphabetic character and an unstyled character is complicated by the fact that certain characters that were already present in Unicode in the BMP are not in the 'expected' sequence in Plane 1. The table below shows the common mathematical ones listing in the last two column the corresponding alphabetic value in the BMP (Plane 0) and the place in Plane-1 that one might have naturally sought this character.

Unicode code point Unicode name BMP code Plane-1 code
U+210E PLANCK CONSTANT U+0068 U+1D455
U+2102 DOUBLE-STRUCK CAPITAL C U+0043 U+1D540
U+210D DOUBLE-STRUCK CAPITAL H U+0048 U+1D53F
U+2115 DOUBLE-STRUCK CAPITAL N U+004E U+1D545
U+2119 DOUBLE-STRUCK CAPITAL P U+0050 U+1D547
U+211A DOUBLE-STRUCK CAPITAL Q U+0051 U+1D548
U+211D DOUBLE-STRUCK CAPITAL R U+0052 U+1D549
U+2124 DOUBLE-STRUCK CAPITAL Z U+005A U+1D551
U+212C SCRIPT CAPITAL B U+0043 U+1D49D
U+2130 SCRIPT CAPITAL E U+0045 U+1D450
U+2131 SCRIPT CAPITAL F U+0046 U+1D450
U+210B SCRIPT CAPITAL H U+0048 U+1D4A3
U+2110 SCRIPT CAPITAL I U+0049 U+1D4A4
U+2112 SCRIPT CAPITAL L U+004C U+1D4A7
U+2113 SCRIPT CAPITAL M U+004D U+1D4A8
U+211B SCRIPT CAPITAL R U+0052 U+1D4AD
U+212F SCRIPT SMALL E U+0065 U+1D4BA
U+210A SCRIPT SMALL G U+0067 U+1D4BC
U+2134 SCRIPT SMALL O U+006F U+1D4C4
U+212D BLACK-LETTER CAPITAL C U+0043 U+1D506
U+210C BLACK-LETTER CAPITAL H U+0048 U+1D50B
U+2111 BLACK-LETTER CAPITAL I U+0049 U+1D50C
U+211C BLACK-LETTER CAPITAL R U+0052 U+1D515
U+2128 BLACK-LETTER CAPITAL Z U+005A U+1D51D

Mathematical Alphanumeric Symbol characters should not be used for styled prose. For example, Mathematical Fraktur A must not be used to just select a blackletter font for an uppercase A as it would create problems for searching, restyling (e.g. for accessibility), and many other kinds of processing.

7.6 Non-Marking Characters

Some characters, although important for the quality of print or alternative rendering, do not have glyph marks that correspond directly to them. They are called here non-marking characters. Their roles are discussed in Chapter 3 Presentation Markup and Chapter 4 Content Markup.

In MathML, control of page composition, such as line-breaking, is effected by the use of the proper attributes on the mo and mspace elements.

The characters below are not simple spacers. They are especially important new additions to the UCS because they provide textual clues which can increase the quality of print rendering, permit correct audio rendering, and allow the unique recovery of mathematical semantics from text which is visually ambiguous.

Unicode code point Unicode name Description
U+2061 FUNCTION APPLICATION character showing function application in presentation tagging (Section 3.2.5 Operator, Fence, Separator or Accent <mo>
U+2062 INVISIBLE TIMES marks multiplication when it is understood without a mark (Section 3.2.5 Operator, Fence, Separator or Accent <mo>
U+2063 INVISIBLE SEPARATOR used as a separator, e.g., in indices (Section 3.2.5 Operator, Fence, Separator or Accent <mo>
U+2064 INVISIBLE PLUS marks addition, especially in constructs such a 1½ (Section 3.2.5 Operator, Fence, Separator or Accent <mo>

7.7 Anomalous Mathematical Characters

Some characters which occur fairly often in mathematical texts, and have special significance there, are frequently confused with other similar characters in the UCS. In some cases, common keyboard characters have become entrenched as alternatives to the more appropriate mathematical characters. In others, characters have legitimate uses in both formulas and text, but conflicting rendering and font conventions. All these characters are called here anomalous characters.

7.7.1 Keyboard Characters

Typical Latin-1-based keyboards contain several characters that are visually similar to important mathematical characters. Consequently, these characters are frequently substituted, intentionally or unintentionally, for their more correct mathematical counterparts.

7.7.1.1 Minus

The most common ordinary text character which enjoys a special mathematical use is U+002C [HYPHEN-MINUS]. As its Unicode name suggests it is used as a hyphen in prose contexts and in formulas for a minus or negative sign. For the mathematical use there is also a special code point U+2212 [MINUS SIGN] which is intended for mathematical formulas. MathML renderers should treat U+002C [HYPHEN-MINUS] the same as U+2212 [MINUS SIGN] when appropriate in formula contexts, e.g. in mo. In text contexts, e.g. mtext, U+002C [HYPHEN-MINUS] should render as a hyphen or short dash.

7.7.1.2 Apostrophes, Quotes and Primes

On a typical European keyboard there is a key available which is viewed as an apostrophe or a single quotation mark (an upright or right quotation mark). Thus one key is doing double duty for prose input to enter U+0027 [APOSTROPHE] and U+2019 [RIGHT SINGLE QUOTATION MARK]. In mathematical contexts it is also commonly used for the prime, which should be U+2032 [PRIME]. Unicode recognizes the overloading of this symbol and remarks that it can also signify the units of minutes or feet. In the unstructured printed text of normal prose the characters are placed next to one another. The U+0027 [APOSTROPHE] and U+2019 [RIGHT SINGLE QUOTATION MARK] are marked with glyphs that are small and raised with respect to the center line of the text. The fonts used provide small raised glyphs in the appropriate places indexed by the Unicode codes. The U+2032 [PRIME] of mathematics is similarly treated in fuller Unicode fonts.

MathML renderers are encouraged to treat U+0027 [APOSTROPHE] as U+2032 [PRIME] when appropriate in formula contexts, and as U+2019 [RIGHT SINGLE QUOTATION MARK] when appropriate in text contexts.

A final remark is that a ‘prime’ is often used in transliteration of the Cyrillic character U+044C [CYRILLIC SMALL LETTER SOFT SIGN]. This different use of primes is not part of considerations for mathematical formulas.

7.7.1.3 Other Keyboard Substitutions

While the minus and prime characters are the most common and important keyboard characters with more precise mathematical counterparts, there are a number of other keyboard character substitutions that are sometime used. For example some may expect

<mo>''</mo>

to be treated as U+2033 [DOUBLE PRIME], and analogous substitutions could perhaps be made for U+2034 [TRIPLE PRIME] and U+2057 [QUADRUPLE PRIME]. Similarly, sometimes U+007C [VERTICAL LINE] is used U+2223 [DIVIDES]. MathML regards these as application-specific authoring conventions, and recommends that authoring tools generate markup using the more precise mathematical characters for better interoperability.

7.7.2 Pseudo-scripts

There are a number of characters in the UCS that traditionally have been taken to have a natural ‘script’ aspect. The visual presentation of these characters is similar to a script, that is, raised from the baseline, and smaller than the base font size. The degree symbol and prime characters are examples. For use in text, such characters occur in sequence with the indentifier they follow, and are typically rendered using the same font. These characters are called pseudo-scripts here.

In almost all mathematical context, pseudo-script characters should be associated with a base expression using explicit script markup in MathML. For example, the preferred encoding of "x prime" is

<msup><mi>x</mi><mo>&#x2032;<!--PRIME--></mo></msup>

and not

<mi>x'</mi>

or any other variants not using an explicit script construct. Note, however, that within text contexts such as mtext, pseudo-scripts may be used in sequence with other character data.

There are two reasons why explicit markup is preferrable in mathematical contexts First, a problem arises with typesetting, when pseudo-scripts are used with subscripted identifiers. Traditionally, subscripting of x' would be rendered stacked under the prime. This is easily accomplished with script markup, for example:

<mrow><msubsup><mi>x</mi><mn>0</mn><mo>&#x2032;<!--PRIME--></mo></msubsup></mrow>

By contrast,

<mrow><msub><mi>x'</mi><mn>0</mn></msub></mrow>

will render with staggered scripts.

Note this means that a renderer of MathML will have to treat pseudo-scripts differently from most other character codes it finds in a superscript position; in most fonts, the glyphs for pseudo-scripts are already shrunk and raised from the baseline.

The second reason that explicit script markup is preferrable to juxtaposition of characters is that it generally better reflects the intended mathematical structure. For example,

<msup>
  <mrow><mo>(</mo><mrow><mi>f</mi><mo>+</mo><mi>g</mi></mrow><mo>)</mo></mrow>
  <mo>&#x2032;<!--PRIME--></mo>
</msup>

accurately reflects that the prime here is operating on an entire expression, and does not suggest that the prime is acting on the final right parenthesis.

However, the data model for all MathML token elements is Unicode text, so one cannot rule out the possibility of valid MathML markup containing constructions such as

<mrow><mi>x'</mi></mrow>

and

<mrow><mi>x</mi><mo>'</mo></mrow>

While the first form may, in some rare situations, legitmately be used to distinguish a multi-character identifer named x' from the derivative of a function x, such forms should generally be avoided. Authoring and validation tools are encouraged to generate the recommended script markup:

<mrow><msup><mi>x</mi><mo>&#x2032;<!--PRIME--></mo></msup></mrow>

The U+2032 [PRIME] character is perhaps the most common pseudo-script, but there are others as well:

Pseudo-script Characters
U0002A [ASTERISK]
U000B0 [DEGREE SIGN]
U02033 [DOUBLE PRIME]
U02035 [REVERSED PRIME]
U+2034 [TRIPLE PRIME]
U+2035 [REVERSED PRIME]
U+2057 [QUADRUPLE PRIME]
U+201C [LEFT DOUBLE QUOTATION MARK]
U+201D [RIGHT DOUBLE QUOTATION MARK]
U+201A [SINGLE LOW-9 QUOTATION MARK]
U+201E [DOUBLE LOW-9 QUOTATION MARK]

Note that several of these characters are common on keyboards, namely U+002A [ASTERISK], U+00B0 [DEGREE SIGN], U+2033 [DOUBLE PRIME], and U+2035 [REVERSED PRIME] also known as a back prime.

7.7.3 Combining Characters

In the UCS there are many combining characters that are intended to be used for the many accents of numerous different natural languages. Some of them may seem to provide markup needed for mathematical accents. They should not be used in mathematical markup. Superscript, subscript, underscript and overscript constructions as just discussed above should be used for this purpose. Of course, combining characters may be used in multi-character identifiers as they are needed, or in text contexts.

There is one more case where combining characters turn up naturally in mathematical markup. Some relations have associated negations, such as U+226F [NOT GREATER-THAN] for the negation of U+003E [GREATER-THAN SIGN]. The glyph for U+226F [NOT GREATER-THAN] is usually just that for U+003E [GREATER-THAN SIGN] with a slash through it. Thus it could also be expressed by U+003E-0338 making use of the combining slash U+0338 [COMBINING LONG SOLIDUS OVERLAY]. That is true of 25 other characters in common enough mathematical use to merit their own Unicode points. In the other direction there are 31 character entity names listed in the [Entities] which are to be expressed using U+0338 [COMBINING LONG SOLIDUS OVERLAY].

In a similar way there are mathematical characters which have negations given by a vertical bar overlay U+20D2 [COMBINING LONG VERTICAL LINE OVERLAY]. Some are available in pre-composed forms, and some named character entities are given explicitly as combinations. In addition there are examples using U+0333 [COMBINING DOUBLE LOW LINE] and U+20E5 [COMBINING REVERSE SOLIDUS OVERLAY], and variants specified by use of the U+FE00 [VARIATION SELECTOR-1]. For fuller listing of these cases see the listings in [Entities].

The general rule is that a string of combining characters should be treated just as though it were the pre-composed characters resulting from the combination, if there exist such.

Overview: Mathematical Markup Language (MathML) Version 3.0
Previous: 6 Interactions with the Host Environment
Next: A Parsing MathML