Characters, Entities and Fonts

6.1 Introduction

Character and entity tables need to be updated for MathML3 and ISO 9573-13
Issue regenerate_character_and_entity_tables	`wiki (member only)`
Many of the tables in chapter 6 need to be updated and regenerated. In this draft references to tables in chapter 6 link to the published MathML2 Recommendation, and are marked [MathML2]
Resolution	None recorded

Notation and symbols have proved very important for mathematics. Mathematics has grown in part because its notation continually changes toward being succinct and suggestive. There have been many new signs developed for use in mathematical notation, and mathematicians have not held back from making use of many symbols originally introduced elsewhere. The result is that mathematics makes use of a very large collection of symbols. It is difficult to write mathematics fluently if these characters are not available for use. It is difficult to read mathematics if corresponding glyphs are not available for presentation on specific display devices.

The W3C Math Working Group therefore took on directly the task of specifying part of the full mechanism needed to proceed from notation to final presentation, and has collaborated with the STIX Fonts Project and Unicode Technical Committee (UTC) in undertaking specification of the rest.

This chapter of the MathML specification contains a listing of character names for use with MathML, recommendations for their use, and warnings to pay attention to the correct form of the corresponding code points given in the UCS (Universal Character Set) as codified in Unicode and ISO 10646 [Unicode] and the Unicode Web site. For simplicity we refer to this character set by the short name Unicode. Though Unicode changes from time to time so that it is specified exactly by using version numbers, unless this brings clarity on some point we do not use them. MathML 2.0 (Second Edition) is based on Unicode 4.0, and MathML 3.0 on Unicode 5.0.)

While a long process of review and adoption by UTC and ISO/IEC of the characters of special interest to mathematics and MathML is now complete, more characters may be added in the future. To ensure any possible corrections to relevant standards are taken into account, and for the latest character tables and font information, see the W3C Math Working Group home page and the Unicode site, notably Unicode Work in Progress and Unicode Technical Report #25 “Unicode Support for Mathematics”.

6.2 MathML Characters

A MathML token element (see Section 3.2 Token Elements, , ) takes as content a sequence of MathML Characters. MathML Characters are defined to be either Unicode characters legal in XML documents or mglyph elements. The latter are used to represent characters that do not have a Unicode encoding, as described in Section 3.2.9 Accessing glyphs for characters from MathML (mglyph). Because the Unicode UCS provided approximately one thousand special alphabetic characters for the use of mathematics with Unicode 3.1, and over 900 further special symbols in Unicode 3.2, the need for mglyph should be rare.

6.2.1 Unicode Character Data

As always in XML, any character allowed by XML may be used in MathML in an XML document. The legal characters have the hexadecimal code numbers 09 (tab = U+0009), 0A (line feed = U+000A), 0D (carriage return = U+000D), 20-D7FF (U+0020..U+D7FF), E000-FFFD (U+E000..U+FFFD), and 10000-10FFFF (U+010000..U+10FFFF). The notation, just introduced in parentheses, beginning with U+ is that recommended by Unicode for referring to Unicode characters [see [Unicode], page xxviii]. The exclusions above code number D7FF are of the blocks used in surrogate pairs, and the two characters guaranteed not to be Unicode characters at all. U+FFFE is excluded to allow determination of byte order in certain encodings.

There are essentially three different ways of encoding character data.

Using characters directly: For example, an A may be entered as 'A' from a keyboard (character U+0041). This option is only available if the character encoding specified for the XML document includes the character. Most commonly used encodings will have 'A' in the ASCII position. In many encodings, characters may need more than one byte. Note that if the document is, for example, encoded in Latin-1 (ISO-8859-1) then only the characters in that encoding are available directly. Using UTF-8 or UTF-16, the only two encodings that all XML processors are required to accept, mathematical symbols can be encoded as character data.
Using numeric XML character references: Using this notation, 'A' may be represented as A (decimal) or A (hex). Note that the numbers always refer to the Unicode encoding (and not to the character encoding used in the XML file). By using character references it is always possible to access the entire Unicode range. For a general XML vocabulary, there is a disadvantage to this approach: character references may not be used in XML element or attribute names. However, this is not an issue for MathML, as all element names in MathML are restricted to ASCII characters.
Using entity references: The MathML DTD defines internal entities that expand to character data. Thus for example the entity reference é may be used rather than the character reference "é or, if, for example, the document is encoded in ISO-8859-1, the character é. An XML fragment that uses an entity reference which is not defined in a DTD is not well-formed; therefore it will be rejected by an XML parser. For this reason every fragment using entity references must use a DOCTYPE declaration which specifies the MathML DTD, or a DTD that at least declares any entity reference used in the MathML instance. The need to use a DOCTYPE complicates inclusion of MathML in some documents. However, entity references are very useful for small illustrative examples, and are used in most examples in this document.

6.2.2 Special Characters Not in Unicode

For special purposes, one may need to use a character which is not in Unicode. In these cases one may use the mglyph element for direct access to a glyph from some font and creation of a MathML substitute for the corresponding character. All MathML token elements that accept character data also accept an mglyph in their content.

Beware, however, that the font chosen may not be available to all MathML processors.

6.2.3 Mathematical Alphanumeric Symbols Characters

A noticeable feature of mathematical and scientific writing is the use of single letters to denote variables and constants in a given context. The increasing complexity of science has led to the use of certain common alphabet and font variations to provide enough special symbols of this letter-like type. These denotations are in fact not letters that may be used to make up words with recognized meanings, but individual carriers of semantics themselves. Writing a string of such symbols is usually interpreted in terms of some composition law, for instance, multiplication. Many letter-like symbols may be quickly interpreted by specialists in a given area as of a certain mathematical type: for instance, bold symbols, whether based on Latin or Greek letters, as vectors in physics or engineering, or fraktur symbols as Lie algebras in part of pure mathematics. To this end the STIX Fonts Project defined a set of mathematical characters all of which are included in Unicode 5.0.

The additional Mathematical Alphanumeric Symbols provided in Unicode 3.1 have code points U+1D400..U+1D7FF in Plane 1, that is, in the first plane with Unicode values higher than 2¹⁶. This plane of characters is also known as the Secondary Multilingual Plane (SMP), in contrast to the Basic Multilingual Plane (BMP) which was originally the entire extent of Unicode. Support for Plane 1 characters in currently deployed software is not always reliable, but it should be possible in multilingual operating systems, since Plane 2 has many Chinese characters that must be displayable in East Asian locales.

As discussed in Section 3.2.2 Mathematics style attributes common to token elements, MathML offers an alternative mechanism to specify mathematical alphabetic characters. This alternative spans the gap between the specification of Unicode 3.1 and its associated deployment in software and fonts. Namely, one uses the mathvariant attribute on the surrounding token element, which will most commonly be mi. In this section we detail the correspondence that a MathML processor should apply between certain characters in Plane 0 (BMP) of Unicode, modified by the mathvariant attribute, and the Plane 1 Mathematical Alphanumeric Symbol characters.

The basic idea of the correspondence is fairly simple. For example, a Mathematical Fraktur alphabet is in Plane 1, and the code point for Mathematical Fraktur A is U+1D504. Thus using these characters, a typical example might be

<mi>&#x1D504;</mi>

However, an alternative, equivalent markup would be to use the standard A and modify the identifier using the mathvariant attribute, as follows:

<mi mathvariant="fraktur">A</mi>

The exact correspondence between a mathematical alphabetic character and an unstyled character is complicated by the fact that certain characters that were already present in Unicode are not in the 'expected' sequence.

Mathematical Alphanumeric Symbol characters should not be used for styled text. For example, Mathematical Fraktur A must not be used to just select a blackletter font for an uppercase A. Doing this sort of thing would create problems for searching, restyling (e.g. for accessibility), and many other kinds of processing.

6.2.4 Non-Marking Characters

Some characters, although important for the quality of print or alternative rendering, do not have glyph marks that correspond directly to them. They are called here non-marking characters. Their roles are discussed in Chapter 3 Presentation Markup and Chapter 4 Content Markup.

In MathML 2 control of page composition, such as line-breaking, is effected by the use of the proper attributes on the mspace element.

The characters below are not simple spacers. They are especially important new additions to the UCS because they provide textual clues which can increase the quality of print rendering, permit correct audio rendering, and allow the unique recovery of mathematical semantics from text which is visually ambiguous.