6 Characters, Entities and Fonts

Overview: Mathematical Markup Language (MathML) Version 3.0
Previous: 5 Combining Presentation and Content Markup
Next: 7 MathML interactions with the Wide World

6 Characters, Entities and Fonts
    6.1 Introduction
    6.2 Unicode Character Data
    6.3 Entity Declarations
    6.4 Special Characters Not in Unicode
    6.5 Mathematical Alphanumeric Symbols
    6.6 Non-Marking Characters

6.1 Introduction

Notation and symbols have proved very important for mathematics. Mathematics has grown in part because its notation continually changes toward being succinct and suggestive. There have been many new signs developed for use in mathematical notation, and mathematicians have not held back from making use of many symbols originally introduced elsewhere. The result is that mathematics makes use of a very large collection of symbols. It is difficult to write mathematics fluently if these characters are not available for use. It is difficult to read mathematics if corresponding glyphs are not available for presentation on specific display devices.

The W3C Math Working Group therefore took on directly the task of specifying part of the full mechanism needed to proceed from notation to final presentation, and has collaborated with the STIX Fonts Project and Unicode Technical Committee (UTC) in undertaking specification of the rest.

This chapter of the MathML specification contains a listing of character names for use with MathML, recommendations for their use, and warnings to pay attention to the correct form of the corresponding code points given in the UCS (Universal Character Set) as codified in Unicode and ISO 10646 [Unicode] and the Unicode Web site. For simplicity we refer to this character set by the short name Unicode. Though Unicode changes from time to time so that it is specified exactly by using version numbers, unless this brings clarity on some point we do not use them. MathML 2.0 (Second Edition) is based on Unicode 4.0, and MathML 3.0 on Unicode 5.1.)

While a long process of review and adoption by UTC and ISO/IEC of the characters of special interest to mathematics and MathML is now complete, more characters may be added in the future. To ensure any possible corrections to relevant standards are taken into account, and for the latest character tables and font information, see the W3C Entities page and the Unicode site, notably Unicode Work in Progress and Unicode Technical Report #25 “Unicode Support for Mathematics”.

A MathML token element (see Section 3.2 Token Elements, Section 4.2.3 Numbers, Section 4.2.4 Symbols and Identifiers) takes as content a sequence of MathML Characters. MathML Characters are defined to be either Unicode characters legal in XML documents or mglyph elements. The latter are used to represent characters that do not have a Unicode encoding, as described in Section 3.2.9 Using images to represent symbols (mglyph). Because the Unicode UCS provided approximately one thousand special alphabetic characters for the use of mathematics with Unicode 3.1, and over 900 further special symbols in Unicode 3.2, the need for mglyph should be rare.

6.2 Unicode Character Data

Any character allowed by XML may be used in MathML in an XML document. The legal characters have the hexadecimal code numbers 09 (tab = U+0009), 0A (line feed = U+000A), 0D (carriage return = U+000D), 20-D7FF (U+0020..U+D7FF), E000-FFFD (U+E000..U+FFFD), and 10000-10FFFF (U+010000..U+10FFFF). The notation, just introduced in parentheses, beginning with U+ is that recommended by Unicode for referring to Unicode characters [see [Unicode], page xxviii]. The exclusions above code number D7FF are of the blocks used in surrogate pairs, and the two characters guaranteed not to be Unicode characters at all. U+FFFE is excluded to allow determination of byte order in certain encodings.

There are essentially three different ways of encoding character data.

6.3 Entity Declarations

Earlier versions of this MathML specification included detailed listings of the entity definitions to be used with the MathML DTD. these entity definitions are of more general use, and have now been separated into a separate document, [Entities]. That document describes several entity sets, not all of them are used in the MathML DTD, although an XML document that includes MathML may reference any entity definitions. The standard MathML DTD references the following entity sets:

6.4 Special Characters Not in Unicode

For special purposes, one may need to use a character which is not in Unicode. In these cases one may use the mglyph element for direct access to a glyph as an image, or (in some systems) from a font that uses a non-uniocde encoding. All MathML token elements that accept character data also accept an mglyph in their content. Beware, however, that use of mglyph to access a font is deprecated and the mechanism may not work in all systems. The mglyph element should always supply an alternatve representation in its alt attribute.

6.5 Mathematical Alphanumeric Symbols

A noticeable feature of mathematical and scientific writing is the use of single letters to denote variables and constants in a given context. The increasing complexity of science has led to the use of certain common alphabet and font variations to provide enough special symbols of this letter-like type. These denotations are in fact not letters that may be used to make up words with recognized meanings, but individual carriers of semantics themselves. Writing a string of such symbols is usually interpreted in terms of some composition law, for instance, multiplication. Many letter-like symbols may be quickly interpreted by specialists in a given area as of a certain mathematical type: for instance, bold symbols, whether based on Latin or Greek letters, as vectors in physics or engineering, or fraktur symbols as Lie algebras in part of pure mathematics. To this end the STIX Fonts Project defined a set of mathematical characters all of which are included in Unicode 5.0.

The additional Mathematical Alphanumeric Symbols provided in Unicode 3.1 have code points U+1D400..U+1D7FF in Plane 1, that is, in the first plane with Unicode values higher than 216. This plane of characters is also known as the Secondary Multilingual Plane (SMP), in contrast to the Basic Multilingual Plane (BMP) which was originally the entire extent of Unicode. Support for Plane 1 characters in currently deployed software is not always reliable, but it should be possible in multilingual operating systems, since Plane 2 has many Chinese characters that must be displayable in East Asian locales.

As discussed in Section 3.2.2 Mathematics style attributes common to token elements, MathML offers an alternative mechanism to specify mathematical alphabetic characters. This alternative spans the gap between the specification of Unicode 3.1 and its associated deployment in software and fonts. Namely, one uses the mathvariant attribute on the surrounding token element, which will most commonly be mi. In this section we detail the correspondence that a MathML processor should apply between certain characters in Plane 0 (BMP) of Unicode, modified by the mathvariant attribute, and the Plane 1 Mathematical Alphanumeric Symbol characters.

The basic idea of the correspondence is fairly simple. For example, a Mathematical Fraktur alphabet is in Plane 1, and the code point for Mathematical Fraktur A is U+1D504. Thus using these characters, a typical example might be

<mi>&#x1D504;</mi>

However, an alternative, equivalent markup would be to use the standard A and modify the identifier using the mathvariant attribute, as follows:

<mi mathvariant="fraktur">A</mi>

The exact correspondence between a mathematical alphabetic character and an unstyled character is complicated by the fact that certain characters that were already present in Unicode are not in the 'expected' sequence.

Mathematical Alphanumeric Symbol characters should not be used for styled text. For example, Mathematical Fraktur A must not be used to just select a blackletter font for an uppercase A. Doing this sort of thing would create problems for searching, restyling (e.g. for accessibility), and many other kinds of processing.

6.6 Non-Marking Characters

Some characters, although important for the quality of print or alternative rendering, do not have glyph marks that correspond directly to them. They are called here non-marking characters. Their roles are discussed in Chapter 3 Presentation Markup and Chapter 4 Content Markup.

In MathML 2 control of page composition, such as line-breaking, is effected by the use of the proper attributes on the mspace element.

The characters below are not simple spacers. They are especially important new additions to the UCS because they provide textual clues which can increase the quality of print rendering, permit correct audio rendering, and allow the unique recovery of mathematical semantics from text which is visually ambiguous.

Unicode codepoint Unicode name Description
02061 FUNCTION APPLICATION character showing function application in presentation tagging (Section 3.2.5 Operator, Fence, Separator or Accent (mo)
02062 INVISIBLE TIMES marks multiplication when it is understood without a mark (Section 3.2.5 Operator, Fence, Separator or Accent (mo)
02063 INVISIBLE SEPARATOR used as a separator, e.g., in indices (Section 3.2.5 Operator, Fence, Separator or Accent (mo)
02064* INVISIBLE PLUS marks addition, especialy in constructs such a 1½ (Section 3.2.5 Operator, Fence, Separator or Accent (mo)

*Character U+2064 has been accepted by the UTC and ISO for inclusion into the next revision of Unicode, 5.1

Overview: Mathematical Markup Language (MathML) Version 3.0
Previous: 5 Combining Presentation and Content Markup
Next: 7 MathML interactions with the Wide World