6 Characters, Entities and Fonts

Overview: Mathematical Markup Language (MathML) Version 2.0 (2nd Edition)
Previous: 5 Combining Presentation and Content Markup
Next: 7 The MathML Interface

6 Characters, Entities and Fonts
    6.1 Introduction
    6.2 MathML Characters
        6.2.1 Unicode Character Data
        6.2.2 Special Characters Not in Unicode
        6.2.3 Mathematical Alphanumeric Symbols Characters.
        6.2.4 Non-Marking Characters
    6.3 Character Symbol Listings
        6.3.1 Special Constants
        6.3.2 Character Tables (ASCII format)
        6.3.3 Tables arranged by Unicode block
        6.3.4 Negated Mathematical Characters
        6.3.5 Variant Mathematical Characters
        6.3.6 Mathematical Alphanumeric Symbols
        6.3.7 MathML Character Names
    6.4 Differences from Characters in MathML 1
        6.4.1 Coverage
        6.4.2 Fewer Non-marking Characters
        6.4.3 ISO Tables
        6.4.4 Status of Character Encodings

6.1 Introduction

Notation and symbols have proved very important for mathematics. Mathematics has grown in part because its notation continually changes toward being succinct and suggestive. There have been many new signs developed for use in mathematical notation, and mathematicians have not held back from making use of many symbols originally introduced elsewhere. The result is that mathematics makes use of a very large collection of symbols. It is difficult to write mathematics fluently if these characters are not available for use. It is difficult to read mathematics if corresponding glyphs are not available for presentation on specific display devices.

This situation posed a problem for the first W3C Math Working Group when it was brought into existence. It did not fall naturally within the purview of developing a specification enabling mathematics to be used with HTML and producing a DTD for the Working group this to worry about more than the entities allowed in the DTD. However, as experience has shown, a long list of entities with no means to display them is of little use, and a cause of frequent frustrations in trying to use a standard. On the other hand, a large collection of glyphs and fonts representing characters without a standard way to refer to them is not of much use either.

The W3C Math Working Group therefore took on directly the task of specifying part of the full mechanism needed to proceed from notation to final presentation, and started collaboration with organizations undertaking specification of the rest.

This chapter of the MathML specification contains a listing of character names for use with MathML, recommendations for their use, and warnings to pay attention to the correct form of the corresponding code points given in the UCS (Universal Character Set) as codified in Unicode and ISO 10646 [see [Unicode] and the Unicode Web site]. For simplicity we shall refer to this character set by the short name Unicode. Though Unicode changes from time to time so that it is specified exactly by using version numbers, unless this brings clarity on some point we shall not use them. The specification of MathML 2.0 used to make use of some characters that were not part of Unicode 3.0 but which had been proposed to the Unicode Technical Committee (UTC), and thus for inclusion in ISO 10646. They have been included in the revisions Unicode 3.1 and 3.2. (For more detail about this see Section 6.4.4 Status of Character Encodings.)

While a long process of review and adoption by UTC and ISO/IEC of the characters of special interest to mathematics and MathML is now complete (Unicode Work in Progress) there remains the possibility of some further modification of the lists of characters accepted, of the code assignments for those adopted, or of the names given them by Unicode. To make sure any possible corrections to relevant standards are taken into account, and for the latest character tables and font information, see the W3C Math Working Group home page and the Unicode site.

6.2 MathML Characters

A MathML token element Section 3.2 Token Elements, and Section 4.4.1 Token Elements takes as content a sequence of MathML Characters. MathML Characters are defined to be either Unicode characters legal in XML documents or mglyph elements. The latter are used to represent characters that do not have a Unicode encoding, as described in Section 3.2.9 Adding new character glyphs to MathML (mglyph). Because the Unicode UCS provided approximately one thousand special alphabetic characters for the use of mathematics with Unicode 3.1, and over 900 further special symbols in Unicode 3.2, the need for mglyph should be rare.

6.2.1 Unicode Character Data

As always in XML, any character allowed by XML may be used in MathML in an XML document. The legal characters have the hexadecimal code numbers 09 (tab = U+0009), 0A (line feed = U+000A), 0D (carriage return = U+000D), 20-D7FF (U+0020..U+D7FF), E000-FFFD (U+E000..U+FFFD), and 10000-10FFFF (U+010000..U+10FFFF). The parenthetical notation beginning with U+ is one recommended by Unicode for referring to Unicode characters [see [Unicode], page xxviii]. The exclusions above code number D7FF are of the blocks used in surrogate pairs, and the two characters guaranteed not to be Unicode characters at all. U+FFFE is excluded to allow determination of byte order in certain encodings.

There are essentially three different ways of encoding character data.

  • Using characters directly: For example, an A may be entered as 'A' from a keyboard (character U+0041). This option is only available if the character encoding specified for the XML document includes the character. Most commonly used encodings will have 'A' in the ASCII position. In many encodings, characters may need more than one byte. Note that if the document is, for example, encoded in Latin-1 (ISO-8859-1) then only the characters in that encoding are available directly. Unfortunately, most mathematical symbols may not be encoded as character data in this way.

  • Using numeric XML character references: Using this notation, 'A' may be represented as & #65; (decimal) or A (hex). Note that the numbers always refer to the Unicode encoding (and not to the character encoding used in the XML file). By using Character references it is always possible to access the entire Unicode range. For a general XML vocabulary, there is a disadvantage to this approach: character references may not be used in XML element or attribute names. However, this is not an issue for MathML, as all element names in MathML are restricted to ASCII characters.

  • Using entity references: The MathML DTD defines internal entities that expand to character data. Thus for example the entity reference é may be used rather than the character reference "é or, if, for example, the document is encoded in ISO-8859-1, the character é. An XML fragment that uses an entity reference which is not defined in a DTD is not well-formed; therefore it will be rejected by an XML parser. For this reason every fragment using entity references must use a DOCTYPE declaration which specifies the MathML DTD, or a DTD that at least declares any entity reference used in the MathML instance. The need to use a DOCTYPE complicates inclusion of MathML in some documents. However, entity references are very useful for small illustrative examples, and are used in most examples in this document.

    For this reason entity references are perhaps not optimal for use in generated MathML, however they are very useful for small illustrative examples, as used in this document.

6.2.2 Special Characters Not in Unicode

For special purposes, one may need to use a character which is not in Unicode, even with the expected additions. In these cases one may use the mglyph element for direct access to a glyph from some font and creation of a MathML character corresponding. All MathML token elements that accept character data also accept an mglyph in their content.

Beware, however, that the font chosen may not be available to all MathML processors.

6.2.3 Mathematical Alphanumeric Symbols Characters.

A noticeable feature of mathematical and scientific writing is the use of single letters to denote variables and constants in a given context. The increasing complexity of science has led to the use of certain common alphabet and font variations to provide enough special symbols of this letter-like type. These denotations are in fact not letters that may be used to make up words with recognized meanings, but individual carriers of semantics themselves. Writing a string of such symbols is usually interpreted in terms of some composition law, for instance, multiplication. Many letter-like symbols may be quickly interpreted by specialists in a given area as of a certain mathematical type: for instance, bold symbols, whether based on Latin or Greek letters, as vectors in physics or engineering, or fraktur symbols as Lie algebras in part of pure mathematics. Again, in given areas of science, some constants are recognizable letter forms. When you look carefully at the range of letter-like mathematical symbols in common use today, as the STIX project supported by major scientific and technical publishers did, you come up with perhaps surprisingly many. A proposal to facilitate mathematical publishing by inclusion of mathematical alphabetic symbols in the UCS was made, and has been favorably handled.

The new Mathematical Alphanumeric Symbols provided in Unicode 3.1 have code points in Plane 1, that is, in the first plane with Unicode values higher than 216. This plane of characters is also known as the Secondary Multilingual Plane (SMP), in contrast to the Basic Multilingual Plane (BMP) which has been used by Unicode so far. Support for Plane 1 characters in currently deployed software is not always reliable, and in particular support for these Mathematical Alphanumeric Symbol characters is not likely to be widespread until after public fonts covering the characters adopted for mathematics are available.

As discussed in Section 3.2.2 Mathematics style attributes common to token elements, MathML offers an alternative mechanism to specify mathematical alphabetic characters, which will help bridge the time of transition to Unicode revisions and the associated deployment of implementing software and fonts therefore required. Namely, one uses the mathvariant attribute on the surrounding token element, which will most commonly be mi. In this section we detail the correspondence that a MathML processor should apply between certain characters in Plane 0 (BMP) of Unicode, modified by the mathvariant attribute, and the Plane 1 Mathematical Alphanumeric Symbol characters.

The basic idea of the correspondence is fairly simple. For example, a Mathematical Fraktur alphabet is being added, and the code point for Mathematical Fraktur A is U1D504. Thus using these characters, a typical example might be

<mi>&#x1D504;</mi>

However, an alternative, equivalent markup would be to use the standard A and modify the identifier using the mathvariant attribute, as follows:

<mi mathvariant="fraktur">A</mi>

The exact correspondence between a mathematical alphabetic character and an unstyled character is complicated by the fact that certain characters that were already present in Unicode are not in the 'expected' sequence.

The detailed correspondence is shown in the tables given in Section 6.3.6 Mathematical Alphanumeric Symbols.

Mathematical Alphanumeric Symbol characters should not be used for styled text. For example, Mathematical Fraktur A must not be used to just select a blackletter font for an uppercase A. Doing this sort of thing would create problems for searching, restyling (e.g. for accessibility), and many other kinds of processing.

6.2.4 Non-Marking Characters

Some characters, although important for the quality of print or alternative rendering, do not have glyph marks that correspond directly. They are called here non-marking characters. Their roles are discussed in Chapter 3 Presentation Markup and Chapter 4 Content Markup.

In MathML 2 control of page composition, such as line-breaking, is effected by the use of the proper attributes on the mspace element.

The characters below are not simple spacers. They are especially important new additions to the UCS because they provide textual clues which can increase the quality of print rendering, permit correct audio rendering, and allow the unique recovery of mathematical semantics from text which is visually ambiguous.

Character name Unicode Description
&InvisibleTimes; 02062 marks multiplication when it is understood without a mark (Section 3.2.5 Operator, Fence, Separator or Accent (mo)
&InvisibleComma; 02063 used as a separator, e.g., in indices (Section 3.2.5 Operator, Fence, Separator or Accent (mo)
&ApplyFunction; 02061 character showing function application in presentation tagging (Section 3.2.5 Operator, Fence, Separator or Accent (mo)

6.3 Character Symbol Listings

The Universal Character Set (UCS) of Unicode and ISO 10646 continues to evolve, see Section 6.4.4 Status of Character Encodings. A small number of the changes recently introduced, relative to those resulting from the needs of Asian languages, are those designed exactly to facilitate the use of Unicode by the 'equation-writing' community. This specification is written on the assumption that the code assignments suggested to ISO/IEC JTC1/SC2/WG2 by the UTC will be confirmed as they are in public draft forms of Unicode 3.1 and 3.2. As before, we can only reiterate that for latest developments on details of character standards as far as they influence mathematical formalism the home page of the W3C Math Working Group should be consulted.

The characters are given with entity names as well as Unicode numbers. To facilitate comprehension of a fairly large list of names, which totals over 2000 in this case, we offer more than one way to find to a given character. A corresponding full set of entity declarations is in the DTD in Appendix A Parsing MathML. For discussion of entity declarations see that appendix.

The characters are listed by name, and sample glyphs provided for all of them. Each character name is accompanied by a code for a character grouping chosen from a list given below, a short verbal description, and a Unicode hex code drawn from ISO 10646, now extended in accordance with the proposal forwarded by the UTC to ISO/IEC WG2 in March 2000.

The character listings by alphabetical and Unicode order in Section 6.3.7 MathML Character Names are in harmony with the ISO character sets given, in that if some part of a set is included then the entire set is included.

6.3.1 Special Constants

To begin we list separately a few of the special characters which MathML has introduced. These have been accorded new Unicode values. Rather like the non-marking characters above, they provide very useful capabilities in the context of machinable mathematics.

Entity name Unicode Description
&CapitalDifferentialD; 02145 D for use in differentials, e.g. within integrals
&DifferentialD; 02146 d for use in differentials, e.g. within integrals
&ExponentialE; 02147 e for use for the exponential base of the natural logarithms
&ImaginaryI; 02148 i for use as a square root of -1

6.3.2 Character Tables (ASCII format)

The first table offered is a very large ASCII listing of characters considered particularly relevant to mathematics. This is given in Unicode order. Most, but not all, of these characters have MathML names defined via entity declarations in the DTD. Those that do not are usually symbols which seem mathematically peripheral, such as dingbats, machine graphics or technical symbols.

A second table lists those characters that do have MathML entity names, ordered alphabetically, with a lower-case letter preceding its upper-case counterpart.

6.3.3 Tables arranged by Unicode block

The tables in this section detail Unicode code points (displayed with 256 code points per table) that have mathematically significant characters. The sample glyph images link to the table of characters ordered by Unicode given in the previous section. The names of the blocks are those of the Unicode blocks included in the numerical range given; bracketing indicates glyphs for characters of that type are not shown in these tables.

Block Range Description
00000 - 000FF Controls and Basic Latin, and Latin-1 Supplement
00100 - 001FF Latin Extended-A, Latin Extended-B
00200 - 002FF IPA Extensions, Spacing Modifier Letters
00300 - 003FF Combining Diacritical Marks, Greek [and Coptic]
00400 - 004FF Cyrillic
02000 - 020FF General Punctuation, Superscripts and Subscripts, Currency Symbols, Combining Diacritical Marks for Symbols
02100 - 021FF Letter-like Symbols, Number Forms, Arrows
02200 - 022FF Mathematical Operators
02300 - 023FF Miscellaneous Technical
02400 - 024FF Control Pictures, Optical Character Recognition, Enclosed Alphanumerics
02500 - 025FF Box Drawing, Block Elements, Geometric Shapes
02600 - 026FF Miscellaneous Symbols
02700 - 027FF Dingbats
02900 - 029FF Supplemental Arrows, Miscellaneous Mathematical Symbols
02A00 - 02AFF Supplemental Mathematical Operators
03000 - 030FF CJK Symbols and Punctuation, [Hiragana, Katakana]
0FB00 - 0FBFF Alphabetic Presentation Forms
0FE00 - 0FEFF [Combining Half Marks, CJK Compatibility Forms, Small Form Variants, Arabic Presentation Forms-B]
1D400 - 1D4FF Mathematical Styled Latin (Bold, Italic, Bold Italic, Script, Bold Script begins)
1D500 - 1D5FF Mathematical Styled Latin (Bold Script ends, Fraktur, Double-struck, Bold Fraktur, Sans-serif, Sans-serif Bold begins)
1D600 - 1D6FF Mathematical Styled Latin (Sans-serif Bold ends, Sans-serif Italic, Sans-serif Bold Italic, Monospace, Bold), Mathematical Styled Greek (Bold, Italic begins)
1D700 - 1D7FF Mathematical Styled Greek (Italic continued, Bold Italic, Sans-serif Bold), Mathematical Styled Digits

6.3.4 Negated Mathematical Characters

In addition to the Unicode Characters so far listed, one may use the combining characters U+0338 (/), U+20D2 (|) and U+20E5 (\) to produce negated or canceled forms of characters. A combining character should be placed immediately after its 'base' character, with no intervening markup or space, just as is the case for combining accents.

In principle, the negation characters may be applied to any Unicode character, although fonts designed for mathematics typically have some negated glyphs ready composed. A MathML renderer should be able to use these pre-composed glyphs in these cases. A compound character code either represents a UCS character that is already available, as in the case of U+003D+0038 which amounts to U+2260, or it does not as is the case for U+2202+0338. The common cases of negations, of both types, that have been identified are listed in the table

Note that it is the policy of the W3C and of Unicode that if a single character is already defined for what can be achieved with a combining character, that character must be used instead of the decomposed form. It is also intended that no new single characters representing what can be done by with existing compositions will be introduced.

6.3.5 Variant Mathematical Characters

Unicode attempts to avoid having several character codes for simple font variants. For a code point to be assigned there should be more than a nuance in glyphs to be recorded. To record variants worth noting there is a special character in Unicode 3.2, U+FE00 (VARIATION SELECTOR-1), which acts as a postfix modifier. However the legally allowed combinations with this variation selector are restricted to a list recorded as part of Unicode. The VARIATION SELECTOR-1 character may only be applied to the characters listed here. The resulting combination is not regarded by Unicode as a separate character, but a variation on the base character. Unicode aware systems may render the combination as the base if the available fonts do not support the variant glyph shape.

6.3.6 Mathematical Alphanumeric Symbols

Here we list the special mathematical alphabets. Note that the names for these alphabetic runs should be regarded as conventions resulting from recent tradition in the typesetting of mathematical formulas, rather than as fixing exactly and forever the styles which are to be used. Of course, they do correspond to the styles presently most common. But, for instance, there may be font variations in the glyphs from double-struck, open-face or blackboard bold fonts, all of which would naturally be used for the characters in the range here labelled Double-struck. Similar considerations would apply to appellations such as fraktur and gothic, or script and calligraphic.

As discussed above, the use of these characters is formally equivalent to the use of characters in Plane 0, together with a suitable value for the mathvariant attribute. The correspondence is given in the character tables. Most of these characters come from the additions to Plane 1, however a few characters (such as the double-struck letters N, P, Z, Q, R, C, H representing common number sets) were already present in Unicode 3.0 and retain their original positions. These characters are highlighted in the tables.

6.3.7 MathML Character Names

This section corresponds closely with the entity definitions in the DTD described in Appendix A Parsing MathML. All of the entity sets except the last correspond to entity sets defined by ISO 8879 or ISO 9573-13.

ISO Handle Description
ISOAMSA Added Mathematical Symbols: Arrows
ISOAMSB Added Mathematical Symbols: Binary Operators
ISOAMSC Added Mathematical Symbols: Delimiters
ISOAMSN Added Mathematical Symbols: Negated Relations
ISOAMSO Added Mathematical Symbols: Ordinary
ISOAMSR Added Mathematical Symbols: Relations
ISOBOX Box and Line Drawing
ISOCYR1 Cyrillic-1
ISOCYR2 Cyrillic-2
ISODIA Diacritical Marks
ISOGRK3 Greek-3
ISOLAT1 Latin-1
ISOLAT2 Latin-2
ISOMFRK Mathematical Fraktur
ISOMOPF Mathematical Openface (Double-struck)
ISOMSCR Mathematical Script
ISONUM Numeric and Special Graphic
ISOPUB Publishing
ISOTECH General Technical
MMLEXTRA Extra Names added by MathML

6.4 Differences from Characters in MathML 1

6.4.1 Coverage

We have excluded a very few other characters that may have appeared in the corresponding lists in MathML 1. Those characters thus lost will be found to be used very infrequently in the experience of mathematical publishers, or simply to be completely unacceptable for inclusion in Unicode. However MathML 2 does provide the mglyph element to accommodate new characters that authors may wish to introduce.

6.4.2 Fewer Non-marking Characters

It used to be in MathML 1.0 that there were a number more non-marking character entities listed. These were concerned with composition control, such as line-breaking. In MathML 2 such control is effected by the use of the proper attributes on the mspace element.

6.4.3 ISO Tables

The character listings by alphabetical and Unicode order in Section 6.3.7 MathML Character Names have now been brought more into line with the corresponding ISO character sets than was the case in MathML 1.0, in that if some part of a set is included then the entire set is included. In addition, the group ISOCHEM has been dropped as more properly the concern of chemists. All the ISO mathematical alphabets are listed, since there are now Unicode characters to point to, in particular the bold Greek of ISOGRK3. These changes have also been reflected in the entity declarations in the DTD in Appendix A Parsing MathML.

6.4.4 Status of Character Encodings

A significant change after MathML 1.0 occurred in the movement toward adoption of more characters for mathematics in the UCS and availability of public fonts for mathematics. The encoding of characters in the UCS is done jointly by the Unicode Technical Committee and by ISO/IEC JTC1/SC2/WG2. The process of encoding takes quite some time from the deliberation of first proposals to the final approval. The characters mentioned in this chapter and listed in the associated tables have been though the various stages of this approval process.

At the time of the preparation of the MathML 2.0 Specification the characters relevant to mathematics fell into three categories: Fully accepted characters, characters in final (JTC1) ISO/IEC ballot, and characters before the final ISO/IEC ballot.

  • Fully accepted characters included a large number of Latin, Greek, and Cyrillic letters, a large number of Mathematical Operators and symbols, including arrows, and so on. Fully accepted characters were exactly those that are in both Unicode 3.0 [Unicode] and ISO/IEC 10646-1:2000 [ISOIEC10646-1], which are identical code point by code point. Those of obvious special interest to mathematics numbered over 1,500, depending on how you count.

  • The Mathematical Alphanumeric Symbols were, in April 2001, coming up for a final ballot together with a large number of ideographs and other characters not directly relevant for mathematics. There were just about 1,000 of these. The additions were published as ISO/IEC 10646-2, and became part of Unicode 3.1.

  • Characters relevant to MathML that were before final ballot made up a long list of operators and symbols, including some special constants and non-marking characters (see Section 6.2.4 Non-Marking Characters and Section 6.3.1 Special Constants). They numbered about 590 in all. With some small technical improvements and compromises the proposed additions accepted were published as an amendment to [ISO/IEC 10646-1], and became part of Unicode 3.2.

    Even with the good will shown to the mathenatical community by the Unicode process a small number of characters of special interest to some may not yet have been included. The obvious solution of avoiding their use may not satisfy all. For these characters the Unicode mechanism involving Private Use Area codes could be deployed, in spite of all the dangers of confusion and collisions of conventions this brings with it. However, this is the situation for which mglyph was introduced.

Overview: Mathematical Markup Language (MathML) Version 2.0 (2nd Edition)
Previous: 5 Combining Presentation and Content Markup
Next: 7 The MathML Interface