Overview: Mathematical Markup Language (MathML) Version 2.0
Previous: 5 Combining Presentation and Content Markup
Next: 7 The MathML Interface
6 Characters, Entities and Fonts
6.1 Introduction
6.2 MathML Characters
6.2.1 Unicode Character Data
6.2.2 Special Characters Not in Unicode
6.2.3 Mathematical Alphabetic Symbol
Characters.
6.2.4 NonMarking Characters
6.3 Character Symbol Listings
6.3.1 Special Constants
6.3.2 Character Tables (ASCII format)
6.3.3 Tables arranged by Unicode block
6.3.4 Negated
Mathematical Characters
6.3.5 Variant
Mathematical Characters
6.3.6 Mathematical Alphabetic Characters
6.3.7 MathML Character Names
6.4 Differences from Characters in MathML 1
6.4.1 Coverage
6.4.2 Fewer Nonmarking Characters
6.4.3 ISO Tables
6.4.4 Status of Character Encodings
Notation and symbols have proved very important for mathematics. Mathematics has grown in part because of the succinctness and suggestiveness of its evolving notation. There have been many new signs evolved for use in mathematical notation, and mathematicians have not held back from making use of many symbols originally developed elsewhere. The result is that mathematics makes use of a very large collection of symbols. It is difficult to write mathematics fluently if these characters are not available for use in coding. It is difficult to read mathematics if corresponding glyphs are not available for presentation on specific display devices.
This situation posed a problem for the first W3C Math Working Group when it was brought into existence. It did not fall naturally within the purview of developing a specification enabling mathematics to be used with HTML and producing a DTD for this to worry about more than the entities allowed in the DTD. However, as experience has shown, a long list of entities with no means to display them is of little use, and a cause of frequent frustrations in trying to use a standard. On the other hand, a large collection of glyphs and fonts representing characters without a standard way to refer to them is not of much use either.
The W3C Math Working Group therefore took on directly the task of specifying part of the full mechanism needed to proceed from notation to final presentation, and started collaboration with organizations undertaking specification of the rest.
This chapter of the MathML Specification contains a listing of character names for use in MathML, recommendations for their use, and warnings to pay attention to the correct form of the corresponding code points given in the UCS (Universal Character Set) as codified in Unicode and ISO 10646 [see [Unicode] and the Unicode Web site]. For simplicity we shall refer to this character set by the short name Unicode. Though Unicode changes from time to time so that it is specified exactly by using version numbers, unless this brings clarity on some point we shall not use them. This specification of MathML makes use of some characters that are not part of Unicode 3.0 but which have been proposed to the Unicode Technical Committee (UTC), and thus for inclusion in ISO 10646. They are presently expected to be in the revisions Unicode 3.1 and 3.2. (For more detail about this see Section 6.4.4 [Status of Character Encodings].)
While the process of review and adoption by UTC and ISO/IEC of the characters of special interest to mathematics and MathML is largely complete (Unicode Work in Progress) there remains the possibility of some further modification of the lists of characters accepted, of the code assignments for those adopted, or of the names given them by Unicode. To make sure any possible corrections to relevant standards are taken into account, and for the latest character tables and font information, see the W3C Math Working Group home page and the Unicode site.
A MathML token element Section 3.2 [Token Elements], and Section 4.4.1 [Token Elements] takes as content a sequence of MathML
Characters. MathML Characters are defined to be either
Unicode characters legal in XML documents or mglyph
elements. The latter are used to represent
characters that do not have a Unicode encoding, as described in
Section 3.2.9 [Adding new character glyphs to MathML
(mglyph
)]. Because the Unicode UCS provides
approximately one thousand special alphabetic characters for the use
of mathematics (Unicode 3.1), and will provide over 900
special symbols in Unicode 3.2, the need for
mglyph
should be rare.
As always in XML, any character allowed by XML may be used in MathML in an XML document. The legal characters have the hexadecimal code numbers 09 (tab = U+0009), 0A (line feed = U+000A), 0D (carriage return = U+000D), 20D7FF (U+0020..U+D7FF), E000FFFD (U+E000..U+FFFD), and 1000010FFFF (U+010000..U+10FFFF). The parenthetical notation beginning with U+ is one recommended by Unicode for referring to Unicode characters [see [Unicode], page xxviii]. The exclusions above code number D7FF are of the blocks used in surrogate pairs, and the two characters guaranteed not to be Unicode characters at all. U+FFFE is excluded to allow determination of byte order in certain encodings.
There are essentially three different ways of encoding character data.
For special purposes, one may need to use a character which is not in
Unicode, even with the expected additions. In these cases
one may use the mglyph
element for direct access to a glyph from some font and creation of
a MathML character corresponding.
All MathML token elements that accept character data also accept an
mglyph
in their content.
Beware, however, that the font chosen may not be available to all MathML processors.
A noticeable feature of mathematical and scientific writing is the use of single letters to denote variables and constants in a given context. The increasing complexity of science has led to the use of certain common alphabet and font variations to provide enough special symbols of this letterlike type. These denotations are in fact not letters that may be used to make up words with recognized meanings, but individual carriers of semantics themselves. Writing a string of such symbols is usually interpreted in terms of some composition law, for instance, multiplication. Many letterlike symbols may be quickly interpreted by specialists in a given area as of a certain mathematical type: for instance, bold symbols, whether based on Latin or Greek letters, as vectors in physics or engineering, or fraktur symbols as Lie algebras in part of pure mathematics. Again, in given areas of science, some constants are recognizable letter forms. When you look carefully at the range of letterlike mathematical symbols in common use today, as the STIX project supported by major scientific and technical publishers did, you come up with perhaps surprisingly many. A proposal to facilitate mathematical publishing by inclusion of mathematical alphabetic symbols in the UCS was made, and has been favorably handled.
The new Mathematical Alphabetic characters expected Unicode 3.1 have provisional code points in Plane 1, that is, in the first plane with Unicode values higher than 2^{16}. This plane of characters is also known as the Supplemental Multilingual Plane (SMP), in contrast to the Basic Multilingual Plane (BMP) which has been used by Unicode so far. Support for Plane 1 characters in currently deployed software is not always reliable, and in particular support for these Mathematical Alphabetic characters is not likely to be widespread until after final positions in Unicode 3.1 have been confirmed in the standard ISO 10646.
As discussed in Section 3.2.2 [Mathematics style attributes common to token
elements], MathML offers an
alternative mechanism to specify mathematical alphabetic characters,
which will help bridge the time of transition to Unicode revisions and
the associated deployment of implementing software and fonts therefore
required. Namely, one uses the mathvariant
attribute on the surrounding token element, which will most commonly
be mi
. In this section we detail the
correspondence that a MathML processor should apply between certain
characters in Plane 0 (BMP) of Unicode, modified by the
mathvariant
attribute, and the Plane 1
Mathematical Alphabetic Symbol characters.
The basic idea of the correspondence is fairly simple. For example, a Mathematical Fraktur alphabet is being added, and the code point for Mathematical Fraktur A is U1D504. Thus using these proposed characters, a typical example might be
<mi>𝔄</mi>
However, an alternative, equivalent markup would be to use
the standard A and modify the identifier using the
mathvariant
attribute, as follows:
<mi mathvariant="fraktur">A</mi>
The exact correspondence between a mathematical alphabetic character and an unstyled character is complicated by the fact that certain characters that were already present in Unicode are not in the `expected' sequence.
The detailed correspondence is shown in the tables given in Section 6.3.6 [Mathematical Alphabetic Characters].
Mathematical Alphabetic Symbol characters should not be used for styled text. For example, Mathematical Fraktur A must not be used to just select a blackletter font for an uppercase A. Doing this sort of thing would create problems for searching, restyling (e.g. for acessibility), and many other kinds of processing.
Some characters, although important for the quality of print or alternative rendering, do not have glyph marks that correspond directly. They are called here nonmarking characters. Below we have a table of those adopted for the purposes of MathML. Their roles are discussed in Chapter 3 [Presentation Markup] and Chapter 4 [Content Markup], respectively. The values of the spaces given are recommendations. Some of these characters are among those with new Unicode values, and some are given as combinations of Unicode characters employing the new special mathematics modifier character (U0FE00). The correspondence between the spacing amounts mentioned below and those in the Unicode descriptions is not exact, but the matches are good.
In MathML 2 control of page composition, such as linebreaking, is
effected by the use of the proper attributes on the mspace
element.
The last two characters below, with mnemonic entity names ⁢
and ⁡
, are not simple spacers. They are
especially important new additions to the UCS because they provide
textual clues which can increase the quality of print rendering,
permit correct audio rendering, and allow the unique recovery of
mathematical semantics from text which is visually ambiguous.
Character name  Unicode  Description 

	 
00009  tabulator stop; horizontal tabulation 

 
0000A  force a line break; line feed 
&Space; 
00020  one em of space in the current font 
  
000A0  space that is not a legal breakpoint 
​ 
0200B  space of no width at all 
  
0200A  space of width 1/18 em 
  
02009  space of width 3/18 em 
  
02005  space of width 4/18 em 
   
020050200A  space of width 5/18 em 
​ 
0200A0FE00  space of width 1/18 em 
​ 
020090FE00  space of width 3/18 em 
​ 
0205F0FE00  space of width 4/18 em 
​ 
020050FE00  space of width 5/18 em 
⁢ 
02062  marks multiplication when it is understood without a mark
(Section 3.2.5 [Operator, Fence, Separator or Accent
(mo )] 
⁡ 
02061  character showing function application in presentation tagging
(Section 3.2.5 [Operator, Fence, Separator or Accent
(mo )] 
The Universal Character Set (UCS) of Unicode and ISO 10646 continues to evolve Section 6.4.4 [Status of Character Encodings]. A small number of the changes recently introduced, relative to those resulting from the needs of Asian languages, are those designed exactly to facilitate the use of Unicode by the `equationwriting' community. This specification is written on the assumption that the code assignments suggested to ISO/IEC JTC1/SC2/WG2 by the UTC will be confirmed as they are in public draft forms of Unicode 3.1 and 3.2. As before, we can only reiterate that for latest developments on details of character standards as far as they influence mathematical formalism the Home Page of the W3C Math WG should be consulted.
The characters are given with entity names as well as Unicode numbers. To facilitate comprehension of a fairly large list of names, which totals over 2000 in this case, we offer more than one way to find to a given character. A corresponding full set of entity declarations is in the DTD in Appendix A [Parsing MathML]. For discussion of entity declarations see that appendix.
The characters are listed by name, and sample glyphs provided for all of them. Each character name is accompanied by a code for a character grouping chosen from a list given below, a short verbal description, and a Unicode hex code drawn from ISO 10646, now extended in accordance with the proposal forwarded by the UTC to ISO/IEC WG2 in March 2000.
The character listings by alphabetical and Unicode order in Section 6.3.7 [MathML Character Names] are in harmony with the ISO character sets given, in that if some part of a set is included then the entire set is included.
To begin we list separately a few of the special characters which
MathML has introduced. These have
been accorded new Unicode values. Rather like the nonmarking ⁢
and ⁡
above, they provide very useful
capabilities in the context of machinable mathematics. It might be
imagined there could also be entries below for &true;
, &false;
and &NotANumber;
, but these do not yet have Unicode
points assigned. They can be introduced by the character extension
mechanisms provided by the mglyph
and csymbol
elements.
Entity name  Unicode  Description 

ⅅ 
02145  D for use in differentials, e.g. within integrals 
ⅆ 
02146  d for use in differentials, e.g. within integrals 
ⅇ 
02147  e for use for the exponential base of the natural logarithms 
ⅈ 
02148  i for use as a square root of 1 
The first table offered is a very large ASCII listing of characters considered particularly relevant to Mathematics. This is given in Unicode (or proposed Unicode) order. Most, but not all, of these characters have MathML names defined via entity declarations in the DTD. Those that do not are usually symbols which seem mathematically peripheral, such as dingbats, machine graphics or technical symbols.
A second table lists those characters that do have MathML entity names, ordered alphabetically, with a lowercase letter preceding its uppercase counterpart.
The tables in this section detail Unicode code points (displayed with 256 code points per table) that have mathematically significant characters. The sample glyph images link to the table of characters ordered by Unicode given in the previous section. As shown in the key for each table, the status of each character (for example in Unicode 3.0 or in the proposed additions) is indicated by a CSS class on the table cell (which by default is indicated by varying the background color). The names of the blocks are those of the Unicode blocks included in the numerical range given; bracketing indicates characters of that type are not shown in these tables.
Block Range  Description 

00000  000FF  Controls and Basic Latin, and Latin1 Supplement 
00100  001FF  Latin ExtendedA, Latin ExtendedB 
00200  002FF  IPA Extensions, Spacing Modifier Letters 
00300  003FF  Combining Diacritical Marks, Greek [and Coptic] 
00400  004FF  Cyrillic 
00500  005FF  Cyrillic Supplement, [Armenian, Hebrew] 
02000  020FF  General Punctuation, Superscripts and Subscripts, Currency Symbols, Combining Diacritical Marks for Symbols 
02100  021FF  Letterlike Symbols, Number Forms, Arrows 
02200  022FF  Mathematical Operators 
02300  023FF  Miscellaneous Technical 
02400  024FF  Control Pictures, Optical Character Recognition, Enclosed Alphanumerics 
02500  025FF  Box Drawing, Block Elements, Geometric Shapes 
02600  026FF  Miscellaneous Symbols 
02700  027FF  Dingbats 
02900  029FF  Supplemental Arrows, Miscellaneous Mathematical Symbols 
02A00  02AFF  Supplemental Mathematical Operators 
03000  030FF  CJK Symbols and Punctuation, [Hiragana, Katakana] 
0FB00  0FBFF  Alphabetic Presentation Forms 
0FE00  0FEFF  [Combining Half Marks, CJK Compatibility Forms, Small Form Variants, Arabic Presentation FormsB] 
1D400  1D4FF  Mathematical Styled Latin (Bold, Italic, Bold Italic, Script, Bold Script begins) 
1D500  1D5FF  Mathematical Styled Latin (Bold Script ends, Fraktur, Doublestruck, Bold Fraktur, Sansserif, Sansserif Bold begins) 
1D600  1D6FF  Mathematical Styled Latin (Sansserif Bold ends, Sansserif Italic, Sansserif Bold Italic, Monospace, Bold), Mathematical Styled Greek (Bold, Italic begins) 
1D700  1D7FF  Mathematical Styled Greek (Italic continued, Bold Italic, Sansserif Bold), Mathematical Styled Digits 
In addition to the Unicode Characters so far listed, one may use the combining characters U0338 (/), U20D2 () and U20E5 (\) to produce negated or canceled forms of characters. A combining character should be placed immediately after its `base' character, with no intervening markup or space, just as is the case for combining accents.
In principle, the negation characters may be applied to any Unicode character, although fonts designed for mathematics typically have some negated glyphs ready composed. A MathML renderer should be able to use these precomposed glyphs in these cases. A compound character code either represents a UCS character that is already available, as in the case of U0003D+00038 which amounts to U02260, or it does not as is the case for U02202+00338. The common cases of negations, of both types, that have been identified are listed in the table
Note that it is the policy of the W3C and of Unicode that if a single character is already defined for what can be achieved with a combining character, that character must be used instead of the decomposed form. It is also intended that no new single characters representing what can be done by with existing compositions will be introduced.
Unicode attempts to avoid having several character codes for simple font variant. For a code point to be assigned there should be more than a nuance in glyphs to be recorded. To record some nuances as variants there is a special character U+FE00 (Variant Selector1) which acts as a postfix modifier. However the legally allowed combinations with this variant selector are restricted to a list recorded as part of Unicode. The variant selector1 character may only be applied to the characters listed here.
Here we list the special mathematical alphabets. Note that the names for these alphabetic runs should be regarded as conventions resulting from recent tradition in the typesetting of mathematical formulas, rather than as fixing exactly and forever the styles which are to be used. Of course, they do correspond to the styles presently most common. But, for instance, there may be font variations in the glyphs from doublestruck, openface or blackboard bold fonts, all of which would naturally be used for the characters in the range here labelled Doublestruck. Similar considerations would apply to appellations such as fraktur and gothic, or script and calligraphic.
As discussed above, the use of these characters is formally equivalent
to the use of characters in Plane 0, together with a suitable value
for the mathvariant
attribute. The
correspondence is given in the character tables. Most of these
characters come from the proposed additions to Plane 1, however a few
characters (such as the doublestruck letters N, P, Z, Q, R, C, H
representing common number sets) were already present in Unicode 3.0
and retain their original positions. These characters are highlighted
in the tables.
This section corresponds closely with the entity definitions in the DTD described in Appendix A [Parsing MathML]. All of the entity sets except the last correspond to entity sets defined by ISO 8879 or ISO 957313.
ISO Handle  Description 

ISOAMSA  Added Mathematical Symbols: Arrows 
ISOAMSB  Added Mathematical Symbols: Binary Operators 
ISOAMSC  Added Mathematical Symbols: Delimiters 
ISOAMSN  Added Mathematical Symbols: Negated Relations 
ISOAMSO  Added Mathematical Symbols: Ordinary 
ISOAMSR  Added Mathematical Symbols: Relations 
ISOBOX  Box and Line Drawing 
ISOCYR1  Cyrillic1 
ISOCYR2  Cyrillic2 
ISODIA  Diacritical Marks 
ISOGRK3  Greek3 
ISOLAT1  Latin1 
ISOLAT2  Latin2 
ISOMFRK  Mathematical Fraktur 
ISOMOPF  Mathematical Openface (Doublestruck) 
ISOMSCR  Mathematical Script 
ISONUM  Numeric and Special Graphic 
ISOPUB  Publishing 
ISOTECH  General Technical 
MMLEXTRA  Extra Names added by MathML 
We have excluded a very few other characters that may have appeared in
the corresponding lists in MathML 1. Those characters thus
lost will be found to be used very infrequently in the
experience of mathematical publishers, or simply to be completely
unacceptable for inclusion in Unicode. However MathML 2 does provide
the mglyph
element to accommodate new
characters that authors may wish to introduce.
It used to be in MathML 1.0 that there were a number more
nonmarking character entities listed. These were concerned with
composition control, such as linebreaking. In MathML 2 such control
is effected by the use of the proper attributes on the mspace
element.
The character listings by alphabetical and Unicode order in Section 6.3.7 [MathML Character Names] have now been brought more into line with the corresponding ISO character sets than was the case in MathML 1.0, in that if some part of a set is included then the entire set is included. In addition, the group ISOCHEM has been dropped as more properly the concern of chemists. All the ISO mathematical alphabets are listed, since there are now Unicode characters to point to, in particular the bold Greek of ISOGRK3. These changes have also been reflected in the entity declarations in the DTD in Appendix A [Parsing MathML].
A significant change since MathML 1.0 is the movement toward adoption of more characters for mathematics in the UCS (Universal Character Set) and availability of public fonts for mathematics. The encoding of characters in the UCS (Universal Character Set) is done jointly by the Unicode Technical Committee and by ISO/IEC JTC1/SC2/WG2. The process of encoding takes quite some time from the deliberation of first proposals to the final approval. The characters mentioned in this chapter and listed in the associated tables are at various stages of this approval process. This section gives detailed information about the stages relevant to this specification and gives an overview of the characters affected. The lists, as well as other places that discuss characters, mention when characters are not fully approved or show this graphically. Updates on the status of the characters will be provided by updates to this specification, by errata to this specification, and by notices on the W3C Math home page. The final word on all Unicode matters is naturally to be found at the Unicode Consortium.
The characters relevant for MathML fall at present into three categories: Fully accepted characters, characters in final (JTC1) ISO/IEC ballot, and characters before the final ISO/IEC ballot.
mathvariant
attribute (see Section 3.2.2 [Mathematics style attributes common to token
elements]) can be used to avoid that risk.
mathvariant
attribute are used to avoid that risk.
Overview: Mathematical Markup Language (MathML) Version 2.0
Previous: 5 Combining Presentation and Content Markup
Next: 7 The MathML Interface