Overview: Mathematical Markup Language (MathML) Version 2.0
  Previous:     5 Combining Presentation and Content Markup
  Next:     7 The MathML Interface
   
6 Characters, Entities and Fonts
6.1 Introduction
6.2 MathML Characters
   6.2.1 Unicode Character Data
   6.2.2 Special Characters Not in Unicode
   6.2.3 Mathematical Alphanumeric Symbols
Characters.
   6.2.4 Non-Marking Characters
6.3 Character Symbol Listings
   6.3.1 Special Constants
   6.3.2 Character Tables (ASCII format)
   6.3.3 Tables arranged by Unicode block
   6.3.4 Negated
Mathematical Characters
   6.3.5 Variant
Mathematical Characters
   6.3.6 Mathematical Alphanumeric Symbols
   6.3.7 MathML Character Names
6.4 Differences from Characters in MathML 1
   6.4.1 Coverage
   6.4.2 Fewer Non-marking Characters 
   6.4.3 ISO Tables
   6.4.4 Status of Character Encodings
Notation and symbols have proved very important for mathematics. Mathematics has grown in part because of the succinctness and suggestiveness of its evolving notation. There have been many new signs evolved for use in mathematical notation, and mathematicians have not held back from making use of many symbols originally developed elsewhere. The result is that mathematics makes use of a very large collection of symbols. It is difficult to write mathematics fluently if these characters are not available for use in coding. It is difficult to read mathematics if corresponding glyphs are not available for presentation on specific display devices.
This situation posed a problem for the first W3C Math Working Group when it was brought into existence. It did not fall naturally within the purview of developing a specification enabling mathematics to be used with HTML and producing a DTD for this to worry about more than the entities allowed in the DTD. However, as experience has shown, a long list of entities with no means to display them is of little use, and a cause of frequent frustrations in trying to use a standard. On the other hand, a large collection of glyphs and fonts representing characters without a standard way to refer to them is not of much use either.
The W3C Math Working Group therefore took on directly the task of specifying part of the full mechanism needed to proceed from notation to final presentation, and started collaboration with organizations undertaking specification of the rest.
This chapter of the MathML specification contains a listing of character names for use in MathML, recommendations for their use, and warnings to pay attention to the correct form of the corresponding code points given in the UCS (Universal Character Set) as codified in Unicode and ISO 10646 [see [Unicode] and the Unicode Web site]. For simplicity we shall refer to this character set by the short name Unicode. Though Unicode changes from time to time so that it is specified exactly by using version numbers, unless this brings clarity on some point we shall not use them. This specification of MathML makes use of some characters that are not part of Unicode 3.0 but which have been proposed to the Unicode Technical Committee (UTC), and thus for inclusion in ISO 10646. They are presently expected to be in the revisions Unicode 3.1 and 3.2. (For more detail about this see Section 6.4.4 [Status of Character Encodings].)
While the process of review and adoption by UTC and ISO/IEC of the characters of special interest to mathematics and MathML is largely complete (Unicode Work in Progress) there remains the possibility of some further modification of the lists of characters accepted, of the code assignments for those adopted, or of the names given them by Unicode. To make sure any possible corrections to relevant standards are taken into account, and for the latest character tables and font information, see the W3C Math Working Group home page and the Unicode site.
A MathML token element Section 3.2 [Token Elements], and Section 4.4.1 [Token Elements] takes as content a sequence of MathML
Characters.  MathML Characters are defined to be either
Unicode characters legal in XML documents or mglyph elements. The latter are used to represent
characters that do not have a Unicode encoding, as described in
Section 3.2.9 [Adding new character glyphs to MathML
  (mglyph)].  Because the Unicode UCS provides
approximately one thousand special alphabetic characters for the use
of mathematics (Unicode 3.1), and will provide over 900
special symbols in Unicode 3.2, the need for 
mglyph should be rare. 
As always in XML, any character allowed by XML may be used in MathML in an XML document. The legal characters have the hexadecimal code numbers 09 (tab = U+0009), 0A (line feed = U+000A), 0D (carriage return = U+000D), 20-D7FF (U+0020..U+D7FF), E000-FFFD (U+E000..U+FFFD), and 10000-10FFFF (U+010000..U+10FFFF). The parenthetical notation beginning with U+ is one recommended by Unicode for referring to Unicode characters [see [Unicode], page xxviii]. The exclusions above code number D7FF are of the blocks used in surrogate pairs, and the two characters guaranteed not to be Unicode characters at all. U+FFFE is excluded to allow determination of byte order in certain encodings.
There are essentially three different ways of encoding character data.
For special purposes, one may need to use a character which is not in
Unicode, even with the expected additions. In these cases
one may use the mglyph
element for direct access to a glyph from some font and creation of
a MathML character corresponding.
All MathML token elements that accept character data also accept an
mglyph in their content.
Beware, however, that the font chosen may not be available to all MathML processors.
A noticeable feature of mathematical and scientific writing is the use of single letters to denote variables and constants in a given context. The increasing complexity of science has led to the use of certain common alphabet and font variations to provide enough special symbols of this letter-like type. These denotations are in fact not letters that may be used to make up words with recognized meanings, but individual carriers of semantics themselves. Writing a string of such symbols is usually interpreted in terms of some composition law, for instance, multiplication. Many letter-like symbols may be quickly interpreted by specialists in a given area as of a certain mathematical type: for instance, bold symbols, whether based on Latin or Greek letters, as vectors in physics or engineering, or fraktur symbols as Lie algebras in part of pure mathematics. Again, in given areas of science, some constants are recognizable letter forms. When you look carefully at the range of letter-like mathematical symbols in common use today, as the STIX project supported by major scientific and technical publishers did, you come up with perhaps surprisingly many. A proposal to facilitate mathematical publishing by inclusion of mathematical alphabetic symbols in the UCS was made, and has been favorably handled.
The new Mathematical Alphanumeric Symbols expected in Unicode 3.1 have provisional code points in Plane 1, that is, in the first plane with Unicode values higher than 216. This plane of characters is also known as the Secondary Multilingual Plane (SMP), in contrast to the Basic Multilingual Plane (BMP) which has been used by Unicode so far. Support for Plane 1 characters in currently deployed software is not always reliable, and in particular support for these Mathematical Alphanumeric Symbol characters is not likely to be widespread until after final positions in Unicode 3.1 have been confirmed in the standard ISO 10646.
As discussed in Section 3.2.2 [Mathematics style attributes common to token
elements], MathML offers an
alternative mechanism to specify mathematical alphabetic characters,
which will help bridge the time of transition to Unicode revisions and
the associated deployment of implementing software and fonts therefore
required.  Namely, one uses the mathvariant
attribute on the surrounding token element, which will most commonly
be mi.  In this section we detail the
correspondence that a MathML processor should apply between certain
characters in Plane 0 (BMP) of Unicode, modified by the
mathvariant attribute, and the Plane 1
Mathematical Alphanumeric Symbol characters.
The basic idea of the correspondence is fairly simple. For example, a Mathematical Fraktur alphabet is being added, and the code point for Mathematical Fraktur A is U1D504. Thus using these proposed characters, a typical example might be
<mi>𝔄</mi>
However, an alternative, equivalent markup would be to use
the standard A and modify the identifier using the
mathvariant attribute, as follows:
<mi mathvariant="fraktur">A</mi>
The exact correspondence between a mathematical alphabetic character and an unstyled character is complicated by the fact that certain characters that were already present in Unicode are not in the `expected' sequence.
The detailed correspondence is shown in the tables given in Section 6.3.6 [Mathematical Alphanumeric Symbols].
Mathematical Alphanumeric Symbol characters should not be used for styled text. For example, Mathematical Fraktur A must not be used to just select a blackletter font for an uppercase A. Doing this sort of thing would create problems for searching, restyling (e.g. for accessibility), and many other kinds of processing.
Some characters, although important for the quality of print or alternative rendering, do not have glyph marks that correspond directly. They are called here non-marking characters. Below we have a table of those adopted for the purposes of MathML. Their roles are discussed in Chapter 3 [Presentation Markup] and Chapter 4 [Content Markup], respectively. The values of the spaces given are recommendations. Some of these characters are among those with new Unicode values, and some are given as combinations of Unicode characters employing the new special mathematics modifier character (U0FE00). The correspondence between the spacing amounts mentioned below and those in the Unicode descriptions is not exact, but the matches are good.
In MathML 2 control of page composition, such as line-breaking, is
effected by the use of the proper attributes on the mspace element. 
The last two characters below, with mnemonic entity names ⁢ and ⁡, are not simple spacers.  They are
especially important new additions to the UCS because they provide
textual clues which can increase the quality of print rendering,
permit correct audio rendering, and allow the unique recovery of
mathematical semantics from text which is visually ambiguous.
| Character name | Unicode | Description | 
|---|---|---|
| 	 | 00009 | tabulator stop; horizontal tabulation | 
| 
 | 0000A | force a line break; line feed | 
| &Space; | 00020 | one em of space in the current font | 
|   | 000A0 | space that is not a legal breakpoint | 
| ​ | 0200B | space of no width at all | 
|   | 0200A | space of width 1/18 em | 
|   | 02009 | space of width 3/18 em | 
|   | 02005 | space of width 4/18 em | 
|    | 02009-0200A-0200A | space of width 5/18 em | 
| ​ | 0200A-0FE00 | space of width -1/18 em | 
| ​ | 02009-0FE00 | space of width -3/18 em | 
| ​ | 0205F-0FE00 | space of width -4/18 em | 
| ​ | 02005-0FE00 | space of width -5/18 em | 
| ⁢ | 02062 | marks multiplication when it is understood without a mark
(Section 3.2.5 [Operator, Fence, Separator or Accent
  ( mo)] | 
| ⁡ | 02061 | character showing function application in presentation tagging
(Section 3.2.5 [Operator, Fence, Separator or Accent
  ( mo)] | 
The Universal Character Set (UCS) of Unicode and ISO 10646 continues to evolve, see Section 6.4.4 [Status of Character Encodings]. A small number of the changes recently introduced, relative to those resulting from the needs of Asian languages, are those designed exactly to facilitate the use of Unicode by the `equation-writing' community. This specification is written on the assumption that the code assignments suggested to ISO/IEC JTC1/SC2/WG2 by the UTC will be confirmed as they are in public draft forms of Unicode 3.1 and 3.2. As before, we can only reiterate that for latest developments on details of character standards as far as they influence mathematical formalism the home page of the W3C Math Working Group should be consulted.
The characters are given with entity names as well as Unicode numbers. To facilitate comprehension of a fairly large list of names, which totals over 2000 in this case, we offer more than one way to find to a given character. A corresponding full set of entity declarations is in the DTD in Appendix A [Parsing MathML]. For discussion of entity declarations see that appendix.
The characters are listed by name, and sample glyphs provided for all of them. Each character name is accompanied by a code for a character grouping chosen from a list given below, a short verbal description, and a Unicode hex code drawn from ISO 10646, now extended in accordance with the proposal forwarded by the UTC to ISO/IEC WG2 in March 2000.
The character listings by alphabetical and Unicode order in Section 6.3.7 [MathML Character Names] are in harmony with the ISO character sets given, in that if some part of a set is included then the entire set is included.
 To begin we list separately a few of the special characters which
MathML has introduced.  These have
been accorded new Unicode values.  Rather like the non-marking ⁢ and ⁡ above, they provide very useful
capabilities in the context of machinable mathematics.  It might be
imagined there could also be entries below for &true;, &false; and &NotANumber;, but these do not yet have Unicode
points assigned.  They can be introduced by the character extension
mechanisms provided by the mglyph and csymbol elements.
| Entity name | Unicode | Description | 
|---|---|---|
| ⅅ | 02145 | D for use in differentials, e.g. within integrals | 
| ⅆ | 02146 | d for use in differentials, e.g. within integrals | 
| ⅇ | 02147 | e for use for the exponential base of the natural logarithms | 
| ⅈ | 02148 | i for use as a square root of -1 | 
The first table offered is a very large ASCII listing of characters considered particularly relevant to mathematics. This is given in Unicode (or proposed Unicode) order. Most, but not all, of these characters have MathML names defined via entity declarations in the DTD. Those that do not are usually symbols which seem mathematically peripheral, such as dingbats, machine graphics or technical symbols.
A second table lists those characters that do have MathML entity names, ordered alphabetically, with a lower-case letter preceding its upper-case counterpart.
The tables in this section detail Unicode code points (displayed with 256 code points per table) that have mathematically significant characters. The sample glyph images link to the table of characters ordered by Unicode given in the previous section. As shown in the key for each table, the status of each character (for example in Unicode 3.0 or in the proposed additions) is indicated by a CSS class on the table cell (which by default is indicated by varying the background color). The names of the blocks are those of the Unicode blocks included in the numerical range given; bracketing indicates characters of that type are not shown in these tables.
| Block Range | Description | 
|---|---|
| 00000 - 000FF | Controls and Basic Latin, and Latin-1 Supplement | 
| 00100 - 001FF | Latin Extended-A, Latin Extended-B | 
| 00200 - 002FF | IPA Extensions, Spacing Modifier Letters | 
| 00300 - 003FF | Combining Diacritical Marks, Greek [and Coptic] | 
| 00400 - 004FF | Cyrillic | 
| 02000 - 020FF | General Punctuation, Superscripts and Subscripts, Currency Symbols, Combining Diacritical Marks for Symbols | 
| 02100 - 021FF | Letter-like Symbols, Number Forms, Arrows | 
| 02200 - 022FF | Mathematical Operators | 
| 02300 - 023FF | Miscellaneous Technical | 
| 02400 - 024FF | Control Pictures, Optical Character Recognition, Enclosed Alphanumerics | 
| 02500 - 025FF | Box Drawing, Block Elements, Geometric Shapes | 
| 02600 - 026FF | Miscellaneous Symbols | 
| 02700 - 027FF | Dingbats | 
| 02900 - 029FF | Supplemental Arrows, Miscellaneous Mathematical Symbols | 
| 02A00 - 02AFF | Supplemental Mathematical Operators | 
| 03000 - 030FF | CJK Symbols and Punctuation, [Hiragana, Katakana] | 
| 0FB00 - 0FBFF | Alphabetic Presentation Forms | 
| 0FE00 - 0FEFF | [Combining Half Marks, CJK Compatibility Forms, Small Form Variants, Arabic Presentation Forms-B] | 
| 1D400 - 1D4FF | Mathematical Styled Latin (Bold, Italic, Bold Italic, Script, Bold Script begins) | 
| 1D500 - 1D5FF | Mathematical Styled Latin (Bold Script ends, Fraktur, Double-struck, Bold Fraktur, Sans-serif, Sans-serif Bold begins) | 
| 1D600 - 1D6FF | Mathematical Styled Latin (Sans-serif Bold ends, Sans-serif Italic, Sans-serif Bold Italic, Monospace, Bold), Mathematical Styled Greek (Bold, Italic begins) | 
| 1D700 - 1D7FF | Mathematical Styled Greek (Italic continued, Bold Italic, Sans-serif Bold), Mathematical Styled Digits | 
In addition to the Unicode Characters so far listed, one may use the combining characters U0338 (/), U20D2 (|) and U20E5 (\) to produce negated or canceled forms of characters. A combining character should be placed immediately after its `base' character, with no intervening markup or space, just as is the case for combining accents.
In principle, the negation characters may be applied to any Unicode character, although fonts designed for mathematics typically have some negated glyphs ready composed. A MathML renderer should be able to use these pre-composed glyphs in these cases. A compound character code either represents a UCS character that is already available, as in the case of U0003D+00038 which amounts to U02260, or it does not as is the case for U02202+00338. The common cases of negations, of both types, that have been identified are listed in the table
Note that it is the policy of the W3C and of Unicode that if a single character is already defined for what can be achieved with a combining character, that character must be used instead of the decomposed form. It is also intended that no new single characters representing what can be done by with existing compositions will be introduced.
Unicode attempts to avoid having several character codes for simple font variants. For a code point to be assigned there should be more than a nuance in glyphs to be recorded. To record variants worth noting there is a special character proposed for Unicode 3.2, U+FE00 (VARIATION SELECTOR-1), which acts as a postfix modifier. However the legally allowed combinations with this variation selector are restricted to a list recorded as part of Unicode. The VARIATION SELECTOR-1 character may only be applied to the characters listed here. The resulting combination is not regarded by Unicode as a separate character, but a variation on the base character. Unicode aware systems may render the combination as the base if the available fonts do not support the variant glyph shape.
Here we list the special mathematical alphabets. Note that the names for these alphabetic runs should be regarded as conventions resulting from recent tradition in the typesetting of mathematical formulas, rather than as fixing exactly and forever the styles which are to be used. Of course, they do correspond to the styles presently most common. But, for instance, there may be font variations in the glyphs from double-struck, open-face or blackboard bold fonts, all of which would naturally be used for the characters in the range here labelled Double-struck. Similar considerations would apply to appellations such as fraktur and gothic, or script and calligraphic.
As discussed above, the use of these characters is formally equivalent
to the use of characters in Plane 0, together with a suitable value
for the mathvariant attribute.  The
correspondence is given in the character tables. Most of these
characters come from the proposed additions to Plane 1, however a few
characters (such as the double-struck letters N, P, Z, Q, R, C, H
representing common number sets) were already present in Unicode 3.0
and retain their original positions. These characters are highlighted
in the tables.
This section corresponds closely with the entity definitions in the DTD described in Appendix A [Parsing MathML]. All of the entity sets except the last correspond to entity sets defined by ISO 8879 or ISO 9573-13.
| ISO Handle | Description | 
|---|---|
| ISOAMSA | Added Mathematical Symbols: Arrows | 
| ISOAMSB | Added Mathematical Symbols: Binary Operators | 
| ISOAMSC | Added Mathematical Symbols: Delimiters | 
| ISOAMSN | Added Mathematical Symbols: Negated Relations | 
| ISOAMSO | Added Mathematical Symbols: Ordinary | 
| ISOAMSR | Added Mathematical Symbols: Relations | 
| ISOBOX | Box and Line Drawing | 
| ISOCYR1 | Cyrillic-1 | 
| ISOCYR2 | Cyrillic-2 | 
| ISODIA | Diacritical Marks | 
| ISOGRK3 | Greek-3 | 
| ISOLAT1 | Latin-1 | 
| ISOLAT2 | Latin-2 | 
| ISOMFRK | Mathematical Fraktur | 
| ISOMOPF | Mathematical Openface (Double-struck) | 
| ISOMSCR | Mathematical Script | 
| ISONUM | Numeric and Special Graphic | 
| ISOPUB | Publishing | 
| ISOTECH | General Technical | 
| MMLEXTRA | Extra Names added by MathML | 
We have excluded a very few other characters that may have appeared in
the corresponding lists in MathML 1.  Those characters thus
lost will be found to be used very infrequently in the
experience of mathematical publishers, or simply to be completely
unacceptable for inclusion in Unicode.  However MathML 2 does provide
the mglyph element to accommodate new
characters that authors may wish to introduce.
It used to be in MathML 1.0 that there were a number more
non-marking character entities listed.  These were concerned with
composition control, such as line-breaking. In MathML 2 such control
is effected by the use of the proper attributes on the mspace element. 
The character listings by alphabetical and Unicode order in Section 6.3.7 [MathML Character Names] have now been brought more into line with the corresponding ISO character sets than was the case in MathML 1.0, in that if some part of a set is included then the entire set is included. In addition, the group ISOCHEM has been dropped as more properly the concern of chemists. All the ISO mathematical alphabets are listed, since there are now Unicode characters to point to, in particular the bold Greek of ISOGRK3. These changes have also been reflected in the entity declarations in the DTD in Appendix A [Parsing MathML].
A significant change since MathML 1.0 is the movement toward adoption of more characters for mathematics in the UCS and availability of public fonts for mathematics. The encoding of characters in the UCS is done jointly by the Unicode Technical Committee and by ISO/IEC JTC1/SC2/WG2. The process of encoding takes quite some time from the deliberation of first proposals to the final approval. The characters mentioned in this chapter and listed in the associated tables are at various stages of this approval process. This section gives detailed information about the stages relevant to this specification and gives an overview of the characters affected. The lists, as well as other places that discuss characters, mention when characters are not fully approved or show this graphically. Updates on the status of the characters will be provided by updates to this specification, by errata to this specification, and by notices on the W3C Math home page. The final word on all Unicode matters is naturally to be found at the Unicode Consortium.
The characters relevant for MathML fall at present into three categories: Fully accepted characters, characters in final (JTC1) ISO/IEC ballot, and characters before the final ISO/IEC ballot.
mathvariant attribute (see Section 3.2.2 [Mathematics style attributes common to token
elements]) can be used to avoid that risk.
mathvariant attribute are used to avoid that risk.
  Overview: Mathematical Markup Language (MathML) Version 2.0
  Previous:     5 Combining Presentation and Content Markup
  Next:     7 The MathML Interface