6 Characters, Entities and Fonts

Overview: Mathematical Markup Language (MathML) Version 2.0
Previous: 5 Combining Presentation and Content Markup
Next: 7 The MathML Interface

6 Characters, Entities and Fonts
6.1 Introduction
6.2 MathML Characters
   6.2.1 Unicode Character Data
   6.2.2 Special Characters Not in Unicode
   6.2.3 Mathematical Alphabetic Symbol Characters.
   6.2.4 Non-Marking Characters
6.3 Character Symbol Listings
   6.3.1 Special Constants
   6.3.2 Character Tables (ASCII format)
   6.3.3 Tables arranged by Unicode block
   6.3.4 Negated Mathematical Characters
   6.3.5 Variant Mathematical Characters
   6.3.6 Mathematical Alphabetic Characters
   6.3.7 MathML Character Names
6.4 Differences from Characters in MathML 1
   6.4.1 Coverage
   6.4.2 Fewer Non-marking Characters
   6.4.3 ISO Tables
   6.4.4 Status of Character Encodings

6.1 Introduction

Notation and symbols have proved very important for mathematics. Mathematics has grown in part because of the succinctness and suggestiveness of its evolving notation. There have been many new signs evolved for use in mathematical notation, and mathematicians have not held back from making use of many symbols originally developed elsewhere. The result is that mathematics makes use of a very large collection of symbols. It is difficult to write mathematics fluently if these characters are not available for use in coding. It is difficult to read mathematics if corresponding glyphs are not available for presentation on specific display devices.

This situation posed a problem for the first W3C Math Working Group when it was brought into existence. It did not fall naturally within the purview of developing a specification enabling mathematics to be used with HTML and producing a DTD for this to worry about more than the entities allowed in the DTD. However, as experience has shown, a long list of entities with no means to display them is of little use, and a cause of frequent frustrations in trying to use a standard. On the other hand, a large collection of glyphs and fonts representing characters without a standard way to refer to them is not of much use either.

The W3C Math Working Group therefore took on directly the task of specifying part of the full mechanism needed to proceed from notation to final presentation, and started collaboration with organizations undertaking specification of the rest.

This chapter of the MathML Specification contains a listing of character names for use in MathML, recommendations for their use, and warnings to pay attention to the correct form of the corresponding code points given in the UCS (Universal Character Set) as codified in Unicode and ISO 10646 [see [Unicode] and the Unicode Web site]. For simplicity we shall refer to this character set by the short name Unicode. Though Unicode changes from time to time so that it is specified exactly by using version numbers, unless this brings clarity on some point we shall not use them. This specification of MathML makes use of some characters that are not part of Unicode 3.0 but which have been proposed to the Unicode Technical Committee (UTC), and thus for inclusion in ISO 10646. They are presently expected to be in the revisions Unicode 3.1 and 3.2. (For more detail about this see Section 6.4.4 [Status of Character Encodings].)

While the process of review and adoption by UTC and ISO/IEC of the characters of special interest to mathematics and MathML is largely complete (Unicode Work in Progress) there remains the possibility of some further modification of the lists of characters accepted, of the code assignments for those adopted, or of the names given them by Unicode. To make sure any possible corrections to relevant standards are taken into account, and for the latest character tables and font information, see the W3C Math Working Group home page and the Unicode site.

6.2 MathML Characters

A MathML token element Section 3.2 [Token Elements], and Section 4.4.1 [Token Elements] takes as content a sequence of MathML Characters. MathML Characters are defined to be either Unicode characters legal in XML documents or mglyph elements. The latter are used to represent characters that do not have a Unicode encoding, as described in Section 3.2.9 [Adding new character glyphs to MathML (mglyph)]. Because the Unicode UCS provides approximately one thousand special alphabetic characters for the use of mathematics (Unicode 3.1), and will provide over 900 special symbols in Unicode 3.2, the need for mglyph should be rare.

6.2.1 Unicode Character Data

As always in XML, any character allowed by XML may be used in MathML in an XML document. The legal characters have the hexadecimal code numbers 09 (tab = U+0009), 0A (line feed = U+000A), 0D (carriage return = U+000D), 20-D7FF (U+0020..U+D7FF), E000-FFFD (U+E000..U+FFFD), and 10000-10FFFF (U+010000..U+10FFFF). The parenthetical notation beginning with U+ is one recommended by Unicode for referring to Unicode characters [see [Unicode], page xxviii]. The exclusions above code number D7FF are of the blocks used in surrogate pairs, and the two characters guaranteed not to be Unicode characters at all. U+FFFE is excluded to allow determination of byte order in certain encodings.

There are essentially three different ways of encoding character data.

Using characters directly: For example, an A may be entered as `A' from a keyboard (character U+0061). This option is only available if the character encoding specified for the XML document includes the character. Most commonly used encodings will have `A' in the ASCII position. In many encodings, characters may need more than one byte. Note that if the document is, for example, encoded in Latin-1 (ISO-8859-1) then only the characters in that encoding are available directly. Unfortunately, most mathematical symbols may not be encoded as character data in this way.
Using Numeric XML character references: Using this notation, `A' may be represented as = (decimal) or A (hex). Note that the numbers always refer to the Unicode encoding (and not to the character encoding used in the XML file). By using Character references it is always possible to access the entire Unicode range. For a general XML vocabulary, there is a disadvantage to this approach: character references may not be used in XML element or attribute names. However, this is not an issue for MathML, as all element names in MathML are restricted to ASCII characters.
Using entity references: The MathML DTD defines internal entities that expand to character data. Thus for example the entity reference é may be used rather than the character reference "é or, if, for example, the document is encoded in ISO-8859-1, the character é. An XML fragment that uses an entity reference which is not defined in a DTD is not well formed; therefore it will be rejected by an XML parser. For this reason every fragment using entity references must use a DOCTYPE declaration which specifies the MathML DTD, or a DTD that at least declares any entity reference used in the MathML instance. The need to use a DOCTYPE complicates inclusion of MathML in some documents. However, entity references are very useful for small illustrative examples, and are used in most examples in this document. For this reason entity references are perhaps not optimal for use in generated MathML, however they are very useful for small illustrative examples, as used in this document.

6.2.2 Special Characters Not in Unicode

For special purposes, one may need to use a character which is not in Unicode, even with the expected additions. In these cases one may use the mglyph element for direct access to a glyph from some font and creation of a MathML character corresponding. All MathML token elements that accept character data also accept an mglyph in their content.

Beware, however, that the font chosen may not be available to all MathML processors.

6.2.3 Mathematical Alphabetic Symbol Characters.

A noticeable feature of mathematical and scientific writing is the use of single letters to denote variables and constants in a given context. The increasing complexity of science has led to the use of certain common alphabet and font variations to provide enough special symbols of this letter-like type. These denotations are in fact not letters that may be used to make up words with recognized meanings, but individual carriers of semantics themselves. Writing a string of such symbols is usually interpreted in terms of some composition law, for instance, multiplication. Many letter-like symbols may be quickly interpreted by specialists in a given area as of a certain mathematical type: for instance, bold symbols, whether based on Latin or Greek letters, as vectors in physics or engineering, or fraktur symbols as Lie algebras in part of pure mathematics. Again, in given areas of science, some constants are recognizable letter forms. When you look carefully at the range of letter-like mathematical symbols in common use today, as the STIX project supported by major scientific and technical publishers did, you come up with perhaps surprisingly many. A proposal to facilitate mathematical publishing by inclusion of mathematical alphabetic symbols in the UCS was made, and has been favorably handled.

The new Mathematical Alphabetic characters expected Unicode 3.1 have provisional code points in Plane 1, that is, in the first plane with Unicode values higher than 2¹⁶. This plane of characters is also known as the Supplemental Multilingual Plane (SMP), in contrast to the Basic Multilingual Plane (BMP) which has been used by Unicode so far. Support for Plane 1 characters in currently deployed software is not always reliable, and in particular support for these Mathematical Alphabetic characters is not likely to be widespread until after final positions in Unicode 3.1 have been confirmed in the standard ISO 10646.

As discussed in Section 3.2.2 [Mathematics style attributes common to token elements], MathML offers an alternative mechanism to specify mathematical alphabetic characters, which will help bridge the time of transition to Unicode revisions and the associated deployment of implementing software and fonts therefore required. Namely, one uses the mathvariant attribute on the surrounding token element, which will most commonly be mi. In this section we detail the correspondence that a MathML processor should apply between certain characters in Plane 0 (BMP) of Unicode, modified by the mathvariant attribute, and the Plane 1 Mathematical Alphabetic Symbol characters.

The basic idea of the correspondence is fairly simple. For example, a Mathematical Fraktur alphabet is being added, and the code point for Mathematical Fraktur A is U1D504. Thus using these proposed characters, a typical example might be

<mi>&#x1D504;</mi>

However, an alternative, equivalent markup would be to use the standard A and modify the identifier using the mathvariant attribute, as follows:

<mi mathvariant="fraktur">A</mi>

The exact correspondence between a mathematical alphabetic character and an unstyled character is complicated by the fact that certain characters that were already present in Unicode are not in the `expected' sequence.

The detailed correspondence is shown in the tables given in Section 6.3.6 [Mathematical Alphabetic Characters].

Mathematical Alphabetic Symbol characters should not be used for styled text. For example, Mathematical Fraktur A must not be used to just select a blackletter font for an uppercase A. Doing this sort of thing would create problems for searching, restyling (e.g. for acessibility), and many other kinds of processing.

6.2.4 Non-Marking Characters

Some characters, although important for the quality of print or alternative rendering, do not have glyph marks that correspond directly. They are called here non-marking characters. Below we have a table of those adopted for the purposes of MathML. Their roles are discussed in Chapter 3 [Presentation Markup] and Chapter 4 [Content Markup], respectively. The values of the spaces given are recommendations. Some of these characters are among those with new Unicode values, and some are given as combinations of Unicode characters employing the new special mathematics modifier character (U0FE00). The correspondence between the spacing amounts mentioned below and those in the Unicode descriptions is not exact, but the matches are good.

In MathML 2 control of page composition, such as line-breaking, is effected by the use of the proper attributes on the mspace element.

The last two characters below, with mnemonic entity names ⁢ and ⁡, are not simple spacers. They are especially important new additions to the UCS because they provide textual clues which can increase the quality of print rendering, permit correct audio rendering, and allow the unique recovery of mathematical semantics from text which is visually ambiguous.

Character name	Unicode	Description
`&Tab;`	00009	tabulator stop; horizontal tabulation
`&NewLine;`	0000A	force a line break; line feed
`&Space;`	00020	one em of space in the current font
`&NonBreakingSpace;`	000A0	space that is not a legal breakpoint
`&ZeroWidthSpace;`	0200B	space of no width at all
`&VeryThinSpace;`	0200A	space of width 1/18 em
` `	02009	space of width 3/18 em
` `	02005	space of width 4/18 em
` `	02005-0200A	space of width 5/18 em
`&NegativeVeryThinSpace;`	0200A-0FE00	space of width -1/18 em
`&NegativeThinSpace;`	02009-0FE00	space of width -3/18 em
`&NegativeMediumSpace;`	0205F-0FE00	space of width -4/18 em
`&NegativeThickSpace;`	02005-0FE00	space of width -5/18 em
`⁢`	02062	marks multiplication when it is understood without a mark (Section 3.2.5 [Operator, Fence, Separator or Accent (`mo`)]
`⁡`	02061	character showing function application in presentation tagging (Section 3.2.5 [Operator, Fence, Separator or Accent (`mo`)]

6.3 Character Symbol Listings

The Universal Character Set (UCS) of Unicode and ISO 10646 continues to evolve Section 6.4.4 [Status of Character Encodings]. A small number of the changes recently introduced, relative to those resulting from the needs of Asian languages, are those designed exactly to facilitate the use of Unicode by the `equation-writing' community. This specification is written on the assumption that the code assignments suggested to ISO/IEC JTC1/SC2/WG2 by the UTC will be confirmed as they are in public draft forms of Unicode 3.1 and 3.2. As before, we can only reiterate that for latest developments on details of character standards as far as they influence mathematical formalism the Home Page of the W3C Math WG should be consulted.

The characters are given with entity names as well as Unicode numbers. To facilitate comprehension of a fairly large list of names, which totals over 2000 in this case, we offer more than one way to find to a given character. A corresponding full set of entity declarations is in the DTD in Appendix A [Parsing MathML]. For discussion of entity declarations see that appendix.

The characters are listed by name, and sample glyphs provided for all of them. Each character name is accompanied by a code for a character grouping chosen from a list given below, a short verbal description, and a Unicode hex code drawn from ISO 10646, now extended in accordance with the proposal forwarded by the UTC to ISO/IEC WG2 in March 2000.

The character listings by alphabetical and Unicode order in Section 6.3.7 [MathML Character Names] are in harmony with the ISO character sets given, in that if some part of a set is included then the entire set is included.

6.3.1 Special Constants

To begin we list separately a few of the special characters which MathML has introduced. These have been accorded new Unicode values. Rather like the non-marking ⁢ and ⁡ above, they provide very useful capabilities in the context of machinable mathematics. It might be imagined there could also be entries below for &true;, &false; and &NotANumber;, but these do not yet have Unicode points assigned. They can be introduced by the character extension mechanisms provided by the mglyph and csymbol elements.

Entity name	Unicode	Description
`&CapitalDifferentialD;`	02145	D for use in differentials, e.g. within integrals
`&DifferentialD;`	02146	d for use in differentials, e.g. within integrals
`&ExponentialE;`	02147	e for use for the exponential base of the natural logarithms
`&ImaginaryI;`	02148	i for use as a square root of -1

6.3.2 Character Tables (ASCII format)

The first table offered is a very large ASCII listing of characters considered particularly relevant to Mathematics. This is given in Unicode (or proposed Unicode) order. Most, but not all, of these characters have MathML names defined via entity declarations in the DTD. Those that do not are usually symbols which seem mathematically peripheral, such as dingbats, machine graphics or technical symbols.

A second table lists those characters that do have MathML entity names, ordered alphabetically, with a lower-case letter preceding its upper-case counterpart.

6.3.3 Tables arranged by Unicode block

The tables in this section detail Unicode code points (displayed with 256 code points per table) that have mathematically significant characters. The sample glyph images link to the table of characters ordered by Unicode given in the previous section. As shown in the key for each table, the status of each character (for example in Unicode 3.0 or in the proposed additions) is indicated by a CSS class on the table cell (which by default is indicated by varying the background color). The names of the blocks are those of the Unicode blocks included in the numerical range given; bracketing indicates characters of that type are not shown in these tables.

Block Range	Description
00000 - 000FF	Controls and Basic Latin, and Latin-1 Supplement
00100 - 001FF	Latin Extended-A, Latin Extended-B
00200 - 002FF	IPA Extensions, Spacing Modifier Letters
00300 - 003FF	Combining Diacritical Marks, Greek [and Coptic]
00400 - 004FF	Cyrillic
02000 - 020FF	General Punctuation, Superscripts and Subscripts, Currency Symbols, Combining Diacritical Marks for Symbols
02100 - 021FF	Letter-like Symbols, Number Forms, Arrows
02200 - 022FF	Mathematical Operators
02300 - 023FF	Miscellaneous Technical
02400 - 024FF	Control Pictures, Optical Character Recognition, Enclosed Alphanumerics
02500 - 025FF	Box Drawing, Block Elements, Geometric Shapes
02600 - 026FF	Miscellaneous Symbols
02700 - 027FF	Dingbats
02900 - 029FF	Supplemental Arrows, Miscellaneous Mathematical Symbols
02A00 - 02AFF	Supplemental Mathematical Operators
03000 - 030FF	CJK Symbols and Punctuation, [Hiragana, Katakana]
0FB00 - 0FBFF	Alphabetic Presentation Forms
0FE00 - 0FEFF	[Combining Half Marks, CJK Compatibility Forms, Small Form Variants, Arabic Presentation Forms-B]
1D400 - 1D4FF	Mathematical Styled Latin (Bold, Italic, Bold Italic, Script, Bold Script begins)
1D500 - 1D5FF	Mathematical Styled Latin (Bold Script ends, Fraktur, Double-struck, Bold Fraktur, Sans-serif, Sans-serif Bold begins)
1D600 - 1D6FF	Mathematical Styled Latin (Sans-serif Bold ends, Sans-serif Italic, Sans-serif Bold Italic, Monospace, Bold), Mathematical Styled Greek (Bold, Italic begins)
1D700 - 1D7FF	Mathematical Styled Greek (Italic continued, Bold Italic, Sans-serif Bold), Mathematical Styled Digits

6.3.4 Negated Mathematical Characters

In addition to the Unicode Characters so far listed, one may use the combining characters U0338 (/), U20D2 (|) and U20E5 (\) to produce negated or canceled forms of characters. A combining character should be placed immediately after its `base' character, with no intervening markup or space, just as is the case for combining accents.

In principle, the negation characters may be applied to any Unicode character, although fonts designed for mathematics typically have some negated glyphs ready composed. A MathML renderer should be able to use these pre-composed glyphs in these cases. A compound character code either represents a UCS character that is already available, as in the case of U0003D+00038 which amounts to U02260, or it does not as is the case for U02202+00338. The common cases of negations, of both types, that have been identified are listed in the table

Note that it is the policy of the W3C and of Unicode that if a single character is already defined for what can be achieved with a combining character, that character must be used instead of the decomposed form. It is also intended that no new single characters representing what can be done by with existing compositions will be introduced.

cancellations

6.3.5 Variant Mathematical Characters

Unicode attempts to avoid having several character codes for simple font variants. For a code point to be assigned there should be more than a nuance in glyphs to be recorded. To record variants worth noting there is a special character proposed for Unicode 3.2, U+FE00 (VARIATION SELECTOR-1), which acts as a postfix modifier. However the legally allowed combinations with this variation selector are restricted to a list recorded as part of Unicode. The VARIATION SELECTOR-1 character may only be applied to the characters listed here. The resulting combination is not regarded by Unicode as a separate character, but a variation on the base character. Unicode aware systems may render the combination as the base if the available fonts do not support the variant glyph shape.

variants

6.3.6 Mathematical Alphabetic Characters

Here we list the special mathematical alphabets. Note that the names for these alphabetic runs should be regarded as conventions resulting from recent tradition in the typesetting of mathematical formulas, rather than as fixing exactly and forever the styles which are to be used. Of course, they do correspond to the styles presently most common. But, for instance, there may be font variations in the glyphs from double-struck, open-face or blackboard bold fonts, all of which would naturally be used for the characters in the range here labelled Double-struck. Similar considerations would apply to appellations such as fraktur and gothic, or script and calligraphic.

As discussed above, the use of these characters is formally equivalent to the use of characters in Plane 0, together with a suitable value for the mathvariant attribute. The correspondence is given in the character tables. Most of these characters come from the proposed additions to Plane 1, however a few characters (such as the double-struck letters N, P, Z, Q, R, C, H representing common number sets) were already present in Unicode 3.0 and retain their original positions. These characters are highlighted in the tables.

6.3.7 MathML Character Names

This section corresponds closely with the entity definitions in the DTD described in Appendix A [Parsing MathML]. All of the entity sets except the last correspond to entity sets defined by ISO 8879 or ISO 9573-13.

ISO Handle	Description
ISOAMSA	Added Mathematical Symbols: Arrows
ISOAMSB	Added Mathematical Symbols: Binary Operators
ISOAMSC	Added Mathematical Symbols: Delimiters
ISOAMSN	Added Mathematical Symbols: Negated Relations
ISOAMSO	Added Mathematical Symbols: Ordinary
ISOAMSR	Added Mathematical Symbols: Relations
ISOBOX	Box and Line Drawing
ISOCYR1	Cyrillic-1
ISOCYR2	Cyrillic-2
ISODIA	Diacritical Marks
ISOGRK3	Greek-3
ISOLAT1	Latin-1
ISOLAT2	Latin-2
ISOMFRK	Mathematical Fraktur
ISOMOPF	Mathematical Openface (Double-struck)
ISOMSCR	Mathematical Script
ISONUM	Numeric and Special Graphic
ISOPUB	Publishing
ISOTECH	General Technical
MMLEXTRA	Extra Names added by MathML

6.4 Differences from Characters in MathML 1

6.4.1 Coverage

We have excluded a very few other characters that may have appeared in the corresponding lists in MathML 1. Those characters thus lost will be found to be used very infrequently in the experience of mathematical publishers, or simply to be completely unacceptable for inclusion in Unicode. However MathML 2 does provide the mglyph element to accommodate new characters that authors may wish to introduce.

6.4.2 Fewer Non-marking Characters

It used to be in MathML 1.0 that there were a number more non-marking character entities listed. These were concerned with composition control, such as line-breaking. In MathML 2 such control is effected by the use of the proper attributes on the mspace element.

6.4.3 ISO Tables

The character listings by alphabetical and Unicode order in Section 6.3.7 [MathML Character Names] have now been brought more into line with the corresponding ISO character sets than was the case in MathML 1.0, in that if some part of a set is included then the entire set is included. In addition, the group ISOCHEM has been dropped as more properly the concern of chemists. All the ISO mathematical alphabets are listed, since there are now Unicode characters to point to, in particular the bold Greek of ISOGRK3. These changes have also been reflected in the entity declarations in the DTD in Appendix A [Parsing MathML].

6.4.4 Status of Character Encodings

A significant change since MathML 1.0 is the movement toward adoption of more characters for mathematics in the UCS (Universal Character Set) and availability of public fonts for mathematics. The encoding of characters in the UCS (Universal Character Set) is done jointly by the Unicode Technical Committee and by ISO/IEC JTC1/SC2/WG2. The process of encoding takes quite some time from the deliberation of first proposals to the final approval. The characters mentioned in this chapter and listed in the associated tables are at various stages of this approval process. This section gives detailed information about the stages relevant to this specification and gives an overview of the characters affected. The lists, as well as other places that discuss characters, mention when characters are not fully approved or show this graphically. Updates on the status of the characters will be provided by updates to this specification, by errata to this specification, and by notices on the W3C Math home page. The final word on all Unicode matters is naturally to be found at the Unicode Consortium.

The characters relevant for MathML fall at present into three categories: Fully accepted characters, characters in final (JTC1) ISO/IEC ballot, and characters before the final ISO/IEC ballot.

Fully accepted characters include a large number of Latin, Greek, and Cyrillic letters, a large number of Mathematical Operators and symbols, including arrows, and so on. Fully accepted characters currently exactly those that are part of both [Unicode 3.0] and [ISO/IEC 10646-1:2000], which are identical code point by code point. Fully accepted characters are not specially marked or mentioned in this specification; they do not pose any unusual implementation problems other than possibly finding fonts to display them. Those of obvious special interest to mathematics number over 1,500, depending on how you count.
The characters presently in final ballot are the Mathematical Alphanumeric Symbols with a large number of ideographs and other characters not directly relevant for mathematics. There are just about 1,000 of these. The due date of the ballot is early in 2001. If accepted, the additions will still take some time to be formally published. At this stage, there can be only acceptance or rejection of the full proposal without technical changes. The additions are expected to be published as ISO/IEC 10646-2, and to become part of Unicode 3.1, which is tentatively scheduled for March 2001. While acceptance of this ballot seems more likely than rejection, implementers and users of MathML have to be aware that until the final acceptance, they are using the code points of characters in final ballot at their own risk. Entities (see Section 6.3.7 [MathML Character Names]) and the mathvariant attribute (see Section 3.2.2 [Mathematics style attributes common to token elements]) can be used to avoid that risk.
Characters before final ballot relevant to MathML make up a long list of operators and symbols, including some special constants and non-marking characters (see Section 6.2.4 [Non-Marking Characters] and Section 6.3.1 [Special Constants]). There are about 590 of these. The proposal going to ballot is the result of repeated refinements by the UTC; several, possibly final, changes (5) were made at a WG2 meeting in Athens in September. This document reflects these changes. The majority of these characters have proved completely uncontroversial. ISO balloting processes, which involve a PDAM and an FPDAM during which technical changes are possible, and an FDAM with no changes allowed, may be expected to end in November 2001. The additions accepted are expected to be published as an amendment to [ISO/IEC 10646-1], and to become part of Unicode 3.2. It can therefore be expected that almost all of the characters in this category will finally be accepted, and encoded at the current code points. It is possible that a small number of characters may be renamed, moved, or less likely, ultimately rejected. Until final acceptance, implementers and users of MathML are using these characters and code points at their own risk. Entities and the mathvariant attribute are used to avoid that risk.

Overview: Mathematical Markup Language (MathML) Version 2.0
Previous: 5 Combining Presentation and Content Markup
Next: 7 The MathML Interface