XML Entity definitions for Characters

1 Introduction

Notation and symbols have proved very important for human communication, especially in scientific documents, especially in mathematics. Mathematics has grown in part because its notation continually changes toward being succinct and suggestive. There have been many new signs developed for use in mathematical notation, and mathematicians have not held back from making use of many symbols originally introduced elsewhere. The result is that science in general, and particularly mathematics, makes use of a very large collection of symbols. It is difficult to write science fluently if these characters are not available for use. It is difficult to read science if corresponding glyphs are not available for presentation on specific display devices. In the majority of cases it is preferable to store characters directly as Unicode character data or as XML numeric character references. However, in some environments it is more convenient to use the ASCII input mechanism provided by XML entity references. Many entity names are in common use, and this specification aims to provide standard mappings to Unicode for each of these names. It introduces no names that have not already been used in earlier specifications.

Specifically, the entity names in the sets starting with the letters "iso" were first standardized in SGML ([SGML]) and updated in [ISO9573-13-1991]. The W3C Math Working Group has been invited to take over the maintenance and development of these sets by the original standards committee (ISO/IECJTC1 SC34). The sets with names starting "mml" were first standardized in MathML [MathML2] and those starting with "xhtml" were first standardized in HTML [HTML4].

2 Sets of names

This specification defines mappings to Unicode of many sets of names that have been defined by earlier specifications.

We first present two tables listing all the sets combined, first in Unicode order and then in alphabetic order:

All in Unicode order
All in alphabetic order.

Then there come tables documenting each of the entity sets. Each set has a link to the DTD entity declaration for the corresponding entity set, and also a link to an XSLT2 stylesheet that will implement a reverse mapping from characters to entity names (this is, of course, only possible for entity names that map to a single Unicode code point).

In addition to the stylesheets and entity files corresponding to each individual entity set, a combined stylesheet is provided, as well as two combined sets of DTD entity declarations. The first is a small file which includes all the other entity files via parameter entity references; the second is a larger file that directly contains a definition of each entity, with all duplicates removed.

isobox Box and Line Drawing
isocyr1 Russian Cyrillic
isocyr2 Non-Russian Cyrillic
isodia Diacritical Marks
isolat1 Added Latin 1
isolat2 Added Latin 2
isonum Numeric and Special Graphic
isopub Publishing
isoamsa Added Math Symbols: Arrow Relations
isoamsb Added Math Symbols: Binary Operators
isoamsc Added Math Symbols: Delimiters
isoamsn Added Math Symbols: Negated Relations
isoamso Added Math Symbols: Ordinary
isoamsr Added Math Symbols: Relations
isogrk1 Greek Letters
isogrk2 Monotoniko Greek
isogrk3 Greek Symbols
isogrk4 Alternative Greek Symbols
isomfrk Math Alphabets: Fraktur
isomopf Math Alphabets: Open Face
isomscr Math Alphabets: Script
isotech General Technical
mmlextra Additional MathML Symbols
mmlalias MathML Aliases
xhtml1-lat1 Latin for HTML
xhtml1-special Special for HTML
xhtml1-symbol Symbol for HTML
html5-uppercase uppercase aliases for HTML

3 Unicode Character Blocks for Scientific Documents

Certain characters are of of particular relevance to scientific document production. The following tables display Unicode ranges containing the characters that are most used in mathematics.

000	C0 Controls and Basic Latin, C1 Controls and Latin-1 Supplement
001	Latin Extended-A, Latin Extended-B
002	IPA Extensions, Spacing Modifier Letters
003	Combining Diacritical Marks, Greek and Coptic
004	Cyrillic
020	General Punctuation, Superscripts and Subscripts, Currency Symbols, Combining Diacritical Marks for Symbols
021	Letterlike Symbols, Number Forms, Arrows
022	Mathematical Operators
023	Miscellaneous Technical
024	Control Pictures, Optical Character Recognition, Enclosed Alphanumerics
025	Box Drawing, Block Elements, Geometric Shapes
026	Miscellaneous Symbols
027	Dingbats, Miscellaneous Mathematical Symbols-A, Supplemental Arrows-A
029	Supplemental Arrows-B, Miscellaneous Mathematical Symbols-B
02A	Supplemental Mathematical Operators
02B	Miscellaneous Symbols and Arrows
0FB	Alphabetic Presentation Forms, Arabic Presentation Forms-A
0FE	Variation Selectors, Vertical Forms, Combining Half Marks, CJK Compatibility Forms, Small Form Variants, Arabic Presentation Forms-B
1D4	Mathematical Alphanumeric Symbols
1D5	Mathematical Alphanumeric Symbols (continued)
1D6	Mathematical Alphanumeric Symbols (continued)
1D7	Mathematical Alphanumeric Symbols (continued)

4 Mathematical Alphanumeric Characters

Many of the entities defined by this specification relate to the mathematical alphanumeric characters contained in the letter-like symbols block of Unicode Plane 0, or in the Mathematical Alphanumeric Symbols block in Unicode Plane 1. The following tables list all these symbols, highlighting those that are not in Plane 1, and giving entity names where appropriate.

Bold (Serif)

Italic or Slanted

Bold Italic or Slanted

Double Struck (Open Face, Blackboard Bold)

Script (or Calligraphic)

Slanted Bold Sans Serif

Monospace

5 Entities for Negated and Variant Characters

Each of the entity definitions in a majority of the specification expands to a single Unicode character, however there are some that use multiple character combinations, as outlined in this section.

5.1 Negated Mathematical Characters

In addition to the Unicode Characters so far listed, one may use the combining characters U+0338 (/), U+20D2 (|) and U+20E5 (\) to produce negated or canceled forms of characters. A combining character should be placed immediately after its "base" character, with no intervening markup or space, just as is the case for combining accents.

In principle, the negation characters may be applied to any Unicode character, although fonts designed for mathematics typically have some negated glyphs ready composed. A MathML renderer should be able to use these pre-composed glyphs in these cases. A compound character code either represents a UCS character that is already available, as in the case of U+003D U+0338 which amounts to U+2260, or it does not, as is the case for U+2202 U+0338. The common cases of negations, of the latter type, that have been identified are listed in the tables.

Note that it is the policy of the W3C and of Unicode that if a single character is already defined for what can be achieved with a combining character, that character must be used instead of the decomposed form. It is also intended that no new single characters representing what can be done by with existing compositions will be introduced. For further information on these matters see the Unicode Standard Annex 15, Unicode Normalization Forms [Unicode15], especially the discussion of Normalization Form C.

5.2 Variant Mathematical Characters

Unicode attempts to avoid having several character codes for simple font variants. For a code point to be assigned there should be more than a nuance in glyphs to be recorded. To record variants worth noting there is a special character in Unicode 3.2, U+FE00 (VARIATION SELECTOR-1), which acts as a postfix modifier. However the legally allowed combinations with this variation selector are restricted to a list recorded as part of Unicode. The VARIATION SELECTOR-1 character may only be applied to the characters listed here. The resulting combination is not regarded by Unicode as a separate character, but a variation on the base character. Unicode aware systems may render the combination as the base if the available fonts do not support the variant glyph shape.

variation selector-1

A Special Considerations

A.1 Epsilon

Historically there has been much confusion and lack of agreement over variant forms for lower case epsilon.

This specification uses the definitions below. Note that the name epsilon is used for the character used in textual Greek (U+03B5) and varepsilon used for the epsilon symbol character more commonly used in mathematics (U+03F5). Note that this usage is compatible with the naming of similar pairs of characters (for example theta, vartheta) but incompatible with the naming convention used in TeX, MathML2 and some earlier mappings of the ISO entity sets to Unicode.

Entity	Set	Description	Unicode Character
eacgr	isogrk2	=small epsilon, accent, Greek	U+03AD	GREEK SMALL LETTER EPSILON WITH TONOS
egr	isogrk1	=small epsilon, Greek	U+03B5	GREEK SMALL LETTER EPSILON
epsi	isogrk3	/epsilon
epsilon	xhtml1-symbol
epsiv	isogrk3	/straightepsilon, small epsilon, Greek	U+03F5	GREEK LUNATE EPSILON SYMBOL
straightepsilon	mmlalias	alias ISOGRK3 epsiv
varepsilon	mmlalias	alias ISOGRK3 epsiv
bepsi	isoamsr	/backepsilon R: such that	U+03F6	GREEK REVERSED LUNATE EPSILON SYMBOL
backepsilon	mmlalias	alias ISOAMSR bepsi	U+03F6	GREEK REVERSED LUNATE EPSILON SYMBOL
b.epsi	isogrk4	small epsilon, Greek	U+1D6C6	MATHEMATICAL BOLD SMALL EPSILON
b.epsiv	isogrk4	variant epsilon	U+1D6DC	MATHEMATICAL BOLD EPSILON SYMBOL

A.2 Phi

The situation for phi is very similar to that of epsilon, although with the further complication that early versions of Unicode had the sample glyphs for U+03C6 and U+03D5 in swapped from the current usage, and some older fonts still in use follow that older convention. The definitions used in this specification are as listed below.

Entity	Set	Description	Unicode Character
phi	isogrk3	/phi - small phi, Greek	U+03C6	GREEK SMALL LETTER PHI
phi	xhtml1-symbol	greek small letter phi
phgr	isogrk1	=small phi, Greek
straightphi	mmlalias	alias ISOGRK3 phiv	U+03D5	GREEK PHI SYMBOL
phiv	isogrk3	/varphi - straight phi
varphi	mmlalias	alias ISOGRK3 phiv
b.phi	isogrk4	small phi, Greek	U+1D6D7	MATHEMATICAL BOLD SMALL PHI
b.phiv	isogrk4	variant phi	U+1D6DF	MATHEMATICAL BOLD PHI SYMBOL

A.3 Multiple Character Entities

In addition to the combining and variant character combinations listed in the previous sections, the following table lists the remaining entity replacement texts that consist of more than one character.

Entity	Set	Description	Unicode Character
fjlig	isopub	small fj ligature	U+0066 U+006A	fj ligature
ThickSpace	mmlextra	space of width 5/18 em	U+205F U+200A	space of width 5/18 em
race	isoamsb	reverse most positive, line below	U+223D U+0331	REVERSED TILDE with underline
acE	isoamsb	most positive, two lines below	U+223E U+0333	INVERTED LAZY S with double underline
DownBreve	mmlextra	breve, inverted (non-spacing)	U+0020 U+0311	COMBINING INVERTED BREVE
tdot	isotech	three dots above	U+0020 U+20DB	COMBINING THREE DOTS ABOVE
TripleDot	mmlalias	alias ISOTECH tdot	U+0020 U+20DB	COMBINING THREE DOTS ABOVE
DotDot	isotech	four dots above	U+0020 U+20DC	COMBINING FOUR DOTS ABOVE

Unicode does not have an fj character, although the other common f ligatures such as fi (U+FB01) are contained in the Alphabetic Presentation Forms block. The fjlig entity is mapped to the pair of characters "fj", modern typesetting engines should automatically use the fj ligature for this combination ligature if the font supplies such a ligature.

Unicode has a range of space characters (including all multiples of 1/18 em up to 6/18, except for 5/18 em) thus this entity is made from a pair of space characters. An alternative would have been to use U+2005 (1/4 em), but 1/4 em is not equal to 5/18 em, so the above definition was chosen, despite the fact that the difference is unlikely to be visibly noticable at most typeset font sizes.

The entities race and acE denote underlined characters for which Unicode does not have codepoints, thus combining underline characters have been used, in a way analogous to the use of combining strokes for negated operators.

For reasons explained further in [Charmod-norm], it is not advisable to to start the replacement text of an entity with a combining character, as then potentially different results may be produced depending on the order in which entity expansion and Unicode normalisation are performed. As far as possible this specification uses non combining characters, however in the three cases shown above Unicode only has combining forms of the accents, and so the entity replacement text starts with a space, to avoid the possibility that the expansion of the entity combines with preceding text.

B Changes

B.1 Changes since 2008-07-21

The html5-uppercase set is now documented.

The entities ohm and angst have changed to U+03A9 and U+00C5 to match NFC. See w3c bugzilla entry.

The entity race, which had been erroneously assigned to U+29DA, is now assigned to the combination U+223D U+0331. (U+223D isn't quite the shape shown in the original ISO document which is a rotated S rather than a rotated tilde, but this appears to be the closest character in Unicode 5.2.)

The entities bsolhsub and suphsol which were previously mapped to two-character combinations U+005C U+2282 and U+2283 U+002F are now mapped to the Unicode 5 characters that were added specifically to support these entities, U+27C8 and U+27C9.

The source files have all been updated to match Unicode 5.2.

The entity ThickSpace now maps to the pair U+205F U+200A rather than the triple U+2009 U+200A U+200A (4/18 + 1/18)em rather than (3/18 + 1/18 + 1/18)em.

The entity UnderBar maps to the spacing character _ rather than the combining character U+0332.

The entity OverBar maps to the spacing character U+203E (like the XHTML entity oline) rather than the macron character U+00AF.

The entities epsiv and varepsilon are now mapped to the epsilon symbol U+03F5 rather than being aliases for the entity epsilon, U+03B5.

The entities phiv and varphi are now mapped to the phi symbol U+03D5 rather than being aliases for the entity phi, U+03C6.

B.2 Changes between 2008-07-21 and 2007-12-14

The following entity definitions have changed at this draft:

phi, lang, rang, OverParenthesis, UnderParenthesis, OverBrace, UnderBrace, lbbrk, rbbrk.

C Differences between these entities and earlier W3C DTDs

C.1 Differences from XHTML 1.0

Differences between the XHTML entity definitions described here and the entity set described in the XHTML 1.0 DTD.

lang and rang: U+27E8 and U+27E9; XHTML 1.0 used U+2329 and U+232A (which have canonical decomposition to U+3008 and U+3009).

Note:

The current drafts of [HTML5] use entity definitions derived from this specification.

C.2 Differences from MathML 2.0 (second edition)

The differences between MathML 2 and the current entity definitions are listed below.

fjlig: ISOPUB (and MathML 1) defined an fj ligature; Unicode does not have a specific character and the entity was dropped from MathML2. It is re-instated here for maximum compatibility with [SGML].
phi: U+03C6 GREEK SMALL LETTER PHI (the definition used in HTML4); MathML2 used U+03D5 GREEK PHI SYMBOL.
epsiv, varepsilon, phiv, varphi: these have been changed to map to the symbol character (to match other uses of the var prefix such as vartheta).
jmath: U+0237; MathML 2 used U+006A (j) as there was no dotless j before Unicode 4.1.
trpezium, elinters: U+23E2 and U+23E7; MathML 2 used U+FFFD (REPLACEMENT CHARACTER) as these characters were added at Unicode 5.0 specifically to support these entities.
ohm, angst: As noted above, the definitions of these entities has been changed so that the definitions use characters that are in NFC normal form.
bsolhsub and suphsol: U+27C8 and U+27C9; MathML2 used U+005C U+02282 and U+2283 U+002F.

The following bracket symbols have been added to the Mathematical symbols block in Unicode versions between 3.1 and 5.1. MathML2 used similar characters intended for CJK punctuation.

lang, langle, LeftAngleBracket and rang, rangle, RightAngleBracket: U+27E8 and U+27E9; MathML2 used U+2329 and U+232A (which have canonical decomposition to U+3008 and U+3009).
Lang and Rang: U+27EA and U+27EB; MathML2 used U+300A and U+300B.
lbbrk and rbbrk: U+2772 and U+2773; MathML2 used U+3014 and U+3015.
loang and roang: U+27EC and U+27ED; MathML2 used U+3018 and U+3019.
lobrk and robrk: U+27E6 and U+27E7; MathML2 used U+301A and U+301B.
OverBrace and UnderBrace: U+23DE and U+23DF; MathML2 used U+FE37 and U+FE38.
OverParenthesis and UnderParenthesis: U+23DC and U+23DD; MathML2 used U+FE35 and U+FE36.
LeftDoubleBracket and RightDoubleBracket: U+27E6 and U+27E7; MathML2 used U+301A and U+301B.