XML Entity definitions for Characters

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document is a W3C First Public Working Draft produced by the W3C Math Working Group as part of the W3C Math Activity.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

Public discussion of this document is encouraged on www-math@w3c.org, the public mailing list of the Math Working Group (list archives). To subscribe send an email to www-math-request@w3.org with the word subscribe in the subject line.

Please report errors in this document to www-math@w3.org.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

It is hoped that the entity sets defined by this specification may form the basis of an update to [ISO9573-13-1991], however pressure of other commitments has currently prevented this document being processed by the relevant ISO committee, thus the entity sets are being presented with Formal Public identifiers of the form -//W3C//... rather than ISO.... It is hoped that an update to TR 9573-13 may be made later. (The present version of TR 9573-13 defines the sets of names, but does not give mappings to Unicode.)

1 Introduction

Notation and symbols have proved very important for scientific documents, especially in mathematics. Mathematics has grown in part because its notation continually changes toward being succinct and suggestive. There have been many new signs developed for use in mathematical notation, and mathematicians have not held back from making use of many symbols originally introduced elsewhere. The result is that mathematics makes use of a very large collection of symbols. It is difficult to write mathematics fluently if these characters are not available for use. It is difficult to read mathematics if corresponding glyphs are not available for presentation on specific display devices. In the majority of cases it is preferable to store characters directly as Unicode character data or as XML numeric character references. However, in some environments it is more convenient to use the ASCII input mechanism provided by XML entity references. Many entity names are in common use, and this specification aims to provide standard mappings to Unicode for each of these names. It introduces no names that have not already been used in earlier specifications. Specifically the entity names in the sets starting with the letters "iso" were first standardized in SGML ([SGML]) and updated in [ISO9573-13-1991], the entity names in the sets with names starting "mml" were first standardized in MathML ([MathML2]) and those starting with "xhtml" were first standardized in HTML ([HTML4]).

2 Sets of names

This specification defines Unicode mappings of many sets of names that have been defined by earlier specifications.

We first present two tables listing the combined sets. firstly in Unicode order and then in alphabetic order, then present tables documenting each of the entity sets. Each set has a link to the DTD entity declaration for the corresponding entity set, and also a link to an XSLT2 stylesheet that will implement a reverse mapping from characters to entity names (this is only possible for entity names that map to a single uniocde code point).

In addition to the stylesheets and entity files corresponding to each individual entity set, a combined stylesheet is provided, as well as two combined sets of DTd entity declarations. The first is a small file which includes all the other entity files via parameter entity references, the second is a larger file that directly contains a definition of each entity, with all duplicates removed.

isobox Box and Line Drawing
isocyr1 Russian Cyrillic
isocyr2 Non-Russian Cyrillic
isodia Diacritical Marks
isolat1 Added Latin 1
isolat2 Added Latin 2
isonum Numeric and Special Graphic
isopub Publishing
isoamsa Added Math Symbols: Arrow Relations
isoamsb Added Math Symbols: Binary Operators
isoamsc Added Math Symbols: Delimiters
isoamsn Added Math Symbols: Negated Relations
isoamso Added Math Symbols: Ordinary
isoamsr Added Math Symbols: Relations
isogrk1 Greek Letters
isogrk2 Monotoniko Greek
isogrk3 Greek Symbols
isogrk4 Alternative Greek Symbols
isomfrk Math Alphabets: Fraktur
isomopf Math Alphabets: Open Face
isomscr Math Alphabets: Script
isotech General Technical
mmlextra Additional MathML Symbols
mmlalias MathML Aliases
xhtml1-lat1 Latin for HTML
xhtml1-special Special for HTML
xhtml1-symbol Symbol for HTML

3 Unicode Character Blocks for Scientific Documents

Certain characters are of of particular relevance to scientific document production. The following tables display Unicode ranges containing the characters that are most used in mathematics.

000	C0 Controls and Basic Latin, C1 Controls and Latin-1 Supplement
001	Latin Extended-A, Latin Extended-B
002	IPA Extensions, Spacing Modifier Letters
003	Combining Diacritical Marks, Greek and Coptic
004	Cyrillic
020	General Punctuation, Superscripts and Subscripts, Currency Symbols, Combining Diacritical Marks for Symbols
021	Letterlike Symbols, Number Forms, Arrows
022	Mathematical Operators
023	Miscellaneous Technical
024	Control Pictures, Optical Character Recognition, Enclosed Alphanumerics
025	Box Drawing, Block Elements, Geometric Shapes
026	Miscellaneous Symbols
027	Dingbats, Miscellaneous Mathematical Symbols-A, Supplemental Arrows-A
029	Supplemental Arrows-B, Miscellaneous Mathematical Symbols-B
02A	Supplemental Mathematical Operators
0FB	Alphabetic Presentation Forms, Arabic Presentation Forms-A
0FE	Variation Selectors, Vertical Forms, Combining Half Marks, CJK Compatibility Forms, Small Form Variants, Arabic Presentation Forms-B
1D4	Mathematical Alphanumeric Symbols

A Differences between these entities and earlier W3C DTDs

A.1 Differences from XHTML 1.0

Currently there is just one difference between the XHTML entity definitions described here and the entity set described in the XHTML 1.0 DTD.

phi: XHTML uses U+03C6 (decimal 966) GREEK SMALL LETTER PHI, in these files phi is defined as U+03D5 (decimal 981) GREEK PHI SYMBOL.

Note:

It is very difficult for (X)HTML definitions to change since HTML is so widely deployed. Many of the assignments in the current definitions would be different if it were not for HTML compatibilty. However in this case, perhaps this change could be made in an XHTML2/HTML5 time frame. If not, these definitions should change, as the entity sets should be compatible. Currenly U+03D5 has the entity names phi, straightphi,phis. U+03C6 has the entity names phgr, phiv,varphi.

It is also worth noting that Unicode has changed (swapped) the default glyphs for U+03C6 and U+03D5 since the publication of HTML4.

A.2 Differences from MathML 2.0 (second edition)

The differences between MathML 2 and the current entity definitions are listed below.

fjlig: fj, ISOPUB (and MathML 1) defined an fj ligature Unicode does not have a secific character and the entity was dropped from MathML2, It is re-instated here for maximum compatibility with [SGML]
jmath: U+0237, MathML 2 used U+006A (j) as there was no dotless j before Unicode 4.1.
trpezium, elinters: U+23E2 and U+23E7, MathML 2 used U+FFFD (REPLACEMENT CHARACTER) as these characters were added at Unicode 5.0 specifcally to support these entities.

The following bracket symbols have been added to the Mathematical symbols block in Unicode versions between 3.1 and 5.1. MathML2 used similar characters intended for CJK punctuation.

Lang: U+27EA, MathML2 used U+300A
lbbrk: U+2997, MathML2 used U+3014
loang: U+27EC, MathML2 used U+3018
lobrk: U+27E6, MathML2 used U+301A
Rang: U+27EB, MathML2 used U+300B
rbbrk: U+2998, MathML2 used U+3015
roang: U+27ED, MathML2 used U+3019
robrk: U+27E7, MathML2 used U+301B
LeftDoubleBracket: U+27E6, MathML2 used U+301A
RightDoubleBracket: U+27E7, MathML2 used U+301B

Note:

MathML3 uses the entity sets defined by this specification, so there will be no differences between MathML and the entities defined here once MathML3 is finalized.

B Source Files

All data files used to construct the entity declarations, XSLT character maps, and HTML tables referenced from this document are available from http://www.w3.org/2003/entities/2007xml/.

unicode.xml master file detailing all unicode characters with names in various entity sets and applications, TeX equivalents and other data. This file has been maintained for many years, originally by Sebastian Rahtz as part of the passivetex distribution and since around 1999 as part of the MathML specification sources by David Carlisle. The current version encodes data for all characters in Unicode 5.1 (beta). Note: unicode.xml is over 5MB in size and may not really be suitable for direct viewing in a browser, you may prefer to save the file rather than follow the above link to unicode.xml in a browser.
charlist.rnc relax NG schema for unicode.xml.
unicode.xsl XSLT stylesheet that renders unicode.xml as an HTML table.
character-set.xml The source file for this document.
xmlspec.xsl a copy of the standard xmlspec stylesheet
run small script file that builds this collection
xhtml1.xml record of XHTML 1.0 entity definitions
mml2.xml record of MathML 2.0 (second edition) entity definitions
unicodedata.xsl stylesheet that generates a new copy of unicode.xml, incorporating data from the unicode data file, used to updated unicode.xml as new versions of Unicode are released.
entities.xsl stylesheet to generate the DTD declarations for the entities.
charmap.xsl stylesheet to generate the XSLT character maps.
characters.xsl stylesheet to generate this document, including the referenced HTML tables.
schemas.xml File associating XML documents with appropriate Relax NG schema

C References

SGML: ISO/IEC 8879:1986, Information processing — Text and office systems — Standard Generalized Markup Language (SGML)
ISO9573-13-1991: ISO/IEC TR :1991, Information technology — SGML support facilities Techniques for using SGML — Part 13: Public entity sets for mathematics and science
Unicode: The Unicode Consortium; The Unicode Standard, Version 5.0, Addison-Wesley Professional; 5th edition (November 3, 2006). ISBN 0321480910. (http://www.unicode.org/versions/Unicode5.0.0/)
Unicode25: Barbara Beeton, Asmus Freytag, Murray Sargent III, Unicode Support for Mathematics, Unicode Technical Report #25 2007-05-07. (http://www.unicode.org/unicode/reports/tr25/)
MathML2: David Carlisle, Patrick Ion, Robert Miner, Nico Poppelier, Mathematical Markup Language (MathML) Version 2.0 (Second Edition) W3C Recommendation 21 October 2003 (http://www.w3.org/TR/2003/REC-MathML2-20031021/)
HTML4: Dave Raggett, Arnaud Le Hors, Ian Jacobs, HTML 4.01 Specification W3C Recommendation 24 December 1999 (http://www.w3.org/TR/1999/REC-html401-19991224)