Information technology - SGML support facilities

Information technology - SGML support facilities - Techniques for using SGML Part 13: Public entity sets for mathematics and sciences 2003-12-08 Martin Bryan David Carlisle

Scope

Tens of thousands of graphic characters are used in publishing text, a large proportion of which have been defined in ISO/IEC 10646. Even where standard coded representations exist, however, there may be situations in which they cannot be keyboarded conveniently or accurately, or in which it is not possible to display the desired visual depiction of the characters.

To help overcome these barriers to the successful interchange of SGML and related documents, this part of ISO/IEC TR 9573 defines character entity sets for some widely used special graphic characters regularly used in the production of scientific and mathematical documents.

Entity repertoires are necessarily larger and more repetitious than character sets, as they deal in general with higher-level constructs. For example, unique entities have been defined for each accented Latin alphabetic character, while a character set might represent such characters as combinations of letters and diacritical mark characters.

In many instances upper- and lower-case is used to differentiate the names of entities. It is assumed that any SGML concrete syntax used in conjunction with these entity names will be case sensitive.

The reference concrete syntax defined in ISO/IEC 8879 (SGML) is case sensitive.

Normative references

The following standards contain provisions which, through reference in this text, constitute provisions of this part of ISO/IEC TR 9573. At the time of publication, the editions indicated were valid. All standards are subject to revision, and parties to agreements based on this part of ISO/IEC TR 9573 are encouraged to investigate the possibility of applying changes made in more recent editions of referenced standards.

ISO/IEC 8879:1986

ISO/IEC 9541-1:1991

ISO/IEC 10646-1:2000/Amd 1:2002

ISO/IEC 10646-2:2001

Definitions

For the purposes of this part of ISO/IEC TR 9573 the definitions given in ISO/IEC 8879 apply.

General considerations

This edition of the standard has been aligned with the Unicode 3.2 updates to ISO/IEC 10646:2000, as covered by Amendment 1 to the standard. For the purposes of backwards compatibility the names assigned to the characters in the original edition of the standard are shown before those assigned to the character in ISO/IEC 10646. References to characters in this part should, however, refer to the ISO/IEC 10646 name rather than the name originally assigned by ISO/IEC TR 9573.

Format of Descriptions

To follow

Corresponding Display Entity Sets

Each character has a characteristic visual description known as a "glyph". Systems using these entity sets need to be able to convert each entity reference to an appropriate glyph. Where character sets based on ISO/IEC 10646 are available this is typically done by conversion to an entity reference of the form &xnnnnn; where nnnnn is the five digit hexadecimal code listed in the column headed Unicode/10646, where the first character indicates the plane of ISO/IEC 10646 to which the character has been assigned. The entity name and descriptive comment are added to the definition, giving it the form:

<!ENTITY frac78  "&#x0215E"><!-- VULGAR FRACTION SEVEN EIGHTHS-->

Comparision with other sets of entity definitions

Differences between MathML and Stix Data

The Stix consortium maintains a table of information about mathematical characters, including mappings to ISO/IEC entity sets. The following is an annotated list of cases where entity definitions appear to be different in these two collection.

During the review of this draft document these alignment issues will be reviewed and resolved.

Differences between MathML and DocBook Data

OASIS distribute a set of entity declarations for use with the DocBook markup language. http://www.oasis-open.org/docbook/specs/wd-docbook-xmlcharent-0.3.html

The following table lists the current differences between this set and the definitions described in this document.

During the review of this draft document these alignment issues will be reviewed and resolved.

Differences between MathML and XHTML 1.1 Data

W3C distribute a set of entity declarations for use with the (X)HTML markup language. http://www.w3.org/TR/2001/REC-xhtml-modularization-20010410/dtd_module_defs.html#a_xhtml_character_entities

The following table lists the current differences between this set and the definitions described in this document.

During the review of this draft document these alignment issues will be reviewed and resolved.

Character definitions requiring special review

Duplicate entities mapped to same code point

This section details cases where two entities with different names have been mapped to the same code point (so become indistinguishable to most XML applications, even if SGML applications could differentiate the original SGML SDATA entities).

In many cases these unifications are acceptable or intentional, for example for reasons of convenience and compatibility the mathematical greek set ISOGRK3 is mapped to the standard Greek characters in the BMP (so clashing with the ISOGRK1 definitions) rather than the Math Italic alphabet in the 1Dxxx range. Similarly, the same character can have different logical uses in different scientific disciplines, but sometimes the duplication has occurred because it has not been possible to retain differences foreseen in ISO/IEC TR 9573 within the ISO/IEC 10646 character set. During the review of this draft the need for duplication of these entities will be reviewed and resolved.

Entity definitions starting with a combining character

It is generally a bad idea to start an entity definition with a combining character as it makes normalisation dependent on the order of entity expansion, and in worst cases the combining character can combine with the markup, resulting in a normalised form that is no longer well formed XML.

In the current version of these entities, the entity definitions are defined to consist of a single space character (U+0020) followed by the combining character

Possible new characters

There are several new characters planned for Unicode and ISO/IEC 10646 that may affect these definitions. See http://www.unicode.org/alloc/Pipeline.html.

In particular, jmath (ISOAMSO) which is currently defined to be j could more usefully be defined to be the proposed character LATIN SMALL LETTER DOTLESS J at code point U+0237 and perp (ISOTECH) could be defined to use PERPENDICULAR at code point U+27C2 (and so allow it to be distinguished from bottom (also in ISOTECH).

Character listings

Each character set is shown as four column table.

The first column gives the entity name, these names are as used in previous versions of this report, and use the following abbreviation scheme:
- Prefixes
  l = left; r = right; u = up; d = down; h = horizontal; v= vertical;
  b = back, reversed;
  cu = curly;
  cw = clock-wise; aw = anti clock-wise;
  g = greater than; l = less than;
  n = negated;
  o = in circle;
  s = small, short:
  sq = square shaped;
  thk = thick.;
  x = extended, long, big;
- Bodies
  ap = approx;
  arr = arrow; har = harpoon;
  pr = precedes; sc = succeeds;
  sub = subset; sup = superset;
- Suffixes
  b = boxed;
  f = filled, black, solid;
  e = single equals; E = double equals;
  hk = hook;
  s = slant;
  t = tail;
  v = variant;
  w = wavy, squiggly;
  2 = two of;
- Upper-case letter means doubled (or sometimes two of).
The second column gives the code points of the corresponding character as 5 digit hexadecimal numbers, separated by -.
The third column gives a sample glyph representation of the character.
The fourth column gives the name of the character in two forms, Firstly the entity description as used in previous editions of this report. secondly (in uppercase) The name of the character as given in Unicode and ISO/IEC 10646. In the case of combinations with combining characters or variant selectors, the name of the base character is given in uppercase, followed by an indication (in lower case) of the variant form.