Character Set

Subsection: Character sets at various stages of processing

Note that there is a distinction between (in principle) four different sets of characters relevant to HTML-Math, corresponding to four stages in the processing of HTML-Math, each of which is generated from the previous one by a different process (mentioned in parentheses below); I'll refer to the characters used in these stages as follows [let me know if there are standard names]:

(A computer file system or communications link provides:)

File characters -- the characters which are directly stored in a document's computer file;

(These are lexically analyzed according to SGML to produce:)

Document characters -- the characters generated by the SGML parser reading the file, after it interprets the DTD's CHARSET declarations but before it interprets markup or entity references (these characters are always [I think] in one of the BASESETs mentioned in the CHARSET declarations);

(These are further parsed by an SGML parser to produce some markup, and some:)

HTML-Math input characters -- the document characters not interpreted as markup (either normally or due to SHORTREFs) by the SGML parser, and in addition, the characters represented by SDATA entity references;

(When these are part of an HTML-Math expression, they and the associated markup are parsed by HTML-Math to produce rendered output containing:)

HTML-Math rendered characters -- the character symbols that can be generated as output by HTML-Math rendering.

This document discusses everything up to the "HTML-Math input characters". The nature and operation of the HTML-Math parser (as distinct from the SGML parser), including everything about the HTML-Math input character attributes which affect it, are discussed in the "Syntax" document.

Subsection: Overview

HTML-Math transforms its input into HTML-Math input characters in exactly the same way as HTML itself, except that

HTML-Math defines some SGML markup tags;
HTML-Math defines some SGML SHORTREF characters; and
HTML-Math defines some sets of extended characters represented by SGML SDATA entities.

This means that either the normal mechanisms of SGML, or a subset of them which are accepted by the HTML standard, are used to generate the HTML-Math input characters. The details of how this is done are part of the larger HTML standard and are not described by this HTML-Math standard. However, I will assume in these documents that the characters '<' and '>' are always required to delimit SGML markup tags, '&' and ';' to delimit SGML entity references, and that all other "document characters" are turned directly into HTML-Math input characters, except for the HTML-Math SHORTREF forms specifically listed below.

[Since '>' and ';' are only used as ending delimiters, these document characters could in principle also represent themselves as HTML-Math input characters (according to my SGML book, by Maler et. al.), but I don't know whether this is acceptable to HTML browsers or in the HTML standard; in all examples I'll assume it's not.]

Subsection: Markup tags

This is discussed in the "Markup tags" document.

Subsection: Shortref forms

HTML-Math defines certain characters as SHORTREF forms, that is, as abbreviations for SGML markup; the complete list of these forms is:

[To be discussed, but very small; candidates include '{' and '}' for grouping, and '_' and '^' for subscripts and superscripts, and no others that I'm aware of. Furthermore, it's possible that these meanings could be achieved through HTML-Math macros or operator syntax rather than requiring SHORTREF forms (see the "Extensibility" and "Syntax" documents), and if so, that would be preferable to using SHORTREF forms since it can be author-customized.]

Subsection: Extended characters

SGML uses "SDATA entities" to allow documents to refer to characters in extended character sets beyond the "document characters" (as defined above). The set of allowed entities is a property of the DTD. An entity is referenced by "&name;" where "name" is the entity name.

HTML-Math defines a large set of character entities, intended to cover almost all the symbols presently found in mathematical notation, and sometimes including several standard abbreviations for one character.

These characters are divided into classes, so that several "levels of compliance" are available to a browser. There is a minimum set of characters that any HTML-Math-compliant browser must support, and several additional levels which can be optionally supported.

To be discussed:

Should HTML-Math provide mechanisms, beyond whichever ones are added to HTML as a whole, for:
- author extension of the character set?
- downloadable characters or fonts?
- declaration of characters used in a document (to save downloading time)?
I propose that it should not, since these issues are difficult, and are really issues for HTML in general rather than specifically math-related. I hope they will be addressed someday as part of HTML "internationalization", and I assume HTML-Math will automatically incorporate HTML character-set extensions into its own treatment of characters. Nothing about how HTML-Math code would need to work internally should depend on the character set, except its character data type and its methods for SGML-parsing and rendering of extended characters. For HTML-Math parsing attributes (see "syntax" document), any unknown character can be assumed to be an "identifier" character unless declared otherwise.
How many options should be available to a browser in terms of which characters it supports?
WRI suggests that there be 2 to 4 specific "levels", each designating a specific set of characters to be supported, with each level a strict subset of the next.
It's also possible that the characters would be divided into sets which could be individually supported or not by a given browser, or even that such sets could be customizable for each browser. This much flexibility seems to us at WRI to be not worth the trouble, in view of the present realities about how character sets are actually supported in software, and the hoped-for future extension of HTML as a whole to address this kind of problem more generally.