HTML-Math Syntax.

Subsection: Introduction.

The "Character set" document defines "HTML-Math input characters" and describes how they are generated by the SGML parser. This document describes how these are further interpreted by the HTML-Math parser (which is distinct from the SGML parser), and in particular how they are grouped into subexpressions. How these subexpressions are then rendered is described in other sections, especially "Layout primitives" and "Extensibility".

In this document, "character" will mean "HTML-Math input character" unless otherwise specified.

Subsection: General nature of HTML-Math syntax: precedences, infix operators, specialized tags.

HTML-Math provides a precedence-based syntax with infix, prefix, or postfix operators for common kinds of expressions. This syntax imitates standard conventions for written or typeset math in many simple cases. Some operators are specific HTML-Math input characters or character sequences, and some are EMPTY SGML elements (consisting only of SGML begin tags).

There are also specialized SGML begin/end markup tags for certain expression types, such as [perhaps] sum or integral or certain layout primitives, and also for identifying certain expression types and for grouping. The details of these are discussed in "Markup tags". Regardless of the current Math Syntax Model (see below), all begin/end tag pairs are treated as grouping operators and are properly matched by the HTML-Math parser.

Rationale:

It is desirable for HTML-Math to be human-writable and readable using standard text editors, even though WYSIWYG editors are expected to be available. Therefore the syntax should be concise and easily viewed whenever practical. Furthermore, there is no reason not to aid human intepretation of HTML-Math by using conventional syntax where it exists and can reasonably be used.

Subsection: Tokenization, and character parsing types [proposal, not yet discussed; for now, only an overview is given here]:

Tokenization (by the lexical analysis phase of the HTML-Math parser) interprets any HTML-Math expression as a sequence of HTML-Math tokens. This interpretation is guided by the "parsing type" attribute of each character (described below). The kinds of HTML-Math tokens are:

SGML markup tags (begin, end, or empty) recognized by HTML-Math;
HTML elements which are allowed within HTML-Math expressions but whose tags are not specific to HTML-Math (e.g. links and anchors);
Operators, i.e. sequences of characters with the "operator" parsing type which have been specifically declared as operators in the syntax (by the standard or by an author extension), or which are one character long [or, I propose, which would otherwise be identifiers but which have been declared as operators];
Number literals;
Identifiers.

[I conjecture that grouping forms, such as a left or right parenthesis, can be considered operators, and therefore don't need a separate token type (or a separate parsing type for their characters). How these can be properly matched during parsing is discussed in the separate detailed proposal for parsing, also mentioned below in the subsection about precedence rules, which I will write up if we agree on the general goals expressed here.]

Any character (or SGML SDATA entity) that is legal in HTML, and not treated specially (as a token or entity delimiter) by SGML, can be part of an HTML-Math expression, and will be an "HTML-Math input character".

Every HTML-Math input character has a "parsing type" attribute which determines how the HTML-Math parser (as distinguished from the SGML parser) treats it during tokenization (lexical analysis).

HTML-Math defines a default parsing type for every character, but a document author can change this as described in "Extensibility".

[The set of parsing types remains to be discussed. The parsing types need to be sufficient to permit whitespace, identifiers, numbers, operators, and grouping forms (such as parentheses) to be properly tokenized.]

[The initial parsing type for every character, as well as the character set itself, also remains to be discussed; when it is, I'll refer to a table of all proposed characters and parsing types from both here and the "Character set" document.]

[I propose that whitespace is never significant except that characters separated by whitespace are never part of the same token. Rationale: this is an almost universal property of other syntaxes, and exceptions to it are often regarded as mistakes in hindsight; also it is hard to see how it could be violated in a way that could easily be used by general parsing rules (with the possible exception of allowing the choice of an inserted "missing term" or "missing operator" (see below) to depend on the nature of the whitespace, e.g. whether it contained newlines).]

[Also to be discussed: the precise definitions of identifiers and number literals, and whether these definitions are author-modifiable.]

[I propose that there is a tokenization mode (which is an attribute value that can be set for any part of a document, by the Math Syntax Model -- see "Extensibility") which determines whether a sequence of identifier characters is treated as a sequence of one-character identifiers, or as one multi-character identifier. Another possibility is that individual characters can be declared as always being one-character identifiers via their parsing type attribute. These possibilities could both be the case.]

[The only part of tokenization that I don't yet have a detailed proposal for is the precise definition of a number literal, in case a number such as, e.g., "2.3" needs to be distinguished from, e.g., "x.y" where x and y are variables and the "." is an operator, so that these expressions could be rendered differently by some browsers.]

Subsection: Formation of subexpressions, guided by explicit markup or by operator precedences and associativities

The HTML-Math parser (after lexical analysis) turns a sequence of HTML-Math tokens (as defined above) into a subexpression tree, for subsequent macro expansion and rendering.

The syntax used is specified by the applicable Math Syntax Model (see "Extensibility"). This is initially a default syntax specified by the HTML-Math standard, but can be incrementally extended or replaced by authors [and perhaps by viewers, though this has no known purpose and is not recommended], for either all or part of a document.

[The details of the syntax specification are part of the proposal I'll provide later, but it will at least be possible to specify operators having infix, prefix, and/or postfix forms, and for each form, a precedence and an associativity ("left", "right", or "flat", meaning adjacent copies of that operator group like {{a+b}+c}, {a+{b+c}}, or {a+b+c} (i.e. only as one large expression), respectively).]

Regardless of the syntax rules being used, the subexpression-forming parser has two important properties:

Any sequence of HTML-Math tokens is legal (provided the begin/end SGML markup tags are properly matched, as required by the DTD) -- parsing never fails, but instead, "missing terms" or "missing infix operators" (which are specific SGML empty elements) are inserted into the subexpression tree. The precedence of the "missing infix operator" (as well the rendering rules for it, or any other operator) can be author-modified. The missing term or operator insertions have the property that, if the expression was re-parsed with any subset of these newly added terms and operators retained, the same subexpression structure would be derived (barring differences caused by the same operator having more than one legitimate form, e.g. for "-" in the standard math syntax or "++" in the C programming language). Conversely (with the same caveat), if an expression is modified by removing some of its terms, and some of its infix operators with the same precedence as the "missing infix operator", it will parse with the same subexpression structure. (This means that the structure generated by parsing will "degrade gracefully" in the presence of missing terms or operators.)
[The missing term could be , or , or .... The missing infix operator could be , or , or , or , or ....]
Rationales for (1):
1. It is necessary to specify what the browser does for any HTML document conforming to the DTD, but the DTD is incapable of expressing most kinds of syntax rules (even if they were not author-extensible). Insertion of missing terms and missing infix operators is a reasonable solution to the problem of what to do with expressions that would otherwise not be parsable (especially since the viewer can customize the rendering rules for these). Even if such expressions are considered errors, much more information about them can be derived this way than if, say, the entire expression refused to be rendered, but was replaced with some kind of error message.
2. During interactive editing of expressions, it is common for an intermediate expression to be missing some characters. It is desirable if these intermediate expressions still have a specified rendering. Furthermore, it is desirable if an HTML-Math document remains legal, savable, and transmittable, when it contains expressions still being written.
3. Multiplication is often expressed as a "missing infix operator". Even when it is rendered as simple concatenation of the multiplicands, it needs a precedence in order for all subexpressions to be identified by parsing. Furthermore, it is desirable for rendering rules to be able to treat concatenation of terms as an operation like any other, and to change the way this operation is rendered.
[I will provide later a detailed proposal for the kind of syntax rules that can be allowed, and a parsing algorithm that uses them, while preserving property (1). It turns out that this goal is compatible with very general parsing rules. For now, I want us to agree on the reasonableness of this goal, assuming that it's achievable (or, if we don't agree, to change it).]
SGML markup can be added to any HTML-Math expression (in a process called "fully expanding" the expression) in a way which doesn't change the results of parsing, but such that every subexpression is a separate SGML element.
Rationale: giving authors this option facilitates the use of existing code for processing SGML documents, which can apply rewrite rules to SGML elements. Also, filters which fully expand HTML-Math expressions (by implementing the HTML-Math parsing algorithm) can be written, which will facilitate use of SGML processing tools on HTML-Math from any source.
[The details of the fully expanded markup remain to be discussed. One possibility is: surround each
```
                number literal          with            <mn>...</mn>,
each            operator                with            <mo>...</mo>,

each            identifier which is not an operator
                (and also, perhaps, each HTML element allowed in HTML-Math)
                
                                        with            <mi>...</mi>,

and each        nonatomic subexpression (and also, provided it's nonatomic,
                the whole expression)
                                        with            <me>...</me>.
```
This will allow SGML-based processing tools to recognize particular expression types, provided that they can recognize patterns such as, e.g., <me> SE1 <mo>+</mo> SE2 </me> (where SE1 and SE2 are unknown elements which are parameters of the pattern) (this pattern matches expressions of the (unexpanded) form "<me>...+...</me>") and operate on them with reference to the specific subelements which occur in place of SE1 and SE2 in each instance.]
In the remaining discussion I'll assume for definiteness that the proposal sketched in []s above is the system for fully expanding an expression.
Any subset of subexpressions in any HTML-Math expression can be "fully expanded" by the author, by adding the markup mentioned above, without changing the fully expanded form of the expression, and therefore, without changing the rendering in any way.
Note that the above implies that <me>...</me> can be used as "invisible parentheses" which affect grouping by the parser, but have no other direct effect. [It is possible that { and } will be SHORTREF forms for <me> and </me> respectively; see "Markup tags".]
Note that the HTML-Math DTD does not restrict the contents of <mn>...</mn> or <mi>...</mi> or <mo>...</mo> any more than of <me>...</me> or any other HTML-Math element with begin/end tags. This allows these markup tags to be used more flexibly in macro rules -- specifically, it makes it possible for their contents to be supplied to a macro formal parameter, or generated from one (see "Macro processing").
The actual meaning of an element like <mn>...</mn> is that the resulting expression will be treated as a number literal by all stages after parsing (i.e. by the macro processor and the renderer) regardless of its contents. Thus, authors can add "nonstandard" full-expansion markup as a way of overriding the HTML-Math tokenizer's or parser's normal interpretations of HTML-Math text; e.g. <mi>55</mi> would fully expand to itself, and be treated as an identifier in rendering, even though 55 alone would fully expand to <mn>55</mn> and be treated as a number, due to the ordinary rules for interpreting certain character sequences as number literal tokens.
(end of Syntax document)