The "Character set" document defines "HTML-Math input characters" and describes how they are generated by the SGML parser. This document describes how these are further interpreted by the HTML-Math parser (which is distinct from the SGML parser), and in particular how they are grouped into subexpressions. How these subexpressions are then rendered is described in other sections, especially "Layout primitives" and "Extensibility".
In this document, "character" will mean "HTML-Math input character" unless otherwise specified.
Subsection: General nature of HTML-Math syntax: precedences, infix operators, specialized tags.
HTML-Math provides a precedence-based syntax with infix, prefix, or postfix operators for common kinds of expressions. This syntax imitates standard conventions for written or typeset math in many simple cases. Some operators are specific HTML-Math input characters or character sequences, and some are EMPTY SGML elements (consisting only of SGML begin tags).
There are also specialized SGML begin/end markup tags for certain expression types, such as [perhaps] sum or integral or certain layout primitives, and also for identifying certain expression types and for grouping. The details of these are discussed in "Markup tags". Regardless of the current Math Syntax Model (see below), all begin/end tag pairs are treated as grouping operators and are properly matched by the HTML-Math parser.
It is desirable for HTML-Math to be human-writable and readable using standard text editors, even though WYSIWYG editors are expected to be available. Therefore the syntax should be concise and easily viewed whenever practical. Furthermore, there is no reason not to aid human intepretation of HTML-Math by using conventional syntax where it exists and can reasonably be used.
Subsection: Tokenization, and character parsing types [proposal, not yet discussed; for now, only an overview is given here]:
Tokenization (by the lexical analysis phase of the HTML-Math parser) interprets any HTML-Math expression as a sequence of HTML-Math tokens. This interpretation is guided by the "parsing type" attribute of each character (described below). The kinds of HTML-Math tokens are:
[I conjecture that grouping forms, such as a left or right parenthesis, can be considered operators, and therefore don't need a separate token type (or a separate parsing type for their characters). How these can be properly matched during parsing is discussed in the separate detailed proposal for parsing, also mentioned below in the subsection about precedence rules, which I will write up if we agree on the general goals expressed here.]
Any character (or SGML SDATA entity) that is legal in HTML, and not treated specially (as a token or entity delimiter) by SGML, can be part of an HTML-Math expression, and will be an "HTML-Math input character".
Every HTML-Math input character has a "parsing type" attribute which determines how the HTML-Math parser (as distinguished from the SGML parser) treats it during tokenization (lexical analysis).
HTML-Math defines a default parsing type for every character, but a document author can change this as described in "Extensibility".
[The set of parsing types remains to be discussed. The parsing types need to be sufficient to permit whitespace, identifiers, numbers, operators, and grouping forms (such as parentheses) to be properly tokenized.]
[The initial parsing type for every character, as well as the character set itself, also remains to be discussed; when it is, I'll refer to a table of all proposed characters and parsing types from both here and the "Character set" document.]
[I propose that whitespace is never significant except that characters separated by whitespace are never part of the same token. Rationale: this is an almost universal property of other syntaxes, and exceptions to it are often regarded as mistakes in hindsight; also it is hard to see how it could be violated in a way that could easily be used by general parsing rules (with the possible exception of allowing the choice of an inserted "missing term" or "missing operator" (see below) to depend on the nature of the whitespace, e.g. whether it contained newlines).]
[Also to be discussed: the precise definitions of identifiers and number literals, and whether these definitions are author-modifiable.]
[I propose that there is a tokenization mode (which is an attribute value that can be set for any part of a document, by the Math Syntax Model -- see "Extensibility") which determines whether a sequence of identifier characters is treated as a sequence of one-character identifiers, or as one multi-character identifier. Another possibility is that individual characters can be declared as always being one-character identifiers via their parsing type attribute. These possibilities could both be the case.]
[The only part of tokenization that I don't yet have a detailed proposal for is the precise definition of a number literal, in case a number such as, e.g., "2.3" needs to be distinguished from, e.g., "x.y" where x and y are variables and the "." is an operator, so that these expressions could be rendered differently by some browsers.]
The HTML-Math parser (after lexical analysis) turns a sequence of HTML-Math tokens (as defined above) into a subexpression tree, for subsequent macro expansion and rendering.
The syntax used is specified by the applicable Math Syntax Model (see "Extensibility"). This is initially a default syntax specified by the HTML-Math standard, but can be incrementally extended or replaced by authors [and perhaps by viewers, though this has no known purpose and is not recommended], for either all or part of a document.
[The details of the syntax specification are part of the proposal I'll provide later, but it will at least be possible to specify operators having infix, prefix, and/or postfix forms, and for each form, a precedence and an associativity ("left", "right", or "flat", meaning adjacent copies of that operator group like {{a+b}+c}, {a+{b+c}}, or {a+b+c} (i.e. only as one large expression), respectively).]
Regardless of the syntax rules being used, the subexpression-forming parser has two important properties:
[The missing term could be Rationales for (1):
[I will provide later a detailed proposal for the kind of syntax
rules that can be allowed, and a parsing algorithm that uses them,
while preserving property (1). It turns out that this goal is
compatible with very general parsing rules. For now, I want us to
agree on the reasonableness of this goal, assuming that it's
achievable (or, if we don't agree, to change it).]
Rationale: giving authors this option facilitates the use of
existing code for processing SGML documents, which can apply rewrite
rules to SGML elements. Also, filters which fully expand HTML-Math
expressions (by implementing the HTML-Math parsing algorithm) can
be written, which will facilitate use of SGML processing tools on
HTML-Math from any source.
[The details of the fully expanded markup remain to be discussed.
One possibility is: surround each
This will allow SGML-based processing tools to recognize particular
expression types, provided that they can recognize patterns such
as, e.g., <me> SE1 <mo>+</mo> SE2 </me> (where SE1 and SE2 are
unknown elements which are parameters of the pattern) (this pattern
matches expressions of the (unexpanded) form "<me>...+...</me>")
and operate on them with reference to the specific subelements
which occur in place of SE1 and SE2 in each instance.]
In the remaining discussion I'll assume for definiteness that the
proposal sketched in []s above is the system for fully expanding
an expression.
Any subset of subexpressions in any HTML-Math expression can be
"fully expanded" by the author, by adding the markup mentioned
above, without changing the fully expanded form of the expression,
and therefore, without changing the rendering in any way.
Note that the above implies that <me>...</me> can be used as
"invisible parentheses" which affect grouping by the parser,
but have no other direct effect. [It is possible that { and }
will be SHORTREF forms for <me> and </me> respectively; see
"Markup tags".]
Note that the HTML-Math DTD does not restrict the contents of
<mn>...</mn> or <mi>...</mi> or <mo>...</mo> any more than of
<me>...</me> or any other HTML-Math element with begin/end tags.
This allows these markup tags to be used more flexibly in macro
rules -- specifically, it makes it possible for their contents to
be supplied to a macro formal parameter, or generated from one (see
"Macro processing").
The actual meaning of an element like <mn>...</mn> is that the
resulting expression will be treated as a number literal by all
stages after parsing (i.e. by the macro processor and the renderer)
regardless of its contents. Thus, authors can add "nonstandard"
full-expansion markup as a way of overriding the HTML-Math tokenizer's
or parser's normal interpretations of HTML-Math text; e.g. <mi>55</mi>
would fully expand to itself, and be treated as an identifier in
rendering, even though 55 alone would fully expand to <mn>55</mn>
and be treated as a number, due to the ordinary rules for interpreting
certain character sequences as number literal tokens.
(end of Syntax document)
number literal with <mn>...</mn>,
each operator with <mo>...</mo>,
each identifier which is not an operator
(and also, perhaps, each HTML element allowed in HTML-Math)
with <mi>...</mi>,
and each nonatomic subexpression (and also, provided it's nonatomic,
the whole expression)
with <me>...</me>.