XML in C

Status

Personal thoughts on what the XML syntax should be. Compare with my earlier notes.

Abstract

XML is a base-language for expressing arbitrary structured data in text form. It consists of several modules: core syntax, meta-syntax, linking, style-bindings, and maybe more. Of these only the core syntax is common to all XML applications. Applications can choose to omit the other modules if they don't need them.

This text describes one possible core syntax, using flex/bison specifications. The most important additions relative to the XML-lang draft of 30 June 1997 include: automatically ignored newlines, attribute defaults, and boolean attributes.

Why this specification?

Split the core syntax and the meta-syntax

The core syntax of XML specifies a very general language within which all XML applications have to stay. Most applications will want to restrict this language, and XML provides an optional module with a meta-syntax for writing those restrictions. The restrictions are often referred to as a DTD, Document Type Definition, after SGML, where this term was introduced.

In the XML-lang draft of 30 June 1997 the core syntax and the meta-syntax are combined into a single draft. The linking module and the planned style-sheet binding modules are kept in separate draft. There are good reasons for keeping the core syntax and the meta-syntax separate:

For consistency.
Because the current meta-syntax is not very good and there are other proposals in separate documents (e.g., XML-data).
Because it makes the draft easier to read.
To allow one to be changed without the other.

"RE delenda est"

This Latin phrase means "RE is to be deleted." It refers to a rule from SGML that specifies in which contexts an "RE" (Record-End, SGML-speak for a newline) is to be ignored. The precise rules in SGML are very complicated, but in general a newline is ignored after a start tag and before an end tag. This allows SGML documents to be somewhat pretty-printed, by starting tags on a new line.

XML also has start and end tags, but none of the exceptions of SGML, so the "RE delenda rule" can be applied without any problems.

In fact, looking at how people write XML and HTML, the rule good be generalized a bit, to say that a newline before and "<" and after any ">" is to be ignored, whether that "<" is part of a start tag or not.

There is a lot of confusion over this issue. The first applications that are based on XML seem to assume that not only one newline is ignored, but that all whitespace, even multiple lines, is to be ignored. While this allows even more "pretty printing", it also means that a lot of meaningful spaces have to be escaped (as &32;).

The 30 June draft of XML on the other hand says that no whitespace is to be ignored, not even a single newline.

Most people meanwhile seem to agree that ignoring one newline is a good compromise. It allows tags to be put on separate lines, while not requiring meaningful whitespace to be escaped. That is therefore what the syntax below describes.

Default attributes

Especially if an application uses the linking module, it will benefit a lot from being able to specify defaults for attributes. The 30 June draft relies on the meta-syntax to provide default attribute values. This is not a good idea, for several reasons:

The meta-syntax is not very good and may change.
Many application that could benefit from attribute defaults have no use for the rest of the meta-syntax (or cannot afford the cost and complexity of parsing the meta-syntax).
Restricting the syntax and setting defaults are logically two very different things and should not be mixed so easily.

The syntax below therefore includes an attribute defaulting mechanism that is part of the core syntax.

Boolean attributes

All attributes in XML are by default string valued, although the meta-syntax should be able to restrict that. There are different proposals for doing that. One interesting one is Tim Bray's proposal

But there is one very simple type that is useful in almost all applications and that can be added to the core syntax without complicating it, and that is booleans. The syntax below therefore includes boolean attributes as well as string-valued ones.

The code

The code is in two parts: a flex tokenizer and a bison grammar. Also included are a test program and a makefile. Below is some documentation for each of them. To download all of them together, download this tarfile.

Flex scanner

(See the source.)

The actual scanner code is very short. After all, there are only 12 tokens to be recognized. The code relies on a few macros that keep the code clear:

nl: A newline can be either a carriage return, a line feed, or both.
ws: Whitespace is any sequence of one or more spaces, tabs, carriage returns or line feeds.
open: The rule that a newline is to be ignored just before a "<" is expressed by this macro, that combines and optional newline and a "<".
close: Same for the delimiter that signals the end of mark-up: a ">" optionally followed by a newline.
namestart: This represents all the characters that can start a name (element name, attribute name). This code doesn't try to deal with character encodings (most 8-bit encodings, as well as UTF-8 should work fine, though), and so it simply accepts all non-ASCII characters as name start characters. This is probably too lenient, but since all the delimiters in XML are from the ASCII set, it doesn't really matter.
namechar: All the characters that are allowed in a name, after the first character. The same leniency as for namestart above.
data: The data in an XML file, i.e., the characters between a start and end tag, are matched by this regular expression, that accepts all characters except a "<", and only accepts a newline if it is not immediately followed by a "<". There may be escaped characters in this data, of the form "&#[0-9]+;" or "&#x[0-9a-f]+;". This program doesn't expand them. To do that would require implementing the character encodings and the program currently doesn't do that.
string: A string is something between double or single quotes. Like data, it may include escaped characters.

The scanner works in one of two modes (start conditions). The INITIAL mode ignores white space and recognized names, strings, and most of the other tokens. It is active as the program starts and every time the tokenizer is in between "<" and ">".

The CONTENT mode is entered after the ">" of any start or end tag. In this mode only data, "<", comments, and the start of an attribute defaults declaration are recognized.

Bison grammar

(See the source.)

The grammar contains just 13 productions, and it could have been shorter and clearer if Bison had accepted some common notations for grammars. The grammar that is actually intended is as follows:

document: prolog element misc*;
prolog: VERSION? ENCODING? misc*;
misc: COMMENT | attribute_decl;
attribute_decl: ATTDEF NAME attribute+ ENDDEF;	
element: START attribute* empty_or_content;
empty_or_content: SLASH CLOSE | CLOSE content END NAME? CLOSE;
content: (DATA | misc | element)*;
attribute: NAME (EQ VALUE)?;

... or just 8 productions.

Test application

(See the source.)

The test application just calls the parser to parse standard input.

Bert Bos
$Date: 1997/07/09 20:44:19 $