A meta-grammar for describing XML-based formats

Bert Bos; 1, 8 Feb 1999

Status of this document

This is a discussion document, presented to the XML Syntax WG by the author. It is a contribution both to the specification of an XML profile, and to the accompanying documentation that is to help the users of the profile.

Abstract

This is a proposal for a different, and hopefully easier approach to defining XML-based formats. The approach is based on defining a format with a context-free grammar, but without actually specifying any of the terminals, since they are implictly specified by XML already. (It is, in a sense, a meta-grammar.) An effect of this approach is that any formats defined with this approach all use a small and well-defined subset of XML, which should make implementation of parsers for the formats easier.

Another effect is that the resulting XML subset (or "profile") is not arrived at by striking features from XML 1.0 or by guessing what features are useful, but is inferred from a system for defining arbitrary data-formats, and is thus guaranteed to be both minimal and complete for at least one concrete method of developing data-formats.

Introduction

When choosing a syntax to linearize a data-structure, the requirements are usually at a fairly high level: the syntax must be unambiguous, easy to parse, easy to generate, and sometimes also: possible to debug/tweak with a text editor. The precise details (how to escape difficult characters, what delimiters to use, etc.) are not important, as long as they are addressed somehow. An XML-based syntax can deliver this, and has the benefit that it might allow some re-use of code.

Sometimes a user-friendly syntax is needed when hand-editing is expected, but then a generic syntax like XML is usually not adequate. In the following we assume data-formats for which an XML-based syntax is considered user-friendly enough. There are several such formats already, so the set is not empty.

The developer of such a linearization syntax would be helped by a simple notation that allows him to express just the structure, and from which the syntax details are inferred by an automated, but simple to understand process. Software engineers are familiar with context free grammars (CFG), such as BNF, EBNF and yacc, and CFGs seem the ideal basis for such a notation. Just like yacc allows a developer to concentrate on the syntax, and let the system take care of the code, so this system could go one step further, and let the developer concentrate on the structure, and let the system take care of the syntax.

The following is a start towards such a system. For the moment we define it by example.

Describing structure

Let's take as example a data structure defined in C as follows:

struct node {
    char *val;
    int type;
    struct node *left;
    struct node *right;
}

This describes a simple binary tree structure. It can be linearized using one of the usual methods for trees, let's take pre-order. The resulting linearized data can then be described by a grammar like this:

S     -> node;
node  -> val type left? right?;
val   -> TEXT;
type  -> INT;
left  -> node;
right -> node;

except, of course, that this is ambiguous without some delimiters.

The basic way XML makes a structure unambiguous is by inserting the name of every non-terminal in the syntax. surrounded by "<" and ">". That is something that can be done automatically. The above would essentially turn into this:

S     -> "<S>" node "</S>";
node  -> "<node>" val type left? right? "</node>";
val   -> "<val>" TEXT "</val>";
type  -> "<type>" INT "</type>";
left  -> "<left>" node "</left>";
right -> "<right>" node "</right>";

Here is an example of a document in this language (assuming some reasonable tokenization rules):

<S>
  <node>
    <val>aap</val>
    <type>7</val>
    <left>
      <node>
        <val>noot</val>
        <type>8</type>
      </node>
    </left>
  </node>
</S>

We could in pronciple leave it at that, but a developer might consider that this is a bit too verbose. A little inspection will show that the "<node>" and "</node>" delimiters are not really needed. Also the name "<S>" is not very descriptive, and it might be called "<tree>" instead. So the next section proposes some hints that the developer can put in the grammar to influence the automatic expansion.

Hinting the choice of delimiters

The default delimiters are the names of the non-terminals, enclosed in "<...>" or "</...>". A different name can be selected, or the delimiter can be suppressed altogether, if so desired, with an argument to the left-hand side. Example:

S(tree) -> node;
node()  -> val type left? right?;
val     -> TEXT;
type    -> INT;
left    -> node;
right   -> node;

The expansion is now as follows:

S(tree) -> "<tree>" node "</tree>";
node()  -> val type left? right?;
val     -> "<val>" TEXT "</val>";
type    -> "<type>" INT "</type>";
left    -> "<left>" node "</left>";
right   -> "<right>" node "</right>";

This changes the name for "S" to "tree" and suppresses the delimiters for "node". An example of a document:

<tree>
  <val>aap</val>
  <type>7</val>
  <left>
    <val>noot</val>
    <type>8</type>
  </left>
</tree>

It may be that the developer is still not happy with the syntax, and wants to put certain things in attributes. He might want that either to make the syntax less verbose, or to express some kind of intuitive distinction between two kinds of information, by putting one kind in the content and the other kind in attributes. (Note that this is a distinction for the benefit of human readers only, for the computer it doesn't make a difference.) This is the subject of the next section.

Comparison to values and types distinction

In a message to the XML schema WG, David Beech introduced an idea for distinguishing field names and field types in a DTD-like formalism. It would allow the definition of data type names in a way that is accessible to the DOM (assuming a suitable DOM). The current practice of using a parameter entity in DTDs doesn't allow that, or at least is an overloading of the concept of parameter entities.

A definition like it is found in today's DTDs:

<!ENTITY % Complex "(real, imaginary)">
<!ELEMENT x-coord "%Complex;">

which is meant to express the fact that the "x-coord" element takes a complex number as value, can be written in his syntax as:

<!DATATYPE Complex (real, imaginary)>
<!ELEMENT x-coord #DATATYPE Complex>

The ability of the meta-grammar to define non-terminals with tags and without tags can be used to do the same. The absence of a tag creates a type, rather than a field:

x-coord   -> Complex;
Complex() -> real imaginary;

Of course, like with parameter entities, the fact that a non-terminal without tags represents a type is only a convention. To make "Complex" into a type in some concrete application, something needs to be added, such as semantic actions (see below.)

Note that one of the reasons that David Beech mentions for introducing the datatype primitive doesn't apply to the meta-grammar. That reason is to overcome the problem of name scoping in XML: once a content model for "Complex" has been defined, the name "Content" cannot be used anymore in a different context to mean something else. In particular, once "Complex" has been chosen as a data type name, it cannot be a field name anymore. In the meta-grammar on the other hand, the content model is only tied to a nonterminal, not to an element name, and different non-terminals can have the same tag. For example, the three occurrences of <a> in a document in the language below all have different content models:

a1(a) -> a2 a3;
a2(a) -> TEXT;
a3(a) -> URI;

Example document:

<a><!-- this is a1 -->
  <a>Some text</a><!-- this is a2 -->
  <a>http://www.w3.org/</a><!-- this is a3 -->
</a>

Converting elements to attributes

Attributes are very restricted in XML: on any element, there can be only one attribute of a given name, inside attributes no further structure is possible (they must be plain text), and the order of the attributes cannot be made to mean anything. But still, we can try to put certain things in attributes.

This part of the proposal is must more shaky, because of the above restrictions, but here is an attempt. We go back to the original grammar, and this time assume the developer wants to put the "val" and "type" fields in attributes, because he feels that "looks better." Here's how:

S       -> node;
node    -> @val @type left? right?;
val     -> TEXT;
type    -> INT;
left    -> node;
right   -> node;

Prefixing the first few tokens on the right-hand side of "node" with an "@" indicates that their productions will go inside the "<" and ">", and be prefixed with "name=". The expansion is as follows:

S       -> "<S>" node "</S>";
node    -> "<node" @val @type ">" left? right? "</node>";
@val    -> "val" "=" "\"" TEXT "\"";
@type   -> "type" "=" "\"" INT "\"";
left    -> "<left>" node "</left>";
right   -> "<right>" node "</right>";

Here is an example document:

<S>
  <node val="aap" type="7">
    <left>
      <node val="noot" type="8">
      </node>
    </left>
  </node>
</S>

There are restrictions on where the "@" may be put, but these have to be investigated.

Additional possibilities

By means of "semantic actions" the link back from the syntax to the data-structure could be made explicit, so that a parser could create the actual data-structure instead of a generic parse tree. The following is neither complete, nor very elegant, but it shows the idea:

S       -> node;
node    -> val type left? right? {$$ = new node($1, $2, $3, $4)};
val     -> TEXT;
type    -> INT {$$ = atoi($1)};
left    -> node;
right   -> node;

Embedding other formats

The examples above used terminal symbols TEXT and INT to stand for text data (with whitespace, line breaks, and other reserved characters suitably escaped) and whole numbers (in some representation, by default decimal). By using more predefined terminal symbols, the set of value types can be extended to booleans (BOOL), real numbers (REAL), identifiers (IDENT), URLs (URI), and more if needed.

Another predefined terminal symbol could be ANY, which stands for any XML element (or any XML attribute, depending on context). For example, this grammar

S    -> head body;
head -> TEXT;
body -> (TEXT | ANY)*;

creates a language with three elements, and the content of "body" can be anything, as long as it properly balances the start and end tag, and escapes reserved characters.

The same can also be done with attributes, although the restrictions on attributes in XML mean that "any" doesn't really mean "any" in this case. It means attributes whose names are different from any explictly specified attributes:

S   -> foo;
foo -> @bar @ANY*;
bar -> INT;

This defines a language with two elements, of which the "foo" element has one required attribute "bar" and any number of other attributes.

Example 1

This is a larger example, using a data-structure inspired by RDF. The model consists of a set of descriptors. Each descriptor has an object (the thing being described) and a set of property/value pairs for that object. The descriptor, as well as each pair can also have an identifier, so it can be the object of another descriptor. Every property is actually a pair: a schema (a URL) and a name from that schema. A value is either a string, or another set of properties:

S           -> descriptor*;
descriptor  -> id? object property+;
object      -> URI;
property    -> id? schema name value;
schema      -> URI;
name        -> TEXT;
value       -> TEXT | propset;
propset     -> property+;

Next we select the names of the tags we want to use:

S(snf)              -> descriptor*;
descriptor          -> id? object property+;
object(about)       -> URI;
property            -> id? schema name value;
schema              -> URI;
name                -> TEXT;
value()             -> TEXT | propset;
propset(descriptor) -> property+;

Note that we now have two places where an element with the name "descriptor" is used, but they have different contents.

Next we put everything except the values in attributes:

S(snf)              -> descriptor*;
descriptor          -> @id? @object property+;
object(about)       -> URI;
property            -> @id? @schema @name value;
schema              -> URI;
name                -> TEXT;
value()             -> TEXT | propset;
propset(descriptor) -> property+;

Here is an example of a document in this language:

<snf>
  <descriptor id="x7" about="http://w3.org/">
    <property schema="http://w3.org/prop" name="foo">
      bar
    </property>
    <property schema="http://w3.org/prop" name="xyz">
      abc
    </property>
  </descriptor>
  <descriptor about="http://some.where/">
    <property schema="http://w3.org/prop" name="foo">
      also-bar
    </property>
  </descriptor>
</snf>

Example 2

Xarc (by Gabe Beged-Dov <begeddov@jfinity.com>) is a language for describing directed graphs in XML, and as such it is similar to RDF (and XLink, though XLink is not a language). Its grammar is given in the Xarc proposal as:

arcElt         ::= '<' arcName fromAttr toAttr attr* '>' content '</' arcName '>'
arcName        ::= Qname
fromAttr       ::= 'xarc:from="' locator '"'
toAttr         ::= 'xarc:to="' locator '"'
Qname          ::= [ NSprefix ':' ] name
name           ::= (any legal XML name symbol)
NSprefix       ::= (any legal XML namespace prefix)
content        ::= (any well formed XML content)
locator        ::= URI | "#" XPTR-reference
XPTR-reference ::= string, interpreted per [XPointer]
attr           ::= (any legal XML attribute)

Here is an example:

<s:Creator
    xarc:from="#this();" xarc:to="http://www.w3.org/Home/Lassila">
  Ora Lassila </s:Creator>

It uses the name of the element as the label for an edge, which is not expressible in the meta-grammar, but apart from that, the translation to the meta-grammar is straightforward. If we choose "a" as the name for the elements representing arcs and "xarc:name" for the attribute containing the label of the arc, the grammar becomes:

arcElt(a)           -> @arcName @fromAttr @toAttr @ANY* content;
arcName(xarc:name)  -> Qname;
fromAttr(xarc:from) -> locator;
toAttr(xarc:to)     -> locator;
Qname               -> NSprefix ":" name;
name                -> IDENT; /* any legal XML name */
NSprefix            -> IDENT; /* any legal XML namespace prefix */
content             -> (TEXT | ANY)*;
locator             -> URI | "#" XPTR-reference;
XPTR-reference      -> TEXT; /* interpreted per [XPointer] */

Here is the same example as above, rewritten to follow the new grammar:

<a xarc:name="s:Creator"
    xarc:from="#this()" xarc:to="http://www.w3.org/Home/Lassila">
  Ora Lassila </a>

Example 3

The DTD of SMIL can be rewritten with the meta-grammar as follows:

smil             -> @id? head body;
head             -> @id? head-element;
head-element()   -> meta* [ layout-section meta* ]?;
id               -> IDENT;
layout-section() -> layout | switch;
layout           -> @id? @type?
                    ((root-layout | region)+ | (TEXT | ANY)+))?;
type             -> "text/smil-basic-layout";
region           -> @id? @title? @height? @width? @background-color?
                    @left? @top? @z-index? @fit? @skip-content?;
title            -> TEXT;
height           -> TEXT;
width            -> TEXT;
background-color -> TEXT;
left             -> TEXT;
top              -> TEXT;
z-index          -> TEXT;
fit              -> "hidden" | "fill" | "meet" | "scroll" | "slice";
skip-content     -> "true" | "false";
root-layout      -> @id? @title? @height? @width? @background-color?
                    @skip-content?;
meta             -> @name? @content? @skip-content?;
name             -> IDENT;
content          -> TEXT;
body             -> @id? body-content*;
body-content()   -> container-content;
container-content() -> schedule | switch | link;
schedule()       -> par | seq | media-object;
media-object()   -> audio | video | text | img | animation | textstream | ref;
par              -> @id? @title? @abstract? @author? @copyright?
                    @endsync? @dur? @repeat? @region? @begin? @end?
                    par-content*;
par-content()    -> container-content;
abstract         -> TEXT;
author           -> TEXT;
copyright        -> TEXT;
endsync          -> TEXT;
(etc.)

Actually, the meta-grammar can be much more precise than the DTD from which the above is copied. The SMIL specification has a grammar for the value of "endsync", which can be combined with the grammar above.The "endsync" production could be rewritten as:

endsync   -> "indefinite" | Clock-val;
Clock-val -> Full-clock-val | Partial-clock-val;
(etc.)

Bert Bos