Context free schemas for data-XML

A schema is a formal description of a data format, in particular of a format expressed in XML. "Data-XML" is my term for a simplified version of XML (a.k.a. the trivial subset), while waiting for a better term. "Context free" refers to context free grammars, on which this proposal is based.

Rationale

A context free grammar (CFG) is often the most compact and easiest way to describe some structure. Not all languages (or structures) are context free, but it is usually worth the trouble making them so (or nearly). A CFG provides a tree structure for sentences in the language it generates.

A DTD does something similar, but is more limited and harder to read. DTD's are limited, because they don't have real non-terminals: you cannot group some elements together without also introducing some terminals into the language. On the other hand, the fact that mark-up is implicit makes it unncessary to mention all the delimiters in the production rules, which leads to shorter rules.

For a long time I've had the idea that the best of both worlds would be an EBNF-like grammar notation but with implicit addition of the XML mark-up where that mark-up is straightforward. But unlike in DTD's, there must be control over the mark-up, because element and attribute names do not always correspond to non-terminals, and some non-terminals do not generate any mark-up at all.

Another idea is that the choice between elements and attributes is often arbitrary, a matter of esthetics. Attributes are much more limited in what they can contain, but often they lead to more compact formats. So the schema language I propose creates elements by default, since that always works, but a 1-character flag can be added to the rules, as a hint that attributes are to be generated instead, if possible.

The result is that a schema looks like a context-free grammar: it has production rules with a non-terminal on the left and a regular expression on the right. The terminals are either string constants or predefined symbols representing basic data types, such as INTEGER, REAL, or DATE. Some extra symbols may be inserted in the rule to fine-tune the XML representation, but these do not influence the structure, or the "meaning" of the format.

Although the recommended notation for a schema is the grammar notation, it is, of course, possible to "apply the schema to itself" as it were, and derive an XML-based syntax for schemas. The grammar notation is more compact and readable, and will be used below.

I'll explain the schemas with the help of an example.

Example of a schema

The production rules below describe a data format. The data format is called "S" and the various rules describe what data (integers, strings, etc.) you can find in it, and how much of each kind of data:

S           : rule*;
rule        : head @name? rhs;
head()      : IDENTIFIER;
name        : IDENTIFIER;
rhs()       : term?;
seq         : @repeatcount? @isattrib? term+;
choice      : @repeatcount? @isattrib? term+;
term()      : type | constant | nonterm | seq | choice;
nonterm     : @repeatcount? @isattrib? IDENTIFIER;
constant    : TEXT;
type        : "DATE" | "INTEGER" | "REAL" | "TEXT";
repeatcount : lower upper? | upper;
lower("min"): INTEGER;
upper("max"): INTEGER;
isattrib    : "y" | "n";

If we forget the "@" and "(...)" for the moment, this is just a normal grammar. E.g., the rules say that a "repeatcount" consists of either a "lower" and an optional "upper", or a single "upper".

The difference between a typical context-free grammar and this schema is that there are a number of symbols omitted from the poduction rules, that will be inserted automatically. E.g., the rule "S: rule*" in fact says that an "S" consists of the text "<S>", followed by zero or more "rules", followed by "</S>". In other words, all the typical XML mark-up is omitted. E.g., the first production says that the following conforms to this schema:

<S>
  <rule>...</rule>
  <rule>...</rule>
</S>

But sometimes the names of the non-terminals are not suitable for the XML-based format; they may be considered too long, or two names that are different in the grammar should be the same in the XML format. In that case, the name can be added after the non-terminal in parentheses. So, e.g., the rule for "lower" says that the element shouldn't be called "lower", but "min".

A special case of this is when the name between the parentheses is omitted, as in "rhs()". That means that there will be no mark-up at all. There is a structure identified in the schema, but that structure has no direct counterpart in the XML format. The structure is implicit.

Omitting the name can be dangerous, as it may lead to ambiguous sentences. However, whether the grammar is ambiguous can be checked mechanically.

In the example above, the "rhs()" rule could have been omitted completely, by rewriting the second production rule to

rule: head @name? term?;

but it is kept as an example.

Another choice that often exists in designing XML-based formats is whether to use an element or an attribute. The limits on attributes are that their content can only be a basic data type (text, integer, etc.), and that they can only occur once per element. But their use often leads to slightly more compact and more readable formats. In the schema, the addition of a "@" in front of a non-terminal on the right hand side indicates that the non-terminal should not be turned into a child element, but into an attribute, or a set of attributes. Given the limitations on attributes, that is obviously not always possible.

In the schema above, e.g., the rule for "seq" says that "repeatcount" (among others) should be turned into one or more attributes. "Repeatcount" consists of either one or two other elements, but both of them expand to primitive types, and neither is repeated, so "repeatcount" in fact satifies the condition on attributes. The following two are therefore valid expansions of "term":

<seq min="0">...</seq>
<seq min="1" max="10">...</seq>

(Note that "upper" and "lower" had been renamed to "max" and "min", respectively.)

Here is an example of a complete sentence that conforms to the schema above:

<S>
  <rule>S
    <nonterm min="0">rule</>;
  </rule>
  <rule>rule
    <seq>
      <nonterm>head</>
      <nonterm isattrib="y">name</>
      <nonterm>rhs</>;
    </seq>
  </rule>
  <rule name="">head
    <type>IDENTIFIER</>
  </rule>
  <rule>name
    <type>IDENTIFIER</>
  </rule>
  <rule name="">rhs
    <nonterm max="1">term</>
  </rule>
  <rule>seq
    <seq>
      <nonterm max="1" isattrib="y">repeatcount</>
      <nonterm max="1" isattrib="y">isattrib</>
      <nonterm min="1">term</>
    </seq>
  </rule>
  <rule>choice
    <seq>
      <nonterm max="1" isattrib="y">repeatcount</>
      <nonterm max="1" isattrib="y">isattrib</>
      <nonterm min="1">term</>
    </seq>
  </rule>
  <rule name="">term
    <choice>
      <nonterm>type</>
      <nonterm>constant</>
      <nonterm>nonterm</>
      <nonterm>seq</>
      <nonterm>choice</>
    </choice>
  </rule>
  <rule>nonterm
    <seq>
      <nonterm max="1" isattrib="y">repeatcount</>
      <nonterm max="1" isattrib="y">isattrib</>
      <type>IDENTIFIER</>
    </seq>
  </rule>
  <rule>constant
    <type>TEXT</>
  </rule>
  <rule>type
    <choice>
      <constant>DATE</>
      <constant>INTEGER</>
      <constant>REAL</>
      <constant>IDENTIFIER</>
      <constant>TEXT</>
    </choice>
  </rule>
  <rule>repeatcount
    <choice>
      <seq>
        <nonterm>lower</>
        <nonterm max="1">upper</>
      </seq>
      <nonterm>upper</>
    </choice>
  </rule>
  <rule name="min">lower
    <type>INTEGER</>
  </rule>
  <rule name="max">upper
    <type>INTEGER</>
  </rule>
  <rule>isattrib
    <choice>
      <constant>y</>
      <constant>n</>
    </choice>
  </rule>
</S>

You may recognize it. It is, indeed, the schema applied to itself. This shows two things: (1) that the schema works, and (2) that the non-XML syntax is a lot easier.:-)

Another example

That last step may have been two quick. Let's try another example, one that doesn't bite its own tail. The traditional example of a memo.

A memo has only a few parts. For this example, we keep it very simple. There is an addressee, a sender, a date, a title, and a body:

memo           : addressee sender date title body;
addressee      : TEXT;
sender         : TEXT;
date           : DATE;
title          : TEXT;
body           : TEXT;

For example, the following would be a valid memo:

<memo>
  <addressee>John</addressee>
  <sender>Carla</addressee>
  <date>19980901</date>
  <title>New coffee maker</title>
  <body>
    The new coffee maker has been installed!
    Operation is simple: put a cup in the opening
    and press the red button.
  </body>
</memo>

Now let' see if we can' vary things a bit. What if we put the first three elements in attributes instead, and rename "addressee" to "to"? The memo would like like this:

<memo to="John" sender="Carla" date="19980901">
  <title>New coffee maker</title>
  <body>
    The new coffee maker has been installed!
    Operation is simple: put a cup in the opening
    and press the red button.
  </body>
</memo>

The schema that makes the memo look like this is basically the same as the earlier one, but it adds some @-signs and an alias for "addressee":

memo           : @addressee @sender @date title body;
addressee("to"): TEXT;
sender         : TEXT;
date           : DATE;
title          : TEXT;
body           : TEXT;

The fact that the schema hasn't changed, except cosmetically, indicates that the two structures are in some sense equivalent. However, we won't try to define what this "equivalence" means exactly.

Whitespace

As the examples show, spaces and newlines have been inserted in several places to make the documents readable. The schema language defines formats in which most of the whitespace is insignificant. Whitespace is optional at several places between "<" and ">", and also after every ">". Whitespace at the start of a line is also insignificant.

For most data formats this is very convenient, but there may be a few that want to have elements that contain spaces at the start. Maybe a format for storing poems. Such a format will have to make the spaces explicit, for example with a rule

spaces: INTEGER

which would expand, e.g., to

<spaces>4</spaces>

or

spaces="4"

depending on its context.

Basic types

The following basic types are supported:

INTEGER
Whole numbers. There is no limit on their size. [Maybe introduce a RANGE type which has limits?]. They are represented in decimal, with an optional sign. [Allow octal and hex?]
REAL
Real numbers. There is no limit on their size. They are represented in decimal, the usual ways: 3.14, -1.2E+356, 0.11E-12, etc.
DATE
A date and optional time, represented as per "Date and time formats".
IDENTIFIER
An identifier, conforming to the rules for XML names.
TEXT
Arbitrary text. Note that the text with initial whitespace, or with whitespace following a newline cannot be represented. [Provide predefined escapes?]

Bert Bos
Created: April 2000
$Date: 2001/04/05 19:00:54 $