ADT and marshalling for XML

Abstract

This is an attempt at defining XML (i.e., XML proper, not the meta-grammar for defining XML applications, nor the hyperlinking conventions) in a formal, mathematical way, that hopefully takes away all ambiguity as to what an XML document contains, or what the relation is between the document and its linearized, textual representation. Please report errors or suggest improvements.

Intro

An ADT (Abstract Data Type) is a mathematical formalism for defining the characteristics of a certain data type, without revealing the underlying model, let alone any concrete implementation in a computer program. An ADT defines exactly what is required of an implementation, and, by omission, also what is not required.

Marshalling is the process of linearizing a data structure into a sequence of bytes in a certain, previously agreed-upon way, so that one process can pass its internal data structure to another process, which will then be able to build an equivalent data structure. Since the marshalling algorithm presented here is based on the ADT, and not on an actual model, or even implementation, the two processes are free to use different internal memory structures. As a consequence, the marshalled data contains exactly enough information that a round-trip is possible, but no more.

XML is a very rich data type that contains Strings and Identifiers, partly ordered in a sequence, partly hierarchical, partly in an unordered database-like keyword/value system.

Fundamental functions

The functions defined here together define the ADT XML. The ADT makes use of previously defined types Identifier, URL, char, int, boolean, and nil, as well as of powersets and sequences of these. For example, the set consisting of all possible sequences of chars is denoted as char[] and the collection consisting of all possible sets of Identifiers is denoted as 2Identifier. The empty set is written as {}.

ADT XMLNode

import Identifier, URL, char, int, boolean, nil

The functions and the invariants that define them are as follows (`x' is the Cartesian product; `A -> B' means the function is from A to B; `=>' stands for `implies'; `+' is the set union operator for sets, and the normal addition operator for ints; `<>' means `not equal'; `\' is set difference):

create : Identifier -> XML
setSchema : XML x URL -> XML
getSchema : XML -> URL + nil

getSchema(create(S)) = nil

getSchema(setSchema(X, U)) = U

setAttribute : XML x Identifier x char[] -> XML
getAttribute : XML x Identifier -> char[] + nil
getAttributes : XML -> 2Identifier

getAttribute(create(S), I) = nil

getAttribute(setAttribute(X, I, S), I) = S

I1 <> I2 => getAttribute(setAttribute(X, I1, S), I2) = getAttribute(X, I2)

getAttributes(create(S)) = {}

getAttributes(setAttribute(X, I, S)) = getAttributes(X) + {I}

getAttribute(X, I) = getAttribute(setSchema(X, U), I)

getAttributes(X) = getAttributes(setSchema(X, U))

getSchema(X) = getSchema(setAttribute(X, I, S))

addNode : XML x XML + char -> XML + error
setNode : XML x int x XML + char -> XML + error
getNode : XML x int -> XML + char + nil
nrNodes : XML -> int
isNode : XML x XML -> boolean
isDescendant : XML x XML -> boolean
isFamily : XML x XML -> boolean

isNode(X1, X2) <=> E N : getNode(X1, N) = X2

isDescendant(X1, X2) <=> isNode(X1, X2) | (E X3 : isNode(X3, X2) & isDescendant(X1, X3))

isFamily(X1, X2) <=> X1 = X2 | isDescendant(X1, X2) | isDescendant(X2, X1)

isFamily(X1, X2) => addNode(X1, X2) = error

isFamily(X1, X2) => setNode(X1, N, X2) = error

N < 1 => setNode(X1, N, X2) = error

N > nrNodes(X1) => setNode(X1, N, X2) = error

N < 1 => getNode(X, N) = nil

N > nrNodes(X) => getNode(X, N) = nil

nrNodes(create(I)) = 0

¬isFamily(X1, X2) => nrNodes(addNode(X1, X2)) = nrNodes(X1) + 1

¬isFamily(X1, X2) & 0 < N <= nrNodes(X1) => getNode(setNode(X1, N, X2), N) = X2

¬isFamily(X1, X2) & N = nrNodes(X1) + 1 => getNode(addNode(X1, X2)), N) = X2

¬isFamily(X1, X2) => getNode(X1, N) = getNode(addNode(X1, X2), N)

¬isFamily(X1, X2) & 0 < M <= nrNodes(X1) & N <> M => getNode(setNode(X1, M, X2), N) = getNode(X1, N)

getNode(setSchema(X1, U), N) = getNode(X1, N)

getNode(setAttribute(X1, I, S), N) = getNode(X1, N)

¬isFamily(X1, X2) => getAttribute(addNode(X1, X2), I) = getAttribute(X1, I)

¬isFamily(X1, X2) & 0 < N <= nrNodes(X1) => getAttribute(setNode(X1, N, X2), I) = getAttribute(X1, I)

¬isFamily(X1, X2) => getAttributes(addNode(X1, X1)) = getAttributes(X1)

¬isFamily(X1, X2) & 0 < N <= nrNodes(X1) => getAttributes(setNode(X1, N, X2)) = getAttributes(X1)

¬isFamily(X1, X2) => getSchema(addNode(X1, X2)) = getSchema(X1)

¬isFamily(X1, X2) & 0 < N <= nrNodes(X1) => getSchema(setNode(X2, N, X2)) = getSchema(X1)

getName : XML -> Identifier

getName(create(I)) = I

getName(setSchema(X, U)) = getName(X)

getName(setAttribute(X, I, S)) = getName(X)

¬isFamily(X1, X2) => getName(addNode(X1, X2)) = getName(X1)

¬isFamily(X1, X2) & 0 < N <= nrNodes(X1) => getName(setNode(X1, N, X2)) = getName(X1)

deleteAttribute : XML x Identifier -> XML

deleteAttribute(create(I1), I2) = create(I1)

getSchema(deleteAttribute(X, I)) = getSchema(X)

getAttribute(deleteAttribute(X, I), I) = nil

getAttributes(deleteAttribute(X, I)) = getAttributes(X) \ {I}

getNode(deleteAttribute(X, I), N) = getNode(X, N)

getName(deleteAttribute(X, I)) = getName(X)

popNode : XML -> XML

popNode(create(I)) = create(I)

¬isFamily(X1, X2) => popNode(addNode(X1, X2)) = X1

¬isFamily(X1, X2) & 0 < N < nrNodes(X1) => popNode(setNode(X1, N, X2)) = setNode(popNode(X1), N, X2)

¬isFamily(X1, X2) & N = nrNodes(X1) => popNode(setNode(X1, N, X2)) = popNode(X1)

nrNodes(X) > 0 => nrNodes(popNode(X)) = nrNodes(X) - 1

N < nrNodes(X) => getNode(popNode(X), N) = getNode(X, N)

getSchema(popNode(X)) = getSchema(X)

getAttribute(popNode(X), I) = getAttribute(X)

getAttributes(popNode(X)) = getAttributes(X)

getName(popNode(X)) = getName(X)

Marshalling function

The notation for the functions uses nondeterministic if-statements, borrowed from the programming language Simple, which in turn is based on Dijkstra's guarded commands:

if guard1 : stmt1
[] guard2 : stmt2
...
[] guardn : stmtn
fi

When evaluating this statement, from among all those guards that evaluate to true, one is picked nondeterministically and its statement is executed.

if guard1 : stmt1
[] guard2 : stmt2
...
[] guardn : stmtn
[> stmt0
fi

is an abbreviation for

if guard1 : stmt1
[] guard2 : stmt2
...
[] guardn : stmtn
[] ~guard1 & ~guard2 &... & ~guardn : stmt0
fi

(I found writing pseudo-code easier for this part than trying to find a mathematical formula. (Who says that programs are not mathematical formula, btw?) Note that the functions below are still side-effect-free.

The linearization function makes use of a linearization function for Identifiers. We assume that function doesn't produce any characters from the set {'<', '>', '"', "'", '=', '!', '?', <space>, <tab>, <CR>, <LF>}.

import writeIdentifier;

function whitespace(min: integer): char[];
  begin
    if min <= 0 : whitespace:= ""
    [] true     : whitespace:= " " + whitespace(min - 1)
    [] true     : whitespace:= "\t" + whitespace(min - 1)
    [] true     : whitespace:= "\r" + whitespace(min - 1)
    [] true     : whitespace:= "\n" + whitespace(min - 1)
    fi
  end;

function space(min: integer): char[];
  begin
    if min <= 0 : space:= ""
    [] true     : space:= " " + space(min - 1)
    [] true     : space:= "\t" + space(min - 1)
    fi
  end;

function opendelim(sofar: char[]): char[];
  begin
    if endsWith(sofar, "\r")    : opendelim:= "\r" + spaces(0) + "<"
    [] endsWith(sofar, "\r")    : opendelim:= "\r\n" + spaces(0) + "<"
    [] endsWith(sofar, "\n")    : opendelim:= "\r" + spaces(0) + "<"
    [] endsWith(sofar, "\n")    : opendelim:= "\n" + spaces(0) + "<"
    [] endsWith(sofar, "\n")    : opendelim:= "\r\n" + spaces(0) + "<"
    [] endsWith(sofar, ">\r")   : opendelim:= "<"
    [] endsWith(sofar, ">\n")   : opendelim:= "<"
    [] endsWith(sofar, ">\r\n") : opendelim:= "<"
    [] ~endsWith(sofar, "\n")
       & ~endsWith(sofar, "\r") : opendelim:= "\r" + spaces(0) + "<"
    [] ~endsWith(sofar, "\n")
       & ~endsWith(sofar, "\r") : opendelim:= "\n" + spaces(0) + "<"
    [] ~endsWith(sofar, "\n")
       & ~endsWith(sofar, "\r") : opendelim:= "\r\n" + spaces(0) + "<"
    fi
  end;

function closedelim(): char[];
  begin
    if true : closedelim:= whitespace(0) + ">"
    [] true : closedelim:= whitespace(0) + ">" + "\r" + space(0)
    [] true : closedelim:= whitespace(0) + ">" + "\n" + space(0)
    [] true : closedelim:= whitespace(0) + ">" + "\r\n" + space(0)
    fi
  end;

function writeSchema(sofar: char[]; url: char[]): char[];
  begin
    if url <> nil : writeSchema:= opendelim(sofar) + "!doctype"
                      + whitespace(1) + writeString(url) + closedelim()
    [>              writeSchema:= ""
    fi
  end;

function writeString(s: char[]): char[];
  begin
  if true : writeString:= "\"" + writeChars(s, 1, "\"") + "\""
  [] true : writeString:= "'" + writeChars(s, 1, "'") + "'"
  fi
  end;

function writeChars(s: char[]; start: integer; avoid: char): char[];
  begin
    if start <= length(s) : writeChars:= writeChar(s[start], avoid)
                              + writeChars(s, start + 1, avoid)
    [>                      writeChars:= ""
  end;

function writeChar(c, avoid: char): char[];
  begin
    if c = "\"" : writeChar:= "&quot;"
    [] c = "'"  : writeChar:= "&squot;"
    [] c = "<"  : writeChar:= "&lt;"
    [] c = ">"  : writeChar:= "&gt;"
    [] c = "&"  : writeChar:= "&amp;"
    [] true     : writeChar:= "&#" + decimal(ord(c)) + ";"
    [] true     : writeChar:= "&#x" + hexadecimal(ord(c)) + ";"
    [] c <> "<" & c <> ">" & c <> "&" & c <> avoid :
      writeChar:= c
  end;

function writeAttribute(X: XML; names: set of Identifier; name: Identifier): char[];
  begin
    writeAttribute:= whitespace(1) + writeIdentifier(name)
      + whitespace(0) + "=" + writeString(getAttribute(X, name))
      + writeAttributes(X, names - {name})
  end;

function writeAttributes(X: XML; names: set of Identifier): char[];
  begin
    if names = {} : writeAttributes:= ""
    [>              writeAttributes:= writeAttribute(X, names, getMember(names))
    fi
  end;

function writeXML(sofar: char[]; X: XML): char[];
  var
    s1, s2, s3, s4, s5: char;
  begin
    s1:= writeSchema(sofar, getSchema(X));
    s2:= s1 + opendelim(sofar + s1) + writeIdentifier(getName(X))
      + writeAttributes(X, getAttributes(X));
    if true :
      s3:= s2 + closedelim();
      s4:= s3 + writeNodes(s3, X, 1);
      s5:= s4 + opendelim(s3 + s4)
    [] nrNodes = 0 :
      s3:= s2;
      s4:= s3;
      s5:= s4 + whitespace(0);
    fi;
    writeXML:= s5 + "/" + writeIdentifier(getName(X)) + closedelim();
  end;

function writeNodes(sofar: char[]; X: XML; n: integer): char[];
  var
    s: char[];
  begin
    if getNode(X, n) isa char :
      s:= writeChar(getNode(X, n));
      writeNodes:= s + writeNodes(s, X, n + 1)
    [] getNode(X, n) isa XML :
      s:= writeXML(sofar, getNode(X, n));
      writeNodes:= s + writeNodes(s, X, n + 1)
    [>
      writeNodes:= ""
    fi
  end;

Character encoding

The marshalling function above is incomplete: it stops at the level of characters, when it should have specified bytes. The final step is to choose an encoding for characters. The chosen encoding has to be passed to the receiver of the byte sequence in some way, but this sepcification doesn't say how (`out of band').


Bert Bos <bert@w3.org>
9 Aug 1997