This is an attempt at defining XML (i.e., XML proper, not the meta-grammar for defining XML applications, nor the hyperlinking conventions) in a formal, mathematical way, that hopefully takes away all ambiguity as to what an XML document contains, or what the relation is between the document and its linearized, textual representation. Please report errors or suggest improvements.
An ADT (Abstract Data Type) is a mathematical formalism for defining the characteristics of a certain data type, without revealing the underlying model, let alone any concrete implementation in a computer program. An ADT defines exactly what is required of an implementation, and, by omission, also what is not required.
Marshalling is the process of linearizing a data structure into a sequence of bytes in a certain, previously agreed-upon way, so that one process can pass its internal data structure to another process, which will then be able to build an equivalent data structure. Since the marshalling algorithm presented here is based on the ADT, and not on an actual model, or even implementation, the two processes are free to use different internal memory structures. As a consequence, the marshalled data contains exactly enough information that a round-trip is possible, but no more.
XML is a very rich data type that contains Strings and Identifiers, partly ordered in a sequence, partly hierarchical, partly in an unordered database-like keyword/value system.
The functions defined here together define the ADT XML. The ADT makes use of previously defined types Identifier, URL, char, int, boolean, and nil, as well as of powersets and sequences of these. For example, the set consisting of all possible sequences of chars is denoted as char[] and the collection consisting of all possible sets of Identifiers is denoted as 2Identifier. The empty set is written as {}.
ADT XMLNode
import Identifier, URL, char, int, boolean, nil
The functions and the invariants that define them are as follows (`x' is the Cartesian product; `A -> B' means the function is from A to B; `=>' stands for `implies'; `+' is the set union operator for sets, and the normal addition operator for ints; `<>' means `not equal'; `\' is set difference):
getSchema(create(S)) = nil
getSchema(setSchema(X, U)) = U
getAttribute(create(S), I) = nil
getAttribute(setAttribute(X, I, S), I) = S
I1 <> I2 => getAttribute(setAttribute(X, I1, S), I2) = getAttribute(X, I2)
getAttributes(create(S)) = {}
getAttributes(setAttribute(X, I, S)) = getAttributes(X) + {I}
getAttribute(X, I) = getAttribute(setSchema(X, U), I)
getAttributes(X) = getAttributes(setSchema(X, U))
getSchema(X) = getSchema(setAttribute(X, I, S))
isNode(X1, X2) <=> E N : getNode(X1, N) = X2
isDescendant(X1, X2) <=> isNode(X1, X2) | (E X3 : isNode(X3, X2) & isDescendant(X1, X3))
isFamily(X1, X2) <=> X1 = X2 | isDescendant(X1, X2) | isDescendant(X2, X1)
isFamily(X1, X2) => addNode(X1, X2) = error
isFamily(X1, X2) => setNode(X1, N, X2) = error
N < 1 => setNode(X1, N, X2) = error
N > nrNodes(X1) => setNode(X1, N, X2) = error
N < 1 => getNode(X, N) = nil
N > nrNodes(X) => getNode(X, N) = nil
nrNodes(create(I)) = 0
¬isFamily(X1, X2) => nrNodes(addNode(X1, X2)) = nrNodes(X1) + 1
¬isFamily(X1, X2) & 0 < N <= nrNodes(X1) => getNode(setNode(X1, N, X2), N) = X2
¬isFamily(X1, X2) & N = nrNodes(X1) + 1 => getNode(addNode(X1, X2)), N) = X2
¬isFamily(X1, X2) => getNode(X1, N) = getNode(addNode(X1, X2), N)
¬isFamily(X1, X2) & 0 < M <= nrNodes(X1) & N <> M => getNode(setNode(X1, M, X2), N) = getNode(X1, N)
getNode(setSchema(X1, U), N) = getNode(X1, N)
getNode(setAttribute(X1, I, S), N) = getNode(X1, N)
¬isFamily(X1, X2) => getAttribute(addNode(X1, X2), I) = getAttribute(X1, I)
¬isFamily(X1, X2) & 0 < N <= nrNodes(X1) => getAttribute(setNode(X1, N, X2), I) = getAttribute(X1, I)
¬isFamily(X1, X2) => getAttributes(addNode(X1, X1)) = getAttributes(X1)
¬isFamily(X1, X2) & 0 < N <= nrNodes(X1) => getAttributes(setNode(X1, N, X2)) = getAttributes(X1)
¬isFamily(X1, X2) => getSchema(addNode(X1, X2)) = getSchema(X1)
¬isFamily(X1, X2) & 0 < N <= nrNodes(X1) => getSchema(setNode(X2, N, X2)) = getSchema(X1)
getName(create(I)) = I
getName(setSchema(X, U)) = getName(X)
getName(setAttribute(X, I, S)) = getName(X)
¬isFamily(X1, X2) => getName(addNode(X1, X2)) = getName(X1)
¬isFamily(X1, X2) & 0 < N <= nrNodes(X1) => getName(setNode(X1, N, X2)) = getName(X1)
deleteAttribute(create(I1), I2) = create(I1)
getSchema(deleteAttribute(X, I)) = getSchema(X)
getAttribute(deleteAttribute(X, I), I) = nil
getAttributes(deleteAttribute(X, I)) = getAttributes(X) \ {I}
getNode(deleteAttribute(X, I), N) = getNode(X, N)
getName(deleteAttribute(X, I)) = getName(X)
popNode(create(I)) = create(I)
¬isFamily(X1, X2) => popNode(addNode(X1, X2)) = X1
¬isFamily(X1, X2) & 0 < N < nrNodes(X1) => popNode(setNode(X1, N, X2)) = setNode(popNode(X1), N, X2)
¬isFamily(X1, X2) & N = nrNodes(X1) => popNode(setNode(X1, N, X2)) = popNode(X1)
nrNodes(X) > 0 => nrNodes(popNode(X)) = nrNodes(X) - 1
N < nrNodes(X) => getNode(popNode(X), N) = getNode(X, N)
getSchema(popNode(X)) = getSchema(X)
getAttribute(popNode(X), I) = getAttribute(X)
getAttributes(popNode(X)) = getAttributes(X)
getName(popNode(X)) = getName(X)
The notation for the functions uses nondeterministic if-statements, borrowed from the programming language Simple, which in turn is based on Dijkstra's guarded commands:
if guard1 : stmt1 [] guard2 : stmt2 ... [] guardn : stmtn fi
When evaluating this statement, from among all those guards that evaluate to true, one is picked nondeterministically and its statement is executed.
if guard1 : stmt1 [] guard2 : stmt2 ... [] guardn : stmtn [> stmt0 fi
is an abbreviation for
if guard1 : stmt1 [] guard2 : stmt2 ... [] guardn : stmtn [] ~guard1 & ~guard2 &... & ~guardn : stmt0 fi
(I found writing pseudo-code easier for this part than trying to find a mathematical formula. (Who says that programs are not mathematical formula, btw?) Note that the functions below are still side-effect-free.
The linearization function makes use of a linearization function for Identifiers. We assume that function doesn't produce any characters from the set {'<', '>', '"', "'", '=', '!', '?', <space>, <tab>, <CR>, <LF>}.
import writeIdentifier; function whitespace(min: integer): char[]; begin if min <= 0 : whitespace:= "" [] true : whitespace:= " " + whitespace(min - 1) [] true : whitespace:= "\t" + whitespace(min - 1) [] true : whitespace:= "\r" + whitespace(min - 1) [] true : whitespace:= "\n" + whitespace(min - 1) fi end; function space(min: integer): char[]; begin if min <= 0 : space:= "" [] true : space:= " " + space(min - 1) [] true : space:= "\t" + space(min - 1) fi end; function opendelim(sofar: char[]): char[]; begin if endsWith(sofar, "\r") : opendelim:= "\r" + spaces(0) + "<" [] endsWith(sofar, "\r") : opendelim:= "\r\n" + spaces(0) + "<" [] endsWith(sofar, "\n") : opendelim:= "\r" + spaces(0) + "<" [] endsWith(sofar, "\n") : opendelim:= "\n" + spaces(0) + "<" [] endsWith(sofar, "\n") : opendelim:= "\r\n" + spaces(0) + "<" [] endsWith(sofar, ">\r") : opendelim:= "<" [] endsWith(sofar, ">\n") : opendelim:= "<" [] endsWith(sofar, ">\r\n") : opendelim:= "<" [] ~endsWith(sofar, "\n") & ~endsWith(sofar, "\r") : opendelim:= "\r" + spaces(0) + "<" [] ~endsWith(sofar, "\n") & ~endsWith(sofar, "\r") : opendelim:= "\n" + spaces(0) + "<" [] ~endsWith(sofar, "\n") & ~endsWith(sofar, "\r") : opendelim:= "\r\n" + spaces(0) + "<" fi end; function closedelim(): char[]; begin if true : closedelim:= whitespace(0) + ">" [] true : closedelim:= whitespace(0) + ">" + "\r" + space(0) [] true : closedelim:= whitespace(0) + ">" + "\n" + space(0) [] true : closedelim:= whitespace(0) + ">" + "\r\n" + space(0) fi end; function writeSchema(sofar: char[]; url: char[]): char[]; begin if url <> nil : writeSchema:= opendelim(sofar) + "!doctype" + whitespace(1) + writeString(url) + closedelim() [> writeSchema:= "" fi end; function writeString(s: char[]): char[]; begin if true : writeString:= "\"" + writeChars(s, 1, "\"") + "\"" [] true : writeString:= "'" + writeChars(s, 1, "'") + "'" fi end; function writeChars(s: char[]; start: integer; avoid: char): char[]; begin if start <= length(s) : writeChars:= writeChar(s[start], avoid) + writeChars(s, start + 1, avoid) [> writeChars:= "" end; function writeChar(c, avoid: char): char[]; begin if c = "\"" : writeChar:= """ [] c = "'" : writeChar:= "&squot;" [] c = "<" : writeChar:= "<" [] c = ">" : writeChar:= ">" [] c = "&" : writeChar:= "&" [] true : writeChar:= "&#" + decimal(ord(c)) + ";" [] true : writeChar:= "&#x" + hexadecimal(ord(c)) + ";" [] c <> "<" & c <> ">" & c <> "&" & c <> avoid : writeChar:= c end; function writeAttribute(X: XML; names: set of Identifier; name: Identifier): char[]; begin writeAttribute:= whitespace(1) + writeIdentifier(name) + whitespace(0) + "=" + writeString(getAttribute(X, name)) + writeAttributes(X, names - {name}) end; function writeAttributes(X: XML; names: set of Identifier): char[]; begin if names = {} : writeAttributes:= "" [> writeAttributes:= writeAttribute(X, names, getMember(names)) fi end; function writeXML(sofar: char[]; X: XML): char[]; var s1, s2, s3, s4, s5: char; begin s1:= writeSchema(sofar, getSchema(X)); s2:= s1 + opendelim(sofar + s1) + writeIdentifier(getName(X)) + writeAttributes(X, getAttributes(X)); if true : s3:= s2 + closedelim(); s4:= s3 + writeNodes(s3, X, 1); s5:= s4 + opendelim(s3 + s4) [] nrNodes = 0 : s3:= s2; s4:= s3; s5:= s4 + whitespace(0); fi; writeXML:= s5 + "/" + writeIdentifier(getName(X)) + closedelim(); end; function writeNodes(sofar: char[]; X: XML; n: integer): char[]; var s: char[]; begin if getNode(X, n) isa char : s:= writeChar(getNode(X, n)); writeNodes:= s + writeNodes(s, X, n + 1) [] getNode(X, n) isa XML : s:= writeXML(sofar, getNode(X, n)); writeNodes:= s + writeNodes(s, X, n + 1) [> writeNodes:= "" fi end;
The marshalling function above is incomplete: it stops at the level of characters, when it should have specified bytes. The final step is to choose an encoding for characters. The chosen encoding has to be passed to the receiver of the byte sequence in some way, but this sepcification doesn't say how (`out of band').