A schema for serialized infosets
Richard Tobin and Henry Thompson, LTG, University of Edinburgh
This is a schema that describes an XML serialization of XML infosets.
There are two main versions: one for the basic infoset, and one for
the post-schema-validation (PSV) infoset.
Our main goal in defining this serialization was to allow comparison of
the infosets generated by different processors (including parsers and
schema validators). It has also proved useful for finding flaws in
the infoset and schema specifications themselves, and the serializations
can also be converted to HTML (by stylesheets) for display.
The top-level schemas are
- XMLInfoset.xsd
- The basic infoset
- XMLInfoset-strict.xsd
- "Strict" version of the basic infoset (see below)
- PSVInfoset.xsd
- PSV infoset (uses strict version of basic infoset)
Notes
All properties and infoitems are represented as elements. There are
two reasons for this:
- It avoids the need to decide each case individually.
- It allows properties to be nulled with xsi:nil, to
represent no value (and absent as it is called in
the PSV infoset).
A type is declared for each info item and property. Type names are
camel-case with an initial capital. Element names are camel-case
with an initial lower-case letter.
All properties are represented as elements whose name is the property
name. These elements are globally declared (except where there are
infoitems with the same name, in which case "Property" is appended to
the name). A consequence is that properties with the same name must
have the same type; this is true for both the basic and PSV infosets.)
Their types fall into several categories:
-
Atomic (strings, enumerations and booleans):
The property is an element with a simple type.
-
Lists of atoms
If the atom cannot include spaces, the property is an element
with a simple type which is
a list of the appropriate simple type. If it can include spaces (eg
xpaths) we create a dummy infoitem for it (XXX we haven't done this for URIs
which can theoretically include spaces).
-
Lists of info items:
The property is an element containing a sequence of elements which represent
the info items.
-
(Unordered) sets
As for lists. The values are sorted into a canonical order. For
attributes and namespaces this is the same as the order in Canonical
XML. For other cases we will specify an order.
-
Single info items:
Surprisingly, there are none of these in the basic infoset, except
in cases where a pointer is used (see below). They do occur in the
PSV infoset.
The property is an element containing an element which represents
the info items. In several cases the property has the same name
as the infoitem that is its value, resulting in a strange-looking
repetition of the element name.
-
References to info items:
Where the very same info item appears in two or more places, we specify
that one contains the real value and the others contain pointers.
All info items that are pointed to have attributes named id.
A property pointing to an info item contains an element named pointer
which has an attribute named ref corresponding to the pointed-to
item. Identity constraints in the schema enforce the correspondence
(just in ID/IDREF style at present - this could be tightened up to
ensure that pointers point to the right type).
When there is a natural home for the real definitions it is used. In
particular, unparsed entities and notations reside in the
[unparsed entities] and [notations] properties of the
document info item. Global schema components reside in the
[schema components] property of the schema information
info item, others reside in the component in which they are defined
(for example a local element declaration will reside in a particle).
-
Odd cases:
The PSV infoset has some odd cases. Where the property is either an atom
or a structure (eg [scope], which is either global or
a complex type definition), we just use mixed content. Where the
property has substructure (eg [value constraint] which is a pair
of a string and default or fixed), we create a
dummy infoitem.
Since there is no requirement for a processor to produce all infoitems
or properties, in the basic infoset schema all properties are optional.
In addition, to
allow extensions of the infoset to be validated against the basic
schema, all infoitems end with
<s:any namespace="##other" processContents="lax" minOccurs="0" maxOccurs="unbounded"/>
There is a "strict" version of the schema which requires all the properties
(but still allows extra properties from other namespaces).
The serialization of the basic infoset uses the namespace
http://www.w3.org/2001/05/XMLInfoset
and corresponds to what is expected to be the CR draft of the Infoset spec.
The serialization of the PSV infoset uses the namespace
http://www.w3.org/2001/05/PSVInfosetExtension
for added properties and infoitems,
and corresponds to the XML Schema Recommendation.
Future work
There are some incompletenesses that will be rectified. In
particular, no serialization has yet been defined for ID/IDREF or
identity constraint tables. The schema could be tightened up in
several places (facets, for example).
We intend to make the schemas compatible with the RDF schema
for the basic infoset, so that a serialization can be valid
according to both.
There are no doubt many bugs in these schemas, which we will
attempt to correct. Please mail Richard Tobin
(richard@cogsci.ed.ac.uk)
with corrections.