DataPower XML Schema Experience Report

David Maze
DataPower Technology, Inc.
dmaze@datapower.com
May 20, 2005

Overview

DataPower produces three network devices, the XA35 XML Accelerator, the XS40 XML Security Gateway, and the XI50 XML Integration Appliance. All three products implement and support W3C XML Schema 1.0 as a core feature. A typical XS40 deployment would configure the appliance to perform schema validation on all incoming messages, in addition to digital signature verification and other user-specified policy checks, before sending inbound messages on to a back-end server.

How do you use XML Schema? What usage scenario is illustrated by your usage?

Our schema validation is fully streaming: unless identity constraints or the xs:ID or xs:IDREF simple types are used, documents of unlimited size can be validated. A typical deployment will validate against a fixed schema, but we fully support the xsi:schemaLocation attribute in instance documents if requested by the administrator.

What features of XML Schema 1.0 meet your needs?

We use XML Schema 1.0 principally for validity testing. As such, the most interesting parts of the schema specification are the document structural description language and the basic definitions of the simple types. Often our customers use widely available schemas, such as the ACORD, STAR, or FpML schemas. In general, the structural language is designed such that it is very easy to generate a streaming validator from a schema description; the only notable exception is the xs:all construct.

What features of XML Schema 1.0 don't meet your needs?

Readability. The schema specification is very dense. In its own way, it is very precise, but it is much closer to pseudocode than human-readable English. The multiple layers of abstraction also make the flow of schema processing difficult to understand: given an XML 1.0 schema document and an instance document I want to validate, I need to parse both documents into infosets, create an abstract schema object tree from the schema infoset, and then apply all the rules; when I'm done, what I have is the infoset of the instance document with some extra information items scattered throughout. What I really want for a yes-or-no answer is that the [validity] property of the element information item corresponding to the root element of the infoset produced from the instance document is valid, but that sentence is a lot harder to understand than "the instance document is valid".

PSVI. In many cases, we make very little use of the post-schema-validation infoset: since we only validate, and only want a yes-or-no answer, all we need to manage is the aforementioned [validity] property, and this can be handled implicitly. XSLT 2.0 requires at least knowing the type of each element as it is produced. The requirements imposed by full PSVI support feel very invasive for what seems to be an "add-on" to the core XML specification.

Runtime dispatching. Substitution groups and the xsi:type attribute seem to perform similar roles, in letting an element have a different type from what was declared in the schema. Substitution groups can be implemented with static checking: check that the type of the group member is validly derived from the type of the group leader, and then treat any appearance of the group leader as a choice of all of the group members. xsi:type demands runtime checking of all of these things, and ultimately doing namespace resolution and dispatching based on a string value in an attribute. For our hardware-based solutions, this is an unusual requirement in XML processing and requires adding support for this specific feature to the hardware.

Identity constraints. These seem to be rarely used in our customers' schemas. They seem to be completely separate from the rest of the XML Schema structural language, and the rules for propagation of values between trees with a shared parent xs:keyref are unusual and complicated.

DTD-derived types. The xs:NOTATION type, in particular, has unusual rules, and in practice is nothing more than a variant on xs:QName. This and xs:ENTITY seem to be completely unused in practice.

What interoperability problems have you experienced in your use of XML Schema 1.0?

Our principal issues have come from trying to describe all of the variants of SOAP 1.1 in a single schema. The specification is imprecise about details such as the namespaces of the children of SOAP:Fault or their order. In this case what we really want is an xs:all group that has choices of either qualified or unqualified elements, or perhaps a choice of two all groups which allow either all qualified or all unqualified elements, but both of these are disallowed by XML Schema. This is much less of a problem for SOAP 1.2 and its various drafts, which have official published schemas.

Our schema validator attempts to strictly implement the XML Schema 1.0 specification. We run into occasional problems with tools that claim to support schema validation but do so less strictly than our validator. In most cases it is straightforward to explain what the problem is and either give chapter-and-verse from the schema specification or explain how the schema in question violates the Unique Particle Attribution Constraint.

The single largest problem we encounter in schema validation is a mismatch in namespaces: the instance document's primary XML namespace and the targetNamespace of the schema disagree. There is little the XML Schema specification could do to resolve this problem.

What features do you most miss in XML Schema 1.0? Would you wish them to be added to XML Schema 1.1?

The output of a validator in an XML context is unspecified. Can the result of XML Schema 1.0 validation be used as the input to an XSLT 1.0 transformation? If so, does the transformation see the [schema normalized value] or the original value for information items where the two string representations differ? If the input of validation includes an empty element for an element with a default value, does the output contain the [schema default] or nothing? A concrete answer to this would be useful for XML Schema 1.1.

It would be useful if the specification included reverse cross-referencing. What uses, for example, "Schema Component Constraint: Effective Total Range (all and sequence)"? The list of result codes in Appendix C should also reference a section of the specification using more than HTML hyperlinks.

It seems odd that the various date-time simple types aren't more directly related in the schema type system: they all reference durations in time on either the zoned or unzoned timeline, so they should be comparable. It also seems odd that the only useful way to restrain a date to, for example, "any date in March" is via an xs:pattern regular expression. A similar statement could be made about xs:hexBinary and xs:base64Binary. For our use, this is not particularly vital.

What features of XML Schema 1.0 have caused you the most puzzlement and/or frustration?

It took several weeks for me to figure out how to read the schema specification at all. At this point the structure of the document makes sense to me, but only because I've spent enough time immersed in it to be able to find the subsubsection with the information I need.

The first bump in the learning curve comes from the separation of the schema objects from the schema concrete syntax or infoset. The schema specification reads as though it's expected that the principal use will be from machine-generated schemas, with the XML representation being only a secondary form that needs to be transformed into the relevant schema components. For interoperability, though, most schemas are in fact exchanged as machine-neutral XML documents.

Many of the simple types have side effects or dependencies beyond the string itself. Validating a string of type xs:ID modifies global state, namely the set of strings used in the document as ID values. Validating a string of type xs:QName depends on the namespace bindings in scope; producing a default value of type xs:QName when the namespace URI of the QName has no in-scope bindings to it is problematic.

The interactions of comments and simple element values, particularly in a streaming context, is problematic. Such instance documents are fortunately unusual. However, determining whether this is a valid instance of a type "integer between 10 and 99":

<a>1<!-- comment -->7</a>

involves examining multiple nodes in a tree- or event-based representation. Similar problems would appear with XML processing instructions.

Have you used the XML Schema test suite? Other test suites or tests? Developed your own tests, testing framework, test harness?

We make extensive use of the XML Schema test suite: in particular, it is a component of our automated testing of the schema validator. Our experience has been that the test suite is dated and inaccurate. To pick a specific test, the msxsdtest attO025 test in the January 16, 2002 release of the test suite uses one fixed value in an attribute declaration and a different fixed value in its use; this is incorrect as per au-props-correct 2, but the "expected" result is that the schema is valid. The suntest xsd003b.xsd schema references an undefined type, xs:number, which only existed in the March 16, 2001 draft of XML Schema. A large number of the schema tests seem to check the validity of the id attribute in a schema. There are many positive tests for correctness of simple-type checking but few negative tests.

We also have a collection of industry schemas and customer-provided schemas that we do testing against. These include the ACORD, STAR, and FpML schemas mentioned earlier, among others. Internally-developed test cases include unit tests for xsi:type, the xs:ID and xs:QName types, checking of unique particle attribution, and dynamic schema loading using xsi:schemaLocation.