Heidi Buelow and Allen Brookes, Rogue Wave Software
May 25, 2005
At Rogue Wave Software our experience with XML Schema is mainly though our XML data binding product XML Object Link which is part of our web services product LEIF. XML Object Link has similarities to products such as JAXB and Castor but, since we generate C++ code, rather than Java we may have some unique experiences.
Our customers have found this product very useful in manipulating XML data programmatically and in most cases customer schemas use a subset of XML Schema that we have no problems supporting. The main issues that we have found come from challenges in implementing certain schema features, from dealing with invalid schemas, from versioning, and from C++ compiler limitations.
There are many features in XML Schema that have no equivalent in C++ or have no simple equivalent. These include choice, complexType restriction, and mixed content, among others. Choice requires that we provide storage for all possible choice types and provide an indication for which is the current choice. Since C++ does not have a singly routed object hierarchy nor reflection, this results in excessive storage and an unnatural interface.
ComplexType restriction also does not have a natural implementation. Restriction provides for substitutability which in C++ is best gained using inheritance. If we use inheritance to model restriction then there exists the possibility that data types that are used to represent an XML Schema type can change from the parent type to the child type. Restricting xsd:any has this problem for example.
Mixed content is not particularly hard to implement, but providing a natural interface to access and maintaining the string content is quite difficult.
Due to resource limitations and the focus of the product we have chosen to not implement all of XML Schema, at least not all at once. This has caused a number of problems since customers expect their schemas to work, even if the documentation clearly says that the feature they want has not yet been implemented.
One incident involved the use of mixed content. It has been our position that since we are focusing on XML for representing data and not text documents that mixed content is not needed. We had a potential customer say they needed mixed content. Examination of their schema indicated that even though their schema used mixed content, a simple change to the schema would have allowed all the same documents without using mixed content. Unfortunately changing the customer’s schema was not an option.
The schema was generated by a tool that, for some reason, chose a representation that was not appropriate. This is a reasonably common occurrence. Even if we did support all of schema this would still present a problem since these inappropriate choices can have a significant effect on performance.
Many customer issues come from schemas that are not valid. In almost all cases this is the result of a schema generated by a tool. The expectation from customers is that we will support such cases even though it can be shown that their schemas are, in fact, not valid. This has caused us to have to add support for common mistakes.
Common examples are ambiguous schemas, generally due to wildcards, circular includes of schemas, illegal restriction, and problems with target namespaces.
Other customers want enhancements to schema to be supported. In many cases this has to do with customers wanting to reduce the size of documents, either by not requiring or by not serializing items that bloat the documents. These include such things as xsi:type, xsi:nil, and namespace declarations.
Our original approach to versioning matched what we saw as the XML Schema approach to versioning. That is, for new versions, we use a new namespace. Our product will create a new library with a new set of symbols in a new C++ namespace. This requires customers to update software to deal with new versions. Our customers tell us they want something less heavy handed that allows the addition of new elements and attributes to not break existing software.
This seems to go against the fundamental design of schema since schema creates a closed content model. Our product follows this model and breaks if additional elements are present.
Many C++ compilers have hard upper limits for the number of symbols that can be processed in a single compilation unit or stored in a single shared library. We have encountered a number of customer schemas that cause our product to generate so many symbols that the compilers fail. An obvious solution is to break the compilation up into smaller chunks but schema does not lend itself easily to this.
There are a number of different conclusions that one might draw from our experiences. One conclusion would be that we would benefit from a profile of schema that removes those features that are outside the domain of using XML to represent data and those features that have no equivalent in programming object oriented programming languages. For such a profile to really work, tools that generate schemas, and businesses that create their own schemas, must adhere to the profile. Given that there are tools that do not even adhere to the XML Schema spec this may be too much to hope for.
Another conclusion is that we need to implement all of XML Schema and all common misuses in order to satisfy our customer base. This conclusion is more realistic but, as we have implemented more and more of the edge cases, we find that our data model gets stretched to the breaking point.