Teaching and Modeling with XML Schema

1. INTRODUCTION

This report is based on both the experiences of teaching[6] XML Schema in the graduate program at UC Berkeley's School of Information Management and Systems (SIMS)[7] as well as the use, by the author, of XML for modeling document for Mathematics. The Center for Document Engineering[8] teaches graduate students in the SIMS program very detailed knowledge of XML technologies and their applications. Each year, the students complete masters projects--many of them heavily using XML. Typically, modeling XML documents is at the core of these projects and so XML Schema has been used quite extensively.

XML Schema is taught at multiple levels within the XML-related courses at SIMS and we've noticed consistent stumbling blocks that the students struggle with as they develop their understanding. Some of these are just adjusting to the process of modeling and typing XML while learning a new syntax. On the other hand, some of these problems relate to areas in which XML Schema could use improvement.

Also, the author has been using XML Schema in several projects relating to encoding of Mathematical documents [1] and computations [2]. These Mathematics applications make heavy use of XML and all have XML Schemata associated with their data. Many of the type systems in these applications make heavy use of substitution groups successfully.

This report attempts to enumerate the issues with teaching, using, and modeling with XML Schema from both the perspective of the new and experienced user.

2. RAINBOWS OF NAMESPACES

One of the hardest things to explain in XML Schema is what namespace is associated with an element, its contained content, and attributes given a type definition. The undecorated name in the element syntax that names a declaration or definition is typically misleading to the author of the schema. Only after a user has become experience in reading the XML Schema syntax do they understand the use of QNames and NCNames in declarations or definitions.

While some of this can be explained as just getting used to a new syntax, there is one critical point that should be made. When a definition of a type is made, its local declarations take the namespace of the containing schema document (e.g. the targetNamespace on the [xs:]schema ancestor). This means you need to make a "careful dance" if you want to have types in their own namespaces.

People who come from programming languages like Java where everything is placed in a package will naturally gravitate towards making type libraries where the target namespace is different for the different type libraries. Unfortunately, those same people will gravitate towards making local element declarations as well. When those elements are qualified, they've just associated an element with the type's namespace and not necessary with the namespace they expect to use in their document. It is even worse that when they aren't qualified that they have no namespace.

An example might clarify the problem. Suppose we create a type library with a base type as follows:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
   targetNamespace="http://www.example.com/schemata/people/2005" 
   elementFormDefault="qualified">

<xs:complexType name="Person">
  <xs:sequence>
    <xs:element name="name" type="xs:string"/>
  </xs:sequence>
</xs:complexType>

</xs:schema>

We then create another schema document which uses that library and extends it:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
   xmlns:base="http://www.example.com/schemata/people/2005" 
   targetNamespace="http://www.example.com/schemata/people/extended/2005" 
   elementFormDefault="qualified">

<xs:import namespace="http://www.example.com/schemata/people/2005"/>

<xs:complexType name="PersonWithNickname">
  <xs:complexContent>
    <xs:extension base="base:Person">
      <xs:sequence>
        <xs:element name="nickname" type="xs:string"/>
      </xs:sequence>
    </xs:extension>
  </xs:complexContent>
</xs:complexType>

</xs:schema>

Finally, we author the schema for the document uses this type library:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
   xmlns:people="http://www.example.com/schemata/people/extended/2005" 
   targetNamespace="http://www.example.com/schemata/people-list/2005" 
   elementFormDefault="qualified">

<xs:import namespace="http://www.example.com/schemata/people/extended/2005"/>

<xs:element name="person-list">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="person" type="people:PersonWithNickname" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
</xs:complexType>

</xs:schema>

The confusing part is what the instance looks like. The root element and contained element have the namespace of the document. Unfortunately, the 'name' and 'nickname' elements are forced to use the namespace of type libraries in which they were declared. As a result, we get three namespaces in the observed document:

<d:person-list xmlns:d="http://www.example.com/schemata/people-list/2005"
               xmlns:e="http://www.example.com/schemata/people/extended/2005"
               xmlns:p="http://www.example.com/schemata/people/2005">
   <d:person>
      <p:name>R. Alexander Milowski</p:name>
      <e:nickname>Alex</e:nickname>
   </d:person>
</d:person-list>

This "rainbow of namespaces" is confusing. In fact, it typically is not what the schema author intended. Then then need to completely re-organize their type definitions to get one namespace but there is no way to have type definitions in their own namespace and have all the elements in one document namespace.

What is really wanted here is to have "structural types" for which the local element declarations take on some enclosing namespace. This can be approximated by changing the imports to includes. The problem is that if two schemata with different namespaces include these definitions, a processor cannot tell that they are the same definition by the type name. Even though the definition is exactly the same and the structures are isomorphic, the type names are different.

This is a real problem for usability in that people will often develop type libraries and assign them namespaces. They might even go further and assign different namespaces within their library. They will not necessarily realize what they have done to the instance until later.

My recommendation to schema authors is to start with the instance and understand what elements and attributes you want to appear and decide what namespaces they should use. Then make you schema conform to that. That is, don't do anything that would force strange uses of namespaces. But this may force the schema author to create types in a way that they might not want to. That is, to not take advantage of typing facilities because they would cause extra namespaces in the observed document.

The rational for this recommendation is that the XML document is what programs process and what people author. The more namespaces you have, the more difficult it becomes to produce or process and so more errors may be introduced.

3. CONFIGURATION AND DEPLOYMENT

One of the most difficult challenges in using XML Schema validation has been configuring the different tools to find the correct schema documents. Although simple applications can use xsi:schemaLocation attributes, this does not work for interchange situations (e.g. web services, content management systems, etc.) where the authored xsi:schemaLocation attribute is usually incorrect.

A partial solution to this is to use OASIS XML Catalogs[5] for schema location. This has been implemented in Xerces[3] and Netbeans[4] as well as other products. This works quite well except that:

There is no way to map the "no namespace" to a schema document.
Strange things happen to URNs (e.g. they translate to ISO public identifiers).
It isn't clear that every declaration in a catalog relates to a schema.

Further, there is a real need to be able to package a set of schema documents and identify their namespaces so that an application knows what types, elements, attributes, etc. should be available for a particular domain. I call this a "universe". That is, a set of self-consistent definitions and declarations that represent the "known world".

This packaging of a set of schema documents could be as simple as an enumeration of namespace to document mapping that allows the "no namespace" schema:

<schema-set name="mathdoc">
 <schema namespace='http://www.mathdoc.org/schema/paper/2005/us' src='paper.xsd'/>
 <schema namespace='http://www.mathdoc.org/schema/thesis/2005/us' src='thesis.xsd'/>
 <schema namespace='http://www.mathdoc.org/schema/problemset/2005/us' src='problemset.xsd'/>
 <schema namespace='http://www.mathdoc.org/schema/solutionset/2005/us' src='solutionset.xsd'/>
 <schema namespace='http://www.mathdoc.org/schema/slides/2005/us' src='slides.xsd'/>
 <schema namespace='http://www.mathdoc.org/schema/syllabus/2005/us' src='syllabus.xsd'/>
 <schema namespace='http://www.mathdoc.org/schema/review/2005/us' src='review.xsd'/>
</schema-set>

Optionally, it would be really nice for applications to identify the root elements that should be allowed:

<schema-set name="mathdoc"
   xmlns:p="http://www.mathdoc.org/schema/paper/2005/us' 
   xmlns:t='http://www.mathdoc.org/schema/thesis/2005/us'
>
   <root name="p:paper"/>
   <root name="t:thesis"/>
...
</schema-set>

4. SUBSTITUTION GROUP SHOULD BE FIRST CLASS DEFINITIONS

Substitution groups are fantastic tools but they are limited in a number of ways:

There is no declaration of the substitution group. Schema tools must search for the occurrence of the substitutionGroup attribute.
There is no way for an element to belong to more than one substitution group.
There is no way to refer to a substitution group within a content model and make exceptions for validation (e.g. the substitution group xhtml:inline except xhtml:b or xhtml:i).

One solution to this is to make substitution groups their own definition and then allow the substitutionGroup attribute to be a list of QName values.

Substitution groups are very important for type-based modeling of XML documents. The substitution groups allows similar functionality of generics in programming language type systems--which has been recently added to the Java type system. That is, they let you declare a complex type that has a slot whose members must conform to some base type. Subsequently, any member of the substitution group can be placed in that slot without special syntax (i.e. no xsi:type).

The main problem is that because they aren't defined with their own definition, they are somewhat obscure to the new or experienced user. In addition, you can't make something part of a new substitution group without changing the original definition. This limits re-purposing of schema declarations and severely limits substitution groups.

5. MODELING CONTROL

5.1. A Rational Built-in Simple Type

The fact that XML Schema doesn't allow rational numbers to be typed is almost like leaving out the number zero. Decimal numbers are wholly insufficient for representation of exact numbers.

Let's not make comparisons to programming languages like Java. They too forgot rational numbers. Fortunately, as Java is a programming language, you can add rational numbers into the language as a class. You cannot add rational numbers into XML Schema as a primitive datatype.

There are plenty of situations where you need rational numbers. Because they aren't part of the XML Schema simple type system, they can't be sorted or compared in languages like XSLT or XQuery. You must first convert them to a floating point approximation to get them to sort correctly.

5.2. Restricting Simple Content Complex Types

If you extend a simple type to add an attribute there is no way to further restrict the simple typed content. It would be very nice to be able to restrict the simple type content of a complex type whose content is a simple type restriction.

5.3. Wildcard Control

Wildcard control is very insufficient. They following cases really need to be covered:

Anything but a list of excluded namespaces.
Exclusions of certain elements from a particular namespace.
Pattern matching on a namespace name (e.g. anything from my domain).

5.4. Elements at the Top Level of Content Models

The restriction of not allow element or any at the top of a content model seems arbitrary. If there is a technical reason, then it could easily be solved by saying that an element at the top-level of a content model is automatically wrapped in a sequence. This is a common error by new and experienced users.

5.5. Attributes Dictating Content Models

There are many situations (e.g. ATOM) where an attribute value dictates the content model of an element. That is, for the particular value of an attribute, only specific content may appear. It would be nice to map these attribute values to different complex types rather than force the modeler to "union" the types and then have application-level validation of the structure of the element.

An example of this is the 'content' element from ATOM[9]. The 'type' attribute controls what content can be contained according to some rules. The following is a short list of some of those rules:

When type='text', the content must be a string.
When type='xhtml', the content must be an XHTML div element.
When type='text/xml' the content must be well-formed XML.

A schema author could easily write a type for each of those rules:

When type='text':

<xs:complexType>
   <xs:simpleContent>
      <xs:extension base="xs:string">
         <xs:attribute name="type" type="xs:string" fixed="text"/>
      </xs:extension>
   </xs:simpleContent>
</xs:complexType>

When type='xhtml', the content must be an XHTML div element.

<xs:complexType>
   <xs:sequence>
      <xs:element ref="xhtml:div"/>
   </xs:sequence>
   <xs:attribute name="type" type="xs:string" fixed="xhtml"/>
</xs:complexType>

When type='text/xml' the content must be well-formed XML.

<xs:complexType>
   <xs:sequence>
      <xs:any namespace="##other" maxOccurs="unbounded"/>
   </xs:sequence>
   <xs:attribute name="type" type="xs:string" fixed="text/xml"/>
</xs:complexType>

And then validation should choose between these types based on the attribute value of 'type'.

5.6. "Some of These" Models

There is a real need to have a variant of 'all' where:

Everything is optional.
But something must occur.

That is, all the elements can occur in any order, they are all optional, but the containing element cannot be empty.

6. CONCLUSIONS

At the Center for Document Engineering, we have been successful at both teaching and using XML Schema. As a whole, it is not as bad as many people believe. But XML Schema does have rough edges.

The largest and most difficult part of teaching XML Schema is explaining how namespaces interact with schemata, schema documents, and definitions vs declarations, and the instance. Much of the want to fix the problem described in section 2 is about lessening this problem.

Also, fixing the configuration and deployment aspects of XML Schema described in section 3 by providing a standard and reasonable alternative to xsi:schemaLocation will help enormously. New users have a very difficult time configuring tools to find their schema documents. They often turn to xsi:schemaLocation to fix this only to put off the real problem of schema location till deployment.

Finally, promoting substitution groups to their own definition will make them far less opaque. They are very misunderstood and that is partially do to the fact that they have no explicit definition. Yet, they are very important to type-oriented XML.