Health Level Seven: XML Schema User Experience Report

Paul V. Biron, Kaiser Permanente

Introduction

Health Level Seven (HL7) is an ANSI accedited SDO that develops standards for the exchange of clinical/administrative medical information (we leave the financial/insurance stuff to X12, another SDO).

We use UML as our modeling language and XML as the (one of possibly many) wire format for our messages. We generate schema documents from a representation of our UML models (tho not XMI). This document details our experiences in designing our instances and schemas. We don't have an overall conclusion for this report but hope that in discussing this report with other Workshop participants we can find solutions to some of the problems reported here.

Table of Contents

Aspects We Like
  1. XML Syntax
  2. Separation Of Elements From Their Types
  3. Type Derivation
Problems With Interoperability
  1. Complicated Writing Style
  2. Features Incorrectly (or Not At All) Implenented
Problems With Expressiveness Of The Schema Language
  1. Extend Only At End
  2. Lack Of Co-Occurance Constraints
  3. Wildcards deficiencies
  4. Component Identity Vagueness
  5. Limitations On All Groups

Aspects We Like

I'd like to start off this report by stating several aspects of XML Schema that we really appreciate. Please do not interpret the fact that this section is much shorter than the other two sections as indicating that we dislike the current spec more than we like it…it's just natural to go into more detail on the items that one would like to see changed.

1. XML Syntax

The benefit of having an XML syntax for XML Schema documents is incalculable for all of the obvious reasons.

2. Separation Of Elements From Their Types

The benefits of this are also enormous. We await XPath 2.0 reaching Rec status with baited breath!! We believe that type-aware XPath will revolutionize the way people process XML instances. While there have always been DTD-based data binding tools, their results have never been very satisfactory and we believe that data binding tools are almost useful ;-) precisely because of the introduction of the element/type distinction in XML Schema.

3. Type Derivation

Dispite several limitations spelled out below, the fact that XML Schema's notion of derivation maps so cleanly to similar concepts in UML and object oriented programming languages is extremely useful. While it is true that UML and modern OO languages are more comfortable with extension I think it is false to contend that they don't have the notion of restriction, as is often asserted. In particular, UML's notion of restriction is simply to place contraints on a vacuous specialization of a class. In the case of OO languages, while it is true that it is not possible to define one type as a "restriction" of another (in any OO language that I know), restrictions can be easily handled by a small bit of code in constructor and/or setter methods.

As described below, we wish that certain limiations could be lifted in XML Schema so that we could take more advantage of derivation.

Problems With Interoperability

HL7's experiences with designing schemas that work across a broad array of tools has been extremely disheartening. We have experienced interoperability problems among applications within three broad classes of tools: 1) validators; 2) schema aware instance editors; and 3) data binding tools. Finding two different data binding tools that will generate useful code from a specific construct in a schema can be difficult (e.g., mixed content). To make matters worse, after massaging our schemas so that an array of data binding tools are happy often causes validators to barf on the schemas, etc. It has been a nightmare, to say the least.

1. Complicated Writing Style

It is our belief that much of the interoperability problem is the result of the style in which the XML Schema spec is written (this applies equally well to Datatypes as it does to Structures). It is very dense and interconnected. Trying to following the various interal cross-references can be like trying to escape from a labyrinth. It is true that after a certain amount of time one gets a degree of familiarity with the style and it the difficulty subsides.

The problem, of course, is that many implementors don't have the time necessary to spend with the spec to become that familiary with it it. As a result, many implementors misinterpret major portions of the specification resulting, of course, in non-interoperation of various processors.

And if implementors have a hard time understanding the finer points of the spec, there is often little hope that end users will be able to figure things out. Thus, many users are at the mercy of whatever processor they use to determine "what is right". That is, unlike many other standards, it is very difficult for the average end user to decide "is this a error in my schema/instance or a bug in my processor?". I can't tell you how often I've seen people do the following: when faced with error messages from their processor, the go into the schema document and almost randomly start changing things until their processor stops complaining…kind of like the way some people debug programs. The problem with this method, of course, is that unlike your average language compiler, the error messages from the processor are often as likely to be the result of bugs in the processor as they are of being problems in the schema/instance :-(

The world would be done a great service if the spec could simply be rewritten so that it is easier to understand.

A related issue is the number of errata that have been issues against the spec. When one processor has implemented a given erratum and another hasn't then there are bound to be interoperability problems. I firmly believe that one of the reasons that there have been so many errata is because the spec is so difficult to read that even the WG members were often unsure of what specific sections meant and that we didn't always realize that changing one piece had impact in some seemingly unrelated portion of the spec.

2. Features Incorrectly (or Not At All) Implenented

We have come across a wide array of XML Schema features that we want/need to use in the schemas that we generate that are either incorrectly implemented or simply not implemented at all in many different tools that users of our standards want to base their products on. This has caused us untolled problems.

While not an exhaustive list, the following have been particularly problematic, with the category of process that we've had problems with:

There are probably more but I'm running out of time to finish this report so I guess I'll leave it at that.

Many discussions of interoperability problems center around things like "lets find that minimal set of features that everyone gets right" and produce a profile of XML Schema with just those features. We believe this is the wrong direction to go…since many of the features that we use heavily do not appear on many people's "minimal profile". Rather, we would prefer to address the core problems that are responsible: 1) the readability of the spec; 2) education on how to implement specific features.

Problems With Expressiveness Of The Schema Language

We have encounter several problems because of limitations in the expressiveness of the schema language and these are described below, ranked most severe to least.

1. Extend Only At End

For each UML class in a given model we generate a complex type. The content model of the type is a sequence of the attributes (UML class attributes and obviously not XML attribute) and associations of the given class, where the child elements for the class attributes come before the child elements for the associations.

When one UML class specializes another class (by adding attributes and/or associations) we would like to be able to define the complex type for the specialization as an extension of the first class. However, the only way we can could currently do that it is if we break the rule that all child elements representing class attributes come before child elements for associations. This is because of the "extend only at the end" rule.

For example, consider the following simple UML model, with a general Entity class, with LivingSubject and Person specializations that each add additional attributes.

Simple UML Model: Entity class, with LivingSubject and Person specializations

Note: the player and scoper associations go to another class called a Role; e.g. a certain Person plays the Role of patient, scoped by a certain Hospital (Entity).

We need our instances to looked like:

        <entity>
           <name/>
           <player>
              …
           </player>
           <scoper>
              …
           </scoper>
        </entity>

        <livingSubject>
           <name/>          <-- from Entity -->
           <birthTime/>
           <desceasedTime/>
           <player>         <-- from Entity -->
              …
           </player>
           <scoper>         <-- from Entity -->
              …
           </scoper>
        </livingSubject>

        <person>
           <name/>          <-- from Entity -->
           <birthTime/>     <-- from Living Subject -->
           <desceasedTime/> <-- from Living Subject -->
           <address/>
           <race/>
           <player>         <-- from Entity -->
              …
           </player>
           <scoper>         <-- from Entity -->
              …
           </scoper>
        </person>
                        

However, because of the "extend only at end" rule there is no way to derive the type for LivingSubject from Entity or Person from LivingSubject and maintain this child element order. We would have to have the following:

        <person>
           <name/>
           <player>
              …
           </player>
           <scoper>
              …
           </scoper>
           <birthTime/>
           <desceasedTime/>
           <address/>
           <race/>
        </person>
                        

which we actually tried for a while but that element ordering was confusing to our users. As a result, we have had to give up on using extension at all! [Actually, we use extension in several other places but not to model the class hierarchy in our UML model.] An important part of our processing model is that if a receiver does not understand what a Person is they should still be able to process the information as an instance of a LivingSubject or Entity. Naturally, if we could define the types as extensions of each other then handling this processing model would be very simple…just walk up the base type chain in the PSVI until you got to a type that you understand.

To accomdate this processing model we have had to add fixed attributes in each type whose value is the name of the UML class it represents and receiving applications must maintain their own table of the class hierarchy This is very unfortunate!

2. Lack Of Co-Occurance Constraints

As with many other users, we have a need to express co-occurance contraints. During the WG's discussions on this issue I introduced the distinction between occurance-based and value-based co-occuance contraints. In the first case, the attributes and content model of a given type are conditional on the occurance of some other element/attribute; in the later, the attributes and content model are conditioned on the value of some other element/attribute.

While HL7 has at least one need for value-based constraints we really need occurance-based constraints and would gladly settle for that.

It is often the case with clinical information that the values of certain UML class attributes or an entire UML class at the distal end of an association are unknown or null. XML Schema provides the xsi:nil feature to indicate that required child elements that are missing should not cause the parent element to be treated as invalid.

However, there are two reasons we can't use use xsi:nil (in isolation) to signal these unknown values. First, there are often clinical and/or legal reasons that make it necessary to say why some value is unknown.

Imagine a person is in an auto accident and is rushed to the emergency room unconscious. Their spouse, who was not injured in the accident, is asked for vital clinical information by the admitting nurse: e.g., is the patient allergic to any medications. If the spouse is unsure whether the patient is allergic it is necessary to indicate in the admission message that this question was asked but the answer was unknown, as this is a very different state of affairs from simply not including any information about allergies…if the patient has a reaction to some medication given them in the emergency room, the hospital and doctors have a certain amount of legal protection against malpractice since it is documented that they at least asked! HL7 has developed a very rich vocabulary of "null flavors" to cover all of the various cases of why required information might be unknown. xsi:nil does not give us any way to convey these "null flavors" (we tried to get that capability in while XML Schema was still in development but the rest of the WG felt it didn't make the 80/20 cut…and they may have been right). Instead, we have a separte nullFlavor attribute whose value is drawn from the controlled vocabulary we have developed.

As might be expected, there has been a large amount of confusion among those implementing the HL7 standard about the relationship between our nullFlavor attribute xsi:nil.

Second, in our instances, the values of UML class attributes are (largely) represented as XML attributes on the element that represents the UML class attribute itself. For example,

        <livingSubject>
           …
           <birthTime value='1962-06-17T06:31:00-800'/>
           …
        </livingSubject>
                        

xsi:nil has no effect on how the presence or absence of attributes effects local validity…it applies only to child elements. Because of this, we have had to make all XML attributes optional…which has the very unfortunately consequence that an instance may be "structurally" valid but contain no actual information. For instance, the instance fragment above is actually valid even though it contains none of the values of UML class attributes. Note: we represent the values of UML class attributes as XML attributes because it drastically reduces the size of our messages…at one time we had developed an element-only language but our community found the bloat to be excessive (on the order of doubling or tripling message size in some cases).

Thus, what we really need is a way to say "if the nullFlavor attribute is present then no child elements and, more importantly in most cases, other attributes are allowed." Without this capability the efficacy of using our schemas for validation is drastically reduced.

3. Wildcards deficiencies

XML Schema provides for the appearance of wildcards in a content model. While there is a great deal of capability embedded in wildcards (controling the namespace in question, and strict/lax/skip processing, etc.) there are two limitations that HL7 has bumped into.

The first is the lack of a "typed" wildcard. We have certain cases where the distal class of a UML association is not known at "design time", all that is known is that it will be a specialization of some given class. That is, instead of saying "any element from namespace foo goes here", we have cases where we need to say "any element of type bar (or any time dervived from bar) goes here".

We have gotten around this limitation by using substitution groups, which have a certain degree of "typed wildcard"-ness. However, using substitution groups is not ideal for us, as it forces us to make certain elements global that we would otherwise rather have be locally scoped. Granted, because of the problems identified with extend only at the end, even having "typed wildcards" would be of limited value since our schema type hierarchy is not as rich as we would really like.

The second is the oft cited problem about harmful interactions between wildcards and UPA. This problem surfaces for us in that we have a construct, called encapulated data (or ED) that is modeled after MIME media types. That is, values of type ED have a mediaType (e.g., img/jpg or text/xml), may be base64 or "text" encoded, an optional "thumbnail" or "abstract/summary" representation, etc. If the mediaType has an appropriate value then the content may be XML markup. The "thumbnail" property is recursively defined as an ED where "thumbnail" is restricted out, and comes in the content model prior other child elements (all optional), including a wildcard to allow for the arbitrary XML content. That is, we would like to have a type such as:

        <xs:complexType name='ED' mixed='true'>
           <xs:sequence>
              <xs:element name='thumbnail' minOccurs='0'/>
              <xs:any/>
           </xs:sequence>
           …
        </xs:complexType>
                        

Because of the wildcard--UPA interaction, we have had to limit the extent to which elements from the HL7 namespace are allowed in the abitrary XML markup. In particular, we've had to say that only 'root' elements from our namespace are allowed, as follows:

        <xs:complexType name='ED' mixed='true'>
           <xs:sequence>
              <xs:element name='thumbnail' minOccurs='0'/>
              <xs:choice minOccurs='0'>
                 <xs:any namespace='##other'/>
                 <xs:any namespace='##local'/>
                 <xs:element ref='HL7RootElement'/>
              </xs:choice>
           </xs:sequence>
           …
        </xs:complexType>
        <xs:element name='HL7RootElement' abstract='true'/>
        <xs:element name='Message123' substitutionGroup='HL7RootElement'>
        …
        </xs:element>
        <xs:element name='Message567' substitutionGroup='HL7RootElement'>
        …
        </xs:element>
                        

We believe that the "weak wildcard" direction that the WG is exploring for XML Schema 1.1 will be a definate improvement in this area. The

4. Component Identity Vagueness

Because XML Schema never completely came to grips with the notion of component identity, it is not possible to derive one complex type from another when the "base" type is defined in terms of anonymous types. Of course, the workaround is to name all types, just in case one might want to derive something in the future…but that just needlessly polutes the type symbol-space.

Without good reason, global variables should be avoided in programming. We believe that global types and elements should equally be avoided with good reason.

5. Limitations On All Groups

In Extend Only At End we discussed how we generate one complex type per UML class in our models, with the content model being a sequence of elements, one for each of the UML class attributes, followed by one for each of the UML associations coming from that UML class. In fact, what we'd like to have is something like the following:

        <xs:complexType name='Class'>
           <xs:sequence>
              <xs:all>
                 <xs:element name='attr1'/>
                 <xs:element name='attr2'/>
                 <xs:element name='attr3'/>
              </all>
              <xs:all>
                 <xs:element name='assoc1'/>
                 <xs:element name='assoc2'/>
              </all>
           </xs:sequence>
           …
        </xs:complexType>
                        

However, because of the limitations that XML Schema imposes on the all compositor (both the limitation on cardinalities of the particles within the group as well as the fact that an all group can only appear at the "root" of a content model) this is not possible. Thus, we have had to define an arbitrary ordering for UML class attributes and associations and ensure that all of our tools enforce this ordering. That's just one more thing that can go wrong and we would like to see the limitations on the all compositor lifted.