W3C Schema Experience Report

Submitted by: Dan Vint, Sr Technical Architect, ACORD
dvint@acord.org

1 BACKGROUND

ACORD is a non-profit standards organization for the Insurance Industry. We primarily represent the Life and Property and Casualty (P&C) areas in the US and Large Commercial and Reinsurance (RLC) in Europe and the US. Currently we support 3 XML standards, one for each of the areas mentioned previously. We build standards that specify the data format for the messages to transfer information between the agent and carrier and other parties.

My background is primarily publishing and SGML which turned into XML and data when I started work for a dot com that was building an Insurance Portal. Lexica, the dot com I worked for, in 1999 and 2000 Lexica built some XML schema tools around an early (simpler) schema draft and were proposing an Insurance standard Schema in competition with ACORD. We were also members of ACORD to monitor what was going on there. ACORD in 1999 and 2000 was building an XML standard to replace or enhance their proprietary EDI format known as AL3.

In 2000, DTDs were the only thing released and supported, so we based the ACORD design for P&C around those capabilities, but we also tracked the data types that we wanted to support when schemas became official. I believe we had 2 maybe 3 releases (one every 6 months) that were DTD only, before we started producing a supporting schema. We kept our schema features limited to those things that could be supported by DTDs and that did not affect the content of the data stream. So we have had schemas available for about 3 years now and are looking at our next major release that will be based completely on schemas and remove any requirement to keep backwards compatibility with our standard as well as DTDs.

2 ISSUES WE CAME ACROSS

2.1 Complexity of the Standard

Perhaps the biggest problem we came across was trying to interpret and understand the specification itself. The Primer was indispensable in trying to understand the features that were added to basic XML capabilities and understand things like how DTDs or a DOCTYPE declaration was still needed to support Notations and entities, and that groups were sort of like a replacement for entities along with base types.

Where this causes the most problem is that we have a hard time finding 2 parsers that will agree on the rules. XML Spy for the longest time has been the tool of choice by our members, but was the least reliable (until about v5 of XML Spy) for reporting errors. We found that the MS-XML and Xerces-J parsers were at least consistent in finding the same errors, while they would not report the problems that Spy and others did. We have based our validation around these two tools whenever we introduce a new schema feature.

The problems in this area seem to revolve around two issues:

  1. Complexity of the specification and understanding all the error conditions.

  2. Support of functionality. It was not always clear which features were supported and if they were actually validating our content and schema design or if they were just ignoring things they did not understand.

  3. Interpretations of the schema features. Hard to avoid and item #1 makes this much easier to happen.

2.2 Versioning, Namespaces, Finding the “Standard, and Hints

SchemaLocation is defined as a hint in the specification and is not required. There is nothing in the Schema specification that identifies what “standard” or “contract” the data stream was produced against. For me and ACORD this is a major problem. I want to be able to look at the data stream and know which version of our standard it was at least written against. I also want this functionality to be understood by the parsers out of the box without my having to do additional programming. I feel this is one of the biggest areas that this standard has fallen down and you will see that we have come up with a workaround that many folks feel goes against typical best practices with namespaces.

First I understand the idea that people may want to override the information about schema location in the data stream or that they might want to use other tools for parsing than just XML Schemas. So why didn’t we just support or enhance the catalog standard rather than making these fields optional? Now a file name is not the best way to identify the contract as I’ve put it above, but I would start with this change.

What I’m looking for is something like a Public Identifier that identifies the standard by some “standard” name and it should include version. I could in a catalog then use this identifier to establish any processing I want to do with this identifier. If I could create an entry that indicates the identifier and then the application being used with its appropriate supporting files I think this would satisfy many folks. So for instance:

<cat:Identifier name="ACORD/PCS/v1.0.0">
  <cat:processor name="XSD"><schema>acord-pcs-v1-codes.xsd</cat:processor>
  <cat:processor name="Schematron"><schema>acord-pcs.v1.sch</cat:processor>
</cat:identifier>

So with the above I know the standard I’m processing with and for each tool in my processing I can identify the appropriate schema on my system. You might even associate multiple identifiers with a single namespace to support the notion of a common vocabulary which seems to be a big reason for not adding a version to the namespace.

Right now, the only feature that I have which will trigger a response from just about any application is to change the namespace with each release of our standard. As a consumer of data from multiple sources, I need the ability to identify the standard it is based upon and its version.

2.3 Tool Implementation and Support

Generally we have found tools that are built around W3C standards interact well with each other or at least seem to support most of the functionality. Where we have come into serious problems are tools like code generators where there is no W3C specification to control what they should do. So for instance we get issues with:

  1. The developer disagreeing with features included in XML and thus Schemas. Elements and Attributes do the same things, so my generator will not support attributes. Entities and groups, you don’t need to use them so my generator will not support them. Many of these implementers seem to want to recreate these standards based upon their bias rather than trying to truly support the standards.

  2. These tools seem to only look at the world from the particular language they are generating. So in many cases simple features that are used in everyday designs within XML and Schemas are overlooked. A big feature I was surprised was a problem was groups. I may use these in my schema design and they may not map to java constructs, but if the implementer understood them as substitutions, they would see that a simple preprocessing step could level the schema design to the few features they want to support; rather than not generate the groups. Related to this is the support for redefine. Code generators seem to want to see this as a real time redefinition when it really should just be implemented as a new schema; process the redefining schema understanding that it is changing definitions in the related files, but the final outcome is supposed to be a single definition. It is just like issuing a new version of a standard, just using some schema functions.

  3. Another example which is more of a mismatch between XML and programming languages concerns the use of attributes at the element (data leaf level). We have implemented a design where text elements will have the xml:lang attribute to support language identification and all our elements at this level have an optional ID attribute to facilitate business rule error reporting. These features are no problem with XML tools, but code generators want to create a full class definition instead of keeping a simple type a simple type with some additional properties. Because the additional overhead of all these classes, we are usually pushed to not place attributes on anything at the data level. Now with xml:lang I have a trade off between one group that wants to use standard functionality and general support in the XML tools for identifying a language change, vs. those that don’t want to have the attribute in their code generation.

  4. Code generators seem to be the worst offender, but by the time I work out all the issues between the various, I have a very flat and simple design with very few common features from XML let alone the Schema standard. As ACORD is not the person buying the software, we have very little pull with the vendors when we see these problems occurring and our members seem to find it easier to get us to not use particular feature rather than getting the vendor to properly support (or at least report) the functionality they do support. ACORD cannot make recommendations for products, so we can’t even steer people away from these tools (generally we aren’t asked).

Should the W3C consider building a full programming language to work with native XML that can then easily interface to these other languages and their paradigms? I’ve used Omnimark in the past. In the days of SGML this was the only tool available that understood SGML natively, unlike Perl that typically managed text strings. XSLT has taken away much of the need for Omnimark, but even XSLT says it is only for standard transformation, not full on programming. Also making it an XML document may have been a little overboard. Could we define an embeddable (maybe compilable) language like Omnimark that is developed and maintained to understand XML manipulation and define an interface between it and other languages? DOM and SAX do not answer the general problem.

2.4 Extensibility

My organization has the following requirements for extensions and restrictions:

  1. No matter what method is used to define an extension or restriction to an element, internal content choice group, or code list, the resulting data stream should not be any different. So a technique that requires an xsi:type value would not fit with this requirement, because we do not want any of the see what we call "schema artifacts" to be in the data stream. If every possible method for defining an extension required an xsi:Type value then it would be allowed. We want the minimum impact on the data stream as possible.

  2. We want to tightly validate the original content as well as the extensions.

  3. Where possible extensions can be based upon the core elements in our standard. So if someone is creating a new aggregate element and needs an EffectiveDate element and our standard has one, we encourage them to reuse our element rather than creating a new.

  4. The extension methods should be supported at least by the core tools on Java and MS platforms. So we test with Xerces-J and MS-XML parsers.

  5. We would like to have one standard method of handling extensions and restrictions; not one solution for restriction, another for extension and maybe another for handling code values.

  6. We want to identify values added to an ACORD list indicate their non-standard existence. Essentially we would like to use QName to identify company unique values, but this doesn’t work in all cases, most notably for values that numbers are being used or are the logical value or where ACORD doesn’t own (manage the content) of a standard list like the ISO country or currency codes (not a good example of something that would probably be extended).

Note that we can generally live with the restrictions that when elements are added that they are placed at the end of the original definition. We can live with these schema requirements, but we still have ours that we have not found a good solution for.

We have not found a way to do all of the above; we can answer different parts but not all the requirements. We essentially have two approaches currently and would like to find the ultimate solution that answered all of the above.

Solution 1

We are using redefine for both extension and restriction for element and group content. We are defining our code lists in two steps to allow for extension and all our elements are globally defined. Our codes are fist defined with a unique name with no enumerated values in a base schema. We then use redefine to define the content of these lists in a separate redefining schema.

We figure this at least allows some one to easily manage this file by adding or subtracting values from the list. Not a perfect solution but it seems to be the only workaround to allow this. We have those code lists that are controlled by ACORD which are defined as QName and we have those external standard lists that we define as token but do not actually enumerate the values, although members may extend these lists.

To handle the use of numbers as code values (like 401k) we have just made it a requirement that if you want to use a number, please add an underscore ‘_401k’ to make it an acceptable value.

Solution 2

Another group is using a standard element called <Extensions> that has as content the xsd:any element. This allows for the validation of the core standard. We have also been able to use redefine on these elements to allow us to specify the actual content model (without the xsd:any) so we get tight validation. This works with Xerces but doesn't work with MS-XML. This also requires the use of redefine to get restriction, so we have two different ways of achieving what we want. We have not implement this to allow extensions of internal choice groups where we will get into trouble with indeterminate content models:

((a | b| c | extension)?, d?, e?, extension?)

Without a required element between the choice group and the final extension element, this is non-deterministic. Most of our designs are made up of optional elements, very few things are actually required. So this in general also fails the requirement to allow us to add choices to the choice list.

Notice some of us also don’t like the use of substitution groups because they allow too much magic to happen without being easily traced by a human without tool help. My point is that if I allow the use of substitution groups, I cannot easily look at my schema and determine the final content model if substitutions can occur. This then makes troubleshooting a complex data stream even more difficult. If we make element foo the head of a substitution group, I have to search the entire schema to see if there are any elements referencing this head element. Also this only addresses the use of a choice group, what is the consistent approach to allow me to extend a sequence or all group?

2.5 Keeping the W3C standards in Synch

I know there is working going on in this area, but I haven’t seen any results. XML was issued, then namespaces, which didn’t really work with DTDs but was needed to make XSLT work. Then schemas came out and seemed to tweak the meaning of a namespace by publishing information at the namespace URL. XSLT and XPath were then not able to deal with the new functionality of schemas because they came our before the Schema spec. We seem to have a house of cards and the only way to keep up with the latest “definition” of things is to be working with the new cutting edge specifications. I know this is a hard requirement to fulfill, but we need to issue standards as needed, but also make sure all the supporting standards are updated a nearly the same time; not years later.

2.6 XML Special Attributes

Personally I was surprised that the XML features of xml:lang and xml:space which are magic attributes in the XML specification needed 3 additional files to be included in my schema if I want to use these attributes in a schema. This seems over engineered. Why aren’t they just built in and understood like the DTD? The namespace attribute xmlns is understood without any additional effort, why not xml:lang? We now have xml:Id coming out, how is this supposed to be handled in a schema? I’m assuming it will be added to xml:space and xml:lang, but don’t really see why it just can’t be magic like xmlns.

2.7 When is a Simple Type Really Complex?

This is a schema definition that is contrary to the way I think about the declaration process. Rather that having a simple type that can have either simple or complex content, the specification has a complex type that can have simple content. Every time I start a new schema I trip over this problem. When I think about creating a text string of 64 characters in length with an associated xml:lang attribute I see this as extending the simple type, so I expect this to be a simple type feature. Instead it is a complex type definition that is based upon simple content. This one I live with, but it is definitely the reverse of the logic I’m applying and it always trips me up.

2.8 Schema profiles

Perhaps a solution some of the tools issues I have brought up, is to go into the XML Schema standard and break it up into several profiles of increased complexity and full support of the standard. This would at least make it easier for vendors to say “we are XML Schema profile 1” compliant and everyone would understand that this was a certain set of features. Instead we have random implementations and understandings of the specification and it is usually up to the unwary user at the most inopportune moment to find out that a feature is not supported. A parser or tool should not be allowed to claim compliance until it can support at least the minimal profile.