Notes on schema resolution

C. M. Sperberg-McQueen

7 December 2001

1. Why are the conformance rules so lax?
2. Locating schemas

This document describes some issues connected with the task of constructing a schema, preparatory to using it for schema validation as described in the W3C Recommendation XML Schema 1.0.

It is made available in the hope that it will clarify to implementors what kinds of user control over schema resolution and schema-validation would be useful. Some of the contents reflect the considered opinion of the XML Schema WG (at least, some of the contents are taken from the WG's official responses to comments on our spec), but as a whole this document is the responsibility of the named author, and not of anyone else.

1. Why are the conformance rules so lax?

One of the last-call comments on the XML Schema specification was one from Murray Altheim, which was entitled Tighten conformance rules? in the last-call issues document.

In his review of the specification, Murray Altheim wrote:

Part 1 : Structures

§6.1 Layer 1: Summary of the schema-validation core

Another instance of befuddlement. How can this be considered acceptable? (hilighting mine):

The obligation of a schema-aware processor as far as the schema-validation core is concerned is to implement the definitions of schema-valid given below in Schema Validation of Documents (§7.2). Neither the choice of element information item to be schema-validated, nor which of three means of initiating validation are used, is within the scope of this specification.

Part 1 : Structures

§7.9 Missing Sub-components

I've tried three or four times to write up something about this section. Because of my incomplete understanding of the rest of the spec it's difficult to confidently summarize, but my reaction in general is one of mild shock. I long for the days of 'draconian' error handling, and can only attempt to imagine a Web where §7.9 becomes the norm for XML processing.

The rules governing schema-validation and conformance, and their rationales, are (roughly) these:

A Within a document, the schemaLocation attribute can be used on any element to provide a suggestion for where to locate a (not 'the') schema for a particular namespace.

(Rationale: since there may be any number of things at the URI which identifies the namespace, and since content negotiation is currently, ah, imperfectly and incompletely implemented by software and incompletely understood by the average user, it's useful to have a safety valve for cases where the namespace name is not enough.)

[NM adds:] We have been over this ground many times, and without repeating all arguments, that is far from the only reason why a namespace URI might not identify the schema document. I think we have thoroughly analyzed and carefully stated exactly what we want to say about the senses in which a namespace URI does or does not potentially locate the schema. We have similarly agonized over exactly how to the state the options available to processor in dealing with the namespace URI.

I strongly urge you to not try to restate that particular aspect of our specification, which we worked so hard to design and to wordsmith.

B The schemaLocation attribute is, formally, a hint, not an instruction. It may be taken as a claim that a schema for the namespace in question may be found at the location indicated. The schema validator is not required to take the hint. The exact method by which a schema validator finds a schema is out of scope and system dependent. We expect schema validators to use mechanisms like command-line options and arguments, menus, environment variables, and any other user-interface mechanism implementors think their users will find helpful.

(Rationale: if I am receiving data from you, either I trust you or I validate the data. If I don't trust your claim that the document is valid, how on earth can I be expected to trust your claim that the schema at a given URI is the one we agreed to validate against? I can't be. So I need to have the right to tell the schema processor, "I don't care what the other guy said is a good schema, the schema I trust for this namespace is right here." Since the authoritative word must come from the user, not the document, and since we don't want to interfere with user interface design, it would be a huge mistake to prescribe a particular approach to allowing the user to say where to find schemas. Obviously, a processor can provide a 'trust the schemaLocation' option which will work in many cases.)

C The schemaLocation attribute also implies a claim that the relevant parts of the document conform to that schema.

(Rationale: we discussed this; some WG members would have been just as happy for schemaLocation to mean 'a schema for this NS is over there', without embodying any claims about the validity of the document. The WG as a whole, however, thought it preferable to adopt the view that the schemaLocation attribute further embodies a claim to validity against the schema referred to.)

[HT adds:] As the spec says: "On the other hand, in case a document author (human or not) created a document with a particular schema in view, and warrants that some or all of the document is schema-valid per that schema, we provide the schemaLocation and noNamespaceSchemaLocation [attributes] (in the XML Schema instance namespace, that is, http://www.w3.org/1999/XMLSchema-instance) (hereafter xsi:schemaLocation and xsi:noNamespaceSchemaLocation)."

[NM adds:] One subtlety relating to "C". Though one could argue that it is an obscure case, it is quite plausible for a document to provide schemaLocations for namespaces that it does not explicitly reference. For example, the attribute:

     width="3 in"

Is not qualified, but it is quite possible that the simple type used to validate the attribute contents is in a target namespace for which the document is free to offer a schemaLocation hint.

I don't think we should push too hard on the notion that the document is warranting its validity with respect to a certain schema or definition of a target name space. I think there are fewer risks in stating that a schemaLocation allows the opportunity for the instance document to provide hints on the likely location of useful schemas for particular target namespaces. There is certainly no requirement that I am aware of that such namespaces be explicitly referenced in the document, or that they play a particular role in validation (though in practice, few processors would bother to find a schema document known to be for a name space completely irrelevant to the validation.)

D The presence of a schemaLocation attribute does not constitute a request for validation.

(Rationale: there are many situations in which a document should be read, possibly by a processor which understands how to validate it, but does not need to be, or SHOULD NOT be, validated. A request for validation is a transaction between a user and a piece of software, or between two pieces of software. It is not a declarative fact about a document. It is best left to a user interface.)

E If more than one schema location is suggested for a particular namespace, it is not an error, but no particular priority is assigned to the two.

(Rationale: they are HINTS, right?)

F A validation process may start at any element in the document and work down.

(Rationale: Launching a validation process is taken to be a matter between a user and a piece of software, or between two pieces of software. It may sometimes be important to validate the entire document; sometimes only certain parts of the document need to be validated. Since the presence of a schemaLoc attribute does not constitute a request for validation (and its absence cannot be taken as a binding request not to validate), the user is free to select any point as the starting point. It may be expected that some schema validators will, by default, start at the top of the document. But it is important that they are not REQUIRED to do so.)

G A validation process may work in strict mode, lax mode, or skip mode. It may -- or rather, it must -- switch from mode to mode on the basis of the {process contents} property on the relevant schema component.

(Rationale: For some applications, it's essential to check every element and every attribute, and to insist that they be declared, roughly as in a DTD.

For some applications (black-box applications), it's essential to be able to specify that the schema applies only to some outer envelope, which contains well-formed XML as a payload, and that the payload does not need to conform to the schema and should be skipped entirely. Think of defining an information retrieval protocol like Z39.50 as a set of XML messages going back and forth. The envelope needs to conform to the schema, but the payload does not need to conform, and it would normally be a waste of cycles to try to validate the payload.

For some applications (white box applications), there may be a payload which need not be validated, and the elements in it need not be declared, but if elements are encountered for which declarations are available, they should be validated. In a template in an XSL stylesheet, for example, I may not care about validating the elements in the target namespace. But if I see another XSL element inside a target element, I probably do want to validate it.

So strict, skip, and lax are each necessary, because each describes a plausible approach to validation and to coexistence of schemas and namespaces.)

H An application may be guided by the {process contents} property on the relevant schema components, but it NEED NOT be.

(Rationale: the schema may have been devised for skip-processing, but for my purposes I may insist on lax or strict processing. My business partners may not care about the contents of the payload, but for my purposes I want to know that if the payload contains anything that claims to be a purchase order, then it jolly well conforms to my schema for purchase orders.

[HST adds:] As currently written, the spec does not provide for conforming schema processors to ignore the 'processContents' attribute directly, i.e. when encountered, its impact on the [validation attempted] and [validity] properties must be as specified. Applications which want to be stricter have the following options (this is in the spec.) for documents which are apparently valid (their root EII has [validity]!='not'):

a) Reject any documents whose top-level [validation attempted] is anything other than 'full';
b) Scan the PSV infoset for elements and attributes with no [type definition type] property (these were covered by 'skip' or 'lax') and revalidate them strictly.

I think this is sufficient, but if we want to allow processors to force processContents to be stricter than a schema specifies, we need to add this to the spec.

I If in the schema the relevant {process contents} property has the value 'strict' or 'lax' or 'skip', this may be interpreted as a declarative statement that documents which conform to this schema must have no errors when processed in the specified mode. It follows that if a schema processor processes a black-box payload (declared with processContents='skip') in lax mode, and finds an error, the error in question is not a schema-validity error.

(Rationale: all schema processors should give the same results, as regards schema validity. If the schema says something should be skip-conformant, you do have the right to check it in strict or lax mode, but you and your processor do not have the right to call failure to conform to the rules of strict or lax mode a schema validity error. As long as the distinction is made between failure to conform with the restrictions laid out in the schema, and other failures, all is well. You might also want a processor to check to make sure the document is in ASCII, not UTF-8 or UTF-16. That's your right, and it's OK. But the processor is not allowed to claim that a UTF-16 document is ill formed on that account.)

[HST adds:] This point is important for the proper understanding of points G and H above: you can define your own validation property, say [strict validity], and get your processor to compute it, but you can't produce a PSV Infoset that records strict validity in the [validity] property -- the spec. defines what that property means, and you can't change that.

I believe that the commentator was mostly shocked by rules B and F; I have included the others partly because I think they help make the picture more complete, and partly because some of them are becoming hobbyhorses of mine. And also because if I am wrong about any of them, now would be a better time to learn it than later.

The summary above may also be helpful for responding to issue 183.

I believe the WG has hashed out the rationales for our various validation rules at enough length that everyone is satisfied that what we have is what is needed, and that there won't be any consensus for changing rules B and F as implicitly suggested by issue LC-177.

2. Locating schemas

[This section came out of an email exchange with Tony Coates on xml-dev.]

At 2001-05-04 10:05, Tony Coates wrote:

On 04/05/2001 15:43:01 C. M. Sperberg-McQueen wrote:
In other words, the relation of namespace to schema is many-to-many, not one-to-one. This turns out to be a hard pill for some to swallow, but I think it is time to accept the logical consequences of our designs. (The people I know who want a one-to-one relation are, as far as I can tell, still fighting the battles involved in the development of the namespaces rec. Let it go, friends! Let it go!)

Yes, the intersection between schemas, namespaces, and versioning is certainly proving to be cathartic for me too. However, in this many-to-many world, how does my application determine which versions of all the many imported and included schemas apply to a particular instance document?

It's definitely handy to be able to know exactly what declarations were used; that's one reason XML Schema allows a processor to expose the Schema components to downstream applications as part of the post-schema-validation infoset. It's not required (because some small-footprint processors may not need this functionality), but it may be something that affects your choice of processor (much the way DTD-validation or lack thereof, or DTD-awareness or lack thereof may affect your choice of XML 1.0 processors, or support for RANK and DATATAG might affect your choice of SGML processors -- positively or negatively).

So, concretely, you consult (in the output of your schema processor) the information in the [schema information] property of the element (or the element information item, if you prefer) at which validation began. In particular, you want the [schema components] properties for the namespaces you care about, or -- if all of your declarations come in via schema documents -- the [schema documents] property.

Be aware that "your application needs to be flexible and robust" is not a suitable answer, because in the financial world contextual mistakes can be worth millions of dollars. I need to know exactly which version of each element was used, and I really need this information to be available with the DOM tree that my parser returns.

Agreed. One reason that the mechanism for binding namespaces to schemas is left so wide open in XML Schema 1.0 is that applications should not have to be "flexible and robust" in accepting data.

In some cases (including, I suspect, the financial world), it is even more handy not just to know afterwards what declarations were used, but to be able to control which declarations are used. I see from your later postings to xml-dev that this is not what you are in fact after, though I agree with Len Bullard that this is going to be important for lots of applications.

I think (speaking for myself) that there is room for experimentation here, and certainly a need for experience with different ways of going about it. I can imagine several rules, which might be combined in different orders. (N.B. the order in which I list these rules is random, not intended as a proposed ordering.) In fact, there are several distinct kinds of rules:

Hard-coding:

A hard-code the schemas for the namespaces you care about into your software (hard-coding a schema into your software -- what a lovely way of showing you care ... sounds dumb, but it's the way HTML browsers have worked for a long time)
B hard-code a list of schema locations for the namespaces you care about; your software consults those locations for the version you consider right
C assume that every element is declared with the urtype, and every attribute with the ur-simple-type (useful mostly as a fallback, I guess)

Overt user control:

D pass a set or sequence of (namespace-name, schema document) pairs to the processor at invocation time, e.g. as a command-line option
E when you need a schema, ask the user (interactively?) where to get it (some XML editors do this with DTDs, if they can't find it any other way -- this plan is probably most useful as the last attempt in a series, when all else fails)

Trust the document:

F dereference the namespace name and take what the server gives you (or: then use RDDL to find what you want)
G dereference the URIs given as schemaLocation hints and take what the server gives you

Indirection through catalogs or paths:

H use a series of regular expressions into which you substitute the namespace name, or parts of the namespace name, or the URI given as a schemaLocation hint, or parts of it, and treat each one as a system identifier (the way the sgml-public-map variable works in psgml, or the way sgmls entity resolution used to work before catalog support)
I use the namespace name (viewed as a PUBLIC identifier) to consult an SGML-Open catalog file
J use the namespace name (viewed as a SYSTEM identifier) to consult an SGML-Open catalog file
K use the schemaLocation hint (viewed as a SYSTEM identifier) to consult an SGML-Open catalog file
L use the namespace name or schemaLocation hint to consult an SGML-Open catalog file looking for a NAMESPACE or SCHEMA keyword (this is an extension, for now, but Oasis might be persuaded to build some keyword into the new version of catalogs)
M use the system identifier you got from the SGML-Open catalog to consult the SGML-Open catalog file again (keep going until you have a system identifier for which the catalog owner has not provided a redirection)
N accept an invocation-time parameter which specifies where to find catalog files

Not all of these are equally useful (I have always hated systems that made me work with option H), and if some catalog support (N and one or more of I-L, and optionally M) is provided, it's not clear that run-time options (D, E) will be needed for most users. For some applications, hard-coding things may be the way to go, for the namespaces one cares most about: it depends on the deployment conditions.

The situation you describe sounds fine if you don't expect to have to version your elements, because their meaning isn't expected to change with time. This will be true in some areas, but certainly completely incorrect for others. You cannot always invent a new name just because the semantics change in some way.

Versioning seems to be a hard challenge -- I get the impression it's hard in part because we want contradictory things from a versioning mechanism: sometimes the test of a successful versioning mechanism appears to be that it allows us to label both version 1 and version 2 of language X as just 'X', so we don't need to change all the labels in our data when we move to a new processor (or vice versa), and at other times it appears that the test of a successful versioning mechanism is that it allows us to tell specifically which version(s) of a specification data, or software, is compatible with, so that we can fail quickly and avoid catastrophic errors (the processor for Boolean-Language 1.0 failed on the Boolean-2.0 data, because it ignored the new alternate notation for 'not' -- or rather, it didn't fail, it only produced catastrophically erroneous results, smiling cheerfully all the while).

But I think both forms of versioning can be supported if we give the user sufficient control over which schemas get bound to which namespaces.

Case 1. Language X has two versions, labeled 'X' and 'X' respectively. I know my data is in version 2, so I tell the processor to use the schema for version 2 -- that is, I tell the processor to bind namespace X to schema X2.

Case 2. Language X has two versions, labeled 'X' and 'X' respectively. I don't know which version is used by the data you just sent me, but I need to know, so I tell the outermost layer on my system to try first using the schema for version 2, and if that doesn't work then try the schema for version 1, and let me know which worked.

Case 3. As for Case 2, but I don't actually care which version version of language X is used by the data you just sent me. Either I do the same as in Case 2 (but ignore the information about which version it was), or I tell the processor to use the 'any-X' version of the schema, which accepts all documents in either version. (If the schema language I am using has the 'determinism' rule we inherited from ISO 8879, the union schema may also find it necessary to accept a few documents which aren't legal in either version -- one reason some people would like the community to move away from the determinism rule.)

Case 4. Language X has two versions, with namespace names X1 and X2. I know my data is in X2, and I tell the processor to bind the schema for X2 to the namespace name X2.

Case 5. Language X has two versions, X1 and X2, and I don't want to maintain two schemas for it, so I write a union schema and bind both namespace names to it. (Or I say "I don't care what other people do, I'm binding both X1 and X2 to the schema for X2", much the same way that on my system the catalog entry for HTML 4.0 Transitional points to the DTD for HTML 4.0 Strict -- or did for a while.)

Nor do I really see us wanting to have a separate namespace for each individual element (and wouldn't that be a great bandwidth blow-out for our instance documents). There seems to be a piece of the puzzle that is missing, certainly for enterprise usage.

Versioning is certainly still an ongoing challenge. How to label things so that we can, as far as possible, allow old data to work with new processors, allow old processors to work with new data when they can do so safely, and allow old processors to detect, reliably, when they need to fail on new data rather than risk processing it -- if anyone has solved that problem, a lot of us would like to know how.