This document describes some issues connected with the
task of constructing a schema, preparatory to using it for
schema validation as described in the W3C Recommendation
XML Schema 1.0.
It is made available in the hope that it will clarify to
implementors what kinds of user control over schema resolution
and schema-validation would be useful. Some of the contents
reflect the considered opinion of the XML Schema WG (at least,
some of the contents are taken from the WG's official responses
to comments on our spec), but as a whole this document is the
responsibility of the named author, and not of anyone else.
1. Why are the conformance rules so lax?
One of the last-call comments on the XML Schema specification
was one from Murray Altheim, which was entitled Tighten
conformance rules? in the last-call issues document.
In his review of the specification, Murray Altheim wrote:
Part 1 : Structures
§6.1 Layer 1: Summary of the schema-validation
core
Another instance of befuddlement. How can this be considered
acceptable? (hilighting mine):
The obligation of a schema-aware processor as far as the
schema-validation core is concerned is to implement the definitions of
schema-valid given below in Schema Validation of Documents
(§7.2). Neither the choice of element information item to be
schema-validated, nor which of three means of initiating validation
are used, is within the scope of this specification.
Part 1 : Structures
§7.9 Missing Sub-components
I've tried three or four times to write up something about this
section. Because of my incomplete understanding of the rest of the
spec it's difficult to confidently summarize, but my reaction in
general is one of mild shock. I long for the days of 'draconian'
error handling, and can only attempt to imagine a Web where §7.9
becomes the norm for XML processing.
The rules governing schema-validation and conformance, and their
rationales, are (roughly) these:
A Within a document, the schemaLocation attribute can be used on any
element to provide a suggestion for where to locate a (not 'the')
schema for a particular namespace.
(Rationale: since there may be any number of things at the URI
which identifies the namespace, and since content negotiation is
currently, ah, imperfectly and incompletely implemented by
software and incompletely understood by the average user, it's
useful to have a safety valve for cases where the namespace name
is not enough.)
[NM adds:]
We have been over this ground many times, and without repeating all
arguments, that is far from the only reason why a namespace URI might not
identify the schema document. I think we have thoroughly analyzed and
carefully stated exactly what we want to say about the senses in which a
namespace URI does or does not potentially locate the schema. We have
similarly agonized over exactly how to the state the options available to
processor in dealing with the namespace URI.
I strongly urge you to not try to restate that particular
aspect of our specification, which we worked so hard to design and to
wordsmith.
B The schemaLocation attribute is, formally, a hint, not an
instruction. It may be taken as a claim that a schema for the
namespace in question may be found at the location indicated. The
schema validator is not required to take the hint. The exact
method by which a schema validator finds a schema is out of scope
and system dependent. We expect schema validators to use
mechanisms like command-line options and arguments, menus,
environment variables, and any other user-interface mechanism
implementors think their users will find helpful.
(Rationale: if I am receiving data from you, either I trust you or
I validate the data. If I don't trust your claim that the
document is valid, how on earth can I be expected to trust your
claim that the schema at a given URI is the one we agreed to
validate against? I can't be. So I need to have the right to
tell the schema processor, "I don't care what the other guy said
is a good schema, the schema I trust for this namespace is right
here." Since the authoritative word must come from the user,
not the document, and since we don't want to interfere with user
interface design, it would be a huge mistake to prescribe a
particular approach to allowing the user to say where to find
schemas. Obviously, a processor can provide a 'trust the
schemaLocation' option which will work in many cases.)
C The schemaLocation attribute also implies a claim
that the relevant parts of the document conform to that schema.
(Rationale: we discussed this; some WG members would have been
just as happy for schemaLocation to mean
'a schema for this NS
is over there', without embodying any claims about the validity of
the document. The WG as a whole, however, thought it preferable
to adopt the view that the schemaLocation attribute further embodies
a claim to validity against the schema referred to.)
[HT adds:] As the spec says:
"On the other hand, in case a document author (human or not)
created a document with a particular schema in view, and warrants
that some or all of the document is schema-valid per that schema,
we provide the schemaLocation and noNamespaceSchemaLocation
[attributes] (in the XML Schema instance namespace, that is,
http://www.w3.org/1999/XMLSchema-instance) (hereafter
xsi:schemaLocation and xsi:noNamespaceSchemaLocation)."
[NM adds:] One subtlety relating to "C". Though one could
argue that it is an obscure case, it is quite plausible for a document
to provide schemaLocations for namespaces that it does not explicitly
reference. For example, the attribute:
width="3 in"
Is not qualified, but it is quite possible that the simple type used to
validate the attribute contents is in a target namespace for which the
document is free to offer a schemaLocation hint.
I don't think we should push too hard on the notion that the
document is warranting its validity with respect to a certain schema
or definition of a target name space. I think there are fewer risks
in stating that a schemaLocation allows the opportunity for the
instance document to provide hints on the likely location of useful
schemas for particular target namespaces. There is certainly no
requirement that I am aware of that such namespaces be explicitly
referenced in the document, or that they play a particular role in
validation (though in practice, few processors would bother to find a
schema document known to be for a name space completely irrelevant to
the validation.)
D The presence of a schemaLocation attribute does not constitute a
request for validation.
(Rationale: there are many situations in which a document should
be read, possibly by a processor which understands how to validate
it, but does not need to be, or SHOULD NOT be, validated. A
request for validation is a transaction between a user and a piece
of software, or between two pieces of software. It is not a
declarative fact about a document. It is best left to a user
interface.)
E If more than one schema location is suggested for a particular
namespace, it is not an error, but no particular priority is
assigned to the two.
(Rationale: they are HINTS, right?)
F A validation process may start at any element in the document and
work down.
(Rationale: Launching a validation process is taken to be a matter
between a user and a piece of software, or between two pieces of
software. It may sometimes be important to validate the entire
document; sometimes only certain parts of the document need to be
validated. Since the presence of a schemaLoc attribute does not
constitute a request for validation (and its absence cannot be
taken as a binding request not to validate), the user is free to
select any point as the starting point. It may be expected that
some schema validators will, by default, start at the top of the
document. But it is important that they are not REQUIRED to do
so.)
G A validation process may work in strict mode, lax mode, or skip
mode. It may -- or rather, it must --
switch from mode to mode on the basis of the
{process contents} property on the relevant schema component.
(Rationale: For some applications, it's essential to check every
element and every attribute, and to insist that they be declared,
roughly as in a DTD.
For some applications (black-box applications), it's essential to
be able to specify that the schema applies only to some outer
envelope, which contains well-formed XML as a payload, and that
the payload does not need to conform to the schema and should be
skipped entirely. Think of defining an information retrieval
protocol like Z39.50 as a set of XML messages going back and
forth. The envelope needs to conform to the schema, but the
payload does not need to conform, and it would normally be a waste
of cycles to try to validate the payload.
For some applications (white box applications), there may be a
payload which need not be validated, and the elements in it need
not be declared, but if elements are encountered for which
declarations are available, they should be validated. In a
template in an XSL stylesheet, for example, I may not care about
validating the elements in the target namespace. But if I see
another XSL element inside a target element, I probably do want to
validate it.
So strict, skip, and lax are each necessary, because each
describes a plausible approach to validation and to coexistence of
schemas and namespaces.)
H An application may be guided by the {process contents}
property on the relevant schema components, but it NEED NOT be.
(Rationale: the schema may have been devised for skip-processing,
but for my purposes I may insist on lax or strict processing. My
business partners may not care about the contents of the payload,
but for my purposes I want to know that if the payload contains
anything that claims to be a purchase order, then it jolly well
conforms to my schema for purchase orders.
[HST adds:]
As currently written, the spec does not provide for conforming
schema processors to
ignore the 'processContents' attribute directly, i.e. when
encountered, its impact on the [validation attempted] and [validity]
properties must be as specified. Applications which want to be
stricter have the following options (this is in the spec.) for
documents which are apparently valid (their root EII has [validity]!='not'):
-
a) Reject any documents whose top-level [validation attempted] is
anything other than 'full';
-
b) Scan the PSV infoset for elements and attributes with no [type
definition type] property (these were covered by 'skip' or 'lax')
and revalidate them strictly.
I think this is sufficient, but if we want to allow processors to
force processContents to be stricter than a schema specifies, we need
to add this to the spec.
I If in the schema the relevant {process contents} property has the
value 'strict' or 'lax' or 'skip', this may be interpreted as a
declarative statement that documents which conform to this schema
must have no errors when processed in the specified mode. It
follows that if a schema processor processes a black-box payload
(declared with processContents='skip') in lax mode, and finds an
error, the error in question is not a schema-validity error.
(Rationale: all schema processors should give the same results, as
regards schema validity. If the schema says something should be
skip-conformant, you do have the right to check it in strict or
lax mode, but you and your processor do not have the right to call
failure to conform to the rules of strict or lax mode a schema
validity error. As long as the distinction is made between
failure to conform with the restrictions laid out in the schema,
and other failures, all is well. You might also want a processor
to check to make sure the document is in ASCII, not UTF-8 or
UTF-16. That's your right, and it's OK. But the processor is not
allowed to claim that a UTF-16 document is ill formed on that
account.)
[HST adds:] This point is important for the proper understanding
of points G and H above: you can define
your own validation property, say [strict validity], and get your
processor to compute it, but you can't produce a PSV Infoset that
records strict validity in the [validity] property -- the
spec. defines what that property means, and you can't change that.
I believe that the commentator was mostly shocked by rules B and F; I
have included the others partly because I think they help make the
picture more complete, and partly because some of them are becoming
hobbyhorses of mine. And also because if I am wrong about any of them,
now would be a better time to learn it than later.
The summary above may also be helpful for responding to issue 183.
I believe the WG has hashed out the rationales for our various
validation rules at enough length that everyone is satisfied that what
we have is what is needed, and that there won't be any consensus for
changing rules B and F as implicitly suggested by issue LC-177.
2. Locating schemas
[This section came out of an email exchange with Tony Coates on
xml-dev.]
At 2001-05-04 10:05, Tony Coates wrote:
On 04/05/2001 15:43:01 C. M. Sperberg-McQueen wrote:
In other words, the relation of namespace to schema is
many-to-many, not one-to-one. This turns out to be a hard pill for
some to swallow, but I think it is time to accept the logical
consequences of our designs. (The people I know who want a
one-to-one relation are, as far as I can tell, still fighting the
battles involved in the development of the namespaces rec. Let it
go, friends! Let it go!)
Yes, the intersection between schemas, namespaces, and versioning is
certainly proving to be cathartic for me too. However, in this
many-to-many world, how does my application determine which versions
of all the many imported and included schemas apply to a particular
instance document?
It's definitely handy to be able to know exactly what declarations
were used; that's one reason XML Schema allows a processor to expose
the Schema components to downstream applications as part of the
post-schema-validation infoset. It's not required (because some
small-footprint processors may not need this functionality), but it
may be something that affects your choice of processor (much the way
DTD-validation or lack thereof, or DTD-awareness or lack thereof may
affect your choice of XML 1.0 processors, or support for RANK
and DATATAG might affect your choice of SGML processors
-- positively or negatively).
So, concretely, you consult (in the output of your schema processor)
the information in the [schema information] property of the element
(or the element information item, if you prefer) at which validation
began. In particular, you want the [schema components] properties
for the namespaces you care about, or -- if all of your declarations
come in via schema documents -- the [schema documents] property.
Be aware that "your application needs to be flexible and robust" is
not a suitable answer, because in the financial world contextual
mistakes can be worth millions of dollars. I need to know exactly
which version of each element was used, and I really need this
information to be available with the DOM tree that my parser
returns.
Agreed. One reason that the mechanism for binding namespaces to
schemas is left so wide open in XML Schema 1.0 is that applications
should not have to be "flexible and robust"
in accepting data.
In some cases (including, I suspect, the financial world), it is even
more handy not just to know afterwards what declarations were used,
but to be able to control which declarations are used. I see from
your later postings to xml-dev that this is not what you are in fact
after, though I agree with Len Bullard that this is going to be
important for lots of applications.
I think (speaking for myself) that there is room for experimentation
here, and certainly a need for experience with different ways of going
about it. I can imagine several rules, which might be combined in
different orders. (N.B. the order in which I list these rules is
random, not intended as a proposed ordering.) In fact, there are
several distinct kinds of rules:
Hard-coding:
-
A hard-code the schemas for the namespaces you care
about into your software (hard-coding a schema into your software
-- what a lovely way of showing you care ... sounds dumb, but
it's the way HTML browsers have worked for a long time)
-
B hard-code a list of schema locations for the
namespaces you care about; your software consults those locations for
the version you consider right
-
C assume that every element is declared with the
urtype, and every attribute with the ur-simple-type (useful mostly as
a fallback, I guess)
Overt user control:
-
D pass a set or sequence of (namespace-name, schema
document) pairs to the processor at invocation time, e.g. as a
command-line option
-
E when you need a schema, ask the user (interactively?)
where to get it (some XML editors do this with DTDs, if they can't
find it any other way -- this plan is probably most useful as the
last attempt in a series, when all else fails)
Trust the document:
-
F dereference the namespace name and take what the
server gives you (or: then use RDDL to find what you want)
-
G dereference the URIs given as
schemaLocation hints and take what the server gives you
Indirection through catalogs or paths:
-
H use a series of regular expressions into which you
substitute the namespace name, or parts of the namespace name, or the
URI given as a schemaLocation hint, or parts of it, and
treat each one as a system identifier (the way the
sgml-public-map variable works in psgml, or the way
sgmls entity resolution used to work before catalog support)
-
I use the namespace name (viewed as a PUBLIC
identifier) to consult an SGML-Open catalog file
-
J use the namespace name (viewed as a SYSTEM
identifier) to consult an SGML-Open catalog file
-
K use the schemaLocation hint (viewed as
a SYSTEM identifier) to consult an SGML-Open catalog file
-
L use the namespace name or
schemaLocation hint to consult an SGML-Open catalog
file looking for a NAMESPACE or SCHEMA keyword (this
is an extension, for now, but Oasis might be persuaded to build some
keyword into the new version of catalogs)
-
M use the system identifier you got from the SGML-Open
catalog to consult the SGML-Open catalog file again (keep going until
you have a system identifier for which the catalog owner has not
provided a redirection)
-
N accept an invocation-time parameter which specifies
where to find catalog files
Not all of these are equally useful (I have always hated systems that
made me work with option H), and if some catalog support (N and one or
more of I-L, and optionally M) is provided, it's not clear that
run-time options (D, E) will be needed for most users. For some
applications, hard-coding things may be the way to go, for the
namespaces one cares most about: it depends on the deployment
conditions.
The situation you describe sounds fine if you don't expect to have
to version your elements, because their meaning isn't expected to
change with time. This will be true in some areas, but certainly
completely incorrect for others. You cannot always invent a new
name just because the semantics change in some way.
Versioning seems to be a hard challenge -- I get the impression it's
hard in part because we want contradictory things from a versioning
mechanism: sometimes the test of a successful versioning mechanism
appears to be that it allows us to label both version 1 and version 2
of language X as just 'X', so we don't need to change all the labels
in our data when we move to a new processor (or vice versa), and at
other times it appears that the test of a successful versioning
mechanism is that it allows us to tell specifically which version(s)
of a specification data, or software, is compatible with, so that we
can fail quickly and avoid catastrophic errors (the processor for
Boolean-Language 1.0 failed on the Boolean-2.0 data, because it
ignored the new alternate notation for 'not' -- or rather, it didn't
fail, it only produced catastrophically erroneous results, smiling
cheerfully all the while).
But I think both forms of versioning can be supported if we give the
user sufficient control over which schemas get bound to which
namespaces.
Case 1. Language X has two versions, labeled 'X' and 'X'
respectively. I know my data is in version 2, so I tell the processor
to use the schema for version 2 -- that is, I tell the processor
to bind namespace X to schema X2.
Case 2. Language X has two versions, labeled 'X' and 'X'
respectively. I don't know which version is used by the data you just
sent me, but I need to know, so I tell the outermost layer on my
system to try first using the schema for version 2, and if that
doesn't work then try the schema for version 1, and let me know
which worked.
Case 3. As for Case 2, but I don't actually care which version
version of language X is used by the data you just sent me. Either I
do the same as in Case 2 (but ignore the information about which
version it was), or I tell the processor to use the 'any-X' version of
the schema, which accepts all documents in either version. (If the
schema language I am using has the 'determinism' rule we inherited
from ISO 8879, the union schema may also find it necessary to accept a
few documents which aren't legal in either version -- one reason some
people would like the community to move away from the determinism
rule.)
Case 4. Language X has two versions, with namespace names X1 and X2.
I know my data is in X2, and I tell the processor to bind the schema
for X2 to the namespace name X2.
Case 5. Language X has two versions, X1 and X2, and I don't want to
maintain two schemas for it, so I write a union schema and bind both
namespace names to it. (Or I say "I don't care what other people do,
I'm binding both X1 and X2 to the schema for X2", much the same way
that on my system the catalog entry for HTML 4.0 Transitional points
to the DTD for HTML 4.0 Strict -- or did for a while.)
Nor do I really see us wanting to have a separate namespace for each
individual element (and wouldn't that be a great bandwidth blow-out
for our instance documents). There seems to be a piece of the
puzzle that is missing, certainly for enterprise usage.
Versioning is certainly still an ongoing challenge. How to label
things so that we can, as far as possible, allow old data to work with
new processors, allow old processors to work with new data when they
can do so safely, and allow old processors to detect, reliably, when
they need to fail on new data rather than risk processing it -- if
anyone has solved that problem, a lot of us would like to know how.