5293 – Subsumption

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 5293 - Subsumption

Summary: Subsumption

Status:	CLOSED FIXED

Alias:	None

Product:	XML Schema
Classification:	Unclassified
Component:	Structures: XSD Part 1 (show other bugs)
Version:	1.1 only
Hardware:	PC Windows XP

Importance:	P1 major
Target Milestone:	---
Assignee:	C. M. Sperberg-McQueen
QA Contact:	XML Schema comments list

URL:
Whiteboard:	restriction cluster
Keywords:	resolved

Depends on:
Blocks:

Reported:	2007-11-28 10:16 UTC by Michael Kay
Modified:	2008-05-24 08:50 UTC (History)
CC List:	0 users

See Also:

Attachments

Description Michael Kay 2007-11-28 10:16:09 UTC

XSDL 1.1 defines subsumption by intent: D is a valid restriction of B if all instances of D are valid instances of B.

On the whole I think this is a good thing; it encourages implementations to devise the best possible algorithms for determining subsumption, in contrast to 1.0 which required implementations to reproduce an algorithm that was known to be imperfect.

I wonder, however, whether we have set the bar too high. Consider a case like:

B: a(bc?)|b(ac?)|b(ca)|d|e|fc?

D: a&b

It is fairly easy to see by inspection that B subsumes D. But a glance at the references cited in the spec shows that the algorithms for doing this in the general case are extremely complex. In fact, they seem to be right at the bleeding edge of computer science; from reading the papers I would not be 100% convinced that a complete and provably correct algorithm is known.

And even where an algorithm exists, then (a) it may have performance characteristics that make it impractical, (b) it is extremely difficult to test, and (c) the more complex cases that it has to deal with will never be encountered in real life.

At present a product is non-conformant it if fails to detect that B subsumes D in the above case (and in similar but much more complex cases). I think this is too demanding.

I think there is an alternative which is much more practical from an engineering viewpoint. We should simply say that if the implementation is unable to determine statically (one way or the other) whether or not B subsumes D, then it must validate instances at run-time against both. It's then up to implementations to decide how much effort to invest in static analysis, and although some implementations may fail to detect some invalid schemas, they will still agree on the set of valid instances.

Note that this is very similar to the solution we have adopted for handling conditional type assignment, and it is also (in effect) the way we handle assertions. So it's not something radically new.

Comment 1 C. M. Sperberg-McQueen 2007-11-29 04:30:40 UTC

I think the spec would be much simpler if the semantics of restriction
were simply that if type R restricts type B, then any element governed
by R must be valid against B, as well as satisfying whatever constraints
are expressed by the declaration of R.  Call this 'restriction as
intersection'.  (If a processor can prove that anything satisfying the 
content model of R must necessarily also satisfy the content model of B, 
or can generate a content model for internal use that captures just the
legal sequences of children, then that may be a useful optimization at 
validation time.)  Alas, the WG decided instead to require that R be
provably subsumed by B, and to make it simple to check by imposing a few
simple rules on the structure of R's content model.  Somewhere along the
way, the rules grew out of the description 'simple' and became the 
thicket of conditions and case statements and exceptions and special
cases familiar to readers of 1.0.

If I understand you correctly, you are proposing something similar to
restriction-as-intersection, in the sense that a processor can validate 
by checking against both B and R, instead of attempting to prove that B 
subsumes R, but visibly different from it, in that it's still an error 
for R not to be subsumed by B -- it just happens to be an error
which processors are not obliged to notice until confronted at validation
time by an instance which illustrates the discrepancy).  

To take a simple example, if B's content model is (a,b?) and R's content 
model is (a,c?), then restriction-as-intersection would call R legal and 
the effective content model of R would be (a).  Your proposal, if I 
understand it correctly, would say that in principle, R is not legal, 
although a conforming validator is not obliged to notice the fact unless 
the instance contains an element governed by R, with children "a c".
Have I got that right?

My heart still yearns for the simplicity of restriction-as-intersection,
but I for one think I'd be happy to accept your proposal as a
substitute, unless the WG sees the light and wants to move to r-as-i.

I note in passing that if we define failure of R to be subsumed by B as
an error, but an error which conforming processors are not obligated to
detect, then our story about errors and error detection has changed from
the story we had converged on in Seattle.  There, the minutes say we had
consensus on the proposition that conforming processors should be obligated
to detect and report errors.  The proposal here seems to entail a partition
of errors into two classes:  those a processor is required to detect and
report, and those which a processor is allowed to fail to detect.  This
distinction has always made sense to me (I learned it from ISO 8879),
but we should not fall into it without seeing that it's a change from both of
our earlier positions (namely [1] that on error all bets are off and 
behavior is unconstrained, and [2] that conforming processors are obligated
to detect and report all errors in schemas and schema documents and may
recover from them, albeit not silently).

Still, all this said, I have to say that I don't think subsumption checking
is actually as bleeding-edge as the bug description suggests.  In cases with
large exponents, it may involve the construction of automata with large 
numbers of states (or the calculation of derivatives with large numbers of
terms) and require a lot of tedious calculation.  The algorithms cited in 
the spec try to make it unnecessary to unroll all the exponents, or to 
unroll them all the way, and while it's true that the authors in question 
have all published algorithms in which others have found bugs, still 
in the end content models define regular languages, and in the very worst 
case we know that you can get correct results by translating into standard 
representations of regular languages and using textbook algorithms from there.

Comment 2 Michael Kay 2007-11-29 09:51:06 UTC

>If I understand you correctly, you are proposing something similar to restriction-as-intersection

Not quite. I wasn't proposing that users should be encouraged to think that its perfectly OK to define (abc?) as a restriction of (ab), with the only adverse consequence being that elements containing a c aren't valid. I was trying to find wording that says this isn't a valid restriction, but not all conformant processors will report the error.

Comment 3 Pete Cordell 2007-11-29 10:14:05 UTC

Re:
> B: a(bc?)|b(ac?)|b(ca)|d|e|fc?
> 
> D: a&b

I suppose it is too late to insist that the structure of the derived type should mirror the structure of the base type.  e.g. to restrict:

B: a(bc?)|b(ac?)|b(ca)|d|e|fc?

you'd have to do:

D: a(bc{0})|(b(ac?)){0}|(b(ca)){0}|d{0}|e{0}|(fc?){0}

It is very ugly, but then the whole restriction syntax is pretty ugly anyway.

Certainly having wildly different structures between the two gives us problems because we try to store the variables of a restricted derivative in the base, and this is not possible if the structures are different.

Comment 4 Michael Kay 2007-11-29 10:22:36 UTC

>I note in passing that if we define failure of R to be subsumed by B as
an error, but an error which conforming processors are not obligated to
detect,

One approach might be to introduce the concept of a "latent error": an error in the schema that might not be detected until an instance is presented that reveals it.

Comment 5 Michael Kay 2007-11-29 10:28:20 UTC

>I suppose it is too late to insist that the structure of the derived type should mirror the structure of the base type. 

That gets back to the 1.0 approach which disallows many perfectly reasonable restrictions.

In many ways, starting from scratch, I would prefer a syntax for restriction where you define the differences rather than defining the new type and requiring the system to check consistency. That would be far easier from a maintenance viewpoint, because a change to the base type would then be automatically inherited by its subtypes. We do this for attributes (use="prohibited"), we do it for facet-based restriction, we do it for assertions; it's a shame we can't do it with content models. But that's opening up the discussion too far for comfort.

Comment 6 Michael Kay 2007-11-30 10:45:08 UTC

A further example here: consider the two types

B: ab?|ba?

D: a?&b? [count(*)>0]

where the square brackets indicate an assertion.

These two types are equivalent (both allow the instances a, b, ab, or ba), and as far as I can see our current rules require processors to treat D as a valid restriction of B. But I don't think this can be done without some sophisticated theorem proving (we know they're equivalent because we can enumerate the instances, but that clearly isn't a viable strategy in the general case.)

The problem is exacerbated by the fact that all the complexity arises in cases like this that will very rarely arise outwith conformance tests.

Comment 7 Noah Mendelsohn 2007-12-07 01:16:24 UTC

> B: ab?|ba?
> 
> D: a?&b? [count(*)>0]
> 
> where the square brackets indicate an assertion.
> 
> These two types are equivalent (both allow the
> instances a, b, ab, or ba), and as far as I can
> see our current rules require processors to treat
> D as a valid restriction of B. But I don't think
> this can be done without some sophisticated
> theorem proving


Whoa.  Do assertions count in subsumption or restriction checking?  I thought that after two years of trying to make a generalized restriction requires subsumption rule, we dropped back to requiring only subsumption of the content models (plus some other side conditions).  A quick look at the constraints seems to bear out that they deal only with subsumption of content models, but I could easily have missed something.

If I'm reading the above example correctly, it presuupposes that we are checking the net content allowed by the combination of content models and assertions, and I didn't think that was so.  Am I misremembering our intent?

Also: in some of the earlier comments, it was suggested that we might consider an intersection-based approach.  Am I correct that proponents of that approach do want to retain our requirement that types be intensionally as well as extensionally connected for us to consider one as a derivation of another?  In other words, whatever we do about content checking (and I'm not necessarily in favor of the intersection approach but am willing to hear more), we would in any case continue to require that types be in a parent/child relationship as well?  I assume that was implicit in the earlier comments, but perhaps not.  I certainly would not be OK with dropping that.

Thank you.

Noah

Comment 8 Michael Kay 2007-12-07 09:31:08 UTC

>Whoa.  Do assertions count in subsumption or restriction checking?

You're right, the rule that says "A restricts B iff A's extent is a subset of B's extent" applies to A and B as "content types", not to the type as a whole.

This still leaves some cases that are very hard to prove, notably cases where one of the content types uses an all-group and the other uses a combination of sequence and choice. For example:

B: ab?|ba?|(empty)

D: a?&b? 

It's not clear that requiring all such cases to be analyzed gives a real user benefit.

Comment 9 David Ezell 2008-01-25 15:52:24 UTC

RESOLVED: As phase-1 agreement to resolve #5293
(1) Restrictions involving all-groups can be checked at run-time, need not be
checked at compile time. 

(2) Failure of the rule, exercised by instance, is a schema error not a validation error. 

(3) Yes, we are introducing a notion of
'latent error' here. (4) On question of whole type vs just part of type, no
action under this heading. 

AGREED.

Comment 10 C. M. Sperberg-McQueen 2008-05-21 03:45:59 UTC

A wording proposal with two different ways of resolving this issue
(one marked as a change, the other as 'not status quo') is now 
available at 

  http://www.w3.org/XML/Group/2004/06/xmlschema-1/structures.b5293.html
  (member-only link)

Comment 11 C. M. Sperberg-McQueen 2008-05-23 23:23:13 UTC

At its telcon today, the XML Schema WG considered the wording proposal
mentioned in comment #10.  As noted, it contains two ways of resolving
the issue:  adding a rule forbidding derivation of all groups from 
sequences and from any but simple choices, or specifying that run-time
checking of the relevant constraint is allowed when all-groups are involved.

The WG chose the run-time checking, and instructed the editors to add a
statement that whether a processor finds the relevant errors statically
in all cases, dynamically in all cases, or sometimes one and sometimes the
other, is implementation-defined.  (The exact boundary, when sometimes such errors are found statically and sometimes only dynamically, however, is
implementation-dependent.)

The proposal has been integrated into the status-quo text with the desired
amendment, so I am marking this issue resolved.  

Michael, as the originator, please indicate etc. etc.  You know what to do
and you know how to do it.