RE: additional constraints validation variant from Mark Feblowitz on 2002-07-31 (xmlschema-dev@w3.org from July 2002)

From: Mark Feblowitz <mfeblowitz@frictionless.com>
Date: Wed, 31 Jul 2002 18:11:39 -0400
To: "'Paul Kiel'" <paul@hr-xml.org>, xmlschema-dev@w3.org
Cc: "David Connelly (E-mail)" <dconnelly@openapplications.org>, "Duane Krahn (E-mail)" <duane.krahn@irista.com>, "Satish Ramanathan (E-mail)" <Satish.Ramanathan@mro.com>, "Andrew Warren (E-mail)" <awarren@openapplications.org>, "Kurt A Kanaskie (Kurt) (E-mail)" <kkanaskie@lucent.com>, "Michael Rowell (E-mail)" <mrowell@openapplications.org>
Message-ID: <4DBDB4044ABED31183C000508BA0E97F040ABE8B@fcpostal.frictionless.com>
Schema-dev people. Pardon us while we have an OAGIS-oriented conversation.
I'd like to keep y'all in the loop, since this particular issue is one of
the stickiest of the Schema-induced challenges we faced as we developed
OAGIS on Schema. And challenges to our approach persist. 
 
You may want to skip down to the discussion of the second validation step.
 
We'd love to look ahead to some relief, in the form of future Schema Rec
evolution. Perhaps you Working Group people can read and comment. 
 
Let me know when it goes too far off-topic and I'll stop CCing the list.
 
Read on...
 
-----Original Message-----
From: Paul Kiel [mailto:paul@hr-xml.org]
Sent: Wednesday, July 31, 2002 2:53 PM
To: Mark Feblowitz; xmlschema-dev@w3.org
Cc: David Connelly (E-mail); Duane Krahn (E-mail); Satish Ramanathan
(E-mail); Andrew Warren (E-mail); Kurt A Kanaskie (Kurt) (E-mail); Mark
Feblowitz; Michael Rowell (E-mail)
Subject: Re: additional constraints validation variant
 
Mark,
 
I see you are ahead of the curve as usual.
 
Gosh.
 
>>The reasons we rejected it had to do with complexity: first, it's complex
to manage multiple schemas and to link to "the right" generated schema.
That's not so bad, but can be daunting. 
 
[Paul] A clear issue I realize.  But not a deal breaker.  The obvious
suspect would be naming conventions of some type.
 
True. Not impossible, just complex. In fact, none of the rationale say that
your suggestion is impossible. It comes down to a judgment call as to
whether it's worth the effort and relative complexity. I liked the idea and
still do. But I am not yet convinced that it's a complete fit with the rest
of the OAGIS 8 architecture. Perhaps.
 
In some cases, though, I fear that the approach may be barely tractable. A
difficult part comes in assembling a Noun from shared Components. It's
conceivable that a particular Noun might need only one part of one
component, another part of another, .... (Simple XPath expressions stating
the required parts are the easiest to apply and maintain.) Since all
components are stored in a single Components.xsd, it's conceivable that
there would have to be a huge number of versions of the Components file to
accommodate all needed combinations of cardinalities. One could pre-assemble
the Noun from its components prior to overlaying the constraints, but now
you're doing even more of the parser's job. 
 
Or one could carve up the components into separate files. Other standards
have done this. This could amount to a significant number of files, with
significant runtime overhead of including/importing each component file
(variant) for each of the many components used. 
 
In addition to assembly from components, there's also the problem of
inheritance: If I have a PurchaseOrder of type Order, my constraints could
(and would) apply to both the local content of the PurchaseOrder and the
inherited content of Order. Still a mere "logisticality" - you'd have to
generate the Order variants and pick the right one for the context. The
trick is in getting your preprocessor to correctly derive the right set of
Order variaints for each of the PurchaseOrder, SalesOrder,...  Fanout,
again.
 
>>Let's say I have a PurchaseOrder used in 6 different contexts. I generate
6 variants from the one relaxed model plus the 6 sets of
separately-specified cardinality constraints. Now I derive an extension to
PurchaseOrder. That means I have 6 more generated variants (36 if I'm
foolish enough to allow extensions to the first 6). Several challenges
arise:
 >>First, I must make sure that my derived, extended set also follows the
original cardinality constraints. This also means that I must invent a
constraint language that mirrors the extensibility of my schema. 
>>The big challenge comes when making sure that any user of the generated,
extended variants uses the correct one, from the correct set. What if I have
two layers of extension? More? In theory, it's possible, but practically
speaking, we guessed that getting the cascading schemaLocation hints correct
would be a significant challenge.
 
[Paul] Not sure I understand.  Could one not create an xslt that would take
any constraint and apply it to a schema?  So applying constraints to
extensions would use the same xslt.  It's only limited by the ability of
mapping an xpath in a schematron constraint asstertion to its schema
definitions (not easy, as anyone who has created stylesheets for schemas
knows, but still doable). 
 
Again, not a deal-breaker, but difficult. Yes, you could assume extension
only. You could come up with an approach whereby a new set of extended xsd
files could be generated, based on the base set of constraints plus the sets
for the overlays, by progressively transforming according to the base set of
constraints/XSLT, and then further transforming by an extended set for each
overlay.  You'd have to be quite careful that the constraint language has
regular and predictable semantics, compatible with your extension facility.
Correctly crafting that language and the constraints could get tricky, but
not impossible.
 
[Paul] If the derived schemas are static with a clear naming convention,
could not the problems associated with "finding the right one" be mitigated
to some extent?
 
Seems so. The cleverness would not be limited to the having a clever storage
scheme, but would require appropriate adjustments to the schemaLocations for
all import/include statements.
 
[Paul] Your point on extensions is well taken.  I guess what I would need to
ask is - not how much of a driver extension is but what kind?  My experience
is that for the most part people need only simple extensions, and rarely
need layers upon layers upon layers of extension.  Should the need to
simplify the use of many layers of extension be valued above the ability to
simplify the processing model?  I speak of usability here.   Would a simpler
processing model with simple extensions meet the 80-20 rule?
 
Indeed, it's a tradeoff. Limiting vertical extensibility is an option.
Trouble is, who's to say how many layers would be enough? Let's say you have
the base OAGIS vocabulary. You layer on the HR-XML vocabulary. You're done,
but what about your client organizations? Would they be willing to stop
there, or would they want to further customize/localize?  It is a tradeoff.
OAGI opted for a simple and unlimited layering, and also a simple (albeit
two-step) validation model. We tried other options, and things were either
too arbitrarily limited, or simply too messy for the average user (usually
both). 
 
That second validation step, which is arguably duplicative of validating
parser functionality, is what most people seem to be pushing back hard
against. It's likely the apparent "wastefulness" that triggers so much
pushback, and the mere existence of an extra validation step, requiring
additional infrastructure. Mostly, though, it offends our sensibilities -
especially those of us who were trained to cherish every cycle. So far,
we've seen no evidence that the second pass is prohibitively expensive in
practice. And, like validation, it can be turned off if needed. Still...
 
Do recall that cardinality checking in validating parsers is not static.
Regardless of the mechanism, there is active code evaluating cardinality
violations. The key question is this: what additional overhead must be
devoted to the XSLT processing? If it's done in the same parse (same JVM, if
it's java), it amounts to additional cost of applying each XPath expression
(plus some other overhead). Is that an affordable inefficiency? We need more
data.
 
Since we can imagine a scenario where the constraints can be generated and
trivially evaluated by a validating parser, it seems as though this is
something that Schema could support. The fact that it represents multiple,
alternative constraint sets is what makes it more of a challenge to achieve.
If only derivation by restriction were usable here... Or if Schema supported
an embedded constraint facility. But I repeat myself.
 
Development-time transformation is indeed a cool idea. And one that I tried
pursue. It may be worth another try. Just be mindful that it must play well
with the remainder of the representation architecture and language features.
 
>>Another, somewhat unrelated reason for rejecting this approach was that we
liked that we could use Schematron for other, non-cardinality-oriented
constraints, such as the much-sought-after co-occurrence constraint. With
development-time generation, the only things that could be transformed into
Schema were the things that were supportable in Schema. Co-occurrence and
other similar constraints could not be supported development-time, simply
because there is no equivalent to transform to in Schema.
 
[Paul] Another good point.  I don't see how this conflicts however.  The
"XSDValidationFlow" - a poor name I know - uses schematron constraints in an
xml file just as the "InstanceValidationFlow".  In both cases co-occurance
constraints could be done in a second layer validation as it would have to
be, since it cannot be represented in schema as you state.  I see this as a
separate issue and not in conflict.
 
Only a conflict if your desire was to do away with the added overhead of
two-pass validation (which, BTW, I don't see on the XSDValidationFlow
diagram). IMHO, having the combined architecture that both pre-applies the
cardinality constraints and does post-validation checks would be optimal as
far as processing efficiency, but I would fear that the size and complexity
of the solution would be too large.
 
Thanks for your insights Mark,
Paul
 
 
And for yours,
 
Mark
 
----- Original Message ----- 

From: Mark <mailto:mfeblowitz@frictionless.com>  Feblowitz 
To: 'Paul Kiel' <mailto:paul@hr-xml.org>  ; xmlschema-dev@w3.org
<mailto:xmlschema-dev@w3.org>  
Cc: David Connelly (E-mail) <mailto:dconnelly@openapplications.org>  ; Duane
Krahn <mailto:duane.krahn@irista.com>  (E-mail) ; Satish Ramanathan (E-mail)
<mailto:Satish.Ramanathan@mro.com>  ; Andrew
<mailto:awarren@openapplications.org>  Warren (E-mail) ; Kurt A Kanaskie
(Kurt) (E-mail) <mailto:kkanaskie@lucent.com>  ; Mark
<mailto:mfeblowitz@frictionless.com>  Feblowitz ; Michael Rowell (E-mail)
<mailto:mrowell@openapplications.org>  
Sent: Wednesday, July 31, 2002 1:52 PM
Subject: RE: additional constraints validation variant
 
Paul - 
 
Your idea is quite compelling. In fact, it was one of the many we considered
(and ultimately abandoned). I liked this approach so much I even mocked it
up myself.
 
For the unfamiliar, the problem could be summarized as follows: 
 
How does one support multiple uses of "the same" complexType, with different
minimum cardinalities in each content model?
 
The problem arises in OAGIS when we want to apply, e.g., the noun
"PurchaseOrder" in different contexts: CancelPurchaseOrder requires only
minimal, identifying PurchaseOrder content (but could contain any), and
ProcessPurchaseOrder requires most of the PurchaseOrder content. In the
former, most of the content would be optional (minOccurs="0"); in the
latter, most would be required (minOccurs="1").
 
With complexType derivation by restriction being a non-starter for us, we
were forced to come up with an alternative. What we settled on, after months
of painstaking exploration, was a "relaxed" model of all-optional content
(all element content with minOccurs="0"), with separately specified
cardinality constraints layered on via post-validation Schematron
processing. This requires two-pass validation: schema-validation and
Schematron processing. That extra step, although achievable using standard
technologies (schema-validating parser plus XSLT processor), offends some
sensibilities and raises efficiency concerns. (Some efficiency concerns will
be addressed when XSLT processors facilitate schema-validation plus
transformation, which should be very soon).
 
Paul's suggested approach is a development-time alternative: rather than
performing interchange-time constraint checking, he proposes that the same
cardinality constraints be used to guide a transformation of the relaxed
schema into other, cardinality-constrained schemas. 
 
The benefits are obvious: no extra runtime machinery is required - only a
schema-validating parser is necessary to check for correct structure, types
and cardinalities.
 
The reasons we rejected it had to do with complexity: first, it's complex to
manage multiple schemas and to link to "the right" generated schema. That's
not so bad, but can be daunting. But the real difficulty comes in managing
the fan-out in the face of further extensions. 
 
Let's say I have a PurchaseOrder used in 6 different contexts. I generate 6
variants from the one relaxed model plus the 6 sets of separately-specified
cardinality constraints. Now I derive an extension to PurchaseOrder. That
means I have 6 more generated variants (36 if I'm foolish enough to allow
extensions to the first 6). Several challenges arise:
 
First, I must make sure that my derived, extended set also follows the
original cardinality constraints. This also means that I must invent a
constraint language that mirrors the extensibility of my schema. 
 
The big challenge comes when making sure that any user of the generated,
extended variants uses the correct one, from the correct set. What if I have
two layers of extension? More? In theory, it's possible, but practically
speaking, we guessed that getting the cascading schemaLocation hints correct
would be a significant challenge.
 
Another, somewhat unrelated reason for rejecting this approach was that we
liked that we could use Schematron for other, non-cardinality-oriented
constraints, such as the much-sought-after co-occurrence constraint. With
development-time generation, the only things that could be transformed into
Schema were the things that were supportable in Schema. Co-occurrence and
other similar constraints could not be supported development-time, simply
because there is no equivalent to transform to in Schema.
 
I'd be happy to discuss this all further. I'd be even happier if at least
this subset of constraints was somehow incorporated into Schema. 
 
 
Mark
 
 
-----Original Message-----
From: Paul Kiel [mailto:paul@hr-xml.org]
Sent: Wednesday, July 31, 2002 12:18 PM
To: xmlschema-dev@w3.org
Cc: Mark Feblowitz
Subject: additional constraints validation variant
 
Greetings folks,
 
I have been working with the schematron "adding additional constraints"
issues that are most accurately addressed in the OAGIS8.0 design.  This
design solves the problem quite well of the desire for a single general
model that is constrained by context.  Nice job folks!  (For example, in one
of our cases having a general HR-XML TimeCard with contextual variations
such as "DeleteTimeCard", "CreateTimeCard", "UpdateTimeCard" etc.)
 
The use of schematron here is perfect.  I would like to add a wrinkle for
perhaps a variant to this approach.  The links below illustrate two methods
of achieving the same goals, both using schematron to document constraints.
However where these constraints are applied differs.  
 
The first link, "InstanceValidationFlow", shows how one may use a document
(in this case an HR-XML TimeCard) in a validation flow.  The two step
approach (parser plus xslt) works well.  
 
http://ns.hr-xml.org/temp/InstanceValidationFlow.gif
<http://ns.hr-xml.org/temp/InstanceValidationFlow.gif> 
 
The second link, XSDValidationFlow", shows a flow where validation occurs
via a derivation of the schema itself instead of the instance in a second
step.  This would maintain the goal of general model with context-specific
constraints but without a second step validation (which is where I get the
push back from my constituents who otherwise like the use of schematron).
 
http://ns.hr-xml.org/temp/XSDValidationFlow.gif
<http://ns.hr-xml.org/temp/XSDValidationFlow.gif> 
 
What do you think of this approach?  I haven't decided if I like it yet, but
I thought enough of it to merit a thread here.  
 
The development of a Constraints2XSD stylesheet would not be simple, but I
would think doable - and reusable!  [I talked with Mark Feblowitz about this
once and he was, I believe, intrigued by it -- Mark, is that the case??]
Might anyone out there be interested in collaboratively creating such an
animal?
 
Pluses and Minuses:
Method 1 - InstanceValidationFlow
+ transforming constraints to validating xslt easily replicated (i.e. via
schematron skeleton xsl)
- results in many xslts laying around for validation (one for each context)
- requires another validation layer via XSLT (performance)
 

Method 2 - XSDValidationFlow
+ single validation layer
+ makes most use of parser
- results in many schemas laying around for validation (one for each
context)
- xslt for transformation of constraints to xsd not developed (yet!?!)
 
 
 
 
W. Paul Kiel
HR-XML Consortium
Received on Wednesday, 31 July 2002 18:12:12 UTC