3754 – UPA-constraint causing principal problems in document authoring

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 3754 - UPA-constraint causing principal problems in document authoring

Summary: UPA-constraint causing principal problems in document authoring

Status:	CLOSED LATER

Alias:	None

Product:	XML Schema
Classification:	Unclassified
Component:	Structures: XSD Part 1 (show other bugs)
Version:	1.0/1.1 both
Hardware:	All All

Importance:	P2 enhancement
Target Milestone:	---
Assignee:	C. M. Sperberg-McQueen
QA Contact:	XML Schema comments list

URL:
Whiteboard:	important, hard (consensus elusive)
Keywords:	decided

Depends on:
Blocks:

Reported:	2006-09-19 21:18 UTC by Marie Bilde Rasmussen
Modified:	2007-05-02 11:50 UTC (History)
CC List:	1 user (show)

See Also:

Attachments

Description Marie Bilde Rasmussen 2006-09-19 21:18:31 UTC

I have real-life problems with the UPA-restriction. I use W3C xml schemas for document authoring in an intensively schema-aware xml editing system. I do spend a lot of effort in designing the xml schemas to support the authoring process.

I will initially state, that I am very satisfied with the choice of W3C xml schema as schema language I find the typing and other features of the standard very usefull.

I do computational lexicography within a large Danish publishing house (Gyldendal Publishers). Gyldendal is the greatest, market leading dictionary publisher in Denmark. The texts/data we produce are dictionary data.

The distinction between so-called document oriented and so-called data oriented xml doesnt really fit dictionary entries very well; they can be seen as both types. Our lexicographers/authors create native xml directly. Among other schema design goals, I need the schemas to supply the lexicographer information about exactly the valid operations (e.g. element insertion/deletion/renaming) that he can perform from a given structural position while editing a dictionary entry in the xml environment. This is why I have a principal problem with the UPA restriction.

I found my latest example during my attempt to represent the correct word division of Danish dictionary lemmas (also known as keywords, headwords, ...). My grammar looks like this:

( hyphen, (wordpart, ( ( ( hyphen, blank? ) | (blank, hyphen? ) )?, wordpart )+ ) ) |
( ( wordpart, ( ( ( hyphen, blank? ) | ( blank, hyphen? ) )? wordpart )+ ), hyphen? )

this can be reformulated in prose as:
- a word division of a lemma consists as a minimum of two word parts
- between two word parts may occur one hyphen (at most), one blank (at most), both, or none of them. If they both occur, they can come in any of the two possible orders
- a lemma may have an initial hyphen (then the part-of-speech of the lemma is "suffix") or a final hyphen (then its POS is "prefix"). The lemma cannot have both an initial and a final hyphen. Most lemmas have neither.

The <hyphen> and <blank>-elements represents orthographic hyphens and blanks in the lemma. They are NOT the representation of the legal division points.

The second branch of the outermost or violates the UPA-constraint, because it cannot be determined whether a hyphen following a word part is a hyphen between two word parts or if its a final (trailing) hyphen of the lemma.

Michael Sperberg McQueen and Xan Gregg both proposed that I formulate a less strict grammar and then run a schematron proces on top of it, after the initial schema validation has been done. The schematron process is then supposed to check for the additional rules, that could not be expressed in the W3C xml schema language, because of the UPA-restriction. These suggestions were made on the xmlschema-dev mailing list. Michael also suggested that, alternatively, I could rename the final hyphen to get rid of the ambiguity (i.e. ambiguity only when seen top-down!)

These are nice and very clever solutions. But I would like to take a polemic position in this question, and this is why I now raise the issue with the xml-schema WG.

For (human) document authoring purposes, it is of the greatest importance, that author feel confident, that the underlying schema actually tells him exactly what he is allowed to or what possiblities he has. Running a post-editing process to find out that the insertion you made of some element (because the schema-aware soft-ware proposed you this very operation!) is actually invalid, would possibly weaken your confidence in the schema as being a true implementation of the editorial principles, that rules the type of text, you work with.

Furthermore, the renaming strategy might seem neat to the designer and the data consumer (e.g. a data processing engineer). But on the other hand, this would blur the otherwise precise terminology of the grammar. In other words: why claim, that a rose is not a rose is not a rose?

If W3C xml schema is (also) intended to be used in document auhoring processes, I would like to ask the WG to reconsider the UPA-restriction in this light.

Thank you!
Marie Bilde Rasmussen,
Copenhagen, Denmark

Comment 1 C. M. Sperberg-McQueen 2007-05-02 00:32:36 UTC

The XML Schema Working Group discussed this issue (Bugzilla bug
3754) at some length during a face to face meeting at the end of March.  

Note first that the UPA constraint of XSD 1.0 has in fact been
eased somewhat by the introduction of weakened wildcards.

There was some sentiment (at least one member of the WG) for going
further and eliminating the Unique Particle Attribution constraint 
entirely, as being irrational and unhelpful.  But those favoring 
that measure were in a distinct minority.  Others felt
that eliminating the constraint was too big a change for XML Schema 1.1
but that it might be worth coming back to later.  

Some in the WG argued that the UPA constraint does provide some help 
for certain kinds of tools and tool development. In this connection,
it was suggested that where determinism is helpful, it would suffice 
for spec to require that the input/output mapping (or in other words 
the input / PSVI mapping) of a given complex type be deterministic;
UPA is strictly stronger than such a constraint.  (A non-deterministic
automaton may have a deterministic mapping if each pair of competing
particles will provide the same annotations in the PSVI.)  Unfortunately,
we don't at the moment know enough about the closure properties of
finite-state automata which produce output to be confident about
moving toward a constraint phrased in terms of them.

In the end, the chair determined that the Working Group did not have
sufficient consensus to make this change, so we agreed to close the
issue without further action.  Since the proposal to ease the UPA
constraint had active support, we chose to give the issue a resolution 
of LATER, indicating that we recommend to any Working Group
preparing a future version of XML Schema that they consider this
issue anew.  But no change will be made for XML Schema 1.1.

Accordingly, I'm marking this issue RESOLVED / LATER.

Dr. Rasmussen, as the originator of the issue, I ask that you update
the record either by changing its status to CLOSED, to indicate that
(however regretfully) you accept the disposition of the question by the
XML Schema Working Group, or else by changing the status to
REOPENED, to indicate that you are dissatisfied with the Working
Group's response and wish to register a formal objection to the
decision (which means in effect that you appeal the decision of the WG to
the Director; all formal objections are reviewed by the Director
of the W3C when specifications progress to certain document
maturity statuses).

I am sorry that the Working Group proved unable to do as you suggested,
and that I (as a supporter of your view) proved unable to persuade my
colleagues in the WG to take a different course of action, but I hope 
that you will accept the decision (for this version of XML Schema,
at least!), and that you will also accept our thanks for raising the
issue and allowing us to discuss it in the light of a concrete example.