W3C Architecture Domain XML | XML Schema

Replicating DTD Functionality

Using XML Schema

W3C XML Schema Working Group

20 April 2000

Status of this document

This document was prepared for the W3C XML Schema Working Group by an ad hoc task force, as a public reply to comments from Murray Altheim and Ann Navarro concerning the replication of DTD-like functionality in XML Schema. The task force members were Michael Sperberg-McQueen, Eve Maler, and Norm Walsh.

The XML Schema WG formally approved this document on 20 April 2000.

This document is publicly accessible.



1. Background

An important goal of the W3C work on XML Schema is to ensure that XML Schemas are at least as expressive as XML 1.0 document type definitions. This document discusses several questions which have arisen concerning the expressive power of XML Schema, and the ability to replicate, in XML Schema, various constructs available in DTDs.

The W3C XML Schema WG thanks those who have raised these questions and given us the opportunity to discuss them here. We appreciate the input, and we hope that this note responds adequately to your concerns. If it does not, please let us know during the Last Call period, which is expected to end on 12 May 2000.

1.1 A note on expressive power

It should be noted at the outset that the phrase expressive power is a technical term in formal language theory. The expressive power of a formalism for defining languages is its ability to delineate precisely all the members of some set of languages. (A language is usually formalized as some set of strings. For our purposes, a language is some set of documents, most precisely some set of documents represented as XML information sets in which the boundaries of parsed entities are not represented.) Given two formalisms P and W, P is more powerful than W if any language which can be defined using W can also be defined using P, and there is at least one language which can be defined using P which cannot be defined using W. Another way to say this is to say that the set of languages whose rules are expressible by P is a proper superset of the set of languages whose rules are expressible by W.

The term expressive power thus says nothing about compactness, convenience, or the suggestiveness of the notation. These are all important features in a formalism, though only the first is objectively measurable. They are all (within limits) goals which we hope XML Schema achieves. But they are not what we mean by expressive power.

The goal of making XML Schema as expressive as the XML 1.0 DTD notation means that for any DTD (document type definition), there should be an XML Schema which accepts the same set of documents. This is not the same as ensuring that every construct in a DTD maps one-to-one onto some construct in an XML Schema. It is also not the same as ensuring that a single file in XML Schema notation can be used, in different environments, to define multiple schemas, in the same way that a single DTD file can, by suitable manipulation of parameter-entities and suitable use of conditional sections, be used to generate several different document type definitions (several different sets of effective declarations).

The fact is that XML Schema achieves some of the same things as DTDs in rather different ways. Some of the long-established methods of DTD design, including the extensive use of parameter entities for widely differing functions, are not directly transferrable to XML Schema, and some kinds of functionality are likely to require the schema designer to think about things in rather different, and sometimes non-obvious, terms. Sometimes this will make things seem odd for anyone accustomed to DTD construction; we hope, however, that seeming odd is not the same thing as being wrong. But it is probably fair to say that so far the XML Schema WG has paid more explicit attention to expressive power than to convenience features, and that the maintainability of the language for XML Schema documents can almost surely be improved.

There are tradeoffs, however, between making schema constructs relate more directly to useful abstractions of document content and structure, on the one hand, and allowing the kinds of tricky manipulations of DTDs made possible through suitably devious manipulation of parameter entities and conditional sections. The correct tradeoff can best be found in experiment and discussion.

1.2 List of questions

The questions to which we are responding here are summarized in the chairs' list of candidate issues thus:

The salient parts of the email discussion are reproduced in the discussions below, and the full email messages in which the questions were raised are given in an appendix.

2. Question 2G0: Noncontiguous declarations (multiple ATTLISTs)

Several people have suggested that it would be useful to allow noncontiguous declarations of various kinds. Dave Raggett's work on assertion-based grammars has assumed that the set of assertions about some object need not be contiguous. In that and similar work, no construct in a markup language need be declared in one block or unit. Comments on XML Schema have, however, commented on this topic only with respect to attribute declarations, perhaps because in DTDs it is only attribute-list declarations which can non-vacuously occur several times for the same construct (for the same element type).

Murray Altheim describes the desired functionality thus:

the ability to declare attributes from multiple locations within the schema (as provided in DTDs via multiple ATTLISTs, a feature long-demanded in the SGML community and included in Annex K of ISO 8879 TC2, the "Web SGML Adaptations.")

Ann Navarro describes it thus:

1. I need to be able to extend an attribute collection (a template) without defining a new one.

It must be possible to define collections of attributes that can be referenced by elements, and to extend those attribute sets from within arbitrary modules. For example, XHTML defines the Common attribute collection. This collection is referenced by a variety of elements. However, the content of the set is modified by the inclusion of the Legacy module. When this module is included, the style attribute is added to the Common attribute collection, and that addition must be reflected in every element that references it.

There are several cases in which users of DTDs commonly use multiple ATTLIST declarations:

2.1 User extension of base schema

First, a user of a standard DTD may supply an additional ATTLIST in the internal subset of the DTD, to add attributes not defined in the standard DTD or to override the declarations of attributes which are defined in the standard DTD.

XML Schema does not support the addition of attributes to an element in this way without the change being visible to processors. A user who wishes to add some new optional attributes to an existing complex type (call it T) in an existing schema (call it S) can

If T has been defined with an <anyAttribute> element (a place-holder particle allowing undeclared attributes from specified namespaces), then no further work is required. Otherwise, each element instance which carries the new attributes must signal that it belongs to type TL, not just to type T, by carrying the information xsi:type="TL" (with the appropriate namespace prefix for the XML Schema / Instances namespace, and the actual name of type TL).

If the user wishes to add required attributes, and to prevent the use of the original complex type T (on which the new attributes are not required), then substantially more work is required. The user should import S into L, and for each complex type and element type in S, derive a similar type in L. Changes which require the use of the derived types, rather than the original types, must be propagated upward in the element and type hierarchies as needed. In the worst case, this method is very similar to copying the entire source schema, though it does have the advantage that schema-aware processors can inform downstream applications that the types in question are derived from those of the base schema (which may be enough to allow applications written for schema S to process instances of schema L); this is not possible with DTDs.

2.2 Modular construction and maintenance of a schema

Thinking about the questions raised by Murray Altheim and Ann Navarro has made clear that most of our thinking in the XML Schema WG has been focused on the case where modular construction of a schema involves the construction, at schema-validation time, of a schema constructed from several modules, each defining elements, attributes, etc., in a particular namespace. The case of modular construction of the schema components within a namespace has received less attention.

There appear to be two cases: (1) modularity for the sake of simplifying schema development, documentation, and maintenance, and (2) modularity for the sake of allowing users to select, document by document, from a palette of modules. The first case is treated in this section, the other in the next section.

In a modular DTD, all the attributes associated with a given bit of functionality may be defined in a particular file (using ATTLIST declarations, or for older DTDs using parameter entities). A driver file embeds each module-definition file in sequence. The resulting set of effective declarations could be expressed by a series of declarations in a single DTD file, without the use of multiple ATTLIST declarations, at the cost of making maintenance more difficult and making the DTD harder to read and understand.

A similar effect can be achieved in XML Schema by the following method. For purposes of concreteness, we imagine a schema with modules for (a) basic text and document structure, (b) metadata in a header, (c) common phrase-level and chunk-level elements, (d) hyperlinking, specialized segmentations, and alignment, and (e) linguistic analysis. Let us consider a type T, which carries several attributes, some belonging to the basic text structure module, some to the common-constructs module, and so on.

This method allows for the physical and logical separation of attribute declarations which belong to different modules, and thus has all the same advantages for maintenance of the schema as does the use of multiple attribute-list declarations.

2.3 Pizza model

In many cases, DTDs are modularized not (or not solely) in order to simplify documentation or maintenance, but to allow the user to generate multiple views of the DTD, including in the set of effective declarations only those declarations which are actually to be used for a particular document or project. The TEI calls this model of DTD construction the pizza model, by analogy with the construction of a pizza on which the customer specifies, at preparation time, some set of toppings which can (in the U.S.) be combined arbitrarily with each other.

Use of the pizza model requires some method of allowing the user to specify a choice among modules: conditional sections are the method most commonly used in DTDs. Since they are discussed in the next section, let us simply assume here that some method of choice is available.

Let us consider the case where the schema described in the previous section is to be modified, to allow the user to decide, document by document, whether the modules for hyperlinking and for linguistic analysis are to be included or not. (We assume that the basic document structure and metadata modules are not optional. The method can readily be extended if they are optional, or if there is a choice among different modules of those types.) In that case, the method works this way:

3. Question 2G1: Conditional sections

Murray Altheim describes the desired functionality thus:

the ability to 'switch on and off' declarations using conditional sections and parameter entity keywords

This functionality is not included in XML Schema. This may be a mistake, but since conditional sections do not affect the expressive power of DTDs, it is not in itself a violation of the Requirements document. (Conditional sections effectively allow one DTD file to be used to define multiple effective DTDs. For each of those effective DTDs, an XML Schema equivalent is possible.)

The most obvious work-around within a schema document involves use of conditional sections surrounding entity declarations in DTDs, and use of general entities in the schema document. Virtually all uses of parameter entities can be modeled in this way, though observers are divided in whether they regard this as an advantage or not. It is true that some WG members decided not to make a major issue out of the absence of conditional-inclusion functionality in XML Schema because they observed that conditional sections would still be legal in the internal DTD subset.

Given the opacity of the methods available for controlling conditional sections in DTDs, of course, some DTD developers have preferred to move the handling of conditionality out of the DTD entirely, and to use higher-level systems to generate appropriate DTDs, based on conditions passed to, and evaluated by, specialized processors. Literate programming systems which handle the production of multiple versions of the &lsquo;same&rsquo; material can be and have been used to produce the different views of a DTD; the user's choice of modules, in turn, can be expressed through appropriate signals in the public and system identifiers for the DTD. This is suitable for choices among some relatively small number of modules (say, up to ten, for a thousand views of the DTD), but probably not for the common use of conditional sections to allow the user to suppress certain elements (notations, attributes, etc.) entirely.

Another approach to the parameterized assembly or generation of schemas would use XSLT stylesheets or similar mechanisms to transform a schema into a different schema, e.g. by addition, modification, or suppression of certain specified material. Opinions are divided on the wisdom or utility of this approach.

If conditional-section functionality were to be added to XML Schema, there are two obvious approaches:

In either case, some attribute or child element expressing a condition would need to be provided, as well as some mechanism to allow the user to set conditions appropriately (the analogue of initializing the appropriate magic parameter entities with INCLUDE or IGNORE in a DTD context).

Conditional sections will interact with namespace issues in complex ways: some schema authors will wish to associate different views of their schema documents with different namespaces (where a set of schema documents defines a &lsquo;family&rsquo; of namespaces), while others will wish all views of the schema documents to be associated with the same namespace (where a set of schema documents provides a definition of all or part of a namespace which is not always all visible in all uses).

Conditional sections and target-namespace information will also interact with the mechanism for embedding schema documents (<import>, <include>), since some schema integrators may wish to embed the same schema module more than once, using a different view each time. This projected usage is similar to the use of generic functions in programming languages which support them; some existing schema languages (e.g. Sox) have generics, but in general their usage in markup languages is not yet fully understood by all practitioners.

It should be noted that some members of the XML Schema WG have expressed a strong aversion to features for conditional manipulation of the schema document text; such features of obscure the meaning of the schema (from programs as well as humans) and are nearly impossible to represent in abstract component form. As a result, they are inclined to suggest leaving conditional manipulation of schema document text to mechanisms like XSLT.

Input from DTD developers and prospective users of XML Schema on this topic is welcome.

4. Question 321: Extending content model

Ann Navarro describes the requirement thus:

2. I need to be able to extend a content model (a template) without defining a new one.

XHTML defines content model sets like Flow, Block, and Inline in the Text Module. Other modules modify these sets, extending the Inline set to include <object>, for example, when the Object module is selected. This modified set must be the content model for every element that references the Inline content set (or the Flow set, which is made up of Block, Inline, and other elements; or the Block set, with is made up of Inline and other elements).

We haven't been able to distinguish this usefully from question 322, but we can distinguish two ways of extending a content model. We discuss one here and one under question 322.

If a content model says (or is intended to say) in effect “A <P> element can contain any mix of PCDATA, phrase-level elements, and list elements”, then the desired effect can be had in this way:

There is no need to modify the content models which include <phrase> and <list> as children, in order to add members to their equivalence classes.

Note that each element may be in only one equivalence class; this helps avoid ambiguity and is not thought to be a serious usability problem; if it presents a problem in practice, please let us know.

5. Question 322: Arbitrary extensions to content model

Ann Navarro describes the requirement thus:

3. I need to be able to extend a content model definition for an individual element arbitrarily.

This should be fairly obvious. For instance, XHTML defines %Inline, which serves as the content model for <p>. If I create a new phrasal element which is added to the %Inline collection, I necessarily must be able to extend the content model of <p> for this to work.

The actual use case described (adding new inline elements) is addressed above under Question 321. Note that formally such extensions are not fully arbitrary: they have the effect of adding new alternatives to an existing disjunction (OR-group) in a content model, but do not otherwise affect it.

Fully arbitrary changes to content models are not supported in XML Schema.

Schema authors and schema adapters can extend a complex type on the end by deriving a new complex type by extension. This restriction allows a clear description of the relationship among the sets of children-sequences accepted by the base type and the derived type: each string accepted by the derived type has as a prefix a string accepted by the base type. Other forms of content-model modification may be defined in future versions of XML Schema, but extension by suffix appears to meet the immediate needs of most users who need extension, rather than just restriction, of content models.

If the original declaration said the elements would use the unextended type, then the instance elements will have to signal that they are using the new extended type. This provides a signal to processors which are interested only in the base type, that they will have to ignore some suffix in the string of children.

In cases where it is necessary to change a content model entirely, it will be necessary to define a different complex type and a different element type to use that new complex type. We think this is the correct approach to making schema extensions and modifications processable. (We haven't thought of use cases which require that a type be completely modified with no signal to the application that anything has changed. If there are such use cases, let us know.)

In some cases, the requirement is not to modify, but to specify, the content model. Table modules, for example (e.g. the Oasis table module, which specifies a commonly used table model), may allow or even require the DTD integrator to specify the content model for table cells. There are a few ways of handling cases like this in XML Schema (possibly more).

6. Question 323: Extending attribute list

Ann Navarro describes the requirement thus:

4. I need to be able to extend the attribute list for an individual element arbitrarily.

It must be possible to define arbitrary data types for attribute values. These must include enumerated lists of literal values (e.g. "one", "two", "three"). It must also be possible to extend the enumerated lists after they are defined. For example, the input element has an attribute type. A new module may want to extend the collection of input types by adding a new value to its value set.

The addition of new values to an enumerated type (or more generally, the derivation of simple types by extension rather than restriction) is not supported by XML Schema.

We think this is the correct approach; it allows processors to avoid being blind-sided by non-substitutable extensions to the set of legal attribute values.

The schema author can come close to the described behavior by (a) clever games with general entities, or by (b) defining a new enumerated type (it can't be derived from the old one, since simple types can't be extended), and a new complex type to use it as the value of the attribute, and a new element type to use that new complex type.

7. Relevant design ideas

A number of design ideas have been floated in the WG's discussions which, if adopted, would affect the topics discussed here. These design ideas have not been adopted by the WG in the past, but they may come up for consideration (or re-consideration) when we resolve last-call comments, so comments on them are potentially useful. They include:

Mention of these ideas in this document should not be taken as a promise that the WG will consider them in its future work, only as an invitation to comment on their potential utility to the XML Schema language.


A. Full text of email

The full text of the relevant email postings from Murray Altheim and Ann Navarro (minus signature blocks and most of the header information) is reproduced here.

1.1 Altheim to Connolly 16 Feb 2000

Date: Wed, 16 Feb 2000 15:15:50 -0800
From: Murray Altheim <altheim@eng.sun.com>
To: Dan Connolly <connolly@w3.org>
CC: w3c-xml-schema-ig@w3.org
Subject: Re: "more expressive than XML DTDs; " design principles 4.1

Dan Connolly wrote:

It's been mentioned that some of our decisions might be seen to conflict with our requirement about being as expressive as DTDs. [...]

Two features that to my current understanding of schemas are currently missing and seem to be required for any type of schema modularization:

  1. the ability to declare attributes from multiple locations within the schema (as provided in DTDs via multiple ATTLISTs, a feature long-demanded in the SGML community and included in Annex K of ISO 8879 TC2, the "Web SGML Adaptations.")
  2. the ability to 'switch on and off' declarations using conditional sections and parameter entity keywords

If I'm missing something and these features are already provided, I'd be happy to understand how this might be accomplished using XML Schemas.

Thanks,

Murray

1.2 Navarro to Fuchs, 2 Mar 2000

Date: Thu, 02 Mar 2000 15:01:37 -0500
To: Matthew Fuchs <matthew.fuchs@commerceone.com>,
Schema Interest Group <w3c-xml-schema-ig@w3.org>
From: Ann Navarro <ann@hwg.org>
Subject: RE: qualified attributes

In discussion about the current provision in XML Schema for schema-validation of qualified attributes in instances, it has been pointed out that we have currently no way to define a complex type which will schema-validate one of the examples in the Namespace REC, namely

  <x:foo a='1' x:a='1'/>

The problem is not with the two 'a's, but simply with attributes qualified with the same namespace URI as their containing element.

I consider this as a category 1 (the spec is broken) issue, and that it overtakes my Editorial observation of last week [1] that the &lsquo;global&rsquo; attribute issue could be settled without action.

There's several other issues we've been hoping to get addressed, namely:

  1. I need to be able to extend an attribute collection (a template) without defining a new one.
  2. I need to be able to extend a content model (a template) without defining a new one.
  3. I need to be able to extend a content model definition for an individual element arbitrarily.
  4. I need to be able to extend the attribute list for an individual element arbitrarily.

I don't think I can do any of these things in XML Schema right now (I'd love to be proven wrong, but we've smacked at it for some time now without headway).

Ann

1.3 Navarro to IG, 2 Mar 2000

Date: Thu, 02 Mar 2000 16:21:45 -0500
To: w3c-xml-schema-ig@w3.org
From: Ann Navarro <ann@hwg.org>
Subject: RE: qualified attributes

At 12:12 PM 3/2/00 -0800, Matthew Fuchs wrote:

Ann,

Could you elaborate on these a bit, in particular could you give a brief example of each (don't worry too much about getting your syntax right). As is, these are too sketchy for me to understand. I believe 3, & 4 are possible but not 2, and I definitely don't understand 1.

Sure. We've discussed some of this previously, some of these examples taken from a previous HTML WG post from Shane McCarron

1. I need to be able to extend an attribute collection (a template) without defining a new one.

It must be possible to define collections of attributes that can be referenced by elements, and to extend those attribute sets from within arbitrary modules. For example, XHTML defines the Common attribute collection. This collection is referenced by a variety of elements. However, the content of the set is modified by the inclusion of the Legacy module. When this module is included, the style attribute is added to the Common attribute collection, and that addition must be reflected in every element that references it.

2. I need to be able to extend a content model (a template) without defining a new one.

XHTML defines content model sets like Flow, Block, and Inline in the Text Module. Other modules modify these sets, extending the Inline set to include <object>, for example, when the Object module is selected. This modified set must be the content model for every element that references the Inline content set (or the Flow set, which is made up of Block, Inline, and other elements; or the Block set, with is made up of Inline and other elements).

3. I need to be able to extend a content model definition for an individual element arbitrarily.

This should be fairly obvious. For instance, XHTML defines %Inline, which serves as the content model for <p>. If I create a new phrasal element which is added to the %Inline collection, I necessarily must be able to extend the content model of <p> for this to work.

4. I need to be able to extend the attribute list for an individual element arbitrarily.

It must be possible to define arbitrary data types for attribute values. These must include enumerated lists of literal values (e.g. "one", "two", "three"). It must also be possible to extend the enumerated lists after they are defined. For example, the input element has an attribute type. A new module may want to extend the collection of input types by adding a new value to its value set.

Hope this helps to see where we're coming from.

Ann

1.4 Navarro to Thompson, 2 Mar 2000

Date: Thu, 02 Mar 2000 16:45:09 -0500
To: ht@cogsci.ed.ac.uk (Henry S. Thompson), Ann Navarro <ann@hwg.org>
From: Ann Navarro <ann@hwg.org>
Cc: w3c-xml-schema-ig@w3.org, "Shane P. McCarron" <shane@aptest.com>
Subject: Re: qualified attributes

At 09:37 PM 3/2/00 +0000, Henry S. Thompson wrote:

Ann Navarro <ann@hwg.org> writes: [Note I've reordered your points]

2. I need to be able to extend a content model (a template) without defining a new one.

XHTML defines content model sets like Flow, Block, and Inline in the Text Module. Other modules modify these sets, extending the Inline set to include <object>, for example, when the Object module is selected. This modified set must be the content model for every element that references the Inline content set (or the Flow set, which is made up of Block, Inline, and other elements; or the Block set, with is made up of Inline and other elements).

This is precisely what element equivalence classes are for, so you have this already.

3. I need to be able to extend a content model definition for an individual element arbitrarily.

This should be fairly obvious. For instance, XHTML defines %Inline, which serves as the content model for p. If I create a new phrasal element which is added to the %Inline collection, I necessarily must be able to extend the content model of p for this to work.

I don't see how this is different from (2) above.

Attributes sets vs. element content model. The concept is the same, but the syntax may or may not be.

This is a commonly requested feature which we have not ever seen a proposal for.

I believe we forwarded this (we being HTML WG) back in late January, but that we're not alone is comforting in a manner :)

For a number of reasons, this request among them, as well as Phil Wadler's observation of the inappropriateness of the use of the phrase 'equivalence class' for what we use it for ('substitution class' might be nearer the mark), I think we might want to look again at providing a unified approach to provide post-facto bottom-up modifications to all potentially disjunctive permissions, i.e. element particles (now covered by 'equivClass'), attribute declarations (not now covered, but obvious how we might) and simple types derived by enumeration.

A new look would be appreciated.

We'll see if we can't find any other stumpers.

Ann