Context-sensitive rules in XML Schema

]> Context-sensitive rules in XML Schema World Wide Web Consortium Cambridge, Sophia-Antipolis, Tokyo 2000

Created in electronic form.

2000-04-27 : CMSMcQ : add reference to Maler and el Andaloussi 2000-04-26 : CMSMcQ : finish revision 2000-04-24 : CMSMcQ :work through, using Steve's comments. Reorder sections and redo intro overview and transitions. 2000-03 : Steve DeRose : reads draft and comments 2000-02-15 : CMSMcQ : first draft begins as part of note to W3C staff explaining the XML Schema WG's decision on issue 208; grows too large, and becomes separate document Context-sensitive rules in XML Schema C. M. Sperberg-McQueen 15 February 2000 rev. 25 April 2000

The element declarations in SGML and XML document type definitions (DTDs) specify the legal contents of an element type. The rules for what can occur within an element are the same, no matter where it occurs. This is true of XML DTDs; it is slightly less true of SGML DTDs, owing to the inclusion and exclusion exceptions constructs present in SGML and omitted from XML. Inclusion and exclusion exceptions are an attempt to deal with the problems I am about to describe. They were omitted from XML because the complexity they introduce seems out of proportion to the modest increase in expressive power they provide. As the examples below should make clear to those familiar with exceptions in SGML, the local definitions of XML Schema are a better approach and can accomplish more. Since the declaration of an element type is thus independent of what, in formal terms, is called the context of the element, the DTD notation in XML 1.0 can be said to define a form of context-free grammar. The concepts of context-free and context-sensitive grammars were first introduced, in their current form, by Chomsky 1957. Detailed discussion can be found in virtually any textbook of parsing, syntax-directed programs, and compiler construction. A good introduction is in Grune.

It is fairly common in markup languages, however, to want slightly different rules for an element depending on where it occurs in the document. Consider, for example, the input element type in HTML with forms. In principle, this should be able to occur anywhere that text (character data) can occur, but only within a form element. So, for example, it should be able to occur within a paragraph, if the paragraph is within a form element, but not otherwise. This is a simple example of context-sensitivity, and it has historically been a source of astonishment to some users of SGML and XML to learn that neither SGML nor XML DTDs are able to express it formally.

This document describes how to use XML Schema to achieve some context-sensitive effects not possible in XML or SGML document type definitions (DTDs). The first section introduces a simple example of a markup language in which some form of context-sensitivity is needed — specifically, the rules governing the appearance of input elements within HTML documents, and shows a simple DTD for that markup language, as it might be written for SGML or XML processors. The context-sensitive constraints cannot be expressed by the formal element declarations, so they must be expressed separately, in normative prose. The second section shows an equivalent XML Schema. The third section uses some imaginary extensions to XML 1.0 DTD syntax to show how the context-sensitive rules could be declared, given a suitable notation. The fourth section shows how the XML Schema language can be used to capture the context-sensitive constraints. The author acknowledges his deep debt, and expresses his deep thanks, to Murata Makoto for discussions of document grammars over the last few years. The example outlined here using local types was originally developed as an application of Murata's proposals to use tree automata in defining document grammars. The complex types used here serve much the same purpose as the non-terminals in Murata's proposals.

This document assumes either a certain degree of familiarity with XML, XML DTDs, HTML, and XML Schema, or a certain degree of willingness to take the details on faith. Since it would be desirable to make it accessible to readers without such familiarity, anyone who is puzzled by some part of the document is invited to contact the author and explain where you got lost, and what information you think would have helped keep you from getting lost.

An example of context-sensitive rules

To illustrate the problem, let us consider a simple markup language we'll call M. M is roughly similar to a small subset of HTML with forms, though it has a slightly different overall document structure.

The basic document structure of M is given by the following element types: html is the document element; each html element contains a head and a body. head is a container for metadata pertaining to the document. meta is an empty element specifying one piece of metadata; the name attribute indicates what kind of metadata, and the content attribute gives the value. body contains the human-readable part of the document; it consists of one or more form, div1, or p elements. div1 is a top-level text division (e.g. a chapter or a section). Elements of type div1 may occur only at the top level; they cannot nest. h1 is the title of a top-level text division. div2 is a second-level text division (e.g. a subsection). Elements of type div2 may occur only within div1 elements; they cannot nest. h2 is the title of a second-level text division. p is a paragraph.

The form element in M is very similar to the form element of HTML: form is the container element for one interactive form; when the user clicks on the appropriate submit button, the data will be transmitted to the server identified by the action attribute, using the method identified by the method attribute. item specifies one input field in the form; its name, type, and value attributes specify the field name, datatype, and default value respectively. The attributes specified for meta, form, and item are given for verisimilitude; they don't actually affect our example, though the methods described in this paper can also be used to give limited context sensitivity to attribute definitions as well as content models.

M has a number of other rules about how these elements interact: The hierarchy (body / div1 / div2 / p) is unaffected by whether forms are present or not; form is transparent to the hierarchy. Forms may appear at any level (directly within body, div1, or div2) but may not appear within other forms. In M, as in HTML, the input element should be able to occur anywhere that text (character data) can occur, but only within a form element. So, for example, it should be able to occur within a paragraph, if the paragraph is within a form element, but not otherwise.

An XML version of the sample DTD might look like this; we use parameter entities in declaring some attributes, in order to stress the nature of the expected value. ]]>

Note that the form element can contain div1 and div2 elements, and each of these can in turn contain a form element. So a form element within a div1 element might also contain a div1 element, thus allowing div1 elements to nest, thus violating rule 1. Even if rule 1 is obeyed, a form element within a div1 element can contain a div2 which in turn contains another form element, thus violating rule 2. And there is no way to specify that input elements may occur within paragraphs only if a form is open, so rule 3 is not enforced by this DTD either. No XML DTD can enforce any of the three rules just given, without outlawing some documents which should be legal.

In SGML, rule 1 can be enforced Well, mostly enforced. See if you can spot the gap in coverage before the text mentions it. by adding exclusion exceptions to the declarations for div1 and div2 ]]> The exclusion exceptions mean that when the effective content model of form varies with context. When a form element occurs directly within the text body, its content model is (p | div2 | div1)*. When it occurs directly within a div1, it is not allowed to have any div1 children, so its effective content model is (p | div2)*. When it occurs directly within a div2, it is not allowed to have any div2 children. And since div2 elements can occur only within div1 elements, a form within a div2 cannot contain div1 elements either, so its effective content model is (p)*. Which is what we want.

Or, almost what we want. Actually, a form which occurs directly within a body element can contain a div2 element, thus allowing div2 to occur outside of any div1. So the exclusion exceptions have allowed us to enforce some, but not all, of the consequences of rule 1.

Rule 2 can be enforced fully by SGML: adding an exclusion exception to the declaration for form, we can prohibit self-nesting at any depth. ]]> The enforcement of prohibitions against self-nesting is one of the things SGML exclusion exceptions do best.

The third rule requires a more complicated kind of context sensitivity, and cannot be expressed in full in either SGML or XML. SGML offers two mechanisms which can be used to enforce rule 3 in part, and the HTML 2.0 and 3.2 DTDs each use one of them. First, we can define input as an inclusion exception on the form element (as was done in the DTD for HTML 2.0). ]]>This ensures that input is legal within form elements, and not otherwise. But it allows input in locations where text is not allowed (e.g. between paragraphs). So we must supply a rule, expressed in prose but not in the DTD, that restricts input elements to appropriate locations within the form.

The second method is to define input as a legal subelement of p and other elements which can contain text, and supply a rule, expressed in prose but not in the DTD, which says that input is only legal if a form element is open. This is what was done in HTML 3.2. ]]>

In neither case can we formally define exactly the constraint we want, with the result that SGML and XML systems guided by the DTD will not enforce this rule; ad hoc software must be written to check to ensure that input elements occur only where they are supposed to.

An XML Schema without context sensitivity

A definition of M using XML Schema, which enforces the context-sensitive rules, will be shown below. In order to introduce the basic notations of XML Schema, however, let us begin with a direct translation of this DTD into XML Schema, without any context-sensitivity.

The html element contains a sequence consisting of a head and a body: ]]>

The head element contains one or more meta elements, which are empty, and carry name and content attributes: ]]>

The body contains one or more form, div1, or p elements, in any order. ]]>

The div1 (first-level division) element contains first a heading (h1) and then one or more form elements (form), second-level divisions (div2), or paragraphs (p). ]]>

Second-level divisions are analogous to first-level divisions, but after their heading they can contain only forms and paragraphs: ]]>

Paragraphs can contain input elements mixed with character data: ]]>

And, finally, forms consist of paragraphs and divisions. Whether a first- or second-level division is allowed within the form ought to depend on whether the form occurs within any division itself -- but in this formulation, it does not depend on that: either level is legal inside the form, no matter what happened outside it. ]]>

The schema uses user-defined types, rather than parameter entities, for specialized attribute datatypes like the set of legal widget types, or the set of legal methods. We define those types by enumerating the legal values: ]]>The uriReference type does not need to be defined; it is a built-in type in XML Schema.

Adding context-sensitivity

The problem with the DTD and schema given above is that the content models are fixed once and for all, and we would like them to vary depending on the context of the element instances.

For example, the declaration for p should take one form if the p is anywhere inside a form element: ]]>but it should take another form otherwise, without input elements: ]]>

If we used parameter entities for the content models, as many DTDs do, to give a name to each of these content models, the result might look like this: ]]>

Inside forms and outside them, p needs to have two distinct content models: ]]>

Similarly form should allow a mix of div1 and p elements, if the form is outside of any div1 element: ]]>but inside a div1 it should allow a div2: ]]>and inside a div2 it should allow no text divisions at all, only paragraphs: ]]>Defined as parameter entities, these three content models take the following form: ]]>

Similar sets of varying declarations could be constructed for div1 and div2; this is left as an exercise for the reader (hint: two of each).

An imaginary extension to DTD notation

In order to make clear just what is needed to formalize the definition of our example markup language, let us use some imaginary extensions to DTD notation. N.B. this is not a proposal for changing or extending DTD notation, just an attempt to provide a more compact way of seeing what is going on. Let us imagine that DTDs allow us to declare element types in the familiar way, and to refer to them in content models in the usual way. But let us imagine further that whenever an element type is referred to in any content model, we can if we wish specify a different content model for elements in the document which match that particular content-model particle. Formally, we can say that conventional element-type declarations bind a generic identifier to a content model, globally, and that our new syntax allows us to override the global binding with a local binding.

Concretely, In a content model, we allow a reference either to a generic identifier (which identifies a top-level element) or to a (GI:content-model) pair, which declares a local binding of that generic identifier to that content model, for elements which match that particular particle. In order to keep the (GI:content-model) pairs simple, we require that such pairs refer to content models by name. Allow a new TYPE declaration for content models, which specifies a content-model name and a content model. (I am not making a real proposal for reform of DTD notation, so I won't worry about simple types or various other complications.)

The parameter entities p-normal, p-in-form, etc., which were defined above, now become content-model names, and are declared with TYPE declarations. ]]>

This imaginary extension of DTDs allows us to capture formally all the rules of our example markup language M.

XML Schema with context-sensitivity

XML Schema can be used to allow declarations to vary by context, in much the same way as the imaginary extensions to DTD notation just described. The key is that in addition to declaring element types at the top level, globally (i.e. as a global binding between an element-type name and a complex type), we can declare element types locally, within the definition of a particular complex type.

Parts of the schema (specifically the declarations for html, head, meta, h1, h2, and input, are the same as before.

For the other elements, first let us declare the various types we need. Within these types we may either refer to global element types (by means of <element ref="..."/>), or define new local element types (by means of <element name="..." type="..."/>). Where we need multiple forms of one element type (or, to put it a different way, when we need the same generic identifier to be bound to different complex types in different contexts), we define named types for each variant. When we don't need such complexity, we just define the element at the global level, as already shown above.

Paragraphs can be of two types, which we name p-normal and p-in-form: ]]>

Forms can occur directly within the body of the text, or within first- or second-level divisions; in each case, we need a distinct complex type, which we name form-in-body, form-in-div1, and form-in-div2. All three of these have the same attributes, so we factor the attribute declarations out into an attribute group: ]]>

First-level divisions can occur directly within the body of the text, or within a form. In the former case, they can contain forms or second-level divisions, or paragraphs: ]]>In the latter case, they cannot contain nested forms, and they use the alternate inside-a-form types for nested divisions and paragraphs: ]]>

Second-level divisions can occur directly within a first-level division, or within a form which is itself inside a first-level division. If no form is open, a form can occur within the div2: ]]>If a form is open outside the div2, no form is allowed inside, only paragraphs: ]]>

We have now defined all the different complex types we need for the context-sensitive elements. It remains only to declare the body element, which specifies types form-in-body, div1-normal, and p-normal for its children: ]]>

The new schema correctly enforces all the rules laid out at the outset: Forms may appear at any level (body, div1, div2) but may not appear within other forms. This is enforced by the document grammar. All elements which can both appear within forms and contain forms have two distinct types, one for each case, and all descendants of a form element are associated with their inside-a-form variant. The hierarchy (body / div1 / div2 / p) is unaffected by whether forms are present or not; form is transparent to the hierarchy. (This is enforced by the context-dependent variations on form, which allow within form only the element types which would have been legal in the form's parent element). The input element is legal within a p if and only if the p appears within an open form. This is enforced by the types which contain paragraphs: if the type is of the inside-a-form variety, p is bound to the appropriate complex type p-in-form, and otherwise it is bound to the type p-normal.

Since these rules cannot be enforced by DTDs, it is clear that local binding of GIs and types makes schemas strictly more powerful (in this respect) than DTDs. Since such local bindings have often been desired in document markup languages (see, for example, Welty and Ide 1999), XML Schema should prove a better (more powerful and more flexible) vehicle for defining such languages than DTDs have been.

Capturing limited context sensitivity in context-free grammars

We now have a fully worked out example of our basic problem. The problem is simply described: we are writing a schema, which in essence is a context-free grammar, and we are trying to use it to express constraints on the data which depend on the context. The solution is equally simply described: for each element whose content model should vary by context, we write distinct forms of that content model, and bind them to the appropriate generic identifier in the appropriate locations.

This technique is an application of a a well-known technique for capturing limited context-sensitivity in context-free grammars. This section describes that basic technique, and describes a general method of expressing context-sensitive constraints in document grammars which allow local binding of generic identifiers to content models.

Basically, we have doubled almost every production rule, to provide a singular and a plural form. This technique trades size of the grammar for power, and it generalizes: any finite number of pieces of context-sensitive information can be carried around in a finite context-free grammar (though not usually in a compact context-free grammar). There are grammar systems which allow a more compact form of specification, and which then generate the long context-free grammar from the compact specifications.

To apply this technique to schemas, the procedure is as follows: Identify the context upon which some variation in the markup language depends; let us call this the top element. For example, the language M has the constraint that the p element should allow input elements as children if the p is within a form element, and otherwise not. The form element is the top element. (For some constraints, there may be multiple tops.) Identify the place where the constraint makes a difference. Call this the bottom element. For example in the constraint just described, p is the bottom element. (For some constraints, there may be multiple bottoms.) Find every element type which can appear between a top and a bottom. In our example, this set includes div1 and div2. Call these the middle elements. (In some cases, middle elements may also be bottom elements.) The (generic identifier : type) pairings need to be changed (doubled) for all bottom and middle elements. For each such element type, there will be some simple or complex type T associated with it. For each type T associated with a bottom element, write a new type T' which enforces the context-sensitive constraint (or allows the context-dependent freedom). In our example, we will write a version of the complex type for p which allows input elements. For each type T associated with a middle element, write a new type T'; call this the context-dependent version of T. T' is just like T except that: If T allows any middle element type E as a child, then in T' the reference to the global element type E is replaced with a local element declaration. The local element declaration will use the same generic identifier, but the type U associated with E will be replaced by U', the context-dependent version of U. Finally, replace the original form of the type for the top element with a context-dependent version of the same type.

References Chomsky, Noam. On certain formal properties of grammars. Information and Control 2 (1959): 137-167. DeRose, Steven J. The SGML FAQ Book: Understanding the Foundation of HTML and XML. Boston, Dordrecht, London: Kluwer Academic, 1997. Questions 5.5 (pp. 105-106) and 6.1 (pp. 139-142) discuss content-model exceptions and touch on issues of context sensitivity. Grune, Dick, and Ceriel J.H. Jacobs. Parsing Techniques: A Practical Guide. New York, London: Ellis Horwood, 1990. Maler, Eve, and Jeanne el Andaloussi. Developing SGML DTDs: From Text to Model to Markup. Upper Saddle River, NJ: Prentice Hall PTR, 1996. Section 6.1.3 discusses the possibility of merging distinct content models, and observes that the choice among them may be determined by context, but the correct choice cannot be checked by the DTD. Berners-Lee, Tim, and Dan Connolly. Hypertext Markup Language - 2.0. RFC 1866. [n.p.]: Internet Engineering Task Force, November 1995. ftp://ftp.isi.edu/in-notes/rfc1866.txt Raggett, Dave. HTML 3.2 Reference Specification. W3C Recommendation 14-Jan-1997. [Cambridge, Sophia-Antipolis, Tokyo]: W3C, 1997. http://www.w3.org/TR/REC-html32 Welty, Christopher, and Nancy Ide. Using the Right Tools: Enhancing Retrieval from Marked-up Documents. CHum 33 (1999): 59-84. Originally delivered at TEI 10, Providence (1997).