The element declarations in SGML and XML document type definitions (DTDs) specify the legal contents of an element type. The rules for what can occur within an element are the same, no matter where it occurs. [1] Since the declaration of an element type is thus independent of what, in formal terms, is called the context of the element, the DTD notation in XML 1.0 can be said to define a form of context-free grammar. [2]
It is fairly common in markup languages, however, to want slightly different rules for an element depending on where it occurs in the document. Consider, for example, the <input> element type in HTML with forms. In principle, this should be able to occur anywhere that text (character data) can occur, but only within a <form> element. So, for example, it should be able to occur within a paragraph, if the paragraph is within a <form> element, but not otherwise. This is a simple example of context-sensitivity, and it has historically been a source of astonishment to some users of SGML and XML to learn that neither SGML nor XML DTDs are able to express it formally.
This document describes how to use XML Schema to achieve some context-sensitive effects not possible in XML or SGML document type definitions (DTDs). The first section introduces a simple example of a markup language in which some form of context-sensitivity is needed -- specifically, the rules governing the appearance of <input> elements within HTML documents, and shows a simple DTD for that markup language, as it might be written for SGML or XML processors. The context-sensitive constraints cannot be expressed by the formal element declarations, so they must be expressed separately, in normative prose. The second section shows an equivalent XML Schema. The third section uses some imaginary extensions to XML 1.0 DTD syntax to show how the context-sensitive rules could be declared, given a suitable notation. The fourth section shows how the XML Schema language can be used to capture the context-sensitive constraints. [3]
This document assumes either a certain degree of familiarity with XML, XML DTDs, HTML, and XML Schema, or a certain degree of willingness to take the details on faith. [4]
To illustrate the problem, let us consider a simple markup language we'll call M. M is roughly similar to a small subset of HTML with forms, though it has a slightly different overall document structure.
The basic document structure of M is given by the following element types:
The <form> element in M is very similar to the <form> element of HTML:
The attributes specified for <meta>, <form>, and <item> are given for verisimilitude; they don't actually affect our example, though the methods described in this paper can also be used to give limited context sensitivity to attribute definitions as well as content models.
M has a number of other rules about how these elements interact:
An XML version of the sample DTD might look like this; we use parameter entities in declaring some attributes, in order to stress the nature of the expected value.
Note that the <form> element can contain <div1> and <div2> elements, and each of these can in turn contain a <form> element. So a <form> element within a <div1> element might also contain a <div1> element, thus allowing <div1> elements to nest, thus violating rule 1. Even if rule 1 is obeyed, a <form> element within a <div1> element can contain a <div2> which in turn contains another <form> element, thus violating rule 2. And there is no way to specify that <input> elements may occur within paragraphs only if a form is open, so rule 3 is not enforced by this DTD either. No XML DTD can enforce any of the three rules just given, without outlawing some documents which should be legal.
In SGML, rule 1 can be enforced [5] by adding exclusion exceptions to the declarations for <div1> and <div2>
The exclusion exceptions mean that when the effective content
model of <form> varies with context. When a <form>
element occurs directly within the text <body>, its content
model is (p | div2 | div1)*
. When it occurs directly
within a <div1>, it is not allowed to have any <div1>
children, so its effective content model is (p | div2)*
.
When it occurs directly within a <div2>, it is not allowed to
have any <div2> children. And since <div2> elements can
occur only within <div1> elements, a <form> within a
<div2> cannot contain <div1> elements either, so its
effective content model is (p)*
. Which is what we
want.
Or, almost what we want. Actually, a <form> which occurs directly within a <body> element can contain a <div2> element, thus allowing <div2> to occur outside of any <div1>. So the exclusion exceptions have allowed us to enforce some, but not all, of the consequences of rule 1.
Rule 2 can be enforced fully by SGML: adding an exclusion exception to the declaration for <form>, we can prohibit self-nesting at any depth.
The enforcement of prohibitions against self-nesting is one of the things SGML exclusion exceptions do best.
The third rule requires a more complicated kind of context sensitivity, and cannot be expressed in full in either SGML or XML. SGML offers two mechanisms which can be used to enforce rule 3 in part, and the HTML 2.0 and 3.2 DTDs each use one of them. First, we can define <input> as an inclusion exception on the <form> element (as was done in the DTD for HTML 2.0).
<!ELEMENT p (#PCDATA)* > <!ELEMENT input EMPTY > <!ELEMENT form (p | div2 | div1)* -(form) +(input) > |
This ensures that <input> is legal within <form> elements, and not otherwise. But it allows <input> in locations where text is not allowed (e.g. between paragraphs). So we must supply a rule, expressed in prose but not in the DTD, that restricts <input> elements to appropriate locations within the form.
The second method is to define <input> as a legal subelement of <p> and other elements which can contain text, and supply a rule, expressed in prose but not in the DTD, which says that <input> is only legal if a <form> element is open. This is what was done in HTML 3.2.
<!ELEMENT p (#PCDATA | input)* > <!ELEMENT input EMPTY > <!ELEMENT form (p | div2 | div1)* -(form) > |
In neither case can we formally define exactly the constraint we want, with the result that SGML and XML systems guided by the DTD will not enforce this rule; ad hoc software must be written to check to ensure that <input> elements occur only where they are supposed to.
A definition of M using XML Schema, which enforces the context-sensitive rules, will be shown below. In order to introduce the basic notations of XML Schema, however, let us begin with a direct translation of this DTD into XML Schema, without any context-sensitivity.
The <html> element contains a sequence consisting of a <head> and a <body>:
<xsd:element name="html"> <xsd:complexType> <xsd:sequence> <xsd:element ref="head"/> <xsd:element ref="body"/> </xsd:sequence> </xsd:complexType> </xsd:element> |
The <head> element contains one or more <meta> elements, which are empty, and carry name and content attributes:
The body contains one or more <form>, <div1>, or <p> elements, in any order.
The <div1> (first-level division) element contains first a heading (<h1>) and then one or more form elements (<form>), second-level divisions (<div2>), or paragraphs (<p>).
Second-level divisions are analogous to first-level divisions, but after their heading they can contain only forms and paragraphs:
Paragraphs can contain <input> elements mixed with character data:
<xsd:element name="p"> <xsd:complexType content="mixed"> <xsd:element ref="input" minOccurs="0" maxOccurs="unbounded"/> </xsd:complexType> </xsd:element> |
And, finally, forms consist of paragraphs and divisions. Whether a first- or second-level division is allowed within the form ought to depend on whether the form occurs within any division itself -- but in this formulation, it does not depend on that: either level is legal inside the form, no matter what happened outside it.
The schema uses user-defined types, rather than parameter entities, for specialized attribute datatypes like the set of legal widget types, or the set of legal methods. We define those types by enumerating the legal values:
The uriReference type does not need to be defined; it is a built-in type in XML Schema.
The problem with the DTD and schema given above is that the content models are fixed once and for all, and we would like them to vary depending on the context of the element instances.
For example, the declaration for <p> should take one form if the <p> is anywhere inside a <form> element:
but it should take another form otherwise, without <input> elements:
If we used parameter entities for the content models, as many DTDs do, to give a name to each of these content models, the result might look like this:
Inside forms and outside them, <p> needs to have two distinct content models:
<!--* inside forms: *--> <!ELEMENT p %p-in-form; > <!--* outside of forms *--> <!ELEMENT p %p-normal; > |
Similarly <form> should allow a mix of <div1> and <p> elements, if the form is outside of any <div1> element:
but inside a <div1> it should allow a <div2>:
and inside a <div2> it should allow no text divisions at all, only paragraphs:
Defined as parameter entities, these three content models take the following form:
<!ENTITY % form-in-body "(div1 | p)*" > <!ENTITY % form-in-div1 "(div2 | p)*" > <!ENTITY % form-in-div2 "(p)*" > |
Similar sets of varying declarations could be constructed for <div1> and <div2>; this is left as an exercise for the reader (hint: two of each).
In order to make clear just what is needed to formalize the definition of our example markup language, let us use some imaginary extensions to DTD notation. N.B. this is not a proposal for changing or extending DTD notation, just an attempt to provide a more compact way of seeing what is going on. Let us imagine that DTDs allow us to declare element types in the familiar way, and to refer to them in content models in the usual way. But let us imagine further that whenever an element type is referred to in any content model, we can if we wish specify a different content model for elements in the document which match that particular content-model particle. Formally, we can say that conventional element-type declarations bind a generic identifier to a content model, globally, and that our new syntax allows us to override the global binding with a local binding.
Concretely,
TYPE
declaration for content models, which
specifies a content-model name and a content model. (I am not making a
real proposal for reform of DTD notation, so I won't worry about simple
types or various other complications.)The parameter entities p-normal,
p-in-form, etc., which were defined above, now become
content-model names, and are declared with TYPE
declarations:
This imaginary extension of DTDs allows us to capture formally all the rules of our example markup language M.
XML Schema can be used to allow declarations to vary by context, in much the same way as the imaginary extensions to DTD notation just described. The key is that in addition to declaring element types at the top level, globally (i.e. as a global binding between an element-type name and a complex type), we can declare element types locally, within the definition of a particular complex type.
Parts of the schema (specifically the declarations for <html>, <head>, <meta>, <h1>, <h2>, and <input>, are the same as before.
For the other elements, first let us declare the various types we
need. Within these types we may either refer to
global element types (by means of <element
ref="..."/>
), or
define new
local element types (by means of <element
name="..." type="..."/>)
. Where we need multiple forms of one
element type (or, to put it a different way, when we need the same
generic identifier to be bound to different complex types in different
contexts), we define named types for each variant. When we don't need
such complexity, we just define the element at the global level, as
already shown above.
Paragraphs can be of two types, which we name p-normal and p-in-form:
<xsd:complexType name="p-normal" content="textonly"/> <xsd:complexType name="p-in-form" content="mixed"> <xsd:element ref="input" minOccurs="0" maxOccurs="unbounded"/> </xsd:complexType> |
Forms can occur directly within the body of the text, or within first- or second-level divisions; in each case, we need a distinct complex type, which we name form-in-body, form-in-div1, and form-in-div2. All three of these have the same attributes, so we factor the attribute declarations out into an attribute group:
First-level divisions can occur directly within the body of the text, or within a form. In the former case, they can contain forms or second-level divisions, or paragraphs:
In the latter case, they cannot contain nested forms, and they use the alternate inside-a-form types for nested divisions and paragraphs:
Second-level divisions can occur directly within a first-level division, or within a form which is itself inside a first-level division. If no form is open, a form can occur within the <div2>:
If a form is open outside the <div2>, no form is allowed inside, only paragraphs:
<xsd:complexType name="div2-in-form"> <xsd:sequence> <xsd:element ref="h2"/> <xsd:element name="p" type="p-in-form" minOccurs="1" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> |
We have now defined all the different complex types we need for the context-sensitive elements. It remains only to declare the <body> element, which specifies types form-in-body, div1-normal, and p-normal for its children:
The new schema correctly enforces all the rules laid out at the outset:
Since these rules cannot be enforced by DTDs, it is clear that local binding of GIs and types makes schemas strictly more powerful (in this respect) than DTDs. Since such local bindings have often been desired in document markup languages (see, for example, Welty and Ide 1999), XML Schema should prove a better (more powerful and more flexible) vehicle for defining such languages than DTDs have been.
We now have a fully worked out example of our basic problem. The problem is simply described: we are writing a schema, which in essence is a context-free grammar, and we are trying to use it to express constraints on the data which depend on the context. The solution is equally simply described: for each element whose content model should vary by context, we write distinct forms of that content model, and bind them to the appropriate generic identifier in the appropriate locations.
This technique is an application of a a well-known technique for capturing limited context-sensitivity in context-free grammars. This section describes that basic technique, and describes a general method of expressing context-sensitive constraints in document grammars which allow local binding of generic identifiers to content models.
Let us consider a very simple example of context sensitivity: subject-verb agreement in English. Consider the grammar
S ::= NP VP NP ::= PN | N | Det N VP ::= V | V NP PN ::= Bill | Ed N ::= students | teachers | student | teacher Det ::= a | some | the V ::= like | likes | heckle | heckles | teaches | teach |
This grammar generates grammatical sentences like “the teacher teaches some students” and “Bill likes Ed”, but also non-grammatical sentences like “the teachers teaches a students”, etc. Number agreement between subject and verb is -- in principle -- context-sensitive information: a singular verb is allowed in the VP only in contexts where the VP follows an NP containing a singular noun. But a context-free grammar can capture this particular bit of context-sensitive information, by distinguishing singular noun phrases and verb phrases from their plural counterparts, thus:
This grammar enforces subject-verb agreement, and thus enforces the context-sensitive subject-verb agreement rule in a context-free grammar.
Basically, we have doubled almost every production rule, to provide a singular and a plural form. This technique trades size of the grammar for power, and it generalizes: any finite number of pieces of context-sensitive information can be carried around in a finite context-free grammar (though not usually in a compact context-free grammar). There are grammar systems which allow a more compact form of specification, and which then generate the long context-free grammar from the compact specifications.
To apply this technique to schemas, the procedure is as follows:
Chomsky, Noam. “On certain formal properties of grammars”. Information and Control 2 (1959): 137-167.
DeRose, Steven J. The SGML FAQ Book: Understanding the Foundation of HTML and XML. Boston, Dordrecht, London: Kluwer Academic, 1997. Questions 5.5 (pp. 105-106) and 6.1 (pp. 139-142) discuss content-model exceptions and touch on issues of context sensitivity.
Grune, Dick, and Ceriel J.H. Jacobs. Parsing Techniques: A Practical Guide. New York, London: Ellis Horwood, 1990.
Berners-Lee, Tim, and Dan Connolly. Hypertext Markup Language - 2.0. RFC 1866. [n.p.]: Internet Engineering Task Force, November 1995. ftp://ftp.isi.edu/in-notes/rfc1866.txt
Maler, Eve, and Jeanne el Andaloussi. Developing SGML DTDs: From Text to Model to Markup. Upper Saddle River, NJ: Prentice Hall PTR, 1996. Section 6.1.3 discusses the possibility of merging distinct content models, and observes that the choice among them may be determined by context, but the correct choice cannot be checked by the DTD.
Raggett, Dave. HTML 3.2 Reference Specification. W3C Recommendation 14-Jan-1997. [Cambridge, Sophia-Antipolis, Tokyo]: W3C, 1997. http://www.w3.org/TR/REC-html32
Welty, Christopher, and Nancy Ide. “Using the Right Tools: Enhancing Retrieval from Marked-up Documents.”CHum 33 (1999): 59-84. Originally delivered at TEI 10, Providence (1997).
[1] This is true of XML DTDs; it is slightly less true of SGML DTDs, owing to the inclusion and exclusion exceptions constructs present in SGML and omitted from XML. Inclusion and exclusion exceptions are an attempt to deal with the problems I am about to describe. They were omitted from XML because the complexity they introduce seems out of proportion to the modest increase in expressive power they provide. As the examples below should make clear to those familiar with exceptions in SGML, the local definitions of XML Schema are a better approach and can accomplish more.
[2] The concepts of context-free and context-sensitive grammars were first introduced, in their current form, by Chomsky 1957. Detailed discussion can be found in virtually any textbook of parsing, syntax-directed programs, and compiler construction. A good introduction is in Grune.
[3] The author acknowledges his deep debt, and expresses his deep thanks, to Murata Makoto for discussions of document grammars over the last few years. The example outlined here using local types was originally developed as an application of Murata's proposals to use tree automata in defining document grammars. The complex types used here serve much the same purpose as the non-terminals in Murata's proposals.
[4] Since it would be desirable to make it accessible to readers without such familiarity, anyone who is puzzled by some part of the document is invited to contact the author and explain where you got lost, and what information you think would have helped keep you from getting lost.
[5] Well, mostly enforced. See if you can spot the gap in coverage before the text mentions it.