Notes on logic grammars and XML Schema

A working paper prepared for the W3C XML Schema Working Group

C. M. Sperberg-McQueen

8 October 2002

N.B. not complete: work in progress

1. Note to the reader
2. Introduction
3. Simple DTD-level checking with definite-clause grammars
- 3.1. Definite-clause grammars
  - 3.1.1. A trivial grammar for a fragment of English (E1)
  - 3.1.2. Calling the parser
  - 3.1.3. A simple attribute grammar (E2)
- 3.2. Representing XML in Prolog
- 3.3. A simple schema
- 3.4. Layer 1: Translation into definite-clause grammar
  - 3.4.1. Top-level rules for element types
  - 3.4.2. Rules for attributes of complex types
  - 3.4.3. Rules for content of complex types
  - 3.4.4. Rules for checking values of simple types
- 3.5. Layer 2a: checking attribute types
- 3.6. Layer 2b: checking for duplicate, defaulted, and required attributes
- 3.7. Evaluation
4. Providing PSVI properties
- 4.1. Introduction to DCTG notation
- 4.2. Another example: the English grammar fragment E1
- 4.3. English grammar with attributes (E2) in DCTG notation
- 4.4. Layer 3: Translation into DCTG notation
- 4.5. Layer 4: PSVI properties
- 4.6. Layer 4a: handling xsi:type
- 4.7. Layer 4b: wildcards with skip and lax processing
- 4.8. Layer 4c: substitution groups
- 4.9. Layer 4d: error handling
- 4.10. Layer 4e: validity and validation-attempted properties
- 4.11. Further work needed for complete representation
5. DCTGs and schema components
6. A two-level system

A. Works cited and further reading
B. To do

This document describes two possible uses of logic grammars in helping elucidate the XML Schema specification. The term logic grammar is used to denote grammars written in logic-programming systems; this paper works with definite-clause grammars (DCGs), which are a built-in part of most Prolog systems, and definite-clause translation grammars (DCTGs), which employ a similar formalism which more closely resembles attribute grammars as described by [Knuth 1968] and later writers and which make it a bit easier to handle complex specifications. Both DCGs and DCTGs can be regarded as syntactic sugar for straight Prolog; before execution, both notations are translated into Prolog clauses in the usual notation.

In its current state, this paper is unfinished. Some highlighted notes about what needs to be done to complete the paper are included.

1. Note to the reader

At the XML Schema WG face to face meeting in Redmond in August 2002, Allen Brown suggested that one way of more or less directly interpreting the inference rules in XML Schema: Formal Description would be to translate them mechanically into a definite-clause translation grammar. This paper is an elaboration of this idea, intended to help the author (and possibly other members of the WG) understand how this might work.

The idea of direct interpretation of formal specifications is not new; many textbooks on formal methods or on specific formal methods (Z, the Vienna Development Method VDM, etc.) discuss it at least as an idea. [Stepney 1993] provides a simple Web-accessible example of the idea: she illustrates the creation of a high-integrity compiler by

specifying the semantics (both dynamic and static) of a procedural programming language formally, in Z
specifying the semantics of a target machine language (or: assembly language) formally, also in Z
specifying the translation from source language to machine language formally
specifying the syntax of the programming language in a definite-clause translation grammar
translating the static semantics, dynamic semantics, and the source-to-machine language translation as grammatical attributes[1] in the DCTG. A Prolog system asked to supply the dynamic semantics of a program fragment serves as an interpreter for the source language; a Prolog system asked to supply the translation semantics acts as a compiler for the source language.

Animation of a specification is probably easiest if the spec is in a formal language. As Stepney's book illustrates, however, it is still possible otherwise, but it involves both the animation of a formal specification and an interpretation of the spec (in the form of a translation into a more formal language); the latter requires careful checking.

This paper attempts to make plausible the claim that a similar approach can be used with the XML Schema specification, in order to provide a runnable XML Schema processor with a very close tie to the wording of the XML Schema specification.

NOTE:

In the final version of this paper, it is expected that brief introductions to the various notations used will be provided. For the first draft, however, the introductions are rather perfunctory and some sections have been left as stubs. Introductory material may be found using Web searches or by consulting standard textbooks. Some pointers are given in the references.

2. Introduction

Any schema defines a set of trees, and can thus be modeled more or less plausibly by a grammar. Schemas defined using XML Schema 1.0 impose some constraints which are not conveniently represented by pure context-free grammars, and the process of schema-validity-assessment defined by the XML Schema 1.0 specification requires implementations to produce information that goes well beyond a yes/no answer to the question "is this tree a member of the set?" For both of these reasons, it is convenient to use a form of attribute grammar to model a schema; logic grammars are a convenient choice.

In the remainder of this paper, I introduce some basic ideas for using logic grammars as a way of animating the XML Schema specification / modeling XML Schema.

The first question that arises is:

How do we use logic grammars to model XML Schema processing?

To this, there are three answers:

Model the schema as a logic grammar; use it to parse document instances and generate the PSVI.
Move up one level of abstraction: model the schema for schemas as a logic grammar; use it to parse schema documents and generate the appropriate schema components.
Do both. This could be done by writing a utility to take a set of components (in the form generated by the logic grammar for schema documents) and generate a corresponding logic grammar (in the form needed for parsing other document instances).

Some problems arise if we use logic grammars to model schemas and perform schema-validity assessment by parsing a document using the logic grammar. These problems are attacked in a series of small examples in which a DCTG is developed which represents a simple schema:

In practice, the XML input document is likely to be presented to us in the form of a tree. How do we apply logic grammars to things that are already trees instead of sequences of tokens?
How do we check XML attributes?
How do we model type assignment, black-box validation, white-box validation?
How do we model the post-schema-validation information set?

Some problems arise if we use a logic grammar to model the rules for constructing schema components, and parse schema documents with that logic grammar as a way of modeling the process of constructing schema components.

How do we use parse-tree attributes to model information-item properties?
How do we use the output of this process to drive schema-validity assessement of document instances? Can we formulate the logic grammars used for schema-validity assessment as properties / attributes of the schema components?

3. Simple DTD-level checking with definite-clause grammars

We start with a simple case: considering only (a subset of) the kind of schema information captured in a document-type definition (DTD) in XML 1.0 DTD syntax, we write a logic grammar to represent a schema or DTD and use it to validate a document. The same technique, of course, can be used to capture and use part of the information in a schema expressed in XML Schema 1.0.

3.1. Definite-clause grammars

Two or three versions of a simple example may be useful in making some characteristics of DCG notation clear.

Readers already familiar with DCGs can skip this section without loss.

3.1.1. A trivial grammar for a fragment of English (E1)

Here is a simple DCG for a trivially small fragment of English, which I'll call E1. It is taken from example 8.1 in [König/Seiffert 1989], an introduction to Prolog for linguists:

%%% Grammar E1, a trivial fragment of English
s --> np, vp.  % A sentence (s) is a noun phrase (np) plus a verb phrase
np --> det, n. % A noun phrase is a determiner plus a noun
np --> n.      % ... or just a noun.
vp --> v, np.  % A verb phrase is a verb and its direct object, an np
vp --> v.      % ... or just the verb (for intransitives).
n --> [mary].  % 'mary', 'john', 'woman, 'man', 'apple' are nouns.
n --> [john].
n --> [woman].
n --> [man].
n --> [apple].
det --> [the]. % 'the' is a determiner
v --> [loves]. % 'loves', 'eats', and 'sings' are verbs.
v --> [eats].
v --> [sings].

This grammar generates sentences like "John loves Mary" and "The man sings" and "The woman eats the apple", as well as the somewhat less plausible "The apple sings the Mary."

When a Prolog system consults a file containing this grammar, the rules of the grammar are automatically translated into canonical Prolog clauses using difference lists:

?- listing([s,np,vp,n,det,v]).

s(A, B) :-
        np(A, C),
        vp(C, B).

np(A, B) :-
        det(A, C),
        n(C, B).
np(A, B) :-
        n(A, B).

vp(A, B) :-
        v(A, C),
        np(C, B).
vp(A, B) :-
        v(A, B).

n([mary|A], A).
n([john|A], A).
n([woman|A], A).
n([man|A], A).
n([apple|A], A).

det([the|A], A).

v([loves|A], A).
v([eats|A], A).
v([sings|A], A).

Yes
?-

Difference lists are pairs of lists such that the second list is a suffix of the first list (i.e. the same from some point or other to the end), for example, the pair [a, b, c, d] and [c, d]. Prolog-based parsers commonly use difference lists to represent the input; they allow the Prolog clauses which recognize structures in the input list to consume as much of the input list (the first list in the pair) as they wish, and pass the rest of the input to the next clause.

For example, given the sentence [the, woman, eats, the, apple], the first rule for np will be called with its first argument bound to the entire input string; after it recognizes the initial noun phrase, the second argument will be bound to the part of the sentence which follows the noun phrase, namely [eats,the,apple].

3.1.2. Calling the parser

The following fragment of a Prolog session log shows how the grammar can be used to test the grammaticality of sentences; it also shows that the grammar could use some refinement. To test whether a given sequence of tokens is an s, we call the Prolog predicate s with two difference-list arguments. To require that the entire input sequence be a legal sentence, we specify the empty list as the second argument.

?-  s([john,loves,mary],[]).

Yes
?- s([the,woman,eats,the,apple],[]).

Yes
?- s([the,man,sings],[]).

Yes
?- s([apple,eats,woman,the],[]).

No
?- s([apple,sings,the,mary],[]).

Yes
?-

We can use a variable as a second argument, instead of the empty list; in this case, Prolog will tell us each prefix of the input list which can be parsed as a sentence according to our grammar:

?- s([the,woman,eats,the,apple],Remainder).

Remainder = [] ;

Remainder = [the, apple] ;

No
?-

In one parse, the entire input is parsed as a sentence, and the remainder is the empty list. An alternative parse, however, is just to take [the,woman,eats] as a sentence, leaving [the,apple] as the remainder.

The XML Schema processor we sketch below will normally call the parser with an empty list as the final parameter, to require that the entire input sequence be consumed. If we wish to see how much of the content of an element belongs to the base type, and how much to an extension of the base type, we can call the parser for the base type, with a variable as the final parameter, and the base type will consume as much of the input as it can. The rest belongs to the extension.

3.1.3. A simple attribute grammar (E2)

We can turn this simple context-free grammar E1 into an attribute grammar by adding parameters to the productions; I will call the attribute grammar E2. In E2, one parameter is used to generate a structural description of the sentence. In this notation, the sentence "John loves Mary" is assigned the structure s(np(pn(john)), vp(v(loves), np(pn(mary)))) -- or, with pretty-printing:

s(
  np(
     pn(john)
    ), 
  vp(
     v(loves), 
     np(
        pn(mary)
       )
    )
 )

On some productions, a second parameter is used to enforce agreement in number between subject and verb and between determiner and noun.

Thus extended with parameters, and a few more words in the lexicon, the grammar looks like this:

%%% E2: a trivial attribute grammar for a fragment of
%%% English, with a synthesized attribute for structural description
%%% and enforcement of number agreement.

s(s(S,P)) --> np(S,Num), vp(P,Num).
np(np(D,N),Num) --> det(D,Num), n(N,Num).
np(np(N),pl) --> n(N,pl).
np(np(N),sg) --> pn(N,sg).
vp(vp(V,O),Num) --> v(V,Num), np(O,_).
vp(vp(V),Num) --> v(V,Num).
n(n(L),Num) --> [L], { lex(L,n,Num) }.
pn(pn(L),Num) --> [L], { lex(L,pn,Num) }.
v(v(L),Num) --> [L], { lex(L,v,Num) }.
det(det(L),Num) --> [L], { lex(L,det,Num) }.

lex(mary,pn,sg).
lex(john,pn,sg).
lex(woman,n,sg).
lex(women,n,pl).
lex(man,n,sg).
lex(men,n,pl).
lex(apple,n,sg).
lex(apples,n,pl).
lex(the,det,_).
lex(some,det,pl).
lex(one,det,sg).
lex(loves,v,sg).
lex(love,v,pl).
lex(eats,v,sg).
lex(eat,v,pl).
lex(sings,v,sg).
lex(sing,v,pl).

In this version of the example, it is worth calling the reader's attention to the Prolog clauses enclosed in braces on the right-hand side of some rules, e.g. in

n(n(L),Num) --> [L], { lex(L,n,Num)}.

These Prolog clauses express constraints on the grammatical construction which are not themselves grammatical constituents. They are sometimes called `guards' (e.g. by [Brown/Blair 1990]); they correspond directly to the `semantic conditions' which are a constituent part of attribute grammars as described by [Alblas 1991]. With the guard, this rule can be paraphrased as saying "A noun (an n), has the structure n(L) and the grammatical number Num, if (a) L (for `lexical form') is a single item in the input list [L], and (b) the relation lex contains the triple (L,n,Num)."

Such guards will be present in many of the rules we use to represent XML Schema in logic grammars.

By calling the grammar rules, we can instantiate a variable with the structural description of the sentence.

?- s(Struc,[john,loves,mary],[]).

Struc = s(np(pn(john)), vp(v(loves), np(pn(mary)))) 

Yes
?- s(S,[the,woman,eats,the,apple],[]).

S = s(np(det(the), n(woman)), vp(v(eats), np(det(the), n(apple)))) 

Yes
?- s(S,[the,man,sings],[]).

S = s(np(det(the), n(man)), vp(v(sings))) 

Yes
?- s(S,[apple,sings,the,mary],[]).

No
?- s(S,[the,apple,sings,mary],[]).

S = s(np(det(the), n(apple)), vp(v(sings), np(pn(mary)))) 

Yes
?-

Note that in a `conventional' use of DCGs, the user or application will call the grammar just once, at the top level, and the parser itself will generate recursive calls for the nested grammatical constructs (e.g. np and vp in our example). Note, however, that we can if we wish make calls to any level of the grammar from arbitrary Prolog code. Note also that we can include arbitrary Prolog code in the right-hand side of rules by wrapping it in braces. By doing so, we can make arbitrarily complex calculations part of the grammar rules; this is one reason that DCGs are not limited to context-free languages.

3.2. Representing XML in Prolog

The SWI-Prolog system includes an SGML/XML parser which conveniently makes SGML and XML documents accessible to Prolog manipulation [Wielemaker 2001].

The parser returns the XML document to the user as a list of items,[2] each item one of:

an atom (for character data)
a structure of the form element(GI,AttributeList,ContentList), where
- GI is an atom representing the element's generic identifier
- AttributeList is a list containing the element's attribute-value specifications in the form Attributename=Value, with both attribute names and attribute values represented as Prolog atoms. Like all Prolog lists, this one is comma-separated, so a list as a whole might look like [id=frobnitzer,rend=blort,type=farble]
- ContentList is a list of items (i.e. atoms representing character data, element(Gi,A,C) structures, entity(N) structures, etc.
a structure of the form entity(Integer) representing an otherwise unrepresentable character
a structure of the form entity(Name) representing an otherwise unrepresentable character for which a named entity is defined
a structure of the form pi(Text) representing a processing instruction

For example, the sample purchase order used as an example in XML Schema Part 0 looks like this in the format described (I have added white space for legibility; the single quotation marks around some atoms are a way of ensuring that they can be read again by a Prolog implementation and recognized as atoms):

[element(purchaseOrder, 
  [orderDate='1999-10-20'], 
  ['\n    ', 
   element(shipTo, 
     [country='US'], 
     ['\n        ', 
      element(name, [], ['Alice Smith']), 
      '\n        ', 
      element(street, [], ['123 Maple Street']), 
      '\n        ', 
      element(city, [], ['Mill Valley']), 
      '\n        ', 
      element(state, [], ['CA']), 
      '\n        ', 
      element(zip, [], ['90952']), 
      '\n    '
     ]), 
   '\n    ', 
   element(billTo, 
     [country='US'], 
     ['\n        ', 
      element(name, [], ['Robert Smith']), 
      '\n        ', 
      element(street, [], ['8 Oak Avenue']), 
      '\n        ', 
      element(city, [], ['Old Town']), 
      '\n        ', 
      element(state, [], ['PA']), 
      '\n        ', 
      element(zip, [], ['95819']), 
      '\n    '
     ]), 
   '\n    ', 
   element(comment, [], ['Hurry, my lawn is going wild!']), 
   '\n    ', 
   element(items, 
     [], 
     ['\n        ', 
      element(item, 
        [partNum='872-AA'], 
        ['\n            ', 
         element(productName, [], ['Lawnmower']), 
         '\n            ', 
         element(quantity, [], ['1']), 
         '\n            ', 
         element('USPrice', [], ['148.95']), 
         '\n            ', 
         element(comment, 
           [], 
           ['Confirm this is electric']), 
         '\n        ']), 
      '\n        ', 
      element(item, 
        [partNum='926-AA'], 
        ['\n            ', 
         element(productName, [], ['Baby Monitor']), 
         '\n            ', 
         element(quantity, [], ['1']), 
         '\n            ', 
         element('USPrice', [], ['39.98']), 
         '\n            ', 
         element(shipDate, [], ['1999-05-21']), 
         '\n        '
        ]), 
      '\n    '
     ]), 
   '\n'
  ])
]

When elements and attributes are namespace-qualified, the SWI-Prolog parser represents the name as a pair of two atoms, separated by a colon. To the right of the colon is the local name; depending on the option specified, the left-hand atom is either the namespace name or the namespace prefix used.

Other Prolog representations of XML are possible, of course, but we can use this one and it already exists.

3.3. A simple schema

Consider the simple schema defined for purchase orders in Part 0 of the XML Schema specification ([W3C 2001a]). Its complex types are relatively simple, involving only sequences of elements, some of which are optional, and some attributes. Some attributes and some elements are defined with named and some with anonymous simple types. This schema will serve admirably to illustrate the translation of schemas to logic grammars.

3.4. Layer 1: Translation into definite-clause grammar

Each element type in the language is translated into three sets of DCG productions:

a top-level rule to match the tag and get the contents
a set of rules for checking the element's attributes
a set of rules defining the element's content; this is the part that looks most like a conventional DCG

3.4.1. Top-level rules for element types

For each element type in the schema, we create one rule in the grammar, which will serve to match the start-tag, get the attributes and contents of the element, and call routines to check the attributes and content against the complex type. Here there is exactly one such rule for each element type, but in later examples we will have several, in order to handle substitution groups. A simple version of these top-level rules would look like this:[3]

< 1 Rules for elements with complex types [File podcg3.pl] > ≡

purchaseOrder --> [element(purchaseOrder,Attributes,Content)],
  {
    purchaseOrderType_atts(Attributes,[]),
    purchaseOrderType_cont(Content,[])
  }.
shipTo --> [element(shipTo,Attributes,Content)],
  {
    usAddress_atts(Attributes,[]),
    usAddress_cont(Content,[])
  }.
billTo --> [element(billTo,Attributes,Content)],
  {
    usAddress_atts(Attributes,[]),
    usAddress_cont(Content,[])
  }.
items --> [element(items,Attributes,Content)],
  {
    t_items_atts(Attributes,[]),
    t_items_cont(Content,[])
  }.
item --> [element(item,Attributes,Content)],
  {
    t_item_atts(Attributes,[]),
    t_item_cont(Content,[])
  }.

Because we get the XML document already in tree form, the top-level rules don't have much work to do. In particular, they do not have to identify the substructure of the element or find its end; that has already been done. Instead, we get the attributes and content already separated out. We launch a separate call on each to parse them according to the appropriate rule for their datatype.[4]

The comment element and others associated with simple types pose an interesting question. We could define them following the pattern above, like this:

comment --> [element(comment,Attributes,Content)],
  {
    xsd_string_atts(Attributes,[]),
    xsd_string_cont(Content,[])
  }.

But we know that the xsd:string type does not have any attributes declared, so Attributes must be the empty list. We can therefore define these elements this way instead:

< 2 Rules for elements with simple types [File podcg3.pl] > ≡

comment --> [element(comment,[],Content)],
  {
    xsd_string_cont(Content,[])
  }.
poname --> [element(name,[],Content)], { xsd_string_cont(Content,[]) }.
street --> [element(street,[],Content)], { xsd_string_cont(Content,[]) }.
city   --> [element(city,[],Content)], { xsd_string_cont(Content,[]) }.
state  --> [element(state,[],Content)], { xsd_string_cont(Content,[]) }.
zip    --> [element(zip,[],Content)], { xsd_decimal_cont(Content,[]) }.

productName --> [element(productName,[],Content)], 
                { xsd_string_cont(Content,[]) }.
quantity    --> [element(quantity,[],Content)], 
                { xsd_integer_cont(Content,[]) }.
'USPrice'   --> [element('USPrice',[],Content)], 
                { xsd_decimal_cont(Content,[]) }.
shipDate    --> [element(shipDate,[],Content)], 
                { xsd_date_cont(Content,[]) }.

3.4.2. Rules for attributes of complex types

For each complex type in the schema, we generate two rules, one we will use to check the attributes, and one to check the contents. For each simple type, we generate one rule for checking the value.

The attributes for a given element can be enumerated; in this first example, we check to make sure each attribute was declared.

< 3 Rules for attributes defined for complex types (v1) > ≡

purchaseOrderType_atts --> [].
purchaseOrderType_atts --> purchaseOrderType_att, purchaseOrderType_atts.
purchaseOrderType_att --> [orderDate=Date].

usAddress_atts --> [].
usAddress_atts --> usAddress_att, usAddress_atts.
usAddress_att --> [country='US']. /* note: fixed value! */

t_items_atts --> [].
t_items_atts --> t_items_att, t_items_atts.
t_items_att  --> no_such_attribute.
/* note that t_items_att is undefined, since there
 * are no attributes for the Items type 
 */

t_item_atts --> [].
t_item_atts --> t_item_att, t_item_atts.
t_item_att  --> [partNum=SKU].

This code is not used elsewhere.

For now, we do not worry about whether required attributes have been specified, or about checking the type of the attribute, or about supplying default values; we will do that later.

3.4.3. Rules for content of complex types

The most conventional-looking part of our grammar is the representation of the content models. A purchase order (being of type purchaseOrderType) has a ship-to address, a billing address, possibly a comment, and a list of items. An element with complex type USAddress has a name, street, city, state, and zip. We handle optional elements by writing a pair of recursive productions:

< 4 [File podcg3.pl] > ≡

purchaseOrderType_cont --> shipTo, billTo, opt_comment, items.
opt_comment --> [].
opt_comment --> comment, opt_comment.

usAddress_cont --> poname, street, city, state, zip.

t_items_cont --> star_item.
star_item    --> [].
star_item    --> item, star_item.

t_item_cont --> productName, quantity, 'USPrice', 
                opt_comment, opt_shipDate.

opt_shipDate --> [].
opt_shipDate --> shipDate.

It's worth noting that since the content models of XML all define regular languages, the right hand sides of the DCG rules for handling the content of elements are all rather simple. The only complexity is introduced by nested choices, occurrence indicators, and the like. In a way, this is the reflection of a fundamental fact about validating XML: the XML structure is given rather than needing to be discovered, and the rules for validation are correspondingly simpler.

A practical consequence of the fact that XML works with regular-right-part grammars while we are working with something more like standard Backus-Naur Form is that whenever we have recursive constructs like star_block, the parse tree described by the DCG does not correspond directly to the XML structure, and when we generate PSVI structures we are going to need to flatten the result to make it fit the XML structure.[5]

There is one other complication in practice: if an element has mixed content, we need to allow atoms representing character data between sub-elements; otherwise, we need to allow only atoms representing white space. For now, we ignore this.

3.4.4. Rules for checking values of simple types

We need some rules for checking simple types. The rules for built-in types of XML Schema, at least, can be written once for all and built in to a library. N.B. In order to get something working quickly, we cut some corners here.

In each case we write two rules, one to check the content of an element by checking to see whether the putative contents are a legal value, and the second to check the value itself. Within these latter rules, it is necessary to remember that the thing we are checking is a Prolog atom created from the lexical form for the putative value. In the case of strings, anything that's a legal lexical form (represented, following the conventions of the SWI-Prolog parser, as an atom) counts as a legal string.

< 5 Checking simple types [File podcg3.pl] > ≡

/* In our representation of XML, character data is represented
 * as atoms.  Need to check on handling of non-ASCII characters */
xsd_string_cont([H|T],T) :- xsd_string_value(H).
xsd_string_value(LF) :- atom(LF).

We check decimal numbers by converting the lexical form from an atom into a sequence of characters and then seeing whether Prolog can create a number out of that sequence of characters. This is actually a bit too broad, at least with the Prolog I am using, since it accepts numbers in scientific notation. But it will do for an illustration.

< 6 Checking numbers [File podcg3.pl] > ≡

/* We'll accept anything as a decimal number if we can convert
 * the atom to a list of character codes which can in turn be
 * converted to a number. 
 */
xsd_decimal_cont([H|T],T) :- xsd_decimal_value(H).
xsd_decimal_value(LF) :- atom_codes(LF,L), number_codes(N,L).
xsd_integer_cont([H|T],T) :- xsd_integer_value(H).
xsd_integer_value(LF) :- 
  atom_codes(LF,L), 
  number_codes(N,L),
  integer(N).

For dates, we really punt for now.

< 7 Checking numbers [File podcg3.pl] > ≡

xsd_date_cont([H|T],T) :- xsd_date_value(H). 
xsd_date_value(LF) :- atom(LF). 
/* well, we really need to check this ... */

With this set of rules, we have a primitive but working program for checking whether a given XML document does or does not conform to the purchase order schema. There are a number of constraints it does not check, however; in the following sections of this document we will refine the code to make it a more persuasive representation of the schema.

3.5. Layer 2a: checking attribute types

One obvious improvement would be to check to make sure that the values of attributes are legitimate representations of the simple types with which the attributes are declared. Here is a revision of the relevant rules:

< 8 Rules for attributes defined for complex types (v2) [File podcg3.pl] > ≡

purchaseOrderType_atts --> [].
purchaseOrderType_atts --> purchaseOrderType_att, purchaseOrderType_atts.
purchaseOrderType_att --> [orderDate=Date],
  { xsd_date_value(Date) }.

usAddress_atts --> [].
usAddress_atts --> usAddress_att, usAddress_atts.
usAddress_att --> [country='US']. /* note: fixed value! */

t_items_atts --> [].
t_items_atts --> t_items_att, t_items_atts.
t_items_att  --> [no_such_attribute].
/* Since it has no equal sign or value, the rule for
 * t_items_att will never match anything.
 * Alternatively we could leave t_items_att undefined, since there
 * are no attributes for the Items type.
 */

t_item_atts --> [].
t_item_atts --> t_item_att, t_item_atts.
t_item_att  --> [partNum=SKU],
  { po_sku_value(SKU) }.

Hand-coding a rule to check SKU values is not too hard, and illustrates how we can check the lexical forms of simple types by writing mini-grammars for them:

< 9 Checking SKU values [File podcg3.pl] > ≡

po_sku_value(LF) :- 
  atom_chars(LF,Charseq),
  sku_value(Charseq,[]).

sku_value --> sku_decimal_part, hyphen, sku_alpha_part.
sku_decimal_part --> digit, digit, digit.
sku_alpha_part --> cap_a_z, cap_a_z.

The mini-grammar for SKU values relies on the SWI-Prolog character classification predicate char_type, though it would be possible to implement rules for digit checking and letter checking in any Prolog system.

< 10 Character classification rules [File podcg3.pl] > ≡

digit --> [Char], { char_type(Char,digit) }.
hyphen --> ['-'].
cap_a_z --> [Char], { char_type(Char,upper) }.

3.6. Layer 2b: checking for duplicate, defaulted, and required attributes

Another obvious improvement would be to check to ensure that attributes are not specified more than once (strictly speaking, the upstream XML processor should be checking this, so this is probably not essential) and that required attributes do have values specified. We have already seen how to check a fixed value: we simply specify the value as part of the grammatical rule for the attribute.

In our sample schema, the only required attribute on any type is the partNum attribute on the item element; it is also the only attribute defined for items. In a case like this one, we could simply write the relevant part of the grammar this way:

t_item_atts --> [partNum=SKU], { po_sku_value(SKU) }.

Or we could even put the constraint into the top-level rule for the item element:

item --> [element(item,[partNum=SKU],Content)],
  {
    po_sku_value(SKU),
    t_item_cont(Content,[])
  }.

In the general case, however, it is going to be simpler to put checking for required, forbidden, and defaulted attributes into a separate rule, which we will here call t_item_att_check:

< 11 Checking for required, forbidden attributes > ≡

t_item_att_check(LAVS) :-
  atts_present(LAVS,[partNum]).

This code is not used elsewhere.

In the general case, we may have to check for required, forbidden, and defaulted attributes, and to merge defaulted attributes into the list of attribute-value specifications. So we can extend the general form of the foo_att_check predicate to:

< 12 Checking for required, forbidden attributes > ≡

t_item_att_check(LAVS,AugmentedLAVS) :-
  atts_present(LAVS,[partNum]),
  atts_absent(LAVS,[]),
  atts_defaulted(LAVS,[],AugmentedLAVS).

This code is not used elsewhere.

We assume here the existence of three special-purpose items for checking the presence or absence of items in lists of attribute-value specifications. A list of attribute-value specifications contains all the attributes in a list of required attributes, if the list of required attributes is empty.

< 13 Utilities for checking required attributes > ≡

atts_present(LAVS,[]).

Continued in < Utilities for checking required attributes 14 > , < Utilities for checking required attributes 15 > , < Utilities for checking forbidden attributes 16 > , < Utilities for checking forbidden attributes 17 > , < Utilities for checking defaulted attributes 18 >
This code is not used elsewhere.

A list of attribute-value specifications contains all the attributes in a list of required attributes, if it contains the first one, and also all the others.

< 14 Utilities for checking required attributes [continues 13 Utilities for checking required attributes] > ≡

atts_present(LAVS,[HRA|RequiredTail]) :-
  att_present(LAVS,HRA),
  atts_present(LAVS,RequiredTail).

And finally, an attribute value specification is present for a given attribute name if it is either at the head of the list, or in the tail of the list.

< 15 Utilities for checking required attributes [continues 13 Utilities for checking required attributes] > ≡

att_present([Attname=Attval|Tail],Attname).
att_present([AVS|Tail],Attname) :-
  att_present(Tail,Attname).

Some readers will be looking, right about now, for a rule of the form

att_present([],Attname) :- ...

We don't have a rule for the empty list, though, because if we have run through the list of attribute-value specifications without finding one for the attribute name we are seeking, we want the predicate to fail. If necessary to prevent later maintenance programmers, or ourselves, from supplying the `missing' induction basis by mistake, we could write:

att_present([],Attname) :- !, fail.

Now for forbidden attributes. (These would be originally optional attributes whose maxOccurs has been set to 0 in a refinement step.) If there are no attribute names in the list of forbidden attributes, then we are done and all is well.

< 16 Utilities for checking forbidden attributes [continues 13 Utilities for checking required attributes] > ≡

atts_absent(LAVS,[]).

If the list of forbidden attribute names is not empty, we check first the head, then (recursively) the tail.

< 17 Utilities for checking forbidden attributes [continues 13 Utilities for checking required attributes] > ≡

atts_absent(LAVS,[HFA|ForbiddenTail]) :-
  not(att_present(LAVS,HFA)),
  atts_absent(LAVS,ForbiddenTail).

Finally, for attributes with defaults we specify a list of attribute-value specifications; for each of these, we check to see whether a value is already specified for an attribute of that name. If it is, we skip the attribute in question; if not, we add it to the list of attribute-value specifications. Either way, we call the predicate recursively to take care of the rest of the defaulted attributes.

< 18 Utilities for checking defaulted attributes [continues 13 Utilities for checking required attributes] > ≡


atts_defaulted(LAVS,[],LAVS).
atts_defaulted(LAVS,[AN=AV|Tail],AugmentedLAVS) :-
  att_present(LAVS,AN),
  atts_defaulted(LAVS,Tail,AugmentedLAVS).
atts_defaulted(LAVS,[AN=AV|Tail],[AN=AV|AugmentedLAVS]) :-
  not(att_present(LAVS,AN)),
  atts_defaulted(LAVS,Tail,AugmentedLAVS).

3.7. Evaluation

So far, we have shown that given a relatively simple schema, it is possible to create a definite-clause grammar which checks the salient constraints defined by the schema:

Each element is checked against the rules for its datatype.
Elements declared with a complex type are checked with regard to their attributes and their content.
- Attributes are checked to see that they were declared and that their value is of the correct type.
- Sets of attribute value specifications are checked to make sure that required attributes have been specified, and forbidden attributes have not. Default values can be supplied for attributes which have them.
- The regular language defined in a content model is translated into a set of definite-clause grammar productions; any legal sequence of children will be recognized, and any illegal sequence of children will fail to be recognized.
- Elements declared with a simple type are checked to ensure that their content is a legal value of the simple type.

The example we have been developing does not illustrate all forms of content models, occurrence indicators, or simple types, but I think it's obvious how to extend the treatment given here.

There are several gaps in coverage, which we will make good in what follows:

handling wildcards
handling character data intermingled with mixed content models, handling whitespace intermingled with non-mixed content models
handling substitution groups
handling xsi:type specifications in the document instance
identifying errors and recovering from them
providing relevant PSVI properties

It will be simplest to show how to address these gaps if we handle the last item first, and make our parser start returning more than a yes/no answer to the question "is this document/element/lexical form ok?" As shown by the examples used to introduce DCG notation, we could do that by adding parameters to the productions in the grammar; it will be more convenient, however, to translate our grammar from DCG form into a related form, that of definite-clause translation grammars. To that, the next part of this paper is dedicated.

4. Providing PSVI properties

To model schema-validity assessment properly, we need to provide more output: specifically, we need to provide information about the input document together with some additional properties (the schema infoset contributions). It's possible to do that in DCG notation, but it rapidly becomes cumbersome. We'll use DCTG notation instead; it was devised to handle grammatical attributes more conveniently than DCG, and to separate the semantics more effectively from the syntax [Abramson 1984].

introduction to DCTG notation
translation into DCTG notation
reconstructing input structure

On this foundation we will later be able to add more:

adding type name information: type name property
black-box (skip) validation
white-box (lax) validation, validation attempted property
error handling (falling back to lax), validity property

Note that this section is not intended to be a complete translation of XML Schema into DCTG, but a sample small enough to follow and large enough to make a persuasive case that all of XML Schema could be translated.

4.1. Introduction to DCTG notation

DCTG notation was invented to make it easier to calculate values for grammatical attributes and to refer to the values of grammatical attributes on other rules.

Here is a simple example. Consider the following grammar for binary strings:

bit ::= '0'.
bit ::= '1'.
bitstring ::= '' /* nothing */.
bitstring ::= bit, bitstring.
number ::= bitstring, fraction.
fraction ::= '.', bitstring.
fraction ::= ''.

We might wish to calculate the length and the (unsigned base-two) value of the bitstring as attributes. Using a yacc-like notation that might look like this. Notice that scale is a top-down attribute and value and fractionalvalue are bottom-up attributes.

bit ::= '0' { $0.bitvalue = 0; }.
bit ::= '1' { $0.bitvalue = power(2,$0.scale); }.
bitstring ::= '' { 
  $0.value = 0; 
  $0.length = 0;
  /* scale doesn't matter here */
}.
bitstring ::= bit, bitstring {
  $0.length = $2.length + 1;
  $1.scale = $0.scale;
  $2.scale = $0.scale - 1;
  $0.value = $1.value + $2.value;
}.
number ::= bitstring, fraction {
  $1.scale = $1.length - 1;
  $0.value = $1.value + $2.fractionalvalue;
}.
fraction ::= '.', bitstring {
  $2.scale = -1;
  $0.fractionalvalue = $2.value;
}.
fraction ::= '' {
  $0.fractionalvalue = 0;
}.

In DCTG notation, this grammar looks like this:[6]

bit ::= [0]
  <:>
  bitval(0,_).

bit ::= [1]
  <:>
  bitval(V,Scale) ::- V is **(2,Scale).

bitstring ::= []
  <:>
  length(0)
  &&
  value(0,_).

bitstring ::= bit^^B, bitstring^^B1
  <:>
  length(Length) ::-
    B1 ^^ length(Length1),
    Length is Length1 + 1
  &&
  value(Value,ScaleB) ::-
    B ^^ bitval(VB,ScaleB),
    S1 is ScaleB - 1,
    B1 ^^ value(V1,S1),
    Value is VB + V1.

number ::= bitstring ^^ B, fraction ^^ F
  <:>
  value(V) ::-
    B ^^ length(Length),
    S is Length-1,
    B ^^ value(VB,S),
    F ^^ fractional_value(VF),
    V is VB + VF.

fraction ::= ['.'], bitstring ^^ B
  <:>
  fractional_value(V) ::-
    S is -1,
    B ^^ value(V,S).

fraction ::= []
  <:>
  fractional_value(0).

As may be seen, each rule in DCTG consists of a left hand side, a right-hand side, and optionally a list of attributes. The right-hand side is separated from the left-hand side by the operator ::= and from the following list of attributes (if any) by the operator <:>. The expression on the right hand side is a sequence of non-terminal symbols, each optionally suffixed with the operator ^^ and a variable name (e.g. bitstring^^B). Each attribute is identified by a Prolog structure (e.g. value(V), whose functor is the name of the attribute and whose arguments are its values. The structure may be followed by the operator ::- (designed to look a lot like the standard Prolog `neck' operator :-) and a series of goals which will be satisfied when the attribute value is to be instantiated. In practice, these goals help to calculate the attribute values. Values of attributes attached to the items on the right-hand side may be referred to by the variable name associated with the item and the name of the attribute, joined by the operator ^^ (e.g. B ^^ bitval(VB,ScaleB).

An EBNF grammar for DCTG notation[7] is:

grammar ::= rule*
rule    ::= lhs '::=' rhs ('<:>' att-spec ('&&' att-spec)*)?
lhs     ::= term
rhs     ::= term (',' term)*
attspec ::= compound-term ('::-' goal (',' goal)*)?
compound-term ::= ATOM '(' term (',' term)* ')'

4.2. Another example: the English grammar fragment E1

Here is another simple example: the trivial fragment of English grammar given above, translated directly into DCTG notation, looks like the following. The only difference from the DCG form is that the separator between the left- and right-hand side of each production is "::=" instead of "-->".

%%% E1 (trivial context-free grammar for a fragment of English)
%%% in DCTG notation.
s ::= np, vp.
np ::= det, n.
np ::= n.
vp ::= v, np.
vp ::= v.
n ::= [mary].
n ::= [john].
n ::= [woman].
n ::= [man].
n ::= [apple].
det ::= [the].
v ::= [loves].
v ::= [eats].
v ::= [sings].

The translation into standard Prolog is similar to that used for DCGs, but instead of adding two arguments to each non-terminal, the DCTG translation adds three. The first additional argument is a Prolog structure with the functor node, described further below. The second and third are (like the two arguments added in DCG translations) difference lists.

The node structure has three arguments:

the non-terminal of the grammar rule (here s, np, vp, etc.
a list of the node structures associated with the items on the right-hand side of the grammar rule
a list of grammatical attributes (in this grammar, these will be empty

The translation of our trivial grammar into standard Prolog is thus:

?- dctg_reconsult('ks81dctg.pl').

Yes
?- listing([s,np,vp,n,det,v]).

:- dynamic s/3.

s(node(s, [A, B], []), C, D) :-
        np(A, C, E),
        vp(B, E, D).

:- dynamic np/3.

np(node(np, [A, B], []), C, D) :-
        det(A, C, E),
        n(B, E, D).
np(node(np, [A], []), B, C) :-
        n(A, B, C).

:- dynamic vp/3.

vp(node(vp, [A, B], []), C, D) :-
        v(A, C, E),
        np(B, E, D).
vp(node(vp, [A], []), B, C) :-
        v(A, B, C).

:- dynamic n/3.

n(node(n, [[mary]], []), A, B) :-
        c(A, mary, B).
n(node(n, [[john]], []), A, B) :-
        c(A, john, B).
n(node(n, [[woman]], []), A, B) :-
        c(A, woman, B).
n(node(n, [[man]], []), A, B) :-
        c(A, man, B).
n(node(n, [[apple]], []), A, B) :-
        c(A, apple, B).

:- dynamic det/3.

det(node(det, [[the]], []), A, B) :-
        c(A, the, B).

:- dynamic v/3.

v(node(v, [[loves]], []), A, B) :-
        c(A, loves, B).
v(node(v, [[eats]], []), A, B) :-
        c(A, eats, B).
v(node(v, [[sings]], []), A, B) :-
        c(A, sings, B).

Yes
?-

The predicate dctg_reconsult(File) is used to translate a DCTG grammar into Prolog clauses and load them; it is provided by [Abramson/Dahl/Paine 1990] and is available from a variety of sources on the net.[8]

A short terminal session should make the nature of the results a bit clearer:[9]

?- s(S,[john,loves,mary],[]), write(S).
node(s, 
     [node(np, 
           [node(n, [[john]], [])], 
           []), 
      node(vp, 
           [node(v, [[loves]], []), 
            node(np, 
                 [node(n, [[mary]], [])], 
                 [])], 
           [])], 
     [])

S = node(s, [node(np, [node(n, [[john]], [])], []), 
node(vp, [node(v, [[loves]], []), node(np, [node(n, 
[...], [])], [])], [])], []) 

Yes
?- s(S,[the,woman,eats,the,apple],[]), write(S).
node(s, 
     [node(np, 
           [node(det, [[the]], []), 
            node(n, [[woman]], [])], 
           []), 
      node(vp, 
           [node(v, [[eats]], []), 
            node(np, 
                 [node(det, [[the]], []), 
                  node(n, [[apple]], [])], 
                 [])], 
           [])], 
     [])

S = node(s, [node(np, [node(det, [[the]], []), node(n, 
[[woman]], [])], []), node(vp, [node(v, [[eats]], []), 
node(np, [node(det, [...], []), node(..., ..., ...)], 
[])], [])], []) 

Yes
?- s(S,[the,man,sings],[]), write(S).
node(s, 
     [node(np, 
           [node(det, [[the]], []), 
            node(n, [[man]], [])], 
           []), 
      node(vp, 
           [node(v, [[sings]], [])], 
           [])], 
     [])

S = node(s, [node(np, [node(det, [[the]], []), 
node(n, [[man]], [])], []), node(vp, [node(v, 
[[sings]], [])], [])], []) 

Yes
?-

The node structure constructed in the first added argument resembles and serves much the same purpose as the structure attribute used in E2, the attribute-grammar version of the English grammar fragment.

4.3. English grammar with attributes (E2) in DCTG notation

The fragment of English grammar E2, which was presented earlier to illustrate the use of DCGs for attribute grammars, may also be used to illustrate DCTGs.

Let's walk through the grammar. A sentence is an NP followed by a VP; the goals enclosed in braces (S^^number(Num), P^^number(Num)) expresses the constraint that the NP and VP must agree in number (the number attribute of the NP S and the number attribute of the VP P must unify with each other).

< 19 English grammar fragment with attributes [File ks92.dctg] > ≡

s ::= np^^S, vp^^P,
  { S^^number(Num), P^^number(Num) }
  <:>
  {Attributes for non-terminal s 20}

Continued in < NP rules 21 > , < VP rules 26 > , < Pre-terminal rules 29 > , < Lexicon 30 >

The non-terminal s has only one grammatical attribute, which we call structure. When s is made up (as here) of an NP and a VP, we represent its structure by a Prolog term with s as its functor and the structure of the NP and VP as its two arguments:

< 20 Attributes for non-terminal s > ≡

  structure(s(Sstr,Pstr)) ::- 
    S^^structure(Sstr), 
    P^^structure(Pstr).

This code is used in < English grammar fragment with attributes 19 >

A noun phrase (NP) is made up of a determiner and a noun; they must agree in number. This covers phrases like "the apple", "the apples", "one apple", "some apples". The agreement rule excludes the phrases "one apples" and "some apple".[10]

< 21 NP rules [continues 19 English grammar fragment with attributes] > ≡

np ::= det^^D, n^^N,
  { D^^number(Num), N^^number(Num) }
  <:>
  {NP structure attribute (for DET+N) 22}

  &&
  {NP number attribute (for DET+N) 23}

Continued in < NP rules, cont'd 24 > , < NP rules, cont'd 25 >

The non-terminal np has two attributes, named structure and number.

The structure of the NP is a Prolog term with the name of the non-terminal (np) as its functor and the constituents as the arguments. (This is the pattern of the structure attribute on all non-terminals, and I won't comment on it again.)

< 22 NP structure attribute (for DET+N) > ≡

  structure(np(Dstr,Nstr)) ::-
    D^^structure(Dstr),
    N^^structure(Nstr)

This code is used in < NP rules 21 >

The number attribute of the NP illustrates an important idiom: the guard in the syntactic part of the rule has already checked the number attributes of the determiner and the noun, to make sure they unify with each other; the variable Num has the value sg or pl already, and we don't need to do any more computation. We just say that whatever Num is, that's the grammatical number of the NP.

< 23 NP number attribute (for DET+N) > ≡

  number(Num).

This code is used in < NP rules 21 >

Noun phrases can also take the form of a plural noun by itself, as in "Men love apples".

< 24 NP rules, cont'd [continues 21 NP rules] > ≡

np ::= n^^N, { N^^number(pl) }
  <:>
  structure(np(Nstr)) ::-
    N^^structure(Nstr)
  &&
  number(pl).

The final form of NP recognized by this grammar is a singular proper noun by itself, as in "John loves Mary".

< 25 NP rules, cont'd [continues 21 NP rules] > ≡

np ::= pn^^N, { N^^number(sg) }
  <:>
  structure(np(Nstr)) ::-
    N^^structure(Nstr)
  &&
  number(sg).

A verb phrase (VP) can include a direct object in the form of a noun phrase:

< 26 VP rules [continues 19 English grammar fragment with attributes] > ≡

vp ::= v^^V, np^^O
  <:>
  structure(vp(Vs,Os)) ::-
    V^^structure(Vs),
    O^^structure(Os)
  &&
  {Number attribute for VP 28}

Continued in < VP rules, cont'd 27 >

or they can be just a verb with no direct object.[11]

< 27 VP rules, cont'd [continues 26 VP rules] > ≡

vp ::= v^^V
  <:>
  structure(vp(Vs)) ::-
    V^^structure(Vs)
  &&
  {Number attribute for VP 28}

Although both the verb and the direct object have a number attribute, only that of the verb counts in determining the value of the number attribute for the VP as a whole.

< 28 Number attribute for VP > ≡

  number(Num) ::- V^^number(Num).

This code is used in < VP rules 26 > < VP rules, cont'd 27 >

Nouns, proper nouns, verbs, and determiners (the `pre-terminal' categories of the grammar) all have rules of the same structure: a token in the string counts as one of these if the lexicon says it's one, and the number attribute has whatever value the lexicon gives.

< 29 Pre-terminal rules [continues 19 English grammar fragment with attributes] > ≡

n ::= [L], { lex(L,n,Num) }
  <:>
  structure(n(L))
  &&
  number(Num).

pn ::= [L], { lex(L,pn,Num) }
  <:>
  structure(pn(L))
  &&
  number(Num).

v ::= [L], { lex(L,v,Num) }
  <:>
  structure(v(L))
  &&
  number(Num).

det ::= [L], { lex(L,det,Num) }
  <:>
  structure(det(L))
  &&
  number(Num).

Finally, the lexicon. To keep things simple, the lexicon here is just a set of facts, with literal values.[12] The entry for the is the exception: it does not have a literal value, but the anonymous variable _ to indicate that the can be either sg or pl.

< 30 Lexicon [continues 19 English grammar fragment with attributes] > ≡

lex(mary,pn,sg).
lex(john,pn,sg).
lex(woman,n,sg).
lex(women,n,pl).
lex(man,n,sg).
lex(men,n,pl).
lex(apple,n,sg).
lex(apples,n,pl).
lex(the,det,_).
lex(some,det,pl).
lex(one,det,sg).
lex(loves,v,sg).
lex(love,v,pl).
lex(eats,v,sg).
lex(eat,v,pl).
lex(sings,v,sg).
lex(sing,v,pl).

A session log illustrates the structure built by the DCTG rules:

?- s(S,[john,loves,mary],[]), write(S).
node(s, 
     [node(np, 
           [node(pn, 
                 [[john]], 
                 [structure(pn(john)), 
                 (number(sg)::-lex(john, pn, sg))])], 
           [ (structure(np(_G292))::-
                node(pn, 
                     [[john]], 
                     [structure(pn(john)), 
                      (number(sg)::-lex(john, pn, sg))])^^structure(_G292)), 
            number(sg)]), 
      node(vp, 
           [node(v, 
                 [[loves]], 
                 [structure(v(loves)), 
                 (number(sg)::-lex(loves, v, sg))]), 
            node(np, 
                 [node(pn, 
                       [[mary]], 
                       [structure(pn(mary)), 
                       (number(sg)::-lex(mary, pn, sg))])], 
                 [(structure(np(_G424))::-
                     node(pn, [[mary]], 
                          [structure(pn(mary)), 
                          (number(sg)::-lex(mary, pn, sg))])
                     ^^structure(_G424)), 
                  number(sg)])], 
           [ (structure(vp(_G351, _G352))::-
                node(v, [[loves]], 
                     [structure(v(loves)), 
                     (number(sg)::-lex(loves, v, sg))])^^structure(_G351), 
                node(np, 
                     [node(pn, [[mary]], 
                           [structure(pn(mary)), 
                           (number(sg)::-lex(mary, pn, sg))])], 
                     [(structure(np(_G424))::-
                         node(pn, [[mary]], 
                              [structure(pn(mary)), 
                              (number(sg)::-lex(mary, pn, sg))])
                            ^^structure(_G424)), 
                      number(sg)])^^structure(_G352)), 
             (number(_G373)::-
                node(v, [[loves]], 
                     [structure(v(loves)), 
                     (number(sg)::-lex(loves, v, sg))])^^number(_G373))])], 
     [(structure(s(_G261, _G262))::-
         node(np, 
              [node(pn, [[john]], 
                    [structure(pn(john)), 
                    (number(sg)::-lex(john, pn, sg))])], 
              [(structure(np(_G292))::-
                  node(pn, [[john]], 
                       [structure(pn(john)), 
                       (number(sg)::-lex(john, pn, sg))])^^structure(_G292)), 
               number(sg)])^^structure(_G261), 
         node(vp, [node(v, [[loves]], 
                        [structure(v(loves)), 
                        (number(sg)::-lex(loves, v, sg))]), 
                   node(np, 
                        [node(pn, [[mary]], 
                              [structure(pn(mary)), 
                              (number(sg)::-lex(mary, pn, sg))])], 
                        [(structure(np(_G424))::-
                            node(pn, [[mary]], 
                                 [structure(pn(mary)), 
                                 (number(sg)::-lex(mary, pn, sg))])
                            ^^structure(_G424)), 
                         number(sg)])], 
                  [(structure(vp(_G351, _G352))::-
                      node(v, [[loves]], 
                           [structure(v(loves)), 
                           (number(sg)::-lex(loves, v, sg))])
                      ^^structure(_G351), 
                      node(np, [node(pn, [[mary]], 
                                     [structure(pn(mary)), 
                                     (number(sg)::-lex(mary, pn, sg))])], 
                               [(structure(np(_G424))::-
                                   node(pn, [[mary]], 
                                        [structure(pn(mary)), 
                                        (number(sg)::-lex(mary, pn, sg))])
                                   ^^structure(_G424)), 
                                number(sg)])^^structure(_G352)), 
                   (number(_G373)::-
                      node(v, [[loves]], 
                           [structure(v(loves)), 
                           (number(sg)::-lex(loves, v, sg))])
                      ^^number(_G373))])
         ^^structure(_G262))])

In examining the structure just shown, the reader will note a great deal of apparent repetition; this results from the high incidence of structure sharing, which is not shown explicitly: in the grammar itself, the variables which have child nodes as values are used quite freely and appear both in the list of children and in the rules for calculating synthetic attributes.

Note also that the values of the attributes are not always pre-calculated: instead, the structure has the rules necessary to perform the evaluation on demand. In the example, the structure attribute for the pre-terminal categories is already completely grounded: it has no variables, but only literal values, while the structure attributes for the higher-level parts of the linguistic structure still have unbound variables.

4.4. Layer 3: Translation into DCTG notation

First, translate the existing purchase-order schema into DCTG notation, without any new functionality.

The ...

4.5. Layer 4: PSVI properties

Add some simple properties to the input instance.

Test local validity (non-recursive).

4.6. Layer 4a: handling xsi:type

There are two approaches to handling xsi:type. We can have the base type parse its part, so we get layers and clear delineation between base and extension, or else we can parse the entire thing. I'll describe both, and implement plan B because the separation between base and extension is not exhibited in the PSVI as defined.

4.7. Layer 4b: wildcards with skip and lax processing

Show how to handle wildcards with skip, lax, and strict processing. The wildcards will have grammatical productions of their own; these grammars will follow rules different from those of normal non-wildcard particles.

4.8. Layer 4c: substitution groups

Substitution groups can be handed by defining multiple element rules with the same left-hand side. If a is in the substitution group of b, then we have both

b --> [element(b,Attributes,Content)],
  {
    b_atts(Attributes,[]),
    b_cont(Content,[])
  }.
a --> [element(a,Attributes,Content)],
  {
    a_atts(Attributes,[]),
    a_cont(Content,[])
  }.

and also

b --> [element(a,Attributes,Content)],
  {
    a_atts(Attributes,[]),
    a_cont(Content,[])
  }.

4.9. Layer 4d: error handling

A simple way to handle errors is to define a fall-back production which accepts pretty much anything, and forces the appropriate grammatical attribute to take the error value. If multiple fallbacks can be defined, each relaxing a different constraint, then we can easily identify which constraint was violated. This may require that we insert a cut into the normal rules, which seems unfortunate.

4.10. Layer 4e: validity and validation-attempted properties

Add grammatical attribute / PSVI property for overall validation outcome (local validity plus validity or otherwise of children, attributes).

4.11. Further work needed for complete representation

According to section 2.4 of [W3C 2001b], to be minimally conforming a translation of an XML Schema into DCTG notation must

implement the schema component constraints (see Appendix C.4 and subsection 6 of each component) I take this to mean that the schemas used by the processor must obey them; the processor may or may not independently enforce them. Proof obligation: prove that the style of DCTG outlined above does satisfy these constraints.
implement the validation rules (see Appendix C.1 and subsection 4 of each component). I take this to mean the processor must work correctly from its component representation to the values of the relevant PSVI instance properties. Proof obligation: prove that the style of DCTG outlined above does correctly implement the validation rules.
implement (i.e. make) the schema information set contributions (see Appendix C.2 and subsection 5 of each component). Proof obligation: make the parser decorate the input with appropriate grammatical attributes representing the relevant properties.

5. DCTGs and schema components

Translating schema documents into schema components (nodes with properties).

A processor may also accept schemas written in the XML transfer syntax; to make our processor do this, we need to meet the two obligations imposed on such processors: in addition to the core conformance requirements outlined above, they must

implement (enforce?) the Schema Representation Constraints (see Appendix C.3 and subsection 3 of each component) Proof obligation: provide routines which check the input schema documents against these constraints, and prove them correct.
follow the rules for mapping from schema documents to schema components. Proof obligation: provide routines which translate from the Prolog representation of an XML schema document into some representation of components; if this representation differs from that described above, provide further routines which map from this initial representation to the working representation described above.

6. A two-level system

We can combine the two applications of DCTGs we have just outlined to create a two-level system:

At one level, a DCTG for schema documents in XML Schema 1.0 notation parses the documents and builds a schema, that is a set of components, represented as Prolog structures in the format defined by DCTG.
This set of components can be translated into a DCTG for performing schema-validity assessment against documents, at the second level.

The only missing piece is the translation of schema components (in the format described in the preceding section) into a definite-clause translation grammar (of the form described in section 3, Adding PSVI properties to the output.

Translating schema components into DCTGs

A. Works cited and further reading

Abramson, Harvey. 1984. "Definite Clause Translation Grammars". Proceedings of the 1984 International Symposium on Logic Programming, Atlantic City, New Jersey, February 6-9, 1984, pp. 233-240. (IEEE-CS 1984, ISBN 0-8186-0522-7)

Abramson, Harvey, and Veronica Dahl. 1989. Logic Grammars. Symbolic Computation AI Series. Springer-Verlag, 1989.

Abramson, Harvey, and Veronica Dahl, rev. Jocelyn Paine. 1990. DCTG: Prolog definite clause translation grammar translator. (Prolog code for translating from DCTG notation to standard Prolog. Note says syntax extended slightly by Jocelyn Paine to accept && between specifications of grammatical attributes, to minimize need for parentheses. Available from numerous AI/NLP software repositotries, including http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/lang/prolog/code/syntax/dctg/0.html, http://www.ims.uni-stuttgart.de/ftp/pub/languages/prolog/libraries/imperial_college/dctg.tar.gz, and http://www.ifs.org.uk/~popx/prolog/dctg/.)

Alblas, Henk. 1991. "Introduction to attribute grammars". Attribute grammars, applications and systems: International Summer School SAGA, Prague, Czechoslovakia, June 4-13, 1991, Proceedings, pp. 1-15. Berlin: Springer, 1991. Lecture Notes in Computer Science, 545.

Bratko, Ivan. 1990. Prolog programming for artificial intelligence. Second edition. Wokingham: Addison-Wesley. xxi, 597 pp.

Brown, Allen L., Jr., and Howard A. Blair. 1990. "A logic grammar foundation for document representation and layout". In EP90: Proceedings of the International Conference on Electronic Publishing, Document Manipulation and Typography, ed. Richard Furuta. Cambridge: Cambridge University Press, 1990, pp. 47-64.

Brown, Allen L., Jr., Toshiro Wakayama, and Howard A. Blair. 1992. "A reconstruction of context-dependent document processing in SGML". In EP92: Proceedings of Electronic Publishing, 1992, ed. C. Vanoirbeek and G. Coray. Cambridge: Cambridge University Press, 1992. Pages 1-25.

Brüggemann-Klein, Anne. 1993. Formal models in document processing. Habilitationsschrift, Freiburg i.Br., 1993. 110 pp. Available at ftp://ftp.informatik.uni-freiburg.de/documents/papers/brueggem/habil.ps (Cover pages archival copy also at http://www.oasis-open.org/cover/bruggDissert-ps.gz).

[Brüggemann-Klein provides a formal definition of 1-unambiguity, which corresponds to the notion of unambiguity in ISO 8879 and determinism in XML 1.0. Her definition of 1-unambiguity can be used to check XML Schema's Unique Particle Attribution constraint by changing every minOccurs value greater than 1 to 1, and every maxOccurs value greater than 1 to unbounded.]

Clocksin, W. F., and C. S. Mellish. Programming in Prolog. Second edition. Berlin: Springer, 1984.

Gal, Annie, Guy Lapalme, Patrick Saint-Dizier, and Harold Somers. 1991. Prolog for natural language processing. Chichester: Wiley, 1991. xiii, 306 pp.

Gazdar, Gerald, and Chris Mellish. 1989. Natural language processing in PROLOG: An introduction to computational linguistics. Wokingham: Addison-Wesley, 1989. xv, 504 pp.

Grune, Dick, and Ceriel J. H. Jacobs. 1990. Parsing techniques: a practical guide. New York, London: Ellis Horwood, 1990. Postscript of the book is available from the first author's Web site at http://www.cs.vu.nl/~dick/PTAPG.html

Knuth, D. E. 1968. "Semantics of context-free languages". Mathematical Systems Theory 2: 127-145.

König, Esther, and Roland Seiffert. Grundkurs PROLOG für Linguisten. Tübingen: Francke, 1989. [= Uni-Taschenbücher 1525]

Stepney, Susan. High-integrity compilation. Prentice-Hall. Available from http://www-users.cs.york.ac.uk/~susan/bib/ss/hic/index.htm. Chapter 3 (Using Prolog) provides a terse introduction to DCTG notation and use.

[W3C 2001] "XML Schema Part 0: Primer", ed. David Fallside. W3C Recommendation, 2 May 2001. [Cambridge, Sophia-Antipolis, Tokyo: W3C] http://www.w3.org/TR/xmlschema-0/.

[W3C 2001] 2001. XML Schema Part 1: Structures, ed. Henry S. Thompson, David Beech, Murray Maloney, and Noah Mendelsohn. W3C Recommendation 2 May 2001. [Cambridge, Sophia-Antipolis, and Tokyo]: World Wide Web Consortium. http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/

[W3C 2001] W3C. 2001. XML Schema Part 2: Datatypes, ed. Biron, Paul V. and Ashok Malhotra. W3C Recommendation 2 May 2001. [Cambridge, Sophia-Antipolis, and Tokyo]: World Wide Web Consortium. http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/

Wielemaker, Jan. "SWI-Prolog SGML/XML parser: Version 1.0.14, March 2001". http://www.swi-prolog.org/packages/sgml2pl.html

For more information on the representation of SGML and XML documents as Prolog structures, see the SWI add-ons documentation [Wielemaker 2001]. Other representations are possible; I have used this one because it's convenient and because representing sibling sets in a list makes it easier to use DCGs and DCTGs.

Definite-clause grammars (DCGs) are introduced in almost any good Prolog textbook: e.g. [Clocksin/Mellish 1984], [Bratko 1990]. They are discussed at somewhat greater length in treatments of Prolog for natural-language processing, including [König/Seiffert 1989], [Gazdar/Mellish 1989], and [Gal et al. 1991]. Most extended discussions show how to use additional arguments to record syntactic structure or handle the semantics of the material.

Definite-clause translation grammars were introduced as a way of making it easier to handle semantics; they provide explicit names for attributes (in the sense of attribute grammars [Knuth 1968]).

B. To do

get a layer 1 grammar working (done 2002-09-13, Denver International Airport)
add layer 2 (done 2002-09-13)
translate to DCTG
make an XSLT stylesheet to do the DCTG generation from a single schema document (?)
layer 3
layer 4

Think about schema composition. N.B. the SWI http_open library can be used to open streams for Web-accessible material, like this:

['d:/usr/lib/prolog/http_open']. % load the library
http_open('http://www.w3.org/2001/XMLSchema.xsd',XSDStream), 
 load_xml_file(stream(XSDStream),XSD), 
 close(XSDStream).

¿ make explicit grammatical attributes for the input infoset properties, as well as those added in the PSVI? (Need to learn a bit more about how SWI Prolog presents this information.)
- Attribute Information Items:
  - [local name]
  - [namespace name]
  - [normalized value]
- Character Information Items:
  - [character code]
- Element Information Items:
  - [local name]
  - [namespace name]
  - [children]
  - [attributes]
  - [in-scope namespaces] or [namespace attributes]
- Namespace Information Items:
  - [prefix]
  - [namespace name]
Write a basic DCTG for schema documents (translation of the schema for schemas). (Write XSLT stylesheet and/or Prolog code to do this job.)
Elaborate the schema-document DCTG to check all required constraints on XML representation of schemas.
Elaborate the schema-document DCTG to provide simple representations of schema components; prove that the mapping follows the rules. Checking unique particle attribution rule can be done by transcribing Anne Brüggemann-Klein's algorithm in [Brüggemann-Klein 1993]. For extra credit, write a diagnostic routine which fires when a non-deterministic model is provided, and proposes a deterministic alternative. Warn the user if this loses any appinfo information.
Add routines to go from the representations of schema components into DCTG notation described in section 3.
Show how (add routines to?) perform schema composition: import, include, redefine.
Go back and fulfil the proof obligations relating to the DCTG representation of schemas.