Notes on logic grammars and XML Schema

A working paper prepared for the W3C XML Schema Working Group

C. M. Sperberg-McQueen

1 December 2003

N.B. not complete: work in progress



This document describes some possible applications of logic grammars to schema processing as described in the XML Schema specification. The term logic grammar is used to denote grammars written in logic-programming systems; this paper works with definite-clause grammars (DCGs), which are a built-in part of most Prolog systems, and definite-clause translation grammars (DCTGs), which employ a similar formalism which more closely resembles attribute grammars as described by [Knuth 1968] and later writers and which make it a bit easier to handle complex specifications. Both DCGs and DCTGs can be regarded as syntactic sugar for straight Prolog; before execution, both notations are translated into Prolog clauses in the usual notation.
In its current state, this paper is unfinished: it is more or less complete through the description of the layer-3 grammar, but layer 4 and the topics which follow it are still fragmentary. Some highlighted notes about what needs to be done to complete the paper are included. The author hopes the work is sketched out in sufficient detail to allow the reviewers to form a solid opinion on it.

1. Introduction

1.1. Background

It is often observed that the needs of precision and completeness in technical specifications are sometimes at odds with the goal of making the specifications easy to read and understand. It is sometimes suggested (e.g. by [Wadler 2000]) that specifications would be less ambiguous, more compact, and easier to understand if they were written not in natural-language prose but in a formal notation, for example in Prolog or in Z or in the Gentzen calculus or in some formalism developed for the purpose. Some propose not to eliminate the prose but to augment it with a formal expression of the same rules. This is a special case of a practice sometimes seriously recommended for developing national or international standards: draft the specification in two languages at once (e.g. English and French) as a way of exposing ambiguity, vagueness, and misunderstandings early and forcing the working group to achieve greater clarity and precision in their work.
If the second language used is a formal language, rather than a second natural language, this practice can have the same advantages, as well as the the added benefit that the resulting specification can be read and understood not only by human readers but also by machines. If the notation used for the specification can be directly interpreted by a machine, then the specification can be tested much more readily for completeness, consistency, and accurate reflection of the developers' design decisions; when a specification can be directly interpreted by machine, the specification itself becomes its own reference implementation.
The work described below may be regarded, in part, as a test of these propositions. I show how the definitions and declarations of a schema expressed in XML Schema can be represented using a formal notation developed for writing attribute grammars in Prolog; in follow-on work I expect to provide Prolog equivalents for the validation rules, info-set contributions, and constraints on schemas given in rather formal English prose by [W3C 2001b] and [W3C 2001c]. If the arguments for formalizing technical specifications are correct, the Prolog representations ought (if they are properly done) to be less ambiguous, more compact, and easier to understand that the formal English of the specification. The lack of ambiguity is guaranteed, as long as the Prolog is syntactically correct. The compactness can easily be measured; for the schema used here as an example, the Prolog is in fact not much more compact than the XML representation of the schema. Whether the Prolog representation is clearer or not is something which individual readers must decide for themselves.[1]

1.2. Direct interpretation (animation) of specifications

The idea of direct interpretation of formal specifications is not new; many textbooks on formal methods or on specific formal methods (Z, the Vienna Development Method VDM, etc.) discuss it at least as an idea. [Stepney 1993] provides a simple Web-accessible example of the idea: she illustrates the creation of a high-integrity compiler by
  • specifying the semantics (both dynamic and static) of a procedural programming language formally, in Z
  • specifying the semantics of a target machine language (or: assembly language) formally, also in Z
  • specifying the translation from source language to machine language formally
  • specifying the syntax of the programming language in a definite-clause translation grammar
  • translating the static semantics, dynamic semantics, and the source-to-machine language translation as grammatical attributes[2] in the DCTG. A Prolog system asked to supply the dynamic semantics of a program fragment serves as an interpreter for the source language; a Prolog system asked to supply the translation semantics acts as a compiler for the source language.
Animation of a specification is probably easiest if the spec is in a formal language. As Stepney's book illustrates, it is still possible otherwise, but it involves both the animation of a formal specification and an interpretation of the spec (in the form of a translation into a more formal language); the latter requires careful checking.
This paper attempts to make plausible the claim that a similar approach can be used with the XML Schema specification, in order to provide a runnable XML Schema processor with a very close tie to the wording of the XML Schema specification. Separate papers will report on an attempt to make good on the claim by building an XML Schema processor using this approach; this paper will focus on the rationale and basic ideas, omitting many details.

1.3. Using logic grammars in schema processing

Any schema defines a set of trees, and can thus be modeled more or less plausibly by a grammar. Schemas defined using XML Schema 1.0 impose some constraints which are not conveniently represented by pure context-free grammars, and the process of schema-validity-assessment defined by the XML Schema 1.0 specification requires implementations to produce information that goes well beyond a yes/no answer to the question “is this tree a member of the set?” For both of these reasons, it is convenient to use a form of attribute grammar to model a schema; logic grammars are a convenient choice.
In the remainder of this paper, I introduce some basic ideas for using logic grammars as a way of animating the XML Schema specification / modeling XML Schema.
The first question that arises is:
  • How do we use logic grammars to model XML Schema processing?
To this, there are three answers:
  • Model the schema as a logic grammar; use it to parse document instances and generate the post-schema-validation infoset (PSVI).
  • Move up one level of abstraction: model the schema for schemas as a logic grammar; use it to parse schema documents and generate the appropriate schema components.
  • Do both. This could be done by writing a utility to take a set of components (in the form generated by the logic grammar for schema documents) and generate a corresponding logic grammar (in the form needed for parsing other document instances).
Some problems arise if we use logic grammars to model schemas and perform schema-validity assessment by parsing a document using the logic grammar. These problems are attacked in a series of small examples in which a DCTG is developed which represents a simple schema:
  • In practice, the XML input document is likely to be presented to us in the form of a tree. How do we apply logic grammars to things that are already trees instead of sequences of tokens?
  • How do we check XML attributes?
  • How do we model type assignment, black-box validation, white-box validation?
  • How do we model the post-schema-validation information set?
Some problems arise if we use a logic grammar to model the rules for constructing schema components, and parse schema documents with that logic grammar as a way of modeling the process of constructing schema components.
  • How do we use parse-tree attributes to model information-item properties?
  • How do we use the output of this process to drive schema-validity assessement of document instances? Can we formulate the logic grammars used for schema-validity assessment as properties / attributes of the schema components?
In the next part of this paper (section 2) I introduce the definite-clause grammar notation, with some simple examples. A simplified representation of the purchase-order schema po1.xsd from the XML Schema tutorial [W3C 2001a] shows how a DCG can be used to support DTD-style validation in a straightforward way In section 3, I describe definite-clause translation grammars (DCTGs), which provide a slightly more convenient notation for complex attribute grammars than do DCGs. The patterns illustrated by the purchase-order schema are elaborated to handle various non-DTD constructs: post-schema-validation properties, error handling, substitution groups, xsi:type, and wildcards. This simple example shows how logic grammars can be applied as representations of schemas, but does not illustrate all aspects of schema validation.
In follow-on work reported in a separate paper, I plan to work systematically through the validation rules of [W3C 2001b] and [W3C 2001c], providing a more systematic representation of schema-validity assessment in logic-grammar terms. In a third paper, I will develop a grammar which reads XML Schema documents and enforces constraints on their XML representation and on the components derived from them, together with routines for compiling schema components into the logic-grammar representation sketched here.

2. Simple DTD-style document checking

We start with a simple case: considering only a subset of the kind of schema information captured in a document-type definition (DTD) in XML 1.0 DTD syntax, we write a logic grammar to represent a schema or DTD and use it to validate a document. The same technique, of course, can be used to capture and use part of the information in a schema expressed in XML Schema 1.0.

2.1. Definite-clause grammars

Before starting on the schema processor, however, it may be helpful to review the basics of definite-clause grammars and their notation. I'll present a simple grammar in two or three versions, to illustrate the use of DCG notation.
Readers already familiar with DCGs can skip this section without loss and continue with 2.2. Readers who don't find this introduction detailed enough may consult one of the Prolog books listed among the references.

2.1.1. Grammar E1: A trivial grammar for a fragment of English

Here is a simple DCG for a trivially small fragment of English, which I'll call E1. It is taken from example 8.1 in [König/Seiffert 1989], an introduction to Prolog for linguists:
%%% Grammar E1, a trivial fragment of English
s --> np, vp.  % A sentence (s) is a noun phrase (np) plus a verb phrase
np --> det, n. % A noun phrase is a determiner plus a noun
np --> n.      % ... or just a noun.
vp --> v, np.  % A verb phrase is a verb and its direct object, an np
vp --> v.      % ... or just the verb (for intransitives).
n --> [mary].  % 'mary', 'john', 'woman, 'man', 'apple' are nouns.
n --> [john].
n --> [woman].
n --> [man].
n --> [apple].
det --> [the]. % 'the' is a determiner
v --> [loves]. % 'loves', 'eats', and 'sings' are verbs.
v --> [eats].
v --> [sings].
This grammar generates sentences like “John loves Mary” and “The man sings” and “The woman eats the apple”, as well as the somewhat less plausible “The apple sings the Mary.”
When a Prolog system consults a file containing this grammar, the rules of the grammar are automatically translated into canonical Prolog clauses using difference lists:
?- listing([s,np,vp,n,det,v]).

s(A, B) :-
        np(A, C),
        vp(C, B).

np(A, B) :-
        det(A, C),
        n(C, B).
np(A, B) :-
        n(A, B).

vp(A, B) :-
        v(A, C),
        np(C, B).
vp(A, B) :-
        v(A, B).

n([mary|A], A).
n([john|A], A).
n([woman|A], A).
n([man|A], A).
n([apple|A], A).

det([the|A], A).

v([loves|A], A).
v([eats|A], A).
v([sings|A], A).

Yes
?- 
Difference lists are pairs of lists such that the second list is a suffix of the first list (i.e. the same from some point or other to the end), for example, the pair [a, b, c, d] and [c, d]. Prolog-based parsers commonly use difference lists to represent the input; they allow the Prolog clauses which recognize structures in the input list to consume as much of the input list (the first list in the pair) as they wish, and pass the rest of the input to the next clause. If a DCG predicate succeeds with its final two arguments unified to the difference-list pair [a, b, c, d] and [c, d], what it means is that the grammar rule recognized the list [a, b].
In the clause for s, we can see how one part of the parser picks up where another leaves off. The input list is A, and the np predicate recognizes a noun phrase at the beginning of A. The portion of the list which follows the noun phrase is assigned to C. The predicate vp then tries to recognize a verb phrase at the beginning of list C. What is left after the verb phrase is assigned, in turn, to list B, which is both ‘the part of the input list after the verb phrase’ (in the predicate vp) and (because the verb phrase can be the last item in a sentence) also ‘the part of the input after the s’ (in the predicate s).
For example, given the sentence [the, woman, eats, the, apple], the first rule for np will be called with its first argument bound to the entire input string; after it recognizes the initial noun phrase, the second argument will be bound to the part of the sentence which follows the noun phrase, namely [eats,the,apple]. The predicate vp will then recognize the verb phrase [eats,the,apple], leaving the empty list [] as the part of the input after the sentence.[3]

2.1.2. Calling the parser

The following fragment of a Prolog session log shows how the grammar can be used to test the grammaticality of sentences; it also shows that the grammar could use some refinement. To test whether a given sequence of tokens is an s, we call the Prolog predicate s with two difference-list arguments. If we specify the empty list as the second argument, we in effect require that the entire input sequence be a legal sentence.
?-  s([john,loves,mary],[]).

Yes
?- s([the,woman,eats,the,apple],[]).

Yes
?- s([the,man,sings],[]).

Yes
?- s([apple,eats,woman,the],[]).

No
?- s([apple,sings,the,mary],[]).

Yes
?- 
We can use a variable as a second argument, instead of the empty list; in this case, Prolog will tell us each prefix of the input list which can be parsed as a sentence according to our grammar:
?- s([the,woman,eats,the,apple],Remainder).

Remainder = [] ;

Remainder = [the, apple] ;

No
?- 
In one parse, the entire input is parsed as a sentence, and the remainder is the empty list. An alternative parse, however, is just to take [the,woman,eats] as a sentence, leaving [the,apple] as the remainder.
The XML Schema processor we sketch below will normally call the parser with an empty list as the final parameter, to require that the entire input sequence be consumed. If we wish to see how much of the content of an element belongs to the base type, and how much to an extension of the base type, we can call the parser for the base type, with a variable as the final parameter, and the base type will consume as much of the input as it can. The rest belongs to the extension.

2.1.3. Grammar E2: A simple attribute grammar

We can turn this simple context-free grammar E1 into an attribute grammar by adding parameters to the productions; I will call the attribute grammar E2. In E2, one parameter is used to generate a structural description of the sentence. In this notation, the sentence “John loves Mary” is assigned the structure s(np(pn(john)), vp(v(loves), np(pn(mary)))) — or, with pretty-printing:
s(
  np(
     pn(john)
    ), 
  vp(
     v(loves), 
     np(
        pn(mary)
       )
    )
 )
On some productions, a second parameter is used to enforce agreement in number between subject and verb and between determiner and noun.
Thus extended with parameters, and a few more words in the lexicon, the grammar looks like this:
%%% E2: a trivial attribute grammar for a fragment of
%%% English, with a synthesized attribute for structural description
%%% and enforcement of number agreement.

s(s(S,P)) --> np(S,Num), vp(P,Num).
np(np(D,N),Num) --> det(D,Num), n(N,Num).
np(np(N),pl) --> n(N,pl).
np(np(N),sg) --> pn(N,sg).
vp(vp(V,O),Num) --> v(V,Num), np(O,_).
vp(vp(V),Num) --> v(V,Num).
n(n(L),Num) --> [L], { lex(L,n,Num) }.
pn(pn(L),Num) --> [L], { lex(L,pn,Num) }.
v(v(L),Num) --> [L], { lex(L,v,Num) }.
det(det(L),Num) --> [L], { lex(L,det,Num) }.

lex(mary,pn,sg).
lex(john,pn,sg).
lex(woman,n,sg).
lex(women,n,pl).
lex(man,n,sg).
lex(men,n,pl).
lex(apple,n,sg).
lex(apples,n,pl).
lex(the,det,_).
lex(some,det,pl).
lex(one,det,sg).
lex(loves,v,sg).
lex(love,v,pl).
lex(eats,v,sg).
lex(eat,v,pl).
lex(sings,v,sg).
lex(sing,v,pl).
In this version of the grammar, it is worth calling the reader's attention to the Prolog clauses enclosed in braces on the right-hand side of some rules, e.g. in
n(n(L),Num) --> [L], { lex(L,n,Num)}.
These Prolog clauses express constraints on the grammatical construction which are not themselves grammatical constituents. They are sometimes called ‘guards’ (e.g. by [Brown/Blair 1990]); they correspond directly to the ‘semantic conditions’ which are a constituent part of attribute grammars as described by [Alblas 1991]. With the guard, this rule can be paraphrased as saying “A noun (an n), has the structure n(L) and the grammatical number Num, if (a) L (for ‘lexical form’) is a single item in the input list [L], and (b) the relation lex contains the triple (L,n,Num).”
Such guards will be present in many of the rules we use to represent XML Schema in logic grammars.
By calling the grammar rules, we can instantiate a variable with the structural description of the sentence.
?- s(Struc,[john,loves,mary],[]).

Struc = s(np(pn(john)), vp(v(loves), np(pn(mary)))) 

Yes
?- s(S,[the,woman,eats,the,apple],[]).

S = s(np(det(the), n(woman)), vp(v(eats), np(det(the), n(apple)))) 

Yes
?- s(S,[the,man,sings],[]).

S = s(np(det(the), n(man)), vp(v(sings))) 

Yes
?- s(S,[apple,sings,the,mary],[]).

No
?- s(S,[the,apple,sings,mary],[]).

S = s(np(det(the), n(apple)), vp(v(sings), np(pn(mary)))) 

Yes
?- 
Note that in a ‘conventional’ use of DCGs, the user or application will call the grammar just once, at the top level, and the parser itself will generate recursive calls for the nested grammatical constructs (e.g. np and vp in our example). Note, however, that we can if we wish make calls to any level of the grammar from arbitrary Prolog code. Note also that we can include arbitrary Prolog code in the right-hand side of rules by wrapping it in braces. By doing so, we can make arbitrarily complex calculations part of the grammar rules; this is one reason that DCGs are not limited to context-free languages.

2.2. Representing XML in Prolog

The SWI-Prolog system includes an SGML/XML parser which conveniently makes SGML and XML documents accessible to Prolog manipulation [Wielemaker 2001].
The parser returns the XML document to the user as a list of items,[4] each item one of:
  • an atom (for character data)
  • a structure of the form element(GI,AttributeList,ContentList), where
    • GI is an atom representing the element's generic identifier
    • AttributeList is a list containing the element's attribute-value specifications in the form Attributename=Value, with both attribute names and attribute values represented as Prolog atoms. Like all Prolog lists, this one is comma-separated, so a list as a whole might look like [id=frobnitzer,rend=blort,type=farble]
    • ContentList is a list of items (i.e. atoms representing character data, element(Gi,A,C) structures, entity(N) structures, etc.
  • a structure of the form entity(Integer) representing an otherwise unrepresentable character
  • a structure of the form entity(Name) representing an otherwise unrepresentable character for which a named entity is defined
  • a structure of the form pi(Text) representing a processing instruction
For example, the sample purchase order po1.xml used as an example in [W3C 2001a] looks like this in the format described (I have added white space for legibility; the single quotation marks around some atoms are a way of ensuring that they can be read again by a Prolog implementation and recognized as atoms):
[element('http://www.example.com/PO1':purchaseOrder, 
  [xmlns:apo='http://www.example.com/PO1', 
   orderDate='1999-10-20'], 
  [element(shipTo, 
    [country='US'], 
    [element(name, [], ['Alice Smith']), 
     element(street, [], ['123 Maple Street']), 
     element(city, [], ['Mill Valley']), 
     element(state, [], ['CA']), 
     element(zip, [], ['90952'])
    ]), 
   element(billTo, 
    [country='US'], 
    [element(name, [], ['Robert Smith']), 
     element(street, [], ['8 Oak Avenue']), 
     element(city, [], ['Old Town']), 
     element(state, [], ['PA']), 
     element(zip, [], ['95819'])
    ]), 
   element('http://www.example.com/PO1':comment, 
    [], 
    ['Hurry, my lawn is going wild!']
   ), 
   element(items, [], 
    [element(item, 
      [partNum='872-AA'], 
      [element(productName, [], ['Lawnmower']), 
       element(quantity, [], ['1']), 
       element('USPrice', [], ['148.95']), 
       element('http://www.example.com/PO1':comment, 
        [], 
        ['Confirm this is electric']
       )
      ]), 
     element(item, 
      [partNum='926-AA'], 
      [element(productName, [], ['Baby Monitor']), 
       element(quantity, [], ['1']), 
       element('USPrice', [], ['39.98']), 
       element(shipDate, [], ['1999-05-21'])
      ])
    ])
  ])
]
When elements and attributes are namespace-qualified, as the purchaseOrder and comment elements are in this example, the SWI-Prolog parser represents the name as a pair of two atoms, separated by a colon. To the right of the colon is the local name; depending on the option specified, the left-hand atom is either the namespace name (as shown here) or the namespace prefix used.
Other Prolog representations of XML are possible, of course, but we can use this one and it already exists.

2.3. A simple schema

Consider the simple schema defined for purchase orders in Part 0 of the XML Schema specification ([W3C 2001a]). Its complex types are relatively simple, involving only sequences of elements, some of which are optional, and some attributes. Some attributes and some elements are defined with named and some with anonymous simple types. This schema will serve admirably to illustrate the translation of schemas to logic grammars. For simplicity of exposition, the translation will be divided into several layers:
  1. translation into DCG form
  2. checking for duplicate and missing attributes, validating attribute types
  3. translation into DCTG form and supplying PSVI properties
  4. supplying properties related to validity, handling invalid input
Some features of XML Schema (handling xsi:type, supporting wildcards, supporting substitution groups) won't be shown here; the purchase-order schema doesn't have any wildcards or substitution groups, and it doesn't have much scope for the use of xsi:type in document instances. But it will be possible, after developing the example, to explain how these features can be handled in logic grammars.

2.4. Layer 1: Translation into definite-clause grammar

Each element type in the language is translated into three sets of DCG productions:
  • a top-level rule to match the tag and get the contents
  • a set of rules for checking the element's attributes
  • a set of rules defining the element's content; this is the part that looks most like a conventional DCG

2.4.1. Top-level rules for element types

For each element type in the schema, we create one rule in the grammar, which will serve to match the start-tag, get the attributes and contents of the element, and call routines to check the attributes and content against the complex type. Here there is exactly one such rule for each element type, but in later examples we will have several, in order to handle substitution groups. A simple version of these top-level rules would look like this:[5]
< 1 Rules for elements with complex types [File podcg3.pl] > ≡
/* Definite-clause grammar for the XML Schema represented in po.xsd.
 * 
 * This DCG was generated by a literate programming system; if
 * maintenance is necessary, make changes to the source (dctgnotes.xml),
 * not to this output file.
 */
purchaseOrder --> [element('http://www.example.com/PO1':purchaseOrder,
    Attributes,Content)],
  {
    purchaseOrderType_atts(Attributes,[]),
    purchaseOrderType_cont(Content,[])
  }.
shipTo --> [element(shipTo,Attributes,Content)],
  {
    usAddress_atts(Attributes,[]),
    usAddress_cont(Content,[])
  }.
billTo --> [element(billTo,Attributes,Content)],
  {
    usAddress_atts(Attributes,[]),
    usAddress_cont(Content,[])
  }.
items --> [element(items,Attributes,Content)],
  {
    t_items_atts(Attributes,[]),
    t_items_cont(Content,[])
  }.
item --> [element(item,Attributes,Content)],
  {
    t_item_atts(Attributes,[]),
    t_item_cont(Content,[])
  }.



Because we get the XML document already in tree form, the top-level rules don't have much work to do. In particular, they do not have to identify the substructure of the element or find its end; that has already been done. Instead, we get the attributes and content already separated out. We launch a separate call on each to parse them according to the appropriate rule for their datatype.[6]
The comment element and others associated with simple types pose an interesting question. We could define them following the pattern above, like this:
comment --> [element('http://www.example.com/PO1':comment,
    Attributes,Content)],
  {
    xsd_string_atts(Attributes,[]),
    xsd_string_cont(Content,[])
  }.
But we know that the xsd:string type does not have any attributes declared, so Attributes must be the empty list. We can therefore define these elements this way instead:
< 2 Rules for elements with simple types [File podcg3.pl] > ≡
comment --> [element('http://www.example.com/PO1':comment,
    [],Content)],
  {
    xsd_string_cont(Content,[])
  }.

poname --> [element(name,[],Content)], 
           { xsd_string_cont(Content,[]) }.
street --> [element(street,[],Content)], 
           { xsd_string_cont(Content,[]) }.
city   --> [element(city,[],Content)], 
           { xsd_string_cont(Content,[]) }.
state  --> [element(state,[],Content)], 
           { xsd_string_cont(Content,[]) }.
zip    --> [element(zip,[],Content)], 
           { xsd_decimal_cont(Content,[]) }.

productName --> [element(productName,[],Content)], 
                { xsd_string_cont(Content,[]) }.
quantity    --> [element(quantity,[],Content)], 
                { xsd_integer_cont(Content,[]) }.
'USPrice'   --> [element('USPrice',[],Content)], 
                { xsd_decimal_cont(Content,[]) }.
shipDate    --> [element(shipDate,[],Content)], 
                { xsd_date_cont(Content,[]) }.




2.4.2. Rules for attributes of complex types

For each complex type in the schema, we generate two rules, one we will use to check the attributes, and one to check the contents. For each simple type, we generate one rule for checking the value.
The attributes for a given element can be enumerated; in this first example, we check to make sure each attribute was declared.
< 3 Rules for attributes defined for complex types (v1) > ≡
purchaseOrderType_atts --> [].
purchaseOrderType_atts --> purchaseOrderType_att, purchaseOrderType_atts.
purchaseOrderType_att --> [orderDate=Date].

usAddress_atts --> [].
usAddress_atts --> usAddress_att, usAddress_atts.
usAddress_att --> [country='US']. /* note: fixed value! */

t_items_atts --> [].
t_items_atts --> t_items_att, t_items_atts.
t_items_att  --> no_such_attribute.
/* note that t_items_att is undefined, since there
 * are no attributes for the Items type 
 */

t_item_atts --> [].
t_item_atts --> t_item_att, t_item_atts.
t_item_att  --> [partNum=SKU]. 

This code is not used elsewhere.

For now, we do not worry about whether required attributes have been specified, or about checking the type of the attribute, or about supplying default values; we will do that later.

2.4.3. Rules for content of complex types

The most conventional-looking part of our grammar is the representation of the content models. A purchase order (being of type purchaseOrderType) has a ship-to address, a billing address, possibly a comment, and a list of items. An element with complex type USAddress has a name, street, city, state, and zip. We handle optional elements by writing a pair of recursive productions:
< 4 [File podcg3.pl] > ≡
purchaseOrderType_cont --> shipTo, billTo, opt_comment, items.
opt_comment --> [].
opt_comment --> comment, opt_comment.

usAddress_cont --> poname, street, city, state, zip.

t_items_cont --> star_item.
star_item    --> [].
star_item    --> item, star_item.

t_item_cont --> productName, quantity, 'USPrice', 
                opt_comment, opt_shipDate.

opt_shipDate --> [].
opt_shipDate --> shipDate.



It's worth noting that since the content models of XML all define regular languages, the right hand sides of the DCG rules for handling the content of elements are all rather simple. The only complexity is introduced by nested choices, occurrence indicators, and the like. In a way, this is the reflection of a fundamental fact about validating XML: the XML structure is given rather than needing to be discovered, and the rules for validation are correspondingly simpler.
A practical consequence of the fact that XML works with regular-right-part grammars while we are working with something more like standard Backus-Naur Form is that whenever we have recursive constructs like star_block, the parse tree described by the DCG does not correspond directly to the XML structure, and when we generate PSVI structures we are going to need to flatten the result to make it represent the XML structure properly.[7]
There is one other complication in practice: if an element has mixed content, we need to allow atoms representing character data between sub-elements; otherwise, we need to allow only atoms representing white space. For now, we step around this problem by instructing the upstream XML parser to strip whitespace from the beginning and ending of CDATA nodes; this does the job for this particular schema.[8] Later, we'll need to find a systematic way to hop over inter-element characters, and to restrict them to white space where appropriate.

2.4.4. Rules for checking values of simple types

We need some rules for checking simple types. The rules for built-in types of XML Schema, at least, can be written once for all and built in to a library. N.B. In order to get something working quickly, we cut some corners here.
In each case we write two rules, one to check the content of an element by checking to see whether the putative contents are a legal value, and the second to check the value itself. Within these latter rules, it is necessary to remember that the thing we are checking is a Prolog atom created from the lexical form for the putative value. In the case of strings, anything that's a legal lexical form (represented, following the conventions of the SWI-Prolog parser, as an atom) counts as a legal string.
< 5 Checking simple types [File podcg3.pl] > ≡
/* In our representation of XML, character data is represented
 * as atoms.  See note on handling of non-ASCII characters */
xsd_string_cont([H|T],T) :- xsd_string_value(H).
xsd_string_value(LF) :- atom(LF).



We check decimal numbers by converting the lexical form from an atom into a sequence of characters and then seeing whether Prolog can create a number out of that sequence of characters. This is actually a bit too broad, at least with the Prolog I am using, since it accepts numbers in scientific notation. But it will do for an illustration.
< 6 Checking numbers [File podcg3.pl] > ≡
/* We'll accept anything as a decimal number if we can convert
 * the atom to a list of character codes which can in turn be
 * converted to a number. 
 */
xsd_decimal_cont([H|T],T) :- xsd_decimal_value(H).
xsd_decimal_value(LF) :- atom_codes(LF,L), number_codes(N,L).
xsd_integer_cont([H|T],T) :- xsd_integer_value(H).
xsd_integer_value(LF) :- 
  atom_codes(LF,L), 
  number_codes(N,L),
  integer(N).




For dates, we really punt for now.
< 7 Checking dates [File podcg3.pl] > ≡
xsd_date_cont([H|T],T) :- xsd_date_value(H). 
xsd_date_value(LF) :- atom(LF). 
/* well, we really need to check this ... */



With this set of rules, we have a primitive but working program for checking whether a given XML document does or does not conform to the purchase order schema. There are a number of constraints it does not check, however; in the following sections of this document we will refine the code to make it a more persuasive representation of the schema.

2.5. Layer 2a: checking attribute types

One obvious improvement would be to check to make sure that the values of attributes are legitimate representations of the simple types with which the attributes are declared. Here is a revision of the relevant rules:
< 8 Rules for attributes defined for complex types (v2) [File podcg3.pl] > ≡
purchaseOrderType_atts --> [].
purchaseOrderType_atts --> purchaseOrderType_att, purchaseOrderType_atts.
purchaseOrderType_att --> [orderDate=Date],
  { xsd_date_value(Date) }.

usAddress_atts --> [].
usAddress_atts --> usAddress_att, usAddress_atts.
usAddress_att --> [country='US']. /* note: fixed value! */

t_items_atts --> [].
t_items_atts --> t_items_att, t_items_atts.
t_items_att  --> [this_will_never_match].
/* Since it has no equal sign or value, the rule for
 * t_items_att will never match anything.
 * Alternatively we could leave t_items_att undefined, since there
 * are no attributes for the Items type.
 */

t_item_atts --> [].
t_item_atts --> t_item_att, t_item_atts.
t_item_att  --> [partNum=SKU],
  { po_sku_value(SKU) }.



Hand-coding a rule to check SKU values is not too hard, and illustrates how we can check the lexical forms of simple types by writing mini-grammars for them. The SKU type is declared this way:
 <xsd:simpleType name="SKU">
  <xsd:restriction base="xsd:string">
   <xsd:pattern value="\d{3}-[A-Z]{2}"/>
  </xsd:restriction>
 </xsd:simpleType>
The pattern can be translated readily into the following grammar operating on the character sequence of the lexical form:
< 9 Checking SKU values [File podcg3.pl] > ≡
po_sku_value(LF) :- 
  atom_chars(LF,Charseq),
  sku_value(Charseq,[]).

sku_value --> sku_decimal_part, hyphen, sku_alpha_part.
sku_decimal_part --> digit, digit, digit.
sku_alpha_part --> cap_a_z, cap_a_z.




The mini-grammar for SKU values relies on the SWI-Prolog character classification predicate char_type, though it would be possible to implement rules for digit checking and letter checking in any Prolog system.
< 10 Character classification rules [File podcg3.pl] > ≡
digit --> [Char], { char_type(Char,digit) }.
hyphen --> ['-'].
cap_a_z --> [Char], { char_type(Char,upper) }.



2.6. Layer 2b: checking for duplicate, defaulted, and required attributes

Another obvious improvement would be to check to ensure that attributes are not specified more than once (strictly speaking, the upstream XML processor should be checking this, so this is probably not essential) and that required attributes do have values specified. We have already seen how to check a fixed value: we simply specify the value as part of the grammatical rule for the attribute.
In our sample schema, the only required attribute on any type is the partNum attribute on the item element; it is also the only attribute defined for items. In a case like this one, we could simply write the relevant part of the grammar this way:
t_item_atts --> [partNum=SKU], { po_sku_value(SKU) }.
Or we could even put the constraint into the top-level rule for the item element:
item --> [element(item,[partNum=SKU],Content)],
  {
    po_sku_value(SKU),
    t_item_cont(Content,[])
  }.
In the general case, however, it is going to be simpler to put checking for required, forbidden, and defaulted attributes into a separate rule, which we will here call t_item_att_check:
< 11 Checking for required, forbidden attributes > ≡
t_item_att_check(LAVS) :-
  atts_present(LAVS,[partNum]).

This code is not used elsewhere.

In the general case, we may have to check for required, forbidden, and defaulted attributes, and to merge defaulted attributes into the list of attribute-value specifications. So we can extend the general form of the foo_att_check predicate to:
< 12 Checking for required, forbidden attributes > ≡
t_item_att_check(LAVS,AugmentedLAVS) :-
  atts_present(LAVS,[partNum]),
  atts_absent(LAVS,[]),
  atts_defaulted(LAVS,[],AugmentedLAVS).

This code is not used elsewhere.

We assume here the existence of three special-purpose items for checking the presence or absence of items in lists of attribute-value specifications. A list of attribute-value specifications contains all the attributes in a list of required attributes, if the list of required attributes is empty.
< 13 Utilities for checking required attributes > ≡
atts_present(LAVS,[]).
Continued in <Utilities for checking required attributes 14>, <Utilities for checking required attributes 15>, <Utilities for checking forbidden attributes 16>, <Utilities for checking forbidden attributes 17>, <Utilities for checking defaulted attributes 18>
This code is not used elsewhere.

A list of attribute-value specifications contains all the attributes in a list of required attributes, if it contains the first one, and also all the others.
< 14 Utilities for checking required attributes [continues 13 Utilities for checking required attributes] > ≡
atts_present(LAVS,[HRA|RequiredTail]) :-
  att_present(LAVS,HRA),
  atts_present(LAVS,RequiredTail).



And finally, an attribute value specification is present for a given attribute name if it is either at the head of the list, or in the tail of the list.
< 15 Utilities for checking required attributes [continues 13 Utilities for checking required attributes] > ≡
att_present([Attname=Attval|Tail],Attname).
att_present([AVS|Tail],Attname) :-
  att_present(Tail,Attname).



Some readers will be looking, right about now, for a rule of the form
att_present([],Attname) :- ...
We don't have a rule for the empty list, though, because if we have run through the list of attribute-value specifications without finding one for the attribute name we are seeking, we want the predicate to fail. If necessary to prevent later maintenance programmers, or ourselves, from supplying the ‘missing’ induction basis by mistake, we could write:
att_present([],Attname) :- !, fail.
Now for forbidden attributes. (These would be originally optional attributes whose maxOccurs has been set to 0 in a refinement step.) If there are no attribute names in the list of forbidden attributes, then we are done and all is well.
< 16 Utilities for checking forbidden attributes [continues 13 Utilities for checking required attributes] > ≡
atts_absent(LAVS,[]).



If the list of forbidden attribute names is not empty, we check first the head, then (recursively) the tail.
< 17 Utilities for checking forbidden attributes [continues 13 Utilities for checking required attributes] > ≡
atts_absent(LAVS,[HFA|ForbiddenTail]) :-
  not(att_present(LAVS,HFA)),
  atts_absent(LAVS,ForbiddenTail).



Finally, for attributes with defaults we specify a list of attribute-value specifications; for each of these, we check to see whether a value is already specified for an attribute of that name. If it is, we skip the attribute in question; if not, we add it to the list of attribute-value specifications. Either way, we call the predicate recursively to take care of the rest of the defaulted attributes.
< 18 Utilities for checking defaulted attributes [continues 13 Utilities for checking required attributes] > ≡

atts_defaulted(LAVS,[],LAVS).
atts_defaulted(LAVS,[AN=AV|Tail],AugmentedLAVS) :-
  att_present(LAVS,AN),
  atts_defaulted(LAVS,Tail,AugmentedLAVS).
atts_defaulted(LAVS,[AN=AV|Tail],[AN=AV|AugmentedLAVS]) :-
  not(att_present(LAVS,AN)),
  atts_defaulted(LAVS,Tail,AugmentedLAVS).



2.7. Evaluation

So far, we have shown that given a relatively simple schema, it is possible to create a definite-clause grammar which checks the salient constraints defined by the schema:
  • Each element is checked against the rules for its datatype.
  • Elements declared with a complex type are checked with regard to their attributes and their content.
    • Attributes are checked to see that they were declared and that their value is of the correct type.
    • Sets of attribute value specifications are checked to make sure that required attributes have been specified, and forbidden attributes have not. Default values can be supplied for attributes which have them.
    • The regular language defined in a content model is translated into a set of definite-clause grammar productions; any legal sequence of children will be recognized, and any illegal sequence of children will fail to be recognized.
    • Elements declared with a simple type are checked to ensure that their content is a legal value of the simple type.
Running a lightly modified version of this grammar over a set of test cases based on po1.xml, the grammar accepted a number of valid variations of the po1.xml example, with
  • additional attributes (namespace declarations, xsi attributes)
  • omission or inclusion of optional elements
  • omission or inclusion of optional attributes
Modification of the grammar was necessary because the grammar as shown does not handle namespace declarations or attributes in the xsi namespace; since all the test files do have namespace declarations, the following rules had to be added to the grammar (I haven't added them above for fear it would clog up the exposition):
purchaseOrderType_att --> magic_att.
usAddress_att --> magic_att.
t_items_att  --> magic_att.
t_item_att  --> magic_att.

magic_att --> [xmlns=NS].
magic_att --> [xmlns:Pre=NS].
magic_att --> ['http://www.w3.org/2001/XMLSchema-instance':Localname=Value].
The grammar detected a number of errors:
  • misspelled generic identifiers and attribute names
  • omission of required elements
  • excess occurrences of elements
  • elements out of sequence
  • extraneous undeclared elements present
  • value other than the prescribed value for a fixed attribute
The following errors were not detected:
  • multiple po:comment elements (the schema allows at most one at the purchaseOrder level and at most one in each item)
        <apo:comment>Hurry, my lawn is going wild!</apo:comment>
        <apo:comment>I don't know how much longer I can hold out.</apo:comment>
    
    This is an error in the grammar above, which has been retained to illustrate the risks of hand-translation. Instead of
    opt_comment --> [].
    opt_comment --> comment, opt_comment.
    
    (which allows arbitrarily many comments), the rule for the optional comment element should read
    opt_comment --> [].
    opt_comment --> comment.
    
  • two orderDate attributes
    <apo:purchaseOrder xmlns:apo="http://www.example.com/PO1"
                       orderDate="1999-10-19"
                       orderDate="1999-10-20"> 
    
    (this can be blamed on the upstream processor, which should probably have rejected the document without producing an infoset, since single occurrence of attribute names is a well-formedness constraint)
  • quantity exceeding the maximum legal value
            <item partNum="926-AA">
                <productName>Baby Monitor</productName>
                <quantity>100</quantity>
                <USPrice>39.98</USPrice>
                <shipDate>1999-05-21</shipDate>
            </item>
    
  • omission of the required partNum attribute
            <item>
                <productName>Baby Monitor</productName>
                <quantity>1</quantity>
                <USPrice>39.98</USPrice>
                <shipDate>1999-05-21</shipDate>
            </item>
    
The following errors were detected, or at least the document was not accepted as valid, but validating the document caused a Prolog error because the number_chars predicate used to validate numbers is apparently not robust against non-numeric arguments.
  • invalid zip code
    <zip>OX2 6NN</zip>
  • wrong datatype for quantity
    <quantity>one</quantity>
  • bad value for USPrice
    <USPrice>USD 39.98</USPrice>
See A note on the test cases below for further details on the test documents.
The example we have been developing does not illustrate all forms of content models, occurrence indicators, or simple types, but I think it's obvious how to extend the treatment given here.
There are several gaps in coverage, which we will make good in what follows:
  • handling wildcards
  • handling character data intermingled with mixed content models, handling whitespace intermingled with non-mixed content models
  • handling substitution groups
  • handling xsi:type specifications in the document instance
  • diagnosing errors (i.e. saying something more useful than “no” or “ERROR: number_chars/2: Syntax error: Illegal number” when an error is encountered) and recovering from them
  • providing relevant PSVI properties
It will be simplest to show how to address these gaps if we handle the last item first, and make our parser start returning more than a yes/no answer to the question “is this document/element/lexical form ok?” As shown by the examples used to introduce DCG notation, we could do that by adding parameters to the productions in the grammar; it will be more convenient, however, to translate our grammar from DCG form into a related form, that of definite-clause translation grammars. To that, the next part of this paper is dedicated.

3. Providing PSVI properties

To model schema-validity assessment properly, we need to provide more output: specifically, we need to provide information about the input document together with some additional properties (the schema infoset contributions). It's possible to do that in DCG notation, but it rapidly becomes cumbersome. We'll use DCTG notation instead; it was devised to handle grammatical attributes more conveniently than DCG, and to separate the semantics more effectively from the syntax [Abramson 1984].
On this foundation we will later be able to add more:
  • adding type name information: type name property
  • black-box (skip) validation
  • white-box (lax) validation, validation attempted property
  • error handling (falling back to lax), validity property
Note that this section is not intended to be a complete translation of XML Schema into DCTG, but a sample small enough to follow and large enough to make a persuasive case that all of XML Schema can be translated.

3.1. Introduction to DCTG notation

DCTG notation was invented to make it easier to calculate values for grammatical attributes and to refer to the values of grammatical attributes on other rules.
Here is a simple example. Consider the following grammar for binary strings:
bit ::= '0'.
bit ::= '1'.
bitstring ::= '' /* nothing */.
bitstring ::= bit, bitstring.
number ::= bitstring, fraction.
fraction ::= '.', bitstring.
fraction ::= ''.
We might wish to calculate the length and the (unsigned base-two) value of the bitstring as attributes. Using a yacc-like notation that might look like this. Notice that scale is a top-down attribute and value and fractionalvalue are bottom-up attributes.
bit ::= '0' { $0.bitvalue = 0; }.
bit ::= '1' { $0.bitvalue = power(2,$0.scale); }.
bitstring ::= '' { 
  $0.value = 0; 
  $0.length = 0;
  /* scale doesn't matter here */
}.
bitstring ::= bit, bitstring {
  $0.length = $2.length + 1;
  $1.scale = $0.scale;
  $2.scale = $0.scale - 1;
  $0.value = $1.value + $2.value;
}.
number ::= bitstring, fraction {
  $1.scale = $1.length - 1;
  $0.value = $1.value + $2.fractionalvalue;
}.
fraction ::= '.', bitstring {
  $2.scale = -1;
  $0.fractionalvalue = $2.value;
}.
fraction ::= '' {
  $0.fractionalvalue = 0;
}.
In DCTG notation, this grammar looks like this:[9]
bit ::= [0]
  <:> bitval(0,_).

bit ::= [1]
  <:> bitval(V,Scale) ::- V is **(2,Scale).

bitstring ::= []
  <:> length(0)
  &&  value(0,_).

bitstring ::= bit^^B, bitstring^^B1
  <:> length(Length) ::-
        B1 ^^ length(Length1),
        Length is Length1 + 1
  &&  value(Value,ScaleB) ::-
        B ^^ bitval(VB,ScaleB),
        S1 is ScaleB - 1,
        B1 ^^ value(V1,S1),
        Value is VB + V1.

number ::= bitstring ^^ B, fraction ^^ F
  <:> value(V) ::-
        B ^^ length(Length),
        S is Length-1,
        B ^^ value(VB,S),
        F ^^ fractional_value(VF),
        V is VB + VF.

fraction ::= ['.'], bitstring ^^ B
  <:> fractional_value(V) ::-
        S is -1,
        B ^^ value(V,S).

fraction ::= []
  <:> fractional_value(0).
As may be seen, each rule in DCTG consists of a left hand side, a right-hand side, and optionally a list of attributes. The right-hand side is separated from the left-hand side by the operator ::= and from the following list of attributes (if any) by the operator <:>. The expression on the right hand side is a sequence of non-terminal symbols, each optionally suffixed with the operator ^^ and a variable name (e.g. bitstring^^B). Each attribute is identified by a Prolog structure, e.g. value(V), whose functor is the name of the attribute and whose arguments are its values. The attribute structure may be followed by the operator ::- (designed to look a lot like the standard Prolog operator :- or ‘neck’) and a series of goals which will be satisfied when the attribute value is to be instantiated. In practice, these goals help to calculate the attribute values. Values of attributes attached to the items on the right-hand side may be referred to by the variable name associated with the item and the name of the attribute, joined by the operator ^^, as for example in B ^^ bitval(VB,ScaleB).
A partial EBNF grammar for DCTG notation[10] is:
grammar ::= rule*
rule    ::= lhs '::=' rhs ('<:>' att-spec ('&&' att-spec)*)?
lhs     ::= term
rhs     ::= term (',' term)*
attspec ::= compound-term ('::-' goal (',' goal)*)?
compound-term ::= ATOM '(' term (',' term)* ')'

3.2. Another example: the English grammar fragment E1

Here is another simple example: the trivial fragment of English grammar given above, translated directly into DCTG notation, looks like the following. The only difference from the DCG form is that the separator between the left- and right-hand side of each production is “::=” instead of “-->”.
%%% E1 (trivial context-free grammar for a fragment of English)
%%% in DCTG notation.
s ::= np, vp.
np ::= det, n.
np ::= n.
vp ::= v, np.
vp ::= v.
n ::= [mary].
n ::= [john].
n ::= [woman].
n ::= [man].
n ::= [apple].
det ::= [the].
v ::= [loves].
v ::= [eats].
v ::= [sings].
The translation into standard Prolog is similar to that used for DCGs, but instead of adding two arguments to each non-terminal, the DCTG translation adds three. The first additional argument is a Prolog structure with the functor node, described further below. The second and third are (like the two arguments added in DCG translations) difference lists.
The node structure has three arguments:
  1. the non-terminal of the grammar rule (here s, np, vp, etc.)
  2. a list of the node structures associated with the items on the right-hand side of the grammar rule
  3. a list of grammatical attributes (in this grammar, this list will be empty)
The translation of our trivial grammar into standard Prolog is thus:
?- dctg_reconsult('ks81dctg.pl').

Yes
?- listing([s,np,vp,n,det,v]).

:- dynamic s/3.

s(node(s, [A, B], []), C, D) :-
        np(A, C, E),
        vp(B, E, D).

:- dynamic np/3.

np(node(np, [A, B], []), C, D) :-
        det(A, C, E),
        n(B, E, D).
np(node(np, [A], []), B, C) :-
        n(A, B, C).

:- dynamic vp/3.

vp(node(vp, [A, B], []), C, D) :-
        v(A, C, E),
        np(B, E, D).
vp(node(vp, [A], []), B, C) :-
        v(A, B, C).

:- dynamic n/3.

n(node(n, [[mary]], []), A, B) :-
        c(A, mary, B).
n(node(n, [[john]], []), A, B) :-
        c(A, john, B).
n(node(n, [[woman]], []), A, B) :-
        c(A, woman, B).
n(node(n, [[man]], []), A, B) :-
        c(A, man, B).
n(node(n, [[apple]], []), A, B) :-
        c(A, apple, B).

:- dynamic det/3.

det(node(det, [[the]], []), A, B) :-
        c(A, the, B).

:- dynamic v/3.

v(node(v, [[loves]], []), A, B) :-
        c(A, loves, B).
v(node(v, [[eats]], []), A, B) :-
        c(A, eats, B).
v(node(v, [[sings]], []), A, B) :-
        c(A, sings, B).

Yes
?- 
The predicate dctg_reconsult(File) is used to translate a DCTG grammar into Prolog clauses and load them; it is provided by [Abramson/Dahl/Paine 1990] and is available from a variety of sources on the net.[11]
A short terminal session should make the nature of the results a bit clearer:[12]
?- s(S,[john,loves,mary],[]), write(S).
node(s, 
     [node(np, 
           [node(n, [[john]], [])], 
           []), 
      node(vp, 
           [node(v, [[loves]], []), 
            node(np, 
                 [node(n, [[mary]], [])], 
                 [])], 
           [])], 
     [])

S = node(s, [node(np, [node(n, [[john]], [])], []), 
node(vp, [node(v, [[loves]], []), node(np, [node(n, 
[...], [])], [])], [])], []) 

Yes
?- s(S,[the,woman,eats,the,apple],[]), write(S).
node(s, 
     [node(np, 
           [node(det, [[the]], []), 
            node(n, [[woman]], [])], 
           []), 
      node(vp, 
           [node(v, [[eats]], []), 
            node(np, 
                 [node(det, [[the]], []), 
                  node(n, [[apple]], [])], 
                 [])], 
           [])], 
     [])

S = node(s, [node(np, [node(det, [[the]], []), node(n, 
[[woman]], [])], []), node(vp, [node(v, [[eats]], []), 
node(np, [node(det, [...], []), node(..., ..., ...)], 
[])], [])], []) 

Yes
?- s(S,[the,man,sings],[]), write(S).
node(s, 
     [node(np, 
           [node(det, [[the]], []), 
            node(n, [[man]], [])], 
           []), 
      node(vp, 
           [node(v, [[sings]], [])], 
           [])], 
     [])

S = node(s, [node(np, [node(det, [[the]], []), 
node(n, [[man]], [])], []), node(vp, [node(v, 
[[sings]], [])], [])], []) 

Yes
?- 
The node structure constructed in the first added argument resembles and serves much the same purpose as the structure attribute used in E2, the attribute-grammar version of the English grammar fragment.

3.3. English grammar with attributes (E2) in DCTG notation

The fragment of English grammar E2, which was presented earlier to illustrate the use of DCGs for attribute grammars, may also be used to illustrate DCTGs.
Let's walk through the grammar. A sentence is an NP followed by a VP; we will call the NP S (‘subject’), and the VP we will call P (‘predicate’).
< 19 English grammar fragment with attributes [File ks92.dctg] > ≡
s ::= np^^S, vp^^P,
  { S^^number(Num), P^^number(Num) }
  <:> {Attributes for non-terminal s 20}

Continued in <NP rules 21>, <VP rules 26>, <Pre-terminal rules 29>, <Lexicon 30>


The goals enclosed in braces (S^^number(Num), P^^number(Num)) together express the constraint that the NP and VP must agree in number: the number attribute of the NP S and the number attribute of the VP P must unify with each other.
The non-terminal s has only one grammatical attribute; let us call it structure. When s is made up (as here) of an NP and a VP, we represent its structure by a Prolog term with s as its functor and the structure of the NP and VP as its two arguments:
< 20 Attributes for non-terminal s > ≡
  structure(s(Sstr,Pstr)) ::- 
    S^^structure(Sstr), 
    P^^structure(Pstr).

This code is used in < English grammar fragment with attributes 19 >

A noun phrase (NP) is made up of a determiner and a noun; they must agree in number. This covers phrases like “the apple”, “the apples”, “one apple”, “some apples”. The agreement rule excludes the phrases “one apples” and “some apple”.[13]
< 21 NP rules [continues 19 English grammar fragment with attributes] > ≡
np ::= det^^D, n^^N,
  { D^^number(Num), N^^number(Num) }
  <:> {NP structure attribute (for DET+N) 22}

  &&  {NP number attribute (for DET+N) 23}

Continued in <NP rules, cont'd 24>, <NP rules, cont'd 25>


The non-terminal np has two attributes, named structure and number.
The structure of the NP is a Prolog term with the name of the non-terminal (np) as its functor and the constituents as the arguments. This is the pattern of the structure attribute on all non-terminals, and I won't comment on it again.
< 22 NP structure attribute (for DET+N) > ≡
  structure(np(Dstr,Nstr)) ::-
    D^^structure(Dstr),
    N^^structure(Nstr)

This code is used in < NP rules 21 >

The number attribute of the NP illustrates an important idiom: the guard in the syntactic part of the rule has already checked the number attributes of the determiner and the noun, to make sure they unify with each other; the variable Num has the value sg or pl already, and we don't need to do any more computation. We just say that whatever Num is, that's the grammatical number of the NP.
< 23 NP number attribute (for DET+N) > ≡
  number(Num).

This code is used in < NP rules 21 >

Noun phrases can also take the form of a plural noun by itself, as in “Men love apples”.
< 24 NP rules, cont'd [continues 21 NP rules] > ≡
np ::= n^^N, { N^^number(pl) }
  <:> structure(np(Nstr)) ::-
        N^^structure(Nstr)
  &&  number(pl).



The final form of NP recognized by this grammar is a singular proper noun by itself, as in “John loves Mary”.
< 25 NP rules, cont'd [continues 21 NP rules] > ≡
np ::= pn^^N, { N^^number(sg) }
  <:> structure(np(Nstr)) ::-
        N^^structure(Nstr)
  &&  number(sg).



A verb phrase (VP) can include a direct object in the form of a noun phrase:
< 26 VP rules [continues 19 English grammar fragment with attributes] > ≡
vp ::= v^^V, np^^O
  <:> structure(vp(Vs,Os)) ::-
        V^^structure(Vs),
        O^^structure(Os)
  &&  {Number attribute for VP 28}

Continued in <VP rules, cont'd 27>


or they can be just a verb with no direct object.[14]
< 27 VP rules, cont'd [continues 26 VP rules] > ≡
vp ::= v^^V
  <:> structure(vp(Vs)) ::-
        V^^structure(Vs)
  &&  {Number attribute for VP 28}




Although both the verb and the direct object have a number attribute, only that of the verb counts in determining the value of the number attribute for the VP as a whole.
< 28 Number attribute for VP > ≡
  number(Num) ::- V^^number(Num).

This code is used in < VP rules 26 > < VP rules, cont'd 27 >

Nouns, proper nouns, verbs, and determiners (the ‘pre-terminal’ categories of the grammar) all have rules of the same structure: a token in the string counts as one of these if the lexicon says it's one, and the number attribute has whatever value the lexicon gives.
< 29 Pre-terminal rules [continues 19 English grammar fragment with attributes] > ≡
n ::= [L], { lex(L,n,Num) }
  <:> structure(n(L))
  &&  number(Num).

pn ::= [L], { lex(L,pn,Num) }
  <:> structure(pn(L))
  &&  number(Num).

v ::= [L], { lex(L,v,Num) }
  <:> structure(v(L))
  &&  number(Num).

det ::= [L], { lex(L,det,Num) }
  <:> structure(det(L))
  &&  number(Num).



Finally, the lexicon. To keep things simple, the lexicon here is just a set of facts, with literal values.[15] The entry for the word the is the exception: it does not have a literal value, but the anonymous variable _ to indicate that the can be either sg or pl.
< 30 Lexicon [continues 19 English grammar fragment with attributes] > ≡
lex(mary,pn,sg).
lex(john,pn,sg).
lex(woman,n,sg).
lex(women,n,pl).
lex(man,n,sg).
lex(men,n,pl).
lex(apple,n,sg).
lex(apples,n,pl).
lex(the,det,_).
lex(some,det,pl).
lex(one,det,sg).
lex(loves,v,sg).
lex(love,v,pl).
lex(eats,v,sg).
lex(eat,v,pl).
lex(sings,v,sg).
lex(sing,v,pl).



A session log illustrates the structure built by the DCTG rules:
?- s(S,[john,loves,mary],[]), write(S).
node(s, 
     [node(np, 
           [node(pn, 
                 [[john]], 
                 [structure(pn(john)), 
                 (number(sg)::-lex(john, pn, sg))])], 
           [ (structure(np(_G292))::-
                node(pn, 
                     [[john]], 
                     [structure(pn(john)), 
                      (number(sg)::-lex(john, pn, sg))])^^structure(_G292)), 
            number(sg)]), 
      node(vp, 
           [node(v, 
                 [[loves]], 
                 [structure(v(loves)), 
                 (number(sg)::-lex(loves, v, sg))]), 
            node(np, 
                 [node(pn, 
                       [[mary]], 
                       [structure(pn(mary)), 
                       (number(sg)::-lex(mary, pn, sg))])], 
                 [(structure(np(_G424))::-
                     node(pn, [[mary]], 
                          [structure(pn(mary)), 
                          (number(sg)::-lex(mary, pn, sg))])
                     ^^structure(_G424)), 
                  number(sg)])], 
           [ (structure(vp(_G351, _G352))::-
                node(v, [[loves]], 
                     [structure(v(loves)), 
                     (number(sg)::-lex(loves, v, sg))])^^structure(_G351), 
                node(np, 
                     [node(pn, [[mary]], 
                           [structure(pn(mary)), 
                           (number(sg)::-lex(mary, pn, sg))])], 
                     [(structure(np(_G424))::-
                         node(pn, [[mary]], 
                              [structure(pn(mary)), 
                              (number(sg)::-lex(mary, pn, sg))])
                            ^^structure(_G424)), 
                      number(sg)])^^structure(_G352)), 
             (number(_G373)::-
                node(v, [[loves]], 
                     [structure(v(loves)), 
                     (number(sg)::-lex(loves, v, sg))])^^number(_G373))])], 
     [(structure(s(_G261, _G262))::-
         node(np, 
              [node(pn, [[john]], 
                    [structure(pn(john)), 
                    (number(sg)::-lex(john, pn, sg))])], 
              [(structure(np(_G292))::-
                  node(pn, [[john]], 
                       [structure(pn(john)), 
                       (number(sg)::-lex(john, pn, sg))])^^structure(_G292)), 
               number(sg)])^^structure(_G261), 
         node(vp, [node(v, [[loves]], 
                        [structure(v(loves)), 
                        (number(sg)::-lex(loves, v, sg))]), 
                   node(np, 
                        [node(pn, [[mary]], 
                              [structure(pn(mary)), 
                              (number(sg)::-lex(mary, pn, sg))])], 
                        [(structure(np(_G424))::-
                            node(pn, [[mary]], 
                                 [structure(pn(mary)), 
                                 (number(sg)::-lex(mary, pn, sg))])
                            ^^structure(_G424)), 
                         number(sg)])], 
                  [(structure(vp(_G351, _G352))::-
                      node(v, [[loves]], 
                           [structure(v(loves)), 
                           (number(sg)::-lex(loves, v, sg))])
                      ^^structure(_G351), 
                      node(np, [node(pn, [[mary]], 
                                     [structure(pn(mary)), 
                                     (number(sg)::-lex(mary, pn, sg))])], 
                               [(structure(np(_G424))::-
                                   node(pn, [[mary]], 
                                        [structure(pn(mary)), 
                                        (number(sg)::-lex(mary, pn, sg))])
                                   ^^structure(_G424)), 
                                number(sg)])^^structure(_G352)), 
                   (number(_G373)::-
                      node(v, [[loves]], 
                           [structure(v(loves)), 
                           (number(sg)::-lex(loves, v, sg))])
                      ^^number(_G373))])
         ^^structure(_G262))])
In examining the structure just shown, the reader will note a great deal of apparent repetition; this results from the high incidence of structure sharing, which is not shown explicitly: in the grammar itself, the variables which have child nodes as values are used quite freely and appear both in the list of children and in the rules for calculating synthetic attributes.
Note also that the values of the attributes are not always pre-calculated: instead, the structure has the rules necessary to perform the evaluation on demand. In the example, the structure attribute for the pre-terminal categories is already completely grounded: it has no variables, but only literal values, while the structure attributes for the higher-level parts of the linguistic structure still have unbound variables.

3.4. Layer 3: Translation into DCTG notation

As a first step toward providing grammatical attributes with PSVI information, we will translate the existing purchase-order schema into DCTG notation, adding grammatical attributes corresponding to some basic information-set properties which are required to be in the input infoset:
  • for Attribute Information Items:
    • [local name]
    • [namespace name]
    • [normalized value]
  • for Element Information Items:
    • [local name]
    • [namespace name]
    • [children]
    • [attributes]
    • [in-scope namespaces] or [namespace attributes]
  • for Namespace Information Items:
    • [prefix]
    • [namespace name]
In principle, we ought perhaps also to provide properties for character information items:
  • for Character Information Items:
    • [character code]
but it seems extraordinarily inconvenient to use grammatical attributes for this purpose. The information involved is obviously present, and can be isolated by using the standard Prolog predicate atom_codes(Atom,String) (or atom_chars(Atom,CharList) and char_code(Atom,ASCII)).
Additionally, we will add some more interesting properties of the PSVI:
  • type definition name, namespace, anonymous, and type
  • schema specified (schema or infoset)
  • validation attempted (always full)
  • validity (always valid, because when the document is not valid, we fail)
Like the DCG version above, the DCTG version of the schema has several distinct kinds of rules:
  • element rules
  • attribute-list rules (for checking the attributes of a complex type)
  • content-model rules (for checking the content of a complex type)
  • simple-type checking rules
The following sections give these in DCTG notation.

3.4.1. Top-level rules for element types

An element rule will serve to match the start-tag and get the attributes and contents of each element; from it, we will call routines to check the attributes and content against the complex type. These differ from the DCG rules in two ways: when we call them, we must specify three arguments, not two, and we provide explicit grammatical attributes for infoset properties. The basic pattern is simple: for any element in namespace n with local name gi and complex type ct, we will construct an appropriate non-terminal symbol nt, and the element rule will look like this:
nt ::= [element(n:gi,Attributes,Content)],
  {
    ct_atts(A,NA,Attributes),
    ct_cont(C,Content,[])
  }
  <:> attributes(A) 
  && namespaceAttributes(NA)
  && children(C) 
  && localName(gi) 
  && namespacename(n)
  && type_definition_anonymous(Boolean)
  && type_definition_namespace(URI)
  && type_definition_name(NCName)
  && type_definition_type(complex)
  && validation_attempted(full)
  && validity(valid)
.
Later, we will add further grammatical attributes, and use values other than full and valid for invalid elements.
Note that the rule for XML attributes is not a simple call to the parser, but a call to a wrapper predicate. Since the SWI parser returns namespace attributes in the same list as other attributes, while the infoset spec requires that they be listed in different properties, the ct_attributes predicate will need to filter the attribute information items into two different lists, one to become the value of the attributes infoset property, and one to become the value of namespaceAttributes.
We also should become a little more systematic about naming conventions. If we continue to use generic identifiers (element type names) directly as names of Prolog predicates, we risk name collisions between elements and predicates defined as part of the parser, or built in to Prolog. To eliminate this risk, we will prefix names taken over from the schema with e_ (for elements), t_ (for types), etc., and we will avoid those prefixes otherwise. If a schema has any names beginning with e_ or t_, this rule may become slightly confusing. But there won't be collisions between schema-based names and other names in the parser.
The purchase order schema po.xsd defines the following fifteen element types: the list gives the simple names which will be used to refer to them in the grammar below, as well as their schema-component designator as defined in Holstege/Vedamuthu 2002. Since their local names are all unique, the grammar below simply uses e_ plus their local names to refer to them. In other schemas, it will be necessary to mangle the names, or generate arbitrary identifiers, in order to distinguish different element types which have the same local names.
  • e_purchaseOrder = /element(purchaseOrder)
  • e_comment = /element(comment)
  • e_shipTo = /complexType(po:PurchaseOrderType)/sequence()/element(shipTo)
  • e_billTo = /complexType(po:PurchaseOrderType)/sequence()/element(billTo)
  • e_items = /complexType(po:PurchaseOrderType)/sequence()/element(items)
  • e_name = /complexType(po:USAddress)/sequence()/element(name)
  • e_street = /complexType(po:USAddress)/sequence()/element(street)
  • e_city = /complexType(po:USAddress)/sequence()/element(city)
  • e_state = /complexType(po:USAddress)/sequence()/element(state)
  • e_zip = /complexType(po:USAddress)/sequence()/element(zip)
  • e_item = /complexType(po:Items)/sequence()/element(item)
  • e_productName = /complexType(po:Items)/sequence()/element(item)/complexType()/sequence()/element(productName)
  • e_quantity = /complexType(po:Items)/sequence()/element(item)/complexType()/sequence()/element(quantity)
  • e_USPrice = /complexType(po:Items)/sequence()/element(item)/complexType()/sequence()/element(USPrice)
  • e_shipDate = /complexType(po:Items)/sequence()/element(item)/complexType()/sequence()/element(shipDate)
The simple purchase-order schema defines four complex types; one is anonymous; we'll use the local name of its host element after the t_ prefix:
  • t_PurchaseOrderType = /complexType(po:PurchaseOrderType)
  • t_USAddress = /complexType(po:USAddress)
  • t_Items = /complexType(po:Items)
  • t_item = /complexType(po:Items)/sequence()/element(item)/complexType()
The elements with complex types get these rules:
< 31 Rules for elements with complex types > ≡
/* e_purchaseOrder: grammatical rule for purchaseOrder element.
   e_purchaseOrder(ParsedNode,L1,L2): holds if the difference
      between L1 and L2 (difference lists) is a purchase order
      element in SWI Prolog notation. 
   And so on for the other element types.
*/
e_purchaseOrder ::= [element('http://www.example.com/PO1':purchaseOrder,
  Attributes,Content)],
  {
    t_PurchaseOrderType_atts(A,NA,Attributes),
    t_PurchaseOrderType_cont(C,Content,[])
  } 
  <:> localname(purchaseOrder)
  && type_definition_anonymous('false')
  && type_definition_namespace('http://www.example.com/PO1')
  && type_definition_name('PurchaseOrderType')
  && type_definition_type(complex)
  {Common infoset properties for elements in po namespace 32}

  .
e_shipTo ::= [element(shipTo,Attributes,Content)],
  {
    t_USAddress_atts(A,NA,Attributes),
    t_USAddress_cont(C,Content,[])
  } 
  <:> localname(shipTo)
  && type_definition_anonymous('false')
  && type_definition_namespace('http://www.example.com/PO1')
  && type_definition_name('USAddress')
  && type_definition_type(complex)
  {Common infoset properties for elements in po namespace 32}


  .
e_billTo ::= [element(billTo,Attributes,Content)],
  {
    t_USAddress_atts(A,NA,Attributes),
    t_USAddress_cont(C,Content,[])
  } 
  <:> localname(billTo)
  && type_definition_anonymous('false')
  && type_definition_namespace('http://www.example.com/PO1')
  && type_definition_name('USAddress')
  && type_definition_type(complex)
  {Common infoset properties for elements in po namespace 32}


  .
e_items ::= [element(items,Attributes,Content)],
  {
    t_Items_atts(A,NA,Attributes),
    t_Items_cont(C,Content,[])
  } 
  <:> localname(items)
  && type_definition_anonymous('false')
  && type_definition_namespace('http://www.example.com/PO1')
  && type_definition_name('Items')
  && type_definition_type(complex)
  {Common infoset properties for elements in po namespace 32}


  .
e_item ::= [element(item,Attributes,Content)],
  {
    t_item_atts(A,NA,Attributes),
    t_item_cont(C,Content,[])
  } 
  <:> localname(item)
  && type_definition_anonymous('true')
  && type_definition_namespace('http://www.example.com/PO1')
  && type_definition_name('t_item')
  && type_definition_type(complex)
  {Common infoset properties for elements in po namespace 32}


  .

This code is used in < Predicates for purchase-order material 84 > < Predicates for purchase-order material 154 >

Note that the type_definition_name property for the item element provides the generated name we use for the type. That this name is not assigned by the schema is clarified by type_definition_anonymous('true').
Since the attributes, children, and namespacename properties have identical definitions for all element types in the purchase-order namespace, we can factor them out into a single code fragment:
< 32 Common infoset properties for elements in po namespace > ≡
  &&  attributes(A)
  &&  namespaceattributes(NA)
  &&  children(C)
  &&  namespacename('http://www.example.com/PO1')
  && validationattempted(full)
  && validity(valid)

This code is used in < Rules for elements with complex types 31 > < Rules for elements with simple types 33 >

The rules for elements with simple types are slightly simpler than those for elements with complex types, but follow the same basic pattern.
In the DCG version of this schema given above, we wrote the comment element and others associated with simple types using a hard-coded requirement for an empty list of attributes. In fact, that is too simple, since such elements may in fact have xsi:type, xsi:nil, xsi:schemaLocation and xsi:noNamespaceSchemaLocation attributes. So we write these element rules with the same basic structure as was used for complex types, except that we use a standard predicate for checking that no attributes outside the xsi namespace were used.
The schema po.xsd defines two simple types: SKU and the anonymous simple type used for quantities:
  • t_quantity = /complexType(po:Items)/sequence()/element(item)/complexType()/sequence()/element(quantity)/simpleType()
  • t_SKU = /simpleType(SKU)
In addition, several built-in simple types are used:
  • t_string = xsd:string
  • t_integer = xsd:integer
  • t_decimal = xsd:decimal
  • t_date = xsd:date
In a full implementation, we'll need to do some more serious name mangling to ensure uniqueness of relatively short, handy names for all types. For now, we just choose the names manually.
The rules for simple types are:
< 33 Rules for elements with simple types > ≡
e_comment ::= [element('http://www.example.com/PO1':comment,Attributes,Content)],
  {Guard to check attributes and content of strings 34}

  <:> localname(comment) 
  {Common infoset properties for elements in po namespace 32}

  {PSVI properties for strings 35}

  .

e_name ::= [element(name,Attributes,Content)],
  {Guard to check attributes and content of strings 34}

  <:> localname(name) 
  {Common infoset properties for elements in po namespace 32}

  {PSVI properties for strings 35}

  .

e_street ::= [element(street,Attributes,Content)],
  {Guard to check attributes and content of strings 34}

  <:> localname(street) 
  {Common infoset properties for elements in po namespace 32}

  {PSVI properties for strings 35}

  .

e_city ::= [element(city,Attributes,Content)],
  {Guard to check attributes and content of strings 34}

  <:> localname(city) 
  {Common infoset properties for elements in po namespace 32}

  {PSVI properties for strings 35}

  .

e_state ::= [element(state,Attributes,Content)],
  {Guard to check attributes and content of strings 34}

  <:> localname(state) 
  {Common infoset properties for elements in po namespace 32}

  {PSVI properties for strings 35}

  .

e_zip ::= [element(zip,Attributes,Content)],
  {
    sT_atts(A,Attributes,[]),
    xsd_decimal_cont(C,Content,[])
  }
  <:> localname(zip) 
  {Common infoset properties for elements in po namespace 32}

  {PSVI properties for decimals 36}

  .

e_productName ::= [element(productName,
    Attributes,Content)],
  {Guard to check attributes and content of strings 34}

  <:> localname(productName) 
  {Common infoset properties for elements in po namespace 32}

  {PSVI properties for strings 35}

  .

e_quantity ::= [element(quantity,
    Attributes,Content)],
  {
    sT_atts(A,Attributes,[]),
    t_quantity_cont(C,Content,[])
  }
  <:> localname(quantity) 
  {Common infoset properties for elements in po namespace 32}

  && type_definition_anonymous('true')
  && type_definition_namespace('http://www.example.com/PO1')
  && type_definition_name('t_quantity')
  && type_definition_type(simple)
  .

e_USPrice ::= [element('USPrice',Attributes,Content)],
  {
    sT_atts(A,Attributes,[]),
    xsd_decimal_cont(C,Content,[])
  }
  <:> localname('USPrice') 
  {Common infoset properties for elements in po namespace 32}

  {PSVI properties for decimals 36}

  .

e_shipDate ::= [element(shipDate,Attributes,Content)],
  {
    sT_atts(A,Attributes,[]),
    xsd_date_cont(C,Content,[])
  }
  <:> localname(shipDate) 
  {Common infoset properties for elements in po namespace 32}

  && type_definition_anonymous('false')
  && type_definition_namespace('http://www.w3.org/2001/XMLSchema')
  && type_definition_name('date')
  && type_definition_type(simple)
  .

This code is used in < Predicates for purchase-order material 84 > < Predicates for purchase-order material 154 >

Just as we factor out the common infoset properties, we can also factor out the checking against frequently used built-in simple types, notably string:
< 34 Guard to check attributes and content of strings > ≡
  {
    sT_atts(A,Attributes,[]),
    xsd_string_cont(C,Content,[])
  }

This code is used in < Rules for elements with simple types 33 >

Similarly, the type identifications for string and decimal are used more than once:
< 35 PSVI properties for strings > ≡
  && type_definition_anonymous('false')
  && type_definition_namespace('http://www.w3.org/2001/XMLSchema')
  && type_definition_name('string')
  && type_definition_type(simple)

This code is used in < Rules for elements with simple types 33 >

< 36 PSVI properties for decimals > ≡
  && type_definition_anonymous('false')
  && type_definition_namespace('http://www.w3.org/2001/XMLSchema')
  && type_definition_name('decimal')
  && type_definition_type(simple)

This code is used in < Rules for elements with simple types 33 >

3.4.2. Rules for attributes

For each complex type, we need to do several things in order to validate all the attributes on occurrence of that type and provide appropriate nodes and infoset properties:
  • The input structure has namespace attributes and other attributes in the same list, while we need them in separate lists so we can assign them to two different infoset properties. So we need to partition the list of attributes. We can perform the partition either before all other processing, or after; doing it afterwards leads to more compact code, so we choose that.
  • For each non-namespace attribute found, we need to validate it: if it is declared, we need to check it against its declared type. If the attribute is declared with a fixed type, we should check that the value given matches the prescribed value. If it is not declared, we should raise an error, but we'll save that for a later layer.
  • We need to ensure that attributes required by the complex type are present and that attributes forbidden by the complex type are not present. For any attributes declared with default values, we need to supply an attribute information item with the default value, if the document didn't supply a value. Rather than trying to interleave this with other tasks, we will perform a separate check on attribute occurrences.
And we want to provide basic infoset properties for the XML attributes, in the form of grammatical attributes in the attribute-grammar sense.
3.4.2.1. Basic pattern
For each complex or simple type dt, the basic pattern of the attribute-checking rule will be:
dt_atts(Lpa,Lpna,Lavs) :-
  lavs_dt(LpaAll,Lavs,[]),    /* parse against grammar of attributes */
  partition(LpaAll,LpaPresent,Lpna),         /* partition the result */
  attocc_dt(LpaPresent,Lpa).      /* check min, max occurrence rules */
The logical variables have the following meanings:
Lpa
List of parsed attributes (i.e. of node() structures of the kind returned by any DCTG rule) for this complex type, including defaulted attributes
Lpna
List of parsed namespace attributes
Lavs
The list of attribute-value specifications provided by the input structure returned by the SWI Prolog parser.
LpaAll
Combined list of parsed-attribute node() structures for all attributes, both namespace attributes and others
LpaPresent
List of parsed-attribute nodes for attributes explicitly assigned values in the document instance (without defaulted attributes)
For each type, a grammar defining the legal attributes will be constructed; if type dt has attributes an1 and an2, of types st1 and st2 respectively, then the core context-free grammar will have a form like this:
lavs_dt ::= [].
lavs_dt ::= avs_dt, lavs_dt.       /* declared attributes */
lavs_dt ::= avs_nsd, lavs_dt.   /* namespace declarations */
lavs_dt ::= avs_xsi, lavs_dt.           /* XSI attributes */

avs_dt ::= [an1=Av], { st1_value(Av) }.
avs_dt ::= [an2=Av], { st2_value(Av) }.
Simple types will, of course, have no declared attributes, and the rules for declared attributes and occurrence-checking (together with the rules for individual attributes) will be omitted. Wildcard support can also be added here when needed.
3.4.2.2. Namespace attributes and XSI attributes
One set of rules for namespace attributes and XSI attributes will suffice:
< 37 Grammar rules for namespace and XSI attributes > ≡
/* avs_nsd: grammatical rule for namespace-attribute specifications */
avs_nsd ::= [xmlns=DefaultNS]
  <:> localname(xmlns)
  && namespacename('http://www.w3.org/2000/xmlns/')
  && normalizedvalue(DefaultNS).
avs_nsd ::= [xmlns:Prefix=NSName]
  <:> localname(Prefix)
  && namespacename('http://www.w3.org/2000/xmlns/')
  && normalizedvalue(NSName).

/* avs_nsd: grammatical rule for XSI attribute specifications */
avs_xsi ::= ['http://www.w3.org/2001/XMLSchema-instance':Localname=Value]
  <:> localname(Localname)
  && namespacename('http://www.w3.org/2001/XMLSchema-instance')
  && normalizedvalue(Value).

This code is used in < Generic utilities for DCTG-encoded schemas 85 > < Generic utilities for DCTG-encoded schemas 158 >

Note that default namespace declarations do have a namespace property, despite not having a prefixed name; this is in accord with Section 2.2 of the Infoset spec, which says “By definition, all namespace attributes (including those named xmlns, whose [prefix] property has no value) have a namespace URI of http://www.w3.org/2000/xmlns/.”
3.4.2.3. Occurrence checking
Each complex type will also have a rule for occurrence-checking, which will take something like the following form (assuming that Lreq, Ldft, and Lnot are lists of required, defaulted, and forbidden attributes:
attocc_dt(LpaPres,LpaAll) :-
  atts_present(LpaPres,Lreq),
  atts_absent(LpaPres,Lnot),
  atts_defaulted(LpaPres,Ldft,LpaAll).
Since the form of the attribute lists has changed (we are now dealing with lists of node structures), we need new forms of atts_present, etc. for this:
< 38 Utilities for checking attribute occurrences > ≡
/* atts_present(Lpa,Lreq):  true if a parsed attribute node
   is present in Lpa for each attribute name in Lreq */
atts_present(LAVS,[]).
atts_present(LAVS,[HRA|RequiredTail]) :-
  att_present(LAVS,HRA),
  atts_present(LAVS,RequiredTail).

/* An attribute name matches if namespace and local part match */
/* att_present(Lpa,Attname):  true if a parsed attribute node
   is present in Lpa which has name Attname */
att_present([Pa|Lpa],NS:Attname) :- 
  Pa^^localname(Attname), 
  Pa^^namespacename(NS).
att_present([Pa|Lpa],Attname) :-
  att_present(Lpa,Attname).
/* no base step: if we reach att_present([],Attname) we want to fail. */

This code is used in < Generic utilities for DCTG-encoded schemas 85 > < Generic utilities for DCTG-encoded schemas 158 >

The rule for checking forbidden attributes is very similar:
< 39 Utility for checking absent attributes > ≡
/* atts_absent(Lpa,Ltabu): true if no attribute named in Ltabu
   is present in Lpa */
atts_absent(LAVS,[]).
atts_absent(LAVS,[H|T]) :-
  not(att_present(LAVS,H)),
  atts_absent(LAVS,T).

This code is used in < Generic utilities for DCTG-encoded schemas 85 > < Generic utilities for DCTG-encoded schemas 158 >

The rule for providing defaults must go through all of the attributes with defaults; this happens in the atts_defaulted predicate in the usual way of recursion on the list.
< 40 Utility for providing defaulted attributes > ≡
/* atts_defaulted(L1,L2,L3): true if L3 has all the attributes in L1,
   plus all of the attributes in L2 which are not also in L1 */
atts_defaulted(Lpa,[],Lpa).
atts_defaulted(Lpa,[Padft|Ldft],LpaAll) :-
  atts_defaulted(Lpa,Ldft,Lpa2),
  att_merge(Lpa2,Padft,LpaAll).
Continued in <Utility for providing defaulted attributes 41>
This code is used in < Generic utilities for DCTG-encoded schemas 85 > < Generic utilities for DCTG-encoded schemas 158 >

For each of these attributes individually, the default value must be added to the list if a value is not already there; this involves recursion on the list of attributes already present. We expect only ever to call this predicate when the first and third arguments (the defaulted attribute and the list into which it is to be merged) are inst