Introduction to

W3C XML Schema 1.0

C. M. Sperberg-McQueen

18 August 2003

http://www.w3.org/People/cmsmcq/2003/xstut.sydney.html



I. Welcome and overview

I.1. Workshop goals

At the end of the morning, you should
  • know what XML Schema is
  • understand why schema languages in general are needed
  • have a general grasp of XML Schema's basic concepts and how to impose various kinds of constraints using XML Schema
  • know how XML syntax is used in XML Schema documents
  • understand how schemas are put together from schema documents and how multiple-namespace documents can be validated

I.2. Non-goals

At the end of the morning, you will not:
  • know how to perform the document and data analysis necessary to define XML vocabularies well
  • have hands-on experience writing a DTD or schema
  • be able to impress computer scientists with a profound understanding of the relative merits of single and multiple inheritance for handling limited context sensitivity in a basically context-free environment
— unless, of course, you already do.

I.3. Rules

Just to let you know what I expect:
  • If you cannot hear me, please interrupt.
  • If something is unclear, please interrupt and ask.
  • If a rabbit hole is tempting, please hold back.
  • Break at 10:30. Sharp.*
* Asterisks in the slides mean that annotation or interpretation is needed; if I don't provide it, ask.

I.4. Acknowledgements

I'm indebted (for general discussions and for specific material included here) to
  • Elaine Brennan
  • Robin Cover
  • David Fallside
  • Michael Hahn
  • Dave Hollander
  • Eve L. Maler
  • Murray Maloney
  • Jeni Tennison
  • Henry S. Thompson
as well as to my colleagues at W3C and in the W3C XML Schema Working Group.

I.5. Workshop overview

  1. Overview and introduction
  2. Basic ideas of XML Schema
  3. A simple example (the duck)
  4. A more complex example (purchase order)
[break]
  1. Simple types and their facets
  2. Complex types and their derivation hierarchy
  3. The post-schema-validation infoset
  4. Some usage questions (modularization, designing for reuse and extension, multi-namespace documents)
  5. Review and conclusion

II. Introduction

II.1. What's a schema?

For our purposes, a schema is
a formal expression
of the structure
of an XML document
and of constraints on the text therein
There are other meanings in DBMS, programming languages, mathematics, and elsewhere. No* relation.
In ISO 8879, the term document type definition has this* meaning.[1]

II.2. What's XML Schema?

XML Schema 1.0 is
A W3C Recommendation
issued in May 2001,
developed by the W3C XML Schema Working Group,
which defines
  • a system of simple and complex types
  • several types of schema components
  • an XML transfer syntax for schema documents
  • rules for schema-validity assessment of XML infosets
  • contributions to a post-schema-validation infoset

II.3. Other schema languages

There are other schema languages for XML document types:
  • Relax NG
  • Schematron
  • XML Data Reduced (XDR)
  • Relax
  • Trex
  • Document structure definition (DSD)
  • Schema for object-oriented XML (SOX)
  • various DTD extensions

II.4. Why use schemas?

So what's wrong with well-formed XML?
  • What is well-formedness?
  • The duck
  • A malformed duck
  • What the computer sees

II.5. What is well-formedness?

A document is well-formed if it obeys all the rules of XML itself:
  • Start-tags match end-tags.
  • Elements nest properly.
  • Attributes are quoted.
  • There is a single root element.
  • All entities used are declared.
  • ...
Any additional constraints are imposed by the application, not by XML.

II.6. The duck

by Ogden Nash
Behold the duck.
It does not cluck.
A cluck it lacks.
It quacks.
It is especially fond
Of a puddle or pond.
When it dines or sups
It bottoms-ups.

II.7. The duck

Let us consider a straightforward XML encoding:
<poem>
<title>The duck</title>
<author>Ogden Nash</author>
<stanza>
<line>Behold the duck.</line>
<line>It does not cluck.</line>
<line>A cluck it lacks.</line>
<line>It quacks.</line>
</stanza>
<stanza>
<line>It is especially fond</line>
<line>Of a puddle or pond.</line>
<line>When it dines or sups</line>
<line>It bottoms-ups.</line>
</stanza>
</poem>

II.8. The duck

Even if the data are meaningless, some errors are obvious:
<poem>
<author>Btqra Anfu</author>
<stanza>
<line>Orubyq gur qhpx.</line>
<line>Vg qbrf abg pyhpx.</line>
<line>N pyhpx vg ynpxf.</line>
<line>Vg dhnpxf.</line>
</stanza>
<title>Gur qhpx</title>
<stanza>
<line>Vg vf rfcrpvnyyl sbaq</line>
<line>Bs n chqqyr be cbaq.</line>
<line>Jura vg qvarf be fhcf</line>
<line>Vg obggbzf-hcf.</line>
</stanza>
</poem>

II.9. What the computer sees

What the computer sees, however, is less clear:
<cbrz>
<gvgyr>Gur qhpx</gvgyr>
<nhgube>ol Btqra Anfu</nhgube>
<fgnamn>
<yvar>Orubyq gur qhpx.</yvar>
<yvar>Vg qbrf abg pyhpx.</yvar>
<yvar>N pyhpx vg ynpxf.</yvar>
<yvar>Vg dhnpxf.</yvar>
</fgnamn>
<fgnamn>
<yvar>Vg vf rfcrpvnyyl sbaq</yvar>
<yvar>Bs n chqqyr be cbaq.</yvar>
<yvar>Jura vg qvarf be fhcf</yvar>
<yvar>Vg obggbzf-hcf.</yvar>
</fgnamn>
</cbrz>

II.10. Find the errors

This document is well-formed, but has several typos.
<cbrz>
<gvgyr>Gur qhpx</gvgyr>
<nhgube>ol Btqra Anfu</nhgube>
<fgnamn>
<yvar>Orubyq gur qhpx.</yvar>
<yvar>Vg qbrf abg pyhpx.</yvar>
<yvar>N pyhpx vg ynpxf.</yvar>
<yvar>Vg dhnpxf.</yvar>
</fgnamn>
<fgnanm>
<yyar>Vg vf rfcrpvnyyl sbaq</yyar>
<yyar>Bs n chqqyr be cbaq.</yyar>
<yyar>Jura vg qvarf be fhcf</yyar>
<yvar>Vg obggbzf-hcf.</yvar>
</fgnanm>
</cbrz>

II.11. Now imagine ...

that it's production data:
  • The document is well-formed but has typos.
  • It's not a poem but a purchase-order.
  • Owing to the typos, your order for ten laser printers has become an order for ten gross of laser printers.
  • (And you just learned your supplier isn't good at correcting errors in their computer systems.)

II.12. The Iron Law

Garbage* in, garbage out.
Three questions:
  • Can errors exist? or is every string of bits a possible message?
  • Can errors be found
    • automatically?
    • by clerical inspection?
    • through inspections by highly trained experts?
  • Is the cost of undetected errors
    • trivial?
    • small?
    • large?
    • catastrophic?

II.13. Why well-formedness isn't enough

Well-formed documents
can have errors
with serious consequences
some of which* can be caught mechanically.

II.14. Document grammars

Origin: pragmatic, not theoretical; partial post hoc alignment with formal language theory.
Formal specification of validity conditions → automated validation.
Er, ah, partial formal specification of validity conditions → partial automated validation.
Distinction between
  • document type definition (DTD) and
  • “set of effective formal declarations”
→ division of labor.

II.15. Conceptual layers

Three layers of rules governing data.

II.16. Conceptual layers (2)

Distinguish logical and physical structure.

II.17. Uses of document grammars

Document grammars may have several uses:
  • in the struggle against dirty data (sanity checking)
  • as documentation of the content of data flows
  • as documentation of a contract between data provider and data consumer
  • as specification of client/server protocols
  • validation (to enforce the contract or check the implementation)
  • automation for document authoring
  • reasoning about data and software (query processing, completeness checking for software)

II.18. DTDs as a schema language

SGML/XML DTDs resemble Backus-Naur Form grammars, but:
  • They describe bracketed languages* ...
  • ... so ‘non-terminals’ are visible*.
  • SGML allows inclusion and exclusion exceptions (Rizzi: NP-complete parsing problem for non-bracketed L).
  • They are not purely grammatical (notations, entities).
  • Determinism rule.

II.19. DTDs: special notation

  • compact, clear distinction of levels
  • ad hoc, adds complexity (1/3 of the rules in the SGML grammar)
  • because the notation is different, DTDs require
    • special parsers
    • special editors
    • special processors
  • learning curve? (not really relevant, even if we agreed on which is harder)
  • no datatypes*
  • do not play well with namespaces
  • no formal role for documentation
  • no inheritance (no kind-of information, only part-of)

III. Basic ideas of XML Schema

III.1. XML Schema

  • DTD++, DTD--
  • instance syntax
  • supporting programming-language and database-oriented types (inheritance)
  • schema combination rules
  • better hooks for documentation and semantics

III.2. The XML Schema 1.0 specification

Three parts:

III.3. Implementations

Among the most widely used validators:
  • MSXML4 (Microsoft)
  • Xerces-J, Xerces-C++ (Apache)
  • XSV (Henry Thompson, Richard Tobin)
  • Schema Quality Checker (IBM)
  • Multi Schema Validator (Sun)
  • Topologi Schematron Validator (uses MSXML for XML Schema validation)
Editors include:
  • XML Spy (Altova)
  • XMetaL (Corel)
  • ...

III.4. Use cases

  • quality assurance
  • database exchange
  • translation to/from OO systems
  • reuse of schemas and schema fragments
  • smooth evolution of schemas, applications, data, software

III.5. Data-intensive applications

  • electronic commerce
  • Web Services
  • database exchange
  • inter-process communication
  • metadata processing
  • process modeling

III.6. Document-oriented applications

  • publishing
  • Web page design
  • form controls
  • online catalogs
  • multimedia presentations
  • electronic books
  • maps, directories, ...

III.7. Schema-validity and schema-validity assessment

XML Schema 1.0 defines
  • schema-validity assessment:
    input infoset × schema → output infoset (PSVI)
  • schema (≡ set of abstract schema components)
  • schema document (XML representation)
  • mapping from XML transfer syntax to schema components

III.8. Validation

In schema-validity assessment, we
  1. identify an XML element information item* to validate
  2. identify a schema to validate against
  3. assess the schema-validity of the element and its descendants
  4. add validity and type information to the infoset

III.9. Some fundamental ideas

  • The syntax is not the schema.
  • Namespaces are fundamental (but not the same as schemas).
  • Schema-validity assessment is an infoset-to-infoset mapping.
  • We separate tags and types.
  • Types can be simple or complex.
  • Elements and types can be global (top-level) or local.
  • Schemas can be combined.
  • Declarations can use wildcards, type derivation (extension, restriction), substitution groups, application-specific annotations.

III.10. Schema components

XML Schema 1.0 defines fourteen types of component. The most important:
  • element declarations
  • attribute declarations
  • complex type definitions
  • simple type definitions
  • the schema itself
  • annotations
Entities conspicuous by their absence.

IV. A simple example

IV.1. Making a schema document

  • the schema element
  • declaring elements
  • occurrence indicators
  • character data
  • linking the schema and the document
[Shift back and forth to emacs for construction of a simple schema for “The Duck”]

IV.2. The schematic duck (1)

First, let's just declare all the element types:
<xsd:schema 
 xmlns:xsd="http://www.w3.org/2001/XMLSchema">
 <xsd:element name="poem"></xsd:element>
 <xsd:element name="title"></xsd:element>
 <!--* ... etc. ... *-->
</xsd:schema>
Validate:
  • with complete schema
  • omitting some declarations
  • with errors in document (misspellings, order, ...)

IV.3. Linking document and schema

Two ways to link document and schema:
  • inline (clunky, problematic, but easy to understand and implement)
  • out-of-band (not standardized)
An inline example:
<poem 
 xmlns:xsi
 ="http://www.w3.org/2001/XMLSchema-instance"
 xsi:noNamespaceSchemaLocation="tds03.xsd"
 >

IV.4. Running the validator

  • XSV: xsv -t -w -o xsvout.xml -s "file:///c:/Program%20Files/XSV/xsv.xsl" file.xml
    • -t = show timings
    • -w = include warnings
    • -o file = write errors to file
    • -s file = include stylesheet file in error output
  • Xerces: java sax.Counter -v -s -f file.xml
    • -v = validate
    • -s = use XML Schema
    • -f = schema full checking (check schema for correctness)
  • Topologi (menu interface)

IV.5. Post-schema-validation infoset

XSV and Xerces-J can also dump the PSVI:
  • XSV: xsv -t -w -o xsvout.xml -s "file:///c:/Program%20Files/XSV/xsv.xsl" -r alt file.xml > psvi.out.xml
    • -t, -w, -o, -s as before
    • -r alt = write PSVI in alternating normal form
    • -r ind = write PSVI in individual normal form
  • Xerces: java sax.Writer -v -s -f -p xni.parser.PSVIParser file.xml > psvi.out.xerces.xml [2]
    • -v, -s, -f as before
    • -p parser = use specified parser in lieu of default

IV.6. Validation outcomes

N.B. there are six outcomes, not two:
Validity
Validation attempted valid invalid notKnown
full OK. Entire subtree valid. OK. Entire subtree assessed; error here or at some descendant. Not possible (contradictory)
partial OK. This node assessed and valid. Some descendant skipped. OK. Problem at this node, or in a child. Also, some descendant skipped. OK. This node not assessed (but a descendant was.)
none
(subtree skipped)
Not possible (contradictory) Not possible (contradictory) OK. This subtree was skipped.

IV.7. The schematic duck (2a)

Next, let's declare the rules more correctly:
<xsd:schema 
 xmlns:xsd="http://www.w3.org/2001/XMLSchema">
 <xsd:element name="poem">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element ref="title"/>
    <xsd:element ref="author"/>
    <xsd:element ref="stanza" maxOccurs="unbounded"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>
 <!--* ... etc. ... *-->
</xsd:schema>

IV.8. The schematic duck (2b)

To handle textual data:
 <xsd:element name="title">
  <xsd:complexType mixed="true"/>
 </xsd:element>	
or
 <xsd:element name="author" type="xsd:string"/>
Validate:
  • with complete schema
  • omitting some declarations
  • with errors in document (misspellings, order, ...)

IV.9. The schematic duck (3)

A better way for textual data:
  <xsd:complexType name="words" mixed="true"/>
which allows us to say simply:
 <xsd:element name="title" type="words"/>
 <xsd:element name="author" type="words"/>
 <xsd:element name="line" type="words"/>

V. A more complex example: the purchase order

Once more, slowly:
  • schema element
  • element declarations with named types
  • element declarations with anonymous types
  • handling natural-language data: string, mixed content

V.1. The purchase order schema

At the outer level is a schema element:
<xsd:schema 
     xmlns:xsd="http://www.w3.org/2001/XMLSchema"
     xmlns:po="http://www.example.com/PO1"
     targetNamespace="http://www.example.com/PO1"
>
 <!--* declarations and definitions go here *-->
</xsd:schema>
N.B. the schema does not identify a document-root element / start symbol.

V.2. Declaring elements

With named types:
 <xsd:element name="purchaseOrder" 
              type="po:PurchaseOrderType"/>
 <xsd:element name="comment"       
              type="xsd:string"/>

V.3. Declaring elements

With anonymous types:
 <xsd:element name="quantity">
  <xsd:simpleType>
   <xsd:restriction base="positiveInteger">
    <xsd:maxExclusive value="100"/>
   </xsd:restriction>
  </xsd:simpleType>
 </xsd:element>

V.4. Declaring complex types

 <xsd:complexType name="PurchaseOrderType">
  <xsd:sequence>
   <xsd:element name="shipTo"    type="po:USAddress"/>
   <xsd:element name="billTo"    type="po:USAddress"/>
   <xsd:element ref="po:comment" minOccurs="0"/>
   <xsd:element name="items"  type="po:Items"/>
  </xsd:sequence>
  <xsd:attribute name="orderDate" type="xsd:date"/>
 </xsd:complexType>
  • Note difference between element declaration and element reference.
  • Implicit occurrence information: min = max = 1.

V.5. Character data

 <xsd:element name="comment"       
              type="xsd:string"/>
or as mixed content:
 <xsd:element name="comment">
  <xsd:complexType mixed="true">
  </xsd:complexType>
 </xsd:element>

V.6. Exploring the purchase order schema

Validate:
  • correct purchase order
  • missing billTo
  • invalid product count
  • invalid product number

VI. Simple types and facets

VI.1. Simple datatypes

  • built-in
    • primitive
    • derived
  • user-defined (all derived)

VI.2. Simple type hierarchy

VI.3. Built-in primitive datatypes

  • string
  • boolean
  • decimal, float, double
  • dateTime, time, date, gYearMonth, gYear, gMonthDay, gDay, gMonth
  • duration
  • hexBinary, base64Binary
  • anyURI
  • QName
  • NOTATION

VI.4. Built-in derived datatypes

  • normalizedString, token, language
  • IDREFS, ENTITIES, NMTOKEN, NMTOKENS, Name, NCName, ID, IDREF, ENTITY
  • integer, nonPositiveInteger, negativeInteger, long, int, short, byte, nonNegativeInteger, unsignedLong, unsignedInt, unsignedShort, unsignedByte, positiveInteger

VI.5. Using a built-in type

We can declare built-ins:
<xsd:element name="USPrice"  type="xsd:decimal"/>
<xsd:attribute name="orderDate" type="xsd:date"/>
Or just use them dynamically:
<shipDate xsi:type="xsd:date">2003-08-18</date>

VI.6. Using xsi:type

The special attribute xsi:type can be used to associate specific elements in an instance with types. Conditions:
  • elements only (why?)
  • if schema says type="B" and instance saysxsi:type="D", then D must be derived* from B.
Some people think the use of xsi:type for simple types is the easiest way to edge into schema usage.

VI.7. What is an atomic?

Extensional view:
  • a set of values V
  • a set of lexical forms L
  • a mapping from L to V

VI.8. What is an atomic? (take 2)

Intensional view:
  • a base mapping L → V
  • a set of fundamental facets:
    • equality (identity)
    • order (partial, total, none)
    • boundedness
    • cardinality
    • numeric
  • a set of constraining facets:
    • length, minLength, maxLength
    • pattern (constrains lexical space)
    • enumeration
    • whiteSpace
    • maxInclusive, maxExclusive, minInclusive, minExclusive
    • totalDigits, fractionDigits

VI.9. Derivation of simple types

Simple types can be derived by restricting a facet:
<xsd:simpleType>
  <xsd:restriction 
       base="xsd:positiveInteger">
    <xsd:maxExclusive value="100"/>
  </xsd:restriction>
</xsd:simpleType>
Most facets directly control the value space (and the lexical space indirectly).

VI.10. Derivation of simple types (2)

Some facets control the lexical space directly (and the value space indirectly):
<xsd:simpleType name="SKU">
  <xsd:restriction base="xsd:string">
    <xsd:pattern value="\d{3}-[A-Z]{2}"/>
  </xsd:restriction>
</xsd:simpleType>

VI.11. Regular expressions for the pattern facet

The regular expressions for the pattern facet are mostly conventional:
  • concatenation: ab
  • alternation: a | b
  • repetition: a*b+c?
  • character classes: [a-zA-Z], [^aeiou]
  • single-character escapes: \n, \r, etc.
...

VI.12. Regular expressions (2)

... with some extensions:
  • numeric exponents: a{1,5}
  • class subtraction: [a-zA-Z-[aeiou]]
  • Unicode-property classes: \p{Lu} (characters with property Lu, i.e. upper-case letters), \P{Lu} (negation: characters lacking property Lu)
  • Unicode-block classes: \p{IsBasicLatin} (characters in the Basic Latin block), \P{IsBasicLatin} (negation: characters outside that block)

VI.13. Enumerations

Enumerations can be used to specify a list of legal values:
 <xsd:simpleType type="width-keywords">
  <xsd:restriction base="xsd:string">
   <xsd:enumeration value="full"/>
   <xsd:enumeration value="half"/>
   <xsd:enumeration value="none"/>
   <xsd:enumeration value="default"/>
  </xsd:restriction>
 </xsd:simpleType>

VI.14. Non-atomic simple types

  • list (white-space delimited)
  • unions (ordered)

VI.15. Creating a list of integers

Lists can be created by restricting anySimpleType:
<xsd:simpleType name="listofdates">
  <xsd:list itemType="xsd:date"/>
</xsd:simpleType>

VI.16. Creating a union

Unions are similarly restrictions of anySimpleType:
 <xsd:simpleType name="widthType">
  <xsd:union 
   memberTypes
    ="width-keywords xsd:positiveInteger">
  </xsd:union>
 </xsd:simpleType>

VI.17. Examples

  • numbers: decimal, integer, positive integer
  • strings: patterns
  • dates and times: minima and maxima, date format
  • binary data (hex, base 64)
  • lists, unions
[Switch to emacs.]

VII. Complex types

VII.1. Content models

Productions in the document grammar:
  • regular expression-like
  • primitive tokens are elements (recognized by name)
  • sequence, choice, all
  • numeric occurrence indicators
  • determinism rule

VII.2. Content model: example

<xsd:sequence>
 <xsd:element name="name"   type="xsd:string"/>
 <xsd:element name="street" type="xsd:string"/>
 <xsd:element name="city"   type="xsd:string"/>
 <xsd:element name="state"  type="xsd:string"/>
 <xsd:element name="zip"    type="xsd:decimal"/>
</xsd:sequence>

VII.3. Content model: example

Let's allow ourselves three kinds of customers:
<xsd:choice>
 <xsd:element name="indiv" type="po:person"/>
 <xsd:element name="corp" type="po:organization"/>
 <xsd:element name="internal" type="po:dept"/>
</xsd:choice>

VII.4. Content model: example

Mixing choice and sequence:
<xsd:sequence>
 <xsd:choice>
  <xsd:element name="customer" type="po:USAddress"/>
  <xsd:sequence>
   <xsd:element name="shipTo" type="po:USAddress"/>
   <xsd:element name="billTo" type="po:USAddress"/>
  </xsd:sequence>
 </xsd:choice>
 <xsd:element ref="po:comment" minOccurs="0"/>
 <xsd:element name="items"  type="po:Items"/>
</xsd:sequence>

VII.5. Attributes

<xsd:attribute name="orderDate" 
               type="xsd:date"/>
  • use = (prohibited | optional | required) (default is optional)
  • form = (qualified | unqualified) (default is set on schema element)

VII.6. Global and local declarations

Elements can be
  • global (top-level) to a namespace, or
  • local to a complex type
Types can be
  • named / global (top-level) to a namespace, or
  • anonymous / local to an element or attribute

VII.7. Wildcards

Two kinds:
  • element wildcard: xsd:any
  • attribute wildcard: xsd:anyAttribute
Two parameters:
  • processContents = (strict | lax | skip)
  • nameSpace = (##any | ##other | ##targetNamespace | ##local | namespace URI)

VII.8. A schema for xsi:type usage

Some people think the use of xsi:type for simple types is the easiest way to edge into schema usage. Here's one way:
 <xsd:element name="mydoc">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:any 
     namespace="##any" 
     processContents="lax" 
     minOccurs="0" 
     maxOccurs="unbounded">
    </xsd:any>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

VII.9. Examples

  • element declarations
  • extension
  • restriction
  • complex types with simple content
  • wildcards

VIII. Post-schema-validation infoset

VIII.1. Post-schema-validation infoset (PSVI)

XML-Schema validation: infoset → infoset.
  • additions, no changes
  • type assignment information
  • validation-attempted information (strict, lax, skip)
  • validation-outcome information

VIII.2. Infoset contributions

  • type information:
    • [type definition] (or: [type definition name], [type definition namspace], [type definition type], and [type definition anonymous])
    • [member type definition] if needed
    • [attribute declaration] or [element declaration]
  • default values:
    • [schema default] = default / fixed value
    • [schema specified] = infoset or schema
  • white-space processing: [schema normalized value]
  • validity:
    • [validity] = valid or invalid or notKnown
    • [validation context] = element where validation started
    • [validation attempted] = full or partial or none
    • [schema error code] if needed

VIII.3. Validation outcomes

There are six outcomes, not two:
Validity
Validation attempted valid invalid notKnown
full OK. Entire subtree valid. OK. Entire subtree assessed; error here or at some descendant. Not possible (contradictory)
partial OK. This node assessed and valid. Some descendant skipped. OK. Problem at this node, or in a child. Also, some descendant skipped. OK. This node not assessed (but a descendant was.)
none
(subtree skipped)
Not possible (contradictory) Not possible (contradictory) OK. This subtree was skipped.

VIII.4. Reflecting / serializing the PSVI

In principle, PSVI is abstract.
In practice, exposed
  • through API
  • through XML serialization
    • additional attributes on input XML
    • various normal-form reflections of PSVI graph

VIII.5. The ‘alternating form’ PSVI

For example, part of a PSVI for a purchase order:
<document xmlns:p="http://www.w3.org/2001/05/PSVInfosetExtension"
          xmlns="http://www.w3.org/2001/05/XMLInfoset"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <children>
    <element id="g1">
      <namespaceName>http://www.example.com/PO1</namespaceName>
      <localName>purchaseOrder</localName>
      <prefix xsi:nil="true"/>
      <children>
        <character>
          <characterCode>10</characterCode>
          <elementContentWhitespace>true</elementContentWhitespace>
        </character>
        <character>
          <characterCode>32</characterCode>
          <elementContentWhitespace>true</elementContentWhitespace>
        </character>
        <element>
          <namespaceName>http://www.example.com/PO1</namespaceName>
          <localName>shipTo</localName>
          <prefix xsi:nil="true"/>
          <children> 
            ...
         
[Examples as output from XSV and Xerces]

IX. Some questions of usage

IX.1. Inheritance Type derivation

It turns out to be hard to model stepwise refinement of types:
  • restriction (preserves subset semantics)
  • extension (preserves prefix semantics)

IX.2. Inheritance in document systems

Existing document systems turn out to have a very different model of class systems and inheritance.
  • inheritance of attributes
  • inheritance of locations
XML Schema models these with
  • inheritance (by extension or restriction)
  • substitution groups

IX.3. Schemas and namespaces

Some (unpleasant) facts of life:
  • Namespaces are not incompatible with document grammars
  • — but they don't play well with DTDs.
  • Namespaces allow us to distinguish mine from not-mine.
  • Namespaces do not provide universal names.
  • The namespace : language relation is 1:n.
  • The language : grammar relation is 1:n.
  • Therefore, the namespace : schema relation is 1:n.
Live with it.

IX.4. Schema layers

We distinguish:
  • schema documents (with single target namespace)
  • schemas (sets of abstract components)
Schema composition operations:
  • import
  • include
  • include with override / redefine

IX.5. Modularization

XML Schema makes it possible to write modular document type definitions:
  • late collection of schema components
  • namespace-aware name matching, validation
  • white-box wildcards (lax / opportunistic)
  • black-box wildcards (skip)

IX.6. Modularizing vocabularies: tasks

The basic requirements for defining modules:
  • control over exposing and hiding
  • a way to refer to items in different modules
  • a way to say “anything from another module goes here”
  • *a way to allow the integrator to say “these specific things from other module go there”

IX.7. Modularizing vocabularies: techniques

The basic requirements for defining modules:
  • expose by making top-level; hide by making local
  • refer to items using namespaces and qualified names
  • use wildcards to allow unrestricted insertion
  • use substitution groups to allow integrators / extenders to allow specific items to go specific places

IX.8. The tag/type distinction and non-local effects

Consider the HTML input element:
  • legal only in p and similar elements
  • legal only within form elements
SGML DTDs have partial solutions:
  • inclusion exceptions
  • content models

IX.9. Non-local effects in XML Schema

Fundamentally, we trade verbosity for context-sensitivity:
 <xsd:element name="div" type="div-type"/>
 <xsd:element name="div" type="div-in-form-type"/>

 <xsd:element name="p" type="p-type"/>
 <xsd:element name="p" type="p-in-form-type"/>

 <xsd:element name="ul" type="ul-type"/>
 <xsd:element name="ul" type="ul-in-form-type"/>

 <xsd:element name="li" type="li-type"/>
 <xsd:element name="li" type="li-in-form-type"/>
One bit of context information = double the size of grammar.
Cf. van Wijngaarden grammars (infinite size, unlimited of context sensitivity).

IX.10. Determinism

The determinism rule remains controversial:
  • LL(1) guarantees may help implementors
  • All regular languages have a deterministic FSA;
  • ... but not necessarily a deterministic regular expression!
  • Implications for closure under union, intersection.
  • Implications for subsumption tests.
  • Implications for interoperability, single-pass processing.

IX.11. Practical issues

  • XML notation*
  • Linking document and schema
    • namespace name
    • schemaLocation hint
  • Hooks for schema annotation: the annotation element

X. Review and conclusion

X.1. Why schema languages?

  • documentation
  • contract
  • firewall

X.2. Fundamental ideas of XML Schema 1.0

  • conventional data typing as in programming languages and database management systems
  • systematic separation of tags and types
  • capture inheritance
  • support multiple namespace schemas, late integration (wildcards, local/top-level, substitution groups, xsi:type)
  • schemas are not data streams

X.3. Simple types

  • alignment with programming languages and database management systems
  • atomic values
  • lexical spaces
  • limited lists and unions

X.4. Complex types

  • match DTD content models and attributes
  • extension and restriction
  • some alignment with OO systems
  • wildcards

X.5. Deployment

  • the schemaLocation attribute is a hint, not a directive
Pretty much all else follows from that.

Notes

[1] Well, to be pedantic, ISO 8879 defines document type definition as “Rules, determined by an application, that apply SGML to the markup of documents of a particular type.”
[2] Well, strictly speaking, when Xerces is invoked as shown the PSVI dump does not have the necessary namespace declarations. I use sed to fix this before invoking XSLT: java sax.Writer -v -s -f -p xni.parser.PSVIParser file.xml | sed -e "s|<document>|<document xmlns='http://www.w3.org/2001/05/XMLInfoset' xmlns:psv='http://www.w3.org/2001/05/PSVInfosetExtension' xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'>|" > psvi.out.xerces.xml