A gentle introduction to XML Schema and XML Document Grammars

" > ]> A gentle introduction to XML Schema and XML Document Grammars

Unpublished.

Created in electronic form. Some parts are transcribed from earlier notes on loose sheets.

A gentle introduction to XML Schema and XML document grammars Universitetet i Bergen HIT-senteret C. M. Sperberg-McQueen 26 October 2001 Abstract

In the historical development of markup languages, few innovations have been more important than the introduction of the notion of document grammars for constraining documents and defining document types. Document grammars provide a simple, easily understood method of specifying rules for the validity of XML documents. By helping keep data clean, they make it easier to write simpler, more reliable software.

Both SGML (ISO 8870) and XML 1.0 define a specialized notation (the DTD) for defining document grammars; more recently a number of alternative languages have been proposed. The W3C XML Schema language replicates the essential functionality of DTDs, and adds a number of features: the use of XML instance syntax rather than an ad hoc notation, clear relationships between schemas and namespaces, a systematic distinction between element types and data types, and a single-inheritance form of type derivation.

This presentation will outline some of the fundamental features of document grammars for XML, and at the same time introduce the basics of the W3C XML Schema 1.0 language. Some fundamental design issues of schema languages will be discussed, together with the choices made by the W3C group in defining XML Schema.

Overview Overview the idea of document grammars DTDs as document grammars (example) XML Schema: replicating DTDs XML Schema: types another example design issues and research questions The idea of document grammars Some models of text

Rectangular: text ::= record record ::= CHAR* // or CHAR{80} (Card images)

Lineartext ::= CHAR* (Stream editors)

Tree-shaped models

Unlabeled tree: text ::= node* node ::= node* | CHAR* (Engelbart's NL/Augment)

Fixed-depth, fixed-label tree: text ::= page* page ::= line* line ::= CHAR* (Tustep)

Models of marked-up text

ODTAO: text ::= (CHAR | COMMAND)* (Many, many programs.) This only looks regular.

Virtual variables: command ::= '<' CHAR value '>' (COCOA, OCP, Tact)

Pseudo-grammars

Pseudo-grammars: command ::= '\begin{' NAME '}' | '\end{' NAME '}' command ::= ':' NAME '.'? | ':e'' NAME '.'? command ::= '<' NAME attributes '>' | '</'' NAME '>' (Scribe, GML, LaTeX)

Document grammars (DGs)

Origin pragmatic, not theoretical. Later aligned (partially) with language theory.

Formal specification of rules → automated validation. (Cf. Algol vs. Fortran.)

Document type definition (DTD) is the set of rules .... N.B. DTD ≠ set of effective formal declarations.

The uses of DGs

Document grammars may have several uses: in the struggle against dirty data as documentation of a contract between data provider and data consumer as documentation of the content of data flows as specification of client/server protocols

An example: DTDs as document grammars DTDs as DGs

DTDs resemble Backus-Naur Form grammars, but: They describe bracketed languages* ... ... so non-terminals are visible*. SGML allows inclusion and exclusion exceptions (Rizzi: NP-complete parsing problem for non-bracketed L). They are not purely grammatical (notations, entities). Determinism rule (LL(1) requirement).

Example: limericks

Consider two kinds of poem. The limerick:

There was a young lady named Bright whose speed was much faster than light. She set out one day, in a relative way, and returned on the previous night. ... and canzone Under der linden an der heide, dâ unser zweier bette was, dâ muget ir vinden schône beide gebrochen bluomen unde gras. vor dem walde in einem tal, tandaradei, schône sanc diu nahtegal. A document grammar

Limericks and canzone: poem ::= limerick | canzone limerick ::= trimeter trimeter dimeter dimeter trimeter trimeter ::= CHAR+ dimeter ::= CHAR+ canzone ::= aufgesang abgesang aufgesang ::= stollen stollen stollen ::= line+ abgesang ::= line+

A DTD

Limericks and canzone: ]]>

A limerick There was a young lady named Bright whose speed was much faster than light. She set out one day, in a relative way, and returned on the previous night. ]]> Walther unter den linden an der heide da unser zweier bette was da mugt ir vinden schone beide gebrochen bluomen unde gras kuste er mich? wol tusentstunt tandaradei seht wie rot mir ist der munt ]]> Note on the poem DTD All the non-terminals show up as tags. The trimeter and dimeter lines should scan with 2 and 3 dactyls; this rule is not expressed. The two Stollen must have same number of lines; this rule is not expressed. The Abgesang must have more lines than a Stollen, fewer than Aufgesang; this rule is not expressed. No grammar detects the errors in the previous example. Removing non-terminals ]]>

This allows the DTD to record our understanding. But can anyone use that understanding?

The canzone minus explicit Aufgesang unter den linden an der heide da unser zweier bette was da mugt ir vinden schone beide gebrochen bluomen unde gras kuste er mich? wol tusentstunt tandaradei seht wie rot mir ist der munt ]]> XML Schema and DTD functionality Overview the idea of document grammars DTDs as document grammars (example) XML Schema: replicating DTDs XML Schema: types another example design issues and research questions XML Schema DTD++ (inheritance, real data types) DTD-- (no entities) instance syntax supporting programming-language and database-oriented types design problems The canzone schema v.1

In version 1 of this schema, we imitate the DTD slavishly.

At the outer level is a schema element: ]]> N.B. the schema does not identify a document-root element / start symbol.

Declaring elements ]]> Declaring elements Note difference between element declaration (outer) and element reference (inner). Implicit occurrence information: min = max = 1. Repeated elements ]]> Character data

]]> or ]]>

Supporting PL/DBMS type notions Overview the idea of document grammars DTDs as document grammars (example) XML Schema: replicating DTDs XML Schema: types another example design issues and research questions Supporting programming-language and dbms paradigms tag/type distinction named and anonymous datatypes simple datatypes The tag/type distinction

In programming languages we write: (Kernighan and Richie, The C Programming Language, chapter 6, Structures.)

Distinguish the name used to access the field from its type.

The tag/type distinction

Similarly, in conventional DTDs we write: ]]>

Distinguish the element type name from the name of the standard content-model.

The tag/type distinction

In XML Schema we sometimes write: ]]> Can we do that for every element type?

N.B. four kinds of type element type (vs. element, element instance) data type simple type (lexical form has no markup) complex type (has element children)

Top-level named types

Named types can be used to capture commonalities: ]]>

Top-level complex types

... or just to provide a name for a type: ]]>

Anonymous types

We can hide things using anonymous local types: ]]> Note nested declarations and definitions.

Simple datatypes

built-in primitive derived user-defined

Built-in primitive datatypes string boolean number, float, double duration, dateTime, time, date, gYearMonth, gYear, gMonthDay, gDay, gMonth hexBinary, base64Binary anyURI QName NOTATION Built-in derived datatypes normalizedString, token, language IDREFS, ENTITIES, NMTOKEN, NMTOKENS, Name, NCName, ID, IDREF, ENTITY integer, nonPositiveInteger, negativeInteger, long, int, short, byte, nonNegativeInteger, unsignedLong, unsignedInt, unsignedShort, unsignedByte, positiveInteger Hierarchy of simple types

What is an atomic type?

Extensional: a set of values V a set of lexical forms L a mapping from L to V

What is an atomic type? (2)

Intensional: a base mapping L → V a set of fundamental facets: equality (identity) order (partial, total, none) boundedness cardinality numeric

What is an atomic type? (3)

Intension, cont'd a set of constraining facets: length, minLength, maxLength pattern (constrains lexical space) enumeration whiteSpace maxInclusive, maxExclusive, minInclusive, minExclusive totalDigits, fractionDigits

Non-atomic simple types list (white-space delimited) unions (ordered) Another example Overview the idea of document grammars DTDs as document grammars (example) XML Schema: replicating DTDs XML Schema: types another example design issues and research questions An example

Consider an annotation language for a system which tokenizes characters into words lemmatizes the words analyses the tokens syntactically

Annotation language

The tagging might look like this: Taggeren best]]>å av en preprosessor , ... ]]>

A DTD

With an XML DTD, we can manage part of the job: ]]> We cannot constrain features as we'd like.

A schema for annotation

]> ]]> ... ]]>

The type feature

]]>

The type featureSet

A feature set is (for our purposes) a list of features, separated by white space. This schema type does not forbid duplicate items, though in practice they will not arise because the tagger doesn't produce them.

]]>

The word element

]]>

Design problems and research questions Overview the idea of document grammars DTDs as document grammars (example) XML Schema: replicating DTDs XML Schema: types another example design issues and research questions Some design problems Inheritance / type derivation Layering Schemas and namespaces Modularization Finding the schemas Information for downstream apps Inheritance

Document systems turn out to have a very clear model of class systems and inheritance. inheritance of attributes inheritance of locations not inheritance of content models

Inheritance in TEI

In the TEI, for example, elements can inherit attributes: ]]>

Or location in content models: ]]>

~~Inheritance~~ Type derivation

By contrast, it turns out to be hard to model stepwise refinement of programming-language types: restriction (preserves subset semantics) extension (preserves prefix semantics)

Depends on point of view: content model as list of fields with accessors, defining a record type content model as right-hand side of a grammar rule, defining a language

Schema layers

We distinguish: schema documents (with single target namespace) schemas (sets of abstract components)

Schema composition operations: import include include with override / redefine

Schemas and namespaces

Some (unpleasant) facts of life: Namespaces allow us to distinguish mine from not-mine. Namespaces do not provide universal names. The namespace : language relation is 1:n. The language : grammar relation is 1:n. Therefore, the namespace : schema relation is 1:n. Live with it.

Modularization

XML Schema makes it possible to write modular document type definitions: late collection of schema components namespace-aware name matching, validation white-box wildcards (lax / opportunistic) black-box wildcards (skip)

Linking document and schema

namespace name schemaLocation hint

Post-schema-validation infoset (PSVI)

XML-Schema validation: infoset → infoset. additions, no changes type assignment information validation-attempted information (strict, lax, skip) validation-outcome information