<!DOCTYPE TEI.2 PUBLIC '-//TEI//DTD TEI Lite 1.0//EN'
    "../1999/xmllite.dtd" [
<!--*
    "http://www.hcu.ox.ac.uk/TEI/Lite/DTD/teixlite.dtd" [
*-->
<!ENTITY mdash  "&#x2014;" ><!--=em dash-->
<!ENTITY uuml    "&#252;" ><!-- small u, dieresis or umlaut mark -->
<!ENTITY ouml    "&#246;" ><!-- small o, dieresis or umlaut mark -->
<!ENTITY auml    "&#228;" ><!-- small a, dieresis or umlaut mark -->
<!ENTITY acirc   "&#226;" ><!-- small a, circumflex accent -->
<!ENTITY ocirc   "&#244;" ><!-- small o, circumflex accent -->

<!ENTITY larr   "&#x2190;" ><!--/leftarrow /gets A: =leftward arrow-->
<!ENTITY rarr   "&#x2192;" ><!--/rightarrow /to A: =rightward arrow-->
<!ENTITY uarr   "&#x2191;" ><!--/uparrow A: =upward arrow-->
<!ENTITY darr   "&#x2193;" ><!--/downarrow A: =downward arrow-->
<!ENTITY supe   "&#x2287;" ><!--/supseteq R: =superset, equals-->
<!ENTITY sube   "&#x2286;" ><!--/subseteq R: =subset, equals-->
<!ENTITY sub    "&#x2282;" ><!--/subset R: =subset or is implied by-->
<!ENTITY sup    "&#x2283;" ><!--/supset R: =superset or implies-->
<!ENTITY equiv  "&#x2261;" ><!--/equiv R: =identical with-->
<!ENTITY ne     "&#x2260;" ><!--/ne /neq R: =not equal-->
<!ENTITY sime   "&#x2243;" ><!--/simeq R: =similar, equals-->

<!ENTITY eacute  "&#233;" ><!-- small e, acute accent -->
<!ENTITY ccedil  "&#231;" ><!-- small c, cedilla -->
<!ENTITY aring   "&#229;" ><!-- small a, ring -->
<!ENTITY amp    "&#38;#38;" ><!--=ampersand-->
<!ENTITY nbsp    "&#160;" >

<!ENTITY lt     "&#38;#60;"      ><!--=less-than sign R:-->
<!ENTITY gt     ">"      ><!--=greater-than sign R:-->

<!ENTITY simplehierarchy SYSTEM "http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/type-hierarchy.gif" NDATA GIF>
]>
<?xml-stylesheet type="text/xsl" href="tlslides.xsl"?>
<TEI.2>
<teiHeader>
<fileDesc>
<titleStmt>
<title>A gentle introduction to XML Schema and XML Document Grammars</title>
</titleStmt>
<publicationStmt>
<p>Unpublished.</p>
</publicationStmt>
<sourceDesc>
<p>Created in electronic form.  Some parts are transcribed from
earlier notes on loose sheets.</p>
</sourceDesc>
</fileDesc>
</teiHeader>
<text>
<front>
<titlePage>
<docTitle>
<titlePart>A gentle introduction to XML Schema</titlePart>
<titlePart>and XML document grammars</titlePart>
</docTitle>
<titlePart>Universitetet i Bergen</titlePart>
<titlePart>HIT-senteret</titlePart>
<docAuthor>C. M. Sperberg-McQueen</docAuthor>
<docDate>26 October 2001</docDate>
</titlePage>

<div1>
<head>Abstract</head>

<p>In the historical development of markup languages, few innovations
have been more important than the introduction of the notion of
document grammars for constraining documents and defining document
types.  Document grammars provide a simple, easily understood method
of specifying rules for the validity of XML documents.  By helping
keep data clean, they make it easier to write simpler, more reliable
software.
</p>
<p>
Both SGML (ISO 8870) and XML 1.0 define a specialized notation (the
DTD) for defining document grammars; more recently a number of
alternative languages have been proposed. The W3C XML Schema language
replicates the essential functionality of DTDs, and adds a number of
features: the use of XML instance syntax rather than an ad hoc
notation, clear relationships between schemas and namespaces, a
systematic distinction between element types and data types, and a
single-inheritance form of type derivation.
</p>
<p>
This presentation will outline some of the fundamental features of
document grammars for XML, and at the same time introduce the basics
of the W3C XML Schema 1.0 language.  Some fundamental design issues of
schema languages will be discussed, together with the choices made by
the W3C group in defining XML Schema.</p>
</div1>
</front>
<body>

<div0>
<head>Overview</head>

<div1><head>Overview</head>

<list type="bullets">
<item>the idea of document grammars</item>
<item>DTDs as document grammars (example)</item>
<item>XML Schema: replicating DTDs</item>
<item>XML Schema: types </item>
<item>another example</item>
<item>design issues and research questions</item>
</list>
</div1>
</div0>

<div0>
<head>The idea of document grammars</head>
<div1><head>Some models of text</head>
<list><item><p>Rectangular:<eg>
text   ::= record
record ::= CHAR*    // or CHAR{80} 
</eg>(Card images)</p></item>
<item><p>Linear<eg>text ::= CHAR*</eg>
(Stream editors)</p></item>
</list>
</div1>
<div1>
<head>Tree-shaped models</head>
<list>
<item><p>Unlabeled tree:<eg>
text   ::= node*
node   ::= node* | CHAR*
</eg>(Engelbart's <ident>NL/Augment</ident>)</p></item>
<item><p>Fixed-depth, fixed-label tree:<eg>
text   ::= page*
page   ::= line* 
line   ::= CHAR*
</eg>(Tustep)</p></item>
</list>
</div1>

<div1>
<head>Models of marked-up text</head>
<list>
<item><p>ODTAO:<eg>
text ::= (CHAR | COMMAND)*
</eg>
(Many, many programs.)
This only <emph>looks</emph> regular.</p>
</item>
<item><p>Virtual variables:<eg>
command ::= '&lt;' CHAR value '>'
</eg>
(COCOA, OCP, Tact)</p></item></list>
</div1>
<div1>
<head>Pseudo-grammars</head>
<list>
<item><p>Pseudo-grammars:<eg>
command ::= '\begin{' NAME '}'
          | '\end{' NAME '}'
</eg><eg>
command ::= ':' NAME '.'?
          | ':e'' NAME '.'?
</eg><eg>
command ::= '&lt;' NAME attributes '>'
          | '&lt;/'' NAME '>'
</eg>(Scribe, GML, LaTeX)</p>
</item>
</list>
</div1>

<div1><head>Document grammars (DGs)</head>
<p>Origin pragmatic, not theoretical.
Later aligned (partially) with language theory.</p>
<p>Formal specification of rules &rarr; automated validation.
(Cf. Algol vs. Fortran.)</p>
<p><term>Document type definition</term> (DTD) is 
<q>the set of <emph>rules</emph> ...</q>.  N.B.
DTD &ne;&nbsp;<q>set of effective formal <emph>declarations</emph></q>.
</p></div1>
<div1>
<head>The uses of DGs</head>
<p>Document grammars may have several uses:<list>
<item>in the struggle against dirty data</item>
<item>as documentation of a contract between data provider 
and data consumer</item>
<item>as documentation of the content of data flows</item>
<item>as specification of client/server protocols</item>
</list></p>
</div1>

</div0>

<div0>
<head>An example: DTDs as document grammars</head>

<div1>
<head>DTDs as DGs</head>
<p>DTDs resemble Backus-Naur Form grammars, but:<list>
<item>They describe <soCalled>bracketed</soCalled> languages* ...</item>
<item>... so <soCalled>non-terminals</soCalled> are visible*.</item>
<item>SGML allows inclusion and exclusion exceptions (Rizzi: NP-complete
parsing problem for non-bracketed L).</item>
<item>They are not purely grammatical (notations, entities).</item>
<item>Determinism rule (LL(1) requirement).</item>
</list>
</p>
</div1>


<div1>
<head>Example: limericks</head>
<p>Consider two kinds of poem. The limerick:</p>
<lg>
<l>There was a young lady named Bright</l>
<l>whose speed was much faster than light.</l>
<l rend="indent">She set out one day,</l>
<l rend="indent">in a relative way,</l>
<l>and returned on the previous night.</l>
</lg>
</div1>
<div1>
<head>... and <term>canzone</term></head>

<lg>
<l>Under der linden an der heide,</l>
<l>d&acirc; unser zweier bette was,</l></lg>
<lg>
<l>d&acirc; muget ir vinden sch&ocirc;ne beide</l>
<l>gebrochen bluomen unde gras.</l></lg>
<lg rend="indent">
<l>vor dem walde in einem tal,</l>
<l>tandaradei,</l>
<l>sch&ocirc;ne sanc diu nahtegal.</l>
</lg>
</div1>

<div1>
<head>A document grammar</head>
<p>Limericks and canzone:<eg>
poem     ::= limerick | canzone

limerick ::= trimeter trimeter dimeter 
             dimeter trimeter
trimeter ::= CHAR+
dimeter  ::= CHAR+

canzone   ::= aufgesang abgesang
aufgesang ::= stollen stollen
stollen   ::= line+
abgesang  ::= line+
</eg></p>
</div1>

<div1>
<head>A DTD</head>
<p>Limericks and canzone:<eg><![CDATA[
<!ELEMENT poem (limerick | canzone) >

<!ELEMENT limerick (trimeter, trimeter, 
                    dimeter, dimeter, 
                    trimeter)>
<!ELEMENT trimeter (#PCDATA)>
<!ELEMENT dimeter  (#PCDATA)>

<!ELEMENT canzone   (aufgesang, abgesang) >
<!ELEMENT aufgesang (stollen, stollen) >
<!ELEMENT stollen   (l+) >
<!ELEMENT abgesang  (l+) >
<!ELEMENT l         (#PCDATA) >
]]></eg></p></div1>
<div1>
<head>A limerick</head>
<eg><![CDATA[
<poem>
  <limerick>
    <trimeter>
      There was a young lady named Bright
    </trimeter>
    <trimeter>
      whose speed was much faster than light.
    </trimeter>
    <dimeter>She set out one day,</dimeter>
    <dimeter>in a relative way,</dimeter>
    <trimeter>
      and returned on the previous night.
    </trimeter>
  </limerick>
</poem>
]]></eg></div1>

<div1>
<head>Walther</head>
<eg><![CDATA[
<poem>
 <canzone>
  <aufgesang>
    <stollen>
      <l>unter den linden an der heide</l>
      <l>da unser zweier bette was</l>
    </stollen>
    <stollen>
      <l>da mugt ir vinden schone beide</l>
      <l>gebrochen bluomen unde gras</l>
    </stollen>
  </aufgesang>
  <abgesang>
    <l>kuste er mich? wol tusentstunt</l>
    <l>tandaradei</l>
    <l>seht wie rot mir ist der munt</l>
  </abgesang>
 </canzone>
</poem>
]]></eg>
</div1>


<div1>
<head>Note on the poem DTD</head>
<list>
<item>All the non-terminals show up as tags.</item>
<item>The trimeter and dimeter lines should scan with 2 and 3 dactyls;
this rule is not expressed.</item>
<item>The two Stollen must have same number of 
lines; this rule is not expressed.</item>
<item>The Abgesang must have more lines than a 
Stollen, fewer than Aufgesang; this rule is not expressed.</item>
<item>No grammar detects the errors in the
previous example.</item>
</list>
</div1>

<div1>
<head>Removing non-terminals</head>
<eg><![CDATA[
<!ENTITY % aufgesang "stollen, stollen" >
<!ENTITY % lines     "l+" >
<!ELEMENT canzone   (%aufgesang;, abgesang) >
<!ELEMENT stollen   (%lines;) >
<!ELEMENT abgesang  (%lines;) >
<!ELEMENT l         (#PCDATA) >
]]></eg>
<p>This allows the DTD to record our understanding.
But can anyone <emph>use</emph> that understanding?
</p>
</div1>

<div1>
<head>The canzone minus explicit Aufgesang</head>
<eg><![CDATA[
<canzone>
  <stollen>
    <l>unter den linden an der heide</l>
    <l>da unser zweier bette was</l>
  </stollen>
  <stollen>
    <l>da mugt ir vinden schone beide</l>
    <l>gebrochen bluomen unde gras</l>
  </stollen>
  <abgesang>
    <l>kuste er mich? wol tusentstunt</l>
    <l>tandaradei</l>
    <l>seht wie rot mir ist der munt</l>
  </abgesang>
</canzone>
]]></eg>
</div1>
<!--*
<div1 rend="suppress">
<head>The canzone minus NTs</head>
<eg><![CDATA[
<canzone>
  <l>unter den linden an der heide</l>
  <l>da unser zweier bette was</l>
  <l>da mugt ir vinden schone beide</l>
  <l>gebrochen bluomen unde gras</l>
  <l>kuste er mich? wol tusentstunt</l>
  <l>tandaradei</l>
  <l>seht wie rot mir ist der munt</l>
</canzone>
]]></eg>
</div1>


<div1 rend="suppress">
<head>Removing <emph>all</emph> non-terminals</head>
<eg><![CDATA[
<!ENTITY % stollen   "l+" >
<!ENTITY % aufgesang "%stollen;, %stollen;" >
<!ENTITY % abgesang  "l+" >
<!ELEMENT canzone   (%aufgesang;, %abgesang;) >
<!ELEMENT l         (#PCDATA) >
]]></eg>
<p>ERROR: this DTD is illegal; why?</p>
</div1>
*-->

</div0>

<div0>
<head>XML Schema and DTD functionality</head>

<div1><head>Overview</head>

<list type="bullets">
<item>the idea of document grammars</item>
<item>DTDs as document grammars (example)</item>
<item rend="here">XML Schema: replicating DTDs</item>
<item>XML Schema: types </item>
<item>another example</item>
<item>design issues and research questions</item>
</list>
</div1>

<div1>
<head>XML Schema</head>
<list>
<item>DTD++ (inheritance, real data types)</item>
<item>DTD-- (no entities)</item>
<item>instance syntax</item>
<item>supporting programming-language and database-oriented
types</item>
<item>design problems</item>
</list>
</div1>

<div1>
<head>The canzone schema v.1</head>
<p>In version 1 of this schema, we imitate the DTD slavishly.</p>
<p>At the outer level is a <ident>schema</ident> element:
<eg><![CDATA[
<xsd:schema xmlns:xsd =
  "http://www.w3.org/2001/XMLSchema"
>
 <!--* element and type declarations 
     * go here ... *-->
</xsd:schema>
]]></eg>
N.B. the schema does not identify
a document-root element / start symbol.
</p>
</div1>
<div1><head>Declaring elements</head>
<eg><![CDATA[
 <xsd:element name="canzone">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element ref="aufgesang"/>
    <xsd:element ref="abgesang"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="aufgesang">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element ref="stollen"/>
    <xsd:element ref="stollen"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>
]]></eg>
</div1>

<div1><head>Declaring elements</head>

<list>
<item>Note difference between element <term>declaration</term> (outer)
and element <term>reference</term> (inner).</item>
<item>Implicit occurrence information: min = max = 1.</item>
</list>
</div1>
<div1><head>Repeated elements</head>
<eg><![CDATA[
 <xsd:element name="abgesang">
  <xsd:complexType>
   <xsd:sequence minOccurs="1" 
                 maxOccurs="unbounded">
    <xsd:element ref="l"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="stollen">
  <xsd:complexType>
   <xsd:sequence minOccurs="1" 
                 maxOccurs="unbounded">
    <xsd:element ref="l"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>
]]></eg>
</div1>
<div1><head>Character data</head>
<p>
<eg><![CDATA[
 <xsd:element name="l">
  <xsd:complexType mixed="true">
   <xsd:sequence>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>
]]></eg>
or
<eg><![CDATA[
 <xsd:element name="l" type="xsd:string"/>
]]></eg>
</p>
</div1>
</div0>

<div0>
<head>Supporting PL/DBMS type notions</head>


<div1><head>Overview</head>

<list type="bullets">
<item>the idea of document grammars</item>
<item>DTDs as document grammars (example)</item>
<item>XML Schema: replicating DTDs</item>
<item rend="here">XML Schema: types </item>
<item>another example</item>
<item>design issues and research questions</item>
</list>
</div1>

<div1>
<head>Supporting programming-language and dbms paradigms</head>
<list>
<item>tag/type distinction</item>
<item>named and anonymous datatypes</item>
<item>simple datatypes</item>
</list>
</div1>

<div1><head>The tag/type distinction</head>
<p>In programming languages we write:
<eg><![CDATA[
struct date {
  int  day;
  int  month;
  int  year;
  char mon_name[4];
};
]]></eg>
(Kernighan and Richie, <title>The C Programming Language</title>, 
chapter 6, Structures.)</p>
<p>
Distinguish the name used to <emph>access</emph> the field
from its <emph>type</emph>.
</p>

</div1>

<div1><head>The tag/type distinction</head>
<p>Similarly, in conventional DTDs we write:
<eg><![CDATA[
<!ELEMENT p       - O  (%paraContent;)   >
<!ELEMENT foreign - O  (%paraContent;)   >
<!ELEMENT emph    - O  (%paraContent;)   >
<!ELEMENT hi      - O  (%paraContent;)   >
<!ELEMENT list    - O  (head, item*)     >
<!ELEMENT item    - O  (%paraSequence;)  >
<!ELEMENT note    - O  (%paraSequence;)  >
]]></eg>
</p>
<p>Distinguish the <emph>element type</emph> name from
the <emph>name of the standard content-model</emph>.</p>

</div1>

<div1><head>The tag/type distinction</head>
<p>In XML Schema we sometimes write:
<eg><![CDATA[
 <xsd:element name="l" type="xsd:string"/>
]]></eg>
Can we do that for every element type?</p>
<p>N.B. four kinds of <term>type</term><list>
<item>element type (vs. element, element instance)</item>
<item>data type<list>
<item>simple type (lexical form has no markup)</item>
<item>complex type (has element children)</item>
</list>
</item>
</list>
</p>
</div1>


<div1><head>Top-level named types</head>
<p>Named types can be used to capture commonalities:
<eg><![CDATA[
 <xsd:complexType name="lines">
  <xsd:sequence minOccurs="1" 
                maxOccurs="unbounded">
   <xsd:element ref="l"/>
  </xsd:sequence>
 </xsd:complexType>

 <xsd:element name="abgesang" type="lines">
 <xsd:element name="stollen" type="lines">
]]></eg>
</p>
</div1>

<div1><head>Top-level complex types</head>
<p>... or just to provide a name for a type:
<eg><![CDATA[
<xsd:complexType name="canzoneform">
  <xsd:sequence>
    <xsd:element ref="aufgesang"/>
    <xsd:element ref="abgesang"/>
  </xsd:sequence></xsd:complexType>
<xsd:complexType name="aufgesang">
  <xsd:sequence>
    <xsd:element ref="stollen"/>
    <xsd:element ref="stollen"/>
  </xsd:sequence></xsd:complexType>

<xsd:element name="canzone" 
             type="canzoneform"/>
<xsd:element name="aufgesang" 
             type="aufgesang">
]]></eg>
</p>
</div1>

<div1><head>Anonymous types</head>
<p>We can hide things using
anonymous local types:<eg><![CDATA[
<xsd:element name="canzone">
 <xsd:complexType>
  <xsd:sequence>
   <xsd:element name="aufgesang">
    <xsd:complexType>
     <xsd:sequence>
      <xsd:element name="stollen" type="lines"/>
      <xsd:element name="stollen" type="lines"/>
     </xsd:sequence>
    </xsd:complexType>
   </xsd:element>
   <xsd:element name="abgesang" type="lines"/>
  </xsd:sequence>
 </xsd:complexType>
</xsd:element>
]]></eg>
Note nested declarations and definitions.</p>
</div1>

<div1>
<head>Simple datatypes</head>
<p><list>
<item>built-in<list>
<item>primitive</item>
<item>derived</item>
</list>
</item>
<item>user-defined</item>
</list></p>
</div1>
<div1>
<head>Built-in primitive datatypes</head>
<list>
<item>string</item>
<item>boolean</item>
<item>number, float, double</item>
<item>duration, dateTime, time, date, gYearMonth, gYear, gMonthDay, gDay,
gMonth</item>
<item>hexBinary, base64Binary</item>
<item>anyURI</item>
<item>QName</item>
<item>NOTATION</item>
</list>
</div1>
<div1>
<head>Built-in derived datatypes</head>
<list>
<item>normalizedString, token, language</item>
<item>
IDREFS, 
ENTITIES, 
NMTOKEN, 
NMTOKENS, 
Name, 
NCName, 
ID, 
IDREF, 
ENTITY
</item>
<item>
integer, 
nonPositiveInteger, 
negativeInteger, 
long, 
int, 
short, 
byte, 
nonNegativeInteger,
unsignedLong, 
unsignedInt, 
unsignedShort, 
unsignedByte, 
positiveInteger 
</item>
</list>
</div1>

<div1>
<head>Hierarchy of simple types</head>
<p>
<figure entity="simplehierarchy">
</figure></p>
</div1>

<div1>
<head>What is an atomic type?</head>
<p>Extensional:
<list>
<item>a set of values V</item>
<item>a set of lexical forms L</item>
<item>a mapping from L to V</item>
</list>
</p>
</div1>
<div1>
<head>What is an atomic type? (2)</head>
<p>Intensional:
<list>
<item>a base mapping L &rarr; V</item>
<item>a set of <term>fundamental facets</term>:
<list>
<item>equality (identity)</item>
<item>order (partial, total, none)</item>
<item>boundedness</item>
<item>cardinality</item>
<item>numeric</item>
</list>
</item></list>
</p>
</div1>
<div1><head>What is an atomic type? (3)</head>
<p>Intension, cont'd<list>
<item>a set of <term>constraining facets</term>:
<list>
<item>length, minLength, maxLength</item>
<item>pattern (constrains lexical space)</item>
<item>enumeration</item>
<item>whiteSpace</item>
<item>maxInclusive, maxExclusive, minInclusive, minExclusive</item>
<item>totalDigits, fractionDigits</item>
</list>
</item>
</list>
</p>
</div1>

<div1>
<head>Non-atomic simple types</head>
<list>
<item>list (white-space delimited)</item>
<item>unions (ordered)</item>
</list>
</div1>

</div0>

<div0>
<head>Another example</head>


<div1><head>Overview</head>

<list type="bullets">
<item>the idea of document grammars</item>
<item>DTDs as document grammars (example)</item>
<item>XML Schema: replicating DTDs</item>
<item>XML Schema: types </item>
<item rend="here">another example</item>
<item>design issues and research questions</item>
</list>
</div1>
<div1>
<head>An example</head>
<p>Consider an annotation language for a system which<list>
<item>tokenizes characters into words</item>
<item>lemmatizes the words</item>
<item>analyses the tokens syntactically</item>
</list></p>
</div1>

<div1>
<head>Annotation language</head>
<p>The tagging might look like this:
<eg><![CDATA[
<w><sf>Taggeren</sf>
   <reading lemma="Taggere" 
            features="subst mask appell be 
                      ent samset"/>
   <reading lemma="Taggeren" 
            features="adj pos m/f ub 
                      ent samset "/></w>
<w><sf>best]]>&aring;<![CDATA[r</sf>
   <reading lemma="best]]>&aring;<![CDATA[" 
            features="verb pres i2 tr5 tr12 
                      tr21 tr22 "/></w>
<w><sf>av</sf>
   <reading lemma="av" 
            features="prep "/></w>
<w><sf>en</sf>
   <reading lemma="en" 
            features="adv"/>
   <reading lemma="en" 
            features="pron pers ent hum"/>
   <reading lemma="en" 
            features="det kvant mask ent"/>
   <reading lemma="ene" 
            features="verb imp tr1 "/></w>
<w><sf>preprosessor</sf>
   <reading lemma="preprosessor" 
            features="subst mask appell ub 
                      ent samset "/></w>
<w><sf>,</sf>
   <reading lemma="$," 
            features="clb &lt;komma>"/>
   <reading lemma="$," 
            features="&lt;komma> "/></w>
...
<!--* Output from Oslo-Bergen tagger for Norwegian
    * http://decentius.hit.uib.no:8005/cl/cgp/test.html
    * XML markup added by hand for didactic purposes
    * 2001-10-25 *-->
]]></eg></p>
</div1>
<div1>
<head>A DTD</head>
<p>With an XML DTD, we can manage part of the job:<eg><![CDATA[
<!ELEMENT w (sf, reading+)>
<!ELEMENT sf (#PCDATA)>
<!ELEMENT reading EMPTY >
<!ATTLIST reading
          lemma    CDATA #REQUIRED
          features CDATA #REQUIRED >
]]></eg>
We cannot constrain <ident>features</ident> as we'd like.</p>
</div1>
<div1>
<head>A schema for annotation</head>
<p>
<eg><![CDATA[
<!DOCTYPE xsd:schema PUBLIC "-//W3C//DTD XMLSCHEMA 200105//EN"
       "http://www.w3.org/2001/XMLSchema.dtd" [
<!ENTITY % schemaAttrs "
  xmlns:this CDATA #IMPLIED
  xmlns:xsd  CDATA #IMPLIED"
>
<!ENTITY % p "xsd:">
<!ENTITY % s ":xsd">
]>
<xsd:schema 
  targetNamespace="http://decentius.hit.uib.no:8005/cl/cgp/test.html" 
  version="2001-10-25"
  xmlns:this="http://decentius.hit.uib.no:8005/cl/cgp/test.html" 
  xmlns:w3c="http://www.w3.org/2001/XMLSchema/typelibrary"
  xmlns:xsd="http://www.w3.org/2001/XMLSchema">
]]></eg>
...
<eg><![CDATA[  
</xsd:schema>
]]></eg>
</p>
</div1>
<div1>
<head>The type <ident>feature</ident></head>
<p>
<eg><![CDATA[
<xsd:simpleType name="feature">
  <xsd:restriction base="xsd:string">
    <xsd:enumeration value="&lt;komma>"/>
    <xsd:enumeration value="adj"/>
    <xsd:enumeration value="adv"/>
    <xsd:enumeration value="appell"/>
    <xsd:enumeration value="be"/>
    <xsd:enumeration value="clb"/>
    <xsd:enumeration value="det"/>
    <xsd:enumeration value="ent"/>
    <xsd:enumeration value="hum"/>
    <xsd:enumeration value="i2"/>
    <xsd:enumeration value="imp"/>
    <xsd:enumeration value="kvant"/>
    <xsd:enumeration value="m/f"/>
    <xsd:enumeration value="mask"/>
    <xsd:enumeration value="pers"/>
    <xsd:enumeration value="pos"/>
    <xsd:enumeration value="prep"/>
    <xsd:enumeration value="pres"/>
    <xsd:enumeration value="pron"/>
    <xsd:enumeration value="samset"/>
    <xsd:enumeration value="subst"/>
    <xsd:enumeration value="tr1"/>
    <xsd:enumeration value="tr12"/>
    <xsd:enumeration value="tr21"/>
    <xsd:enumeration value="tr22"/>
    <xsd:enumeration value="tr5"/>
    <xsd:enumeration value="ub"/>
    <xsd:enumeration value="verb"/>
  </xsd:restriction>
</xsd:simpleType>
]]></eg>
</p>
</div1>
<div1>
<head>The type <ident>featureSet</ident></head>
<p><eg><![CDATA[
<xsd:simpleType name="featureSet">
  <xsd:list itemType="this:feature">

    <xsd:annotation>
      <xsd:documentation xmlns="http://www.w3.org/1999/xhtml">
       <p>A feature set is (for our purposes) 
          a list of features, separated by 
          white space.  This schema type does
          <em>not</em> forbid duplicate items, 
          though in practice they will not 
          arise because the tagger doesn't 
          produce them.</p>
      </xsd:documentation>
    </xsd:annotation>

  </xsd:list>
</xsd:simpleType>
]]></eg>
</p>
</div1>
<div1><head>The <ident>word</ident> element</head>

<p><eg><![CDATA[
  <xsd:element name="word">
    <xsd:complexType mixed="false">
      <xsd:sequence>
        <xsd:element name="sf" type="w3c:text"/>
        <xsd:element name="reading" 
                     minOccurs="1" 
                     maxOccurs="unbounded">
          <xsd:complexType>
            <xsd:sequence/>
            <!--* EMPTY element *-->
            <xsd:attribute name="lemma" 
                           type="xsd:string"/>
            <xsd:attribute name="features" 
                           type="this:featureSet"/>
          </xsd:complexType>
        </xsd:element>
      </xsd:sequence>
    </xsd:complexType>
  </xsd:element>
]]></eg>
</p>
</div1>

</div0>

<div0>
<head>Design problems and research questions</head>


<div1><head>Overview</head>

<list type="bullets">
<item>the idea of document grammars</item>
<item>DTDs as document grammars (example)</item>
<item>XML Schema: replicating DTDs</item>
<item>XML Schema: types </item>
<item>another example</item>
<item rend="here">design issues and research questions</item>
</list>
</div1>

<div1>
<head>Some design problems</head>
<list>
<item>Inheritance / type derivation</item>
<item>Layering</item>
<item>Schemas and namespaces</item>
<item>Modularization</item>
<item>Finding the schemas</item>
<item>Information for downstream apps</item>
<!--*	
<item>Non-local effects</item>
<item>Determinism</item>
*-->
</list>
</div1>

<!--*
<div1 rend="suppress">
<head>Design issues and research questions</head>
<p>Problems and design choices:<list>
<item>PL and dbms idioms, tag/type distinction</item>
<item>inheritance</item>
<item>XML transfer syntax</item>
<item>component/XML distinction</item>
<item>limited non-local (pseudo-context-sensitive) effects</item>
<item>determinism</item>
<item>closure (lack of)</item>
<item>treatment of mixed content</item>
<item>namespaces derivation</item>
<item>type derivation</item>
<item>linking documents and schemas</item>
</list>
</p>
</div1>
*-->

<div1>
<head>Inheritance</head>
<p>Document systems turn out to have a very clear model
of class systems and inheritance.
<list>
<item>inheritance of attributes</item>
<item>inheritance of <emph>locations</emph></item>
<item>not inheritance of content models</item>
</list>
</p></div1>
<div1>
<head>Inheritance in TEI</head>
<p>In the TEI, for example, elements can inherit attributes:
<eg><![CDATA[
<!ATTLIST name               %a.global;
                             %a.names;
          type               CDATA               #IMPLIED
          TEIform            CDATA               'name'         >
]]></eg>
</p>
<p>Or location in content models:<eg><![CDATA[
<!ENTITY % m.phrase '%x.phrase %m.data; | %m.edit; | 
           %m.formPointers; | %m.hqphrase; | %m.loc; | 
           %m.phrase.verse; | %m.seg; | %m.sgmlKeywords; | 
           %n.formula; | %n.fw; | %n.handShift;'                >

]]></eg></p>
</div1>

<div1>
<head><del>Inheritance</del> Type derivation</head>
<p>By contrast, it turns out to be hard to model stepwise refinement
of programming-language types:
<list>
<item>restriction (preserves subset semantics)</item>
<item>extension (preserves prefix semantics)</item>
</list>
</p>
<p>Depends on point of view:<list>
<item>content model as list of fields with accessors,
defining a <emph>record type</emph></item>
<item>content model as right-hand side of a grammar rule,
defining a <emph>language</emph></item>
</list>
</p>
</div1>


<div1>
<head>Schema layers</head>
<p>We distinguish:
<list>
<item>schema documents (with single target namespace)</item>
<item>schemas (sets of abstract components)</item>
</list>
</p>
<p>Schema composition operations:<list>
<item>import</item>
<item>include</item>
<item>include with override / redefine</item>
</list>
</p>
</div1>


<div1>
<head>Schemas and namespaces</head>
<p>Some (unpleasant) facts of life:
<list>
<item>Namespaces allow us to distinguish <mentioned>mine</mentioned>
from <mentioned>not-mine</mentioned>.</item>
<item>Namespaces do <emph>not</emph> provide universal names.</item>
<item>The <ident>namespace</ident> : <ident>language</ident> 
relation is 1:<ident>n</ident>.</item>
<item>The <ident>language</ident> : <ident>grammar</ident>
relation is 1:<ident>n</ident>.</item>
<item>Therefore, the <ident>namespace</ident> : <ident>schema</ident>
relation is 1:<ident>n</ident>.</item>
</list>
Live with it.</p>
</div1>

<div1>
<head>Modularization</head>
<p>XML Schema makes it possible to write modular document
type definitions:
<list>
<item>late collection of schema components</item>
<item>namespace-aware name matching, validation</item>
<item>white-box wildcards (lax / opportunistic)</item>
<item>black-box wildcards (skip)</item>
</list>
</p>
</div1>

<div1>
<head>Linking document and schema</head>
<p>
<list>
<item>namespace name</item>
<item><ident>schemaLocation</ident> hint</item>
</list>
</p>

</div1>

<div1>
<head>Post-schema-validation infoset (PSVI)</head>
<p>XML-Schema validation: infoset &rarr; infoset.
<list>
<item>additions, no changes</item>
<item>type assignment information</item>
<item>validation-attempted information (strict, lax, skip)</item>
<item>validation-outcome information</item>
</list>
</p>

</div1>
<!--*
<div1 rend="suppress">
<head>Non-local effects</head>
<p>Consider the HTML <ident>input</ident> element:
<list>
<item>legal only in <ident>p</ident> and similar elements</item>
<item>legal only within <ident>form</ident> elements</item>
</list>
</p>
<p>SGML DTDs have partial solutions:<list>
<item>inclusion exceptions</item>
<item>content models</item>
</list>
</p>

</div1>

<div1 rend="suppress">
<head>Non-local effects in XML Schema</head>
<p>Fundamentally, we trade verbosity for context-sensitivity:
<eg><![CDATA[
 <xsd:element name="div" type="div-type"/>
 <xsd:element name="div" type="div-in-form-type"/>

 <xsd:element name="p" type="p-type"/>
 <xsd:element name="p" type="p-in-form-type"/>

 <xsd:element name="ul" type="ul-type"/>
 <xsd:element name="ul" type="ul-in-form-type"/>

 <xsd:element name="li" type="li-type"/>
 <xsd:element name="li" type="li-in-form-type"/>
]]></eg>
</p>
<p>One bit of context information = double the size of grammar.</p>
<p>Cf. van Wijngaarden grammars (infinite size, arbitrary amounts of
context sensitivity).</p>

</div1>

<div1 rend="suppress">
<head>Determinism</head>
<p>The determinism rule remains controversial:<list>
<item>LL(1) guarantees may help implementors</item>
<item>All regular languages have a deterministic FSA;</item>
<item>... but <emph>not</emph> necessarily a deterministic
regular expression!</item>
<item>Implications for closure under union, intersection.</item>
<item>Implications for subsumption tests.</item>
</list>
</p>
</div1>

*-->
</div0>

</body>
<back>

</back>
</text>
</TEI.2>
<!-- Keep this comment at the end of the file
Local variables:
mode: xml
sgml-default-dtd-file:(concat sgmlvol "/SGML/Public/Emacs/teilite.ced")
sgml-omittag:t
sgml-shorttag:t
End:
-->
