Perspectives on XML

and related standards

Korpuslinguistik Deutsch

Synchron, diachron, konstrastiv

Universität Würzburg

C. M. Sperberg-McQueen

22 February 2003

TOC | First


I. Overview

Overview

previous table of contents next
1 of 1 [45]
  • markup languages, SGML, XML
  • the XML landscape
  • XML Schema
  • XML Query, XSLT, and XPath

II. Markup languages, SGML, XML

SGML/XML and corpora

previous table of contents next
1 of 10 [45]
  • TEI (Text Encoding Initiative)
  • British National Corpus (CDIF)
  • Corpus Encoding Standard (CES)
  • Edinburgh and Chiba map-task corpora
  • ...

SGML

previous table of contents next
2 of 10 [45]
Standard Generalized Markup Language (SGML)
  • an international standard (ISO 8879:1986)
  • non-proprietary
  • explicit* markup
  • descriptive markup
  • document grammars (~ BNF)
  • flexible character-set declaration

XML

previous table of contents next
3 of 10 [45]
Extensible Markup Language (XML)
  • a de facto standard (W3C Recommendation 1998)
  • subset of SGML
  • Web-friendly
  • easier to parse than SGML
  • few optional features
  • Unicode

Who is the W3C?

previous table of contents next
4 of 10 [45]
The World Wide Web Consortium is a member-supported organization which creates Web standards.
Our mission:
to lead the Web to its full potential.

W3C goals and operating principles

previous table of contents next
  • universal access
  • semantic Web
  • trust
  • interoperability
  • evolvability (through simplicity, modularity, compatibility, extensibility)
  • decentralization
  • cooler multimedia

Markup

previous table of contents next
5 of 10 [45]
A transcription (re)produces a character sequence.
But virtually all transcriptions of texts do more.
Informally, we sometimes say they ‘add’ information to the text using markup — debatable.

Markup languages, SGML, XML

previous table of contents next
6 of 10 [45]
Consider (virtually) any machine-readable transcription of any text.
                                      1875
                                   PEER GYNT
                                by Henrik Ibsen
  THE CHARACTERS
  ASE, a peasant's widow.
  PEER GYNT, her son.
  TWO OLD WOMEN with corn-sacks. ASLAK, a smith. WEDDING-GUESTS. A
    MASTER-COOK, A FIDDLER, etc.
  A MAN AND WIFE, newcomers to the district.
  SOLVEIG and LITTLE HELGA, their daughters.
  THE FARMER AT HEGSTAD.
  INGRID, his daughter.
  THE BRIDEGROOM and His PARENTS.
  THREE SAETER-GIRLS. A GREEN-CLAD WOMAN.
  THE OLD MAN OF THE DOVRE.
  A TROLL-COURTIER. SEVERAL OTHERS. TROLL-MAIDENS and TROLL-URCHINS. A
    COUPLE OF WITCHES. BROWNIES, NIXIES, GNOMES, etc.
  AN UGLY BRAT. A VOICE IN THE DARKNESS. BIRD-CRIES.
  KARI, a cottar's wife.
  Master COTTON, Monsieur BALLON, Herren VON EBERKOPF and
    TRUMPETERSTRALE, gentlemen on their travels. A THIEF and A RECEIVER.
  ANITRA, daughter of a Bedouin chief.
  ARABS, FEMALE SLAVES, DANCING-GIRLS, etc.
  THE MEMNON-STATUE (singing). THE SPHINX AT GIZEH (muta persona).
  PROFESSOR BEGRIFFENFELDT, Dr. Phil., director of the madhouse at
    Cairo.
  HUHU, a language-reformer from the coast of Malabar. HUSSEIN, an
    eastern Minister. A FELLAH, with a royal mummy.
  SEVERAL MADMEN, with their KEEPERS.
  A NORWEGIAN SKIPPER and HIS CREW. A STRANGE PASSENGER.
  A PASTOR. A FUNERAL-PARTY. A PARISH-OFFICER. A BUTTON-MOULDER. A
    LEAN PERSON.
    The action, which opens in the beginning of the nineteenth
  century, and ends around the 1860's, takes place partly in
  Gudbrandsdalen, and on the mountains around it, partly on the coast
  of Morocco, in the desert of Sahara, in a madhouse at Cairo, at sea,
  etc.
                                   ACT FIRST
  SCENE FIRST
  [A wooded hillside near ASE's farm. A river rushes down the slope.
  On the further side of it an old mill shed. It is a hot day in
  summer.]
  [PEER GYNT, a strongly-built youth of twenty, comes down the
  pathway. His mother, ASE, a small, slightly built woman, follows
  him, scolding angrily.]
  ASE
       Peer, you're lying!
  PEER [without stopping].
       No, I am not!
  ASE
       Well then, swear that it is true!
  PEER
       Swear? Why should I?
  ASE
       See, you dare not!
       It's a lie from first to last.

Further examples

previous table of contents next
7 of 10 [45]
Some transcriptions have more information:
|s001
|l001 ich sâz ûf eime steine
|l002 und dâhte bein mit beine
|l003 dar ûf sazt ich den ellenbogen
or
|b001
|l001a Hw*a/et, we GAR-DENA
|l001b in geardagum t*rym gefrunnon
|l002a Hu t*a *a/ed*elingas 
|l002b ellen fremedon

Annotation

previous table of contents next
8 of 10 [45]
Does this transcription include extra-textual information?
S0CF6003 v
[S [N TROUBLED_JJ [ morning_NNT1 television_NN1 ] 
station_NN1 GMTV_NP1 N] finally_RR [V had_VHD 
[N something_PN1 [Ti to_TO smile_VVI [P about_II
P]Ti]N][Nr last_MD night_NNT1 [Fr when_RRQ [N 
it_PPH1 N][V was_VBDZ revealed_VVN [Fn[N it_PPH1 
N][V gained_VVD [N an_AT1 extra_JJ million_NNO 
viewers_NN2 N][P over_II [N the_AT last_MD two_MC
weeks_NNT2 N]P]V]Fn]V]Fr]Nr]V] ._YSTP S]

Markup languages

previous table of contents next
9 of 10 [45]
More formally, we can oppose content and markup. A markup language:
  • defines a vocabulary to use in markup.
  • specifies how markup constructs can occur (containment, sequence, ...), providing a contract between data sources and data sinks.
  • tells how to distinguish markup from content.

Greetings example

previous table of contents next
10 of 10 [45]
Some simple XML:
<!DOCTYPE greetings [
<!ELEMENT greetings (hello+) >
<!ELEMENT hello (#PCDATA) >
<!ATTLIST hello
          lang CDATA #IMPLIED > <!ENTITY szlig "&#223;" > <!ENTITY uuml "&#252;" > ]>

<greetings>
<hello lang="en">Hello, world!</hello>
<hello lang="fr">Bon jour, tout le monde!</hello>
<hello lang="no">Goddag!</hello>
<hello lang="de">Guten Tag!</hello>
<hello lang="de-franken">Gr&uuml;&szlig; Gott!</hello>
</greetings>

III. The XML landscape

The XML landscape: applications

previous table of contents next
1 of 3 [45]
XML is a meta-language for defining markup languages.
Applications of XML are markup languages:
  • XHTML
  • TEI, CES, ...
  • SVG (Scalable Vector Graphics)
  • XSLT (XSL Transformations)
  • MathML
  • ...

The XML landscape: related specs

previous table of contents next
2 of 3 [45]
The original plan:
  • XML (subset of SGML)
  • XLink (subset of HyTime)
  • XSL (Extensible Stylesheet Language) (subset of DSSSL)

Related specs

previous table of contents next
3 of 3 [45]
The plan, as modified by experience:
  • XML
  • Namespaces in XML
  • XML Information Set
  • XLink
  • XPointer
  • XSLT (XSL Transformations)
  • XSL Formatting Objects
  • XPath
  • XML Schema
  • XML Query
  • Document Object Model
  • SOAP (Simple Object Access Protocol)
  • WSDL (Web Services Description Lanuage)
  • ...

IV. XML Schema

XML Schema and document grammars

previous table of contents next
1 of 9 [45]
  • document grammars
  • an example
  • XML Schema

Document grammars (DGs)

previous table of contents next
Origin pragmatic, not theoretical.
Later aligned (partially) with language theory.
Formal specification of validity rules → automated validation. (Cf. Algol vs. Fortran.)
Distinction between document type definition (DTD) and “the set of effective formal declarations”.

The uses of DGs

previous table of contents next
2 of 9 [45]
Document grammars may have several uses:
  • in the struggle against dirty data
  • as documentation of a contract between data provider and data consumer
  • as documentation of the content of data flows
  • as specification of client/server protocols

DTDs as DGs

previous table of contents next
SGML/XML DTDs resemble Backus-Naur Form grammars, but:
  • They describe bracketed languages* ...
  • ... so ‘non-terminals’ are visible*.
  • SGML allows ‘inclusion’ and ‘exclusion’ exceptions.
  • They are not purely grammatical (notations, entities).
  • Determinism rule.

A document grammar

previous table of contents next
3 of 9 [45]
Limericks and canzone:
poem     ::= limerick | canzone

limerick ::= trimeter trimeter dimeter 
             dimeter trimeter
trimeter ::= CHAR+
dimeter  ::= CHAR+

canzone   ::= aufgesang abgesang
aufgesang ::= stollen stollen
stollen   ::= line+
abgesang  ::= line+

A DTD

previous table of contents next
4 of 9 [45]
Limericks and canzone:
<!ELEMENT poem (limerick | canzone) >

<!ELEMENT limerick (trimeter, trimeter, 
                    dimeter, dimeter, 
                    trimeter)>
<!ELEMENT trimeter (#PCDATA)>
<!ELEMENT dimeter  (#PCDATA)>

<!ELEMENT canzone   (aufgesang, abgesang) >
<!ELEMENT aufgesang (stollen, stollen) >
<!ELEMENT stollen   (l+) >
<!ELEMENT abgesang  (l+) >
<!ELEMENT l         (#PCDATA) >

A limerick

previous table of contents next
5 of 9 [45]
<limerick>
  <trimeter>
    There was a young lady named Bright
  </trimeter>
  <trimeter>
    whose speed was much faster than light.
  </trimeter>
  <dimeter>She set out one day,</dimeter>
  <dimeter>in a relative way,</dimeter>
  <trimeter>
    and returned on the previous night.
  </trimeter>
</limerick>

A canzone

previous table of contents next
6 of 9 [45]
<canzone>
  <aufgesang>
    <stollen>
      <l>unter den linden an der heide</l>
      <l>da unser zweier bette was</l>
    </stollen>
    <stollen>
      <l>da mugt ir vinden schone beide</l>
      <l>gebrochen bluomen unde gras</l>
    </stollen>
  </aufgesang>
  <abgesang>
    <l>kuste er mich? wol tusentstunt</l>
    <l>tandaradei</l>
    <l>seht wie rot mir ist der munt</l>
  </abgesang>
</canzone>

Note on the canzone DTD

previous table of contents next
7 of 9 [45]
All the non-terminals show up as tags.
The two Stollen must have same number of lines; this rule is not expressed.
The Abgesang must have more lines than a Stollen, fewer than Aufgesang; this rule is not expressed.

Removing non-terminals

previous table of contents next
8 of 9 [45]
<!ENTITY % aufgesang "stollen, stollen" >
<!ENTITY % lines     "l+" >
<!ELEMENT canzone   (%aufgesang;, abgesang) >
<!ELEMENT stollen   (%lines;) >
<!ELEMENT abgesang  (%lines;) >
<!ELEMENT l         (#PCDATA) >

The canzone minus explicit Aufgesang

previous table of contents next
9 of 9 [45]
<canzone>
  <stollen>
    <l>unter den linden an der heide</l>
    <l>da unser zweier bette was</l>
  </stollen>
  <stollen>
    <l>da mugt ir vinden schone beide</l>
    <l>gebrochen bluomen unde gras</l>
  </stollen>
  <abgesang>
    <l>kuste er mich? wol tusentstunt</l>
    <l>tandaradei</l>
    <l>seht wie rot mir ist der munt</l>
  </abgesang>
</canzone>

V. XML Schema and DTD functionality

XML Schema

previous table of contents next
1 of 1 [45]
  • DTD++, DTD--
  • instance syntax
  • support for programming-language and database-oriented types
  • type inheritance
  • ‘complex’ and ‘simple’ types

The canzone schema v.1

previous table of contents next
In version 1 of this schema, we imitate the DTD slavishly.
At the outer level is a schema element:
<xsd:schema>
 <!--* element declarations go here *-->
</xsd:schema>
N.B. the schema does not identify a document-root element / start symbol.

Declaring elements

previous table of contents next
 <xsd:element name="canzone">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element ref="aufgesang"/>
    <xsd:element ref="abgesang"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="aufgesang">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element ref="stollen"/>
    <xsd:element ref="stollen"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

Declaring elements

previous table of contents next
  • Note difference between element declaration (outer) and element reference (inner).
  • Implicit occurrence information: min = max = 1.

Positive closure

previous table of contents next
 <xsd:element name="abgesang">
  <xsd:complexType>
   <xsd:sequence minOccurs="1" 
                 maxOccurs="unbounded">
    <xsd:element ref="l"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="stollen">
  <xsd:complexType>
   <xsd:sequence minOccurs="1" 
                 maxOccurs="unbounded">
    <xsd:element ref="l"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

Character data

previous table of contents next
 <xsd:element name="l">
  <xsd:complexType mixed="true">
   <xsd:sequence>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>
or
 <xsd:element name="l" type="xsd:string"/>

VI. XML Query, XSLT, and XPath

XML Query, XSLT, and XPath

previous table of contents next
1 of 21 [45]
Core functions quite distinct:
  • XSLT: formatting.
  • XPath: addressing.
  • XML Query: tree manipulation.

Why does XSLT have two parts?

previous table of contents next
2 of 21 [45]
Serious formatting requires both
  • styling text blocks
    • font family (Times Roman, Helvetica, ...)
    • font treatment, weight (italic, bold, demi-bold, ...)
    • measure, color, ...
  • tree transformation.
    • table of contents, list of figures, ...
    • indices
    • footnotes, endnotes
    • running headers and footers

XSLT

previous table of contents next
3 of 21 [45]
A generic tree-transformation tool
  • output-driven
  • input-driven
  • functional* language
  • Turing complete
  • XML syntax
  • untyped

XML Query

previous table of contents next
4 of 21 [45]
An industrial-strength query language for XML data
  • explicit data model
  • formal semantics
  • keyword syntax
  • Turing complete
  • statically typed

XPath

previous table of contents next
  • What is XPath?
  • XPath data model
  • Expressions as location ladders
  • Axes
  • Long syntax, short syntax

XPath: an addressing language

previous table of contents next
5 of 21 [45]
Many applications need to ‘address’ parts of XML documents:
  • formatting (e.g. XSLT)
  • hyperlinking
  • document construction
  • query / search and retrieval
  • schema / language specification
  • ...
XPath captures the common functionality.

XPath data model

previous table of contents next
6 of 21 [45]
A document is an ordered tree with
  • a root node
  • element nodes
  • text nodes
  • attribute nodes
  • namespace nodes
  • processing instructions
  • comment nodes
No structure sharing. No entity boundaries. Namespace prefixes resolved.

Data model example

previous table of contents next
7 of 21 [45]
greetings.xml drawn as a tree (color-coded).

Data model example

previous table of contents next
8 of 21 [45]
greetings.xml drawn as a sideways tree.

XPath expressions

previous table of contents next
9 of 21 [45]
An expression is a sequence of steps:
/step/step/step/step ...
Each step
  • starts from some current node(s),
  • moves to some result node(s).
Cf. HyTime, TEI location ladders.

XPath steps

previous table of contents next
10 of 21 [45]
A step is
axis::node test [predicate] [predicate] ...
where
axis says what direction to move
node test says which nodes go into result
predicate adds further constraints
E.g. descendant::figure[@rend="svg"]

XPath selection axes

previous table of contents next
11 of 21 [45]
  • child (→ e, t, c, p)
  • parent (→ e)
  • attribute (→ a)
  • following, following-sibling (→ e, t, c, p)
  • preceding, preceding-sibling (→ e, t, c, p)
  • self
  • namespace (→ n)
  • ancestor, ancestor-or-self (→ e)
  • descendant, descendant-or-self (→ e, t, c, p)

Simple XPath examples

previous table of contents next
12 of 21 [45]
  • child::Fa (all adverbial-clause children)
  • child::* (all element children)
  • child::text() (all text node children)
  • child::node() (all children)
  • attribute::del
  • attribute::*
  • descendant::N (all noun-phrase descendants)
  • ancestor::S (all sentence [clause] ancestors)
  • ancestor-or-self::S (all sentence [clause] context nodes or ancestors)
  • descendant-or-self::Nr (all context nodes or ancestors which are temporal adverbial noun phrases)

Long syntax: more complex

previous table of contents next
13 of 21 [45]
  • self::J (context node if J (adjective phrase), otherwise nothing)
  • child::S/descendant::P
  • child::*/child::N (all N grandchildren)
  • /
  • /descendant::N (all N elements in the document)
  • /descendant::N/child::NN1
  • /descendant::N/child::NN2

Long syntax: predicates

previous table of contents next
14 of 21 [45]
  • child::w[position()=1]
  • child::w[position()=last()-1]
  • child::w[position()>1]
  • following-sibling::w[position()=1]
  • preceding-sibling::N[position()=1]
  • /descendant::S[position()=42]
  • /child::S/child::V[position()=2]/child::Fn[position()=3]
  • child::w[attribute::t="NN1"]
  • child::w[attribute::t='JJ'][position()=2]
  • child::w[position()=5][attribute::t="JJ"]
  • descendant::N[./child::w[attribute::t="AT1"] and ./child::w[attribute::t="JJ"] and ./child::w[attribute::t="NN1"]]

XPath short syntax

previous table of contents next
15 of 21 [45]
  • N
  • * (all element children)
  • text() (all text node children)
  • node() (all children)
  • @name
  • @*
  • //S[.//N]
  • //N//S

Sample queries

previous table of contents next
16 of 21 [45]
//S[not(.//N)]
//N[./w[@t='AT1'] and ./w[@t='JJ'] and ./w[@t='NN1']]
w[@t='AT1' and following-sibling::w[1][@t='JJ'] and following-sibling::w[2][@t='NN1']]
//N[./w[@t='AT1' and following-sibling::w[1][@t='JJ'] and following-sibling::w[2][@t='NN1']]]
//w[@t='NP1' and @f=translate(@f, 'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ')]

XPath as query language

previous table of contents next
17 of 21 [45]
XPath and XSLT already widely used for queries.
  • select elements, attributes, strings, numbers
  • select by position in tree, generic identifier, attribute name, value, content
  • co-occurrence constraints
Some drawbacks:
  • no data types
  • no type safety
  • variables are external

XQuery as query language

previous table of contents next
18 of 21 [45]
  • type system
  • FLWOR expressions (FOR ... LET ... WHERE ... ORDER ... RETURN ... )
for $d in document("depts.xml")//deptno
let $e := document("emps.xml")//emp[deptno = $d]
where count($e) >= 10
order by avg($e/salary) descending
return
   <big-dept>
      {
      $d,
      <headcount>{count($e)}</headcount>,
      <avgsal>{avg($e/salary)}</avgsal>
      }
   </big-dept>

Observables

previous table of contents next
19 of 21 [45]
  • XSLT as tool for XML-to-XML transformation
  • easier software development
  • interest in document processing
  • interest in database management systems
  • industrial support

Consequences

previous table of contents next
20 of 21 [45]
  • ubiquity
  • readily available tools
  • pipelining / Lego structures
  • generic, rather than specific, tools

XML Query, XSLT, and XPath

previous table of contents next
Sample search ...