Practical Extraction of Meaning

from Markup

C. M. Sperberg-McQueen, W3C

Claus Huitfeldt, University of Bergen

* Allen Renear, University of Illinois at Urbana/Champaign

ACH/ALLC 2001

New York University, 15 June 2001



1. Overview

2. Assumptions and Goals

Markup has meaning. We use it to convey
information about the text.
One way to capture that meaning is to list the inferences licensed by markup.
Why?
  • normalization, reduced variation
  • improved retrieval
  • better, smarter tools
  • semantic verification
  • better documentation

3. Assumptions and Goals

Markup has meaning. We use it to convey
information about the text.
One way to capture that meaning is to list the inferences licensed by markup.
Why?
  • normalization, reduced variation
  • improved retrieval
  • better, smarter tools
  • semantic verification
  • better documentation
  • ... but mostly because it's interesting.

4. Related work

  • Simons 1997 (translating between marked up texts and database systems)
  • Sperberg-McQueen and Burnard 1995 (informal introduction to the TEI)
  • Langendoen and Simons 1995 (also TEI)
  • Huitfeldt and others in Bergen
  • Renear and others at Brown University
  • Welty and Ide 1997 (querying using knowledge representation system)
  • Wadler 1999 (XSLT)
  • Ramalho et al. 1999 (semantic verification)
  • Sperberg-McQueen, Huitfeldt, Renear 2000

5. Henry Laurens to Lord William Campbell, 1775

It was be For When we applied to Your Excellency for leave to adjourn it was because we foresaw that we were should continue wasting our own time ...
Papers of Henry Laurens

6. The markup

<p><del>It was be</del> <del>For</del> 
When we applied to Your Excellency for 
leave to adjourn it was because 
we foresaw that we <del>were</del> 
<add>should continue</add> 
wasting our own time ... </p>

7. The immediate inferences

8.

Or, more formally (in Prolog syntax):
  • paragraph(elem-1).
  • deleted(contents(elem-2)).
  • deleted(contents(elem-3)).
  • deleted(contents(elem-4)).
  • added(contents(elem-5)).
Note that we need some way to refer to
  • parts of (components of?) the text
  • words or characters within the content of an element

9. A straw-man proposal

10.

I.e. all properties are inherited.

11. What does property del mean?

Define properties using skeleton sentences: _____ is a paragraph, or _____ has been deleted (or marked as deleted) in the source," ...
Or:
paragraph(____).
deleted(_____).
Fill blanks with reference to element.
In KR, these would be called predicate functions.

12. Problems with the straw-man proposal

It's a start, but:
  • Part and whole are not the same (distributed and non-distributed properties).
  • Some inferences may be incompatible.
  • Properties of n arguments (n > 1)
  • Predicates in real systems frequently take arguments other than "the contents of this element" and "the value of this attribute on this element".

13. Example

Consider:
<doc lang="en">
<p>Wittgenstein wrote:
<q lang="de"><ital>Die Welt ist alles,
was der Fall ist.</ital></q>
It is hard to escape, at first reading,
the suspicion that Wittgenstein is guilty
here of a gross platitude; it is only
after reading the rest of the
<bibl>
<title lang="la">Tractatus</title> 
</bibl> that on returning
to its famous first sentence one appreciates
the depths of its intension.</p>
</doc>

14. Prolog representation

node(n1, element(doc)).
path(n1,[1]).

attr(n1, lang, "en").
node(n3, pcdata("
")).
path(n3,[1,1]).
node(n4, element(p)).
path(n4,[1,1,2]).

node(n5, pcdata("Wittgenstein wrote:
")).
path(n5,[1,1,1]).
node(n6, element(q)).
path(n6,[1,1,1,2]).

attr(n6, lang, "de").
node(n8, element(ital)).
path(n8,[1,1,1,1,1]).

node(n9, pcdata("Die Welt ist alles,
was der Fall ist.")).
path(n9,[1,1,1,1,1]).
node(n10, pcdata("
It is hard to escape, at first reading,
the suspicion that Wittgenstein is guilty
here of a gross platitude; it is only
after reading the rest of the
")).
path(n10,[1,1,3]).
node(n11, element(title)).
path(n11,[1,1,2,4]).

attr(n11, lang, "la").
node(n13, pcdata("Tractatus")).
path(n13,[1,1,2,1]).
node(n14, pcdata(" that on returning
to its famous first sentence one appreciates
the depths of its intension.")).
path(n14,[1,1,5]).
node(n15, pcdata("
")).
path(n15,[1,3]).

15. Straw-man inferences

Immediate inferences:
  • Node n1 is a document (doc(n1).)
  • The document is in English (lang(n1,"en").)
  • Node n4 is a paragraph (p(n4).)
  • Node n6 is a quotation (q(n4).)
  • The quotation is in German (lang(n6,"de").)

16. Calculated inferences

Correctly calculated inferences:
  • The paragraph is in English (lang(n4,"en").)
  • The italics are in German (lang(n8,"de").)
But also some fallacies:
  • The italic phrase is a document (doc(n8).) — part and whole are different.
  • The italics are in English (lang(n8,"en").) — incompatible inferences.
And some gaps:
  • Tractatus is not just a title; it's the title of the book described by the <bibl>.

17. A framework (generic parts)

To describe the meaning of the markup in a document, we will need:
  • representation of the document (Prolog representation / DOM / ...)
  • generic routines for applying predicate functions to document and generating statements about it (ad hoc? XSLT?)

18. Framework (DTD-specific parts)

For each markup language:
  • set of predicate functions (skeleton sentences) describing the meaning to be attached to each construct
  • set of deictic expressions to fill blanks in sentence skeletons / provide arguments for functions
  • categorization of predicates according to inference rules (distributed / non-distributed, ...)
  • (optionally) rules allowing further inferences from the properties directly predicated by the markup (e.g. "if something is an author, and not identified as a corporate author, then it is a person", or "if something is a person, then it is human")

19. System overview

In the abstract, ...

20. Concrete system overview

In practice:

21. A possible goal

Is this the goal?

22. Rejection

It better not be.

23. DTD-specific interpretation

<title level="j">CHum</title>
"node X is a title, with 'level' = 'j'"
"there exists a journal called CHum"

24. A simpler approach

<title level="j">CHum</title>
"node X is a title, with 'level' = 'j'"
"if there exists an element of type 'title' with 'level' = 'j', then there exists a journal whose name is spelled by the contents of the element"

25. Practical extraction of meaning ...

26. Generating the generator

27. An example

Part 1: the document
<doc lang="en">
<p>Wittgenstein wrote:
<q lang="de"><ital>Die Welt ist alles,
was der Fall ist.</ital></q>
It is hard to escape, at first reading,
the suspicion that Wittgenstein is guilty
here of a gross platitude; it is only
after reading the rest of the
<title lang="la">Tractatus</title> that on returning
to its famous first sentence one appreciates
the depths of its intension.</p>
</doc>

28. Part 2: xmltopro.xsl

The XML-to-Prolog translator (fragment):
<xsl:template name="element">
 <xsl:text>node(</xsl:text>
 <xsl:value-of select="generate-id(.)"/>
 <xsl:text>, element(</xsl:text>
 <xsl:value-of select="name()"/>
 <xsl:text>)).&#xA;</xsl:text>
 <xsl:choose>
    <xsl:when test="../..">
     <!--* when we have a parent which has a parent, say who it is *-->
     <xsl:text>parent-child(</xsl:text>
     <xsl:value-of select="generate-id(..)"/>
     <xsl:text>, </xsl:text>
     <xsl:value-of select="generate-id(.)"/>
     <xsl:text>).&#xA;</xsl:text>

     <xsl:text>path(</xsl:text>
     <xsl:value-of select="generate-id(.)"/>
     <xsl:text>,[</xsl:text>
     <xsl:number level="multiple" count="*" format="1,1" />
     <xsl:text>,</xsl:text>
     <xsl:value-of select="position()" />
     <xsl:text>]).&#xA;</xsl:text>

    </xsl:when>
    <xsl:otherwise>
     <!--* when we have a parent which is an element, say who it is *-->
     <xsl:text>document-element(</xsl:text>
     <!--* <xsl:value-of select="./@id"/> *-->
     <xsl:value-of select="generate-id(.)"/>
     <xsl:text>).&#xA;</xsl:text>
    </xsl:otherwise>
 </xsl:choose>
 <xsl:text>&#xA;</xsl:text>
 <xsl:for-each select="@*">
  <xsl:call-template name="attribute"/>
 </xsl:for-each>
</xsl:template>

29. Part 3: XML in Prolog

The Prolog representation of the XML structure:
node(n1, element(doc)).
path(n1,[1]).

attr(n1, lang, "en").
node(n3, pcdata("
")).
path(n3,[1,1]).
node(n4, element(p)).
path(n4,[1,1,2]).

node(n5, pcdata("Wittgenstein wrote:
")).
path(n5,[1,1,1]).
node(n6, element(q)).
path(n6,[1,1,1,2]).

attr(n6, lang, "de").
node(n8, element(ital)).
path(n8,[1,1,1,1,1]).

node(n9, pcdata("Die Welt ist alles,
was der Fall ist.")).
path(n9,[1,1,1,1,1]).
node(n10, pcdata("
It is hard to escape, at first reading,
the suspicion that Wittgenstein is guilty
here of a gross platitude; it is only
after reading the rest of the
")).
path(n10,[1,1,3]).
node(n11, element(title)).
path(n11,[1,1,2,4]).

attr(n11, lang, "la").
node(n13, pcdata("Tractatus")).
path(n13,[1,1,2,1]).
node(n14, pcdata(" that on returning
to its famous first sentence one appreciates
the depths of its intension.")).
path(n14,[1,1,5]).
node(n15, pcdata("
")).
path(n15,[1,3]).

30. Part 4: Formalized TSD for TEI

The formalized tag-set documentation for TEI (in part):
<attribute match="@lang">
<doc>
<p>If an element has a <ident>lang</ident> 
attribute, the attribute value names the 
language of the element's content (unless 
overridden).</p>
</doc>
<rule distributed="true" lang="Prolog">
lang(<de xv="../@id"/>,.) :- !.
</rule>
<doc>
<p>Otherwise, the language is the same as that 
of the parent element.</p>
</doc>
<rule distributed="false" lang="Prolog">
lang(E,L) :- parent(P,E), lang(P,L).
</rule>
</attribute>

31. Part 4b: Formalized TSD for TEI

The formalized tag-set documentation for TEI (formatted):

31.1. @lang

If an element has a lang attribute, the attribute value names the language of the element's content (unless overridden).
lang({../@id,.}) :- !.
Otherwise, the language is the same as that of the parent element.
lang(E,L) :- parent(P,E), lang(P,L).

32. Calculating non-immediate inferences

Non-immediate inferences may be calculated
  • by including appropriate inference rules in Prolog
  • by including appropriate inference rules in XSLT
In Prolog:
lang(E,L) :- parent(P,E), lang(P,L).

33. Non-immediate inferences

In XSLT:
<xsl:template name="calc-lang">
lang(
  <xsl:value-of select="generate-id(.)"/>
 ,
  <xsl:call-template name="inherit-lang"/>
).
</xsl:template>

<xsl:template name="inherit-lang">
 <xsl:choose>
  <xsl:when select="@lang">
   <xsl:value-of select="@lang"/>
  </xsl:when>
  <xsl:otherwise>
   <xsl:for-each select="..">
    <xsl:call-template name="inherit-lang"/>
   </xsl:for-each>
  </xsl:otherwise>
 </xsl:choose>
</xsl:template>
<!--* can you find the bug here? *-->

34. Problems and outlook

  • ontology (what does this mean?)
  • is completeness achievable? is it a goal?
  • housekeeping problems
  • the tenacious agnosticism of TEI

35. Ontology and reference

What do the deictic expressions actually denote?
  • the element (in this encoding)
  • the text (= this rule asserts that in the text there is some (component?) part which is represented by this XML element in this encoding, and that that component has some property.
  • some referent of the text
  • some other encoding or representation of the text
  • the words or characters contained* (except for notes, etc.) within this element (but N.B. deletions, insertions)

36. Ontology examples

  • the element (This TEI header was last updated 2001-06-15.)
  • the text (This work [i.e. Peer Gynt] was created in 1874)
  • some referent of the text (Henrik Ibsen was born in ...)
  • the words or characters contained in this element (The string "Ibsen" is a proper noun here — not the name element)
  • some other encoding or representation of the text (A page-break occurs at this point in the edition of 1874)

37. Completeness

Can we really expect to list "all and only the inferences licensed by the markup"?
No: we cannot list them all (infinite set).
We may be able to identify a basis (a finite set of sentences from which the members of the infinite set follow).
Or maybe not?

38. It's hard to be easy

The TEI's militant agnosticism is hard to model.
Are line-breaks artefactual and inessential aspects of a particular representation of the text?
Or are they essential components of the text?
The TEI does not tell us.
In practice, in the rule for <lb> does (.) refer to the text or to some [other] representation/realization of this text?

39. Outlook

A lot of work remains to be done:
  • complete formalized-TSD for TEI Lite
  • create formalized-TSD for some other DTDs
  • refine DTD for formalized-TSD
  • experiment with calculation of non-immediate inferences (a) in XSLT, (b) in inference engine — what are the tradeoffs, really?
  • experiment with retrieval and other applications
  • theorize the whole enterprise