Meaning and interpretation of markup

C. M. Sperberg-McQueen
World Wide Web Consortium
MIT Laboratory for Computer Science
259 State Road 399
Route 1 Box 380A
Española, NM 87532-9765
Tel. +1 505 747 4224
Fax +1 505 747 1424
Email cmsmcq@acm.org

Claus Huitfeldt
University of Bergen
The HIT Centre
Allégaten 27,
N-5007 Bergen, Norway

Allen Renear
Brown University
Box 1841
92 Thayer Street
Grad Center E
Providence, RI 02912
Email Allen_Renear@brown.edu

This article was published in Markup Languages: Theory & Practice 2.3 (2000): 215-234. A few minor errors have been corrected. (N.B. this electronic copy was derived from the authors' copy and has not yet been proofread in detail against the published article. Corrections made during publication will thus not appear.)


Abstract

SGML and XML markup allows the user to make (licenses) certain inferences about passages in the marked-up material; in particular, markup signals the occurrence of specific features in a document. Some features are distributed, and their occurrences are logically non-countable (italic font is a simple example); others are non-distributed (paragraphs and other standard textual structures, for example).
Formally, the inferences licensed by markup may be expressed as open sentences, whose blanks are to be filled by the contents of an element, by an attribute value, by an individual token of an attribute value, etc. The task of interpreting the meaning of the markup at a particular location in a document may then be formulated as finding the set of inferences about that location which may be drawn on the basis of the markup in the document. Several different approaches to this problem are outlined; one measure of their relative complexity is the complexity of the expressions which are allowed to fill the slots in the open sentences which formally specify the meaning of a markup language.


1. Introduction: The function of markup

Markup is inserted into textual material not at random, but to convey some meaning. [1]
An author may supply markup as part of the act of composing a text; in this case the markup expresses the author's intentions, e.g. as to the structure or appearance of the text. The author creates a section heading, for example, by creating an appropriate element in the document; the content of that element is a section heading because the author says so, and the markup is simply the method by which the author says so. The markup, that is, has performative significance. [2]
In other cases, markup is supplied as part of the transcription in electronic form of pre-existing material. In such cases, markup reflects the understanding of the text held by the transcriber; we say that the markup expresses a claim about the text. The transcriber identifies a section heading in the pre-existing text by transcribing it and tagging it as a section heading; the content of that element is a section heading if the transcriber's interpretation is correct, but other interpreters might disagree; it is plausible to imagine discussions over whether a given way of marking up a text is correct or incorrect. [3]
In the one case, markup is constitutive of the meaning; in the other, it is interpretive. In each case, the reader [4] may legitimately use the markup to make inferences about the structure and properties of the text. For this reason, we say that markup licenses certain inferences about the text. [5]
Let us take a concrete example. Among the papers of the American historical figure Henry Laurens is a draft Laurens prepared of a letter to be sent from the Commons House of Assembly of South Carolina to the royal governor, Lord William Campbell, in 1775. Some words have lines through them, and others written above the line. The editors of Laurens's papers interpret the lines through words as cancellations, and the words above the lines as insertions; an electronic version of the document using TEI markup and reflecting these interpretations, might read thus:
  <p><del>It was be</del> <del>For</del> When we applied
  to Your Excellency for leave to adjourn it was because 
  we foresaw that we <del>were</del> <add>should continue</add> 
  wasting our own time ... </p>
From the del elements, the reader of the document is licensed to infer that the letters It was be, For, and were are marked as deleted; from the add element, the reader may infer that the words should continue have been added. Software might rely on these inferences in the course of making a concordance or displaying a clear text; human readers will rely on them in interpreting the historical document. Note that the markup here stops short of licensing the inference that should continue was substituted for were. The editors could license that inference as well by appropriate markup, if they wished. Human readers may make the inference on their own, given the linguistic context; software cannot safely infer a substitution every time an addition is adjacent to a deletion.
If markup has meaning, it seems fair to ask how to identify the meaning of the markup used in a document, and how to document the meaning assigned to particular markup constructs by specifications of markup languages (e.g. by DTDs and their documentation).
In this paper, we propose an account of how markup licenses inferences, and how to tell, for a given marked up text, what inferences are actually licensed by its markup. We illustrate this account using a simple representation of markup and markup meaning in Prolog. As a side effect of this account of how to interpret markup, we will also provide (1) an account of what is needed in a specification of the meaning of a markup language, (2) a way to determine, for a markup construct in a document, what inferences it licenses, and conversely (3) a way to determine, for any part of a text, what inferences about it are licensed by the markup.
The work described here may thus lead to improved documentation for markup languages; by providing a more formal approach to markup meaning it may make possible a survey of current practice which would be of interest to software developers. A deeper understanding of how to interpret markup should make it easier to normalize markup so as to reduce the variety of ways in which the same information is conveyed within a given body of material, or to improve the performance of query tools. Some applications to problems of semantic verification are also possible.
We begin by describing a simple method of expressing the meaning of SGML or XML element types and attributes; we then identify several ways in which this simple method fails to work in practice. From these failures, we can infer a fundamental distinction between distributive and non-distributive features of texts, which affects the interpretation of markup, as well as some other properties of markup important for its interpretation. We then outline a revised proposal, which is incomplete but which provides a suitable framework for further elaboration; the last section briefly lists some things which must clearly be added to the framework in order to enable the description of existing widely known markup languages.
Related work has been done by [Simons 1997] (in the context of translating between marked up texts and database systems), [Sperberg-McQueen and Burnard 1995] (in an informal introduction to the TEI), [Langendoen and Simons 1995] (also with respect to the TEI), [Huitfeldt 1995] and others in Bergen (in discussions of the Wittgenstein Archive at the University of Bergen, and in critiques of SGML), [Renear et al. 1995] and others at Brown University, and [Welty and Ide 1999] (in a description of systems which draw inferences from markup). Much of this earlier work, however, has focused on questions of subjectivity and objectivity in text markup, or on the nature of text, and the like. The approach taken in this paper is somewhat more formal, while still much less formal and rigorous than that taken by [Wadler 1999] in his recent work on XSLT. Perhaps the closest similarity is to the work of [Ramalho et al. 1999] on semantic verification of marked up documents.

2. A straw-man model of markup interpretation

2.1. Description of the simple model

Let us begin with the simple assumption that start- and end-tags identify some property of the text possessed by the passage which lies between the start-tag and its matching end-tag, and see how far this formulation takes us. It does not, as will be seen, suffice for our purposes, but from observing how it falls short we can see how to build a better account.
Let us consider the example discussed above. We have already described some of the inferences licensed by its markup. How shall we go about deriving those inferences from the marked up document?
It is not our purpose here to propose a specific language or notation for the description of markup languages or for the expression of statements about the document. For purposes of illustration, however, we need some language; we will use Prolog, since it is a convenient and well documented language for the expression of facts and inference rules. For the sake of readers unfamiliar with Prolog, we will provide brief English-language glosses for most Prolog examples; the standard introduction to Prolog is Clocksin and Mellish 1984.
We represent the document as a tree, in which each node represents either an element or a character; for now, we simply ignore comments, processing instructions, etc. Each node in the tree is represented by a Prolog fact like the following:
node([1,5,2],element(p)).
node([1,5,2,1],element(del)).
node([1,5,2,1,1],pcdata("I")).
node([1,5,2,1,2],pcdata("t")).
node([1,5,2,1,3],pcdata(" ")).
node([1,5,2,1,4],pcdata("w")).
node([1,5,2,1,5],pcdata("a")).
node([1,5,2,1,6],pcdata("s")).
node([1,5,2,1,7],pcdata(" ")).
node([1,5,2,1,8],pcdata("b")).
node([1,5,2,1,9],pcdata("e")).
node([1,5,2,2],pcdata(" ")).
node([1,5,2,3],element(del)).
node([1,5,2,3,1],pcdata("F")).
node([1,5,2,3,2],pcdata("o")).
node([1,5,2,3,3],pcdata("r")).
The first argument of the node predicate is a numeric path expression, a sequence of numbers representing the position of the node in the tree. The path [1,5,2] denotes the second child of the fifth child of the root element; its children are [1,5,2,1], [1,5,2,2], etc. The second argument is a structure, either a term with the functor element and an argument showing the generic identifier of the element's type, or else a term with the functor pcdata with an argument of a single character. The facts above can be read "There is a node at location 1.5.2, and it is an element of type p," "There is a node at location 1.5.2.1.1, and it is the character I," and so on. [6]
Attributes are represented by facts using the predicate attr:
attr([1,5,2],id,implied).
attr([1,5,2],n,implied).
attr([1,5,2],lang,implied).
attr([1,5,2],rend,implied).
attr([1,5,2],teiform,"p").
The first argument is the path expression for the node; the second is the attribute name, the third is either the keyword implied or else the value represented as a quoted string. These facts can be read "The element at node 1.5.2 has an attribute named id, the value of which is implied," "The element at 1.5.2 has an attribute named teiform, the value of which is the string "p".

2.2. Inferences licensed by element types

We assume that each element in the document licenses the inference that the contents of that element have some property; for convenience, we use the generic identifier of the element type to denote the property associated with elements of that type (with element type p, we associate property p, and so on).
Interpreting the markup in the document in this way, we can generate an appropriate set of facts:
p([1,5,2]).
del([1,5,2,1]).
del([1,5,2,3]).
add([1,5,2,97]).
This notation is convenient for checking to see whether a known location has some specific property, or for finding all the locations which have some specific property. It is less convenient for finding all the properties which apply to some known location, however, so in practice we will express these facts in a different notation:
property_applies(p,[1,5,2]).
property_applies(del,[1,5,2,1]).
property_applies(del,[1,5,2,3]).
property_applies(add,[1,5,2,97]).
In this notation, the first argument is the name of the property predicated of some location; the second argument is the path expression for that location. These facts can be read "The property p is predicated of node 1.5.2," and so on.
We can specify the meaning of properties like p, del, etc. by formulating sentences with blanks (or, as we shall call them, open sentences), such as "_____ is a paragraph," or "_____ has been deleted (or marked as deleted) in the source," or "____ has been inserted later (e.g. above the line)." The blanks are to be filled in with some reference to an element; we can point at a copy of the document and say "This element is a paragraph," or we can refer to the element by its path expression, as shown in the Prolog examples above.

2.3. Inferences licensed by attributes

The simple model outlined above for elements can be extended slightly to handle attributes. The lang attribute defined by the TEI or by HTML (or the xml:lang attribute defined as part of XML), for example, identifies the language in which the contents of the element are expressed. A specification like lang="eng" licenses the inference that the text so described is written in English. We could write:
property_applies(english,[1]).
or
english([1]).
to record the fact that the document is in English.
With many attributes, however, including lang, it is simpler to view them as expressing not a single-argument predicate, but a two-argument predicate: one argument is the contents of the element on which the attribute appears, the other the value of the attribute. So we will prefer to write forms like these:
language([1],english).
property_applies(language,[1],english).
In order to have a closer parallel with the one-argument properties, however, we choose yet another form for these facts:
property_applies(language(english),[1]).

2.4. Referring to an element and its contents

Note that the facts we have been expressing using the Prolog predicate named property_applies all concern particular locations in the document, identified by their path expressions, rather than (as might have been expected) specific words or strings. We write property_applies(del,[1,5,2,1]), not property_applies(del,"It was be"). The reason is simple: when the same word or string appears more than once in the document, it need not occur in elements of the same type. The property del applies not to the string "It was be", but to one particular occurrence of that string (to use the common terminology, it applies to a token, not to a type). With a blackboard and a diagram of the document tree, one can point to a specific node and say "This node has property del." In writing, the path expressions come as close as we can get to the gesture of pointing and saying "This node".

2.5. Propagating inferences downwards (inherited properties)

Now, if it is true that the phrase "It was be" was deleted in the document, it follows that the word "It" in that phrase was deleted, and also that the letter "I" of that word was deleted. Similarly, if the phrase "should continue" was inserted above the line in the document, then the word "should" in that phrase, and the letter "s" of that word, also have the property of having been inserted above the line. And so on. The same holds true for the language of the document: the entire document is in English, and so is the first paragraph, and the first sentence of that paragraph, and the first word of that sentence. (Some may wish to stop short of the claim that the individual characters of the text are in one language or another.)
That is, the property of "having been deleted", or of "having been inserted above the line"or of "being in English", which has formally been attributed to a particular node in the document tree, must be propagated downwards: the property holds for all of the contents of the node. For practical reasons (avoiding circularity), we distinguish between facts which have been read directly from the markup, as it were, and inferences we may make on the basis of those facts. For the former, we use the Prolog predicate property_applies, as shown above. For the latter, we use the term infer, which can be defined as follows in Prolog.
infer(Property,Loc) :- node(Loc,element(Property)).
infer(Property,Loc) :- 
     node(Anc,element(Property)),
     descendant(Loc,Anc).
A property can be inferred for a particular location if that location is an element and its generic identifier is the name of the property (or in other terms, if it is directly predicated for that location). It can also be inferred for that location, if it is predicated for some other location of which is that location is a descendant. Checking for the ancestor-descendant relation is simple, using path expressions: if the second path is a prefix of the first path, the two paths denote a descendant and an ancestor.
descendant([H,_|_],[H]).
descendant([H|TDesc],[H|TAnc]) :- descendant(TDesc,TAnc).
(Roughly: the first argument denotes a descendant of the second argument, if the two paths start with the same and the second one ends first.)
Properties expressed by attributes require a different definition.
infer(Prop,Loc) :- 
     attr(Loc,Att,Val),
     not(Val = implied),
     Prop =.. [Att,Val].
infer(Prop,Loc) :-
     attr(Anc,Att,Val),
     not(Val = implied),
     Prop =.. [Att,Val],
     descendant(Loc,Anc).
(Infer a property Prop for location Loc if Loc has an attribute Att with value Val, and Prop has the form Att(Val).)

2.6. Summary and illustration

Summarizing this first straw-man proposal, we can say:
  • Each element type E is associated with some property, denoted prop(E).
  • For each instance of element type E in a document, if property P is prop(E), then we can infer that property P is true of E and of all descendants of E.
  • Each attribute A is associated with a two-argument property, which we will denote prop(A).
  • For each element E which has a value for attribute A, if P = prop(A) and V is the value of A on E, then property P(V) is true of E and of all its descendants.
  • The set of inferences licensed by the markup for some location L can be generated thus: for each element E which is an ancestor of L in the document tree, identify each property P which is attributed to E, and assert that L has that property: P(L) for one-argument properties, or P(V,L) for two-argument properties.
Given the full document from which the Laurens example is taken, and knowing that the paragraph in the example has the path address 1.5.2, we can ask a Prolog system what properties can be predicated for that element:
?- infer(Property,[1,5,2]).
Property = p ->;
Property = doc ->;
Property = docbody ->;
Property = teiform([112]) ->;
Property = id([72,76,49,48,51,48,53]) ->;
Property = lang([101,110,103]) ->;
no
?-
The lists of numbers are the default Prolog representation for the strings "p", "HL10305", and "eng". In this case, the paragraph in our example has the properties p (paragraph), doc (document), docbody (body of the document), teiform("p"), id("HL10305"), and lang("eng").
We can also seek properties associated with any of its children:
?- infer(Property,[1,5,2|Tail]).

Property = p
Tail = [] ->;

Property = del
Tail = [1] ->;

Property = del
Tail = [3] ->;

Property = del
Tail = [95] ->;

Property = add
Tail = [97] ->;

Property = person
Tail = [184] ->;

Property = del
Tail = [318] ->;

Property = add
Tail = [320]
->
The inquiry asks to see what properties are associated with any node which begins with the prefix [1,5,2]; the results show the continuation of the path (the Tail) and the property.
Or what locations have the property del:
?- infer(del,Loc).
Loc = [1,5,2,1] ->;
Loc = [1,5,2,3] ->;
Loc = [1,5,2,95] ->;
Loc = [1,5,2,318] ->;
Loc = [1,5,2,348] ->;
Loc = [1,5,2,717] ->;
Loc = [1,5,2,719,57] ->;
Loc = [1,5,2,866] ->;
Loc = [1,5,2,917] ->
...
Formally, we can describe the straw-man model this way:
  1. The meaning of every element type is expressed by a one-argument predicate whose argument identifies the contents of the element (in our illustrative notation, the argument is the path expression for the element or for any descendant of the element). Equivalently, we can say that the meaning of each element type can be described by an open sentence whose single unbound variable is to be bound to the contents of the element.
  2. The meaning of every attribute is expressed by an open sentence with two unbound variables, one of which is to be bound to the contents of the element and the other to the value of the attribute. In predicate terms, each attribute defines some relation R which holds between the contents of the element and the value of the attribute.
  3. All inferences licensed by any two elements are compatible; the set of inferences to be drawn about any location is the set of predicates asserted for the location or for any ancestor of that location.

3. Problems with the straw-man proposal

There are several problems with this simple model in practice:
  • What is true of an element as a whole is not necessarily true of each part of the element's contents.
  • The inferences licensed by different constructs are not always compatible.
  • Some elements express properties which require more than one argument; some attributes express properties with more than two arguments.
  • The predicates expressed by markup in real systems frequently take arguments other than "the contents of this element" and "the value of this attribute on this element".
We discuss these problems in the following sections.

3.1. Distributed and non-distributed features

The straw-man proposal assumes that if an element has some property, then this property applies equally to each part of the contents. This is true for some properties (like del, the property of having been deleted), but not true for some other properties. The words "I have a dream", for example, are a sentence, but the word "I" in that sentence does not itself have the property of being a sentence, only the property of being part of, or occurring within, a sentence.
We therefore must distinguish distributed properties like del from non-distributed properties like sentence-hood.
Non-distributed properties are true of the element as a whole, but not true of all of the individual words or characters of the content. From the markup
  <P>Reader, I married him.</P>
we can infer the existence of one paragraph, but we cannot infer that the word Reader is itself a paragraph. We can, however, infer that it has the property of being within a paragraph.
Distributed properties, by contrast, are true not only of the element as a whole, but also true of each part of the element. Consider the following example from Tristram Shandy.
<hi rend="gothic">And this Indenture 
further witnesseth</hi> that the said
<hi rend="italic">Walter Shandy</hi>,
merchant, in consideration of the said
intended marriage ...
On the straw-man model, we can infer both that the words And this are rendered in black-letter (`gothic') type and that the word Indenture is similarly rendered. I.e. the example as given above is strictly synonymous with the following example: [7]
  <P><HI REND="gothic">And</HI> 
  <HI REND="gothic">this</HI> 
  <HI REND="gothic">Indenture</HI> 
  <HI REND="gothic">further</HI> 
  <HI REND="gothic">witnesseth</HI> that the said
  <HI REND="italic">Walter Shandy</HI>,
  merchant, in consideration of the said
  intended marriage ... </P>
It makes no difference whether the phrase And this Indenture further witnesseth occurs in one or five hi elements: the property of typographic highlighting is distributed equally among each word (in fact, each letter) of the contents. It is as true to say "The word And is in black-letter" as to say "The phrase And this indenture further witnesseth is in black-letter."
In general, a single hi element containing a phrase is equivalent to a series of adjacent hi elements with the same rend value. Similar statements may be made about any distributed property. In fact, we can define distributed properties as those for which such statements can be made. If an element type x marks a distributed property, then any two adjacent x elements may be joined, or one x element may be split: <x>abc</x><x>def</x> means exactly the same thing as <x>abcdef</x>.
A consequence of this fact is that elements marking distributed properties are not usefully countable, since they could be split without changing the meaning of the text. Elements marking non-distributed properties, by contrast are usefully countable. [8]
The rules of inference given above must be modified to take into account the distinction between distributed and non-distributed properties, and we need a way to record for each property, in the markup-language documentation, whether it is distributed or non-distributed.

3.2. Overrides and incompatibilities

The second problem in the straw-man proposal is that it assumes that if markup licenses the inference that some element E has some distributed property P, then all of the contents of E have property P, regardless of what else might be in the document.
Consider the following sample document, which uses a TEI-like lang attribute:
<doc lang="en">
<p>Wittgenstein wrote:
<q lang="de"><ital>Die Welt ist alles,
was der Fall ist.</ital></q>
It is hard to escape, at first reading,
the suspicion that Wittgenstein is guilty
here of a gross platitude; it is only
after reading the rest of the
<title lang="la">Tractatus</title> that on returning
to its famous first sentence one appreciates
the depths of its intension.</p>
</doc>
Given the definition of lang above, we are licensed to infer, from this document, that the contents of the doc element (path 1) are in English, and that the contents of the q element (path 1.1.22) are in German:
?- infer(lang("en"),[1]).
yes
?- infer(lang("de"),[1,1,22]).
yes
?-
The q element, however, is itself contained indirectly in the doc element.
The union-set model of interpretation sees no problem here, and happily infers (1) that the words "Die Welt ist alles, was der Fall ist" are in German (inference licensed by the lang attribute on the q element), and (2) that the same words are in English (inference licensed by the lang attribute on the doc element).
?- infer(lang("en"),[1,1,22]).
yes
?-
Our knowledge of natural languages allows us to observe that these two inferences are in fact incompatible, but we would prefer to find a way to notice this incompatibility without relying on human intelligence.
The rules of inference given above must be modified to take into account the fact that some inferences are legitimate only if no overriding information is available. In particular, for inherited properties like lang or (in HTML) b (bold) or i (italic), it is legitimate to infer a property for a location if that property is predicated of some ancestor and no intervening ancestor predicates a conflicting property.
To enable this, we need the ability, in the markup language documentation, to specify which properties are compatible with each other and which override or conflict with each other.

3.3. N-ary predicates

The third major problem with the straw-man proposal is that it assumes that all elements express single-argument predicates, and all attributes two-argument predicates, and that the arguments are invariably "the contents of this element" and "the value of this attribute". In practice, predicates which express the meaning of markup constructs often take larger numbers of arguments, and the arguments are often somewhat more varied and complex.
In a TEI-encoded bibliography, for example, the title element within each bibl element normally indicates not only that the contents of the element are a title, but also that they are the title of the item described by the bibl element. It is not hard to stipulate a form for predicates with arities of two, three, or higher; several obvious approaches may be mentioned.
We can simply define various forms of the property_applies predicate to accommodate multiple arguments. If element 1.2.5.3 is a bibl element, and 1.2.5.3.4 is its title element, we might write
property_applies(title_of,[1,2,5,3,4],[1,2,5,3]).
Or in order to simplify later search of the database and manipulation of terms, we might make all occurrences of property_applies take exactly two arguments, one identifying the property and the other a list of arguments:
property_applies(title_of,[[1,2,5,3,4],[1,2,5,3]]).
Or we can use a complex expression to identify the property being predicated of a location, as shown above in the discussion of attributes. Instead of naming the property with a simple atom, we name it with a functor which takes one or more arguments, e.g. lang("en"). Our bibliographical example would then be
property_applies(title_of([1,2,5,3]),[1,2,5,3,4]).
in which the property predicated of element 1.2.5.3.4 is the property "being the title of the bibliographic item represented by element 1.2.5.3". This method is well known from functional programming; it amounts to replacing an n-ary function with a function of arity n-1 which in turn returns a unary function. We have split off the first argument of the title_of predicate; we could equally well split off the other, to yield
property_applies(has_title([1,2,5,3,4]),[1,2,5,3]).
which predicates, of element 1.2.5.3, the property of "being a bibliographic item whose title is in element 1.2.5.3.4".
The choice of notation for such n-ary predicates will affect the convenience of a system, but each of the notations shown conveys the same basic information.

3.4. Arguments of predicates and deictic expressions

The straw man proposal defines the meaning of markup constructs by means of open sentences with single blanks. In the straw-man model, the blank is always filled in in the same way: with an expression denoting a node of which the property is predicated. To support the n-ary expressions which are necessary for real markup languages, we need some way to identify what kind of information belongs in each blank in the open sentence. If we associate each blank with such an expression, the result will be what we call a sentence skeleton. A sentence skeleton is not a sentence, but it can become a sentence when it is fleshed out by filling in the blanks.
The expressions which specify what goes in the blanks will typically need to have meanings like "the contents of this element" or "the value of attribute A on this element", or "the nearest ancestor of type bibl". They can be thought of as starting from some point of reference ( "this element") and pointing or gesturing toward some other part of the document. The standard linguistic term for such location-relative pointing is deixis; expressions which perform it are called deictic expressions.
Markup languages vary in the forms of deixis they require. For simple languages, it might suffice to have expressions like contents(this) or value(attribute-name,this) (as we might formalize the expressions needed for the straw-man model). For the TEI, it is clear that we will also need to be able to express concepts like first-ancestor(ancestor-gi,this) which denotes the nearest ancestor from the reference point which has a given generic identifier.
A suitable set of deictic expressions, or a suitable small language for the construction of deictic expressions, will make it possible to capture, in sentence skeletons, the meaning of various common markup idioms which cannot be captured by the straw-man model. Among the idioms which any general account of meaning in SGML and XML must handle are:
  • context dependency: the meaning of an element may depends on its context; trivial examples include TEI's hi and foreign, which can mean 'not-Roman' and 'not-English' in one context, and 'not-italic' and 'not-German' in others, or the TEI head element.
  • ordinal position, relative or absolute; dependence of meaning upon ordinal position is seldom an explicit feature of markup languages, but dependence of processing based on position is a standard feature of style-sheet languages.
  • milestone elements; these convey information by position in the beginning-to-end scan of the linear form of the document, rather than by position in the tree.
  • linking: out-of-line or 'standoff' markup conveys information about location L based not only on open elements, but on elements which point at L or some ancestor of L.
  • upward propagation: the meaning of an element may depend in part on its contents; this is unusual in colloquial SGML/XML systems, but is a regular feature of proposals to eliminate attributes from markup languages.
Other methods of associating markup with meaning are imaginable, but we believe a survey of existing DTDs will show that all or virtually all current practice is covered by any model of interpretation which encompasses the complications just outlined, and possibly even fewer.
It is clear that the idioms described above can all be captured by expressions in the W3C's XPath expression language, or in the TEI's extended-pointer notation, or in the caterpillar language defined by [Brüggemann-Klein and Wood 2000]. We conjecture that actual markup languages now in common use will need only a small portion of these languages. If by "the meaning of markup" we mean something other than the rules for typographic style, then we are unaware of any markup language in wide use which assigns different meanings to the same element type depending on its ordinal position. It is an open question how much of XPath or the extended-pointer notation is actually needed to document existing markup languages; this would be a useful topic for further empirical research.
We observe that a simple measure of the complexity of the semantics associated with an element type or attribute may be found in the number of blanks in the sentence skeleton, by the complexity of the deictic expressions which are to fill them, and by the amount or kind of memory required to allow full generation of the inferences licensed by markup in a particular text. Similar measures may be applied to markup languages as a whole or to various ways of marking up the same text (what is the most complex deictic expression in the definition of the language or the document being marked up? What is the total complexity of all the deictic expressions for all the slots of all the elements and attributes? etc.)

4. A framework for describing the meaning of markup

The description of the straw-man proposal and its problems allows us to describe a general framework for describing the meaning of markup which we believe will be adequate for describing, if not all possible markup languages, at least the publicly documented markup languages in widest use today.
To describe the meaning of the markup in a document, it suffices to generate the set inferences about the document which are licensed by the markup. In some ways, we can regard the meaning of the markup as being constituted, not only described, by that set of inferences. We propose a system for generating that set of inferences which consists of:
  • some representation of the document (analogous to the Prolog representation described above, or to the representation provided by the W3C Document Object Model)
  • a set of sentence skeletons describing the meaning to be attached to each construct in the markup language (a sentence skeleton is a natural-language sentence, or a Prolog predicate, or an expression in some other formal system, which contains blanks which are to be filled in with information from the document itself; each blank associated with a deictic expression showing how to fill it in)
  • some set of deictic expressions which can be associated with the blanks in the sentence skeletons; in the discussion above, we have seen the need for expressions meaning "the contents of this element", "the value of this attribute", "the nearest ancestor of type bibl", etc.
  • some categorization of predicates according to the rules governing inferences from them; the distinction between distributed and non-distributed properties described above is a simple example
  • some generic routines for generating statements about the document by applying the skeleton sentences to it and substituting appropriate concrete values (e.g. path expressions, strings, etc.) for the deictic expressions in the sentence skeletons
  • (optionally) rules allowing further inferences from the properties directly predicated by the markup (e.g. "if something is an author, and not identified as a corporate author, then it is a person", or "if something is a person, then it is human")
The systematic representation of marked up documents in electronic form has been the topic of a great deal of discussion and need not occupy us here.
The form to be used for the sentence skeletons appears to depend, at first glance, upon the system to be used to derive inferences about the document. If Prolog is to be used for inferencing (as above), then the set of basic sentences about the document will need to be in Prolog syntax. Other inference engines will have their own input syntax. For direct interfaces to the user, it may be desirable to be able to formulate sentences in English or some other natural language. In practice, however, the form of the sentence skeletons depends less on the inference engine than upon the method to be used to generate full sentences, by filling in the blanks in the sentence skeletons with information from the document. A Prolog-based blank-filler will naturally require the sentence skeletons to be expressed in Prolog; for an XSLT-based blank-filler, the sentence skeletons may take the form of XSLT templates.
Of course, we can generate Prolog forms, XSLT forms, or other forms of the sentence skeletons readily if we express the sentences themselves, and the blanks within them, as XML or SGML elements. A concrete proposal for such an XML notation is a topic for future research.
We have experimented with Prolog clauses which apply the skeleton sentences to the document and generate the basic description of the document, and with ad hoc programs to do the same job. On balance, however, we think it is probably best to accomplish this task with a more declarative mechanism, such as that provided by XSLT. It would be a useful project to produce a set of XSLT stylesheets to
  • read the skeleton sentences describing some DTD
  • generate from them an XSLT stylesheet for documents conforming to that DTD
  • in that stylesheet, generate base sentences by filling in the blanks in the skeleton sentences with the appropriate information from the document instance
The language of deictic expressions is another topic for further research. It is clear that we need to be able to refer to ancestors of various kinds, to siblings, and possibly to children. In order to capture the meaning of markup constructs involving ID, IDREF, and pointers like XPointer or TEI extended pointers, we will need deictic expressions to follow such pointers. Milestone elements, standoff markup, and the various methods used to evade SGML and XML's prohibition on overlapping structures may also require specialized kinds of deixis.
We have seen already that textual properties expressed by markup must be categorized as distributed or non-distributed, in order to license the correct set of inferences. We have also seen a need to distinguish properties which can coexist (such as lang(eng) and del) from properties which cannot coexist (such as lang(eng) and lang(deu)). Each such distinction is reflected in the rules for valid inference. It is not clear whether it is possible to construct some typology of properties which is guaranteed to suffice for all possible markup languages; it seems possible that the set of relevant distinctions is unbounded in principle, although bounded and rather small in practical markup systems. A better account of the relevant properties of properties might help guide the design of future markup languages, and might help guide software developers in identifying necessary functionality to support arbitrary markup languages well.
One reason to wish for a more formal account of markup language interpretation (apart from any intrinsic interest one may feel the topic has) is that formalization of markup interpretation in terms of predicates and licensed inferences allows us to imagine building systems which use the markup to reason about the documents in ways that go well beyond the straightforward reasoning shown in this paper. Systems of this kind are projected by [Welty and Ide 1999] and (more concretely) by [Ramalho et al. 1999]. It is not clear whether such systems will be useful primarily in checking documents for semantic plausibility and normalizing variant forms of markup to achieve more uniform tagging, or whether the inferences made by such systems can lead to surprising or dramatic improvements in retrieval or other processing, as suggested by Welty and Ide. It is certain, however, that in order to do their work systems of this kind require axioms which relate the various predicates of the markup language and allow inferences to be drawn.
Within the framework just outlined, a great deal of work remains to be done in addition to the topics for further research already mentioned in passing.
We expect to continue the work outlined here by constructing a more detailed representation of the proposed framework, similar in kind to the Prolog representation of the straw-man model. Other instantiations of the framework would also be useful.
Systematic creation of sentence skeletons for existing markup languages such as TEI, HTML, ISO 12083, or Docbook would help test the adequacy of the framework, and would (we believe) tend to improve the precision with which the meaning of constructs in these languages is expressed. Many markup languages will use the same, or overlapping, sets of deictic expressions; practical work with existing or projected markup languages might help establish the relative merits of various notations for deictic expressions and sentence skeletons. In particular, it would be interesting to see whether XPath, TEI extended-pointer notation, and caterpillar notation are all equally well suited as a notation for the necessary deictic expressions, and whether they produce the same results when used to measure complexity of markup languages and markup applications.
It would also be rewarding to see whether the skeleton sentences of different markup languages could be defined in terms of a common set of properties (semantic primitives). If it proved possible to define markup languages in terms of a common vocabulary of properties, it might open the door to more useful systems for automated and semi-automated transformations from one markup language to another.

A. Bibliography

[Biggs and Huitfeldt 1997] Biggs, Michael, and Claus Huitfeldt. 1997. "Philosophy and Electronic Publishing. Theory and Metatheory in the Development of Text Encoding". The Monist 80 no. 3: 348-367. http://hhobel.phl.univie.ac.at/mii

[Brüggemann-Klein and Wood 2000] Brüggemann-Klein, Anne, and Derick Wood. 2000. "Caterpillars: a context specification technique." Markup Languages: Theory & Practice 2.1: 81-106.

[Clocksin and Mellish 1984] Clocksin, W. F., and C. S. Mellish. 1984. Programming in Prolog. Second edition. Berlin: Springer-Verlag.

[DeRose et al. 1990] DeRose, Steve, et al. "What is Text, Really?" 1990. Journal of Computing in Higher Education 1: 3-26.

[Huitfeldt 1995] Huitfeldt, Claus. 1995. "Multi-Dimensional Texts in a One-Dimensional Medium." CHum 28: 235-241.

[Langendoen and Simons 1995] Langendoen, D. Terence, and Gary F. Simons. 1995. "Rationale for the TEI Recommendations for Feature-Structure Markup." CHum 29.3 (1995): 191-209.

[Laurens 1985] [Laurens, Henry]. 1985. "Commons House of Assembly to Lord William Campbell." The Papers of Henry Laurens, ed. David R. Chesnutt et al. (Columbia, S.C.: University of South Carolina Press, 1985) Vol. 10, pp. 305-308.

[Pichler 1993] Pichler, Alois . 1993. "What is Transcription, Really?" Paper presented at ACH/ALLC '93, Georgetown.

[Ramalho et al. 1999] Ramalho, José Carlos, Jorge Gustavo Rocha, José João Almeida, and Pedro Henriques. 1999. SGML documents: Where does quality go? Markup Languages: Theory & Practice 1.1 (1999): 75-90.

[Renear et al. 1995] Renear, Allen, David G. Durand, and Elli Mylonas. 1995. "Refining our notion of what text really is: the problem of overlapping hierarchies." In Research in Humanities Computing (Oxford: Oxford University Press, 1995). Originally delivered at ALLC/ACH '92.

[Simons 1997] Simons, Gary F. 1997. "Conceptual Modeling versus Visual Modeling: A Technological Key to Building Consensus." CHum 30.4 (1997): 303-319.

[Sperberg-McQueen 1991] Sperberg-McQueen, C. M. 1991. "Text in the Electronic Age: Textual Study and Text Encoding, with Examples from Medieval Texts". L&LC 6 (1991): 34-46.

[Sperberg-McQueen and Burnard 1994] Sperberg-McQueen, C. M., and Lou Burnard, ed. 1994. Guidelines for Electronic Text Encoding and Interchange (TEI P3). Chicago, Oxford: ACH, ALLC, and ACL, 1994.

[Sperberg-McQueen and Burnard 1995] Sperberg-McQueen, C. M., and Lou Burnard. 1995. "The Design of the TEI Encoding Scheme." CHum 29 (1995): 17-39.

[Wadler 1999] Wadler, Philip. 1999. "A formal semantics of patterns in XSLT." Paper presented at Markup Technologies '99. [Later published in Markup Languages: Theory & Practice 2.2 (2000): 183-202.]

[Welty and Ide 1999] Welty, Christopher, and Nancy Ide. 1999. "Using the Right Tools: Enhancing Retrieval from Marked-up Documents." CHum 33 (1999): 59-84. Originally delivered at TEI 10, Providence (1997).

[Wimsatt and Beardsley 1946] Wimsatt, W. K., and Monroe Beardsley. 1946. "The intentional fallacy." Sewanee Review 54 (1946); rpt. in Wimsatt's The Verbal Icon (Lexington, Ky., 1954).


Notes

[1] For simplicity, we formulate our discussion in terms of SGML or XML markup, applied to documents or texts. Similar arguments can be made, however, mutatis mutandis, with respect to SGML and XML markup used to delimit fields and records in database extracts or similar material, and we believe our discussion applies to such material, even though the term document may feel unnatural for it. The models of interpretation we describe can in some cases be applied to non-SGML-based markup systems, though not necessarily to all. We also simplify our discussion by restricting our attention for the most part to the conventional uses of markup to convey information about the text in which it is embedded. When we say "Markup is inserted into textual material ... to convey some meaning," we mean in the normal case. It is of course entirely possible to insert markup in a text for other reasons: e.g., to inflate the size of the file, to deceive a colleague or an indexing engine, or (presupposing the existence of an audience) to make a joke. We believe such uses of markup are best thought of as a form of language game, rather than as a different set of rules for interpreting the markup being used. In the case of ironic or humorous markup, indeed, the humor or irony can only be appreciated if the declarative meaning of the markup is understood by all parties.
[2] Some theorists of text will object that we here fall prey to the intentional fallacy ([Wimsatt and Beardsley 1946]). Perhaps we do; we believe that authorial intention is at least important, and possibly conclusive, in the creation of certain textual structures. This fact is not affected by the medium; an author creates a paragraph break on paper by breaking the line and indenting; it would be difficult to know what to make of a claim that the result was not a paragraph break. But even if we are wrong, and authorial intentions are no more binding than any interpretation, we believe that our central assumption holds: markup licenses certain inferences about a text.
[3] That markup is interpretive has become a bit of a commonplace in the recent academic discussion of markup, though details and emphases differ. See [Biggs and Huitfeldt 1997], [DeRose et al. 1990], [Huitfeldt 1995], [Renear et al. 1995], [Pichler 1993], and even [Sperberg-McQueen 1991] and [Sperberg-McQueen and Burnard 1995].
[4] For all practical purposes, readers include software which processes marked up documents.
[5] Since interpretive markup can in theory be false, the licensed inferences are not necessarily sound, but the reliability of the information in markup is not our topic here; we are concerned only with giving some account of how markup systems associate information (reliable or not) with markup in the first place.
[6] This representation is selected to simplify the argument, not for efficiency in processing. Systems for practical work will normally use some other representation.
[7] Well, almost strictly. One version licenses the inference that the blanks between words are highlighted in gothic, and the other does not. Software might or might not have enough knowledge of typography to realize that there is no visible difference here. On the other hand, some software might be sophisticated enough to render blanks differently depending on the typeface in use, so it might make a difference after all, in which case the blanks would need to be brought inside the hi elements.
[8] For any distributed property, however, we can construct parallel non-distributed properties which are usefully countable. Instead of marking some property P, for example, we can define a property smallest-P, which applies to portions of the text such that they have property P but no smaller portion of the text can be said to have property P, or largest-P, which applies to portions of the text such that they, but no larger contiguous span in the text, have property P. We might define an italicSpan element to mark the longest contiguous runs of italic characters in the text, and require that the characters before and after any occurrence of the element be non-italic.