GODDAG: A Data Structure for Overlapping Hierarchies

C. M. Sperberg-McQueen ¹ and Claus Huitfeldt ²

¹ World Wide Web Consortium / MIT Laboratory for Computer Science
cmsmcq@acm.org

² University of Bergen / Humanistic Information Technology Centre
Claus.Huitfeldt@hit.uib.no

[[This paper was published in DDEP-PODDP 2000, ed. P. King and E.V. Munson, Lecture Notes in Computer Science 2023 (Berlin: Springer, 2004), pp. 139-160. The version given here differs in some minor ways from the published volume. It was generated from a pre-final version of the text, so it may differ in some details from the final publication. Footnote and bibliographic references are styled slightly differently. Post-publication additions and corrections are enclosed in double square brackets.]]

Abstract. Notations like SGML and XML represent document structures using tree structures; while this is in general a step forward from earlier systems, it creates certain difficulties for the representation of documents in which the structures of interest are not properly nested. Overlapping structures, discontinuous structures, and material which occurs in different orders in different parts, views, or versions of a document are all problems for SGML and XML. Overlapping structures have received attention from a variety of authors on SGML and XML, who have proposed various solutions including the use of non-SGML notations with translation into SGML for processing, the use of the CONCUR feature of SGML, exploitation of conditional marked sections in the DTD and document instance, the imposition of various kinds of unusual interpretations on SGML/XML elements as milestones or as fragments of some larger ‘virtual’ element, or the use of detailed annotation separate from the base text being annotated.

An alternative is the use of a non-SGML/XML notation which does not require that elements form a hierarchical structure. One such notation, MECS, was developed by one of the authors and has been used in practice for over a decade. Unfortunately, the element structure of a MECS document cannot conveniently be represented as a tree, so that MECS processors lack the assistance provided to SGML/XML processors by the unifying assumption of a simple standard data structure for the document. We propose a data structure for representing documents with overlapping structures (including MECS documents). As in the conventional tree representation of SGML and XML, elements are represented by nodes in a graph, and the character data content of the document by labels on the leaves of the graph. We use a directed acyclic graph in which an arc a → b indicates that node b is a child of node a. Unlike SGML/XML trees, our graph structure allows children to have multiple parents. In the general form of the data structure, an ordering is imposed on the children of each node; this gives the data structure its name: general ordered-descendant directed acyclic graph (GODDAG). A restricted form of GODDAG, in which an ordering is imposed on the leaves of the graph, cannot handle multiple orderings of the same material but can represent any legal MECS document.

The data structure here proposed should be useful in the representation of naturally occurring documents with complex structures; it may also be useful in other applications.

1. The Problem of Overlap
- 1.1. Document Structure and Interpretation
- 1.2. Overlap
2. Dealing with Overlap
- 2.1. Non-SGML Notations
- 2.2. CONCUR
- 2.3. Marked Sections and Entity References
- 2.4. Milestones
- 2.5. Fragmentation
- 2.6. Standoff Markup
- 2.7. MECS
3. GODDAG
- 3.1. Informal Presentation
- 3.2. Restricted GODDAGs
- 3.3. Generalized GODDAGs
- 3.4. Applications
- 3.5. Open Questions

References

1. The Problem of Overlap

1.1. Document Structure and Interpretation

Those concerned with digital documents are increasingly accustomed to viewing documents as having a hierarchical structure. Such a structure lends itself to electronic representation using a tree structure, in which each node in the tree is labeled to show the kind of document component represented by that subtree. The tree structure can in turn be linearized using, for example, the syntax of SGML [6] or XML [4], as in this example (from Bashō):

<lg>
<l>Summer grass --</l>
<l>all that's left</l>
<l>of warriors' dreams.</l>
</lg>

Here, an <lg> (line group) element contains three <l> elements, each <l> element representing a verse line, and the <lg> element representing the entire haiku.

We can represent the tree graphically as shown in Fig. 1.

Fig. 1. A simple document hierarchy

A slightly more complex example from the posthumous notebooks of Ludwig Wittgenstein (VW 115, p. 7, Fig. 2) shows how the nodes of the tree can be associated with different types of components: here, a (paragraph) element (p) contains two sentences (<s> nodes); within one of them, some words have been deleted (<del>) and others added (<add>).¹

<p><s><del>Der Anblick</del> <add>Das Bild
</add><del>der</del> <add>einer</add>
menschlichen Gestalt sowie die menschliche
Gestalt selbst sind uns wohlvertraute
Gegenst&auml;nde.</s>
<s>Von einem Wiedererkennen aber ist hier
keine Rede.</s></p>

The start-tags mark the beginning of a passage characterized by some feature (e.g. being a paragraph, or having been deleted), the matching end-tags mark the end of the passage. In a markup language with conventional interpretation, the label on each node in the tree identifies some feature possessed by the subtree it dominates. Sometimes the feature is possessed by the subtree as a unit, as with the p element here; sometimes it is inherited by each part of the subtree, as with the del and add elements here.

Fig. 2. Wittgenstein ms 115, p. 7

To find all the characteristics attributed to a given passage by the markup of the document, it suffices to walk the tree from leaf to root, collecting predicates.²

1.2. Overlap

In some texts, however, it is difficult to see how to construct a tree structure which represents all the information of interest in a natural way. Consider, for example, the first lines of Ibsen's Peer Gynt I.i:

AASE: Peer, du lyver!
PEER GYNT : Nej, jeg gjør ej!
AASE: Naa, saa band paa, det er sant!
PEER GYNT: Hvorfor bande?
AASE: Tvi, du tør ej!
Alt ihob er Tøv og Tant!

In English:

AASE: Peer, you're lying!
PEER GYNT : No, I'm not!
AASE: Well then, swear to me it's true.
PEER GYNT: Swear? why should I?
AASE: See, you dare not!
Every word of it's a lie.

Dramatically, this passage consists of five speeches, which might be encoded thus:

<sp who='Aase'> Peer, you're lying! </sp>
<sp who='Peer'> No, I'm not! </sp>
<sp who='Aase'> Well then, swear to me
                   it's true! </sp>
<sp who='Peer'> Swear? why should I? </sp>
<sp who='Aase'> See, you dare not!
                Every word of it's a lie!
</sp>

Metrically, by contrast, it consists of four lines of verse:

<l> Peer, you're lying!  No, I'm not! </l>
<l> Well then, swear to me it's true! </l>
<l> Swear? why should I?  See, you dare not! </l>
<l> Every word of it's a lie! </l>

Each of these two views of the material corresponds to a different tree structure. The dramatic structure is shown in Fig. 3,

Fig. 3. Tree of dramatic view.

the metrical structure in Fig. 4.

Fig. 4. Tree of metrical view

For simplicity of interpretation, it might be best simply to combine these two trees, as in Fig. 5. Unfortunately, the resulting graph is not a tree, and has no natural representation as a legal SGML or XML document.

Fig. 5. Tangled hierarchy

2. Dealing with Overlap

Several methods have been proposed in the literature for dealing with the problem of overlap.

2.1. Non-SGML Notations

In many ways the simplest approach is to use some notation other than SGML or XML.

In an early article on the problem of overlap, Barnard et al. [2] note that one method of handling overlap is simply to ignore the rules of SGML and XML for nesting; such an approach to our example would produce an encoding something like this:

<l>
<sp who='Aase'> Peer, you're lying! </sp>
<sp who='Peer'> No, I'm not! </sp>
</l>
<l>
<sp who='Aase'> Well then, swear to me
                   it's true! </sp>
</l>
<l>
<sp who='Peer'> Swear? why should I? </sp>
<sp who='Aase'> See, you dare not!
</l>
<l>
                Every word of it's a lie!
</l>
</sp>

The failure of the elements to nest properly makes it impossible to create an SGML document type definition (DTD) for the document, and thus impossible to use SGML or XML tools to process it.

Barnard et al.'s second proposal is to keep both sets of elements in a master version of the document, and to apply filters to the document before feeding it into an SGML system. Filtering out the start- and end-tags for <l> elements, for example, yields a dramatic structure that can be parsed with SGML or XML processors; filtering out the <sp> tags (and optionally the <stage> elements) similarly yields the verse structure. While this approach has certain virtues of simplicity, it does have some disadvantages. It is not possible to validate the master version of the document using standard tools. As Barnard et al. point out, it requires ad hoc filter programs outside the SGML/XML environment. Because the SGML/XML form is not the master form, it is difficult to apply standard SGML/XML tools for editing the document. And since the SGML/XML processors only see one structure at a time, it is impossible to examine both structures and their interaction, unless ad hoc tools are written for the task. (Barnard et al. do not mention this, and apparently do not regard it as a problem.) Perhaps for these reasons, the filtering approach described by Barnard et al. seems not to have been adopted by any software developers or text encoding projects.

2.2. `CONCUR`

In SGML, an optional feature named CONCUR is provided, which allows for “multiple concurrent structural views” of the document (ISO 1986 [[6]] sec. C.3.1). Barnard et al. [2] mention CONCUR as a possible solution, and more recently Sperberg-McQueen and Huitfeldt [9] have actively encouraged its use.

The example from Peer Gynt might look like this, if tagged using concurrent markup:

<div1 type="act">
<(D,V)div2 type="scene">
<(V)l>
<(D)sp who='Aase'> Peer, you're lying! </(D)sp>
<(D)sp who='Peer'> No, I'm not! </(D)sp>
</(V)l>
<(V)l>
<(D)sp who='Aase'> 
                   Well then, swear to me
                   it's true! </(D)sp>
</(V)l>
<(V)l>
<(D)sp who='Peer'> Swear? why should I? </(D)sp>
<(D)sp who='Aase'> See, you dare not!
</(V)l>
<(V)l>             Every word of it's a lie!</(D)sp>
</(V)l>
...
</(D,V)div2>
</div1>

Note that each tag carries, in parentheses, the names of the root elements of the concurrent document types to which it applies; when the tag belongs to all concurrent views, the list may be exhaustive, as shown in the <div2> tags, or omitted, as shown in the <div1> tags.

The tagging strongly resembles the non-nesting markup described above, but because each document type (here both D and V) has a tree structure, each can be given a full document-type definition, parsed, and interpreted in a straightforward way using the normal conventions of SGML and XML. The only difference is that instead of one tree for the document, there are multiple trees, sharing the same frontier. (If entity declarations differ in the different DTDs, the trees will have slightly different frontiers.)

The CONCUR feature, however, is not widely regarded as a solution for the overlap problem. It is clear from the SGML standard (ISO 8879) and the commentary of its editor [5] that the developers of SGML expected this to arise primarily in cases where it was desirable to record both the source and the result of some processing (e.g. both the logically tagged source and the formatted output of a layout program) in the same document; Goldfarb writes [5] (p. 304) “I therefore recommend that CONCUR not be used to create multiple logical views of a document, such as verse-oriented and speech-oriented views of poetry. Hypertext links are more appropriate for such applications....” Barnard et al. 1988 [2] object to CONCUR because it is not widely implemented, because it increases the volume of markup in the document, and because it has complex interactions with various markup-minimization features. The rejection of all markup-minimization features in XML has in the meantime robbed the last two objections of most of their force, but the lack of implementation support has been an incentive for practitioners to develop other methods of handling overlap. More substantial reservations about CONCUR are those expressed by Murata [8]: CONCUR is not able to model, in any convenient way, the deletion, insertion, duplication, or reordering of data in the various views.

2.3. Marked Sections and Entity References

Barnard et al. 1988 [2] propose to handle overlap with an elaborate scheme involving marked sections and specially defined entities, which will make either the tags for dramatic structure, or those for metrical structure, visible to an SGML parser, but not both. In some ways, this method resembles their proposal to maintain markup for both hierarchies in the master copy of the document, and filter the master into a single-hierarchy form for processing, with the difference that the filtering is here handled within, rather than outside, the SGML system.

Like the filtering approach, this method provides access to one hierarchical structure at a time, but not to both at once. This drawback, coupled with the lack of support for this style of markup in SGML editors, may be responsible for the fact that this proposal seems to have had little or no effect on the practice of those working with SGML systems.

This technique is not available in XML systems, because it relies on conditional sections in the document instance, and on entity replacement texts containing start- or end-tags, but not both. XML does not allow conditional sections in the instance, and it requires entity replacement text to be well-formed.

2.4. Milestones

The technique we call milestone elements was proposed by Barnard et al. in 1988 [2]; the term itself comes from the Guidelines of the Text Encoding Initiative (TEI) [1], which uses this technique for marking pagination, typographic lineation, and other asynchronous phenomena.

The basic idea is simple: of the various competing hierarchies implicit in an overlapping structure, one is chosen as the dominant hierarchy, and the start and end of what would have been elements in the other hierarchies are represented using empty elements. In sequential processing, the processor must change state as it passes such an element, in much the same way as our estimate of the distance remaining to a destination changes when we pass a milestone by the road.

Our example from Peer Gynt might be represented this way in XML, with line breaks in the metrical view represented as milestone elements named lb (‘line break’):³

<sp who='Aase'><lb n="1"/> 
  Peer, you're lying! </sp>
<sp who='Peer'> 
  No, I'm not! </sp>
<sp who='Aase'><lb n="2"/>
  Well then, swear to me it's true! </sp>
<sp who='Peer'><lb n="3"/>
  Swear? why should I? </sp>
<sp who='Aase'> 
  See, you dare not! 
<lb n="4"/> 
  Every word of it's a lie!
</sp>

In some cases (as described in Barnard et al. 1995 [3]), it may be desirable to provide empty elements both for the start- and for the end-tags of the elements in the non-dominant views of the text; in this particular example, it is unnecessary as each line ends only when the next begins.

The milestone technique, while convenient for data capture and simple sequential processing of the document, requires foregoing any validation of the subordinate views by the SGML or XML parser. The interpretation and processing required for milestone elements differs substantially from that for other element types. The tree structure associated with the XML fragment just given represents the dramatic structure normally, but the metrical structure must be, in effect, decoded rather than being exhibited directly by the tree structure.

2.5. Fragmentation

It is possible to gain slightly better validation of the markup, and to reduce the divergence in interpretation and processing needed for speeches and metrical lines, if we use the technique of fragmentation introduced (as far as we know) by the TEI Guidelines [1].

As with milestone elements, one hierarchy is chosen as the dominant one. Elements in the other hierarchies are represented normally as elements, as long as they nest properly within elements of the dominant hierarchy. Where an element in a subordinate view spans an element boundary in the dominant view, that element is broken into as many pieces as necessary to make each piece nest within an element of the dominant hierarchy. Attributes are used to signal that the resulting elements are, from a logical point of view, only fragments of a larger virtual element. The TEI encoding of our lines from Ibsen uses fragmentation of the verse lines:

<sp who="Aase">
  <l part="i">Peer, you're lying!</l>
</sp>
<sp who="Peer">
  <stage>without stopping</stage>
  <l part="f">No, I'm not!</l>
</sp>
<sp who="Aase">
  <l part="n">Well then, swear to me it's true!</l>
</sp>
<sp who="Peer">
  <l part="i">Swear? Why should I?</l>
</sp>
<sp who="Aase">
  <l part="f">See, you dare not!</l>
  <l part="n">Every word of it's a lie.</l>
</sp>

In the TEI DTD, the part attribute is used to specify that an <l> element represents a complete metrical line (part="n"), the initial part of the metrical line (part="i"), a middle part (part="m"), or the final part of the metrical line (part="f"). Applications which want to reconstruct fragmented metrical lines take any <l> element not marked as partial, or any sequence of <l> elements consisting of one initial, zero or more medial, and one final part. The SGML or XML parser does not ensure that initial, medial, and final parts occur in a proper order, which means some application-specific validation is necessary, but the parser does ensure proper nesting of <l> elements and proper matchup of start- and end-tags, which makes fragmentation frequently preferable to milestone elements.

It is worth noting that fragmentation provides a convenient method of handling discontinuous segments of a text (such as direct discourse of quotations interrupted by attributions), but no way of handling material which appears in other than the logical order.

2.6. Standoff Markup

A final method of handling overlap is to use what the TEI Guidelines call “out-of-line markup”, and what McKelvie et al. [7] (sec. 3.1) call “standoff annotation”.

In standoff markup, hierarchical structures may be represented by markup stored separately from the content of the structures, and connected to the content by hyperlinks. This allows the representation of multiple conflicting hierarchical structures, as well as allowing annotation of read-only material. The TEI <join> element allows us to represent the metrical lines of our example as virtual elements created by combining the line fragments actually encoded:

<sp who="Aase">
  <l id="L1a">Peer, you're lying!</l>
</sp>
<sp who="Peer">
  <stage>without stopping</stage>
  <l id="L1b">No, I'm not!</l>
</sp>
<sp who="Aase">
  <l id="L2">Well then, swear to me it's true!</l>
</sp>
<sp who="Peer">
  <l id="L3a">Swear? Why should I?</l>
</sp>
<sp who="Aase">
  <l id="L3b">See, you dare not!</l>
  <l id="L4" >Every word of it's a lie.</l>
</sp>

The <join> elements themselves each signal the existence of a virtual element of type <l>, created by concatenating the elements pointed at by the targets attribute:

<joinGrp result="l" targOrder="y" targType="L">
<join scope="branches" targets="L1a L1b"/>
<join scope="branches" targets="L3a L3b"/>
</joinGrp>

In practice, standoff markup plays a relatively minor role in the TEI DTD, primarily because it places a relatively heavy burden on data capture and on processors. A much more systematic exploration of the idea is given by McKelvie et al. [7], who describe a system in which various kinds of markup are systematically encoded and stored separately from the material they annotate, in order to allow multiple analyses of the same material. A tool suite is available for combining the base material with the standoff markup, in a way roughly analogous to that suggested by Barnard et al., except that instead of stripping out markup not relevant to the particular view desired by the application, the tools for standoff markup introduce the markup relevant to the desired view.

Standoff markup can handle both discontinuous virtual elements and virtual elements whose content appears out of order in the document. It suffers, in the view of some, from the lack of any convenient way to test, when the markup is stored in the standoff format, whether the markup actually obeys any constraints expressed in a DTD or other notation.

2.7. MECS

All the SGML/XML-based methods of handling overlap described so far have in common that they are more or less awkward. Some require applications to see one or the other, but not both, hierarchical structures; the others privilege one structure by making it the dominant hierarchy and requiring various forms of special processing, and special rules for interpreting the tree structure of the document.

One way to avoid the awkwardness is to take seriously the first method of handling overlap proposed by Barnard et al. 1988: use a non-SGML notation which does not require that elements nest. While there is no very widely used notation of this type — unless one counts the ill-formed HTML generated by users or programs unfamiliar with SGML and supported by the lenient error handling of HTML browsers — the MECS (Multi-Element Code System) notation developed by the Wittgenstein Archive at the University of Bergen provides a useful way to gain most of the benefits of descriptive markup without requiring proper nesting of elements. MECS was developed for the transcription of Wittgenstein's unpublished notebooks; it is essential to provide a clear account of the physical organization of the page, as well as of the logical structure of the text, without privileging either.

In MECS,

start-tags (in the form <e/) mark the start of a feature
end-tags (/e>) mark the end of a feature
no-element codes (<e>) mark a feature with location but no content (these correspond to SGML empty elements)

There are some other convenience features, which we will not describe here.

Elements are not required to nest. In MECS notation, our example might look like this (with some extra white space):

<sp/<speaker/Aase/speaker>
<l/ Peer, you're lying!            /sp>
<sp/<speaker>Peer/speaker>
    No, I'm not!              /l>  /sp>
<sp/<speaker/Aase/speaker>
<l/ Well then, swear to me it's true! /l>
/sp>
<sp/<speaker/Peer/speaker>
<l/ Swear? Why should I?           /sp>
<sp/<speaker/Aase/speaker>
    See, you dare not!        /l>
<l/ Every word of it's a lie! /l>  /sp>

MECS provides a simple notation, which ensures (as does XML) that no DTD is required in order to read the document correctly, and which (unlike SGML and XML) makes it possible to ensure a one-to-one mapping between features in the text and start-/end-tag pairs in the encoding.

MECS does, however, have some problems: it is entirely possible to encode texts with gratuitous overlap of elements which has no ascertainable meaning, it lacks any notation for cases in which two elements of the same type overlap each other, its element structure cannot be represented as a tree, and it provides no formalism for defining a document grammar analogous to a DTD.

The rest of this paper is devoted to defining a data structure intended to stand in the same relation to MECS as trees stand in relation to SGML and XML documents.

3. GODDAG

3.1. Informal Presentation

Let us begin with a trivial example of a document with overlapping structures:

<s/<a/ John <b/ likes /a> Mary /b>/s>

If for this example we draw the containment relations among the <s>, <a>, and elements in the usual way, ignoring for the moment the lack of nesting, the result may be a graph like that shown in Fig. 6.

Fig. 6. A trivial overlap example: John likes Mary

If we add explicit nodes for the #PCDATA chunks in the document, the graph becomes more manageable (Fig. 7).

/
Fig. 7. Another view of John likes Mary.

Overlap can be represented by graphs that are very like trees, but in which nodes may have multiple parents.

Overlap is multiple parentage.

Applying this idea to our Peer Gynt example yields the graph shown in Fig. 8.

Fig. 8. Peer Gynt 1.1

In such a graph, we can apply many of the conventions usual in the construction and interpretation of trees representing document structures: nodes have containment relations, each node can be said to dominate some subgraph (the nodes reachable from it), one node dominates a second if and only if the second node is completely contained within the first, the features associated with nodes by the labels on the nodes are possessed by the subgraphs dominated by the nodes, etc.

The following paragraphs provide a slightly more formal characterization of such graphs, describe algorithms for constructing them, and briefly discuss some applications.

3.2. Restricted GODDAGs

We distinguish two varieties of graph for representing documents with overlap: a restricted form which directly corresponds to MECS documents, and a generalized form which may not always have a representation in MECS.

Because nodes in such graphs have descendants, and because those descendants are ordered, we refer to these graphs as generalized ordered-descendant directed acyclic graph (GODDAG) structures.

In each case,

A GODDAG is a directed acyclic graph in which each node is either a leaf node or a nonterminal node.
Each leaf node is labeled with a string of zero or more characters.
Each nonterminal node is labeled with a generic identifier.
Directed arcs connect nodes; if an arc a → b exists, we say that node a directly dominates node b; we also call b a child of a. If a directed path of length zero or more from a to b exists, we say that node a dominates node b. If the path is of length one or more, then we call b a descendant of a.
Leaf nodes are those which are not the starting point of any arc.
Nonterminals are all others.
For any three nodes a, b, and c, if arcs a → b and b → c exist, then there is not an arc a → c. This amounts to saying that no node dominates another node both directly and indirectly.

In restricted GODDAGs, the following constraints apply:

The leaf nodes form a sequence. For any two leaf nodes a and b,

(a) a b (and b precedes a), or

(c) a = b (in which case they are the same node)

Each leaf node except the first has a predecessor, and each leaf node except the last has a successor. The sequence of leaf nodes we call the frontier.
Each node dominates some contiguous subsequence of leaf nodes. I.e. if node n dominates leaf nodes a and b, and a < b, then for any leaf node c such that a < c < b, node n dominates c.
No two nodes dominate the same subsequence of the frontier. I.e. for any two nodes a and b, there will be at least one leaf node c such that a dominates c but b does not dominate c, or vice versa.

In generalized GODDAGs, these constraints are relaxed. (See below.)

From a MECS document, a restricted GODDAG structure can be constructed as follows. The procedure requires a list of open elements, which is empty at the beginning of the procedure. We conceive of the MECS document as a series of segments, where each segment is either a single tag, or else the sequence of zero or more data characters between two tags. Tag segments and character data segments alternate: between any two tag segments there is at least one character data segment, and vice versa.

To construct a restricted GODDAG we read and process one segment s of the document at a time.

If s is a start-tag, then

(a) Make an element node n labeled with the appropriate generic identifier

(b) For each element node e in the list of open elements, add an arc e → n.

(c) Add n to the list of open elements.
If s is a character data segment, then

(a) Create a PCDATA node p labeled with the appropriate string.

(b) For each element node e in the list of open elements, add an arc e → p.
If s is an empty-element tag, then

(a) Make an element node n labeled with the appropriate generic identifier

(b) For each element node e in the list of open elements, add an arc e → n.

(c) Make a PCDATA node p labeled with the empty string.

(d) Add an arc n → p.
If s is an end-tag, then

(a) Find the element node n created for the corresponding start-tag, and remove n from the list of open elements.

(b) If n is not the most recently created element node in the list of open elements, then for each node m which was created more recently than n remove the arc n → m. These elements are not wholly contained in n and thus should not be dominated by n.

(c) Record the set of nodes dominated by n for later use. (This is the set of nodes x for which an arc n → x exists.)

At this point, there may be illegal arcs from n to other nodes (n may be dominating some other nodes both directly and indirectly). We remove them thus:

For each descendant d of n, examine the set of nodes dominated by d. For each node m dominated by d, remove the arc n → m if it exists.

Note that in following this algorithm,

Every node created is labeled either with a string or with a generic identifier.
Every node labeled with a generic identifier is given at least one outgoing arc; no node labeled with a string is given any outgoing arc.
At completion of the end-tag processing for any node n, any arcs n → d have been removed if d is dominated by any descendant of n. Therefore, for each node n there is no node which is both directly and indirectly dominated by n.
If the leaf nodes are numbered in the order they are read, then we have a total ordering on the leaf nodes, and each leaf node d occurring between the start-tag and the end-tag for any nonterminal node n is dominated by n: an arc n → d is created when d is read, and the arc is removed only if n indirectly dominates d. If the arc is not removed, n dominates d directly in the resulting graph; if the arc is removed, n dominates d indirectly. Since this is true for every d between the start- and end-tags of n, we can conclude that n dominates a contiguous subsequence of the leaf nodes.
Since every start-tag s is immediately followed by a character data segment d (possibly containing the empty string), and d is dominated by the node for s but not by any node for elements with start-tags occurring later in the document, it follows that no two nodes dominate subsequences of the frontier which begin with the same character data segment. It follows in turn that no two nodes dominate the same subsequence of the frontier.

These characteristics of the resulting graph suffice to make it fall within the definition of a restricted GODDAG as given above.

It is easy to see that a fairly simple traversal of the GODDAG structure can be used to serialize the restricted GODDAG structure as a MECS document, as long as no two elements which overlap each other also have the same generic identifier.

Let us examine a simple example. In processing the simple document

<s/<a/ John <b/ likes /a> Mary /b>/s>

we will process, in turn, eleven segments:

The start-tag <s/.
A character-data segment containing the empty string.
The start-tag <a/.
A character-data segment containing the string " John ".
The start-tag <b/.
A character-data segment containing the string " likes ".
The end-tag /a>.
A character-data segment containing the string " Mary ".
The end-tag /b>.
A character-data segment containing the empty string.
The end-tag /s>.

After the first six of these, the list of open elements includes a, b, and s, and the graph has the form shown in Fig. 9.

Fig. 9. GODDAG construction after the word “likes” is read

At this point we encounter our first end-tag, for <a>. Since the <a> element was not the most recently opened, we know that the element, which was opened more recently, does not nest properly within <a>. So we remove the arc a → b (Fig. 10).

Fig. 10. GODDAG construction after “/a>” is read

There are no doubly dominated nodes, so no further arcs are removed. After adding the character-data segment for " Mary ", processing the end-tag for the element does not cause any arcs to be removed. After processing the empty string between the and <s> end-tags, the graph is shown in Fig. 11.

Fig. 11. GODDAG construction after “/s>” is read

At this point, we remove the direct arcs from s to the leaf nodes other than the first and last. The result is shown in Fig. 12.

Fig. 12. GODDAG construction at completion

3.3. Generalized GODDAGs

In a generalized GODDAG, the constraints on the graph are relaxed:

For any nonterminal node n, the set of arcs from n to some other node are ordered. For any two arcs n → a and n → b,

(a) n → a < n → b (in which case we say that node a precedes b among the children of n, or

(b) n → a > n → b (in which case we say that node b precedes a among the children of n, or

(c) n → a = n → b (in which case they are the same arc)
There is no single ordering on the leaf nodes, and requirement that the leaf nodes dominated by any node be contiguous. (In the absence of a single ordering, there is no way to express the notion that a set of leaf nodes is or is not contiguous.)
There is no requirement that any two nodes dominate different sets of leaf nodes.

3.4. Applications

The principal motivation for defining GODDAGs is to provide a data structure for representing documents with overlapping structures. In the majority of cases, these can be represented using restricted GODDAGs. The similarities between tree structures and GODDAGs allow the same methods of interpreting the meaning of markup to be followed: feature values can be inherited from a parent, overridden by a descendant, and so on. There is some chance for conflict and confusion, since with multiple parents, it is possible that different parents have different and incompatible values for the same feature, etc.; the formal specification of the meaning of a markup language will need to specify how to interpret such cases.

GODDAGs can be useful in the construction of various kinds of software for processing complex documents. We sketch two applications here: the elimination of ‘spurious’ overlap and the representation of virtual elements.

Systems which allow overlapping markup often suffer from what we term spurious overlap: the (usually unintentional) use of overlapping markup where the structure of the document can be captured adequately without overlap. When two elements overlap, they define three distinct regions dominated by one or the other of the overlapping elements, or both:


<a/ John <b/ likes /a> Mary /b>

If any one of the regions is empty, then the overlap is spurious: the document can be rewritten without overlap, without changing the interpretation of any character of the document. Here are three examples of spurious overlap.


<a/<b/ John likes /a> Mary /b>
<a/ John likes <b//a> Mary /b>
<a/ John likes <b/ Mary /a>/b>

A GODDAG structure lacks spurious overlap if certain conditions hold. We define the function leafset as returning the set of non-empty-string leaf nodes dominated by a given node. A GODDAG structure is clean if the following conditions hold:

If leafset(b) ⊂ leafset(a), then a → b.

If leafset(b) = leafset(a), then either a → b or b → a.

Clean GODDAGs can be created by constructing a restricted GODDAG, and then deleting every leaf node labeled with the empty string except those dependent on empty elements and ensuring that when the leafset of one node is a proper superset of the leafset of another, the first node dominates the second. Such an algorithm produces the following reformulations of the examples just given:


<b/<a/ John likes /a> Mary /b>
<a/ John likes /a><b/ Mary /b>
<a/ John likes <b/ Mary /b>/a>

The second application of GODDAGs we wish to mention is the representation of virtual elements as they are defined in the TEI Guidelines. Such virtual elements are represented in a generalized GODDAG as nodes like any other, and their children can be searched and processed in the same way as the children of normal elements.

3.5. Open Questions

A number of open questions remain for further work:

formal proof that the transformation of a MECS document into a GODDAG structure, followed by the serialization of the GODDAG structure into a MECS document, produces a new MECS document equivalent to the old
demonstration of the use of GODDAG structures in processing documents with overlap (e.g. in indexing and search and retrieval applications)
algorithms for translating SGML/XML documents which use any of the techniques described in the first part of this paper into GODDAG structures, and vice versa

References

1. Association for Computers and the Humanities (ACH), Association for Computational Linguistics (ACL), and Association for Literary and Linguistic Computing (ALLC). 1994. Guidelines for Electronic Text Encoding and Interchange (TEI P3), ed. C. M. Sperberg-McQueen and Lou Burnard. Chicago, Oxford: TEI, 1994.

2. Barnard, David, Ron Hayter, Maria Karababa, George Logan, and John McFadden. 1988. “SGML-based markup for literary texts: Two problems and some solutions.” Computers and the Humanities 22: 265-276.

3. Barnard, David T., Lou Burnard, Jean-Pierre Gaspart, Lynne A. Price, C. M. Sperberg-McQueen, and Giovanni Battista Varile. 1995. “Hierarchical encoding of text: Technical problems and SGML solutions.” Computers and the Humanities 29: 211-231.

4. Bray, Tim, Jean Paoli, and C. M. Sperberg-McQueen, ed. Extensible Markup Language (XML) 1.0 [Cambridge, Mass., Sophia-Antipolis, Tokyo]: World Wide Web Consortium, 1998. [[3d ed. 2004 edited by the above with Eve Maler and François Yergeau. Available at http://www.w3.org/TR/REC-xml/]]

5. Goldfarb, Charles F. The SGML Handbook. Oxford: Clarendon Press, 1990.

6. International Organization for Standardization (ISO). 1986. ISO 8879: Information processing — Text and office systems — Standard Generalized Markup Language (SGML). [Geneva]: ISO, 1986.

7. McKelvie, D., C. Brew, and H. S. Thompson. 1998. “Using SGML as a basis for data-intensive natural language processing.” Computers and the Humanities 31: 367-388.

8. Murata, M. 1995. “File format for documents containing both logical structures and layout structures.” Electronic publishing 8: 295-317.

9. Sperberg-McQueen, C. M., and Claus Huitfeldt. 1999. “Concurrent document hierarchies in MECS and SGML.” Literary & Linguistic Computing 14.1: 29-42.

10. Sperberg-McQueen, C. M., Claus Huitfeldt, and Allen Renear. 2000. “Meaning and Interpretation of Markup.” Markup Languages Theory & Practice [forthcoming]. [[Published in 2.3 (2000): 215-234. Available at http://www.w3.org/People/cmsmcq/2000/mim.html]] Paper originally presented at ALLC/ACH 2000, Glasgow, and at Extreme Markup Languages 2000, Montréal.