A directed-graph data structure for text manipulation

C. M. Sperberg-McQueen
Computer Center
University of Illinois at Chicago
Chicago, Illinois, USA

[Abstract of a talk given at the conference The Dynamic Text, 9th International Conference on Computers and the Humanities (ICCH) and 16th International Association for Literary and Linguistic Computing (ALLC) Conference, at the University of Toronto, June 1989]


This work came to mind recently when I heard the paper “A Fresh Computational Approach to Textual Variation” by Desmond Schmidt and Domenico Fiormonte at the conference Digital Humanities 2006, the first International Conference of the Alliance of Digital Humanities Organizations (ADHO), at the Sorbonne in Paris earlier this month. So I have unearthed the abstract and put it on the Web. Apart from the tagging in TEI Lite, the addition of this note and the identifying information above, and the omission of the now outdated address at the bottom of the abstract, the text is unchanged from pp. 69-70 of the 1989 conference guide.
C. M. Sperberg-McQueen
Española, New Mexico
21 July 2006

Written texts are conventionally represented in data processing as linear arrays of characters (or tokens, or records). Our text editors and concordance programs commonly assume that texts are simple streams of characters; our programming texts teach programmers to represent them so. But texts are not, as a class, exclusively linear. And the linear model, although convenient, provides no help in dealing with texts when they resist reduction to linearity, as even simple texts often do.
For some non-linear aspects of texts, proper data structures are well known. Simple textual hierarchy (including, at least, footnotes, running titles, and marginalia) can be handled even by very conventional text programs. Structure editors, SGML editors and processors, and outlining programs may model the hierarchy of a text much more completely, representing in a tree-like data structure the text's segmentation into chapters, sections, paragraphs, and sentences (or into cantos, stanzas, and lines; or into volumes, pages, and typographic lines). A second type of non-linearity comprises internal and external cross references and the distinction of passages of various kinds (e.g. material to be read on the first reading and material intended to be read only later). This non-linearity is less simple to handle well than the first, but the solution -- hypertext -- is very well known, even though not widely implemented.
A third type of textual non-linearity is less well served by existing data structures: textual variation. Texts do not always exist only in single forms. Culturally significant texts rarely do: if they are not preserved in multiple divergent manuscripts, they will have been printed in multiple divergent editions. But textual variation does not affect only culturally important texts. Letters, textbooks, software manuals, -- anything written can and often does exist in multiple forms. But the characteristics of variant texts -- an inescapable brute fact of our cultural history -- can be modeled by linear data structures only indirectly and awkwardly.
Conceptually, just as a simple text can be viewed (with some simplification) as a single stream of characters, so multiple versions of the same text can be viewed as so many parallel streams of characters. In this model, the complex text becomes a two dimensional array, a linear array of (texts represented as) linear arrays of characters. This model appeals in its simplicity, but apart from its theoretical problems (even a single text state need not be truly linear) it wastes storage whenever several versions have the same reading, and wastes processing power because to display a summary of the agreements and variations among the text states, it must collate each text against the base text from scratch.
I will outline in my paper a more economical solution to this problem. In this non-linear model, the multiple versions of a text are imagined not as so many parallel, non-intersecting lines, but as curves that intersect, run together for a while, and then split apart again, like the channels in a river delta. Unlike the channels of most river deltas, the versions of a text often merge again after splitting. The data structure takes its name from one riverine delta where such reunion of the channels does occur; I have christened it the "Rhine Delta" structure. Unlike the two-dimensional model of complex texts, this structure stores passages in which all versions agree only once; it is thus more economical of space. It also records the agreements and divergences of manuscripts structurally, which makes the task of preparing a critical apparatus a much simpler computational task.
Formally, the Rhine Delta structure is a directed graph, each node of which is labeled with one token of the text and with the symbols of the manuscripts which contain that token. Each arc linking two tokens is labeled with the symbols of the manuscripts in which the two tokens follow each other. There is a single starting node and a single ending node. If one follows all the arcs labeled with the symbol of a specific manuscript, one visits, in turn, nodes representing each token of that manuscript, in sequence. Passages where all the manuscripts agree are marked by nodes and arcs bearing all the manuscript symbols. Passages where they disagree will have as many paths through the passage as there are manuscript variants.
It can be shown that from this structure we can, for any variant, produce all the conventional views of linear text and perform all the usual operations (deletion, insertion, replacement, travel, search and replace, block move, etc.). Moreover, we can readily generate the various conventional views of complex texts: base text with apparatus, texts in parallel columns, text in parallel horizontal lines. Unlike other methods of handling textual variation, the Rhine Delta has no computational bias toward any single base text state; the user pays no penalty for wishing to view the text in an alternate version, with an apparatus keyed to that version.
Like a textual apparatus, a specific Rhine Delta structure is not wholly determined by a specific configuration of textual variants; more than one segmentation of variants and grouping of manuscripts can reflect the same textual facts. The Rhine Delta structure can reflect structurally the editor's preference for arranging, delimiting, separating, and grouping the textual variations. Or one can generate various alternative forms for the critical apparatus by mechanical operations on the Rhine Delta structure.
The Rhine Delta structure does introduce some complications into the treatment of textual hierarchies and cross references, but there appears to be no reason that Rhine Delta structures and hierarchical text structures could not both be integrated into a hypertext system, providing powerful tools for manipulating text. The form of such a combined hierarchic, hypertextual Rhine Delta structure will be sketched briefly in my paper.
The Rhine Delta structure has a number of possible applications in scholarly and non-scholarly text processing. Textual critics, and those who do not trust textual critics, will be able to to examine all the variant forms of a text separately and in parallel, with an apparatus keyed to whichever version they are reading at the moment. Those responsible for writing texts that must be prepared in various parallel, synchronized versions (system-specific software documentation, texts intermingling introductory overviews with esoteric digressions intended for later reading, project descriptions to be submitted to multiple funding agencies with different agendas, ...) will be able to create and maintain such documents more easily with a text editor that understands the basic concepts of textual variation and file synchronization. Collaborators could propose revisions or alternate readings in each other's texts without destroying the other alternatives, so that the document editor, in making the final version, can have all the suggestions laid out cleanly against each other.
The structure here described has almost certainly been invented before, though not widely publicized. It has parallels with hypertext data structures (although it appears to be finer grained than most hypertext systems), as well as with transition networks and deterministic finite automata. Structures very like it have surely been used sporadically before. But though it would be foolhardy to claim much originality for the idea of the Rhine Delta, it seems worthwhile to consider it in some detail as a data structure and discover its characteristics, advantages, and drawbacks. Only such consideration can lead to improvements in our ability to handle the inescapable, recalcitrant, stubborn non-linearity of the texts we study and write.