Improving the Structure of Digital Publications in the Computer Science Domain
Case Study Rhetorical Article Model SWASD Task
November 27, 2009, Tudor Groza
Dissemination, an important phase of scientific research, can be seen as a communication process between scientists. They expose and support their findings, while discussing claims stated in related scientific publications. This communication takes place over the course of several publications, where each paper itself contains a rhetorical discourse structure laying out supportive evidence for the raised claims. Unfortunately, the semantics of the discourse structure is usually hidden with in the content expressed by the writer in the publication and thus hard to be directly discover by the reader.
The common approach for presenting or representing scientific publications is by using the typical printing (linear) layout. This worked fine until now, as most of the publications were "consumed" in their printed form. With the increasing use of the World Wide Web, and its growing influence over the dissemination process, we can clearly observe a shift from paper publications to digital documents. And together with this shift we realize that the linear representation structure becomes deprecated. Generally, the structure of a document has an important influence on the perception of its content. Thus, a well organized publication, following a "red wire", will always be better understood and analyzed than one having a poor or chaotic structure, but not necessarily poor content.
We propose an open-standard, widely (re)useable format for digital publications, with the goal of externalizing (i.e. articulating the tacit knowledge into explicit concepts) the rhetorical roles carried by the blocks of text composing the publication. Generally, externalization has a dual form. On the one hand, scientific publications represent intrinsically a form of cognitive externalization, making explicit the scientists’ thoughts. On the other hand, in order to make these publications much more accessible to computation, and more specifically on the Web, so that information can be easier navigated, compared and understood, we need for a formal externalization, i.e. stepping from the freely expressed text to machine-processable structures.
The range of activities that computer science researchers usually perform is quite broad, and depend on a series of contexts. To simplify the description we could categorize these contexts into: actual research, project administration / management, dissemination and supervision . Three of these four contexts  rely of “using” scientific literature to achieve different goals, whether it is reading about the state of the art, in order to compare it against an own solution (actual research), writing about or presenting this solution to the community (dissemination) or teaching students how to do research and disseminate it (supervision). Therefore, scientific publications represent one of the center-pieces of the daily activities of a typical CS researcher, and their structure represent one of the key elements to their understanding and acceptance.
3. Use case
3.1 Use Case #1: Finding appropriate information within publications
Task: Alice wants to find out the contribution of a set of publications, with the goal of writing a state of the art survey.
Today: In order to find, and consequently understand, the contribution of the publications under scrutiny, Alice would have to read each and one of them, that clearly represents an overwhelming and cumbersome task . Alternatively, she could use one of the existing search engines to find publications that may contain some specific keywords, but this would only help her in retrieving the set of publications, and not in reaching their actual contribution.
Future: Having the rhetorical block structure externalized and attached to the digital publications would enable a richer and more expressive searching and browsing experience. On the one hand, Alice would be able to quickly spot the CONTRIBUTION blocks within the publication and possibly resume her reading only to those, thus reducing the time usuallt spent on reading the entire publication. On the other hand, being able to formulate queries for content specific only to such blocks could already improve the quality (and possibly the quantity) of the set of publications relevant for her (e.g. contribution: “rhetorical structure of scientific publications”).
3.2 Use Case #2: Publication authoring
Task: Alice writes (or contributes to) a scientific publication.
Today: The authoring process of a publication is, more or less, unchanged since the beginning of science. Usually, Alice would create the linear structure of the publication and then she would lay her argumentative thread within this structure. This proces, together with its outcome (i.e. the document), currently suffers from two major problems. Firstly, in the last decade, we observe a clear shift from printed publications to electronic publications. While for the former, there is the need for the typical linear (print-driven) structure, the latter could make use of novel structuring approaches, that would enable the externalization of both the semantics of the content and its argumentative support. Secondly, considering the current information overload, the existence of an explicit semantic structure within the publication (created already at authoring time) would reduce the overhead of postprocessing the publications in order to externalize this semantic structure.
Future: The usual authoring environments will have developed appropriate modules for supporting the author in creating a rhetorical structure, e.g. based on a template, and perhaps depending on the domain. In fact, we envision two possible scenarios:
- Alice starts from a given template of a rhetorical structure, following a core model, which she then customizes according to the domain of the publication. For example, if the publication is in the Computer Science domain, she would use only the core model, while if the publication is in Experimental Biology, she would add a module that captures specific rhetorical blocks for this domain. This scenario has two advantages: firstly, Alice would have from the start a clear structure which she can fill in, and secondly, each block of the structure would have a clear rhetorical role (unlike today's publication sections), easy to mine, retrieve and browse.
- For more conservative researchers, the authoring process could be the same, i.e. following a linear structure, and the authoring environment would automatically propose types of rhetorical roles for the publication's paragraphs (again, depending on a chosen domain). While this scenario is more complex, it could bring an interesting novelty to the authoring process, as the environment could , on-demand, re-shape the entire publication based on the rhetorical structure. This would give Alice the opportunity to examine the argumentative thread both from the linear and from the rhetorical perspective. Additionally, as in the previous scenario, the rhetorical structure would be explicit and easy to mine, retrieve and browse.
Acknowledgements and Reference:
The Use Case reuses to a large extent content from three publications:
- Anita de Waard and Gerard Tel. The ABCDE Format. Enabling Semantic Conference Proceedings. In Proc. of Sem Wiki 2006 at ESWC 2006
- Tudor Groza, Siegfried Handschuh, Tim Clark, Simon Buckingham Shum and Anita de Waard. A Short Survey of Discourse Representation Models. In Proc. of the SWASD Workshop at ISWC 2009
- Tudor Groza, Alexander Schutz, Siegfried Handschuh. SALT: A Semantic Approach For Generating Document Representations. In Proc. of ACM DocEng 2007.
 There can obviously exist additional categories, or the ones that we have mentioned could have been divided in a different manner. Nevertheless, for the sake of our argumentation, these four categories are sufficient.
 We consider three out of the four contexts, by assuming that the second context is a pure administrative one, including for example, budgeting, creating / filling / distributing all kinds of administrative forms, etc.
 As a remark, we are not advocating against reading, but pointing out that if one requires a specific information, then reading an entire set of publications is not really feasible.