The ABCDE Format: Publishing Semantic Conference Papers

The ABCDE Format: Publishing Semantic Conference Papers
Anita de Waard (Utrecht University/Elsevier B.V.)

(Annotation:

DC.Title The ABCDE Format: Publishing Semantic Conference Papers
DC.Creator.PersonalName Anita de Waard
DC.Creator.PersonalName.Address anita@cs.uu.nl
DC.Creator.2 Simon Pepping
DC.Creator.Address.2 s.pepping@elsevier.com
DC.Subject *** I.7.: Editing, Text
DC.Subject Semantic Web
DC.Subject Wiki's
DC.Type Text.Proceedings
DC.Identifier http://www.semwiki.org/2006/
DC.Identifier URN dewaardsemwiki2006
DC.Language ISO639-1 en
DC.Date.X-MetadataLastModified ISO8601 2006-02-02

/Annotation)

We believe that the best way to present a narrative to a computer is to let the author explicitly create a rich semantic structure for the article during writing.
As conceptual structures become the central bearer of information, a set of structured documents can be integrated to form a ‘knowledge network’, or structured package of related knowledge regarding a topic.
We propose an open-standard, widely (re)useable format, the ABCDE Format for proceedings and workshop contributions that can be easily mined, integrated and consumed by semantic browsers and wikis.
This format contains the following elements: an Annotation with Dublin Core metadata; three paragraphs containing Background, Contribution (of the author(s) to the field) and Discussion, and a set of Entities that are linked to internal or external identifiers, such as project names, references, concept names, etc., which can be formatted in RDF.
There is no abstract in an ABCDE document - instead, within the B,C and D paragraphs the author denotes 'core' sentences. Upon retrieval or rendering of the article, these can be extracted to form a structured abstract of the article - where one can jump directly to the core of the Background, Contribution or Discussion.
We aim to work on different incarnations of this format and open it up to modification and development.
A proposal to mine this format for Facts is also presented to this Workshop.

Background

(Background:

(Entity: Object (type: text) = (Background paragraph), relation = Footnote, Subject (type: text) = "This background is a copy of the Introduction paragraph of (Entity: Object (type: text) = (this contribution), relation = Reference, Subject (type: URI) = http://labs.elsevier.com/resources/adw/papers/SWDaysDeWaard1209.pdf)). It is an essential property of semantic conference contributions that they can be composed in a modular format, i.e. linking to or reusing parts of existing documents.)

“There is a growing mountain of research. But there is increased evidence that we are being bogged down today as specialization extends. The investigator is staggered by the findings and conclusions of thousands of other workers - conclusions which he cannot find time to grasp, much less to remember, as they appear.“
(Bush, 1945)

Scientists are increasingly unable to process the ever-increasing flood of scientific literature that surrounds them. Biomedical literature, for instance, grows by over 500,000 publications each year (Cohen, 2005). In a recent study on user needs among British archaeologists, 71% of the respondents felt that information was produced of which they were unaware (Jones, 2001). Next to problems in accessing one’s own field, it becomes more and more difficult to access adjacent domains of science. Furthermore, scientists do not only want to know what publications contain specific words, and how to rank them by relevance, but what knowledge is contained within the papers, and how it relates to their existing knowledge. For example, cell biologists might want to know: “What functions of this gene are known?” Astronomers might ask “What radiation patterns have we seen in red-dwarf stars?” or “What theories does this new observation support?” Ideally, a new publication should situate itself within the existing knowledge context of the reader, and show how it affects or alters this context.

There have been many efforts to combat information overload in science. Abstracts have been developed in the sixties and seventies. Although they are shorter to read, abstracts do not provide a full summary of the work described in the document, nor do they offer any way to integrate the document into the existing knowledge. Metadata is a broad term covering many different types of information, but generally includes the bibliographic reference to a document, and descriptors such as keywords . Metadata helps retrieve an article when descriptive elements (author, title) are known. The main function of a keyword list is to classify the article in a category. But neither provides any direct insight in the knowledge conveyed within the body of a scientific paper.

Text mining and information extraction are methods specifically developed to find relevant information in unstructured texts and encode the information in a structured form, like a database record (Couto, 2003). In theory, text mining is the perfect solution to transforming factual knowledge from publications into database entries. However, automatically identifying concepts such as genes and proteins poses many problems, see e.g. Mons (2005) and Cohen (2005). Moreover, computational linguists have not yet developed tools that can analyse more than 30% of English sentences correctly and transform them into a structured formal representation. For this, the papers still need to be handled by a curator (Rebholz-Schuhmann, 2005).

The main problem with automatically extracting information from scientific articles is that the genre of the scientific publication has developed to be an indivisible information unit (see e.g. Bazerman (1998)). The scientific paper is a self-contained narrative, created anew in each iteration, with specific genre characteristics that minimize the potential of identification, content reuse and knowledge integration. All this rhetorical freedom comes at the expense of usability in a computer-centered environment. The linear narrative was fine when we still read and wrote on paper, but the changing (digital) environment in which scientists live and work calls for a changing fundamental unit of communication.
(Core1: We believe that the best way to present a narrative to a computer is to let the author explicitly create a rich semantic structure for the article during writing /Core1) (see also de Waard, 2005). At a high level, this structure will consist of self-contained modular elements or entities, and discourse relationships between such elements (within a text, and between texts). The tension between these self-contained ‘knowledge elements’ or conceptual structures, and the meaning conveyed in the conventional narrative of the document as a whole, poses an interesting topic of study in terms of both knowledge modeling and rhetoric/discourse studies.
(Core2: As conceptual structures become the central bearer of information, a set of structured documents can be integrated to form a ‘knowledge network’, or structured package of related knowledge regarding a topic. /Core2) This can be envisaged (and modeled) as a network of nodes and relationships, and can be seen to form an incarnation of the ‘intelligent data’ ideal, which th e Semantic Web is meant to enable (Berners-Lee, 2001) . The purpose of this project is to examine such a new form of structuring, and the authoring, editing and retrieval processes needed to use it. Specifically, we are interested in representing conferenc eproceedings in a new way. Semantic Browsers such as PiggyBank [ ] and semantic collaborative authoring tools such as Semantic Wiki [ ] are paving the road for distributed, semantic communities to communicate.

/Background)

Contribution

(Contribution:(Core3: We propose an open-standard, widely (re)useable format, the ABCDE Format (ABCDEF) for proceedings and workshop contributions that can be easily mined, integrated and consumed by semantic browsers and wikis. /Core3) This format can be created in several data types: LaTeX, xml, as a Microsoft Word template or a simple text file. It is characterised by the following elements:

A - Annotation. Each record contains a set of metadata that follows the Dublin Core standard. Minimal required fields are Title, Creator, Identifier and Date.

B, C, D - Background, Contirbution, Discussion. The main body of text consists of three sections:

Background, describing the positioning of the research,ongoing issues and the central research question;
Contribution, describing the work the authors have done: any concrete things created, programmed, or investigated;
Discussion, contains a discussion of the work done, comparison with other work, and implications and next steps.

These section headings need to exist somewhere in the metadata of the article - but they can be hidden markup, Also, each of the sections can have different, and differently named, subheadings.

E- Entities. Throughout the text, entities such as references, personal names, project websites, etc. are identified by:

The text linking to an entity (and/or it's URI, e.g. in XPath)
The type of link (reference, footnote, website, etc.)
The linking URI, if present
The text for the link

In other words, the entity link can be described as an RDF statement.

(Core4:/ There is no abstract in an ABCDE document - instead, within the B,C and D paragraphs the author denotes 'core' sentences. Upon retrival or rendering of the article, these can be extracted to form a structured abstract of the article - where one can jump directly to the core of the Background, Contribution or Discussion. /Core4)

/Contribution)

(Discussion:

Discussion

(Core5:/ We aim to work on different incarnations of this format and open it up to modification and development. /Core5) The point is to offer a flexible structure that can live on semantic environments such as Semantic Wikis (SemWeb, OntoWeb) and browsers (such as Haystack or Piggybank). The aim is by adding markup, that discovery and integration of information is enhanced, by and for the semantic web community. An example of possible developments would include the creation of a conference program, consisting of "core-contribution"sentences, that link to contributions, as a quick way to scroll around the papers presented. Another example would be to mine all the links to a project website and connecting them to the website, linked to the paragraph in the contribution where the project was mentioned. It is my aim to present at least a subset of papers presented to this workshop in such a format. Ideally, ABCDE papers shoudl be much easier to mine and integrate. (Core6:/A proposal to mine this format for Facts is also presented to this Workshop. /Core6)

/Discussion)