ABCDE Format: Publishing Semantic Conference Papers
DC.Creator.PersonalName Anita de Waard
DC.Creator.2 Simon Pepping
DC.Subject *** I.7.: Editing, Text
DC.Subject Semantic Web
DC.Identifier URN dewaardsemwiki2006
DC.Language ISO639-1 en
DC.Date.X-MetadataLastModified ISO8601 2006-02-02
Object (type: text) = (Background paragraph), relation = Footnote,
Subject (type: text) = "This background is a copy of the Introduction
paragraph of (Entity: Object (type: text) = (this contribution),
relation = Reference, Subject (type: URI) = http://labs.elsevier.com/resources/adw/papers/SWDaysDeWaard1209.pdf)).
It is an essential property of semantic conference contributions that they
can be composed in a modular format, i.e. linking to or reusing parts of existing
“There is a growing mountain of research. But there is increased evidence that we are being bogged down today as specialization extends. The investigator is staggered by the findings and conclusions of thousands of other workers - conclusions which he cannot find time to grasp, much less to remember, as they appear.“
Scientists are increasingly unable to process the ever-increasing flood of scientific literature that surrounds them. Biomedical literature, for instance, grows by over 500,000 publications each year (Cohen, 2005). In a recent study on user needs among British archaeologists, 71% of the respondents felt that information was produced of which they were unaware (Jones, 2001). Next to problems in accessing one’s own field, it becomes more and more difficult to access adjacent domains of science. Furthermore, scientists do not only want to know what publications contain specific words, and how to rank them by relevance, but what knowledge is contained within the papers, and how it relates to their existing knowledge. For example, cell biologists might want to know: “What functions of this gene are known?” Astronomers might ask “What radiation patterns have we seen in red-dwarf stars?” or “What theories does this new observation support?” Ideally, a new publication should situate itself within the existing knowledge context of the reader, and show how it affects or alters this context.
There have been many efforts to combat information overload in science. Abstracts have been developed in the sixties and seventies. Although they are shorter to read, abstracts do not provide a full summary of the work described in the document, nor do they offer any way to integrate the document into the existing knowledge. Metadata is a broad term covering many different types of information, but generally includes the bibliographic reference to a document, and descriptors such as keywords . Metadata helps retrieve an article when descriptive elements (author, title) are known. The main function of a keyword list is to classify the article in a category. But neither provides any direct insight in the knowledge conveyed within the body of a scientific paper.
Text mining and information extraction are methods specifically developed to find relevant information in unstructured texts and encode the information in a structured form, like a database record (Couto, 2003). In theory, text mining is the perfect solution to transforming factual knowledge from publications into database entries. However, automatically identifying concepts such as genes and proteins poses many problems, see e.g. Mons (2005) and Cohen (2005). Moreover, computational linguists have not yet developed tools that can analyse more than 30% of English sentences correctly and transform them into a structured formal representation. For this, the papers still need to be handled by a curator (Rebholz-Schuhmann, 2005).
The main problem
with automatically extracting information from scientific articles is that the
genre of the scientific publication has developed to be an indivisible information
unit (see e.g. Bazerman (1998)). The scientific paper is a self-contained narrative,
created anew in each iteration, with specific genre characteristics that minimize
the potential of identification, content reuse and knowledge integration. All
this rhetorical freedom comes at the expense of usability in a computer-centered
environment. The linear narrative was fine when we still read and wrote on paper,
but the changing (digital) environment in which scientists live and work calls
for a changing fundamental unit of communication.
(Core1: We believe that the best way to present a narrative to a computer is to let the author explicitly create a rich semantic structure for the article during writing /Core1) (see also de Waard, 2005). At a high level, this structure will consist of self-contained modular elements or entities, and discourse relationships between such elements (within a text, and between texts). The tension between these self-contained ‘knowledge elements’ or conceptual structures, and the meaning conveyed in the conventional narrative of the document as a whole, poses an interesting topic of study in terms of both knowledge modeling and rhetoric/discourse studies.
(Core2: As conceptual structures become the central bearer of information, a set of structured documents can be integrated to form a ‘knowledge network’, or structured package of related knowledge regarding a topic. /Core2) This can be envisaged (and modeled) as a network of nodes and relationships, and can be seen to form an incarnation of the ‘intelligent data’ ideal, which th e Semantic Web is meant to enable (Berners-Lee, 2001) . The purpose of this project is to examine such a new form of structuring, and the authoring, editing and retrieval processes needed to use it. Specifically, we are interested in representing conferenc eproceedings in a new way. Semantic Browsers such as PiggyBank [ ] and semantic collaborative authoring tools such as Semantic Wiki [ ] are paving the road for distributed, semantic communities to communicate.
(Contribution:(Core3: We propose an open-standard, widely (re)useable format, the ABCDE Format (ABCDEF) for proceedings and workshop contributions that can be easily mined, integrated and consumed by semantic browsers and wikis. /Core3) This format can be created in several data types: LaTeX, xml, as a Microsoft Word template or a simple text file. It is characterised by the following elements:
A - Annotation. Each record contains a set of metadata that follows the Dublin Core standard. Minimal required fields are Title, Creator, Identifier and Date.
B, C, D - Background, Contirbution, Discussion. The main body of text consists of three sections:
These section headings need to exist somewhere in the metadata of the article - but they can be hidden markup, Also, each of the sections can have different, and differently named, subheadings.
E- Entities. Throughout the text, entities such as references, personal names, project websites, etc. are identified by:
In other words, the entity link can be described as an RDF statement.
(Core4:/ There is no abstract in an ABCDE document - instead, within the B,C and D paragraphs the author denotes 'core' sentences. Upon retrival or rendering of the article, these can be extracted to form a structured abstract of the article - where one can jump directly to the core of the Background, Contribution or Discussion. /Core4)
(Core5:/ We aim to work on different incarnations of this format and open it up to modification and development. /Core5) The point is to offer a flexible structure that can live on semantic environments such as Semantic Wikis (SemWeb, OntoWeb) and browsers (such as Haystack or Piggybank). The aim is by adding markup, that discovery and integration of information is enhanced, by and for the semantic web community. An example of possible developments would include the creation of a conference program, consisting of "core-contribution"sentences, that link to contributions, as a quick way to scroll around the papers presented. Another example would be to mine all the links to a project website and connecting them to the website, linked to the paragraph in the contribution where the project was mentioned. It is my aim to present at least a subset of papers presented to this workshop in such a format. Ideally, ABCDE papers shoudl be much easier to mine and integrate. (Core6:/A proposal to mine this format for Facts is also presented to this Workshop. /Core6)