From W3C Wiki
Jump to: navigation, search

Scientific Discourse Structure Group HCLS IG W3C


The exponential growth of the World Wide Web in the last decade, brought with it an explosion in the information space. Similarly, in the area of scientific literature, the number of publishing spheres (journals, conferences, workshops, etc) is increasing substantially. In particular, the biomedical and pharmaceutical domains are the heavily affected. For example, MedLine now hosts over 18 million articles, and has a growth rate of 0.5 million / year (around 1300 articles/day). This makes the process of finding and associating relevant work for a particular field a cumbersome task.

One of the main reasons for this problem is the fact that indexing publications based only on syntactic resources is no longer sufficient. The typical publication process consists of authors stating claims, positions or arguments in relation to their own achievements, or the results achieved by other researchers. These epistemic items represent the key to decoding the rhetoric captured within the publications' content, and thus identifying them could represent a possible solution to the information overload problem. Additionally, we can mark up have different granularities of content, ranging from single sentences to entire paragraphs, and may contain important domain knowledge.

The goal of this group is to research and develop a formal structure describing these discourse knowledge items, considering the following aspects:

  • Granularity: We will start by forming a coarse-grained structure, modeling larger spans of text that carry a rhetorical role and then lower the granularity with the goal of modeling sentences or phrases, together with their associated rhetorical relations.
  • Domain-specificity: As we target scientific literature in general, and the biomedical and pharmaceutical domains in particular, we need to be able to capture specific needs that emerge from these different directions. Consequently, we will create a core model, general enough to be applicable to all domains, to which we will attach several domain-specific modules.
  • Domain knowledge: While our focus is on the rhetorical roles of the discourse spans, these may also contain domain concepts that might be of interest to the user (or machine). For the time being, we will not address particular domain knowledge present in the scientific content, but will design the rhetorical structures in such a way that it will allow an easy link to specific domain ontologies.
  • Creation mechanisms: Developing the model will not solve the above mentioned problem, unless we apply them. In this direction, we will look at a priori application (i.e. at the time of authoring / submission of a scientific publication or material), as well as, a posteriori application (i.e. enriching the scientific content with rhetorical structures after it was already published).


  • O1: Study and develop a model -- set of elements -- (a core model plus additional domain-specific modules), that capture the rhetorical aspects of scientific publications at different granular levels.
  • O2: In the long run, integrate the developed model in different environments to improve the authoring, retrieving and browsing user experience.
  • O3: Make a mocked up annotated document collection using this element set for each use case.
  • O4: Disseminate these models and document systems to the rest of the W3C and the community at large.

Use Cases

  1. Helping Drug Discovery Through Hypothesis-Based Knowledge Bases
  2. Improving the Structure of Digital Publications in the Computer Science Domain
  3. Information Enhancement and Improved Search of Biomedical Publications

Conference phone number

  • Dial-In #: +1.617.761.6200 (Cambridge, MA)
  • Dial-In #: + (Nice, France)
  • Dial-In #: +44.117.370.6152 (Bristol, UK)
  • Participant Access Code: 42572 ("HCLS2")
  • IRC Channel: irc.w3.org port 6665 channel #HCLS2 (see W3C IRC page for details, or see Web IRC) [direct IRC link http://www.mibbit.com/chat/?server=irc.w3.org:6665&channel=%23hcls2]




Medium Grained

Earlier Models and Related

Elsevier Examples

Model alignment

Scientific Discourse elements of a paper mod C.jpg


  • Anita deWaard (Elsevier)
  • Mike Taylor (Elsevier)
  • Tudor Groza (DERI)
  • Paolo Ciccarese (Mass General Hospital, Harvard Medical School)
  • Tim Clark (Mass General Hospital, Harvard Medical School)
  • Keith Gutfreund (Elsevier)
  • Jodi Schneider (DERI)
  • Alex Passant (DERI)
  • Matthias Samwald (DERI)
  • Ron Daniel (Elsevier)
  • Joanne Luciano (Tetherless World Constellation, Rensselaer Polytechnic Institute)