Scientific Discourse Structure Group HCLS IG W3C
The exponential growth of the World Wide Web in the last decade, brought with it an explosion in the information space. Similarly, in the area of scientific literature, the number of publishing spheres (journals, conferences, workshops, etc) is increasing substantially. In particular, the biomedical and pharmaceutical domains are the heavily affected. For example, MedLine now hosts over 18 million articles, and has a growth rate of 0.5 million / year (around 1300 articles/day). This makes the process of finding and associating relevant work for a particular field a cumbersome task.
One of the main reasons for this problem is the fact that indexing publications based only on syntactic resources is no longer sufficient. The typical publication process consists of authors stating claims, positions or arguments in relation to their own achievements, or the results achieved by other researchers. These epistemic items represent the key to decoding the rhetoric captured within the publications' content, and thus identifying them could represent a possible solution to the information overload problem. Additionally, we can mark up have different granularities of content, ranging from single sentences to entire paragraphs, and may contain important domain knowledge.
The goal of this group is to research and develop a formal structure describing these discourse knowledge items, considering the following aspects:
- Granularity: We will start by forming a coarse-grained structure, modeling larger spans of text that carry a rhetorical role and then lower the granularity with the goal of modeling sentences or phrases, together with their associated rhetorical relations.
- Domain-specificity: As we target scientific literature in general, and the biomedical and pharmaceutical domains in particular, we need to be able to capture specific needs that emerge from these different directions. Consequently, we will create a core model, general enough to be applicable to all domains, to which we will attach several domain-specific modules.
- Domain knowledge: While our focus is on the rhetorical roles of the discourse spans, these may also contain domain concepts that might be of interest to the user (or machine). For the time being, we will not address particular domain knowledge present in the scientific content, but will design the rhetorical structures in such a way that it will allow an easy link to specific domain ontologies.
- Creation mechanisms: Developing the model will not solve the above mentioned problem, unless we apply them. In this direction, we will look at a priori application (i.e. at the time of authoring / submission of a scientific publication or material), as well as, a posteriori application (i.e. enriching the scientific content with rhetorical structures after it was already published).
- O1: Study and develop a model -- set of elements -- (a core model plus additional domain-specific modules), that capture the rhetorical aspects of scientific publications at different granular levels.
- O2: In the long run, integrate the developed model in different environments to improve the authoring, retrieving and browsing user experience.
- O3: Make a mocked up annotated document collection using this element set for each use case.
- O4: Disseminate these models and document systems to the rest of the W3C and the community at large.
- Helping Drug Discovery Through Hypothesis-Based Knowledge Bases
- Improving the Structure of Digital Publications in the Computer Science Domain
- Information Enhancement and Improved Search of Biomedical Publications
Conference phone number
- Dial-In #: +1.617.761.6200 (Cambridge, MA)
- Dial-In #: +33.4.89.06.34.99 (Nice, France)
- Dial-In #: +44.117.370.6152 (Bristol, UK)
- Participant Access Code: 42572 ("HCLS2")
- IRC Channel: irc.w3.org port 6665 channel #HCLS2 (see W3C IRC page for details, or see Web IRC) [direct IRC link http://www.mibbit.com/chat/?server=irc.w3.org:6665&channel=%23hcls2]
- January 30 2012
- January 23 2012
- (more missing here)
- October 10 2011
- Needs review (lots of missing ones)
- July 25 2011: Ping Wang, Rensellear Polytechnic - A Semantically-Enabled Provenance-Aware Water Quality Portal
- July 11 2011: Adrian Walker - Application Semantics via Rules in Open Vocabulary Executable English
- June 20 2011: Anita de Waard - Executable Papers
- June 6 2011: Merce Crosas - Data Citation Principles
- May 23 2011: Joanne Luciano - SADI; Alex Garcia - RDFising Biomedical Docs
- May 2 2011: BioRDF Demonstrator - Collaboration
- April 18 2011: Discussion on medium-grained ontologies and alignment
- April 11 2011: Jodi Schneider: Medium-grained ontologies and alignment
- April 4 2011: Report on the Open Annotation Consortium Workshop
- March 28 2011
- March 21 2011
- March 14 2011
- March 7 2011
- February 7 2011
- January 31 2011
- December 20 2011
- December 13 2010
- November 29 2010
- November 15 2010
- October 7 2010
- September 27 2010
- Face-to-Face July 12 2010
- June 21 2010
- June 7 2010
- May 10 2010
- April 19 2010
- March 1 2010
- February 15 2010
- January 25 2010
- January 11 2010
- December 14 2009
- December 1 2009
- November 23 2009
- Blocks ontology aka ORB (Ontology of Rhetorical Blocks), see also draft note
- RDF Version of Coarse-Grained Document
- Coarse-Grained structure from different sources
- 3 docs for ORB are now posted here using NLM DTD
- Medium Grain alignment
- Anita's proposal on Medium Granularity
- Comparison Grid with blocks and subblocks
Earlier Models and Related
- Anita deWaard (Elsevier)
- Mike Taylor (Elsevier)
- Tudor Groza (DERI)
- Paolo Ciccarese (Mass General Hospital, Harvard Medical School)
- Tim Clark (Mass General Hospital, Harvard Medical School)
- Keith Gutfreund (Elsevier)
- Jodi Schneider (DERI)
- Alex Passant (DERI)
- Matthias Samwald (DERI)
- Ron Daniel (Elsevier)
- Joanne Luciano (Tetherless World Constellation, Rensselaer Polytechnic Institute)