HCLSIG/SWANSIOC/Document-Annotation-Subtask/UseCases/1

From W3C Wiki

Information Enhancement and Improved Search of Biomedical Publications

prepared by Paolo Ciccarese, October 5, 2010 (derived from and connected to the Rhetorical Structure Use Case 3 by Tim Clark, December 4, 2009)

Text Mining Related Use Case: This use case is connected to the usage of entity recognition tools, which establish the initial links from free text to terminology/ontology systems.

1. Introduction

Biomedical research publications are typically presented as free text. It is valuable to link consolidated knowledge to new research publications through the lexical elements representing controlled terms. In so doing we can expand the explicit information content of the publication as well as provide search-enhancement through structured metadata.

Available terminology systems include names and synonyms for entities (genes, proteins, anatomical and cellular structures, organisms, etc.); processes (pathways, reactions, functional processes); conditions (diseases, phenotypes, etc.); reagents (antibodies, biological models, gene constructs, etc.); hypotheses, claims and research questions (as in SWAN); and information resources (publications, websites, databases). Furthermore, networks of information related to elements in such term systems (e.g., Entrez Gene, UniProt, Neuroscience Information Framework) are becoming increasingly important as research tools as they embody consolidated knowledge.

2. Use Case

Given documents such as journal articles - http://tinyurl.com/2e5lgkb - and news items - http://tinyurl.com/29owd63 - we want to enable the RECS (Run, Encode, Curate, Share) process:

  • Run entity recognition tools such as:
   1. NCBO Annotator web service: a Web service that annotates a piece of free text with related ontology concepts and terms
   2. Textpresso: a text-mining system for scientific literature
  • Encode the results with their provenance.
   Desired data:
   1. Annotated document
   2. Annotated document/text fragment
   3. Associated semantic entity identified by a URI
   4. Service that generated the results
   5. Agent who ran the service
   6. Date of generation of the text mining results 
  • Curate the results recording provenance. Results produced by text mining tools can include wrong results. We want the users/curators to be able to accept/reject/discuss the results keeping track of each step of the curation process (what I will call 'curation token').
   Desired data:
   1. Curation Action Type: Accept(maybe acceptAsExactMatch, acceptAsNarrowMatch, acceptAsBroadMatch?)/rejectAsWrong 
   2. Textual description
   3. Curator (creator of the curation token)
   4. Date of curation
   5. Target annotation
   6. Previous curation token in the curation chain
  • Share the curated annotation in a common RDF-based format and separately from the original document.

3. Open Issues

a. Defining the curation tokens 'action type' in response to text mining results

Examples inspired by the Simple Knowledge Organization System (SKOS) model are:

  • Right, Wrong
  • Right, Too broad, Wrong
  • Right, Too broad, Wrong, Too narrow