prepared by Tim Clark, December 4, 2009
Biomedical research is both experimental and observational, as well as theoretical, and has a long history of developing rich terminology systems to classify its findings. These systems include names and synonyms for entities (genes, proteins, anatomical and cellular structures, organisms, etc.); processes (pathways, reactions, functional processes); conditions (diseases, phenotypes, etc.); reagents (antibodies, biological models, gene constructs, etc.); hypotheses, claims and research questions (as in SWAN); and information resources (publications, websites, databases).
Furthermore, networks of information related to elements in such term systems (e.g., Entrez Gene, UniProt, NIF) are becoming increasingly important as research tools as they embody consolidated knowledge.
However biomedical research publications - which are the major means of communicating new research findings and situating them within (or exploding) consolidated knowledge - are typically presented as free text. Several projects (WhatIzIt, Reflect, etc.) have shown the value of linking consolidated knowledge to new research publications through the lexical elements representing controlled terms. In so doing we can expand the explicit information content of the publication as well as provide search-enhancement through structured metadata.
This approach depends critically on the performance of "entity recognition" tools, which establish the initial links from free text to terminology systems. Performance of such tools is context dependent.
Biomedical researchers; computer scientists working in (a) entity recognition and (b) semantic interoperability applications.
3. Use Case
As noted above, the ability of algorithms to detect elements of controlled terminologies successfully depends on the context in which lexical elements which MAY represent such terms, appear. For example, does "APP" refer to "amyloid precursor protein" or to a software application in the Apple iPhone App Store? Is "MD" a credential or does it refer to Marek's Disease? Is "PD" "Parkinson's Disease" or is it the "pD putative transmembrane protein", gene id 3678400, in Spiroplasma citri?
By merely defining and applying a simple, course-grained model of rhetorical structure corresponding to the rhetorical purpose of major document sections, the performance of entity recognition software could, we believe, be dramatically improved in such ambiguous cases, reducing the need for human intervention and improving the cost-effectiveness of the software.
Major rhetorical blocks should be defined for at least the methods & materials; references; experimental results; discussion; and provenance (author and publication information).