HCLSIG BioRDF Subgroup/aTags

From W3C Wiki

BioSIOC task: Representing biomedical statements with SIOC and associative tags (aTags)

Task Objectives

  1. Identify biomedical datasets that can be represented with SIOC and aTags
  2. Represent the statements in these datasets with SIOC/aTags and host them as linked data
  3. Analyse results: which kinds of research problems can be addressed with aTags, where are extensions needed?


aTag datasets


  • Matthias Samwald
  • Kei Cheung

Definition: Associative Tags (aTags)

An aTag is a simple, unordered set of entities. Entities in this set are seen as associated in some way, where the notion of 'association' will be roughly defined by the aTag specification, but also depends on the pragmatics that arise in communities that make use of aTags. In the context of the biomedical domain, 'associated' means that the entities are involved in some kind of causal, biologically relevant phenomenon.

The process of applying aTags to a document or database entry differs from the current process of adding tags in the following ways:

  • The primary intention of creating aTags is not the categorization of the document, but the representation of the key facts inside the document. Key facts in the biomedical domain might be, for example, “Protein A interacts with protein B” or “Overexpression of protein A in tissue B is the cause of disease C”.
  • An aTag is comprised of a set of associated entities. The size of the set is arbitrary, but will typically lie between 2 and 5 entities. For example, the fact “Protein A binds to protein B” can be represented with an aTag comprising of the three entities “Protein A”, “Molecular interaction” and “Protein B”. Similarly, the fact “Overexpression of protein A in tissue B is the cause of disease C” can be represented with an aTag comprising of the four entities “Overexpression”, “Protein A”, “Tissue B” and “Disease C”.
  • Each document or database entry can be described with an arbitrary number of such aTags. Each aTag can be associated with the relevant portions of text or data in a fine granularity.
  • The entities in an aTag are not simple strings, but resources that are part of ontologies and RDF/OWL-enabled databases. For example, “Protein A” and “Protein B” are resources that are defined in the UniProt database, whereas “Molecular Interaction” is a class in the branch of biological processes of the Gene Ontology. They are identified with their URIs.

This makes it possible to integrate information encoded in aTags from various sources. For example, if one data source mentions “Protein A binds to protein B”, and another data source mentions that “Overexpression of protein A in tissue B is the cause of disease C”, both sources will use the same URI for the representation of “Protein A”, e.g. “http://uniprot.org/example/protein_A”. This re-use of existing entities and URIs will be facilitated by the web applications and widgets created by the project.

Once the aTags from these different sources are aggregated, it is possible to pose a query such as “show me molecules that are associated with molecules that are associated with disease C”, yielding “protein A” as an answer.

Relevant RDF/OWL resources containing entities that can be used in aTags

Relevant datasources for deriving a foundation of statements

  1. ++ PDSP Ki Database -- Neuroscientific ligand-receptor interaction
  2. ++ BindingDB http://www.bindingdb.org/bind/index.jsp
  3. Molecular interactions
  4. ++ OMIM Gene-Disease associations http://www.ncbi.nlm.nih.gov/Omim/omimfaq.html#download
  5. ++ Textmining results: Whatizit, Neurocommons Textmining, LifeSKIM KB, others
  6. ++ BAMS (brain module connectivity)
  7. ++ GOA
  8. ++ PubChem Bioassay ftp://ftp.ncbi.nih.gov/pubchem/Bioassay/
  9. ++ PharmGKB Example of an entry. Data available for download?
  10. Database of Protein Subcellular Localization (is this really original data??) http://www.bioinfo.tsinghua.edu.cn/~guotao/intro.html
  11. ++ LOCATE (subcellular protein locations, looks good!) http://locate.imb.uq.edu.au/
  12. Eukaryotic Subcellular Localization DataBase http://gpcr2.biocomp.unibo.it/esldb/download.htm
  13. eggNOG evolutionary genealogy of genes: Non-supervised Orthologous Groups http://eggnog.embl.de/
  14. CluSTr http://www.ebi.ac.uk/clustr/
  15. + Panther (classifies genes by their functions) http://www.pantherdb.org/
  16. DroID - the Drosophila Interactions Database http://www.droidb.org/DBdescription.jsp
  17. Human Ageing Genomic Resources http://genomics.senescence.info/
  18. + Drugbank http://www.drugbank.ca
  19. ++ COSMIC Catalogue of Somatic Mutations in Cancer
  20. ++ SIDER
  21. ++ BrainMaps http://brainmaps.org/connectivity2list.php?cmd=reset

Relevant datasources, ontologies and taxonomies for providing entities for tagging

  1. DBpedia
  2. + Biological process
  3. + Cell type
  4. + Cellular component
  5. + Chemical entities of biological interest (boosted over other chemical entities)
  6. Common Anatomy Reference Ontology
  7. + Foundational Model of Anatomy (subset)
  8. + Human disease
  9. + Mammalian phenotype
  10. + Molecular function
  11. + Phenotypic quality
  12. + Protein Ontology (PRO)
  13. Sequence types and features
  14. - Evidence codes

Templates for aTag generation from existing data

${ligand} binds to ${receptor} with ${value-partition-of-binding-affinity} (Ki value of {?ki-value}[, Organism ${organism}][, Tissue: &{tissue}]). Reported by [${author}[, ${year}]].

aTags: ligand, receptor,  value-partition-of-binding-affinity

Open questions

  • How should "datatype" properties be represented?
    • As datatype properties
    • As objects / value partitions