HCLSIG BioRDF Subgroup/Tasks/Using SW Technologies to Find Small Molecules that Bind to Proteins

From W3C Wiki

Using SW Technologies to Find Small Molecules that Bind to Proteins

Task Objectives

  1. Define the scientific problem as use case.
  2. Assess technical feasibility and interest.
  3. Write more detailed technical specs if there's sufficient interest.


BrianOsborne, EricMiller

Problem statement for this use case

Using a Gene or Protein query to find small molecules or drugs.


The researcher would like to be able to start with a gene or protein name or identifier as query and find compounds that bind to the protein or proteins encoded by this gene.

This query could be applied specifically to genes or proteins of interest to neuroscientists but is not limited to this biological domain.

Scientific Question

The binding of a small molecule to a given protein frequently has a biological consequence such as inhibition of the activity of the protein if the protein is an enzyme or a receptor. The researcher is interested in small molecules as they provide important information on possible therapeutics for a given disease. Or, the small molecule could be used experimentally to as a means to investigate the role of the protein in vivo or in vitro.


It is not easy to get binding small molecule binding data when starting with protein or gene data using public databases, though many commercial databases contain this sort of data. The only way to perform this task currently using public data is copying pasting from Web page to page.

Certain databases are clearly heading towards this sort of functionality but have not yet arrived. PubChem is associating MeSH terms with compound entries but the ability to query with these terms is not available (and MeSH is not sufficently granular for the queries described here). ChemBank associates compounds with GO and MeSH terms and one can query with these terms but, again, this is a different query than the ones discussed here.

This could be a compelling demonstration of the Semantic Web technology and a useful scientific tool as well. See Critique for counterarguments.


Entrez Gene is NCBI's gene database, it currently does not link directly to any database of small molecule-protein interactions (it contains links to BIND and HPRD, databases of protein-protein interactions). It does, however, have conserved protein domain information on the proteins encoded by its genes.

SMID (Small Molecule Interaction Database) contains information on protein domain and their physical interactions with small molecules. Note: one cannot query SMID directly with protein or gene names or identifiers, only with domain identifiers.

Swissprot is a subset of Uniprot, it has extensive protein domain curation, probably more domains per gene/protein than Entrez Gene.

All of these databases (Entrez Gene, Uniprot, SMID) are accessible through public Web interfaces but SMID appears to use HTTP's POST and is inaccessible to the programmer. However all data can be downloaded as data files (XML or RDF, in the case of Uniprot), see Data Sources.

Vocabulary Requirements

No vocabularies or ontologies are required outside of the sets of identifiers used by the relevant databases. These sets include but are not limited to the set of all domain identifiers in SMART and the set of all compound identifiers in SMID or PubChem (PubChem Substance).


  • small molecule - a low-molecular weight chemical compound. All metabolites and almost all drugs, drug candidates, and hormones are considered small molecules. It follows that a small molecule may occur naturally or may be synthetic. small is an arbitrary threshold here, roughly less than 1000 daltons.

Data Sources



This archive contains text files and a SQL script, running the SQL script creates and loads a Mysql database with the SMID data in the text files. For example:

 >mysql -u root < smidCMD.sql

In order to retrieve this file you'll need to sign up for a free account at Blueprint.

NCBI Entrez Gene


A directory containing many sub-directories, essentially for different species. Files would need to be converted to XML using NCBI's gene2xml tool.


Uniprot RDF, http://expasy3.isb-sib.ch/~ejain/rdf/data/


The following are examples of genes from Entrez Gene or proteins from Swissprot and the small molecules that bind their protein domains, from SMID. The examples are given as hyperlinks for simplicity. Both Entrez Gene and Swissprot entries are shown. The genes (Entrez Gene) or their proteins (Swissprot) are known to be involved in Parkinson's Disease.

Entrez Gene uses the CDD (e.g. "cd00196") database, Swissprot does not. Both use the SMART database (e.g. "smart00647" or "SM00647").


Gene or protein links

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=full_report&list_uids=5071 (Entrez Gene).

http://www.expasy.ch/uniprot/O60260 (Swissprot).

Protein domain links

The PARK2 protein has 2 Conserved Domains, smart00647/IBR (http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=554) and cd00196/UBQ (http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=5394).

Small molecule links

The UBQ domain binds Cu++ (http://smid.blueprint.org/SMInfo.php?het=1094) and Mg++ (http://smid.blueprint.org/SMInfo.php?het=166), according to SMID.

The IBR domain binds Zn++ (http://smid.blueprint.org/SMInfo.php?het=172).

In order to access these pages you'll need to sign up for a free account at Blueprint.


Gene or protein links

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=full_report&list_uids=11315 (Entrez Gene)

http://www.expasy.ch/uniprot/Q99497 (Swissprot)

Protein domain links

The PARK7 protein contains a DJ-1_Pfpl domain (http://www.sanger.ac.uk/cgi-bin/Pfam/getacc?PF01965) and other protein domain links.

Small molecule links

The DJ-1_PfpI domain binds acetic acid (http://smid.blueprint.org/SMInfo.php?het=744) as well as 5 other small molecules.

In order to access these pages you'll need to sign up for a free account at Blueprint.

Chemical Taxonomies

Small molecule binding can also be described in terms of which molecular fragments in a compound are critical to binding to a specific site within a protein. This is a key concept in Drug R&D and is further described in the Chemical Taxonomies and Structure UseCase.


In order to determine whether the project is worth doing a discussion of its weaknesses is necessary. This section departs a bit from the use case as it begins to discuss technical specs.

  • There is nothing specific to neuroscience or specific neurological diseases in this use case.
  • The data sets may not have all relevant information. This is particularly true of the SMID data set since it is based upon solved 3D structures from PDB, necessarily a small set of proteins and ligands. Furthermore not all ligands for a given protein have been determined as parts of 3D structures.
  • Small molecules may bind but not through domains that are found in the domain databases used by SMID.
  • Entrez Gene data is problematic - its fields have not been mapped to any RDF (Swissprot data is already available as Uniprot RDF).


  1. 1st Deliverable - in progress
    • Complete use case
  2. 2nd Deliverable
    • General technical specification
  3. 3rd deliverable

Possible supports and dependencies

Related resources

Tools and Services

Timeline for Task Completion

  • Stage 1 (3 month goals)
  • Stage 2 (6 months goals)