Nor any other part belonging to a gene

Biological and medical sciences makes use of diverse data, much of it distributed across the globe. Data sharing standards for life sciences research should offer an opportunity to relieve data of an enormous overhead required for repetitious customized data integration. Each desirable data source in biological research tends to have been imported into multiple organizations, having required new tools to be created in each case. Efficiency in data sharing is one promise which standards offer. A second promise is that data relationships can be requested, permitting any investigator to rapidly explore a network of relationships defined by life sciences experiments and other data curation on the web. For these promises to be met, it is critical that the precision with which data and data relationships are presented be adequate for the decision-making process which we hope to impact as a community.

The first step in making data accessible is to label records, facilitating retrieval. However, when biological entities are named, it is often true that these labels do not constitute categories whose membership indicates meeting any set of necessary and sufficient conditions. For example, the notion of gene provides us a biological term which is part of a natural language, and does not in general provide a diagnosable, technically definable entity. Nevertheless, by being cognizant of the utility of scientific concepts, we can avoid the data loss associated with computing across poorly behaved categories.

Unlike the Bard's rose, that which we call a gene by any other name may not smell as sweet. The entities which modern biologists and medical scientists study are not so easily observed as a flower. A gene is not an entity we can touch, clearly recognize, or even capture as an image. This heritable unit, contrary to oversimplified definitions, is not simply a stretch of the DNA polymer; a gene is a multifaceted concept. The not-necessarily-contiguous set of chromosomal intervals to which a transcript sequence maps in a eucaryote genome does not capture the various notions of gene. The functional role in the organism, the interactions of a DNA with protein complexes in cells within organisms, is not a property of a DNA sequence alone, especially not the transcribed RNA alone. Additionally, there exist gene regulatory properties of DNA, difficult to describe in simple data representations. However, to be clear-thinking scientists it is the observations and phenomena that we need to record and to share.

Data sharing is ideally a communication between publisher and user. In the life sciences, the responsibility of both sides to communicate well, for the publisher to make it clear exactly what is being shared, and for the user to use the data with due scientific diligence, is not at the moment the norm for web-delivered data.

As an example, let us consider a mythical set data sources that a scientist wishes to use, and see and how superfical naming can lead to trouble. What this cartoon scientist, Dr. Project, seeks is to gain further understanding for a particular gene, G000001. Dr. Project has an in-house microarray experiment report which states that G000001 is overexpressed in the tumors investigated relative to control tissue, with a 4.8 fold increase (p < 0.001). Now the scientist wishes to learn what sort of protein the gene encodes, and whether or not publicly available data corroborates the overexpression story. First, we have a biological sequence evaluated by a consortium BioC. This annotation is based on a predication algorithm FunctionalProteins; G000001 is annotated as belonging to a gene class TumorSuppressor . Second we have microarray data from Bliss University (BU). The micorarray data from BU lists G000001 as not differentially expressed in comparison of the tumor type in question to control. No sequences for the hybridization are reported, so we simply have the gene label. Additionally, supplemental material for the published work from BU provides the fact that a TaqMan experiment for this gene confirmed no difference between tumor and control. Again, no sequences for TaqMan primers are shared. Dr. Project investigates a little deeper, and notes that the in-house microarray measured G000001 twice; the other set of measurements reveal only a 1.4 fold change (p < 0.2). Given a busy schedule, and despite the fact that the TumorSupressor prediction makes it clear that G000001 could be one of a number of potentially interesting targets for research, the work from BU by reputable authors is enough to shelve interest in the gene. Dr. Project concludes that the significant result was simply spurious. In fact she is happy to see the confusing tale of an overexpressed tumor suppressor leave her desk.

It is sad, but true, that simply looking up information on the web can be as harmful as it can be useful to research. The reason the story is disappointing is that Dr. Project did not have enough information to understand what she was seeing. The in-house microarray included in its design two hybridization probes. The first, which spanned two exons is typically expressed in transcripts from the locus. The second included sequence from far downstream, designed to target an EST which, once upon a time, appeared in a tumor sample. That latter hybridization probe yielded the intriguing overexpression data. Moreover, the second transcript targeted by the probe does not encode the domains responsible for the TumorSuppressor annotation. Instead it includes sequence from another locus, G123456. The in-house annotations could not handle more than one gene label, the annotation database being a "Gene-Centric" view of the data (considered a great improvement at the time of its establishment). Too easily, an important observation has been lost completely: the increase in a transcript which resulted from a critical chromosomal deletion has lead to lack of tumor suppression. Such missed opportunities likely occur quite often.

A more clear-minded approach requires that we recognize the hierarchies of information. The locus encodes proteins of the type TumorSupressor. Specific transcripts may or may not encode a specific domain. A sequence may map well or poorly to a genome, may cover sequence that is annotated as more than one gene, and typically covers more than one protein-domain encoding sequence. However, the data available at our mythical research institution and our mythical university all failed to make use of the most important information, the specifics of the primary biological reagents. In this case, nucleotide sequence.

We are experiencing a burgeoning of variety in biological data being shared on the web. At the same time, the quantity of data is increasing rapidly. Beware: proteomics data will be far more difficult to process and relate to other data than the transcript data in the fairytale example. Genomic mapping, when you have a good handle the art, provides a framework in which the hierarchies and other relationships between observations make biological sense. Standardizing data formats will help. Standardizing data content standards at the same time is critical. Ensuring the validity of scientific data will never be as straightforward as validating an HTML document. However, the containers and how they are populated across a data release can be summarized.

So, this paper is a plea to create standards and recommendations for data content with as much fervor as one does for data formats. I will end with a few hints as to what could help standardize data sharing for the life sciences.

Always include fields for the primary biological reagent
- What type is it (DNA, antibody, small molecule)?
- Where is it (bound to a surface, bound a bead, bound to histological sample, in solution)?
- Is it labeled? If so, how so (dye, radioactive isotype)?
- What is its specificity of interactions (DNA sequence, antigens, known IC50 values)?
Always include the species (genome)
- Whenever possible, refer to mapping on standard genome sequence, with full version information.
- Note aberrant genome information conspicuously (SNPs, deletions, insertions), as these will make sequence handling more difficult for algorithms and investigators.
Avoid annotating with poorly behaved categories alone. (Although providing the gene symbol may be useful, such a bland label should not be provided without, for example, the sequences of the PCR primers used, in the case of TaqMan reactions, or microarray hybridization probe sequences.)
Create validation tools which check data, and alert the publisher (and user) to overall properties of the published data, such as the fact that Gene Symbol is a populated tag, but that no Primary Reagent tag is found.
Always version data releases.
- Never replace data, even errors. Keep URLs to erronous informaton intact, adding a deprecation notice, and one can hope, a reference to better data.
- Include version information in data, not just with data.

Hugh Salamon, PhD

Dr. Salamon leads Computational Biology at Berlex Biosciences in Richmond, California.

hugh_salamon@berlex.com