HCLSIG/LODD/Mapping Experimental Data

Best Practices: Mapping Experimental Data to RDF

**Material has been moved to GoogleDoc, please email Susie to request access**

Much data has been made available in a Linked Data representation. Many of these data sources are effectively catalogs of information about particular entities and are therefore fairly straightforward to convert into RDF. For example, DrugBank is a catalog of information about drugs, and ClinicalTrials.gov is a catalog of information about ongoing clinical trials.

The aim of this work is to generate recommendations as to how to make Linked Data available for more complex experimental data sets. The particular approach chosen should simplify querying of an individual data set, federated query, or aggregation of data across data sources. It should also ease the integration of the data set with related Linked Data. And it should recommend a scalable approach. And should recommend a workflow.

The data set that this work will focus on is ADNI (Alzheimer's Disease Neuroimaging Initiative). This data set includes information about people who are either normal, have mild cognitive impairment, have Alzheimer's Disease, or have converted from one group to another. Typically, certain background information is collected for each patient, i.e. gender, age, SNP profile. And then additional data is collected during each visit, e.g. CSF levels of tau and abeta, results of ADAS COG and mini mental cognition tests, and MRI scans. And need a 2nd data source. Identify questions that span data sources.

ADNI data is made available to interested researchers in spreadsheets.

The goal of this work is to address the following questions:

1. Are some relational schema easier to map to RDF than others?

D2R provides an automated process to generate the mapping file, which converts every table into a class. This approach did not yield satisfactory results for a database with a normalized schema, largely because Third Normal Form modeling seeks to eliminate data redundancies, not reflect real world objects – such as patients, medical images, etc.

In dimensional modeling, a logical design technique for data warehouses, data are grouped into coherent categories that more closely mimic reality. This makes the mapping of dimensional representations to RDF classes more straightforward, and enables the default D2R mapping process to yield better results. Further, hierarchies in the dimension tables may help to indicate RDF classes and their relationships.

By converting table column headers to classes, instead of the default literal values, they could be used to link to external ontologies.

The depivoted format of the fact tables can be converted to RDF. Occasionally, a column in the fact table may contain values that can be used as predicates. In this case, using a d2rq:dynamicProperty may be sufficient to define all properties for the fact table at once. The mapping becomes independent of the properties listed in the fact table, and remains valid as new rows are added to the table.

Forming ExperimentalResult classes on a pivoted table requires that column names of the table are parsed and mapped to literal values or URIs. Where a D2R mapping would normally create an instance for every table row, this use case requires the mapping to create a new instance for every table cell (for selected columns). This is equivalent to depivoting the table before applying the mapping. The D2R release we used did not have this functionality. Consequently, we did the depivoting operation in the database instead.

2. What are the best tools for mapping relational databases to RDF?

- The SQL Server and Oracle databases were mapped to RDF using D2R server 0.7

3. Is it ever OK to change a relational representation to simplify mapping to RDF?

4. What are the best practices for mapping data sources to RDF/OWL?

- The mapping to public ontologies can be handled in a number of ways by D2R. For example, volume measurements of brain regions on MRI images were linked to the gross anatomy section of NIF using lookup tables. The lookup table can be stored in the D2R file (using d2rq:TranslationTable) or in the database (and used in a d2rq:join). We prefer the latter solution, but note that this approach restricts the lookup table to being in the same database as the data. When SNPs or genes were mapped to Bio2RDF (http://bio2rdf.org/), the database values were automatically mapped to ontology classes by D2R (using d2rq:uriPattern or d2rq:uriSqlExpression).

To encode experimental results in RDF, the experimental conditions need to be uniquely specified. For example, to be able to correctly interpret a measured value, it needs to be clear which patient is being referred to, on which visit, and what exactly was measured. One option is to define properties for the Patient class for every type of experiment.

We decided to encode every experimental result (the measured value and the experimental conditions) in an ExperimentalResult class and link out to the corres-ponding Patient, Visit and Image classes.

Defining a subclass of the ExperimentalResult class for every measurement type (e.g. a ClinicalDementiaRating, a HippocampalVolume, a SystolicBloodPressure…), was unpractical due to the large number of observations in the data sets.

Alternatively, the measurement type can be encoded in an ExperimentalResult property name (e.g. hasClinicalDementiaRating). Contrary to subclass definitions, we can avoid writing D2R code for property definitions for a large number of measurement types by using a d2rq:dynamicProperty statement. However, some experimental conditions are hard to describe in a property name and are difficult to use in queries. (e.g. the property hasStandardDeviationOfCorticalThicknessOfRightTransverseTemporalCortex). Furthermore, using an owl:sameAs construct to define equivalence to public ontologies for properties would take us into OWL Full.

We decided to use two properties to specify the experimental conditions and measurement value of an ExperimentalResult, namely hasResultType and hasValue. The ResultType class can contain multiple properties to specify the experimental conditions fully (a case report form, unit of measurement, etc.) and can be used as a bridge to public ontologies.

5. Should the resulting RDF always mirror the real world?

6. How should the RDF map to global ontologies?

- BioPortal was used to identify public ontologies that best map to the entities in the clinical data sets. Selected terms from these ontologies were linked into a Common Resource Ontology (CRO) that was loaded into an instance of an openRDF triple store from Sesame.

Classes were defined in the CRO to avoid repeating class definitions for every data source. For classes available in public ontologies, the CRO builds a comprehensive representation of a domain by importing a standard set of orthogonal ontologies using the guide-lines described in MIREOT.

Using an internal ontology presents some advantages: - Scientists may have strong preferences for particular ontologies. When there is no general agreement about which ontology to use, we can include the definition of a proxy class in the CRO. The proxy can be linked to a number of public ontologies using URI aliases. - Building a SPARQL query requires knowledge as to which ontology was selected during the mapping phase. This information can be retrieved from the CRO. - Using Semantic MediaWiki technology, the CRO can be maintained or extended by the scientists themselves. - As Semantic MediaWiki stores its data in RDF it can be used as a metadata repository for data source discovery. This functionality is not well supported by SPARQL (Williams, 2010). - Not all class definitions that were required for the mappings were available in public ontologies (e.g. sub¬scores for the AD Assessment Scale Cognition). These definitions could be included within the CRO.

7. Does all data need to be mapped to consistent models to achieve interoperability?

8. How should metadata (reification) be handled?

9. What should be done if there are gaps in the current ontology landscape?

10. How should the URIs and namespaces be determined?

11. How can the original data be augmented with additional insights?

12. How can tools be created to utilize the linked data once it is available?

13. Which entities should be classes and which should be instances?

Explore EAV best practices and applicability to mapping relational data to RDF.

Linked Data Guides and Tutorial

Best Practices: Mapping Experimental Data to RDF

Related Work