HCLSIG/LODD/Interlinking/Metadata

From W3C Wiki

Metadata on links between datasets

Use Case: LODD

With the scale of data items to be interlinked on the Web of Data, an automatic way to create these interlinks on a large scale is highly desirable. However, without any information about which tools were used for generating the links, when the links were generated, and what mapping scores were used to filter mapping results, users have less confidence on the automatic interlinking results. Therefore, there is need for having a vocabulary to provide metadata about these links, created by automatic interlinking discovery tools, for any two or more data sets on the Web.

Link metadata that is useful to know includes:

  • linkage tool used to create the link
  • linkage tool version
  • link confidence as a number (percentage) (in Silk this would be the generated similarity value)
  • rule set used to create the link (at least a link, not necessarily a version number)
  • the person or organization created the links
  • rule set used to create the link (at least a link, not necessarily a version number)
  • previous versions of links

Such metadata could be useful for queries like:

  • search for the confidence value for the interlinking between two given data items
  • TODO: more query examples
  • See also [1]

Existing related vocabularies that can be used to describe some of the metadata.

voiD (Vocabulary of Interlinked Dataset)

A collection of links is regarded as a Dataset in voiD. You can provide some basic metadata including provenance information about this dataset using voiD and DC terms. For example, to describe the set of links we created for Drugbank and RDF-TCM, we can have:

<http://purl.org/net/tcm/id/linkset/11> 
         rdf:type 	 void:Linkset ;
	 void:target 	 :drugbank ;
	 void:target 	 :rdftcm ;
	 void:linkPredicate 	 owl:sameAs ;
         dcterms:creator <http://www.anjeve.de/foaf.rdf> ;
	 dcterms:created "2009-06-30"^^xsd:date ;
	 dcterms:replaces <http://purl.org/net/tcm/id/linkset/10> .


Provenance Vocabulary

Currently, the Provenance Vocabulary can be used to describe provenance information about a data item, such as a link between data items.

For example, to describe a gene from RDF-TCM and how it was created through querying to a SPARQL endpoint, we can have the following:

<http://purl.org/net/tcm/tcm.lifescience.ntu.edu.tw/data/gene/MAPT>   
    prv:createdBy   	
       [
           rdf:type 	prv:DataCreation ;
           prv:performedAt 	2009-10-15T15:44:12.845Z ;
           prv:usedData [
                     rdf:type   	 prvTypes:QueryResult ;
                     rdf:type   	 <http://www.w3.org/2004/03/trix/rdfg-1/Graph> ;
                     prv:retrievedBy 
                       [
                          rdf:type   	 prv:DataAccess ;
                          prv:performedAt 	2009-10-15T15:44:12.845Z ;
                        ] ;
                     prv:createdBy   	
                       [
                           rdf:type 	prvTypes:QueryExecution ;
                           prv:performedBy 	
                             [
                                rdf:type 	prvTypes:DataCreatingService ;                                
                                prv:usedData 	<http://hcls.deri.org/sparql/> ;
                                prv:usedGuideline 	
                                  [
                                     rdf:type 	prv:DataItem ;
                                     rdf:type 	prvTypes:SPARQLQuery ;
                                     rdf:type 	<http://spinrdf.org/sp#Describe> ;
                                     rdfs:label 	DESCRIBE <http://purl.org/net/tcm/tcm.lifescience.ntu.edu.tw/id/gene/MAPT> ;
                                   ]
                              ]
                         ]
                      ]
      ] .


However, we yet need to decide how to represent a link as a data item, whether using RDF reification or OWL2 annotation properties.

Oddlinker Vocabulary

The following structure can be used to provide metadata about a link between two data items:

prefix oddlinker:     <http://data.linkedmdb.org/resource/oddlinker/> .

<http://purl.org/net/tcm/id/linkage_run/11>	 
         oddlinker:linkage_date 	 "2009-07-31"^^xsd:date ;
	 oddlinker:linkage_method 	 :silk ;
	 rdf:type	oddlinker:linkage_run .

:silk 	 foaf:homepage 	 <http://www4.wiwiss.fu-berlin.de/bizer/silk/> .

<http://purl.org/net/tcm/id/interlink/1>	 
      oddlinker:link_source	<http://www4.wiwiss.fu-berlin.de/drugbank/resource/targets/359> ;
      oddlinker:link_target 	<http://purl.org/net/tcm/tcm.lifescience.ntu.edu.tw/id/gene/TYMS> ;
      oddlinker:link_type 	 owl:sameAs ;
      oddlinker:linkage_run 	<http://purl.org/net/tcm/id/linkage_run/11> ;
      dcterms:isPartOf 	<http://purl.org/net/tcm/id/linkset/11> ;
      rdf:type 	 oddlinker:interlink .


However, oddlinker cannot express the confidence of the interlinking and we need to get hold of official documentation of the oddlinker vocabulary.