Warning:
This wiki has been archived and is now read-only.

User:Rcygania2/S14 Revision

From RDF Data Shapes Working Group
Jump to: navigation, search

This is a proposed revision of David Martin's user story S14: Object Reconciliation. David's archived text and the original discussion are archived at the end of the page.

S14: Quality assurance for object reconciliation

  • Created by: David Martin
  • Revised by: Richard Cyganiak

In data integration activities, tools such as Silk or Limes may be used to discover entity coreferences. Entity coreferences are pairs of different identifiers, often in different datasets, that refer to the same entity. Detected coreferences are often recorded as owl:sameAs triples. This may be a step in an object reconciliation pipeline.

It would be nice if shapes could flexibly state conditions by which to check that identity of objects has been correctly recorded; that is, check conditions under which a same-as link should be present between two identifiers, or conversely, check conditions for mis-identified same-as links.

For example (movies domain):

  • If source1.movie.title is highly similar (by some widely adopted string similarity function, perhaps plugged in through an extension interface) to </code>source2.film.title</code> and source1.movie.release-date.year is identical to source2.film.initial-release, then a owl:sameAs triple should be present
  • If source1.movie.title is identical to source2.film.title and source1.movie.release-date.year is within two years of source2.film.initial-release, then a owl:sameAs triple should be present
  • If source1.movie.directors has the same set of values as source2.film.directed-by AND source1.movie.title is highly similar to source2.film.title, then a owl:sameAs triple should be present

The intent here is not that the validation process should produce the expected owl:sameAs triples. We assume that some other tool or process has already produced these triples. The purpose of these validation rules is to perform quality assurance, or sanity checks, on the output of these other tools or processes. Thus, the quality or completeness of the generated linkset could be assessed.

We note however that object reconciliation tools could be driven by constraints like those given above. So potentially, an object reconciliation tool and a validator could use the same input constraints. Thus, this story straddles the boundaries between constraint checking and inference.


Below the original text and discussion:

S14: Object Reconciliation

Created by: David Martin

As an aid in data integration activities, it would be nice if shapes could flexibly state conditions by which to check that identity of objects has been correctly recorded; that is, check conditions under which 2 objects in a KB should explicitly represent the same real-world thing. For example (movies domain), I'd like to say:

if source1.movie.title is highly similar (by some widely adopted measure, or some measure that I can plug in to a tool) to source2.film.title AND source1.movie.release-date.year is identical to source2.film.initial-release, then it should be stated that they are the same movie

OR

if source1.movie.title is identical to source2.film.title AND source1.movie.release-date.year is close (say, < 2 years difference) to source2.film.initial-release then it should be stated that they are the same movie

OR

if source1.movie.directors has the same set of values as source2.film.directed-by AND source1.movie.title is highly similar to source2.film.title then it should be stated that they are the same movie

OR ....

(HK: This story sounds more like an inferencing problem than constraint checking. CONSTRUCT { ?this owl:sameAs ?other } WHERE { ... pattern } which can be expressed using spin:rule. Fuzzy string matching like "title highly similar to another title" may require some SPARQL extension if it cannot be expressed using regex).

(DM: Good point, Holger, and I generally agree, And I'm not wedded to this particular story. But isn't the boundary between inferencing and constraint checking inevitably very blurry? I mean, the essence of this example is meant to be: if there's an object X with property P1, and an object Y with property P2, and the value of P1 is related to the value of P2 in the following way: ... then *there must be* a sameAs relation between X and Y. As with many constraints, the intent here is to check completeness. That's constraint-like, right?)

(ericP: The SPARQL semantics take as input an RDF graph, which may or may not be the result of inferencing. SPARQL 1.1 Entailment Regimes identify certain ways of acquiring an input graph but the semantics are still defined in terms of a graph. I propose that we define validation/verification as an opperation over a graph in order to let people plug in what suites their purpose.)

Tthibodeau: This seems a valid user story, about blending data from 2 sources, in testing for whether the 2 sources are making statements about the same entity (or not). Istanbul vs Constantinople -- have different names, but same Geo Coordinates, etc. That might be the same for my purposes... or maybe the label matters in my shape.