Use Case Metadata Merging

From XG Provenance Wiki
Jump to: navigation, search

Name

Metadata Merging

Owner

Kai Eckert

Provenance Dimensions

  • Primary: Attribution
  • Secondary: Understanding, Trust

Background and Current Practice

This use-case is taken from DC-09 conference article. Please refer to this paper for more details.

Libraries have to deal with metadata from various sources. Usually the data is just transformed to a common internal format (see Use_Case_Crosswalk_Maintenance), but sometimes, there exist several different records from different sources that describe the same ressource. In this case, one has to decide for a specific source or the metadata records have to be merged.

A specific example are subject information for a given ressource, that can be provided by various means (manually and automatically created). While generally manually created subject headings would be prefered, it is nevertheless desirable to also store subject headings from other sources and make them accessable.

It is important that there are no compromises regarding the quality of the resulting metadata.

Goal

The goal is to prevent information loss while merging metadata from different sources. Therefore, provenance information on statement-level has to be provided.

Use Case Scenario

Additionally for every element we store the following information:

  • the source used, as well as some characteristics of the source (e.g. automatic or manual indexing)
  • the rank for the subject heading, if one is given by the source

Note, that this information is specific for subject headings and might be extended for other metadata fields or applications. With this information, at least the following advanced queries can be supportet:

  • Merging annotation sets: Without using the provenance information, we just get the union of all statements. By making use ofit, we can regain the metadata statements of a specific source.
  • Extended queries on the merged annotations: We can query the data by some criteria, like using only manually created statements, or only statements with a given rank higher than a threshold.

Problems and Limitations

This use-case requires provenance on statement level, which has to be supported by the underlying infrastructure. However, in RDF exist two mechanisms that support this: Reification and Named Graphs.

If the user is supposed to use this additional information, the retrieval interface has to be adapted to provide the possibility to select between the different sources. But is is also possible to hide this from the user and just use the data internally.

Unanticipated Uses

Other use-cases that require provenance on statement-level, like Use_Case_Crosswalk_Maintenance

Existing Work

Working examples by means of RDF Reification can be found here: DC-09 conference article