DCAT MetaShare Mapping

From Linked Data for Language Technology Community Group

Introduction

The META-SHARE Schema [1], defines meta-data for language resources to improve their entry, indexing and search across language resource catalogues. In mapping this schema to an RDF vocabulary there is an opportunity to use the W3C Data Catalogue (DCAT) Vocabulary Recommendation, which fulfils a similar meta-data description function for the broader population of data sets published on the web.

Motivation

Mapping

In relation to the MetaShare spread sheet working document, metashare vocabulary elements can be mapped to the high-level model DCAT model as follows.

Make LanguageResource a DCAT Dataset

The core of the mapping is to make the ms:LanguageResource class a subclass of dcat:Dataset.

Comment [PL]: The ms:LanguageResource subsumes datasets (corpora, lexical/conceptual resources and language descriptions) and tools/services used for their processing. So, it might be better to use the dcat:Dataset as a subclass of ms:LanguageResource and have ms:corpus etc. as subclasses of dataset. The only problem with this might be if we cannot add subclasses to a class coming from a different vocabulary.

This allows the following properties of ms:LanguageResource to be mapped to the following standard dcat or dcterms properties of dcat:Dataset with no loss of data values:

Comment [PL]: There's an ongoing discussion for language; I think we should have the same treatment for language overall

Comment [PL]: It might be better to use a relation such as owl:sameAs for this mapping, since the ms:resourceShortName can also be mapped to the same element, and it is better to keep them distinct.Note also that at this point, this should be a property both of the dataset and the distribution Comment [PL]: There are also other properties that can be mapped from MS to dcat, as follows:

Separate LanguageResource metadata from metadata of its Accessible Form

DCAT distinguishes between the Dataset itself and its accessible forms, e.g. a downloadable file published on the web. This is provided by the property dcat:distribution with domain dcat:Dataset and range of Class dcat:Distribution.

This allows a data source to be published in multiple places, to be published in different formats and to be published under different license conditions. Metashare currently does not support this distinction between the dataset and its distributions. Comment [PL]: In fact, MS already supports such a distinction but only as regards licensing conditions and accessibility mode (e.g. downloadable vs. accessed via an interface). What is needed is therefore some rearrangement of certain elements to be more compatible with DCAT. To include this would involve the following:

  • introduce a new class ms:LanguageResourceDistribution as a subclass of dcat:Distribution, and use then the dcat:distribution property to associate different distributions to a LanguageResoruce dataset.

Comment [PL]: in the XSD implementation, this is the distributionInfo component; otherwise, ok.

  • ms:url can be removed as a property of ms:LanguageResource and replaced by dcat:downloadURL property for ms:LanguageResourceDistribution.

Comment [PL]: see above for ms:url. The dcat:downloadURL property for ms:LanguageResourceDistribution should replace the ms:downloadLocation.

  • ms:size can be removed as a property of ms:LanguageResource and replaced by dcat:byteSize property for ms:LanguageResourceDistribution, provided we standardise on byte being the only (and therefore assumed) value of ms:sizeUnit.

Comment [PL]: size should be further discussed; size in bytes can be automatically provided for resources, but the tendency is to use other measurements (e.g. sentences, words, n-grams etc.) for language resources which are more meaningful.

  • ms:mimeType can be removed as a property of ms:LanguageResource and replaced by dct:format or the dct:mediaType property for ms:LanguageResourceDistribution.
  • ms:license can be removed as a property of ms:LanguageResource and replaced by dct:license or dct:rights property for ms:LanguageResourceDistribution.

Comment [PL]: This should be a pointer to the license module that is currently discussed. In the XSD implementation, it is included in the distributionInfo.

  • ms:availabilityStartDate can be removed as a property of ms:LanguageResource and replaced by dct:issued property for ms:LanguageResourceDistribution.

Comment [PL]: This is not the intended meaning of ms:availabilityStartDate; in MS, this is coupled with ms:availabilityEndDate and defines when (or up to when) a resource can be made available. I'm looking more closely to the release/issue dates etc. to find a more appropriate mapping. Comment [PL]: There are also other properties that can be mapped from MS to dcat, as follows:

References

[1] META-SHARE XML Schema on github, latest version (3.0): https://github.com/metashare/META-SHARE/tree/master/misc/schema/v3.0