Warning:
This wiki has been archived and is now read-only.

Guidance on the Provision of Metadata

From Data on the Web Best Practices
Jump to: navigation, search

Intro

This section of the Data on the Web Best Practices document will include best practices for the guidance on the provision of metadata.

Data on the web ecosystem has a subjacent architecture that involves actors with three main roles: data Publisher, data Consumer and data Broker. The Broker is the one that has information that can help the Consumer to find, to access and to process data published by the Publisher. Published data is a central entity in this ecosystem. A way of helping the Consumer to execute the tasks listed above is to provide data about data, metadata. These metadata may contain information about content, provenance, licensing, data quality, data access (URIs schema, APIs), content semantics, etc.

As an example of Brokers we can think about implementations of CKAN, a data management system that makes data accessible by providing tools to streamline publishing, sharing, finding and using data, used by data.gov, dados.gov.br, etc. CKAN has metadata (provided by Publishers) that are useful for Consumers to find data. CKAN is a registry and can also be a repository for the data to be consumed. At the same time, data published in CKAN implementations can have multiple formats, as CSV, for example. Once a Consumer chooses some data to use from a Publisher, she needs another kind of metadata to understand how to access the data and its semantics. Taking the CSV example, W3C CSV on the Web WG has the mission of providing technologies whereby data dependent applications on the Web can provide higher interoperability when working with datasets using the CSV (Comma-Separated Values) or similar formats.

Considering the different tasks that can be executed in the Data on the Web ecosystem, we can categorize metadata in different types:

  • Metadata Types for Search
    • Content Description
    • Provenance
    • License
    • Revenue
    • Credentials
    • Quality / Metrics
    • Release Schedule
    • Data Format
    • Data Access
  • Metadata Types for Use
    • URI Design Principles
    • Machine Access to Data
    • API specification
    • Format Specification

Metadata can be provided in two forms: human-readable and machine-readable. It is important to provide both forms of metadata in order to reach humans and applications. In the case of machine-readable metadata, the use of standard vocabularies should be encouraged as a way of enhancing common semantics. For example, data provenance could be described using PROV-O, a W3C Recommendation that provides a set of classes, properties, and restrictions that can be used to represent and interchange provenance information generated in different systems and under different contexts.

In the next sections, each one of the metadata types will be treated in more detail.

Metadata Types for Search

VoID and DCAT cover most of the categories below, even though they could be extended of course.

The LOD Cloud also comes with recommendations on how to describe LOD datasets on CKAN

Content Description

Provenance

See this mail.

License

other similar (research) proposals are around, like L4LOD

Revenue

Credentials

Quality / Metrics

Release Schedule

Data Format

Data Access

Metadata Types for Use

URI Design Principles

Machine Access to Data

API specification

Format Specification

Intrinsic vs Extrinsic Metadata

(New section contributed by Mark Harrison. Please feel free to modify and/or move elsewhere. This is in response to http://www.w3.org/2013/dwbp/track/actions/54 )

It might be helpful to think about different kinds of metadata that can be used for describing or discovering a dataset. Some of this can be considered to be intrinsic - i.e. about the scope and semantics of the data itself (irrespective of the data format or mechanism through which it is accessed). Other metadata is more extrinsic in nature, being concerned with describing its availability (which may be specific to particular formats and access mechanisms). The ideas below are not definitive or authoritative - they're intended as a starting point for discussion and further development. They might also be useful for Hadley's discussion about alternative approaches to data catalogues.

Intrinsic properties described by metadata

  • Concepts (e.g. SKOS concepts about the knowledge domain for which this data is relevant)
  • Classes (the real-world categories of data object or thing being described or linked in the dataset - making use of classes in defined vocabularies)
  • Predicates (the properties, attributes or semantic relationships that link one thing to another thing or to a literal value such as a string, integer, floating point number or date - making use of predicates in defined vocabularies)
  • Geographic scope (for which region of the world (continent, country, state/province, city, geographic zone (polygon mapped via latitude, longitude co-ordinates) is this data relevant)
  • Temporal scope (for which time interval is this dataset relevant? what is the start time and end time for the collection of data included in the dataset?)
  • Spatial granularity (how detailed is the coverage of the data in space or geography - what is the spatial sampling frequency or granularity? per country, per state, per postcode area, per square kilometer, etc.)
  • Temporal granularity (how detailed is the coverage of the data in time - what is the time sampling frequency, period or interval? per year, per month, per week, per day, per hour, per minute, per second, etc.)
  • Source provenance (who / which organisation collected or generated the data? How was it collected? )

Extrinsic properties described by metadata

  • Format (in which format(s) can the data be accessed? e.g. XML, CSV, TSV, JSON, JSON-LD, RDF/XML, Turtle, N-Triples, etc.)
  • Variants (e.g. different human-language translations) of data
  • Access mechanisms (though which mechanism can the data be accessed? e.g. SPARQL endpoints, Linked Data Platform, REST interfaces, SOAP-based web services, etc.)
  • Transformation provenance (how has the raw data been processed? (to improve its quality, to transform it into a different format, to combine it with another dataset to add valuable context)
  • Licence information (under which licence is the data made available?)
  • Review information (feedback from data users about how useful the data was - did they rate it as good quality data? what was their rating?)
  • Example usage information (how has this data been used by others? links to examples of data mash-ups and visualisations, links to other datasets that were used in combination to achieve a particular purpose)


It might be argued that a simple Intrinsic vs Extrinsic classification is too limited / too simplistic and that these characteristics really fit on a sliding scale between Very Intrinsic and Very Extrinsic, with some middle ground in between. However, I hope that this thought process helps us to think about how data users might search for matching datasets and what kind of selection criteria they might use to find the kind of data that they're looking for, as well as related data that could be helpful.

Editors and Contributors

Carlos Laufer
Makx Dekkers
Carlos Iglesias
Bernadette Farias Loscio
Mark Harrison

Links and References