Use Case Working Space

From Dataset Exchange Working Group

!!This document was for draft discussions and is now inactive. Further development has now moved to structured documentation on the github at https://github.com/w3c/dxwg/blob/gh-pages/ucr/index.html

//Use Case template

Status values

Indicates the review status and maturity level of the use case, one of:

Initial review state, UC created but not discussed yet
Intermediate review state, UC is being discussed initially or after re-opening a closed UC
Final review state, UC accepted for inclusion in Note document
Final review state, work on UC was discontinued (because of being a duplicate, out of scope etc.). It may conditionally transit to "open" status again.

Tag values

Cut and paste then edit this list for convenience:

#Aggregate #Content_negotiation #Coverage #Dataset_concept #DCAT #Documentation #Lifecycle #Meta #Packaging #Profile #Provenance #Publication #Quality #Referencing #Representation #Resolution #Roles #Semantics #Service #Usage_control

#Aggregate The UC is an aggregate of individual UCs, and requires rework/splitting up
#Content_negotiation HTTP Content negotiation by media type and #Profile. Deliverable reference tag.
#Coverage Temporal, spatial etc. extension/coverage
#Dataset_concept UCs challenging the notion of Dataset concept
#DCAT The UC refers to DCAT core and provides related requirements. Deliverable reference tag.
#Documentation Documentation of informal hints for a prospective users, e.g. summary, example data, usage guide
#Lifecycle Status, versioning etc.
#Meta Meta-Usecase provides a point of entry to a topic to be handeld by subordinated, detailed usecases
#Packaging Archiving (TAR), compression (GZ) etc.
#Profile Profile identifies the structural and/or semantic constraints of target resource (compared to its serialization, i.e. media type). The Profile tag likewise refer to Application Profile Specification. Deliverable reference tag.
#Provenance Data lineage, modification history etc.
#Publication (Re)publication and dissemination of Dataset information
#Quality (Relative) quality of data stated or assessed by a testing suite
#Referencing Means to reference entities in DCAT according to requirements of distinct communities
#Representation Media type, data type, encoding
#Resolution (Quantifiable) spatio-temporal etc. resolution and precision
#Roles Agent roles involved in creation, modification etc. of the data
#Semantics Content-related models (dimensions, conceptual classification) etc.
#Service Dynamic, parametrized distribution approaches (PULL, PUSH)
#Space Modeling of spatial aspects
#Time Modeling of temporal aspects
#Usage_control Access control and usage permissions (extending access control)

Use case name

Status:

Identifier: ID$No

Creator: (your name)

Deliverable(s): (DCAT1.1, AP Guidelines, Content Negotiation)
Which deliverable or deliverables does this UC relate to?

Tags

Optional space-separated list of tags out of the above catalog (extend on demand)

Stakeholders

Mandatory list of stakeholders experiencing the problem. When describing the stakeholder, please be as specific as possible (e.g., data consumer, data producer, data publisher, program, etc.) and avoid using the term user.

Problem statement

Mandatory statement of the current situation, including a description of the problem, the stakeholders experiencing the problem, and what the stakeholder(s) are expected to supply (i.e. what contextual knowledge are they expected to have available) and/or receive to resolve the problem they are experiencing. In describing the stakeholder, please be as specific as possible (e.g., data consumer, data producer, data publisher, program, etc.) and avoid using the term user.

Existing approaches

Optional references to standards and examples of established approaches with a potential for reuse in DCAT

Links

Optional link list to documents and projects this use case refers to

Requirements

Mandatory requirements suggested by this use case

  • Imperative sentence starting with a verb each describing an individual task in order to solve the stated problem

Related use cases

Optional references to related local (refer to anchor identifier [[#Id...]]) and remote use cases (e.g. POE-WG UCs).

Comments

Optional section for editorial comments, suggestion and their interactive resolution

//End Use Case template

DCAT packaged distributions [ID1]

Status:

Identifier: ID1

Creator: Makx Dekkers

Deliverable(s): DCAT1.1

Tags

  1. DCAT #Documentation #Packaging #Representation

Stakeholders

Data publisher

Problem statement

In practice, distributions are sometimes made available in a packaged or compressed format. For example, a group of files may be packaged in a ZIP file, or a single large file may be compressed. The current specification of DCAT allows the package format to be expressed in dct:format or dcat:mediaType but it is currently not possible to specify what types of files are contained in the package.

Existing approaches

An example of an approach is the way ADMS defines Representation Technique which could be used to describe the type of data in a ZIP file, e.g. dcat:mediaType="https://www.iana.org/assignments/media-types/application/zip"; adms:representationTechnique="https://www.iana.org/assignments/media-types/text/csv".

Links

Requirements

Define way to describe/specify packaging of files in Distribution.

Detailing and requesting additional constraints (profiles) beyond content types [ID2]

Status:

Identifier: ID2

Creator: Ruben Verborgh

Deliverable(s): AP Guidelines, Content Negotiation

Tags

#Profile #Representation

Stakeholders

data consumer, data producer, data publisher

Problem Statement

While a content type such as application/json identifies the kind of parser a client needs for a given representation, it does not cover all assumptions of the server. In practice, the server will often follow a much more strict pattern than “everything that is valid JSON”, restricting itself to one of more subsets of JSON. For the purpose of this use case, we refer to such subsets generically as "profiles". A profile captures additional structural and/or semantic constraints in addition to the media type. Note that one profile might be used across different media types: for instance, a profile could be applied to multiple RDF syntaxes.

In order to inform clients that a representation conforms to a certain profile, servers should be able to explicitly indicate which profile(s) a response conforms to. This then allows the client to make the additional structural and/or semantic interpretations that are allowed within that profile.

Clients and servers should be able to indicate their compatibility and/or preference for certain profiles. This enables clients to request a resource in a specific profile, in addition to the specific content type it requests. A client should be able to determine which profiles a server supports, and with which content types. A client should be able to look up more information about a certain profile.

One example of such a profile is a specific DCAT Application Profile, but many other profiles can be created. For example, another profile could indicate that the representation uses a certain vocabulary.

Existing approaches

  • HTTP content negotiation (by media type, language, …)

Links

Requirements

  • Create a sufficiently wide definition of an application profile (probably generic "profile", not specific to "application" but rather to "purpose")
  • Create a way for the server to indicate one or more profiles the response conforms to
  • Create a way to negotiate profiles between clients and servers
  • Create a way to list the profiles supported by a server
  • Create a way to retrieve more information about a profile

Related use cases

  • #ID3 considers that resources can conform to multiple profiles.
  • #ID5 requests two abilities also covered by the present use case: clients should be able to see all available profiles supported by a server, and to request metadata about a profile.
  • #ID30 presents highly similar requirements as the present use case from the perspective of harvesting.

Comments

(None yet.)

Responses can conform to multiple, modular profiles [ID3]

Status:

Identifier: ID3

Creator: Ruben Verborgh

Deliverable(s): AP Guidelines, Content Negotiation

Tags

#Profile #Representation

Stakeholders

data consumer, data producer, data publisher

Problem Statement

A response of a server can conform to multiple content types. For example, a JSON-LD response conforms to the following content types: application/octet-stream, application/json, application/ld+json (even though only one of them will typically be indicated).

Similarly, the response of a server can conform to multiple profiles. For example, a profile X could demand that all persons are described with the FOAF vocabulary, and a profile Y could demand that all books are described with the Schema.org vocabulary. Then, a response which uses FOAF for people and Schema.org for books, clearly conforms to both profiles. And in contrast to content types, it is informative to list both profiles, as their conformance is independent.

Therefore, servers should be able to indicate, if they wish to do so, that a response conforms to multiple profiles. Clients should also be able to specify their preference for one or multiple profiles.

This enables a modular design of profiles, which can be combined when appropriate. With content types, only hierarchical combinations are possible. For example, a JSON-LD document is always a JSON document. However, with profiles, this is not necessarily the case: some of them might allow orthogonal combinations (as is the case in the vocabulary example above).

Existing approaches

  • conformance to multiple content types

Links

Requirements

  • Create a sufficiently wide definition of an application profile (probably generic "profile", not specific to "application" but rather to "purpose").
  • Create a way for the server to indicate conformance to multiple profiles.
  • Create a way for a client to indicate its preference for multiple profiles.
  • Create a way to do the above across content types.

Related use cases

  • #ID2 makes the case for profile support, which the present use case extends to multiple modular profiles.

Comments

(None yet.)

Dataset Versioning Information [ID4]

Status:

Identifier: ID4

Creator: Nandana Mihindukulasooriya

Deliverable(s): DCAT1.1

Tags

  1. DCAT #Lifecycle #Provenance #Dataset_concept #Aggregate

Stakeholders

  • data producers that produce versioned datasets
  • data consumers that consumer versioned datasets

Problem statement

Most datasets that are maintained long-term and evolve over time have distributions of multiple versions. However, the current DCAT model does not cover versioning with sufficient details. Being able to publish dataset version information in a standard way will help both producers publishing their data on data catalogues or archiving data and dataset consumers who want discover new versions of a give dataset, etc. We can also see some similarities with software versioning and dataset versioning, for instance, some data projects release daily dataset distributions, major/minor releases etc. Probably, we can use some of the lessons learned from software versioning. There are several existing dataset description models that extend DCAT to provide versioning information, for example, HCLS Community Profile.

Links:

Requirements:

  • A definition of what is meant by version in this context and how it relates to dataset, distribution should be provided.
  • Different versioning scenarios should be supported (e.g., dataset evolution, conversions/translations, granularities/subsets).
  • Each version should provide a version identifier and other relevant metadata.
  • It should be possible to provide metadata about when a version was created (released).
  • It should be possible to provide identifiers for the previous/next versions when applicable (if they are in chronological order)
  • It should be possible to provide what has been changed when applicable (if they are in chronological order)
  • It should be possible to discover versions of a given dataset in a catalog.
  • W3C DWBP guidelines on versioning: BP7. Provide a version indicator, BP8. Provide version history, BP11. Assign URIs to dataset versions and series

Related use cases

Comments

Discover available content profiles [ID5]

Status:

Identifier: ID5

Creator: Rob Atkinson

Deliverable(s): DCAT1.1, AP Guidelines, Content Negotiation

Tags

  1. Content_negotiation #Profile #Referencing #Representation #Resolution #Semantics #Service

Stakeholders

  • user agents mediating access to dataset distributions, especially via services
  • users wishing to browse data availability via Linked Data or other Web environments
  • users wishing to search for data with access methods conforming to specific capabilities

Problem statement

There are multiple reasons to provide different information about the same concept, so if Linked Data is to exist based on URI object identifiers, and these are to relate to the real world entity, rather than specific implementations (i.e. information records), then it is inevitable that different sets of information will be required for different purposes.

Consider a request for the boundary of a country, with a coastline. If the coastline is included as a property, this may be many megabytes of detail. Alternatively, a generalised simple coastline may be provided, or a single point, or may not be required at all. (In reality there may be many different versions of coastline based on different legal definitions, or practices, or approximation methods).

Furthermore, in any graph based response, the depth of traversal of the graph is always a choice. Consider a request to the GBIF taxonomy service to search for a biological species. A response typically includes not just the species, but potentially more information about the hierarchy of the taxomony (kingdom, phyla, family, genus etc) - also possible synonyms, also possibly a wide range of metadata about name sources, usages and history. There is a need for offering different choices of how deep such a traversal of relationships should be undertaken and returned.

Different information models (response schema), and different choices of content within the same schema , constitute necessary options, and there may be a large number of these.

Thus there is a need for discovering which profiles a service will offer for a given resource, and a canonical machine readable graph of metadata about what such offerings consist of and how they may be invoked. This may be as simple as providing a profile name, or content profile, schema choice, encoding and languages.

Note that the Linked Data API implementation used by the UK Linked Data effort, includes the notion of _view parameters in URI requests - these are "named collections of properties" but it does not provide a means to attach metadata about what such views consist of. equivalent HTTP header based profile negotiation would still need to address this requirement in the same way as agent-driven negotiation (https://www.w3.org/Protocols/rfc2616/rfc2616-sec12.html) - what is required is a minimal set of metadata and extension mechanisms for this.

Support for a specific profile is also a powerful search axis, potentially encompassing the full suite of semantic specification and resource interoperability requirements. Thus metadata about profile support can be used for both discovery and mediated traversal via forms of content negotiation.

Links:


Requirements:

  • A canonical means to request a server provide a list of profiles that it supports, and a canonical means to provide metadata about those aspects of the profile that are specifically tied to the content negotiation specifications - including format, language, profile name and if relevant, data schema.
  • An extension mechanism to allow servers to provide additional metadata about profiles above and beyond those aspects standardised. This would be ideally a recursive use of profiles - where the server advertises what profiles such metadata is available in.

Related use cases

DCAT Distribution to describe web services [ID6]

Status:

Identifier: ID6

Creator: Jonathan Yu, CSIRO

Deliverable(s): DCAT1.1

Tags:

  1. DCAT , #Representation , #Service

Stakeholders:

Data provider, data consumer

Problem statement:

Users often access datasets via web services. DCAT provides constructs for associating a resource described by dcat:Dataset with dcat:Distribution descriptions. However, the Distribution class provides only the dcat:accessURL and dcat:downloadURL properties for users to access/download something. It would be useful for users to gain more information about the web service endpoint and how users can interact with the data. If information about the web service is known with appropriate identifiers for the data, then users can understand additional context then invoke a call to the web service to access/download the dataset resource or subsets of it.

Links:

  • Existing CKAN catalogues such as data.gov.au use a DCAT plugin to represent the dataset entry, which lacks consistent/precise ability to associate a dataset description with a distribution and its appropriate web service implementation/interface. See http://data.gov.au/dataset/2016-soe-lan-soil-classification for an example as well as http://data.gov.au/dataset/116eb634-fc0b-42d8-ae27-b876a12c4f6a.rdf (which overloads the use of dct:format for describing the web service interface for some of the dcat:Distribution descriptions).
  • The following is some work done proposing a lightweight method for describing web services and interfaces in RDF/OWL. These can then be specialisations or associations of dcat:Distribution and allow users, catalog implementations to standardise on how different endpoints are used.

https://github.com/CSIRO-LW-LD/dpn-ontology/wiki/Design:-DPN-module-for-services

Requirements:

  • Some way to associate web services precisely in dcat:Distribution to dcat:Dataset, such as well-known web service interfaces (e.g. OGC - WFS, WMS, OpenDAP, REST apis).

Support associating fine-grained semantics for datasets and resources within a dataset [ID7]

Status:

Identifier: ID7

Creator: Jonathan Yu, Simon Cox (CSIRO)

Deliverable(s): DCAT1.1, AP guidelines, Content negotiation

Tags:

  1. Semantics

Stakeholders:

Data provider, data consumer

Problem statement:

We want to be able to describe a dataset record using properties appropriate to the dataset type. This is especially the case in a dataset in the geoscience domain, e.g. an observation of a "feature" in the real world captured using a sensor about some property. There are emerging practices on how to represent these semantics for the data, however, DCAT currently only supports association of a dcat:Dataset with dcat:theme to a skos:Concept. Data providers could extend/specialise dcat:theme to provide specific semantics about the association between dcat:Dataset and the ‘theme’ but is this enough? Furthermore, there are broad/aggregated semantics at the dataset level (e.g. observations in the Great Barrier Reef) and then fine grained semantics for elements within a dataset (e.g. sea surface temperature observations in the Great Barrier Reef). Users need a way to view the aggregated collection level metadata and the associated semantics and then they need a way to view record level metadata to obtain/filter on specific information, e.g. instrument/sensor used, spatial feature, observable property, quantity kind, etc.

Properties from the W3C Semantic Sensor Network SOSA ontology and QUDT may be useful in this context.

Links:

  • Examples of representing dataset metadata at the collection level and at the fine grained record level via the THREDDS server here:

http://dapds00.nci.org.au/thredds/catalogs/fx3/catalog.html

Requirements:

  • Recommendations and mechanisms for data providers to describe Datasets with fine grained semantics (e.g. instrument/sensor used, spatial feature, observable property, quantity kind).
  • Recommendations and mechanisms for data consumers and data catalogues implementers to allow filtering/subsetting Datasets and their elements with fine grained semantics (e.g. instrument/sensor used, spatial feature, observable property, quantity kind).

Scope or type of dataset with a DCAT description [ID8]

Status:

Identifier: ID8

Creator: Simon Cox (CSIRO)

Deliverable(s): DCAT1.1, AP Guidelines

Tags:

  1. Dataset_concept , #DCAT , #Representation ,

Stakeholders:

Data catalogue

Problem statement:

Some users of DCAT may want to apply it to resources that not everyone would consider a 'dataset'. Some examples are text documents, source code, controlled vocabularies, and ontologies. It's not clear what kinds of resources may be described with DCAT or how one would describe the types listed.

Links:

Requirements:

  • Guidance about the expected scope for DCAT
  • A way for a DCAT description to indicate the 'type' of dataset involved (the semantic type, not the media-type).
  • Guidance on use of dc:type or similar for DCAT records.
  • Recommendation on genre or "semantic" type vocabularies.

Common requirements for scientific data [ID9]

Status:

Identifier: ID9

Creator: Andrea Perego - European Commission, Joint Research Centre (JRC)

Deliverable(s): DCAT1.1

Tags

#DCAT #Documentation #Meta #Provenance #Quality #Referencing #Roles

Stakeholders

data consumer, data producer, data publisher

Problem statement

The European Commission's Joint Research Centre (JRC) is a multidisciplinary research organization with the mission of supporting EU policies with independent evidence throughout the whole policy life-cycle.

In order to provide a single access and discovery point to JRC data, in 2016 a corporate data catalog has been launched, where datasets are documented by using a modular metadata schema, consisting of a core profile, defining the elements that should be common to all metadata records, and a set of domain-specific extensions.

The reference metadata standard used is the DCAT application profile for European data portals [DCAT-AP] (the de facto EU standard metadata interchange format), and the related domain-specific extensions - namely, [GeoDCAT-AP], for geospatial metadata, and [StatDCAT-AP], for statistical metadata. The core profile of JRC metadata is however not using [DCAT-AP] as is, but it complements it with a number of metadata elements that have been identified as most relevant across scientific domains, and which are required in order to support data citation.

More precisely, the most common, cross-domain requirements identified at JRC are following ones:

  • Ability to indicate dataset authors.
  • Ability to describe data lineage.
  • Ability to give potential data consumers information on how to use the data ("usage notes").
  • Ability to link to scientific publications about a dataset.
  • Ability to link to input data (i.e., data used to create a dataset).

Existing approaches

[VOCAB-DCAT] does not provide guidance on how to model this information. [DCAT-AP] and [GeoDCAT-AP] partially support these requirements - namely, the specification of dataset authors (dcterms:creator [DCTerms]), data lineage (dcterms:provenance [DCTerms]), and input data (dcterms:source [DCTerms]). For the two remaining requirements, the JRC metadata schema makes use of dcterms:isReferencedBy [DCTerms] (publications) and vann:usageNote [VANN] (usage notes).

These solutions allow a simplified description of the dataset context, that can be used for multiple purposes - as assessing the quality and fitness for use of a dataset, or identifying the dataset most commonly used as input data. Additional details could be provided by representing more precisely these relationship with "qualified" forms by using vocabularies as [PROV-O], [VOCAB-DQV], or [VOCAB-DUV]: for instance, the relationship between a dataset and input data can be complemented with the model used for processing them, and possibly with additional information on the data generation workflow.

Links

Requirements

  • Being able to indicate dataset authors.
  • Being able to describe data lineage.
  • Being able to give potential data consumers information on how to use the data ("usage notes").
  • Being able to link to scientific publications about a dataset.
  • Being able to link to input data (i.e., data used to create a dataset).
  • Being able to link to software used to produce the data

Related use cases

Requirements for data citation [ID10]

Status:

Identifier: ID10

Creator: Andrea Perego - European Commission, Joint Research Centre (JRC)

Deliverable(s): DCAT1.1

Tags

#DCAT #Provenance #Quality #Referencing #Roles #Time

Stakeholders

data consumer, data producer, data publisher

Problem statement

Data citation is gaining more and more importance as a way to recognize the scientific value of research data, by treating them as traditionally done for scientific publications.

Requirements for data citation include:

  • Describing data with the information that is typically used to create a bibliographic entry (e.g., authors, publication year, publisher).
  • Associating data and, whenever possible, the related resources (authors, publisher, input data, publications), with persistent identifiers.

Existing approaches

A study has been carried out at the European Commission's Joint Research Centre (JRC), in order to create mappings between [DataCite] (the current de facto standard for data citation) and [DCAT-AP].

The results show that [DCAT-AP] covers most of the required [DataCite] metadata elements, but some of them are missing. In particular:

  • Mandatory elements:
    • Dataset creator (but [GeoDCAT-AP] supports it)
  • Recommended elements:
    • [DCAT-AP] does not cover all the types of identifiers, dates, contributors and resources supported by [DataCite]
  • Optional elements:
    • Funding reference

Guidance should be provided on how to model this information in order to enable data citation also in records represented with [VOCAB-DCAT] and related application profiles.

Links

Requirements

  • Being able to specify the basic mandatory information for data citation - i.e., dataset authors, title, publication year, publisher, persistent identifier.
  • Being able to specify additional, although not mandatory, information relevant for data citation - e.g., dataset contributors and funding reference.

Related use cases

Modeling identifiers and making them actionable [ID11]

Status:

Identifier: ID11

Creator: Andrea Perego - European Commission, Joint Research Centre (JRC)

Deliverable(s): DCAT1.1

Tags

#DCAT #Referencing

Stakeholders

data consumer, data producer, data publisher

Problem statement

A number of different (possibly persistent) identifiers are widely used in the scientific community, especially for publications, but now increasingly for authors and data.

Different approaches are used for representing them in RDF – best practices are needed to enable their effective use across platforms. But more importantly, they need to be made actionable, irrespective of the platforms they are used in.

Encoding identifiers as HTTP URIs seems to be the most effective way of making them actionable. Notably, quite a few identifier schemes can be encoded as dereferenceable HTTP URIs, and some of them are also returning machine readable metadata (e.g., DOIs, ORCIDs). Moreover, they can still be encoded as literals, especially if there is the need of knowing the identifier “type”. In such a case, a common identifier type registry would ensure interoperability.

Another issue concerns the ability to specify primary and secondary identifiers. This may be a requirement when resources are associated with multiple identifiers.

Existing approaches

When encoded as HTTP URIs, the usual approach to model primary and alternative identifiers is to use the former as the resource URI, whereas the latter are specified by using owl:sameAs. In this case, the information about the identifier type is not explicitly specified, and can be derived only from the URI syntax - although this is not always possible.

To model identifiers as literals, [VOCAB-DCAT] uses dcterms:identifier, but it makes no distinction between primary / alternative identifiers, or the identifier type. For alternative identifiers, [DCAT-AP] recommends class adms:Identifier [VOCAB-ADMS], which can be used to specify the identifier type, plus additional information - namely, the identifier scheme agency and the identifier issue date. It is worth noting that the adms:Identifier has the primary purpose of describing the identifier itself, which makes it less suitable for linking purposes.

Finally, a number of vocabularies have defined specific properties for modeling identifier types, as prism:doi [PRISM] and bibo:doi [BIBO] for DOIs. Moreover, starting from version 3.2, [SCHEMA-ORG] has defined a super-property schema:identifier for all the identifier-specific properties already used in [SCHEMA-ORG].

An alternative approach is to denote the identifier type with an RDF datatype. In such a case, the same property can be used to specify the identifier - e.g., dcterms:identifier. This solution has the advantage of being able to easily identify all literals used as identifiers (you just have to lookup / search for the same property), whereas the datatype can be used to filter specific identifier types.

KC: Note that the libraries/archives community has identifiers that are not (yet) actionable, like ISBNs, ISSNs. These can be coded as dcterms:identifier strings but the string itself is not unique. Not sure how these fit into the overall picture but perhaps we can task someone to bring a specific proposal.

Links

Requirements

  • Encode identifiers as dereferenceable HTTP URIs
  • Being able to model the identifier type
  • Being able to model primary and alternative identifiers

Related use cases

Modeling data lineage [ID12]

Status:

Identifier: ID12

Creator: Andrea Perego - European Commission, Joint Research Centre (JRC)

Deliverable(s): DCAT1.1

Tags

#DCAT #Provenance #Quality #Referencing #Roles

Stakeholders

data consumer, data producer, data publisher

Problem statement

Documentation on data lineage is crucial to ensure transparency on how data are created and to facilitate their reproducibility. These have been traditional requirements for scientific data, but are currently becoming relevant also in other communities, especially in the public sector when data are used in support to policy making.

Data lineage is typically specified with a more or less detailed human-targeted documentation. In very few cases, this information is represented in a formal, machine-readable way, enabling a (semi)automated data processing workflow that can be used to re-run the experiment from which the data were produced.

Existing approaches

[DCAT-AP] uses property dcterms:provenance [DCTerms] to specify a human-readable documentation of data lineage, that can be either embedded in metadata or described in a document linked from the metadata record itself. Moreover, dcterms:source can be used to refer to the input data.

[PROV-O] can be used in order to provide a machine-readable description of data lineage, but best practices on how to use it consistently are missing.

Links

Requirements

  • Being able to specify data lineage in a human-readable way.
  • Being able to specify data lineage in a machine-readable way.

Related use cases

Modeling agent roles [ID13]

Status:

Identifier: ID13

Creator: Andrea Perego - European Commission, Joint Research Centre (JRC)

Deliverable(s): DCAT1.1

Tags

#DCAT #Provenance #Referencing #Roles

Stakeholders

data consumer, data producer, data publisher

Problem statement

Each metadata standard has its own set of agent roles, and they all use their own vocabularies / code lists. E.g., the latest version (2014) of [ISO-19115] has 20 roles, and [DataCite] even more.

Two of the main issues concern (a) how to ensure interoperability across roles defined in different standards, and (b) if it makes sense to support all of them across platforms. The latter point follows from a common issue in metadata standards supporting multiple roles, with overlapping semantics (e.g., the difference between a data distributor and a data publisher is not always clear). In these scenarios, whenever metadata are not created by specialists, roles frequently happen to be used inconsistently.

As far as research data are concerned, agent roles are important to denote the type of contribution provided by each individual / organization in producing data.

Moreover, in some cases, an additional requirement is to specify the temporal dimension of a role – i.e., the time frame during which an individual / organisation played a given role - and, maybe, also other information – e.g., the organisation where the individual held a given position while playing that role.

Existing approaches

[DCTerms] defines a limited number of agent roles as properties. [VOCAB-DCAT] re-uses some of them (in particular, dcterms:publisher), plus it defines a new one, namely, dcat:contactPoint. [DCAT-AP] and [GeoDCAT-AP] provide guidance on the use of other [DCTerms] roles - in particular, dcterms:creator, dcterms:rightsHolder. Anyway, the role properties defined in [DCTerms] and [VOCAB-DCAT] model just a subset of the agent roles defined in other standards. Moreover, they cannot be used to associate a role with other information concerning its temporal / organizational context.

[PROV-O] could be used for this purpose by using a “qualified attribution”. This is, for instance, the approach used in [GeoDCAT-AP] to model agent roles defined in [ISO-19115] but not supported in [DCTerms] and [VOCAB-DCAT]:

a:Dataset a dcat:Dataset; 
  prov:qualifiedAttribution [ a prov:Attribution ;
# The agent role, as per ISO 19115
    dcterms:type <http://inspire.ec.europa.eu/metadata-codelist/ResponsiblePartyRole/owner> ;
# The agent playing that role
    prov:agent [ a foaf:Organization ;
      foaf:name "European Union"@en ] ] .

However, to address the different use cases, such “qualified roles” should be compatible with the corresponding non-qualified forms, and both should be mutually inferable. For instance, the example above in [GeoDCAT-AP] is considered as equivalent to:

a:Dataset a dcat:Dataset;
  dcterms:rightsHolder [ a foaf:Organization ;
    foaf:name "European Union"@en ] .

Links

Requirements

  • Being able to model different types of agent roles

Related use cases

Data quality modeling patterns [ID14]

Status:

Identifier: ID14

Creator: Andrea Perego - European Commission, Joint Research Centre (JRC)

Deliverable(s): DCAT1.1

Tags

#DCAT #Documentation #Meta #Provenance #Quality #Referencing

Stakeholders

data consumer, data producer, data publisher

Problem statement

Used in its broader sense, the notion of "data quality" covers different aspects, that may vary depending on the domain.

They include, but are not limited to:

  • Fitness for purpose.
  • Data precision / accuracy.
  • Compliance with given quality benchmarks, standards, specifications.
  • Quality assessments based on data review / users' feedback.

In order to provide a mechanism for the consistent representation of data quality, the most frequently used data quality aspects should be identified, based on existing standards (e.g., [ISO-19115]) and practices. Such aspects should also be used to identify possible common modeling patterns.

Existing approaches

Solutions for modeling data quality have been defined in [DCAT-AP], [GeoDCAT-AP], [StatDCAT-AP], [VOCAB-DQV], and [VOCAB-DUV]. They cover the following aspects:

  • Metadata conformance with a metadata standard.
  • Data conformance with a given data schema/model.
  • Data conformance with a given reference system (spatial or temporal).
  • Data conformance with a given quality specification / benchmark.
  • Associating data with a quality report.
  • Spatial / temporal resolution.
  • Data quality assessments expressed with quantitative test results.
  • Data quality assessments via users’ feedback.

Notably, the first 4 aspects (those related to "conformance") follow a common pattern in that the reference vocabularies model all them by using property dcterms:conformsTo [DCTerms].

Links

Requirements

  • Provide patterns for a consistent modeling of the different aspects of data quality

Related use cases

Modeling data precision and accuracy [ID15]

Status:

Identifier: ID15

Creator: Andrea Perego - European Commission, Joint Research Centre (JRC)

Deliverable(s): DCAT1.1

Tags

#DCAT #Quality #Referencing #Resolution

Stakeholders

data consumer, data producer, data publisher

Problem statement

Understanding the level of precision and accuracy of a dataset is fundamental to verify its fitness for purpose. This is typically denoted in terms of spatial or temporal resolution, but other dimensions are also possible.

Some metadata standards include elements for specifying precision. For instance the latest version (2014) of [ISO-19115] supports the possibility of specifying spatial resolution in terms of scale (e.g., 1:1,000,000), distance - further split into horizontal ground distance, vertical distance, and angular distance - and level of detail. However, [VOCAB-DCAT] does not provide guidance on how to model this information.

Actually, for some time, [VOCAB-DCAT] included a property dcat:granularity to model precision, which was dropped in the final version of the vocabulary (see ISSUE-10, and, in particular, the mail proposing to drop this property).

Existing approaches

This issue was raised during the development of [VOCAB-DQV], and a solution has been proposed on how to model data precision in terms of spatial resolution - expressed as equivalent scale (e.g., 1:1,000,000) or distance (e.g., 1km) - and data accuracy as percentage - see [VOCAB-DQV], Section 6.13 Express dataset precision and accuracy. Notably, the same approach can be followed to model temporal resolution.

[SDWBP] addresses this problem as well re-using the approach defined in [VOCAB-DQV], and, additionally, it provides an example on how to specify accuracy by stating conformance with a quality standard - see [SDWBP], Best Practice 14: Describe the positional accuracy of spatial data.

Links

Requirements

  • Being able to specify dataset precision
  • Being able to specify dataset accuracy

Related use cases

Modeling conformance test results on data quality [ID16]

Status:

Identifier: ID16

Creator: Andrea Perego - European Commission, Joint Research Centre (JRC)

Deliverable(s): DCAT1.1

Tags

#DCAT #Quality #Referencing

Stakeholders

data consumer, data producer, data publisher

Problem statement

One of the ways of expressing data quality is to verify whether a given dataset is (or not) conformant with a given quality standard / benchmark.

[ISO-19115] supports a way of modeling this information, by allowing to state whether a given dataset passed or not a given test result. Moreover, [INSPIRE-MD] extends this approach by supporting an additional possible result, namely, "not evaluated".

Another approach is provided by the [EARL] vocabulary, which provides a generic mechanisms to model test results. More precisely, [EARL] supports the following possible outcome values (quoting from Section 2.7 OutcomeValue Class):

earl:passed
Passed - the subject passed the test.
earl:failed
Failed - the subject failed the test.
earl:cantTell
Cannot tell - it is unclear if the subject passed or failed the test.
earl:inapplicable
Inapplicable - the test is not applicable to the subject.
earl:untested
Untested - the test has not been carried out.

Existing approaches

[VOCAB-DQV] allows to specify data conformance with a reference quality standard / benchmark. However, this can model only one of the possible scenarios - i.e., when data are conformant.

[GeoDCAT-AP] provides an alternative and extended way of expressing "conformance" by using [PROV-O], allowing the specification of additional information about conformance tests (when this has been carried out, by whom, etc.), but also different conformance test results (namely, conformant, not conformant, not evaluated).

An example of the [GeoDCAT-AP] [PROV-O]-based representation of conformance is provided by the following code snippet:

a:Dataset a dcat:Dataset ;
  prov:wasUsedBy a:TestingActivity .
  
a:TestingActivity a prov:Activity ;
  prov:generated a:TestResult ;
  prov:qualifiedAssociation [ a prov:Association ;
# Here you can specify which is the agent who did the test, when, etc.
    prov:hadPlan a:ConformanceTest ] .
      
# Conformance test result
a:TestResult a prov:Entity ;
  dcterms:type <http://inspire.ec.europa.eu/metadata-codelist/DegreeOfConformity/conformant> .

a:ConformanceTest a prov:Plan ;
# Here you can specify additional information on the test
  prov:wasDerivedFrom <http://data.europa.eu/eli/reg/2014/1312/oj> .

# Reference standard / specification
<http://data.europa.eu/eli/reg/2014/1312/oj> a prov:Entity, dct:Standard ;
  dcterms:title "Commission Regulation (EU) No 1089/2010 of 23 November 2010 implementing 
                 Directive 2007/2/EC of the European Parliament and of the Council as regards 
                 interoperability of spatial data sets and services"@en
  dcterms:issued "2010-11-23"^^xsd:date .

The example states that the reference dataset is conformant with the Commission Regulation (EU) No 1089/2010 of 23 November 2010 implementing Directive 2007/2/EC of the European Parliament and of the Council as regards interoperability of spatial data sets and services. Since this case corresponds to the scenario supported in [VOCAB-DQV], the [PROV-O]-based representation above is equivalent to:

a:Dataset a dcat:Dataset ;
  dcterms:conformsTo <http://data.europa.eu/eli/reg/2014/1312/oj> .
  
# Reference standard / specification
<http://data.europa.eu/eli/reg/2014/1312/oj> a prov:Entity, dct:Standard ;
  dcterms:title "Commission Regulation (EU) No 1089/2010 of 23 November 2010 implementing 
                 Directive 2007/2/EC of the European Parliament and of the Council as regards 
                 interoperability of spatial data sets and services"@en
  dcterms:issued "2010-11-23"^^xsd:date .

Links

Requirements

  • Being able to express conformance with a given quality standard / benchmark
  • Being able to express data quality conformance test results

Related use cases

Data access restrictions [ID17]

Status:

Identifier: ID17

Creator: Andrea Perego - European Commission, Joint Research Centre (JRC)

Deliverable(s): DCAT-AP1.1

Tags

#DCAT #Documentation #Usage_control

Stakeholders

data consumer, data producer, data publisher

Problem statement

The types of possible access restrictions of a dataset are one of the key filtering criteria for data consumers. For instance, while searching in a data catalogue, I may not be interested in those data I cannot access (closed data), or in those data requiring I provide personal information (as data that can be accessible by anyone, but only after registration).

Moreover, it is often the case that different distributions of the same dataset are released with different access restrictions. For instance, a dataset containing sensitive information (as personal data) should not be publicly accessible, although it would be possible to openly release a distribution where these data are aggregated and/or anonymized.

Finally, whenever data are not publicly available, an explanation of a reason why they are closed should be provided - especially when these data are maintained by public authorities, or are the outcomes of public-funded research activities.

Existing approaches

[DCAT-AP] models this information at the dataset level by using property dcterms:accessRights [DCTerms], and defines three possible values:

Public
Definition: Publicly accessible by everyone.
Usage note/comment: Permissible obstacles include: registration and request for API keys, as long as anyone can request such registration and/or API keys.
Restricted
Definition: Only available under certain conditions.
Usage note/comment: This category may include: resources that require payment, resources shared under non-disclosure agreements, resources for which the publisher or owner has not yet decided if they can be publicly released.
Non-public
Definition: Not publicly accessible for privacy, security or other reasons.
Usage note/comment: This category may include resources that contain sensitive or personal information.

In addition to this, the JRC extension to [DCAT-AP] uses property dcterms:accessRights also at the distribution level, with the following possible values:

no limitations
The distribution can be anonymously accessed
registration required
The distribution can be accessed by anyone, but only after registration
authorization required
The distribution can be accessed only by authorized users

Links

Requirements

  • Being able to specify access restrictions at both dataset and distribution level
  • Being able to specify why a dataset or distribution is not publicly accessible

Related use cases

Modeling service-based data access [ID18]

Status:

Identifier: ID18

Creator: Andrea Perego - European Commission, Joint Research Centre (JRC)

Deliverable(s): DCAT1.1

Tags

#DCAT #Documentation #Service

Stakeholders

data consumer, data producer, data publisher

Problem statement

This concerns how to model dataset distributions available via services / APIs (e.g., a SPARQL endpoint), and not via direct file download. In such cases, it is necessary to know how to query the service / API to get the data. Moreover, an additional issue is that a service may provide access to more than one dataset. As a consequence, users do not know how to get access to the relevant subset of data accessible via a service / API.

Although this is a domain-independent issue, it is a key one in the geospatial domain, where data are typically made accessible via services (e.g., a view or download service), that, to be used, require specific clients. In metadata, the link to such services is usually pointing to an XML document describing the service's "capabilities". This of course puzzles non-expert users, who expect instead to get the actual "data".

Some catalogue platforms (as GeoNetwork and, to some extent, CKAN) are able to make this transparent for some services (typically, view services), but not for all. It would therefore be desirable to agree on a cross-domain and cross-platform approach to deal with this issue.

In [VOCAB-DCAT], the option of accessing data via a service / API is explicitly mentioned, recommending the use of dcat:accessURL to point to it. However, this property is meant to be used, generically, for indirect data download, so it is not enough to know that the URL points to a service endpoint rather than to a download page.

Actually, for some time, [VOCAB-DCAT] included a class dcat:WebService (subclass of dcat:Distribution) to specify that data is available via a service / API. Other subclasses of dcat:Distribution were also defined to specify direct data access (dcat:Download), and data access via an RSS/Atom feed (dcat:Feed). All these subclasses were dropped in the final version of the vocabulary (see ISSUE-8 / ISSUE-9, and related discussion).

Existing approaches

A proposal to address this issue has been elaborated in the framework of the DCAT-AP implementation guidelines (see issue DT2: Service-based data access), where two main requirements have been identified:

  1. Denote distributions as pointing to a service / API, and not directly to the actual data.
  2. Provide a description of the API / service interface, along with the relevant query parameters, that can be directly used by software agents - either to access the data, or to make transparent data access to end users.

As far as point (1) is concerned, the proposal is to associate with distributions the following information:

  • Whether the access / download URL of a distribution points to data or to a service / API (dcterms:type).
  • In the latter case, we include the specification the service/API conforms to (dcterms:conformsTo).

An example is provided by the following code snippet. Here, the distribution's access URL points to service, implemented by using the [WMS] standard of the Open Geospatial Consortium (OGC):

a:Dataset a dcat:Dataset; 
  dcat:distribution [ a dcat:Distribution ;
    dct:title "GMIS - WMS (9km)"@en ;
    dct:description "Web Map Service (WMS) - GetCapabilities"@en ;
    dct:license <http://publications.europa.eu/resource/authority/licence/COM_REUSE> ;
    dcat:accessURL <http://gmis.jrc.ec.europa.eu/webservices/9km/wms/meris/?dataset=kd490> ;
# The distribution points to a service
    dct:type <http://publications.europa.eu/resource/authority/distribution-type/WEB_SERVICE> ;
# The service conforms to the WMS specification
    dct:conformsTo <http://www.opengis.net/def/serviceType/ogc/wms> ] .

About (2) (i.e., provide a description of the API / service interface), a number options have been discussed (e.g., describe a service/API by using an OpenSearch document), but no final decision has been taken.

Links

Requirements

  • Being able to denote distributions as pointing to a service / API, and not directly to the actual data.
  • Being able to provide a human- and machine-readable description of the API / service, and its interface.

Related use cases

Guidance on the use of qualified forms [ID19]

Status:

Identifier: ID19

Creator: Andrea Perego - European Commission, Joint Research Centre (JRC)

Deliverable(s): DCAT1.1

Tags

#DCAT #Documentation #Meta #Provenance #Quality #Referencing #Roles

Stakeholders

data consumer, data producer, data publisher

Problem statement

In most cases, the relationships between datasets and related resources (e.g., author, publisher, contact point, publications / documentation, input data, model(s) / software used to create the dataset) can be specified with simple, binary properties available from widely used vocabularies - as [DCTerms] and [VOCAB-DCAT].

However, there may be the need of providing additional information concerning, e.g., the temporal context of a relationship, which requires the use of a more sophisticated representation, similar to the "qualified" forms used in [PROV-O].

Besides [PROV-O], vocabularies as [VOCAB-DQV] and [VOCAB-DUV] can be used for this purpose. However, there is the need of providing guidance on how to use them consistently, since the lack of modeling patterns results in the difficulty of aggregating this information across metadata records and catalogs.

Moreover, it is important to define mappings between qualified and non-qualified forms (e.g., along the lines of what done in [PROV-DC]), not only to make it clear their semantic relationships (e.g., dcterms:source is the non-qualified form of prov:qualifiedDerivation), but also to enable metadata sharing and re-use across catalogs that may support only one of the two forms (qualified / non-qualified).

Existing approaches

[GeoDCAT-AP] makes use of both qualified and non-qualified forms to model agent roles and data quality conformance test results.

Links

Requirements

  • Availability of modeling patterns for qualified forms
  • Being able to map qualified and non-qualified forms

Related use cases

Modelling resources different from datasets [ID20]

Status:

Identifier: ID20

Creator: Andrea Perego - European Commission, Joint Research Centre (JRC)

Deliverable(s): DCAT1.1

Tags

#DCAT #Meta #Service

Stakeholders

data consumer, data producer, data publisher

Problem statement

[VOCAB-DCAT] makes use of quite a general definition of dataset (quoting from [VOCAB-DCAT], Section 5.3 Class: Dataset: "A collection of data, published or curated by a single agent, and available for access or download in one or more formats."), which is meant to be used as broady as possible (as stated in ISSUE-62).

As such, it could be theoretically used to model a variety of resources - including documents, software, images and audio-visual content. However, the solution adopted in [VOCAB-DCAT] is not able to address the following scenarios:

  1. Let's suppose that a data catalog includes records about data, as well as documents and software. If all are modeled just with dcat:Dataset, it is not possible for users to restrict their search to the specific type of resource they are interested in.
  2. In addition to this, let's suppose that the catalog includes also records about Web-based services (e.g., a SPARQL endpoint or any of the services used for geospatial data): can a service be considered a "dataset"? how should it be modeled?

These two scenarios are not hypothetical, but they reflect what is typically included, e.g., in catalogs following the [ISO-19115] or the [DataCite] standards, which model in different ways the documented resources, and both support records about services.

Existing approaches

[GeoDCAT-AP] provides a mechanism to model three out of the more than 20 resource types supported in [ISO-19115] - namely, dataset, dataset series, and service.

The adopted approach is as follows:

  • Datasets and dataset series are modeled with dcat:Dataset.
  • The specific dataset "type" (dataset and dataset series) is denoted by using dcterms:type [DCTerms].
  • Services are modeled as dcat:Catalog, in case of a catalog service, and with dctype:Service [DCTerms] in all the other cases.
  • The type of service (discovery, download, view, etc.) is modeled by using dcterms:type.

A similar approach has been adopted in the study carried out by the European Commission's Joint Research Centre (JRC) to map [DataCite] to [DCAT-AP].

The resource types supported in [DataCite] are 14. Most of them fall into the generic [VOCAB-DCAT] definition of "dataset", so they are modeled with dcat:Dataset. Moreover, the DCMI Type Vocabulary [DCTerms] is used to model both the dataset "type", and those resource types that cannot be modeled as datasets (events, physical objects, services).

Links

Requirements

  • Begin able to specify the dataset "type" (data, documents, software)
  • Being able to model resources that are not datasets (services, events)

Related use cases

Machine actionable link for a mapping client [ID21]

Status:

Identifier: ID21

Creator: Stephen Richard, Columbia University

Deliverable(s): (DCAT1.1, AP Guidelines, Content Negotiation)

A geologic unit dataset has various service distributions e.g. OGC v1.1.1 WFS as GeoSciML v3 GeologicUnit, GeoSciML portrayal GeologicUnit, GeoSciML v4 GeologicUnit, OGC v. 1.3.0 WMS layer portrayed according to stratigraphic age, layer portrayed according to lithology, or layer portrayed according to stratigraphic unit, and as an ESRI feature service. A user's map client software has a catalog search capability, and requires GeoSciMLv4 encoding in order to function correctly. The metadata must provide sufficient information about the distributions for the catalog client to filter for only services that offer such a distribution in the results offered to the user.

Links:

Requirements:

  • Metadata 'distribution' elements need a content model (see white paper referenced above) associated with links to communicate the protocol expected and the interchange formats (information model and encoding/serialization) available via the link. Content negotiation does not solve the problem very well because it requires the client to play a guessing game; it's better to explicitly identify the 'profile' for a link's behavior in the metadata so the client can pick the 'affordance' it needs.

Template link in metadata [ID22]

Status:

Identifier: ID22

Creator: Stephen Richard, Columbia University Deliverable(s): (DCAT1.1, AP Guidelines, Content Negotiation)

A dataset is offered via an OData end point, and the distribution link is a template with several parameters that the user must provide values for to obtain a valid response. Client must have means to know the valid value domains for the parameters. This could be via a link to an open search or URI template description type document, or by metadata elements associated with the link that define the parmeters and their domains.

Links:

Requirements:

  • Support use of URI templates for distribution links.

Data Quality Vocabulary (DQV) Wish List left by the DWBP WG [ID23]

Status:

Identifier: ID23

Creator: Riccardo Albertoni (Consiglio Nazionale delle Ricerche), Antoine Isaac (VU University Amsterdam and Europeana)

Deliverable(s): (DCAT1.1, AP Guidelines, Content Negotiation)

Tags

#Meta #Quality

Stakeholders

Data publisher, data consumer, catalog maintainer, application profile publisher

Problem statement

As discussed in the recent W3C recommendation DWBP “The quality of a dataset can have a big impact on the quality of applications that use it. As a consequence, the inclusion of data quality information in data publishing and consumption pipelines is of primary importance.” DQV is a new RDF vocabulary which extends DCAT with additional properties and classes suitable for expressing the quality of DCAT datasets and distributions. It defines concepts such as measures and metrics to assess the quality of user-defined quality dimensions, but it also puts much importance to allowing many actors to assess the quality of datasets and publish their annotations, certificates, opinions about a dataset. The W3C DWBP Working Group left a list of possible topics to be developed which were not in the scope or could not be covered by the DWBP group, in particular, some of the wishes left for Data Quality Vocabulary (DQV) seem to be related to the activity of this group.

The list below groups some of DQV wishes by the most likely impacted DXWG deliverable. Each requirement in the list might be expanded into a separated use case after a first scrutiny by the group. Some of the DQV wishes might be included either as Use Cases or as group issues. The choice on the most appropriate way of inclusion is affected by the level of commitment that DCAT1.1 wil make about quality documentation, and how much DCAT will rely on DQV for documenting the dataset quality.

Links

VOCAB-DQV

DWBP Wish List

Requirements

  • DCAT deliverable - requirements for DCAT extension or, in case DCAT1.1 entirely relies on DQV for addressing the quality documentation, the next round of DQV specification;
    • Updating Data Quality Vocabulary wrt updates in W3C Permissions and Obligation Expression vocabulary, as per https://github.com/w3c/dxwg/issues/5
    • Specifying in which part of the dataset the quality issue is present as raised by Amrapali Zaveri and Anisa Rula (see DWBP mail and related to the use case ID18-Modeling conformance test results on data quality)
    • Adding attributes for the severity of a quality problem, as per discussion with Amrapali Zaveri Amrapali Zaveri and Anisa Rula (see DWBP mail)
    • Discuss adding attributes for the 'provenance' of a quality measurement in a part of a dataset, as per discussion with Amrapali Zaveri (see DWBP mail in DWBP mailing list)
    • Elaborate on Parameters for Quality Metrics (DWBP-Issue-223)
    • Multilingual Translation for DQV
    • Should we rename QualityCertificate? the current name is a little misleading, it seems it is a quality certificate rather than an annotation pointing to a quality certificate.
  • AP Guidelines - Concrete case study
    • Guidance on how to specify integrity and cardinality constraints when defining an AP
      • Concrete case of study: DQV wants to have more integrity conditions (in SHACL?) to enhance interoperability between DQV implementations.
    • Guidance on how to express alignment/compatibility between profiles? (somehow related to the notions implied in the use cases: ID16-Quality modeling patterns and ID21-Guidance on the use of qualified forms )
      • Concrete case of study: alignment between DQV and quality features of HCLS dataset profile (DWBP-Issue-221)
      • Concrete case of study: alignment between DQV and other quality vocabularies - and try to have these vocabularies use DQV patterns instead of keeping different wheels (Radulovic et al., Fürber et al, sister-ontologies of daQ).
      • Guidance on how DQV can work with quality statistics vocabulary shall be provided with future versions of the DQV documentation.
    • Guidance on how to publish an Application Profiles


Related use cases

Harmonising INSPIRE-obligations and DCAT-distribution [ID24]

Status:

Identifier: ID24

Creator: Thomas D'haenens, Informatie Vlaanderen

Deliverable(s): DCAT1.1, AP Guidelines

Within our government agency we are struggling to combine two targets. On one side, we have a European obligation to share datasets about a wide range of topics (going from environment to transport to ...), following the INSPIRE guidelines. These are for a major part in the spirit of georeferenceable datasets and are based on ISO-standards and go much more in detail than DCAT does. On the other side we also have an open data policy and implementations based on DCAT (much leaner metadata vocabulary).

We're now working on a way to map the INSPIRE-based descriptions to DCAT-based descriptions. Since INSPIRE is a European Regulations (thus obligated for all European countries) this work ought to be done on a supranational level. At the least, I believe guidelines and mapping rules should be defined within both DCAT(-AP) and INSPIRE to enhance interoperability. Starting point should be that a dataset must be described only once (off course)

Links:
'To see the INSPIRE-based metadata catalog, see https://metadata.geopunt.be/zoekdienst/apps/tabsearch/index.html?hl=dut (in dutch)

Requirements:
'Specific supranational guidelines, mapping tool

Distribution and synchronization of catalog information [ID25]

Status:

Identifier: ID25

Creator: Jaroslav Pullmann, Christian Mader (Fraunhofer)

Deliverable: DCAT1.1

Keywords: Catalog cardinality, Data distribution, Usage control

While the operation and co-existence of Catalog instances is out of scope here the information model should support scenarios where Datasets are copied among a number of (specialized) Catalogs. By leveraging dedicated standards (ODRL, PROV-O) or extensions to DCAT the policies should e.g. define the distribution target of the Dataset (copies to royalty-free catalogs only, no copy in case of exclusive hosting) and its subsequent handling (keep the copy synchronized, preserve and display provenance information).

Requirements

  • Allow for explicit control of Dataset publication at dedicated Catalogs

Extension points to 3rd party vocabularies for modeling significant aspects of data exchange [ID26]

Status:

Identifier: ID26

Creator: Jaroslav Pullmann, Christian Mader (Fraunhofer)

Deliverables: DCAT1.1

Tags

#Meta

Stakeholders

DXWG members

Problem statement

Considering DCAT a high-level model for data exchange agree on significant aspects missing so far and define extension points (typically properties) for re-use and integration of existing standards, application profiles etc. The reference listing of aspects deemed relevant is based on evaluation of DXWG use cases and ISO 19115:2014:

  • Identification of Data(sub)sets (e.g. for purposes of citation)
  • Lineage, provenance and versioning (sources and processes applied)
  • Content description (internal data structure and semantics)
  • Context (spatial, temporal, socio-economic)
  • Reference system (spatial, temporal, socio-economic)
  • Quality, ratings and recommendations (?)
  • Distribution options (dynamic distribution, coverage of representation-, packaging- and compression formats)
  • Usage control, licensing (usage constraints and obligations, e.g. pricing)
  • Maintenance (scope and frequency of maintenance)

Requirements

  • Identified missing aspects of dataset descriptions (see reference listing)
  • Define corresponding extension points (ports)

Modeling temporal coverage [ID27]

Status:

Identifier: ID27

Creator: Andrea Perego - European Commission, Joint Research Centre (JRC)

Deliverable(s): DCAT1.1

Tags

#DCAT #Coverage #Documentation #Time

Stakeholders

data consumer, data producer, data publisher

Problem statement

[VOCAB-DCAT] uses dcterms:temporal [DCTerms] to specify the temporal coverage of a dataset, but does not provide guidance on to specify the start / end date.

Actually, the only relevant example provided in [VOCAB-DCAT] makes use of a URI operated by reference.data.gov.uk, denoting a time interval modeled by using [OWL-TIME]. Such sophisticated representation could be relevant for some use cases, but it is quite cumbersome when the requirement is to specify simply a start / end date, and it makes it difficult to use temporal coverage as a filtering mechanism during the discovery phase.

Existing approaches

To address this issue, [VOCAB-ADMS] makes use of properties schema:startDate and schema:endDate [SCHEMA-ORG].

[DCAT-AP] follows the same approach.

(Alejandra Gonzalez-Beltran) Adding a few links to other existing approaches:

- DATS: https://docs.google.com/spreadsheets/d/1aHj_Qvlr7Sf4DlU4uc37PQEOPha8jTxMEdyL_Q7geLQ/edit#gid=0
- DataCite: http://schema.datacite.org/meta/kernel-4.0/doc/DataCite-MetadataKernel_v4.0.pdf
- Google datasets (schema.org): https://developers.google.com/search/docs/data-types/datasets

Links

  • [VOCAB-ADMS], Section 5.2.10 Period of time
  • [DCAT-AP]
  • OWL-Time (2017 revision) includes the following general-purpose predicates
    • time:hasTime to associate a time:TemporalEntity with anything
    • time:hasBeginning to associate a time:Instant with anything, though with the entailment that the subject is itself a time:TemporalEntity (or a member of a subclass, which is general not hard)
    • time:hasEnd to associate a time:Instant with anything, though with the entailment that the subject is itself a time:TemporalEntity (or a member of a subclass, which is general not hard)
  • SOSA/SSN Ontology defines sosa:phenomenonTime and sosa:resultTime, adapted from ISO 19156 (Observations and measurements)
    • sosa:phenomenonTime refers to world time
    • sosa:resultTime refers to data aquisition time
      • also briefly discussed was stimulusTime - being the time the act of data aquisition started (complementing resultTIme which is when it finished)
  • ISO 19156 also has om:validTime - being the time interval during which use of the result is recommended (important for forecasting applications)
  • OGC Met Ocean working group / UK MetOffice recognize the following temporal properties of ensemble forecast data
    • simulation event time
    • analysis time (aka run time or reference time)
    • assimilation window
    • datum time
    • forecast computation time
    • validity time
    • partial forecast time
    • re-analysis event time
    • forecast model run collection time

Requirements

  • Being able to specify start / end date of spatial coverage

Related use cases

Modeling reference systems [ID28]

Status:

Identifier: ID28

Creator: Andrea Perego - European Commission, Joint Research Centre (JRC)

Deliverable(s): DCAT1.1

Tags

#DCAT #Documentation #Quality #Representation #Space

Stakeholders

data consumer, data producer, data publisher

Problem statement

One of the key information necessary to correctly interprete geospatial data is the spatial coordinate reference system used. For instance, a coordinate reference system can denote the order in which the coordinates are specified (latitude / longitude, longitude / latitude), whether coordinates denote points, lines, surfaces, volumes, which is the unit of measurement used.

This information is normally included in geospatial metadata since, depending on the coordinate reference system used, a dataset can or cannot be used for specific use cases. So, users can filter the relevant datasets during the discovery phase.

Used more broadly, the notion of "reference system" can be applied to other data as well. For instance, suppose a dataset consisting of a set of measurements expressed as numbers. Are they percentages or quantities using a specific unit of measurement?

Existing approaches

[SDWBP] addresses this issue in Best Practice 8, and illustrates a number of options that can be followed.

[GeoDCAT-AP] models this information by specifying data conformance with a given standard, as done in [VOCAB-DQV], which, in this case, is a spatial or temporal reference system. As far as spatial reference systems are concerned, they are denoted by the HTTP URIs operated by the OGC CRS register (see [SDWBP], Example 22):

@prefix ex:      <http://data.example.org/datasets/> .
@prefix dcat:    <http://www.w3.org/ns/dcat#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix skos:    <http://www.w3.org/2004/02/skos/core#> .

ex:ExampleDataset 
  a dcat:Dataset ;
  dcterms:conformsTo <http://www.opengis.net/def/crs/EPSG/0/32630> .

<http://www.opengis.net/def/crs/EPSG/0/32630> 
  a dcterms:Standard, skos:Concept ;
  dcterms:type <http://inspire.ec.europa.eu/glossary/SpatialReferenceSystem> ;
  dcterms:identifier "http://www.opengis.net/def/crs/EPSG/0/32630"^^xsd:anyURI ;
  skos:prefLabel "WGS 84 / UTM zone 30N"@en ;
  skos:inScheme <http://www.opengis.net/def/crs/EPSG/0/> .

Links

Requirements

  • Being able to specify the reference system(s) used in a dataset

Related use cases

Modeling spatial coverage [ID29]

Status:

Identifier: ID29

Creator: Andrea Perego - European Commission, Joint Research Centre (JRC)

Deliverable(s): DCAT1.1

Tags

#DCAT #Coverage #Documentation #Space

Stakeholders

data consumer, data producer, data publisher

Problem statement

The "spatial" or "geographic coverage" of a dataset denotes the geographic area of the phenomena described in the dataset itself.

How dataset spatial coverage is specified varies depending on the domain and metadata standards used. However, the different solutions make basically use of two different approaches (not mutually exclusive):

  1. The geographic area is denoted by a geographical name, possibly by using an identifier from a gazetteer (e.g., Geonames) or a registry concerning, e.g., administrative units (see, e.g., the NUTS).
  2. The geographic area is denoted by its "geometry", i.e., the geographic coordinates denoting its boundaries, its representative point (as its centroid) or its bounding box.

Geometries are typically used when it is necessary to denote an arbitrary geographic area, which may not correspond to a specific geographical name. Examples include (but are not limited to) satellite images and data from sensors. Geometries are also used in existing data catalogs for discovery and filtering purposes (e.g., this feature is supported in GeoNetwork and CKAN). Moreover, spatial queries are supported by the majority of the existing triple stores (including those not supporting [GeoSPARQL]).

[VOCAB-DCAT] allows the specification of the spatial coverage of a dataset by using dcterms:spatial [DCTerms], and includes an example making use of an HTTP URI from Geonames denoting a geographical area.

However, no guidance is provided on how to denote arbitrary regions with a "geometry" (i.e., a point, a bounding box, a polygon), which is the typical way spatial coverage is specified in geospatial metadata.

The issue is particularly problematic since the existing vocabularies model this information in very different ways. Moreover, geometries can be expressed in a number of formats (e.g., [GML], WKT, GeoJSON [RFC7946]). This situation makes it difficult to use information on spatial coverage effectively, e.g., to support spatial search and filtering.

Existing approaches

[SDWBP] provides a comprehensive guidance on how to specify geometries in the Best Practices under Section 12.2.2 Geometries and coordinate reference systems.

As far as metadata are concerned, one of the documented approaches concerns the solution adopted in [GeoDCAT-AP], which models spatial coverage by using property locn:geometry [LOCN], and recommending encoding the geometry by using [GML] and/or [WKT] - see [SDWBP], Example 15:

@prefix dcat:      <http://www.w3.org/ns/dcat#> .
@prefix dcterms:   <http://purl.org/dc/terms/> .
@prefix geosparql: <http://www.opengis.net/ont/geosparql##> .
@prefix locn:      <http://www.w3.org/ns/locn#> .

<http://www.ldproxy.net/bag/inspireadressen/> a dcat:Dataset ;
  dcterms:title "Adressen"@nl ;
  dcterms:title "Addresses"@en ;
  dcterms:description "INSPIRE Adressen afkomstig uit de basisregistratie Adressen,
                   beschikbaar voor heel Nederland"@nl ;
  dcterms:description "INSPIRE addresses derived from the Addresses base registry,
                   available for the Netherlands"@en ;
  dcterms:isPartOf <http://www.ldproxy.net/bag/> ;
  dcat:theme <http://inspire.ec.europa.eu/theme/ad> ;
  dcterms:spatial [
    a dcterms:Location ;
    locn:geometry
# Bounding box in WKT
      "POLYGON((3.053 47.975,7.24 47.975,7.24 53.504,3.053 53.504,3.053 47.975))"^^geosparql:wktLiteral ,
# Bounding box in GML
      "<gml:Envelope srsName=\"http://www.opengis.net/def/crs/OGC/1.3/CRS84\">
         <gml:lowerCorner>3.053 47.975</gml:lowerCorner>
         <gml:upperCorner>7.24  53.504</gml:upperCorner>
       </gml:Envelope>"^^geosparql:gmlLiteral ,
# Bounding box in GeoJSON
      "{ \"type\":\"Polygon\",\"coordinates\":[[
           [3.053,47.975],[7.24,47.975],[7.24,53.504],[3.053,53.504],[3.053,47.975]
         ]] }"^^https://www.iana.org/assignments/media-types/application/geo+json
  ] .

Links

Requirements

  • Being able to specify spatial coverage with geometries

Related use cases

Comments

  • Jaroslav: As discussed in telcon on 3rd of July please elaborate on type of regions that motivate the annotation of arbitrary geometries (e.g. unnamed regions like oceans, landscapes in the wild) and their potential usage (e.g. search for data sets by pointing/selection within a map GUI, inferring relationships between data sets by geographical overlap, co-location and proximity)

Andrea Perego says: @Jaroslav, I've revised the UC accordingly, along the lines of what I said in https://lists.w3.org/Archives/Public/public-dxwg-wg/2017Jul/0026.html

Standard APIs for metadata profile negotiation [ID30]

Status:

Identifier: ID30

Creator: Andrea Perego - European Commission, Joint Research Centre (JRC)

Deliverable(s): Content negotiation

Tags

#Content_negotiation #Profile #Service

Stakeholders

data consumer, data producer, data publisher

Problem statement

Cross-catalog harvesting is not a recent practice. Standard catalog services, as [OAI-PMH] and [CSW], have been designed to support this functionality. However, in the past, this was typically done across catalogs of homogeneous resources, usually pertaining to the same domain.

This has changed in the last years, especially with the publication of cross-sector catalogs of government data. A notable example is the European Data Portal, which harvests metadata from both cross-sector and thematic catalogs across EU Member States. In this scenario, one of the issues to be addressed is the heterogeneity of the metadata standards and harvesting protocols used across catalogs.

A partial solution is provided by the development of harmonized mappings between metadata standards (see, e.g., the geospatial and statistical extensions to [DCAT-AP]), and by enabling catalog platforms, as CKAN and GeoNetwork, to support multiple harvesting protocols and to map different metadata standards into their internal representation.

An alternative approach is to enable catalogs to provide metadata in different profiles, using a standard harvesting protocol. Notably, standard protocols as [OAI-PMH] and [CSW] already support the possibility of serving records in different metadata schemas and serializations, by using specific query parameters. So, what is needed is an API-independent mechanism that can be used by clients with the existing catalog service protocols.

HTTP content negotiation may be the most viable solution, since HTTP is the protocol Web-based catalog services makes use of. However, although the HTTP protocol would allow metadata to be served in different formats, it does not support the ability to negotiate the metadata profile.

Existing approaches

The GeoDCAT-AP API was designed to enable [CSW] endpoints to serve [ISO-19115] metadata based on the [GeoDCAT-AP] profile, by using the standard [CSW] interface - i.e., parameters outputSchema (for the metadata profile) and outputFormat (for the metadata format).

HTTP content negotiation is supported to determine the returned metadata format, without the need of using parameter outputSchema. The ability to negotiate also the profile would enable a client to query a [CSW] endpoint without the need of knowing the supported harvesting protocol.

Besides the resulting RDF serialisation of the source [ISO-19115] records, the API returns a set of HTTP Link headers, using the following relationship types:

  • derivedfrom: The URL of the source document, containing the ISO 19139 records.
  • profile: The URI of the metadata schema used in the returned document.
  • self: The URL of the document returned by the API.
  • alternate: The URL of the current document, encoded in an alternative serialization.

It is worth noting that, in its current definition, relationship type alternate denotes just a different serialization, and so it cannot be used to list the possibly alternative metadata schemas.

Links

Requirements

  • Being able to negotiate a metadata profile via HTTP

Related use cases

Modeling funding sources [ID31]

Status:

Identifier: ID31

Creator: Alejandra Gonzalez-Beltran

Deliverable(s): DCAT1.1

Problem statement:

Many datasets (or catalogs) are produced with support by a sponsor/funder (e.g. scientific datasets that result from a study funded by a funding organisation or datasets produced by governmental organisations) and the ability to describe and group them by funder is important across domains.

Links:

DATS github: https://github.com/biocaddie/WG3-MetadataSpecifications
And in particular grant schema: https://github.com/biocaddie/WG3-MetadataSpecifications/blob/master/json-schemas/grant_schema.json
Issue in schema.org tracker related to DATS: https://github.com/schemaorg/schemaorg/issues/1196
Issue in schema.org related to funding of a person/project/creative work: https://github.com/schemaorg/schemaorg/issues/383
Issue on improving dataset descriptions in schema.org tracker: https://github.com/schemaorg/schemaorg/issues/1083

Requirements:

Support the description of the funder of a dataset or catalog

Related use cases:

Relationships between Datasets [ID32]

Status:

Identifier: ID32

Creator: Alejandra Gonzalez-Beltran

Deliverable(s): DCAT1.1

Tags

  1. DCAT #Publication

Stakeholders

data consumer, data producer, data publisher

Problem statement

Datasets are related in many different ways, e.g. the relationships between the different versions of a dataset, 'has part' relationships between datasets, derivation, aggregation.

Examples of relationships:

  • aggregation
 - the Dryad repository defines the concept of a collection of datasets, for example, for datasets related for their topic 
e.g. see the collection about Galapagos finches http://datadryad.org/handle/10255/dryad.148 - the Gene Expression Onmibus repository (GEO) has the concept of series for related data
  • derivation
 - in the Investigation/Study/Assay (ISA) model, it is possible to represent the workflow from raw data to processed data and to indicate the process that yielded the new data
  • citation
 - to represent data citation

See the list of relationTpes given in the DataCite schema: [1](http://schema.datacite.org/meta/kernel-4.0/include/datacite-relationType-v4.xsd)

(Makx Dekkers) Specific cases of relationships that I have come across:

  • a dataset that contains multi-annual budget data (e.g. for a multi-annual programme) but also contains the data for individual years -- this could be as a spreadsheet with worksheets for each year and a sheet with the sum for the whole period
  • two datasets that contain the same data but differ in other aspects than format, for example currency, measurement units, resolution, projection -- if we adopt an understanding that distributions of a dataset may only differ in format

Existing approaches

  and in http://schema.datacite.org/meta/kernel-4.0/doc/DataCite-MetadataKernel_v4.0.pdf

Links

Requirements

  • Ability to represent the different relationships between datasets, including:
 * ability to represent the relationships between different versions of a dataset 
 * ability to represent collection of datasets, to describe their inclusion criteria and to define the 'hasPart'/'partOf' relationships
 * ability to represent derivation, e.g. processed data that is derived from raw data 

Related use cases:

Comments

It might not be easy to provide the exhaustive list of relation types, so maybe we will need a generic way for data producers to specify what the relationship is.

Summarization/Characterization of datasets [ID33]

Status:

Identifier: ID33

Creator: Alejandra Gonzalez-Beltran
Deliverable(s): DCAT1.1, AP Guidelines

Tags

  1. DCAT

Stakeholders

Data consumer, data producer, data publisher

Problem statement

Summary/descriptive statistics that characterize a dataset are important elements to have a high-level overview of the dataset. This is particularly important for datasets that are not publicly accessible, but whose access could be requested under certain conditions.

Existing approaches

HCLS dataset description included a number of statistics for RDF datasets: https://www.w3.org/TR/hcls-dataset/
For healthcare data, there is the Automated Characterization of Health Information at Large-scale Longitudinal Evidence System (ACHILLES): https://www.ohdsi.org/analytic-tools/achilles-for-data-characterization/

Links

Requirements

  • To support the summarization/characterization of dataset through summary statistics and similar metrics, maybe providing a pattern for people to provide information about statistics about a datset


Related use cases

Comments

Relationships between Distributions of a Dataset [ID34]

Status:

Identifier: ID34

Creator: Makx Dekkers

Deliverable(s): DCAT1.1


Tags

  1. DCAT #Documentation #Packaging #Semantics

Stakeholders

Data publishers

Problem statement

DCAT defines a Distribution as "Represents a specific available form of a dataset. Each dataset might be available in different forms, these forms might represent different formats of the dataset or different endpoints. Examples of distributions include a downloadable CSV file, an API or an RSS feed". It turns out that people read this differently. Main interpretations are that (a) the data in different Distributions of the same Dataset *only* differ in format, i.e the data contains the same data points in different representations and (b) the data in different Distributions might be related in other ways, for example by containing different data points for similar observations, as in the same kind of data for different years.

Existing approaches

In the current situation, a variety of approaches can be observed. In an analysis of the data in the DataHub (see link), at least five different approaches could be observed.

Links

Requirements

Revise definition of Distribution, making it clearer what a distribution is and what it is not, in order to provide better guidance for data publishers.

Datasets and catalogues [ID35]

Status:

Identifier: ID35

Creator: Makx Dekkers

Deliverable(s): (DCAT1.1)

Tags

  1. DCAT, #Documentation

Stakeholders

Data publisher

Problem statement

The DCAT model contains a hierarchy of the main entities: a catalogue contains datasets and a dataset has associated distributions. This model does not contemplate a situation that datasets exist outside of a catalogue, while in practice datasets may be exposed on the Web as individual entities without description of a catalogue. Also, it may be inferred from the current model that a dataset, if it is defined as part of a catalogue, is part of only one catalogue; no consideration is given to the practice that datasets may be aggregated – for example when the European Data Portal aggregates datasets from national data portals.

Requirements

Clarify the relationships between datasets and zero, one or multiple catalogues. In particular, consideration of approaches to harvesting and aggregation – when descriptions of datasets are copied from one catalogue to another – contemplating the way that relationships between the descriptions can be maintained and how identifiers can be assigned that allow for linking back to the source descriptions.

Cross-vocabulary relationships [ID36]

Status:

Identifier: ID36

Creator: Makx Dekkers

Deliverable(s): (DCAT1.1)

Tags

  1. DCAT, #Documentation, #VoID, #Data Cube

Stakeholders

Data publisher

Problem statement

In the context of W3C working and interest groups (e.g. SWIG, GLD, DWBP) several overlapping vocabularies have been developed for the description of datasets: DCAT, VoID and Data Cube. These vocabularies define similar concepts, but it is not entirely clear how these concepts are related. For example, all three vocabularies define a notion of ‘dataset’ – dcat:Dataset, void:Dataset and qb:DataSet. These notions are similar but not entirely equivalent.For example, it has been argued that void:Dataset and qb:DataSet are more like a dcat:Distribution than a dcat:Dataset.

Requirements

Clarify how these approaches are similar or different and how they interact, for example in the form of guidelines how to create a DCAT description of a VOID or Data Cube dataset.

Europeana profile ecosystem: representing, publishing and consuming application profiles of the Europeana Data Model (EDM) [ID37]

Status:

Identifier: ID37

Creator: Valentine Charles, Antoine Isaac

Deliverable(s): (AP guidelines, Content Negotiation)

Tags

  1. Content_negotiation, #Documentation, #Profile, #Publication, #Semantics

Stakeholders

  • Europeana: aggregates metadata describing more than 55 millions of cultural heritage objects from a wide variety of institutions (libraries, museums, archives and audiovisual archives) across Europe.
  • Europeana data providers: Cultural Heritage institutions providing content and metadata to Europeana
  • "Intermediate" domain metadata aggregators, gathering metadata for institutions on a specific domain (music, archaeology, theater…) and making it available for Europeana (and other data consumers)
  • Any consumer of CH metadata such as general research infrastructures, specialized virtual research environments, publishers & bibliographic agencies.

Problem statement

The metadata aggregated by Europeana is described using the Europeana Data Model (EDM) which goal is to ensure interoperability between various cultural heritage data sources. EDM has been developed to be as re-usable as possible. It can be seen as an anchor to which various finer-grained models can be attached, ensuring their interoperability at a semantic level. The alignments done between EDM and other models such as CIDOC-CRM allow the definition of adequate application profiles that enable the transition from one model to another without hindering the interoperability of the data. Currently, Europeana itself maintains data in two flavours of EDM is being defined into two specific flavours, each with a specific XML Schema (for RDF/XML data):

  • "EDM external": The metadata aggregated by Europeana from its data providers is being validated against the EDM external XML schema prior to being loaded into the Europeana database.
  • "EDM internal": This schema is meant for validation and enrichment inside the Europeana ingestion workflow where data is reorganised to add "proxies" to distinguish provider data from Europeana data and certain properties are added to support the portal or the API. It is not meant to be used by data providers. The metadata complying with this schema is outputted via the Europeana APIs.

Both "external" and "internal" schemas are available at https://github.com/europeana/corelib/tree/master/corelib-solr-definitions/src/main/resources/eu

Because XML can’t capture all the constraints expressed in the EDM, an additional set of rules was defined using Schematron and embedded in the XML schema. These technical choices impose limitations on the constraints that can be checked and a validation approach less suitable for Linked Data (XML imposes a document-centric approach).

Europeana is not the only one designing and consuming different profiles of EDM in its ecosystem.

  • The Digital Public Library of America has created its Metadata Application Profile (MAP), based on EDM
  • Intermediate domain metadata aggregators have explored developing profiles of EDM that represent the specificity of their domain. One of the main motivation is that they can use these profiles to ingest, exploit and/or re-publish datasets with less data loss than if they would use the 'basic' EDM ingested by Europeana. Some Europeana data providers and aggregators have started to experiment with Semantic Web technologies to represent their own application profiles of EDM:
    • Europeana Sounds
    • Digital Manuscripts to Europeana
    • Performing Arts
    • etc

Finally, some third party sources of interest (esp. authority data, thesauri, gazetteers) use models that are building blocks of EDM, like SKOS (i.e. EDM can itself be been as an application profile / extension of SKOS). Sometimes these sources publishes their data in different flavours at once (e.g http://viaf.org), which makes data consumption both easier (consumer can find the data elements it can consumer) and more difficult (consumer has to separate elements of interests from irrelevant ones)

Europeana has identified two types of AP:

  • A refinement of EDM is any kind of specialisation of EDM to meet specific needs of the data provider, typically with specific constraints on the existing EDM elements).
  • An extension to EDM is required when existing EDM classes and properties cannot represent the semantics of providers’ data with sufficient details.

Currently data providers who would like to provide their data to Europeana using their profiles are unable to do it, even when these profiles would be 'compatible' with the Europeana one for ingestion (which typically happens in the case of a basic EDM extension that adds fields on top of the Europeana profile). This is chiefly because of XML rigidities: Europeana ingestion expects a reference to only one profile/schema. It will not recognize profiles that are compatible with it.

Requirements

  • Each application profile needs to be documented, preferably by showing/reusing what is common across profiles
  • Machine-readable specifications of application profiles need to be easily publishable, and optimize re-use of existing specification.
  • Application profiles need a rich expression for the the validation of metadata
  • Data publishers (data providers, intermediary aggregators, Europeana and DPLA) need to be able to indicate the profile to which a certain piece of data (record describing an individual cultural object, or a whole dataset) belong.
  • Data publishers need to be able to serve different profiles of the same data via the same data publication channel (Web API)
  • Data consumers (intermediary aggregators, Europeana and DPLA, data consumers) need to be able to specify the profile they are interested in
  • Europeana needs to be able to accept the data described using EDM extensions that are compatible with its EDM-external profile whether it doesn't ingest this data entirely (i.e. some elements will be left out are they are useless for the main Europeana Collections portal) or it does ingest it (e.g. for Thematic Collections portals or domain-specific applications that Europeana or third parties would develop)

Links

Time-related aspects [ID38]

Status:

Identifier: ID38

Creator: Jaroslav Pullmann with contributions by Andrea Perego, Simon Cox et al.

Deliverable(s): (DCAT1.1)

Tags

#Meta #Time #Coverage #Quality #Resolution #Lifecycle #Usage_control

Stakeholders

Data authors, data publishers, data consumers

Problem statement

There is an evident demand for capturing various types of time-related information in DCAT. This meta use case provides a topic overview and summary of general requirements on temporal statements shared among detailed use cases each dealing with an individual aspect.

There are two basic layers where temporal modeling applies, the content (a) and the publication life-cycle layer (b). The former refers to the different time dimensions of the data and its elicitation process, i.e. occurrence (phenomenon), overall coverage (scope) and observation time etc. The latter considers stages of the DCAT publication process independently of any domain or content.

While the use cases differ with regard to purpose and interpretation of the temporal expressions some general patterns become apparent. There are references to singular or recurrent, named (last week, Middle Ages, Thanksgiving Day) or formal, numeric expressions (e.g. ISO 8601). These might be relative (today, P15M) or absolute, represent an instant or interval.

The description of evidence and motivation in context for these expressions is delegated to sub-use cases.

Requirements

TBD

Related use cases

Comments

Possible use cases at content level (a)

  • Temporal coverage, done - ID27 exists, expresses the boundaries of the dataset's phenomenon times (first, last)
  • Temporal resolution of a time series (sampling/observation rate), implied by ID15
  • Profile recommendation on a standardized annotation of phenomenon time for single values (e.g. sosa:phenomenonTime)
  • TBD

Possible use cases at life-cycle and publishing level (b)

  • Creation, modification time already covered by dct:issued, dct:modified
  • Data Retention (related to usage control): The copy of Dataset should be removed after this date
  • Expiry of data: The data is considered outdated / unsupported after this date
  • Expiry of record: The record will become obsolete after this point (and e.g. should be removed from catalog)
  • TBD

DCAT Metadata profile integration [ID39]

Status:

Identifier: ID39

Creator: Lieven Raes, Thomas D'haenens

Deliverable(s): (DCAT1.1)

Tags

#DCAT #Meta #Publication #Referencing

Stakeholders

Data authors, data publishers, data consumers

Problem statement

In the field we see people describing their datasets confronted with different regulations/profiles etc with each their own target/goal. Slowly we're starting to transgress domain boundaries (especially between geo and open - on a high level), but the process is still hard. This is partly due to the lack of guidelines/recommendations on a higher level (W3C, OGC).

Eg within the project of OTN (Open TransportNet) harmonization work has been done on different levels (more info : https://www.slideshare.net/plan4all/white-paper-data-harmonization-interoperability-in-opentransportnet). The risk exists that when everyone starts to do so, we loose interoperability along the way.

Existing approaches

GeoDCAT-AP has started with a first attempt of bridging the gap between Geo and Open - https://joinup.ec.europa.eu/node/139283 Within Informatie Vlaanderen, a project is running of combining the two worlds in one catalogue with an automated mapping - https://www.w3.org/2016/11/sdsvoc/SDSVoc16_PPT_v02


Links

See above

Requirements

- (at least) a generic recommendation/guideline on how to proceed with this problem - (if possible) a start with a joint W3C-OGC-INSPIRE-JRC effort to harmonize standards regarding dataset descriptions

Related use cases

Comments

Discoverability by mainstream search engines [ID40]

Status:

Identifier: ID40

Creator: Rob Atkinson

Deliverable(s): (DCAT1.1, AP Guidelines)

Tags

Stakeholders

data publishers, search engines, data users

Problem statement

Major search engines use mechanisms formalised via schema.org to extract structured metadata from Web resources. It is possible, but not given, that some may directly support DCAT in future. Regardless, consideration should be given to exposing DCAT content using equivalent schema.org elements - and this may perhaps be a case for content negotiation by profile, where equivalent schema.org properties are entailed in a DCAT graph.

Existing approaches

Schema.org defines a range of equivalent properties

Links

Requirements

  • define schema.org equivalents for DCAT properties to support entailment of schema.org compliant profiles of DCAT records.

Related use cases

Comments

DCAT is under consideration to be supported by at least one search engine - but the relatively current total volume of DCAT content makes production support for this uncertain

Vocabulary constraints [ID41]

Status:

Identifier: ID41

Creator: Karen Coyle

Deliverable(s): (AP Guidelines)

Tags

Optional space-separated list of tags out of the above catalog (extend on demand)

Stakeholders

Data producers, data consumers. In particular this facilitates sharing between different data consumers.

Problem statement

When considering using data produced by someone else, it is necessary to know not only what their vocabulary terms are, but how those terms are used. This means that you need to know

  • which terms are mandatory
  • what the cardinality rules are
  • what are the valid values
  • what dependencies exist between elements of the vocabulary
  • etc.

It would be ideal if the profile could be translated into a validation language (such as ShEx or SHACL). If not, it should at least be able to link to such a language.

Existing approaches

Links

Requirements

  • Profiles must support declaration of vocabulary constraints

Related use cases

Comments

Metadata Guidance Rules [ID42]

Status:

Identifier: ID42

Creator: Karen Coyle

Deliverable(s): (AP Guidelines)

Tags

Optional space-separated list of tags out of the above catalog (extend on demand)

Stakeholders

Data consumers

Problem statement

The GLAM communities (galleries, libraries, archives, museums) produce metadata based on a small set of known guidance rules. These rules determine choices made in creating the metadata such as: form of names for people, families and organizations; selection of primary titles; use of vocabularies like language lists, subject lists, genre and form lists, geographic designators. There needs to be a place in a profile to indicate which of the relevant standards was used in producing the metadata.

Existing approaches

The primary metadata format used by libraries includes these, but that is a very narrow case.

Links

Requirements

  • It must be possible to include in the profiled vocabulary for a set of data what guidance rules were applied.

Related use cases

Comments

Description of dataset compliance with standards [ID43]

Status:

Identifier: ID43

Creator: Alejandra Gonzalez-Beltran

Deliverable(s): DCAT1.1, AP Guidelines

Tags

  1. DCAT

Stakeholders

Data consumer, data producer, data publisher

Problem statement

Datasets distributions may or not comply with different types of standards, e.g. may be represented in specific formats, follow specific content guidelines, may be annotated with specific ontologies, may comply with standards for describing their use of identifiers, etc. The compliance with specific standards is useful information for data consumers, data producers and data publishers and it may help identifying how to use a dataset, what tools may be needed, etc.. DCAT currently supports describing the file format of a dataset distribution, but it is not possible to indicate compliance with other types of standards.


Existing approaches

Links

Requirements

  • Ability to describe the standards to which the dataset conforms.

Related use cases

https://www.w3.org/2017/dxwg/wiki/Use_Case_Working_Space#Vocabulary_constraints

Comments

Identification of versioned datasets and subsets [ID44]

Status:

Identifier: ID44

Creator: Jaroslav Pullmann, Keith Jeffery

Deliverable(s): (DCAT1.1)

Tags

#Dataset_concept #Lifecycle #Publication #Referencing

Stakeholders

Data consumers

Problem statement

A prerequisite of communicating, annotating or linking a dataset (or a defined part of it) is its unambiguous identification. Since a dataset and its distributions might evolve over time the identification method has to take into account their versioning. The respective distributions might significantly differ in terms of media type and further serialization properties and should therefore have distinct identifiers.

Existing approaches

While DCAT currently does not support resource versioning, subsets (slices) and derivations of a Dataset might be specified as separate, related Dataset instances. Each one is exposed by a set of dedicated Distribution resources identified by a resolvable URI. These Distribution URIs are used to refer to a particular materialization of the (abstract) Dataset. Their design preferably follows the RESTful URI naming conventions. Referencing Distribution metadata has the benefit of providing access to related properties, e.g. usage restrictions and licensing. Contrary, when there are multiple independent copies of Dataset's metadata across Catalogs this method suffers from generating alternative identifiers for the same resource (i.e. the same access/download target).

Requirements

  • Define a means to capture the identity of a serialized DCAT Data(sub)set

Related use cases

Comments

Annotating data quality [ID45]

Status:

Identifier: ID45

Creator: Makx Dekkers

Deliverable(s): (DCAT1.1)

Tags

#DCAT #Quality #Publication

Stakeholders

data producer, data publisher, data consumer (of statistical data)

Problem statement

In many cases, data producers and data publishers may want to inform the data consumers about the quality aspects of the data so that consumers better understand the possibilities and risks of using and reusing the data.

Data producers may have human-readable, textual information or more precise machine-readable information either as part of their publication process or as external resources that they can attach to the description of the dataset.

Existing approaches

The European StatDCAT application profile for data portals in Europe specifies the optional use of the property dqv:hasQualityAnnotation with a range of a subclass of oa:Annotation from the Open Annotation Model https://www.w3.org/ns/oa, which allows annotations to be either embedded text or an external resource identified by a URI.

Links

Requirements

Define a way to associate quality-related information with Datasets.

Related use cases

  • ID9
  • ID14
  • ID15
  • ID23
  • ID26
  • ID28

Comments

Profile support for input functions [ID46]

Status:

Identifier: ID46

Creator: Karen Coyle

Deliverable(s): (AP Guidelines)

Tags

  1. Profile #Semantics #Documentation

Stakeholders

user interface developers, data input staff

Problem statement

Profiles can be used to drive input forms for staff creating the data. To facilitate this, as many features as possible of a good input environment need to be supported. Profiles need to have suitable rules for the validation of values, such as date forms and pick lists. There need to be human-readable definitions of terms and, if needed, instructions for input that would accompany a property and its value.

Existing approaches

Links

Optional link list to documents and projects this use case refers to

Requirements

  • Profiles must support the functions of input forms
  • Profiles must be able to fully define valid values for properties
  • Profiles must have properties for at least two levels of documentation: 1) short definition 2) input and editing guidance

Related use cases

Comments

Define update method [ID47]

Status:

Identifier: ID47

Creator: Karen Coyle

Deliverable(s): (DCAT1.1)

Tags

Optional space-separated list of tags out of the above catalog (extend on demand)

Stakeholders

data consumers

Problem statement

In the library environment, datasets are issued as periodic aggregated (and up-to-date) files with daily or weekly changes to that file as supplements. The change files have new records that are additions to the file, changed records that must replace the record with the same identifier in the file, and deleted records that must result in the matching record being removed from the local copy of the file.

Existing approaches

Links

Requirements

  • Users of the dataset must be able to discover the update method used with this dataset, such as whether each new dataset entirely supercedes previous ones (is stand-alone), or whether there is a base dataset with files that effect updates to that base.

Related use cases

Comments

Profille relation to validation [ID48]

Status:

Identifier: ID48

Creator: Karen Coyle

Deliverable(s): (AP Guidelines)


Tags

Optional space-separated list of tags out of the above catalog (extend on demand)

Stakeholders

data producer, data publisher, validation program(s)

Problem statement

Project X has decided to make its datasets available as open access, downloadable. They do not know who will find the datasets useful but assume that some potential users are outside of Project X's immediate community. They need a way to describe their metadata and its usage such that anyone can work with the datasets, and they hope to do this with a profile that is machine-readable, human-understandable, and that defines the criteria for valid data.

Project X has some datasets that have a separate validation process, for example XML datasets with XML schemas. For some other datasets, there is no validation code. For those datasets, the profile will need to suffice. Also, because Project X cannot know the identity of all of the users of its data much less their technical capabilities, they cannot assume that the users can make use of available validation code. For this reason, the each profile needs to be usable both with and without the validation code that the project can provide.

Existing approaches

Links

Optional link list to documents and projects this use case refers to

Requirements

  • The profile should clarify any relationship between profiles and available validation documents or code
  • The profile must be usable with or without validation documents or code
  • The profile should substitute for coded validation (e.g. ShEx, SHACL) where the latter does not exist or may not be usable by all recipients

Related use cases

Comments

Dataset business context

Status:

Identifier: ID49

Creator: Peter Brenton, Simon Cox (CSIRO)

Deliverable(s): (DCAT1.1, AP Guidelines, Project Ontology (?))

Tags

#DCAT #Documentation #Lifecycle #Provenance #Roles #Semantics #Service #Usage_control

Stakeholders

data consumer, data producer, data publisher

Problem statement

It is helpful and often essential to know the business context in which one or more datasets are created and managed, in particular concerning the project, program, initiative through which the dataset was generated. These are typically associated with funding or policy.

The business context links associated entities participating in a project. Projects can be an umbrella or unifying entity for one or many datasets which share the same project context.

DCAT or users of DCAT have often used externally defined classes for associated concepts from FOAF and W3C Organization ontology, but there is not currently any slot or guidance about how to relate a dataset to its business context. However, there is no general agreement on a class for 'Project'. The class might includes spatial, temporal, social, descriptive and financial information. There are a number of discipline or domain specific Project classes (see Links below), but there does not appear to be anything available which is sufficiently expressive and generic.

As part of the DXWG there might be an opportunity to define a basic ontology for projects and related concepts. This should have a tight scope and few dependencies, similar to the approach used in W3C Organization ontology.

Existing approaches

VIVO Project PPSR Core

Links

Requirements

  • class for Project
  • predicate to be used with DCAT to relate a Dataset to a Project

Related use cases

Comments