Data on the Web Best Practices

Abstract

This document provides best practices related to the publication and usage of data on the Web designed to help support a self-sustaining ecosystem. Data should be discoverable and understandable by humans and machines. Where data is used in some way, whether by the originator of the data or by an external party, such usage should also be discoverable and the efforts of the data publisher recognized. In short, following these best practices will facilitate interaction between publishers and consumers.

9. The Best Practices

This section contains the best practices to be used by data publishers in order to help them and data consumers to overcome the different challenges faced when publishing and consuming data on the Web. One or more best practices were proposed for each one of the previously described challenges. Each BP is related to one or more requirements from the Data on the Web Best Practices Use Cases & Requirements document.

9.1 Example

This example serves as a basis for elaboration that will be described in subsequent sections. It helps to illustrate how best practices may be applied.

When necessary RDF examples will be used to show the result of the application of some best practices. RDF examples in this document are written in Turtle syntax [ TURTLE ] and [ JSON-LD ].

Note

In this current version, examples are presented just in Turtle syntax.

9.2 Metadata

The Web is an open information space, where the absence of a specific context, such a company's internal information system, means that the provision of metadata is a fundamental requirement. Data will not be discoverable or reusable by anyone other than the publisher if insufficient metadata is provided. Metadata provides additional information that helps data consumers better understand the meaning of data, its structure, and to clarify other issues, such as rights and license terms, the organization that generated the data, data quality, data access methods and the update schedule of datasets.

Metadata can be used to help tasks such as dataset discovery and ~~re-use,~~ reuse, and can be assigned considering different levels of granularity from a single property of a resource to a whole dataset, or all datasets from a specific organization.

Metadata can be of different types. These types can be classified in different taxonomies, with different grouping criteria. For example, a specific taxonomy could define three metadata types according to descriptive, structural and administrative features. Descriptive metadata serves to identify a dataset, structural metadata serves to understand the structure in which the dataset is distributed and administrative metadata serves to provide information about the version, update schedule etc. A different taxonomy could define metadata types with a scheme according to tasks where metadata are used, for example, discovery and ~~re-use.~~ reuse.

Best Practice 1: Provide metadata

Metadata must be provided for both human users and computer applications

Why

Providing metadata is a fundamental requirement when publishing data on the Web because data publishers and data consumers may be unknown to each other. Then, it is essential to provide information that helps ~~data consumers, i.e.,~~ human users and computer ~~applications,~~ applications to understand the data as well as other important aspects that describes a ~~dataset.~~ dataset or a distribution.

Intended Outcome

It must be possible for humans to understand the metadata, which makes it ~~human readable~~ human-readable metadata .

It should be possible for computer applications, notably user agents, to process the metadata, which makes it ~~machine readable~~ machine-readable metadata .

Possible Approach to Implementation

Possible approaches to provide human readable metadata:

to provide metadata as part of an HTML Web page
to provide metadata as a separate text file

Possible approaches to provide machine readable metadata:

machine readable metadata may be provided in a serialization format such as Turtle and JSON, or it can be embedded in the HTML page using [ ~~JSON-LD ], or [~~ HTML-RDFA ] or [ ~~Microdata~~ JSON-LD ]. If multiple formats are published separately, they should be served from the same URL using content negotiation. Maintenance of multiple formats is best achieved by generating each available format on the fly based on a single source of the metadata.
when defining machine readable metadata, reusing existing standard terms and popular vocabularies are strongly recommended. For example, Dublin Core Metadata (DCMI) terms [ DC-TERMS ] and Data Catalog Vocabulary [ VOCAB-DCAT ] should be used to provide descriptive ~~metadata (see Section~~ metadata.

How to Test

For human readable metadata , check that a human user can understand the metadata associated with a dataset.

For machine readable metadata , access the same URL either with a user agent that accepts a more data oriented format or a tool that extracts the data from an HTML page.

Evidence

Relevant requirements : R-MetadataAvailable, R-MetadataDocum, R-MetadataMachineRead

Benefits

Reuse
Comprehension
Discoverability
Processability

Best Practice 2: Provide descriptive metadata

The overall features of ~~a dataset~~ datasets and distributions must be described by metadata

Why

Explicitly providing dataset descriptive information allows user agents to automatically discover datasets available on the Web and it allows humans to understand the nature of the ~~dataset.~~ dataset and its distributions.

Intended Outcome

It should be possible for humans to understand the nature of the ~~dataset.~~ dataset and its distributions.

It should be possible for user agents be able to automatically discover ~~the dataset.~~ datasets and distributions.

Possible Approach to Implementation

Discovery metadata should include the following overall features of a dataset:

The title and a description of the dataset.
The keywords describing the dataset.
The date of publication of the dataset.
The entity responsible (publisher) for making the dataset available.
The contact point of the dataset.
The spatial coverage of the dataset.
The temporal period that the dataset covers.
The themes/categories covered by a dataset.

Discovery metadata should include the following overall features of a distribution:

The title and a description of the distribution.
The date of publication of the distribution.
The media type of the distribution.

The machine readable version of the discovery metadata may be provided according to the vocabulary recommended by W3C to describe datasets, i.e. the Data Catalog Vocabulary [ VOCAB-DCAT ]. This provides a framework in which datasets can be described as abstract entities.

Example 3

Machine-readable

The example below shows how to use [ VOCAB-DCAT ] to provide the machine readable discovery metadata for the timetable dataset ~~(dataset-001).~~ (timetable-001). The dataset has one CSV distribution (timetable-001-csv) that is also described using the [ VOCAB-DCAT ].

:dataset-001 a dcat:Dataset ; dct:title "Bus timetable of MyCity" ; dcat:keyword "transport","mobility" ,"bus" ; dct:issued "2015-05-05"^^xsd:date' ; dct:modified "2015-05-05"^^xsd:date' ; dcat:contactPoint <http://example.org/transport-agency/contact>; dct:temporal <http://reference.data.gov.uk/id/year/2014>; dct:spatial <http://www.geonames.org/3399415>; dct:publisher :transport-agency-mycity ; dct:accrualPeriodicity <http://purl.org/linked-data/sdmx/2009/code#freq-A> ; .

  :timetable-001
       a dcat:Dataset;
       dct:title "Bus timetable of MyCity";
       dcat:keyword "transport","mobility" ,"bus";
       dct:issued "2015-05-05"^^xsd:date';
       dct:modified "2015-05-05"^^xsd:date';
       dcat:contactPoint <http://example.org/transport-agency/contact>;
       dct:temporal <http://reference.data.gov.uk/id/year/2014>;
       dct:spatial <http://www.geonames.org/3399415>;
       dct:publisher:transport-agency-mycity ;
       dct:accrualPeriodicity <http://purl.org/linked-data/sdmx/2009/code#freq-A> ;
       dcat:theme :mobility;
       dcat:distribution :timetable-001-csv;
       .       :mobility
          a skos:Concept ;
          skos:inScheme :themes ;
          skos:prefLabel "Mobility" ;
       .       :themes
       a skos:ConceptScheme ;
       skos:prefLabel "A set of domains to classify documents" ;
       .       :timetable-001-csv
       a dcat:Distribution;
       dct:title "CSV distribution MyCity_busTimetable dataset."
       dct:description "CSV distribution of the bus timetable dataset of MyCity."
       dcat:mediaType "text/csv";
       .

~~The following paragraph was extracted from~~ Note

Notice that this dataset is classified under the domain represented by the relative URI :mobility. This domain may be defined as part of a set of domains identified by the URI :themes. SKOS can be used to describe both concepts and schema concepts.

Note

Similar to the example of the DCAT ~~specification (to review): In~~ specification, in order to express frequency of update ~~in the example above,~~ we chose to use an instance from the Content-Oriented Guidelines developed as part of the W3C Data Cube Vocabulary efforts. Additionally, we chose to describe the spatial and temporal coverage of the example dataset using URIs from Geonames and the Interval dataset from data.gov.uk, respectively. A contact point is also provided where comments and feedback about the dataset can be sent. Further details about the contact point, such as email address or telephone number, can be provided using VCard [ VCARD-RDF ].

Human-readable

An ~~example~~ Example page with a human-readable description of dataset is ~~availabe.~~ available.

How to Test

Check that the metadata for the dataset itself includes the overall features of the dataset.

Check if a user agent can automatically discover the dataset.

Evidence

Relevant requirements : R-MetadataAvailable , R-MetadataMachineRead , R-MetadataStandardized

Benefits

Reuse
Comprehension
Discoverability

Best Practice 3: Provide locale parameters metadata

Information about locale parameters (date, time, and number formats, language) should be described by metadata.

Why

Providing locale parameters metadata helps ~~data consumers, i.e.,~~ human users and computer ~~applications,~~ applications to understand and to manipulate the data, improving the ~~re-use~~ reuse of the data. ~~A locale is a set of parameters that defines specific data aspects, such as language and formatting used for numeric values and dates.~~ Providing information about the locality for which the data is currently published aids data users in interpreting its meaning. Date, time, and number formats can have very different meanings, despite similar appearances. Making the language explicit allows users to determine how readily they can work with the data and may enable automated translation services.

Intended Outcome

It should be possible for ~~data consumers~~ human users and computer applications to interpret the meaning of dates, times and numbers accurately by referring to locale information.

Possible Approach to Implementation

Locale parameters metadata should include the following information:

The language(s) of the dataset.
The formats used for numeric values, dates and time.

The machine readable version of the discovery metadata may be provided according to the vocabulary recommended by W3C to describe datasets, i.e. the Data Catalog Vocabulary [ VOCAB-DCAT ].

Example 4

Machine-readable

The example below shows the machine readable metadata for ~~dataset-001~~ timetable-001 with the inclusion of the locale parameters metadata.

:dataset-001 a dcat:Dataset ; dct:title "Bus timetable of MyCity" ; dcat:keyword "transport","mobility" ,"bus" ; dct:issued "2015-05-05" ; dct:modified "2015-05-05" ; dcat:contactPoint <http://example.org/transport-agency/contact>; dct:temporal <http://reference.data.gov.uk/id/year/2014>; dct:spatial <http://www.geonames.org/3399415>; dct:publisher :transport-agency-mycity ; dct:accrualPeriodicity <http://purl.org/linked-data/sdmx/2009/code#freq-A> ; .

  :timetable-001
       a dcat:Dataset;
       dct:title "Bus timetable of MyCity";
       dcat:keyword "transport","mobility" ,"bus";
       dct:issued "2015-05-05";
       dct:modified "2015-05-05";
       dcat:contactPoint <http://example.org/transport-agency/contact>;
       dct:temporal <http://reference.data.gov.uk/id/year/2014>;
       dct:spatial <http://www.geonames.org/3399415>;
       dct:publisher :transport-agency-mycity ;
       dct:accrualPeriodicity <http://purl.org/linked-data/sdmx/2009/code#freq-A>;
       dcat:theme :mobility;
       dcat:distribution :timetable-001-csv;
       dct:language <http://lexvo.org/id/iso639-3/eng>;       dct:language <http://lexvo.org/id/iso639-3/por>;       .

To declare the languages the dataset is published in use dct:language. If the dataset is available in ~~mutiple~~ multiple languages, use multiple values for this property [[VOCAB-DCAT]. As proposed in Dataset Descriptions from HCLS Community [ ~~HCLS~~ HCLS-DATASET ] values were taken from the Lexvo.org Ontology [ Lexvo ].

Note

to include date and numbers formats!

Issue 2 5

DCAT has a property to describe language, but there are no properties to describe date, time and numeric formats. Which vocabulary should be used to provide this type of metadata? This is Issue-167 .

Human-readable

Example page with human-readable description of dataset is available.

How to Test

Check that the metadata for the dataset itself includes the language in which it is published and that all numeric, date, and time fields have locale metadata provided either with each field or as a general rule.

Evidence

Relevant requirements : R-FormatLocalize , R-MetadataAvailable

Benefits

Reuse
Comprehension

Best Practice 4: Provide structural metadata

Information about the schema and internal structure of a distribution must be described by metadata

Why

Providing information about the internal structure of a distribution can be helpful when exploring or querying the dataset. Besides, structural metadata provides information that helps to understand the meaning of the data.

Intended Outcome

It should be possible for humans to understand the internal structure or schema of a distribution.

It should be possible for user agents be able to automatically process the structural metadata about a distribution.

Possible Approach to Implementation

Structural metadata is available according to the format of a specific distribution and it may be provided within separate documents or embedded into the document. For more details see the links below.

Tabular data: see Model for Tabular Data and Metadata on the Web
JSON-LD: see JSON-LD 1.0
XML: see XML Schema

Example 5

Machine-readable

~~to be done~~ The example below presents the structural metadata for one of the CSV files (bus stops) that compose the CSV distribution of the Bus Timetable dataset. For more details about the composition of the CSV distribution see Use machine-readable standardized data formats .

http://example.org/stops.csv-metadata.json
{
  "@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}],
  "url": "https://example.org/stops.csv",
  "dc:title": "Bus stops from bus timetable of MyCity",
  "dcat:keyword": ["bus", "stop", "mobility"],
  "dc:publisher": {
    "schema:name": "Transport Agency of MyCity",
    "schema:url": {"@id": "http://example.org"}
  },
  "dc:license": {"@id": "http://opendefinition.org/licenses/cc-by/"},
  "dc:modified": {"@value": "2015-05-05", "@type": "xsd:date"},
  "tableSchema": {
    "columns": [{
      "name": "stop_id",
      "titles": ["Identifier"],
      "dc:description": "An identifier for the bus stop.",
      "datatype": "string",
      "required": true
    }, {
      "name": "stop_name",
      "titles": ["Name"],
      "dc:description": "The name of the bus stop.",
      "datatype": "string"
    },  {
      "name": "stop_desc",
      "titles": ["Description"],
      "dc:description": "A description for the bus stop.",
      "datatype": "string"
   {
      "name": "stop_lat",
      "titles": ["Latitude"],
      "dc:description": "The latitude of the bus stop.",
      "datatype": "number"
    },
    {
      "name": "stop_long",
      "titles": ["Longitude"],
      "dc:description": "The longitude of the bus stop.",
      "datatype": "number"
    },
    {
      "name": "zone_id",
      "titles": ["ZONE"],
      "dc:description": "An identifier for the zone where the bus stop is located.",
      "datatype": "string"
    },
    {
      "name": "stop_url",
      "titles": ["URL"],
      "dc:description": "URL that identifies the bus stop.",
      "datatype": "string"
    }],
    "primaryKey: "stop_id"
  }
}

Human-readable

Example page with human-readable structural metadata is available.

How to Test

Check that the distribution itself includes structural information about the data organization.

Check if a user agent can automatically process the structural information about the distribution.

Evidence

Relevant requirements : R-MetadataAvailable

Benefits

Reuse
Comprehension
Processability

9.3 Data Licenses

A license is a very useful piece of information to be attached to data on the Web. ~~As defined by the Dublin Core Metadata Initiative [ DC-TERMS ], a license is a legal document giving official permission to do something with the data with which it is associated.~~ According to the type of license adopted by the publisher, there might be more or fewer restrictions on sharing and ~~re-using~~ reusing data. In the context of data on the Web, the license of a dataset can be specified within the data, or outside of it, in a separate document to which it is linked.

Best Practice 5: Provide data license information

Data license information should be available

Why

The presence of license information is essential for data consumers to assess the usability of data. User agents, for example, may use the presence/absence of license information as a trigger for inclusion or exclusion of data presented to a potential consumer.

Intended outcome

It should be possible for humans to understand possible restrictions placed on the use of a distribution.

It should be possible for machines to automatically detect the data license of a distribution.

Possible Approach to Implementation

The machine readable version of the data license metadata may be provided using one of the following vocabularies that include properties for linking to a license:

Dublin Core [ DC-TERMS ]
Creative Commons [ CC-VOCAB ]
schema.org [ SCHEMA-ORG ]
XHTML [ XHTML-VOCAB ]

There are also a number of machine readable rights languages, including:

The Creative Commons Rights Expression Language [ ccREL ]
The Open Data Rights Language [ ~~ODRL~~ ODRL2 ]
The Open Data Rights Statement Vocabulary [ ODRS ].

Example 6

Machine-readable

The example below shows the machine readable metadata for ~~:dataset-001-csv~~ :timetable-001-csv with the inclusion of the data license information.

~~:dataset-001-csv a dcat:Distribution ; dcat:mediaType "text/csv"; dct:license <http://creativecommons.org/licenses/by-sa/3.0/>; .~~

:timetable-001-csv
   a dcat:Distribution;
   dct:title "CSV distribution MyCity_busTimetable dataset."
   dct:description "CSV distribution of the bus timetable dataset of MyCity."
   dcat:mediaType "text/csv";
   dct:license <http://creativecommons.org/licenses/by-sa/3.0/>;   .

Human-readable

Example page with human-readable data license information of the distribution.

How to Test

Check that the metadata for the dataset itself includes the data license information.

Check if a user agent can automatically detect the data license of the dataset.

Evidence

Relevant use cases : R-LicenseAvailable and R-MetadataMachineRead

Benefits

Reuse
Trust

9.4 Data Provenance

~~Provenance originates from the French term "provenir" (to come from), which is used to describe the curation process of artwork as art is passed from owner to owner.~~ Data ~~provenance, in a similar way, is metadata that allows data providers to pass details about the data history to data users. Provenance~~ provenance becomes particularly important when data is shared between collaborators who might not have direct contact with one another either due to proximity or because the published data outlives the lifespan of the data provider projects or organizations.

The Web brings together business, engineering, and scientific communities creating collaborative opportunities that were previously unimaginable. The challenge in publishing data on the Web is providing an appropriate level of detail about its origin. The data ~~publishers~~ producer may not necessarily be the data provider and so collecting and conveying this corresponding metadata is particularly important. Without provenance, consumers have no inherent way to trust the integrity and credibility of the data being shared. Data publishers in turn need to be aware of the needs of prospective consumer communities to know how much provenance detail is appropriate.

Best Practice 6: Provide data provenance information

Data provenance information should should be available.

Why

Without accessible data provenance, data consumers will not know the origin or history of the published data.

Intended Outcome

It should be possible for humans to know the origin or history of the dataset.

It should be possible for machines to automatically process the provenance information about the dataset.

Possible Approach to Implementation

The machine readable version of the data provenance may be provided according to the ontology recommended by W3C to describe provenance information, i.e., the Provenance Ontology [ PROV-O ].

Example 7

Machine-readable

The example below shows the machine readable metadata for ~~dataset-001~~ the Bus Timetable dataset (timetable-001) with the inclusion of the provenance metadata. The metadata specifies that John created the Bus Timetable dataset.

:dataset-001 a dcat:Dataset , prov:Entity; dct:title "Bus timetable of MyCity" ; dcat:keyword "transport","mobility" ,"bus" ; dct:issued "2015-05-05" ; dct:modified "2015-05-05" ; dcat:contactPoint <http://example.org/transport-agency/contact>; dct:temporal <http://reference.data.gov.uk/id/year/2014>; dct:spatial <http://www.geonames.org/3399415>; dct:publisher :transport-agency-mycity ; dct:accrualPeriodicity <http://purl.org/linked-data/sdmx/2009/code#freq-A> ; dct:language <http://id.loc.gov/vocabulary/iso639-1/en> ; . :john a foaf:Person, prov:Agent; foaf:givenName "John"; foaf:mbox <mailto:john@mycitytransport.org>; prov:actedOnBehalfOf :transport-agency-mycity; . :transport-agency-mycity a foaf:Organization, prov:Agent; foaf:name "Transport Agency of Mycity"; .

:timetable-001
  a dcat:Dataset, prov:Entity;
  dct:title "Bus timetable of MyCity";
  dcat:keyword "transport","mobility" ,"bus";
  dct:issued "2015-05-05";
  dct:modified "2015-05-05";
  dcat:contactPoint <http://example.org/transport-agency/contact>;
  dct:temporal <http://reference.data.gov.uk/id/year/2014>;
  dct:spatial <http://www.geonames.org/3399415>;
  dct:publisher:transport-agency-mycity ;
  dct:accrualPeriodicity <http://purl.org/linked-data/sdmx/2009/code#freq-A> ;
  dct:language <http://id.loc.gov/vocabulary/iso639-1/en> ;
  prov:wasAttributedTo :john;   .:john
  a foaf:Person, prov:Agent;
  foaf:givenName "John";
  foaf:mbox <mailto:john@mycitytransport.org>;
  prov:actedOnBehalfOf :transport-agency-mycity;
 .:transport-agency-mycity
  a foaf:Organization, prov:Agent;
  foaf:name "Transport Agency of Mycity";
  .

Human-readable

to be done.

Issue 6

Should we include a more complex example to illustrate provenance? Issue-220

How to Test

Check that the metadata for the dataset itself includes the provenance information about the dataset.

Check if a computer application can automatically process the provenance information about the dataset.

Evidence

Relevant requirements : R-ProvAvailable , R-MetadataAvailable

Benefits

Reuse
Comprehension
Trust

9.5 Data Quality

Data quality ~~is commonly defined as “fitness for use” for a specific application or use case. It~~ can affect the potentiality of the application that use data, as a consequence, its inclusion in the data publishing and consumption pipelines is of primary importance. Usually, the assessment of quality involves different kinds of quality dimensions, each representing groups of characteristics that are relevant to publishers and consumers. Measures and metrics are defined to assess the quality for each ~~dimension.~~ dimension [ DQV ]. There are heuristics designed to fit specific assessment situations that rely on quality indicators, namely, pieces of data content, pieces of data meta-information, and human ratings that give indications about the suitability of data for some intended use.

Best Practice 7: Provide data quality information

Data Quality information should be available.

Why

Data quality might seriously affect the suitability of data for specific applications, including applications very different from the purpose for which it was originally generated. Documenting data quality significantly eases the process of datasets selection, increasing the chances of ~~re-use.~~ reuse. Independently from domain-specific peculiarities, the quality of data should be documented and known quality issues should be explicitly stated in metadata.

Intended Outcome

It should be possible for humans to have access to information that describes the quality of the ~~dataset.~~ dataset and its distributions.

It should be possible for machines to automatically process the quality information about the ~~dataset.~~ dataset and its distributions.

Possible Approach to Implementation

The machine readable version of the dataset quality metadata may be provided according to the vocabulary that is being developed by the DWBP working group , i.e., the Data Quality ~~and Granularity vocabulary.~~ Vocabulary [ DQV ].

Example 8

Machine-readable

The example below shows the metadata for the CSV distribution of the Bus Timetable dataset with the inclusion of the data quality metadata. The metadata was defined according to the Data Quality Vocabulary [ DQV ].

:timetable-001-csv
   a dcat:Distribution;
   dct:title "CSV distribution MyCity_busTimetable dataset."
   dct:description "CSV distribution of the bus timetable dataset of MyCity."
   dcat:mediaType "text/csv";
   dct:license <http://creativecommons.org/licenses/by-sa/3.0/>;
   dqv:hasQualityMeasure :measure1, :measure2         .:measure1
        a dqv:QualityMeasure ;
        dqv:computedOn :dataset001-csv ;
        dqv:metric :csvAvailabilityMetric ;
        dqv:value "1.0"^^xsd:double
        .:measure2
        a dqv:QualityMeasure ;
        dqv:computedOn :dataset001-csv ;
        dqv:metric :csvConsistencyMetric ;
        dqv:value "0.5"^^xsd:double
        .#definition of dimensions and metrics:availabity
        a dqv:Dimension ;
        dqv:hasCategory :category1;
        .:consistency
        a dqv:Dimension ;
        dqv:hasCategory :category2
        .:csvAvailabilityMetric
        a dqv:Metric ;
        dqv:hasDimension :availabity
        .:csvConsistencyMetric
        a dqv:Metric ;
        dqv:hasDimension :consistency
        .

Human-readable

to be ~~done~~ done.

Note

This example is not definitive. It will change according to the updates on the Data Quality Vocabulary [ DQV ].

How to Test

Check that the metadata for the dataset itself includes quality information about the dataset.

Check if a computer application can automatically process the quality information about the dataset.

Evidence

~~Information about the relevance of the BP is described by requirements documented in the Data on the Web Best Practices Use Cases & Requirements document : Requirements for Data Quality~~ Relevant Requirements: R-QualityMetrics , R-DataMissingIncomplete R-QualityOpinions

Benefits

Reuse
Trust

9.6 Data Versioning

Issue 3 This section provides an initial idea of how we are planning to deal with versioning. We're keen to hear comments about our proposal to represent the relationship between a dataset and its different versions as well as the use of PAV ontology as a solution to track dataset versioning. Issue-192 Issue 4 To discuss if the items in yellow each represent a different dataset (they report different data points) or a different version, if released independently. This is Issue-193 . Issue 5 To debate if versions attempt to report the same data. This is Issue-193 .

~~Data~~ Datasets published on the Web ~~often changes~~ may change over time. ~~Many~~ Some datasets are updated on a ~~scheduled basis, such as census data or funding data that changes every fiscal year. Other~~ schedule basis and other datasets are changed as improvements in collecting the data make updates worthwhile. ~~Still other data changes in real time or near real time. All~~ In order to deal with these ~~types~~ changes, new versions of ~~data need~~ a ~~consistent, informative approach~~ dataset may be created. Dataset versioning has been the subject of numerous discussions, however there is no consensus about when creating a new version of a dataset. In the following we present some scenarios where a new dataset, i.e. a new version of the existing dataset, should be created to ~~versioning, so data consumers can understand and work with~~ reflect the ~~changing data.~~ corresponding update.

Scenario 1: a new bus stop is created and its timetable doesn’t exist on the dataset;
Scenario 2: an existing bus stop is removed and its timetable should be deleted from the dataset;
Scenario 3: an error was identified in one of the existing timetables stored in the dataset and this error must be corrected;

~~In order~~ The creation of multiple datasets to ~~deal with changes over time,~~ represent time series as well as spatial series, e.g. the same kind of data for different regions, in general, are not considered as multiple versions ~~may be created~~ for ~~a single~~ the same dataset. ~~To illustrate~~ In this ~~let us consider~~ case, each dataset covers a ~~simple example:~~ different observation about the world and should be treated as a new dataset instead of a new version of an existing dataset. This is the case of a dataset that collects data about ~~weekly~~ weakly weather forecast of a given city, where every week a new dataset should be created to store data about that specific week.

Even for ~~MyCity. The following figure shows the relation between~~ small changes it is important to keep track of the ~~dataset (Dataset001) and its~~ different dataset versions ~~(Dataset001_W1, Dataset001_W2, Dataset001_W3 and Dataset001_W4), where each version corresponds~~ to make the ~~weather forecast of~~ dataset trustworthy. Publishers should remember that a ~~week of May 2015. Each~~ given dataset ~~version has two different distributions: one~~ may be in ~~CSV and~~ use for one ~~in JSON. For example, the version Dataset001_W1 has two distributions: Dataset001_W1_csv~~ or more data consumers and ~~Dataset001_W1_json. Fig. 2 Diagram showing~~ they should be notified about the ~~relationships between Dataset, Versions and Distributions The following best practices address issues that arise in tracking and managing~~ creation of new versions or it should be possible to automatically identify different versions of ~~datasets.~~ the same dataset.Different types of dataset updates need a consistent, informative approach to versioning, so data consumers can understand and work with the changing data.

Best Practice 8: Provide versioning information

Information about dataset versioning should be available.

Why

Version information makes a dataset uniquely identifiable. Uniqueness can be used by data consumers to determine how data has changed over time and to determine specifically which version of a dataset they are working with. Good data versioning enables consumers to understand if a newer version of a dataset is available. Explicit versioning allows for repeatability in research, enables comparisons, and prevents confusion. Using unique version numbers that follow a standardized approach can also set consumer expectations about how the versions differ.

Intended Outcome

It should be possible for data consumers to easily determine which version of the dataset they are working with.

Possible Approach to Implementation

The precise method adopted for providing versioning information may vary according to the context, however there are some basic guidelines that can be followed, for example:

Include a unique version number as part of the metadata for the dataset.
Use a consistent numbering scheme with a meaningful approach to incrementing digits, such as [ SchemaVer ].
~~Provide a description of what has changed since the previous version.~~ If the data is made available through an API , the URI used to request the latest version of the data should not change as the versions change, but it should be possible to request a specific version through the API .
Use the Memento [ RFC7089 ~~protocol,~~ ], or components thereof, to ~~declare~~ express temporal versioning of a dataset and to access the version that was operational at a given ~~resource is versioned and express its relation~~ datetime. The Memento protocol aligns closely with the approach for assigning URIs to ~~other versions.~~ versions described in the following and used for W3C specifications.

The Web Ontology Language ~~provides a number of annotation properties for version information~~ [ OWL2-QUICK-REFERENCE ] and the ~~Provenance~~ Provenance, authoring and versioning Ontology [ ~~PROV-O~~ PAV ] ~~defines several types~~ provides a number of ~~link between versions. to be completed~~ annotation properties for version information.

~~Using Memento~~ Example 9

Machine-readable

~~A query~~ The example below shows the metadata for timetable-001 with the ~~headers~~ inclusion of ~~a resource such as dbpedia:Paris using~~ the versioning metadata.

~~curl -L -I http://dbpedia.org/resource/Paris shows http://mementoarchive.lanl.gov/dbpedia/timegate/http://dbpedia.org/page/Paris~~

:timetable-001
  a dcat:Dataset, prov:Entity, version:VersionedThing;
  dct:title "Bus timetable of MyCity";
  dcat:keyword "transport","mobility" ,"bus";
  dct:issued "2015-05-05";
  dct:modified "2015-05-05";
  dcat:contactPoint <http://example.org/transport-agency/contact>;
  dct:temporal <http://reference.data.gov.uk/id/year/2014>;
  dct:spatial <http://www.geonames.org/3399415>;
  dct:publisher:transport-agency-mycity ;
  dct:accrualPeriodicity <http://purl.org/linked-data/sdmx/2009/code#freq-A> ;
  dct:language <http://id.loc.gov/vocabulary/iso639-1/en>
  prov:wasAttributedTo :john;
  pav:version "1.0";

~~as part of~~ Using Memento

Assume:

http://example.org/dataset is the ~~links with~~ “generic URI” at which the current version of a ~~"timegate" relation. This timegate itself points to~~ dataset is always available
http://example.org/timetable-002 is the ~~latest memento~~ versioned URI for the current dataset
http://example.org/timetable-001 is the versioned URI of the ~~Web representation~~ prior version of the ~~resource with a "last memento" relation: http://dbpedia.mementodepot.org/memento/20120515/http://dbpedia.org/page/Paris~~ dataset
http://example.org/timetable-000 is the versioned URI of the first version of the dataset

In the Memento protocol, the versioned URIs provide HTTP response header information to express their version datetime and their relation to the generic URI:

curl -I http://example.org/timetable-001
HTTP/1.1 200 OK
Memento-Datetime: Sun, 05 April 2015 00:00:00 GMTLink: <http://example.org/timetable>; rel=“original”

~~A query for~~

The versioned URIs can provide a ~~particular~~ link to a TimeGate, which supports datetime ~~will point~~ negotiation as a means to ~~the closest memento available: curl -H "Accept-Datetime: Sat, 16 Jun 2012 00:00:00 GMT" -L -I http://mementoarchive.lanl.gov/dbpedia/timegate/http://dbpedia.org/page/Paris~~ determine which version of a dataset was operational at a given datetime:

curl -I http://example.org/timetable-001
HTTP/1.1 200 OK
Memento-Datetime: Sun, 05 April 2015 00:00:00 GMTLink:<http://example.org/timetable>; rel=“original”,<http://example.org/timegate/timetable>; rel=“timegate”

The generic URI can also provide a link to a TimeGate:

curl -i -H http://example.org/timetable
HTTP/1.1 200 OK
Link: <http://example.org/timegate/timetable>; rel=“timegate”

This is how a client determines which dataset version was operational on March 20 2015:

curl -I -H "Accept-Datetime: Fri, 20 Mar 2015  12:00:00 GMT" http://example.org/timegate/dataset
HTTP/1.1 302 Found
Vary: accept-datetimeLocation: http://example.org/timetable-000Link: <http://example.org/timetable> rel="original"

Human-readable

Example page with human-readable data versioning information.

How to Test

Check that a unique version number or date is provided with the metadata describing the dataset.

Evidence

Relevant requirements : R-DataVersion

Benefits

Reuse
Trust

Best Practice 9: Provide version history

A version history about the dataset should be available.

Why

In creating applications that use data, it can be helpful to understand the variability of that data over time. Interpreting the data is also enhanced by an understanding of its dynamics. Determining how the various versions of a dataset differ from each other is typically very laborious unless a summary of the differences is provided.

Intended Outcome

It should be possible for data consumers to understand how the dataset typically changes from version to version and how any two specific versions differ.

Possible Approach to Implementation

Provide a list of published versions and a description for each version that explains how it differs from the previous version. An API can expose a version history with a single dedicated URL that retrieves the latest version of the complete history.

~~Issue 6~~ Example 10

~~Which vocabulary should be used~~

Machine-readable

Suppose that a new bus stop was created and its timetable doesn't exist on the timetable-001. To maintain the timetable-001 up to ~~describe~~ date a new dataset (timetable-002) was created. timetable-002 includes all the ~~versioning history? This~~ data from timetable-001 plus the data about the new bus stop. The machine readable metadata of timetable-002 is ~~Issue-168~~ shown below.

:timetable-002
       a dcat:Dataset, version:Version;
       dct:title "Bus timetable of MyCity";
       dcat:keyword "transport","mobility" ,"bus";
       dct:issued "2015-05-06"^^xsd:date';
       dct:modified "2015-05-06"^^xsd:date';
       dcat:contactPoint <http://example.org/transport-agency/contact>;
       dct:temporal <http://reference.data.gov.uk/id/year/2014>;
       dct:spatial <http://www.geonames.org/3399415>;
       dct:publisher:transport-agency-mycity ;
       dct:accrualPeriodicity <http://purl.org/linked-data/sdmx/2009/code#freq-A> ;
       dct:language <http://id.loc.gov/vocabulary/iso639-1/en>
       ...       dct:isVersionOf:timetable-001;
       pav:previousVersion:timetable-001;
       rdfs:comment "A new timetable was included in the timetable-001 to reflect the creation of a new bus stop."
       pav:version "1.1".

Versioning metadata for timetable-001 after the creation of timetable-002, which is a new version of timetable-001.

:timetable-001
       a dcat:Dataset, version:VersionedThing;
       dct:title "Bus timetable of MyCity";
              ...       version:currentVersion:timetable-002;
       dct:hasVersion :timetable-002;
       pav:version "1.0".

Using Memento:

Assume:

http://example.org/timetable is the “generic URI” at which the current version of a dataset is always available
http://example.org/timetable-002 is the versioned URI for the current dataset
http://example.org/timetable-001 is the versioned URI of the prior version of the dataset
http://example.org/timetable-000 is the versioned URI of the first version of the dataset

The versioned URIs, the generic URI, and the TimeGate can provide a link to ~~be done~~ a TimeMap that provides an overview of all temporal versions of the dataset:

curl -I http://example.org/timetable-001
HTTP/1.1 200 OK
Date: Sat, 24 Oct 2015 16:23:30 GMTMemento-Datetime: Sun, 05 April 2015 00:00:00 GMTLink: <http://example.org/timetable>; rel=“original”, <http://example.org/timemap/timetable>; rel=“timemap”;
type="application/link-format"
curl -I http://example.org/timemap/timetable
HTTP/1.1 200 OK
Content-Type: application/link-format<http://example.org/timetable>;rel="original”,<http://example.org/timedate/timetable>;rel="timegate”,<http://example.org/timedate/timetable>;rel="timemap”;
type="application/link-format",
<http://example.org/timetable-000>; rel=“first memento"; datetime="Thu,05 Mar 2015 00:00:00 GMT",<http://example.org/timetable-002>; rel=“memento"; datetime=“Sun, 05 Apr2015 00:00:00 GMT"<http://example.org/timetable-002>; rel=“last memento"; datetime="Tue,05 May 2015 00:00:00 GMT"

The versioned URI can provide information regarding relations with other dataset versions:

curl -I http://example.org/timetable-001
HTTP/1.1 200 OK
Memento-Datetime: Sun, 05 April 2015 00:00:00 GMTLink: <http://example.org/timetable>; rel=“original”,<http://example.org/timetable-000>; rel=“prev first memento";
datetime="Thu, 05 Mar 2015 00:00:00 GMT",
<http://example.org/timetable-002>; rel=“next last memento";
datetime="Tue, 05 May 2015 00:00:00 GMT"

Human-readable

Example page with human-readable data versioning history information.

How to Test

Check that a list of published versions is available, and that each version is described.

Evidence

Relevant requirements : R-DataVersion

Benefits

Reuse
Trust
~~9.7 Data Identification~~
~~Issue 7~~

~~To discuss about limiting this section~~ Best Practice 10: Avoid Breaking Changes to ~~information that applies~~ Your API , Communicate Changes to ~~publishing *data*. Issue-194~~ Developers

~~Identifiers are simple conventions of labels that allow us~~

Avoid changes to ~~distinguish what is being identified from anything else. Identifiers are used extensively~~ your API that break client code, and communicate any changes in ~~every information system, making it possible to refer~~ your API to ~~any particular element. The Web is predicated on~~ your developers when evolution happens

Why

When developers implement a ~~uniform system of identifiers~~ client for your API , they may rely on specific characteristics that ~~are globally unique and can be looked up by dereferencing them over~~ you have built into it, such as the ~~Internet. There are three terms~~ schema or the details of each response. Avoiding breaking changes in ~~common use for these identifiers and, although~~ your API minimizes breakage to client code. Communicating changes when they do occur allows developers to take action.

Intended Outcome

Developer code will continue to work, and if changes are ~~often used interchangeably, there are differences.~~ made, developers will have sufficient time and information to adapt their code. That will enable them to address changes that would otherwise cause breakage.

~~URI (Uniform Resource Identifier) is an identifier for anything including those not available over the Internet~~

Possible Approach to Implementation

When improving your API , focus on adding new calls rather than changing how existing calls work. Existing clients can ignore such ~~as people, buildings~~ changes and ~~mountains. There~~ will continue functioning. If using a fully RESTful style, you should be able to avoid changes that affect developers by keeping home resource URIs constant and changing only elements that your users do not call directly. If you need to change your data in ways that are ~~many URI schemes,~~ not ~~all of which can be looked up over~~ compatible with the ~~Internet. For example, doi:10.1103/PhysRevD.89.032002~~ extension points that you initially designed, then a completely new design is ~~as URI but cannot~~ required, and this will be ~~looked up (directly) on the Internet. For data on the Web, only HTTP(s) URIs are relevant. HTTP URIs are~~ a ~~subset of URIs which are a subset of IRIs. IRI (Internationalized Resource Identifier) are conceptually identical~~ breaking change. In that case, it’s best to ~~URIs but allow~~ implement the changes as a new API .

If using any other architectural style, use ~~of non-ASCII characters~~ versioning to ~~which URIs are limited. URL (Uniform Resource Locator) is~~ indicate changes that affect client code. Indicate the ~~location of a resource on~~ version in the ~~Web. URLs are a subset of URIs. Of these three,~~ response header. Major version numbers should be reflected in your URIs or in request headers. When versioning in URIs, include the ~~term URL is by~~ version number as far to the ~~most commonly used. The term URI, and even more so, IRI, may cause confusion among some audiences, however, in~~ left as possible. Keep the ~~context of data on~~ previous version available for developers whose code has not yet been adapted to the ~~Web, URI is more appropriate since data points and datasets very often refer~~ new version.

How to ~~real world objects and phenomena. The term IRI~~ Test

Be sure that client code is ~~used where necessary.~~ still working after changes, ask for feedback from developers

Evidence

Relevant requirements : R-DataVersion

Note

This BP will be complemented.

9.7 Data Identifiers

Identifiers take many forms and are used extensively in every information system. Data discovery, usage and citation on the Web depends fundamentally on the use of HTTP (or HTTPS) ~~URIs.~~ URIs: globally unique identifiers that can be looked up by dereferencing them over the Internet [ RFC3986 ]. It is perhaps worth emphasizing some key points about URIs in the current context.

URIs are 'dumb strings', that is, they carry no semantics. Their function is purely to identify a resource.
Although the previous point is accurate, it would be perverse for a URI such as ~~http://example.com/datset.csv~~ http://example.com/dataset.csv to return anything other than a CSV file. Human readability is helpful.
When de-referenced (looked up), a single URI may offer the same resource in more than one format. http://example.com/dataset may offer the same data in, say, ~~CSV ,~~ CSV, JSON and XML. The server returns the most appropriate format based on content negotiation .
One URI may redirect to another.
De-referencing a URI triggers a computer program to run on a server so that the URI acts as a call to an API . The server may therefore do something as simple as return a single, static file, or it may carry out complex processing. Precisely what processing is carried out, i.e. the software on the server, is completely independent of the URI itself.

Best Practice ~~10:~~ 11: Use persistent URIs as identifiers of datasets

Datasets must be identified by a persistent URI.

Why

Adopting a common identification system enables basic data identification and comparison processes by any stakeholder in a reliable way. They are an essential pre-condition for proper data management and ~~re-use.~~ reuse.

Intended Outcome

Datasets or information about datasets, must be discoverable and citable through time, regardless of the status, availability or format of the data.

Possible Approach to Implementation

To be persistent, URIs must be designed as ~~such,~~ such and backed up by organizational commitments. ~~There have~~ A lot has been ~~a number of articles~~ written on this topic as the table below shows.

Some sources of information related to URI persistence
Status	Title	Authors and Date
Background	Cool URIs don't change	Tim Berners-Lee, 1998
	Cool URIs for the ~~following summarizes many~~ Semantic Web	Leo Saurman, Richard Cyganiak, 2008
	Linked Data	Tim Berners-Lee, 2009
Key Source	Designing URI Sets for the UK Public Sector (PDF)	UK Chief Technology Officer Council October 2009
Survey & summary of techniques	Study on Persistent URIs	Phil Archer, Nikos Loutas, Stijn Goedertier, Saky Kourtidis, 2013
Expansion	Creating Linked Data	Jeni Tennison, 2009
	Linked Data: Evolving the ~~key points made. Follow~~ Web into a pattern (e.g. http://{domain}/{type}/{concept}/{reference}) Re-use existing identifiers (e.g. http://education.data.gov.uk/id/school/ 123457 ) Link multiple representations (e.g. http://data.example.org/doc/foo/bar.rdf and http://data.example.org/doc/foo/bar.html) Implement 303 redirects Global Data Space	Tom Heath & Christian Bizer, 2011
	Linked Data Patterns	Leigh Dodds & Ian Davis, 2012
	Best Practices for ~~real-world objects (e.g. http://www.example.com/id/alice_brown and http://www.example.com/doc/alice_brown) Use a dedicated service (i.e. independent~~ Multilingual Linked Open Data	Jose Emilio Labra Gayo, 2012
Detail	Statistical Linked Dataspaces	Sarven Capadisli, 2012

Issue 7

The table links to Designing URI Sets for the UK Public Sector. A newer version of this document (which was the first of its kind) exists but is on a GitHub repository . It seems that this might happen due to changes in organisation behind data.gov.uk. If this happens, we should update the link to point to the new version.

URIs can be long. In a dataset of even moderate size, storing each URI is likely to be repetitive and obviously wasteful. Instead, define locally unique identifiers for each element and provide data ~~originator) Avoid stating ownership (e.g. http://education.data.gov.uk/ministryofeducation/id/school/123456) Avoid version numbers (e.g. http://education.data.gov.uk/doc/school/v01/123456) Avoid~~ that allows them to be converted to globally unique URIs programmatically. The Metadata Vocabulary for Tabular Data [ tabular-metadata ] provides mechanisms for doing this within tabular data such as CSV files, in particular using auto-increment (e.g. http://education.data.gov.uk/id/school/123456 and e.g. http://education.data.gov.uk/id/school/123457) Avoid query strings (e.g. http://education.data.gov.uk/doc/school?id=123456) Avoid file extensions (http://education.data.gov.uk/doc/schools/123456.csv) URI template properties such as the about URL property.

Where a data publisher is unable or unwilling to manage its URI space directly for persistence, an alternative approach is to use a redirection service such as purl.org . . This provides persistent URIs that can be redirected as required so that the eventual location can be ephemeral. The software behind such services is freely available so that it can be installed and managed locally if required.

Digital Object Identifiers (DOIs) offer a similar alternative. These identifiers are defined independently of any Web technology but can be appended to a 'URI stub.' DOIs are an important part of the digital infrastructure for research data and and libraries.

Example 11

The URI http://data.mycity.example/public-transport/road/bus/dataset/timetable has several features that support persistence:

All names are subject to change over time but in choosing a domain name, it is reasonable for John to assume that MyCity will continue to exist and that it will continue to have a government. Therefore, while cases like Yugoslavia prove that even country names change and top level domains disappear (like .yu), mycity.gov is as persistent as any domain name can be.
By putting data on the data.mycity.example subdomain, John is creating a specific domain that can be managed independently of any particular department.
It is not safe to assume that a specific department will persist. The authorities in MyCity might very well decide that the Transport Agency should be merged with another to create the Transport and Environment Agency. It is right, therefore, not to include the name of the Transport Agency in the URI, but to include the task from which the data comes, in this case that of providing public transport.
Likewise, the path segments of /road and /bus take us further towards the specific dataset for which John is responsible.
The /dataset path segment is an indication that the URI identifies a dataset, rather than, say, a specific bus route.
Finally /timetable leads us to the dataset concerning bus timetables in MyCity.
In DCAT terms, this would be the identifier for the dataset. Specific distributions of the dataset are likely to be identified by adding the relevant file extension to the URI, such as http://data.mycity.example/public-transport/road/bus/dataset/timetable.csv,http://data.mycity.example/public-transport/road/bus/dataset/timetable.json,http://data.mycity.example/public-transport/road/bus/dataset/timetable.ttl etc.

These points cover the design aspects of a persistent URI. To cover the organizational aspect, MyCity should publish information about its URI design principles as well as a commitment to maintain the service in the long term.

How to Test

Check that each dataset in question is identified using a URI that has been assigned under a controlled process as set out in the previous section. Ideally, the relevant Web site includes a description of the process and a credible pledge of persistence should the publisher no longer be able to maintain the URI space themselves.

Evidence

Relevant requirements : R-UniqueIdentifier , R-Citable

Benefits

Reuse
Linkability
Discoverability
Interoperability

Best Practice ~~11:~~ 12: Use persistent URIs as identifiers within datasets

Datasets should use and reuse other people's URIs as identifiers where possible.

Why

The power of the Web lies in the Network effect . The first telephone only became useful when the second telephone meant there was someone to call; the third telephone made both of them more useful yet. Data becomes more valuable if it refers to other people's data about the same thing, the same place, the same concept, the same event, the same person, and so on. That means using the same identifiers across datasets and making sure that your identifiers can be referred to by other datasets. When those identifiers are HTTP URIs, they can be looked up and more data discovered.

These ideas are at the heart of the 5 Stars of Linked Data where one data point links to another, and of Hypermedia where links may be to further data or to services (or more generally 'affordances') that act on or relate to the data in some way. Examples include a bug reporting mechanisms, processors, a visualization engine, a sensor, an actuator etc. In both Linked Data and Hypermedia, the emphasis is put on the ability for machines to traverse from one resource to another following links that express relationships.

That's the Web of Data.

Intended Outcome

That one data item can be related to others across the Web creating a global information space accessible to humans and machines alike.

Possible Approach to Implementation

This is a topic in itself and a general document such as this can only include superficial detail.

Developers know that very often the problem they're trying to solve will have already been solved by other people. In the same way, if you're looking for a set of identifiers for obvious things like countries, currencies, subjects, species, proteins, cities and regions, Nobel prize winners – someone's done it already. The steps described for discovering existing vocabularies [ LD-BP ] can readily be adapted.

ensure URI sets you use are published by a trusted group or organization;
ensure URI sets have permanent URIs.

If you can't find an existing set of identifiers that meet your needs then you'll need to create your own, following the patterns for URI persistence so that others will add value to your data by linking to it.

Example 12

The URI given as an example in the previous Best Practice ( http://data.mycity.example/public-transport/road/bus/dataset/timetable ) identifies a dataset. Much of the URI can be reused to identify bus stops, routes and the type of bus used on a given service. For example, a suitable persistent URI for the 'Airport - Bullfrog' route would be:

http://data.mycity.example/public-transport/road/bus/route/id/AB

This has the same initial structure as for the dataset but rather than /dataset it now includes the path segment /route so that humans can see that the type of thing identified is a bus route. The /id segment indicates that the URI identifies something that is not an information resource, i.e. something you cannot retrieve over the Internet, and /AB is the local identifier for the actual bus route. Dereferencing this URI should result in an HTTP 303 redirect to a similar URL such as http://data.mycity.example/public-transport/road/bus/route/doc/AB that describes , i.e. gives information about, the AB bus route (note the substitution of /doc for /id ). Jeni Tennison's work on URLs in Data has more to say on this topic.

In offering this advice, it is recognized that URIs can be long. In a dataset of even moderate size, storing each URI is likely to be repetitive and obviously wasteful. Instead, define locally unique identifiers for each element (such as AB in this example) and provide data that allows them to be converted to globally unique URIs programmatically. The Metadata Vocabulary for Tabular Data [ tabular-metadata ] provides mechanisms for doing this within tabular data such as CSV files, in particular using URI template properties such as the about URL property.

How to Test

Check that within the dataset, references to things that don't change or that change slowly, such as countries, regions, organizations and people, as referred to by URIs or by short identifiers that can be appended to a URI stub. Ideally the URIs should resolve, however, they have value as globally scoped variables whether they resolve or not.

Evidence

Relevant requirements : R-UniqueIdentifier

Benefits

Reuse
Linkability
Discoverability
Interoperability

Best Practice 13: Assign URIs to dataset versions and series

URIs should be assigned to individual versions of datasets as well as the overall series.

Why

Like documents, many datasets fall into natural series or groups. For example:

noon temperature readings in central London 1850 to the present day;
today's noon temperature in London;
the temperature in London at noon on 3rd June 2015.

In different circumstances, it will be appropriate to refer separately to each of these examples (and many like them).

Intended Outcome

It should be possible to refer to a specific version of a dataset and to concepts such as a 'dataset series' and 'the latest version.'

Possible Approach to Implementation

The W3C provides a good example of how to do this. The (persistent) URI for this document is http://www.w3.org/TR/2015/WD-dwbp-20150224/. That identifier points to an immutable snapshot of the document on the day of its publication. The URI for the 'latest version' of this document is ~~http://www.w3.org/TR/dwbp.~~ http://www.w3.org/TR/dwbp/ which is an identifier for a series of closely related documents that are subject to change over time. At the time of publication, these two URIs both resolve to this document. However, when the next version of this document is published, the 'latest version' URI will be changed to point to that.

Example 13

To complete the London temperature example, one might imagine URIs as follows:

~~http://weather.example.com/temperature/UK/London/noon http://weather.example.com/temperature/UK/London/noon/today http://weather.example.com/temperature/UK/London/noon/2015-06-03~~

  http://weather.example.com/temperature/UK/London/noon
  http://weather.example.com/temperature/UK/London/noon/today
  http://weather.example.com/temperature/UK/London/noon/2015-06-03

How to Test

Check that each version of a dataset has its own URI, and that logical groups of datasets are also identifiable.

Evidence

Relevant requirements : R-UniqueIdentifier , R-Citable

Benefits

Reuse
Discoverability
Trust

9.8 Data Formats

The formats in which data is made available to consumers are a key aspect of making that data usable. The best, most flexible access mechanism in the world is pointless unless it serves data in formats that enable use and reuse. Below we detail best practices in selecting formats for your data, both at the level of files and that of individual fields. W3C encourages use of formats that can be used by the widest possible audience and processed most readily by computing systems. Source formats, such as database dumps or spreadsheets, used to generate the final published format, are out of scope. This document is concerned with what is actually published rather than internal systems used to generate the published data.

Best Practice ~~12:~~ 14: Use machine-readable standardized data formats

Data must be available in a machine-readable standardized data format that is adequate for its intended or potential use.

Why

As data becomes more ubiquitous, and datasets become larger and more complex, processing by computers becomes ever more crucial. Posting data in a format that is not machine readable places severe limitations on the continuing usefulness of the data. Data becomes useful when it has been processed and transformed into information.

Using non-standard data formats is costly and inefficient, and the data may lose meaning as it is transformed. On the other hand, standardized data formats enable interoperability as well as future uses, such as remixing or visualization, many of which cannot be anticipated when the data is first published. Intended Outcome Published data on the Web must be readable and processable by typical computing systems. Any data consumer who wishes to work with the data and is authorized to do so must be able to do so with computational tools typically available in the relevant domain. Possible Approach to Implementation Consider which data formats potential users of the data are most likely to have the necessary tools to parse. Formats suggestion are shown in the . Standard data formats as well as the The use of ~~standard~~ non-proprietary data ~~vocabularies will better enable machines to process the data. to~~ formats should also be ~~done How to Test Check that~~ considered since it increases the ~~data format conforms to a known machine-readable data format specification in current~~ possibilities for use ~~among anticipated~~ and reuse of data ~~users.~~

~~Evidence Relevant requirements : R-FormatMachineRead , R-FormatStandardized~~ Intended Outcome

~~Issue 8 Should the BP "Use machine-readable standardized data formats" be split in two? Issue-138~~

~~Best Practice 13: Use non-proprietary data formats Data~~ It should be ~~available in a nonproprietary~~ possible for machines to easily read and process data ~~format. Why~~ published on the Web.

~~Non-proprietary data formats are usable by anyone. Proprietary data formats may~~ It should be ~~difficult or impractical~~ possible for ~~some~~ data ~~users~~ consumers to ~~view or parse. Thus, the~~ use ~~of open data formats increases~~ computational tools typically available in the relevant domain to work with the ~~possibilities for use and re-use of~~ data.

~~Intended Outcome~~

It should be possible for ~~any person~~ data consumers who wants to use or ~~re-use~~ reuse the data to do so without investment in proprietary software.

Possible Approach to Implementation

Make data available in ~~open~~ a machine readable standardized data ~~formats~~ format that is easily parseable including but not limited to ~~CSV ,~~ CSV, XML, Turtle, NetCDF, JSON and RDF.

Example 14

To provide data about the bus timetables John chooses the tabular format and creates a new distribution for timetable-001. To promote interoperability and to ~~be done~~ help the automatic data processing, John adopts the GTFS standard to describe the data about the bus time tables. According to GTFS John needs to create several csv files, which describe information about bus stops, routes, trips, stop times, bus frequencies as well as information about the starting and ending time of the bus service during the week, the weekend and holidays. As described in the tabular data model, John creates a group of tables to connect the several csv files used to describe the bus timetables. In this case, the csv distribution (timetable-001-csv) of the timetable-001 corresponds to a group of tables as described below.

   :timetable-001
       a dcat:Dataset;
       dcat:distribution :timetable-001-csv;
       .   :timetable-001-csv
       a dcat:Distribution;
       dcat:downloadURL <http://example.org/bus_timetables_mycity.json>
       dct:title "CSV distribution of the bus timetable dataset of MyCity."
       dcat:mediaType "text/csv";
       dct:license <http://creativecommons.org/licenses/by-sa/3.0/>;
       .
    http://example.org/bus_timetables_mycity.json
    {      "@context": "http://www.w3.org/ns/csvw",      "tables": [{          "url": "https://example.org/stops.csv",          "tableSchema": "https://example.org/stops.json"      }, {          "url": "https://example.org/routes.csv",          "tableSchema": "https://example.org/routes.json"      }, {          "url": "https://example.org/trips.csv",          "tableSchema": "https://example.org/trips.json"      }, {          "url": "https://example.org/stop_times.csv",          "tableSchema": "https://example.org/stop_times.json"      }, {          "url": "https://example.org/frequencies.csv",          "tableSchema": "https://example.org/frequencies.json"      }, {          "url": "https://example.org/calendar.csv",          "tableSchema": "https://example.org/calendar.json"      }, {          "url": "https://example.org/calendar_dates.csv",          "tableSchema": "https://example.org/calendar_dates.json"      }]    }

How to Test

Check ~~if it is possible to read, process, and store~~ that the data ~~without using any proprietary software package.~~ format conforms to a known machine-readable data format specification.

Evidence

Relevant requirements : R-FormatMachineRead , R-FormatStandardized R-FormatOpen

Benefits

Reuse
Processability

Best Practice ~~14:~~ 15: Provide data in multiple formats

Data should be available in multiple data formats.

Why

Providing data in more than one format reduces costs incurred in data transformation. It also minimizes the possibility of introducing errors in the process of transformation. If many users need to transform the data into a specific data format, publishing the data in that format from the beginning saves time and money and prevents errors many times over. Lastly it increases the number of tools and applications that can process the data.

Intended Outcome

It should be possible for data consumers to work with the data without transforming it.

Possible Approach to Implementation

Consider the data formats most likely to be needed by intended users, and consider alternatives that are likely to be useful in the future. Data publishers must balance the effort required to make the data available in many formats, but providing at least one alternative will greatly increase the usability of the data.

Example 15

In order to ~~be done~~ reach a larger number of data consumers John decides to provide RDF and XML distributions for the timetable-001.

  :timetable-001
       a dcat:Dataset;
       dcat:distribution :timetable-001-csv;
       dcat:distribution :timetable-001-rdf;
       dcat:distribution :timetable-001-xml;
       .  :timetable-001-csv
       a dcat:Distribution;
       dcat:downloadURL <http://example.org/bus_timetables_mycity.json>
       dct:title "CSV distribution of the bus timetable dataset of MyCity."
       dcat:mediaType "text/csv";
       dct:license <http://creativecommons.org/licenses/by-sa/3.0/>;
       .  :timetable-001-rdf
       dcat:downloadURL <http://example.org/bus_timetables_mycity.rdf>
       dct:title "RDF distribution of the bus timetable dataset of MyCity."
       dcat:mediaType "text/rdf";
       dct:license <http://creativecommons.org/licenses/by-sa/3.0/>;
       .  :timetable-001-xml
       a dcat:Distribution;
       dcat:downloadURL <http://example.org/bus_timetables_mycity.xml>
       dct:title "XML distribution of the bus timetable dataset of MyCity."
       dcat:mediaType "text/xml";
       dct:license <http://creativecommons.org/licenses/by-sa/3.0/>;
       .

How to Test

Check that the complete dataset is available in more than one data format.

Evidence

Relevant requirements : R-FormatMultiple

Benefits

Reuse
Processability

9.9 Data Vocabularies

Issue 9 There is a discussion going on in the group if the creation (and publication) of vocabularies is in the scope of the DWBP document. Issue 10 The section needs terminological discussion on whether we keep "vocabularies", which could be replaced by "data models" or "schemas" and whether we should remove "controlled vocabularies" from the picture. Issue-134 Issue 11 A big part of the section (starting by the section name) is biased towards linked data technology. It should be completed with other references and alternative implementation approaches. Issue-144

Data is often represented in a structured and controlled way, making reference to a range of vocabularies, for example, by defining types of nodes and links in a data graph or types of values for columns in a table, such as the subject of a book, or a relationship “knows” between two persons. Additionally, the values used may come from a limited set of pre-existing values or resources: for example object types, roles of a person, countries in a geographic area, or possible subjects for books. Such vocabularies ensure a level of control, standardization and interoperability in the data. They can also serve to improve the usability of datasets. Say, a dataset contains a reference to a concept described in several languages. Such reference allows applications to localize their display of their search depending on the language of the user.

According to W3C , vocabularies define the concepts and relationships (also referred to as ~~“terms”)~~ “terms” or “attributes”) used to describe and represent an area of concern. Vocabularies are used to classify the terms that can be used in a particular application, characterize possible relationships, and define possible constraints on using those terms. Several categories of vocabularies have been coined, for example, ontology, controlled ~~vocabularies,~~ vocabulary, thesaurus, taxonomy, code list, semantic network.

There is no strict division between the artifacts referred to by these names. “Ontology” tends however to denote the vocabularies of classes and properties that structure the descriptions of resources in (linked) datasets. In relational databases, these correspond to the names of tables and columns; in XML, they correspond to the elements defined by an XML Schema. Ontologies are the key building blocks for inference techniques on the Semantic Web. The first means offered by W3C for creating ontologies is the RDF Schema [ RDF-SCHEMA ] language. It is possible to define more expressive ontologies with additional axioms using languages such as those in The Web Ontology Language [ OWL2-OVERVIEW ].

On the other hand, “controlled vocabularies”, “concept schemes”, “knowledge organization systems” enumerate and define resources that can be employed in the descriptions made with the former kind of vocabulary. A concept from a thesaurus, say, “architecture”, will for example be used in the subject field for a book description (where “subject” has been defined in an ontology for books). For defining the terms in these vocabularies, complex formalisms are most often not needed. Simpler models have thus been proposed to represent and exchange them, such as the ISO 25964 data model [ ISO-25964 ] or W3C 's Simple Knowledge Organization System [ SKOS-PRIMER ].

Best Practice ~~15:~~ 16: Use standardized terms

Standardized terms should be used to provide data and metadata

Why

The need for ~~standardized~~ code lists and other commonly used terms for data values and for describing metadata is to avoid as much as possible ambiguity and clashes in the terms chosen for data and metadata information. The key reason is to be able to refer to the standardized body/organization which defines the term or code as a clear reference.

Intended Outcome

The benefit of using standardized code lists and other commonly used terms is to enable interoperability and consensus among data publishers and consumers.

Possible Approach to Implementation

An approach to implementation is the case of a vocabulary developed within a Working Group or a standardized body such as the W3C .

The Open Geospatial Consortium (OGC) could define the notion of granularity for geospatial datasets, while [DCAT] vocabulary provides a vocabulary reusing the same notion applied to catalogs on the Web.

Example 16


to be done

How to Test

Check that the terms or codes to be used are defined in a standard organization/working group of body such as IETF, OGC, W3C , etc.

Evidence

Relevant requirements : R-MetadataStandardized , R-QualityComparable

Best Practice 16: Document vocabularies Vocabularies should be clearly documented. Why Documentation defines what is within the vocabulary and the better the documentation the higher the possibility of re-use the vocabulary and the datasets built with it. Intended Outcome The description of the vocabulary must be human-readable.

~~Possible Approach to Implementation~~ Benefits

A vocabulary may be published together with human-readable Web pages, as detailed in the recipes for serving vocabularies with HTML documents in the Best Practice Recipes for Publishing RDF Vocabularies [

Reuse
Processability
Interoperability
~~SWBP-VOCAB-PUB~~ ]. Elements from the vocabulary are defined with attributes containing human-understandable labels and definitions, such as rdfs:label, rdfs:comment, dc:description, skos:prefLabel, skos:altLabel, skos:note, skos:definition, skos:example, etc.. Documentation may benefit from the additional presence of visual documentation such as the UML-style diagram of the W3C Organization Ontology [
~~ORG~~ ~~] How to Test Check that a human user can understand the documentation associated with a vocabulary. Evidence~~
~~Relevant requirements : R-VocabDocum~~

Best Practice 17: ~~Share~~ Reuse vocabularies ~~in an open way~~

~~Vocabularies~~ Shared vocabularies should be ~~shared in an open way~~ used to provide metadata

Why

~~Sharing~~ Reusing vocabularies ~~in an open way may increase~~ increases interoperability and reduces redundancies, encouraging reuse of the ~~usage~~ data. Shared vocabularies capture a consensus of the community about a ~~data vocabulary and help~~ specific domain. The reuse of shared vocabularies to ~~understand~~ describe metadata helps the ~~relationships among different vocabularies. Intended Outcome The vocabulary~~ automatic processing of data and metadata. Shared vocabularies should be ~~available for data consumers to use or re-use it. Possible Approach~~ especially used to ~~Implementation Provide the vocabulary under an open license such~~ describe both structural metadata as ~~Creative Commons Attribution License CC-BY [ CC-ABOUT ]. Create entries for the vocabulary in repositories such~~ well as ~~LOV , Prefix.cc , Bioportal~~ other types of metadata (descriptive, provenance, quality and the European Commission's Joinup . How to Test Check that an open license is available looking for URL or link to the document where the copyright is provided. Evidence Relevant requirements : R-VocabOpen Best Practice 18: Vocabulary versioning Vocabularies should include versioning information Why Versioning information guarantees compatibility over time by providing a way to compare different versions as the vocabulary evolves. versioning).

Intended Outcome

It should be possible to ~~identify changes to a~~ automatically compare two or more datasets when they use the same vocabulary ~~over time. Possible Approach~~ to ~~Implementation~~ describe metadata.

~~A vocabulary may~~ It should be ~~given a unique identifier~~ possible for ~~'the latest version' that remains stable over time, even as the vocabulary evolves. In addition, each version of~~ machines to automatically process the vocabulary has its own unique identifier. URI versioning for W3C documents provides examples. The latest version of this document is always found at http://www.w3.org/TR/dwbp/ but individual versions such as http://www.w3.org/TR/2015/WD-dwbp-20150224/ each have their own URL as well so that an initial effort can be made towards understanding, characterizing and tracking data evolution and specific versions pointed to if required. Several vocabularies, including OWL [ OWL2-OVERVIEW ] and schema.org [ SCHEMA-ORG ], include properties for version numbers. How to Test Different versions of within a ~~vocabulary can be easily identified; Evidence Relevant requirements : R-VocabVersion~~ dataset.

~~Best Practice 19: Re-use vocabularies Existing reference vocabularies~~ It should be ~~re-used where~~ possible ~~Why re-using vocabularies increases interoperability and reduces redundancies between vocabularies, encouraging re-use of~~ for machines to automatically process the ~~data. Intended Outcome Datasets (and vocabularies) should re-use core vocabularies.~~ metadata that describes a dataset.

Possible Approach to Implementation

The Standard Vocabularies section of the W3C Best Practices for Publishing Linked Data [ LD-BP ] provides guidance on the discovery, evaluation and selection of existing vocabularies.

Example 17


to be done

How to Test

Check that terms or attributes used do not replicate those defined by vocabularies in common use within the same domain.

Evidence

Relevant requirements : R-MetadataStandardized , R-VocabReference

Benefits

Reuse
Processability
Interoperability

Best Practice ~~20:~~ 18: Choose the right formalization level

When ~~creating or re-using~~ reusing a ~~vocabulary for an application,~~ vocabulary, a data publisher should opt for a level of formal semantics that fit data and applications.

Why

Formal semantics may help one to establish precise specifications that support establishing the intended meaning of the vocabulary and the performance of complex tasks such as reasoning. On the other hand, complex vocabularies require more effort to produce and understand, which could hamper their ~~re-use,~~ reuse, as well as the comparison and linking of datasets exploiting them. Highly formalized data is also harder to exploit by inference engines: for example, using an OWL class in a position where a SKOS concept is enough, or using OWL classes with complex OWL axioms raises the formal complexity of the data according to the OWL Profiles [ OWL2-PROFILES ]. Data producers should therefore seek to identify the right level of formalization for particular domains, audiences and tasks, and maybe offer different formalization levels when one size does not fit all.

Intended Outcome

The data supports all application cases but should not be more complex to produce and ~~re-use~~ reuse than necessary;

Possible Approach to Implementation

Identify the "role" played by the vocabulary for the datasets, say, providing classes and properties used to type resources and provide the predicates for RDF statements, or elements in an XML Schema, as opposed to providing simple concepts or codes that are used for representing attributes of the resources described in a dataset. When simpler models are enough to convey the necessary semantics, represent vocabularies using them. For instance, for Linked Data, SKOS may be preferred for simple vocabularies as opposed to formal ontology languages like OWL; see for example how concept schemes and code lists are used in the RDF Data Cube Recommendation [ QB VOCAB-DATA-CUBE ].

Example 18


to be done

How to Test

For formal knowledge representation languages, applying an inference engine on top of the data that uses a given vocabulary does not produce too many statements that are unnecessary for target applications.

Evidence

Relevant requirements : R-VocabReference , R-VocabDocum , R-QualityComparable

Benefits

to be done.

Issue 12 8

The best practice on formalization above (especially sections "Intended outcome" and "How to test") should be re-written in a more technology-neutral way. Issue-144

9.10 Sensitive Data

Sensitive data is any designated data or metadata that is used in limited ways and/or intended for limited audiences. Sensitive data may include personal data, corporate or government data, and mishandling of published sensitive data may lead to damages to individuals or organizations. To support best practices for publishing sensitive data, data publishers should identify all sensitive data, assess the exposure risk, determine the intended usage, data user audience and any related usage policies, obtain appropriate approval, and determine the appropriate security measures needed to taken to protect the ~~data. Appropriate security measures~~ data, which should also account for secure authentication and use of HTTPS.

At times, because of sharing policies sensitive data may not be available in part or in its entirety. Data unavailability represents gaps that may affect the overall analysis of datasets. To account for unavailable data, data publishers should publish information about unavoidable data gaps. Best Practice 21: Preserve people's right to privacy Data must not infringe a person's right to privacy. Why Data publishers should preserve the privacy of individuals where the release of personal information would endanger safety (unintended accidents) or security (deliberate attack). Privacy information might include: full name, home address, mail address, national identification number, IP address (in some cases), vehicle registration plate number, driver's license number, face, fingerprints, or handwriting, credit card numbers, digital identity, date of birth, birthplace, genetic information, telephone number, login name, screen name, nickname, health records etc.

Data publishers should identify all personal data, assess the exposure risk, determine the intended usage, data user audience and any related usage policies, obtain appropriate approval, and determine the appropriate security measures needed to taken to protect the data including secure authentication and use At times, because of ~~HTTPS for~~ sharing policies sensitive data ~~transmission. Intended Outcome Data that can identify an individual person must~~ may not be published without their consent. Possible Approach to Implementation The data publisher should establish a security plan for publishing data and metadata. The plan should include preparatory steps to ensure personal data is protected or removed prior to publication. All steps need to be followed prior to publication of new data available in part or ~~new data formats particularly binary formats (word processing, spreadsheet etc)~~ in its entirety. Data unavailability represents gaps that may ~~embed personal metadata in files. Identify any personal data exposure risks. Write a security plan~~ affect the overall analysis of datasets. To account for ~~publishing data and metadata that includes clear guidelines to follow. Prior to publication put security measures in place and follow them. In preparation to publication review~~ unavailable data, data ~~to ensure compliance. to be done How to Test Write and test a plan for reviewing, curating and vetting~~ publishers should publish information about unavoidable data ~~prior to publication. Evidence Relevant requirements : R-SensitivePrivacy~~ gaps.

Best Practice ~~22:~~ 19: Provide data unavailability reference

References to data that is not open, or is available under different restrictions to the origin of the reference, should provide ~~context by explaining~~ explanation about how ~~or by whom~~ the referred to data can be ~~accessed.~~ accessed and who can access it.

Why

Publishing online documentation about unavailable data due to sensitivity issues provides a means for publishers to explicitly identify knowledge gaps. This provides a contextual explanation for consumer communities thus encouraging use of the data that is available.

Intended Outcome

Publishers should provide information about data that is referred to from the current dataset but that is unavailable or only available under different conditions.

Possible Approach to Implementation

Data publishers may publish an HTML document that gives a human-readable explanation for data unavailability. RDF may be used to provide a machine readable version of the same information. If appropriate, consider editing the server's 4xx response page(s) to provide the information.

How to Test

If the dataset includes references to other data that is unavailable, check whether an explanation is available in the metadata and/or description of it.

Evidence

Relevant requirements : R-AccessLevel

Benefits

Reuse
Trust
~~Issue 13~~ ~~Should we use SHOULD or MUST on BP for Sensitive Data? Issue-123~~

9.11 Data Access

Providing easy access to data on the Web enables both humans and machines to take advantage of the benefits of sharing data using the Web infrastructure. By default, the Web offers access using Hypertext Transfer Protocol (HTTP) methods. This provides access to data at an atomic transaction level. However, when data is distributed across multiple files or requires more sophisticated retrieval methods different approaches can be adopted to enable data access, including bulk download and API s.

One approach is packaging data in bulk using non-proprietary file formats (for example tar files). Using this approach, bulk data is generally pre-processed server side where multiple files or directory trees of files are provided as one downloadable file. When bulk data is being retrieved from non-file system solutions, depending on the data user communities, the data publisher can offer APIs to support a series of retrieval operations representing a single transaction.

For data that is streaming to the Web in “real time” or “near real time”, data publishers should publish data or use APIs to enable immediate access to data, allowing access to critical time sensitive data, such as emergency information, weather forecasting data, or published system metrics. In general, APIs should be available to allow third parties to automatically search and retrieve data published on the Web.

On a further note, it can be observed that data on the Web is essentially about the description of entities identified by a unique, Web-based, identifier (a (an URI). Once the data is dumped and sent to an institute ~~specialized~~ specialised in digital preservation the link with the Web is broken ~~(de-referencing)~~ (dereferencing) but the role of the URI as a unique identifier still remains. In order to increase the usability of preserved dataset dumps it is relevant to maintain a list of these identifiers.

Best Practice ~~23:~~ 20: Provide bulk download

Data should be available for bulk download.

Why

When ~~Web~~ web data is distributed across many URLs and logically organized as one container, accessing the data in bulk is useful. Bulk access provides a consistent means to handle the data as one container. Without it, individually accessing data is cumbersome leading to inconsistent approaches to handling the container.

Intended Outcome

It should be possible to download data on the Web in bulk. Data publishers should provide a way either through bulk file formats or APIs for consumers to access this type of data.

Possible Approach to Implementation

Depending on the nature of the data and consumer needs possible approaches could include:

Preprocessing a copy of the data in compressed archive format where the data more easily accessible as one URL. This is particularly useful for handling data that changes infrequently or on a scheduled basis.
Hosting an API such as a REST or SOAP service that dynamically retrieves individual data and returns a bulk container. This approach is useful when for capturing a snapshot of the data. The API can also be useful for consumers to customize what they want included or excluded.
Hosting a database, ~~Web~~ web page, or SPARQL endpoint that contains discoverable metadata [ VOCAB-DCAT ] describing the container and data URLs associated with the container.

How to Test

Humans can retrieve copies of preprocessed bulk data through existing tools such as a browser. Clients can test bulk access through an API or queries to ~~Web~~ web resources with discoverable metadata about the bulk data.

Evidence

Relevant requirements : R-AccessBulk

Benefits

Reuse
Access

Best Practice ~~24: Follow REST principles when designing APIs~~ 21: Use Web Standardized Interfaces

It is recommended to use URIs, HTTP verbs, HTTP response codes, MIME types, typed HTTP Links and content negotiation when designing APIs ~~for accessing data should follow REST architectural approaches.~~

Why

~~Considering RESTful architectural aspects when designing an~~ APIs ~~can guarantee~~ that use HTTP verbs, URIs, and response codes leverage developers’ existing knowledge, making it easier ~~development,~~ to make use of ~~pre-existing infrastructure (the Web),~~ your API . Using a ~~shorter learning curve for developers that want to build applications that access data. It~~ standardized interface also ~~assures sustainability as "the technologies that make up this foundation include the Hypertext Transfer Protocol (HTTP), Uniform Resource Identifier (URI), markup languages such as HTML and XML,~~ helps to avoid tight coupling between requests and ~~Web-friendly formats" [ RICHARDSON ]. Furthermore, it~~ responses, making for an API that can ~~mitigate the use of specific clients~~ readily be used by many clients.

Intended Outcome

Developers who have some experience with REST or ~~the need~~ REST-like APIs will have an initial understanding of ~~UDDI .~~ how to use your API because it uses standardized interfaces. Your API will also be easier to maintain

Possible Approach to Implementation

~~APIs~~ There are ~~frequently constructed over different approaches,~~ many RESTful development frameworks available. If you are already using a web development framework that supports building REST APIs, consider using that. If not, consider an API -specific framework that uses REST, such as ~~SOAP. For~~ those mentioned above. One implementation type to consider is a hypermedia API —an API that responds with links rather than data ~~on the Web context,~~ alone. Even for an API that is not truly RESTful, using hypermedia can be helpful for making an API that is self-documenting. RESTful APIs use hypermedia as the ~~architecture~~ engine of application state (HATEOAS). Because state is controlled by links that can be examined and used on the ~~Web itself described at the documentation of REST architectural style offers~~ fly, the ~~same entry for humans~~ underlying code can change without affecting client code and ~~machines~~ developers, making your API evolvable.

How to ~~access data. If humans already have access~~ Test

Evidence

Relevant requirements : R-AccessBulk

Note

This BP will be complemented.

Best Practice 22: Serving data and resources with different formats

It is recommended to use content negotiation ~~between applications easily.~~ for serving data available in multiple formats

~~Intended Outcome~~ Why

It ~~should be~~ is possible ~~for machines~~ to ~~access~~ have data being served in a ~~variety~~ HTML page mixed with human-readable and machine-readable data. RDFa could be used to mix HTML content with semantic data.

But, in some cases this page is subject of ~~formats from~~ scraping by some applications in order to get data available. When structured data is mixed with HTML, but it is possible to have a different representation with the same ~~URI through content negotiation.~~ structured data, written in Turtle or JSON-LD, it is recommended to serve this page using Content Negotiation.

Note

This BP will be complemented.

Intended Outcome

It should be possible ~~for data consumers~~ to ~~access data using browser as a client.~~ serve the same resource with different representations.

Possible Approach to Implementation

~~Design always RESTful APIs using HTTP and good pragmatic REST principles. There~~ A possible approach to implementation is ~~no unique agreed set of principles for REST APIs, some are implicitly defined by~~ to configure the ~~HTTP standard and others have emerged on a consensus base or even are still under discussion. The following are a set~~ web server to deal with content negotiation of ~~rules widely adopted so far:~~ the requested resource.

~~Use hierarchical, readable and technology agnostic Uniform Resource Identifiers (URIs) to address resources~~ http://example.org/profile_info.html - Personal information represented in ~~a consistent way.~~ HTML + RDFa
~~Use~~ http://example.org/profile_info.json - The same information of the ~~URI path to convey your Resources and Collections model. Use nouns~~ resource but ~~no verbs (except for Controllers that does not involve resources). Use HTTP verbs instead to operate on the Collections and Resources.~~ represented in JSON-LD format
~~Use standard HTTP methods accordingly to their expected default behavior. GET method and query parameters should not alter~~ http://example.org/profile_info.ttl - The same information of the ~~state.~~ resource but represented in Turtle format

~~Use HTTP headers to provide metadata and for~~

The specif format of the ~~serialization~~ resource's representation can be accessed by the URI or by the Content-type of data formats. Support multiple formats. Use HTTP status codes (including error codes) accordingly to their original purpose. Simplify associations. Use query parameters to hide complexity and provide filtering, sorting, field selection and paging for collections. Version your API . Never release an API without a version and make the ~~version mandatory.~~ HTTP Request.

How to Test

~~Use API testing tools to compare benefits of implementing RESTful design.~~

Evidence

Relevant requirements : ~~R-AccessBulk , R-APIDocumented~~

Benefits

Reuse
Access

Best Practice ~~25:~~ 23: Provide real-time access

When data is produced in real-time, it should be available on the Web in real-time.

Why

The presence of real-time data on the Web enables access to critical time sensitive data, and encourages the development of real-time ~~Web~~ web applications. Real-time access is dependent on real-time data producers making their data readily available to the data publisher. The necessity of providing real-time access for a given application will need to be evaluated on a case by case basis considering refresh rates, latency introduced by data post processing steps, infrastructure availability, and the data needed by consumers. In addition to making data accessible, data publishers may provide additional information describing data gaps, data errors and anomalies, and publication delays.

Intended Outcome

Data should be available at real time or near real time, where real-time means a range from milliseconds to a few seconds after the data creation, and near real time is a predetermined delay for expected data delivery.

Possible Approach to Implementation

Real-time data accessibility may be achieved through two means:

Push - as data is produced the producers communicates data to the data publisher either by disseminating data to the publisher or making storage available accessible to the data producer.
On-Demand (Pull) - available real-time data is made available upon request. In this case, data publishers will provide an API to facilitate these read-only requests.

In addition to data access, to ensure credibility providing access to error conditions, anomalies, and instrument "house keeping" data enhance real-time applications ability to interpret and convey real-time data quality to consumers.

How to Test

To adequately test real time data access, data will need to be tracked from the time it is initially collected to the time it is published and accessed. [ PROV-O ] can be used to describe these activities. Caution should be used when analyzing real-time access for systems that consist of multiple computer systems. For example, tests that rely on wall clock time stamps may reflect inconsistencies between the individual computer systems as opposed to data publication time latency.

Evidence

Relevant requirements : R-AccessRealTime

Benefits

Reuse
Access

Best Practice ~~26:~~ 24: Provide data up to date

Data must be available in an up-to-date manner and the update frequency made explicit.

Why

Data on the Web availability should closely coincide with data provided at creation time, collection time, or after it has been processed or changed. Carefully synchronizing data publication to the update frequency encourages data consumer confidence and ~~re-use.~~ reuse.

Intended Outcome

When new data is provided or data is updated, it must be published to coincide with the data changes.

Possible Approach to Implementation

Implement an API to enable data access. When data is provided by bulk access, new files with new data should be provided as soon as additional data is created or updated. Or, use technologies that are intended to expose data on the Web using interlinked resources, like Activity Streams or Atom.

How to Test

Write test standard operating procedure for data publisher to keep test data on Web site up to date.

Following standard operating procedure:

Write test client to access published data.
Access data and save first copy locally.
Publish an updated version of data.
Access data and save second copy locally.
Compare first copy to second copy to verify change.

Evidence

Relevant requirements : ~~R-AccessUpToDate~~ R-AccessUptodate

Benefits

Reuse
Access

Issue 14 9

To debate if the goal should be to adhere to a published schedule for updates. Issue-195

Best Practice ~~27: Maintain separate versions for a data~~ 25: Document your API

~~If data is made available through an API , the~~ Provide your users with complete information about how to use your API ~~itself should be versioned separately from the data. Old versions should continue to be available.~~ .

Why

~~Developers need to be made aware~~ The primary consumers of ~~changes to~~ an API ~~so that they can update their code~~ are developers. In order to ~~use it. When an~~ develop against your API ~~is changed, as opposed to when the data it makes available is changed, releasing it as~~ , a ~~new version makes it possible~~ developer will need to ~~gracefully transition from the old version~~ understand how to ~~the new one. Keeping the older versions available avoids breaking applications that cannot be updated.~~ use it.

Intended Outcome

~~It should~~ Developers will be ~~possible for developers~~ able to ~~transition easily from one version of the~~ code efficiently against your API ~~to another. Applications that are impractical to transition should continue~~ , and they will make best use of the features you have provided.

It is recommended to ~~work. The~~ show explanation about the architecture chosen for the API ~~version should not be updated when data versions are updated, only when the~~ design and show how to invoke each API ~~itself changes,~~ call and ~~that should~~ what will be ~~infrequent.~~ returned from those calls.

Possible Approach to Implementation

~~Release updates~~ Swagger , io-docs , OpenApis , and others provide formats for documentation.

How to Test

Quality of documentation is related to the usage and feedback from developers. Try to get constant feedback from your users about the documentation.

Note

This BP will be complemented.

Best Practice 26: Use an API ~~under a slightly different base URI so that older versions remain available under the previous base URI. For example, http://myapi.org/v1/dogs/alfred retrieves the older version of~~

Offer an API to serve data ~~about a dog named Alfred, and http://myapi.org/v2/dogs/alfred retrieves~~

Why

An API offers the ~~newer version~~ greatest flexibility and processability for consumers of your data. It can enable real-time data ~~about Alfred. Keeping~~ usage, filtering on request, and the ~~version number as far~~ ability to work with the ~~left as possible in the~~ data at an atomic level. If your dataset is large, frequently updated, or highly complex, an API ~~call allows developers~~ is likely to ~~switch~~ be helpful.

Intended Outcome

Developers will have programmatic access to the ~~newer version with the least effort.~~ data for use in their own applications.

Possible Approach to Implementation

If you use a data management platform, such as CKAN, you may be able to simply enable an existing API . Many web development frameworks include support for APIs, and there are also frameworks written specifically for building custom APIs. Examples include Swagger, Apigility, Apache CXF, and Restify

How to Test

~~Existing calls to the API should continue~~ Use Service Virtualization to ~~work when the API is updated. New~~ simulate calls ~~to a slightly different base URI should retrieve data according to~~ and responses, make sure that the ~~new rules.~~ performance is acceptable.

~~Evidence~~

~~Relevant requirements : R-DataVersion~~ Issue 10

To review the BP "Use an API " and possibly rewrite Possible Approach to Impelmentation section.

9.12 Data Preservation

This section describes best practices related to data ~~preservation.~~ preservation . Albeit being a closely related topic archiving is considered out of scope for this group and therefore not covered here.

Best Practice ~~28:~~ 27: Assess dataset coverage

The coverage of a dataset should be assessed prior to its preservation

Why

A chunk of Web data is by definition dependent on the rest of the global graph. This global context influences the meaning of the description of the resources found in the dataset. Ideally, the preservation of a particular dataset would involve preserving all its context. That is the entire Web of Data.

At ingestion time an evaluation of the linkage of Web data dataset dump to already preserved resources is assessed. The presence of all the vocabularies and target resources in uses is sought in a set of digital archives taking care of preserving Web data. Datasets for which very few of the vocabularies used and/or resources pointed out are already preserved somewhere should be flagged as being at risk.

Intended Outcome

It should be ~~done an evaluation of~~ possible to appreciate the ~~preservation~~ coverage ~~for~~ and external dependencies of a given dataset.

Possible Approach to Implementation

The assessment ~~could~~ can be performed by the digital preservation institute or the dataset depositor. It essentially consists in checking whether all the resources used are either already preserved somewhere or provided along with the new dataset considered for preservation.

Example 27

A dataset targetted for preservation is made of the following triples:


<http://data.mycity.example/public-transport/road/bus/route/ABtimetable> a gtfs:Route;
  gtfs:color "ff0000";
  gtfs:shortname "10";
  gtfs:longName "Airport - Bullfrog";
  gtfs:agency <http://example.org/transport-agency/DTA>;
  gtfs:routeType ex:three;
  ex:usualVehicleType dbpedia:Roumaster;
  foaf:isPrimaryTopicOf ex:Airport_Bullfrog.
<http://data.mycity.example/public-transport/road/bus/route/BFC> a gtfs:Route;
  gtfs:color "ffff00";
  gtfs:shortname "20";
  gtfs:longName "Bullfrog - Furnace Creek Resort";
  gtfs:agency <http://example.org/transport-agency/DTA>;
  gtfs:routeType ex:three;
  ex:usualVehicleType dbpedia:Articulated_bus;
  foaf:isPrimaryTopicOf ex:Bullfrog_Furnace_Creek_Resort.
  …

Those triples make use of the "gtfs" vocabulary and a custom one defined in the testing domain name "ex". It also uses entities defined in "foaf", "dbpedia" and "ex". Although not formal standards, FOAF and GTFS [ GTFS ] are well established ontologies that are archived in several places on the Web (see, for instance, the LOV repository ). Entities defined in DBpedia are also preserved through their Memento gateway and archived dumps of the dataset also exist. The risks associated to preserving the triple making use of those external resource is thus minimal. A bigger concern arises from the usage made of resources defined in "ex" which is a namespace that, by design, does not exist outside of the dataset. Unless the data describing "ex:usualVehicleType", "ex:Airport_Bullfrog" and "ex:Bullfrog_Furnace_Creek_Resort" is preserved alongside those triples their contextual meaning will be ~~done~~ lost. This is particularly critical for "ex:usualVehicleType" as without it the relationship between the described route and the dbpedia resources will be unknown to a consuming application (however obvious it may be to a human).

Considering this assessment, a revised dataset including the definition of "ex:usualVehicleType" can be considered for preservation:


<http://data.mycity.example/public-transport/road/bus/route/AB> a gtfs:Route;
  gtfs:color "ff0000";
  gtfs:shortname "10";
  gtfs:longName "Airport - Bullfrog";
  gtfs:agency <http://example.org/transport-agency/DTA>;
  gtfs:routeType ex:three;
  ex:usualVehicleType dbpedia:Roumaster;
  foaf:isPrimaryTopicOf ex:Airport_Bullfrog.
<http://data.mycity.example/public-transport/road/bus/route/BFC> a gtfs:Route;
  gtfs:color "ffff00";
  gtfs:shortname "20";
  gtfs:longName "Bullfrog - Furnace Creek Resort";
  gtfs:agency <http://example.org/transport-agency/DTA>;
  gtfs:routeType ex:three;
  ex:usualVehicleType dbpedia:Articulated_bus;
  foaf:isPrimaryTopicOf ex:Bullfrog_Furnace_Creek_Resort.
  …
  # Custom vocabulary element
  ex:usualVehicleType a rdf:Property;
    rdfs:subPropertyOf gtfs:routeType;
    rdfs:range gtfs:Bus.

This second, more complete, dataset is better suited for preservation as it is more self-describing and only makes use of external entities whose preservation is trusted.

How to Test

Datasets making references to portions of the Web of Data which are not preserved should receive a lower score than those using common resources.

Evidence

Relevant requirements : R-VocabReference

Benefits

Reuse
Trust

Best Practice ~~29:~~ 28: Use a trusted ~~serialization~~ serialisation format for preserved data dumps

Data depositors willing to send a data dump for long term preservation must use a well established ~~serialization~~ serialisation

Why

Web data is an ~~abstract~~ abtract data model that can be expressed in different ways ~~(RDF,~~ (RDF/XML, JSON-LD, ...). Using a well established ~~serialization~~ serialisation of this data increases its chances of ~~re-use.~~ reuse.

Institute doing digital preservation are tasked with monitoring file format obsolescence. Datasets which have been acquired in some format some years ago may have to be converted into another format in order to still be usable with more modern software (see [ ROSENTHAL ]). This tasks can be made more challenge, or even impossible, if non standard ~~serialization~~ serialisation formats are used by data depositors.

Intended Outcome

It should be possible to read and load the dataset into a database even its software is no longer supported.

Possible Approach to Implementation

Give preference to Web data ~~serialization~~ serialisation formats available as open standards. For instance those provided by the W3C [ FORMATS ].

Example 28

Those triples are serialised using the Turtle W3C recommendation. It is a text-based format which is supported by the majority of software able to process Web data. This format can thus be ~~done~~ trusted for preservation.

 # Definition of a person
 ex:bob a ex:Staff;
     foaf:basedNear dbpedia:Cardiff;
     foaf:knows ex:john.

A custom-made serialisation of the same data such as what follows should be given a negative appreciation towards preserving the dataset.

ex:bob,a,ex:staff;###,foaf:basedNear,dbpedia:Cardiff;###,foaf:knows,ex:john.

How to Test

Try to dereference the URI of the data dump with Content-Type header according to the format you expect to get, using for example [ cURL ]

Evidence

Relevant requirements : R-FormatStandardized

Benefits

Reuse

Best Practice ~~30:~~ 29: Update the status of identifiers

Preserved ~~datasets~~ resources should be linked with their "live" counterparts

Why

URI dereferencing is a primary interface to data on the Web. Linking preserved datasets with the original URI inform the data consumer of the status of these resources.

During its life cycle a dataset may undergo several modifications. Although URIs assigned to things are not expected to change, the description of these resource will evolve over time. During this evolution, several snapshots could be made available for preservation and access as versions.

Intended Outcome

A link is maintained between the URI of a resource, the most up-to-date description available for it, and preserved descriptions. If the resource does not exist any more the description should say so and refer to the last preserved description that was available.

Possible Approach to Implementation

There is a variety of HTTP status codes that could be put into use to relate the URI with its preserved description. In particular, 200, 410 and 303 can be used for different scenarios:

200 => there is a new description which contains pointers to archived description
410 => the resource is no longer available but it has been removed under a controlled process cf. 404 which simply states that something is not available.
303 => the resource identified by this URI is no longer served here but there is a preserved description at a different location.

In addition to the status codes, HTTP Link headers can also be used to relate resources to preserved descriptions.

Example 29

One approach with link header is to use the Memento protocol to give a link to a timegate providing access to preserved descriptions of the resource:

curl -I http://example.org/timetable-001
HTTP/1.1 200 OK
Memento-Datetime: Sun, 05 April 2015 00:00:00 GMTLink: http://example.org/timetable; rel=“original”, http://example.org/timegate/dataset; rel=“timegate”

Using HTTP status code the data consumer can be ~~done~~ redirected to the most recent description of the entity. In the following example a request for the resource "http://example.org/timetable-001" is first redirected to the description "http://example.org/data/timetable-001" which, as it has been preserved and flagged as invalid, redirects the client to the newer description "http://example.org/newdata/timetable-001"

curl -L -I http://example.org/timetable-001
HTTP/1.1 303 See Other
Location: http://example.org/data/timetable-001Link: http://example.org/newdata/timetable-001, rel="new"
HTTP/1.1 303 See Other
Location: http://example.org/newdata/timetable-001Link: http://example.org/data/timetable-001, rel="previous"
HTTP/1.1 200 Ok

How to Test

Check that de-referencing the URI of a preserved dataset returns information about its current status and availability.

Evidence

Relevant requirements : R-AccessLevel , R-PersistentIdentification

Benefits

Reuse
Trust

9.13 Feedback

Publishing data on the Web enables data sharing on a large scale, providing data access to a wide range of audiences with different levels of expertise. Data publishers want to ensure that the data published is meeting the data consumer needs and user feedback is crucial. Feedback has benefits for both data publishers and data consumers, helping data publishers to improve the integrity of their published data, as well as to encourage the publication of new data. Feedback allows data consumers to have a voice describing usage experiences (e.g. applications using data), preferences and needs. When possible, feedback should also be publicly available for other data consumers to examine. Making feedback publicly available allows users to become aware of other data consumers, supports a collaborative environment, and allows user community experiences, concerns or questions are currently being addressed.

From a user interface perspective there are different ways to gather feedback from data consumers, including site registration, contact forms, quality ratings selection, surveys and comment boxes for blogging. From a machine perspective the data publisher can also record metrics on data usage or information about specific applications consumers are currently relying upon. Feedback such as this establishes a line of communication channel between data publishers and data consumers. In order to quantify and analyze usage feedback, it should be recorded in a machine-readable format. Blogs and other publicly available feedback should be displayed in a human-readable form through the user interface.

This section provides some BP to be followed by data publishers in order to enable data consumers to provide feedback about the consumed data. This feedback can be for humans or machines.

Best Practice ~~31:~~ 30: Gather feedback from data consumers

Data publishers should provide a means for consumers to offer feedback.

Why

Providing feedback contributes to improving the quality of published data, may encourage publication of new data, helps data publishers understand data consumers needs better and, when feedback is made publicly available, enhances the consumers' collaborative experience.

Intended Outcome

It should be possible for data consumers to provide feedback and rate data in both human and machine-readable formats. The feedback should be Web accessible and it should provide a URL reference to the corresponding dataset.

Possible Approach to Implementation

Provide data consumers with one or more feedback mechanisms including, but not limited to: a registration form, contact form, point and click data quality rating buttons, or a comment box for blogging.

Collect feedback in machine-readable formats to represent the feedback and use a vocabulary to capture the semantics of the feedback information.

How to Test

Demonstrate how feedback can be collected from data consumers.
Verify that the feedback is persistently stored. If the feedback is made publicly available verify that a URL links back to the published data being referenced.
Check that the feedback format conforms to a known machine-readable format specification in current use among anticipated data users.

Evidence

Relevant requirements : R-UsageFeedback , R-QualityOpinions

Benefits

Reuse
Comprehension
Trust

Best Practice ~~32:~~ 31: Provide information about feedback

Information about feedback should be provided.

Why

Sharing information about feedback allows data consumers to be aware of feedback given by other consumers.

Intended Outcome

It should be possible for humans to have access to information that describes feedback on a dataset given by one or more data consumers.

It should be possible for machines to automatically process feedback information about a dataset.

Possible Approach to Implementation

The machine readable version of the feedback metadata may be provided according to the vocabulary that is being developed by the DWBP working group , i.e., the Dataset Usage ~~Vocabulary.~~ Vocabulary [ DUV ].

How to Test

Check that the metadata for the dataset itself includes feedback information about the dataset.
Check if a computer application can automatically process feedback information about the dataset.

Evidence

Relevant requirements : R-UsageFeedback , R-QualityOpinions

Benefits

Reuse
Trust

9.14 Data Enrichment

~~Issue 15~~ Note

To discuss about enrichment yields derived data, not just metadata. For example, you could take a dataset of scheduled and real bus arrival times and enrich it by adding on-time arrival percentages. The percentages are data, not metadata. ~~Issue-196~~

~~Issue 16~~ Note

To discuss about the meaning of the word “topification”. ~~Issue-196~~

Data enrichment refers to a set of processes that can be used to enhance, refine or otherwise improve raw or previously processed data. This idea and other similar concepts contribute to making data a valuable asset for almost any modern business or enterprise. It also shows the common imperative of proactively using this data in various ways.

This section provides some advice to be followed by data publishers in order to enable data consumers to enrich data.

Best Practice ~~33:~~ 32: Enrich data by generating new metadata.

Data should be enriched whenever possible, generating richer metadata to represent and describe it.

Why

There is a large number of intelligent techniques that can be used to enrich raw or previously treated data and to extract new metadata from it, making data an even more valuable asset. These methods include those focused on data categorization, entity recognition, sentiment analysis, topification, among others. Providing new and richer metadata may help data consumers to better understand the data they are dealing with.

Intended Outcome

Describe a dataset using richer sets of metadata, which can be readable by humans.

Possible Approach to Implementation

The implementation depends on what types of metadata should be produced. They require the implementation of methods for data categorization, ~~disambiguation and~~ disambiguation, sentiment analysis, among others. After new metadata is extracted, it can be provided as part of an HTML Web page or any open data format.

Example 32


to be done

How to test

Check whether the metadata being extracted by the techniques are in accordance with human-knowledge and can be readable by humans.

Evidence

Relevant requirements: R-DataEnrichment

Benefits

Reuse
Comprehension
Processability

Data on the Web Best Practices

W3C Working Draft 25 June 17 December 2015

Abstract

Status of This Document

Table of Contents

1. Introduction

4. 3. Scope

5. 4. Context

6. 5. Data on the Web Challenges

6. Best Practices Benefits

7. Best Practices Template

8. Best Practices Summary

9. The Best Practices

9.1 Example

9.2 Metadata

9.3 Data Licenses

9.4 Data Provenance

9.5 Data Quality

9.6 Data Versioning

9.7 Data Identifiers

9.8 Data Formats

9.9 Data Vocabularies

9.10 Sensitive Data

9.11 Data Access

9.12 Data Preservation

9.13 Feedback

9.14 Data Enrichment

10. Conclusions

11. Glossary

12. Best Practices x Benefits

13. Use Cases Requirements x Best Practices

A. Acknowledgements

B. Change history

C. References

B.1 C.1 Normative Informative references

Best Practice	Benefits
Provide Metadata
Provide descriptive metadata
Provide locale parameters metadata
Provide structural metadata
Provide data license information
Provide data provenance information
Provide data quality information
Provide versioning information
Provide Provide version history
Use persistent URIs as identifiers of datasets
Use persistent URIs as identifiers within datasets
Assign URIs to dataset versions and series
Use machine-readable standardized data formats
Provide data in multiple formats
Use standardized terms
Reuse vocabularies
Choose the right formalization level
Provide data unavailability reference
Provide bulk download
Follow REST principles when designing APIs
Serving data and resources with different formats
Provide real-time access
Provide data up to date
Maintain separate versions for a data AP
Assess dataset coverage
Use a trusted serialisation format for preserved data dumps
Update the status of identifiers
Gather feedback from data consumers
Provide information about feedback
Enrich data by generating new metadata

UC Requirement	Best Practice
R-AccessBulk	Best Practice 19: Provide bulk download, Best Practice 20: Follow REST principles when designing APIs
R-AccessLevel	Best Practice 18: Provide data unavailability reference, Best Practice 26: Update the status of identifier
R-AccessRealTime	Best Practice 21: Provide real-time access
R-AccessUpToDate	Best Practice 22: Provide data up to date
R-APIDocumented	Best Practice 20: Follow REST principles when designing APIs
R-Citable	Best Practice 10: Use persistent URIs as identifiers, Best Practice 11: Assign URIs to dataset versions and series
R-DataEnrichment	Best Practice 29: Enrich data by generating new metadata
R-DataIrreproducibility
R-DataLifecyclePrivacy
R-DataLifecycleStage
R-DataMissingIncomplete
R-DataVersion	Best Practice 8: Provide versioning information, Best Practice 9: Provide version history, Best Practice 23: Maintain separate versions for a data API
R-FormatLocalize	Best Practice 3: Provide locale parameters metadata, Best Practice 9: Provide version history, Best Practice 23: Maintain separate versions for a data API
R-FormatMachineRead	Best Practice 12: Use machine-readable standardized data formats
R-FormatMultiple	Best Practice 13: Provide data in multiple formats
R-FormatStandardized	Best Practice 12: Use machine-readable standardized data formats, Best Practice 25: Use a trusted serialisation format for preserved data dumps
R-FormatOpen	Best Practice 12: Use machine-readable standardized data formats
R-GeographicalContext
R-GranularityLevels
R-MetadataAvailable	Best Practice 1: Provide metadata, Best Practice 2: Provide descriptive metadata, Best Practice 3: Provide locale parameters metadata, Best Practice 4: Provide structural metadata, Best Practice 6: Provide data provenance information
R-MetadataDocum	Best Practice 1: Provide metadata
R-MetadataMachineRead	Best Practice 1: Provide metadata, Best Practice 2: Provide descriptive metadata, Best Practice 5: Provide data license information
R-MetadataStandardized	Best Practice 2: Provide descriptive metadata, Best Practice 5: Provide data license information, Best Practice 14: Use standardized terms
R-PersistentIdentification	Best Practice 26: Update the status of identifiers
R-QualityComparable	Best Practice 16: Choose the right formalization level
R-QualityMetrics
R-QualityOpinions	Best Practice 27: Gather feedback from data consumers, Best Practice 28: Provide information about feedback
R-TrackDataUsages
R-UsageFeedback	Best Practice 27: Gather feedback from data consumers, Best Practice 28: Provide information about feedback
R-VocabDocum	Best Practice 16: Choose the right formalization level
R-VocabOpen
R-VocabReference	Best Practice 15: Reuse vocabularies, Best Practice 16: Choose the right formalization level
R-VocabVersion	Best Practice 24: Assess dataset coverage
R-UniqueIdentifier	Best Practice 10: Use persistent URIs as identifiers, Best Practice 11: Assign URIs to dataset versions and series
R-LicenseAvailable	Best Practice 5: Provide data license information
R-LicenseLiability
R-ProvAvailable	Best Practice 6: Provide data provenance information
R-SensitivePrivacy	Best Practice 17: Preserve people's right to privacy
R-SensitiveSecurity	Best Practice 17: Preserve people's right to privacy