Data on the Web Best Practices

Abstract

This document provides best practices related to the publication and usage of data on the Web designed to help support a self-sustaining ecosystem. Data should be discoverable and understandable by humans and machines. Where data is used in some way, whether by the originator of the data or by an external party, such usage should also be discoverable and the efforts of the data publisher recognized. In short, following these best practices will facilitate interaction between publishers and consumers.

8. The Best Practices

This section contains the best practices to be used by data publishers in order to help them and data consumers to overcome the different challenges faced during the data on the Web lifecycle. One or more best practices were proposed for each one of the previously described challenges. Each BP is related to one or more requirements from the Data on the Web Best Practices Use Cases & Requirements document.

Issue 3

Do we over-use RFC2119? Is it used correctly? This issue may be applied to all BPs. Issue-135

Issue 4

Some sections of the document have a technological bias. Issue-144

8.1 Metadata

Metadata is data about data. It provides additional information that helps consumers better understand the meaning of data, its structure, and to clarify other issues, such as rights and license terms, the organization that generated the data, data quality, data access methods, the update schedule of datasets, etc.

Metadata can be used to help tasks such as dataset discovery and re-use, and can be assigned considering different levels of granularity from a single property of a resource to a whole dataset, or all datasets from a specific organization.

Metadata can be of different types. These types can be classified in different taxonomies, with different grouping criteria. For example, a specific taxonomy could define three metadata types according to descriptive, structural and administrative features. Descriptive metadata serves to identify a dataset, structural metadata serves to understand the format(s) in which the dataset is distributed and administrative metadata serves to provide information about the version, update schedule etc. A different taxonomy could define metadata types with a scheme according to tasks where metadata are used, for example, discovery and re-use.

The Web is an open information space; in the absence of a specific context, such a company's internal information system, metadata is essential. Data will not be discoverable or reusable by anyone other than the publisher if insufficient metadata is provided.

Data consumers must be able to:

discover the data;
understand the nature and structure of the data, i.e. what the data describes and how it does it;
find out the origin of the data and under what terms it may be used.

Not just data consumers should be able to read the metadata, but also machines should be able to automatically process the metadata associated to a given dataset. It is also important that standard terms are used when specfying metadata to help the automatic processing of metadata as well as to improve data interoperability .

This section presents best practices to help data publishers to face challenges related to metadata. Initially, general best practices are presented (Document data, Use machine-readable formats to provide metadata and Use standard terms to define metadata), then best practices concerning specific kinds of metadata are proposed. These specific best practices are specializations of the more general best practices.

Best Practice 1: Document data

A metadata document must be published together with the data

Why

Providing information about data, i.e. metadata, in a way that data consumers may understand is fundamental when publishing data on the Web. Given that data publishers and data consumers may be unkown to each other, becomes essential to provide information that helps data consumers to understand the data as well as other important aspects that describes a dataset.

Intended Outcome

It must be possible for data consumers to read metadata that describes a dataset, which makes it human readable metadata.

Possible Approach to Implementation

Metadata for humans is most easily provided as part of an HTML Web page.

How to Test

Check that a human user can easily read the metadata associated with a dataset.

Evidence

Relevant requirements: R-MetadataDocum

Issue 5

Should we mention multilingualism in BP Document data? This issue applies to all BPs about providing documentation for humans. Issue-142

Best Practice 2: Use machine-readable formats to provide metadata

Metadata in machine-readable formats must be published together with the data

Why

Metadata can be used by machines, notably search engines, to discover and classify data. Further metadata can be used by machines to process the data once discovered.

Intended Outcome

It should be possible for computer applications, notably search tools, to locate and process the metadata easily, which makes it human readable metadata, machine readability metadata.

Possible Approach to Implementation

Metadata for machines is best provided either as an alternative representation of the Web page in a serialization format such as Turtle or JSON-LD (for RDF) or it can be embedded in the HTML page, again as [JSON-LD], or [HTML-RDFA] or [Microdata].

If the multiple formats are published separately, they should be served from the same URL using content negotiation. Maintenance of multiple formats is best achieved by generating each available format on the fly based on a single source of the metadata.

How to Test

Access the same URL either with a user agent that accepts a more data oriented format, such as RDF, or a tool that extracts the data from an HTML page such as the RDF Translator.

Evidence

Relevant requirements: R-MetadataMachineRead

Best Practice 3: Use standard terms to define metadata

Standard terms should be used for metadata definition.

Why

The provision of metadata is fundamental to data on the Web. It is the shop window, the user-manual and the conditions of use. It is possible to provide this information an an infinite number of ways: maker, author, originator, person responsible and source are among the many near synonyms for creator. And of course those examples are all just in a single language. The task of finding and processing relevant data among the vast amounts available on the Web, for people and the computer systems they use, is made substantially more achievable if different publishers use the same terms as each other, or rather, use common re-use for the terms used, however those terms are presented to humans. Only in this way can the same tools and methods be used for multiple tasks.

Intended Outcome

Metadata should be provided using standard vocabularies.

Issue 6

MUST or SHOULD? Issue-133. This issue could readily be applied to all BPs.

Possible Approach to Implementation

Metadata is best provided using RDF vocabularies. Each term in an RDF vocabulary has its own HTTP URI to which labels and descriptions in multiple languages can be attached. If labels are not provided in your language, you can add them. If your context demands that a term be specialized, you can create a sub class or sub property but retain the benefits of the original vocabulary's semantics.

Detailed advice on the best practices for the selection, use and extension of vocabularies is provided in Best Practices for Publishing Linked Data [LD-BP].

How to Test

Check that standard vocabularies have been used wherever possible. In particular mark as an error instances where vocabularies such as [DC-TERMS] and [VOCAB-DCAT] could have been used but were not.

Evidence

Relevant requirements: R-MetadataStandardized

8.1.1 Data Discovery

In order to be discoverable on the Web, typically through a search engine or a data portal's own search function, it is essential to provide the kind of information that such tools need. Unlike a natural language document that, to a greater or lesser extent, can be processed and its content 'understood,' a dataset must be described and it is that description that forms the basis of any classification, or indexing.

Best Practice 4: Provide discovery metadata

The overall features of a dataset must be described by metadata.

This best practice is a specialization of the higher level Use machine-readable formats to provide metadata.

Why

Explicitly providing data discovery information allows user agents to automatically discover datasets available on the Web.

Intended Outcome

User agents must be able to discover datasets.

Possible Approach to Implementation

The vocabulary recommended by W3C is the Data Catalog Vocabulary [VOCAB-DCAT].

This provides a framework in which datasets can be described as abstract entities with one or more distributions, that is, means of accessing the data. This might be through one or more downloads or APIs. Aspects covered include:

The title and a description of the data.
The data format(s) in which the data could be downloaded (e.g. XML, CSV, TSV, JSON, JSON-LD, RDF/XML, Turtle, N-Triples etc.)
Any variants (e.g. different human-language translations) of data.
Access mechanisms though which the data be accessed, e.g. SPARQL endpoints, Linked Data Platform [LDP], REST interfaces, SOAP-based web services, etc. (see Data Access).

How to Test

A human test might simply be to use an appropriate search tool and check that the dataset is discoverable as expected since that is the intended outcome. However, a more structured test would be to ensure that the basic metadata fields listed above are filled. It may be possible to machine-test this using the work of the RDF Data Shapes Working Group.

Evidence

Relevant requirements: R-MetadataAvailable, R-MetadataMachineRead, R-MetadataStandardized

8.1.2 Locale Parameters

A locale is a set of parameters that defines specific data aspects, such as language and formatting used for numeric values and dates. When publishing data on the Web is important to provide such information in order to improve the common understanding between data publishers and data consumers. Some data fields may differ subtly but significantly with changes in locale, for example. Providing information about the locality for which the data is currently published aids data users in interpreting its meaning. Date, time, and number formats can have very different meanings, despite similar appearances. Making the language explicit allows users to determine how readily they can work with the data and may enable automated translation services.

Best Practice 5: Provide locale parameters metadata

Information about locale parameters (date, time, and number formats, language) should be made available.

This best practice is a specialization of the higher level Use machine-readable formats to provide metadata.

Why

Providing locale parameters helps data consumers to understand and to manipulate the data, improving the re-use of the data.

Intended Outcome

It should be possible for data consumers to interpret the meaning of dates, times and numbers accurately by referring to locale information.

Possible Approach to Implementation

Provide locale metadata for date, time, and number fields, and include the language in which the data is published in the dataset metadata. Where an international format specification exists, e.g., ISO 8601 for dates and times, use it.

How to Test

Check that the metadata for the dataset itself includes the language in which it is published and that all numeric, date, and time fields have locale metadata provided either with each field or as a general rule.

Evidence

Relevant requirements: R-FormatLocalize, R-MetadataAvailable

8.1.3 Data Licenses

A license is a very useful piece of information to be attached to data on the Web. As defined by the Dublin Core Metadata Initiative, a license is a legal document giving official permission to do something with the data with which it is associated. According to the type of license adopted by the publisher, there might be more or fewer restrictions on sharing and re-using data. In the context of data on the Web, the license of a dataset can be specified within the data, or outside of it, in a separate document to which it is linked.

Best Practice 6: Provide data license information

Data license information should be available

This best practice is a specialization of the higher level Use machine-readable formats to provide metadata.

Why

The presence of license information is essential for assessing the usability of data. User agents, including search tools, may use the presence/absence of license information as a trigger for inclusion or exclusion of data presented to a potential consumer. Even though the license may be presented in natural language, where data links to the URL of a well known license, the user agent may be able to present the well known features to the potential consumer.

Intended outcome

Machines should be able to automatically detect whether a given dataset does or does not carry a license.

Possible Approach to Implementation

There are several well known vocabularies that include properties for linking to a license. These include Dublin Core [DC-TERMS], Creative Commons [CC-ABOUT], schema.org [SCHEMA-ORG] and XHTML [XHTML-VOCAB]. There are also a number of machine readable rights languages, including The Creative Commons Rights Expression Language [ccREL], the Open Data Rights Language [ODRL] and the Open Data Rights Statement Vocabulary [ODRS].

Links to the license can be provided from the data itself, from an HTML page that describes the data (via a Link element), or via an HTTP Link Header, the latter two with a @rel value of license.

Further information about open data licensing can be found in the Publisher's Guide to Open Data Licensing, published by the Open Data Institute [ODI-LICENSING].

How to Test

Check for the presence of one or more of:

the presence of an RDF predicate;
an HTML Link element;
an HTTP Link header;

that links the dataset to a license and/or rights information.

Evidence

Relevant use cases: R-LicenseAvailable and R-MetadataMachineRead

8.1.4 Data Provenance

Provenance originates from the French term "provenir" (to come from), which is used to describe the curation process of artwork as art is passed from owner to owner. Data provenance, in a similar way, is metadata, that allows data providers to pass details about the data history to data users. Provenance becomes particularly important when data is shared between collaborators who might not have direct contact with one another either due to proximity or because the published data outlives the lifespan of the data provider projects or organizations.

The Web brings together business, engineering, and scientific communities creating collaborative opportunities that were previously unimaginable. The challenge in publishing data on the Web is providing an appropriate level of detail about its origin. The data publishers may not necessarily be the data provider and so collecting and conveying this corresponding metadata is particularly important. Without provenance, consumers have no inherent way to trust the integrity and credibility of the data being shared. Data publishers in turn need to be aware of the needs of prospective consumer communities to know how much provenance detail is appropriate.

Best Practice 7: Provide data provenance information

Data provenance information should be available

This best practice is a specialization of the higher level Use machine-readable formats to provide metadata.

Why

Without accessible data provenance, consumers will not know the origin or history of the published data.

Data provenance is metadata that corresponds to data. Data provenance relies upon existing vocabularies that make provenance easily identifiable such as the Provenance Ontology [PROV-O].

Intended Outcome

Data published on the Web should include, or link to, provenance information.

Possible Approach to Implementation

Data provenance can be published in a number of ways:

Use the Provenance Ontology [PROV-O] to describe data provenance.
Use an appropriate level of detail that will be meaningful to the intended audience.
Write the data provenance in either a machine readable form such as Turtle or RDF/XML or embed provenance in an HTML page using [JSON-LD], or [HTML-RDFA]
Verify that the data provenance references published data

How to Test

The PROV Implementation Report [PROV-IMP] lists a number of validator tools that can be used to test for the presence of provenance information provided using [PROV-O].

Evidence

Relevant requirements: R-ProvAvailable, R-MetadataAvailable

8.1.5 Data Quality

Data quality is commonly defined as “fitness for use” for a specific application or use case. It can affect the potentiality of the application that use data, as a consequence, its inclusion in the data publishing and consumption pipelines is of primary importance.

Usually, the assessment of quality involves different kinds of quality dimensions, each representing groups of characteristics that are relevant to publishers and consumers. Measures and metrics are defined to assess the quality for each dimension. There are heuristics designed to fit specific assessment situations that rely on quality indicators, namely, pieces of data content, pieces of data meta-information, and human ratings that give indications about the suitability of data for some intended use.

Dimensions and metrics to adopt might largely depend on the specific application scenario, or even on the data domain. A systematic review of dimensions, metrics adopted in the context of Linked Open Data can be found in the recent literature (e.g., see [ZAVERI]).

Best Practice 8: Provide data quality information

Data Quality information should be available

This best practice is a specialization of the higher level Use machine-readable formats to provide metadata.

Why

Data quality might seriously affect the suitability of data for specific applications, including applications very different from the purpose for which it was originally generated. Documenting data quality significantly eases the process of data selection, increasing the chances of re-use.

Intended Outcome

Information about the quality of the data should be provided for humans. Ideally it is also made available in machine readable manner for processing by applications.

Possible Approach to Implementation

Depending on the application domain, information pertaining to the quality may rely on specific quality metrics or feedback-opinion. Specific quality metadata fields may or may not be explicitly included in the metadata vocabularies adopted by catalogs. Independently from domain-specific peculiarities, the quality of data should be documented and known quality issues should be explicitly stated in metadata.

How to Test

Check whether the metadata explicitly includes a description of the quality of the data.

Evidence

Information about the relevance of the BP is described by requirements documented in the Data on the Web Best Practices Use Cases & Requirements document: Requirements for Data Quality

Issue 7

Should we provide more specific/ detailed strategies on how to attach quality info in metadata apart from the use of the data quality vocabulary the group is working on? Issue-116

8.1.6 Data Versioning

Data on the web often changes over time. Many datasets are updated on a scheduled basis, such as census data or funding data that changes every fiscal year. Other datasets are changed as improvements in collecting the data make updates worthwhile. Still other data changes in real time or near real time. All these types of data need a consistent, informative approach to versioning, so that data consumers can understand and work with the changing data. The following best practices address issues that arise in tracking and managing different versions of datasets.

Best Practice 9: Provide versioning information

Data that will be updated over time should be assigned a version number or, at a minimum, a version date, and that identifier should be distributed with the data.

This best practice is a specialization of the higher level Use machine-readable formats to provide metadata.

Why

Version information makes a dataset uniquely identifiable. Uniqueness enables consumers to determine how the data changes across time and whether they are working with the latest version of a dataset. Good versioning helps them to determine when to update to a newer version. Explicit versioning allows for repeatability in research, enables comparisons, and prevents confusion. Using version numbers that follow a standardized approach can also set consumer expectations about how the versions differ.

Intended Outcome

It should be possible for data consumers to easily determine which version of the data they are working with.

Possible Approach to Implementation

The precise method adopted for providing versioning information may vary according to the context, however there are some basic guidelines that can be followed, for example:

Include a version number as part of the metadata for the dataset.
Use a consistent numbering scheme with a meaningful approach to incrementing digits, such as [SchemaVer].
Provide a description of what has changed since the previous version. The Web Ontology Language provides a number of annotation properties for version information [OWL2-QUICK-REFERENCE] and the Provenance Ontology [PROV-O] defines several types of link between versions.
If the data is made available through an API, the URI used to request the latest version of the data should not change as the versions change, but it should be possible to request a specific version through the API.

How to Test

Check that a unique version number or date is provided with the metadata describing the dataset.

Evidence

Relevant requirements: R-DataVersion

Best Practice 10: Provide version history

A version history should be available for versioned data.

Why

In creating applications that use data, it can be helpful to understand the variability of that data over time. Interpreting the data is also enhanced by an understanding of its dynamics. Determining how the various versions of a dataset differ from each other is typically very laborious unless a summary of the differences is provided.

Intended Outcome

It should be possible for data consumers to understand how the data typically changes from version to version and how any two specific versions differ.

Possible Approach to Implementation

Provide a list of published versions and a description for each version that explains how it differs from the previous version. An API can expose a version history with a single dedicated URL that retrieves the latest version of the complete history.

How to Test

Check that a list of published versions is available, and that each version is described.

Evidence

Relevant requirements: R-DataVersion

8.2 Data Identification

Identifiers are simple conventions of labels that allow us to distinguish what is being identified from anything else. Identifiers have being extensively used in every information system, making it possible to refer to any particular element. On the Web, a friendly and uniform system of identification will be required to enable data persistency and re-use, being a crucial element for the process of data sharing and connecting.

Best Practice 11: Use unique identifiers

Data, datasets and the different dataset versions must be associated with a unique identifier.

Why

Adopting a common identification system enables basic data identification and comparison processes by any stakeholder in a reliable way. They are an essential pre-condition for proper data management and re-use.

Intended Outcome

Data and datasets must be discoverable and citable.

A number of basic requirements should be fulfilled during the definition of an identification system:

Ensure that data and datasets can be identified by its unique identifier.
The unique identifier needs to remain persistent through time, regardless of the status of the associated resource (data or dataset).
Each version of a dataset must be identified by a URI.
Always use a consistent and uniform, but also flexible and extensible, structure for the composition of identifiers.

Possible Approach to Implementation

The architecture of the World Wide Web [WEBARCH] is based on links between resources based on globally unique identifiers. They can identify people, places and concepts just as much as the more familiar URL identifiers for Web pages, images etc. HTTP IRIs, the internationalized version of URIs, themselves a super-set of URLs, are a fundamental aspect of the Web.

Internationalized Resource Identifiers (IRIs) [RFC3987] should be used when needed. The use of URI or IRIs depends on the intended audience. For global audience one should use language and cultural neutral URIs; for local audience one could use IRIs.

Example 1: URI and IRIs

http://example.com/中国                  # For Chinese speaker: good mnemonic - non-Chinese speakers: bad mnemonic
http://example.com/china                 # For Chinese speaker: acceptable - non-Chinese speakers: better

Some best practices to adopt while using URIs as a data identification system:

Associate every data, or dataset, resource with an unique identifier using a representational URI.
Use the HTTP protocol to ensure that the resolution of any URI on the Web is possible.
When someone looks up an identifier URI, always provide useful information or metadata.
Build standard URIs that follow a well-defined and extensible scheme or pattern to provide consistency and uniformity.
Avoid broken URIs. In the event that a resource has been modified or deleted, those changes must be communicated using the appropriate response code [RFC7231]. If the resource has changed location, HTTP 3XX codes should be used, whereas if the resource has been deleted a HTTP 410 code should be used.
Do not expose information on the technical implementation of the resources represented within the URI. Any information about the underlying technology should be omitted (e.g. file extensions).

How to Test

Check that there is a documented scheme of Web identifiers for the data in question. For any of the existing identifiers test that the associated data can always be retrieved by means of its identifier and in a technologically neutral way.

Evidence

Relevant requirements: R-UniqueIdentifier, R-Citable

8.3 Data Formats

The formats in which data is made available to consumers are a key aspect of making that data usable. The best, most flexible access mechanism in the world is pointless unless it serves data in formats that enable use and reuse. Below we detail best practices in selecting formats for your data, both at the level of files and that of individual fields. W3C encourages use of formats that can be used by the widest possible audience and processed most readily by computing systems. Source formats, such as database dumps or spreadsheets, used to generate the final published format, are out of scope. This document is concerned with what is actually published rather than internal systems used to generate the published data.

Best Practice 12: Use machine-readable standardized data formats

Data must be available in a machine-readable standardized data format that is adequate for its intended or potential use.

Why

As data becomes more ubiquitous, and datasets become larger and more complex, processing by computers becomes ever more crucial. Posting data in a format that is not machine readable places severe limitations on the continuing usefulness of the data. Using non-standard data formats is costly and inefficient, and the data may lose meaning as it is transformed. On the other hand, standardized data formats enable interoperability as well as future uses, such as remixing or visualization, many of which cannot be anticipated when the data is first published.

Intended Outcome

Published data on the Web must be readable and processable by typical computing systems. Any data consumer who wishes to work with the data and is authorized to do so must be able to do so with computational tools typically available in the relevant domain.

Possible Approach to Implementation

Consider which data formats potential users of the data are most likely to have the necessary tools to parse. Non-proprietary data formats include, but are not limited to, CSV, NetCDF, XML, JSON and RDF. Standard data formats as well as the use of standard data vocabularies will better enable machines to process the data.

How to Test

Check that the data format conforms to a known machine-readable data format specification in current use among anticipated data users.

Evidence

Relevant requirements: R-FormatMachineRead, R-FormatStandardized

Issue 8

Should we use machine-readable standardized data formats BP be split in two? Issue-138

Best Practice 13: Use open data formats

Data should be available in a nonproprietary data format.

Why

Open data formats are usable by anyone. Proprietary data formats may be difficult or impractical for some data users to view or parse. Thus, the use of open data formats increases the possibilities for use and re-use of data.

Intended Outcome

It should be possible for any person who wants to use or re-use the data to do so without investment in proprietary software.

Possible Approach to Implementation

Make data available in open data formats including but not limited to CSV, XML, Turtle, NetCDF and JSON.

How to Test

Check if it is possible to read, process, and store the data without using any proprietary software package.

Evidence

Relevant requirements: R-FormatOpen

Best Practice 14: Provide data in multiple formats

Data should be available in multiple data formats.

Why

Providing data in more than one format reduces costs incurred in data transformation. It also minimizes the possibility of introducing errors in the process of transformation. If many users need to transform the data into a specific data format, publishing the data in that format from the beginning saves time and money and prevents errors many times over. Lastly it increases the number of tools and applications that can process the data.

Intended Outcome

It should be possible for data consumers to work with the data without transforming it.

Possible Approach to Implementation

Consider the data formats most likely to be needed by intended users, and consider alternatives that are likely to be useful in the future. Data publishers must balance the effort required to make the data available in many formats, but providing at least one alternative will greatly increase the usability of the data.

How to Test

Check that the complete dataset is available in more than one data format.

Evidence

Relevant requirements: R-FormatMultiple

8.4 Data Vocabularies

Issue 9

The section needs terminological discussion on whether we keep "vocabularies", which could be replaced by "data models" or "schemas" and whether we should remove "controlled vocabularies" from the picture. Issue-134

Issue 10

A big part of the section (starting by the section name) is biased towards linked data technology. It should be completed with other references and alternative implementation approaches. Issue-144

Data is often represented in a structured and controlled way, making reference to a range of vocabularies, for example, by defining types of nodes and links in a data graph or types of values for columns in a table, such as the subject of a book, or a relationship “knows” between two persons. Additionally, the values used may come from a limited set of pre-existing values or resources: for example object types, roles of a person, countries in a geographic area, or possible subjects for books. Such vocabularies ensure a level of control, standardization and interoperability in the data. They can also serve to improve the usability of datasets. Say, a dataset contains a reference to a concept described in several languages. Such reference allows applications to localize their display of their search depending on the language of the user.

According to W3C, vocabularies define the concepts and relationships (also referred to as “terms”) used to describe and represent an area of concern. Vocabularies are used to classify the terms that can be used in a particular application, characterize possible relationships, and define possible constraints on using those terms. Several categories of vocabularies have been coined, for example, ontology, controlled vocabularies, thesaurus, taxonomy, semantic network.

There is no strict division between the artifacts referred to by these names. “Ontology” tends however to denote the vocabularies of classes and properties that structure the descriptions of resources in (linked) datasets. In relational databases, these correspond to the names of tables and columns; in XML, they correspond to the elements defined by an XML Schema. Ontologies are the key building blocks for inference techniques on the Semantic Web. The first means offered by W3C for creating ontologies is the RDF Schema [RDF-SCHEMA] language. It is possible to define more expressive ontologies with additional axioms using languages such as those in The Web Ontology Language [OWL2-OVERVIEW].

On the other hand, “controlled vocabularies”, “concept schemes”, “knowledge organization systems” enumerate and define resources that can be employed in the descriptions made with the former kind of vocabulary. A concept from a thesaurus, say, “architecture”, will for example be used in the subject field for a book description (where “subject” has been defined in an ontology for books). For defining the terms in these vocabularies, complex formalisms are most often not needed. Simpler models have thus been proposed to represent and exchange them, such as the ISO 25964 data model [ISO-25964] or W3C's Simple Knowledge Organization System [SKOS-PRIMER].

This section presents best practices for data vocabularies accessible as URI sets on the Web, which are applicable to any kind of vocabulary.

Best Practice 15: Document vocabularies

Vocabularies should be clearly documented.

Why

Documentation defines what is within the vocabulary and the better the documentation the higher the possibility of re-use the vocabulary and the datasets built with it.

Intended Outcome

The vocabulary should be human-readable.

Possible Approach to Implementation

A vocabulary may be published together with human-readable Web pages, as detailed in the recipes for serving vocabularies with HTML documents in the Best Practice Recipes for Publishing RDF Vocabularies [SWBP-VOCAB-PUB]. Elements from the vocabulary are defined with attributes containing human-understandable labels and definitions, such as rdfs:label, rdfs:comment, dc:description, skos:prefLabel, skos:altLabel, skos:note, skos:definition, skos:example, etc.. Documentation may benefit from the additional presence of visual documentation such as the UML-style diagram of the W3C Organization Ontology [ORG]

How to Test

Check that a human user can understand the documentation associated with a vocabulary.

Evidence

Relevant requirements: R-VocabDocum

Best Practice 16: Share vocabularies in an open way

Vocabularies should be shared in an open way

Why

Sharing vocabularies in an open way may increase the usage of a data vocabulary and help to understand the relationships among different vocabularies.

Intended Outcome

The vocabulary should be available for data consumers to use or re-use it.

Possible Approach to Implementation

Provide the vocabulary under an open license such as Creative Commons Attribution License CC-BY [CC-ABOUT]. Create entries for the vocabulary in repositories such as LOV, Prefix.cc, Bioportal and the European Commission's Joinup.

How to Test

Check that an open license is available looking for URL or link to the document where the copyright is provided.

Evidence

Relevant requirements: R-VocabOpen

Best Practice 17: Vocabulary versioning

Vocabularies should include versioning information

Why

Versioning information guarantees compatibility over time by providing a way to compare different versions as the vocabulary evolves.

Intended Outcome

It must be possible to identify changes to a vocabulary over time.

Possible Approach to Implementation

A vocabulary may be given an identifier for 'the latest version' that remains stable over time, even as the vocabulary evolves. In addition, each version of the vocabulary has its own identifier. W3C documents provide an example. The latest version of this document is always found at http://www.w3.org/TR/dwbp/ but individual versions each have their own URL as well so that its evolution can be tracked and specific versions pointed to if required.

Several vocabularies, including OWL [OWL2-OVERVIEW] and schema.org [SCHEMA-ORG], include properties for version numbers.

How to Test

Different versions of a vocabulary can be easily identified;

Evidence

Relevant requirements:R-VocabVersion

Best Practice 18: Re-use vocabularies

Existing reference vocabularies should be re-used where possible

Why

re-using vocabularies increases interoperability and reduces redundancies between vocabularies, encouraging re-use of the data.

Intended Outcome

Datasets (and vocabularies) should re-use core vocabularies.

Possible Approach to Implementation

The Standard Vocabularies section of the W3C Best Practices for Publishing Linked Data [LD-BP] provides guidance on the discovery, evaluation and selection of existing vocabularies.

How to Test

Check that terms used do not replicate those defined by vocabularies in common use within the same domain.

Evidence

Relevant requirements: R-VocabReference

Best Practice 19: Choose the right formalization level

When creating or re-using a vocabulary for an application, a data publisher should opt for a level of formal semantics that fit data and applications.

Why

Formal semantics may help one to establish precise specifications that support establishing the intended meaning of the vocabulary and the performance of complex tasks such as reasoning. On the other hand, complex vocabularies require more effort to produce and understand, which could hamper their re-use, as well as the comparison and linking of datasets exploiting them. Highly formalized data is also harder to exploit by inference engines: for example, using an OWL class in a position where a SKOS concept is enough, or using OWL classes with complex OWL axioms raises the formal complexity of the data according to the OWL Profiles [OWL2-PROFILES]. Data producers should therefore seek to identify the right level of formalization for particular domains, audiences and tasks, and maybe offer different formalization levels when one size does not fit all.

Intended Outcome

The data supports all application cases but should not be more complex to produce and re-use than necessary;

Possible Approach to Implementation

Identify the "role" played by the vocabulary for the datasets, say, providing classes and properties used to type resources and provide the predicates for RDF statements, or elements in an XML Schema, as opposed to providing simple concepts or codes that are used for representing attributes of the resources described in a dataset. When simpler models are enough to convey the necessary semantics, represent vocabularies using them. For instance, for Linked Data, SKOS may be preferred for simple vocabularies as opposed to formal ontology languages like OWL; see for example how concept schemes and code lists are used in the RDF Data Cube Recommendation [QB].

How to Test

For formal knowledge representation languages, applying an inference engine on top of the data that uses a given vocabulary does not produce too many statements that are unnecessary for target applications.

Evidence

Relevant requirements: R-VocabReference, R-VocabDocum, R-QualityComparable

Issue 11

The best practice on formalization above (especially sections "Intended outcome" and "How to test") should be re-written in a more technology-neutral way. Issue-144

8.5 Sensitive Data

Sensitive data is any designated data or metadata that is used in limited ways and/or intended for limited audiences. Sensitive data may include personal data, corporate or government data, and mishandling of published sensitive data may lead to damages to individuals or organizations.

To support best practices for publishing sensitive data, data publishers should identify all sensitive data, assess the exposure risk, determine the intended usage, data user audience and any related usage policies, obtain appropriate approval, and determine the appropriate security measures needed to taken to protect the data. Appropriate security measures should also account for secure authentication and use of HTTPS.

At times, because of sharing policies sensitive data may not be available in part or in its entirety. Data unavailability represents gaps that may affect the overall analysis of datasets. To account for unavailable data, data publishers should publish information about unavoidable data gaps.

Best Practice 20: Preserve people's right to privacy

Data must not infringe a person's right to privacy.

Why

Data publishers should preserve the privacy of individuals where the release of personal information would endanger safety (unintended accidents) or security (deliberate attack). Privacy information might include: full name, home address, mail address, national identification number, IP address (in some cases), vehicle registration plate number, driver's license number, face, fingerprints, or handwriting, credit card numbers, digital identity, date of birth, birthplace, genetic information, telephone number, login name, screen name, nickname, health records etc.

Data publishers should identify all personal data, assess the exposure risk, determine the intended usage, data user audience and any related usage policies, obtain appropriate approval, and determine the appropriate security measures needed to taken to protect the data including secure authentication and use of HTTPS for data transmission.

Intended Outcome

Data that can identify an individual person must not be published without their consent.

Possible Approach to Implementation

The data publisher should establish a security plan for publishing data and metadata. The plan should include preparatory steps to ensure personal data is protected or removed prior to publication. All steps need to be followed prior to publication of new data or new data formats particularly binary formats (word processing, spreadsheet etc) that may embed personal metadata in files.

Identify any personal data exposure risks. Write a security plan for publishing data and metadata that includes clear guidelines to follow. Prior to publication put security measures in place and follow them. In preparation to publication review data to ensure compliance.

How to Test

Write and test a plan for reviewing, curating and vetting data prior to publication.

Evidence

Relevant requirements: R-SensitivePrivacy

Preserve organization's security

Data should not infringe an organization's security (local government, national government, business).

Why

What

Intended Outcome

Possible Approach to Implementation

How to Test

Evidence

Relevant requirements: R-SensitiveSecurity

Best Practice 21: Provide data unavailability reference

References to data that is not open, or is available under different restrictions to the origin of the reference, should provide context by explaining how or by whom the referred to data can be accessed.

Why

Publishing online documentation about unavailable data due to sensitivity issues provides a means for publishers to explicitly identify knowledge gaps. This provides a contextual explanation for consumer communities thus encouraging use of the data that is available.

Intended Outcome

Publishers should provide information about data that is referred to from the current dataset but that is unavailable or only available under different conditions.

Possible Approach to Implementation

Data publishers may publish an HTML document that gives a human-readable explanation for data unavailability. RDF may be used to provide a machine readable version of the same information. If appropriate, consider editing the server's 4xx response page(s) to provide the information.

How to Test

If the dataset includes references to other data that is unavailable, check whether an explanation is available in the metadata and/or description of it.

Evidence

Relevant requirements: R-DataUnavailabilityReference

Issue 12

Should we use SHOULD or MUST on BP for Sensitive Data? Issue-123

8.6 Data Access

Providing easy access to data on the Web enables both humans and machines to take advantage of the benefits of sharing data using the Web infrastructure. By default, the Web offers access using Hypertext Transfer Protocol (HTTP) methods. This provides access to data at an atomic transaction level. However, when data is distributed across multiple files or requires more sophisticated retrieval methods different approaches can be adopted to enable data access, including bulk download and APIs.

One approach is packaging data in bulk using non-proprietary file formats(for example zip files or tar files). Using this approach, bulk data is generally pre-processed server side where multiple files or directory trees of files are provided as one downloadable file. When bulk data is being retrieved from non-file system solutions, depending on the data user communities, the data publisher can offer APIs to support a series of retrieval operations representing a single transaction.

For data that is streaming to the Web in “real time” or “near real time”, data publishers must publish data or use APIs to enable immediate access to data, allowing access to critical time sensitive data, such as emergency information, weather forecasting data, or published system metrics. In general, APIs should be available to allow third parties to automatically search and retrieve data published on the Web.

Best Practice 22: Provide bulk download

Data should be available for bulk download.

Why

When web data is distributed across many URLs and logically organized as one container, accessing the data in bulk is useful. Bulk access provides a consistent means to handle the data as one container. Without it, individually accessing data is cumbersome leading to inconsistent approaches to handling the container.

Intended Outcome

It should be possible to download data on the Web in bulk. Data publishers should provide a way either through bulk file formats or APIs for consumers to access this type of data.

Possible Approach to Implementation

Depending on the nature of the data and consumer needs possible approaches could include:

Preprocessing a copy of the data in compressed archive format where the data more easily accessible as one URL. This is particularly useful for handling data that changes infrequently or on a scheduled basis.
Hosting an API such as a REST or SOAP service that dynamically retrieves individual data and returns a bulk container. This approach is useful when for capturing a snapshot of the data. The API can also be useful for consumers to customize what they want included or excluded.
Hosting a database, web page, or SPARQL endpoint that contains discoverable metadata [VOCAB-DCAT] describing the container and data URLs associated with the container.

How to Test

Humans can retreive copies of preprocessed bulk data through existing tools such as a browser. Clients can test bulk access through an API or queries to web resources with discoverable metadata about the bulk data.

Evidence

Relevant requirements: R-AccessBulk

Best Practice 23: Follow REST principles when designing APIs

APIs for accessing data should follow REST architectural approaches.

Why

Considering RESTful architectural aspects when designing an APIs can guarantee easier development, use of pre-existing infrastructure (the Web), a shorter learning curve for developers that want to build applications that access data. It also assures sustainability as "the technologies that make up this foundation include the Hypertext Transfer Protocol (HTTP), Uniform Resource Identifier (URI), markup languages such as HTML and XML, and Web-friendly formats" [RICHARDSON]. Furthermore, it can mitigate the use of specific clients or the need of UDDI.

APIs are frequently constructed over different approaches, such as SOAP. For data on the Web context, the architecture of the Web itself described at the documentation of REST architectural style offers the same entry for humans and machines to access data. If humans already have access to data in URLs, it can be also structured for offering multiple representations for formats and use content negotiation between applications easily.

Intended Outcome

It should be possible for machines to access data in a variety of formats from the same URI through content negotiation.
It should be possible for data consumers to access data using browser as a client.

Possible Approach to Implementation

Design always RESTful APIs using HTTP and good pragmatic REST principles. There is no unique agreed set of principles for REST APIs, some are implicitly defined by the HTTP standard and others have emerged on a consensus base or even are still under discussion. The following are a set of rules widely adopted so far:

Use hierarchical, readable and technology agnostic Uniform Resource Identifiers (URIs) to address resources in a consistent way.
Use the URI path to convey your Resources and Collections model.
Use nouns but no verbs (except for Controllers that does not involve resources). Use HTTP verbs instead to operate on the Collections and Resources.
Use standard HTTP methods accordingly to their expected default behavior. GET method and query parameters should not alter the state.
Use HTTP headers to provide metadata and for the serialization of data formats. Support multiple formats.
Use HTTP status codes (including error codes) accordingly to their original purpose.
Simplify associations. Use query parameters to hide complexity and provide filtering, sorting, field selection and paging for collections.
Version your API. Never release an API without a version and make the version mandatory.

How to Test

Use API testing tools to compare benefits of implementing RESTful design.

Evidence

Relevant requirements: R-AccessBulk

Best Practice 24: Provide real-time access

Where data is produced in real-time, it should be available on the Web in real-time.

Why

The presence of real-time data on the web enables access to critical time sensitive data, and encourages the development of real-time web applications. Real-time access is dependent on real-time data producers making their data readily available to the data publisher. The necessity of providing real-time access for a given application will need to be evaluated on a case by case basis considering refresh rates, latency introduced by data post processing steps, infrastructure availability, and the data needed by consumers. In addition to making data accessible, data publishers may provide additional information describing data gaps, data errors and anomalies, and publication delays.

Intended Outcome

Data should be available at real time or near real time, where real-time means a range from milliseconds to a few seconds after the data creation, and near real time is a predetermined delay for expected data delivery.

Possible Approach to Implementation

Real-time data accessibility may be achieved through two means:

Push - as data is produced the producers communicates data to the data publisher either by disseminating data to the publisher or making storage available accessible to the data producer.
On-Demand (Pull) - available real-time data is made available upon request. In this case, data publishers will provide an API to facilitate these read-only requests.

In addition to data access, to ensure credibility providing access to error conditions, anomalies, and instrument "house keeping" data enhance real-time applications ability to interpret and convey real-time data quality to consumers.

How to Test

To adequately test real time data access, data will need to tracked from the time it is initially collected to the time it is published and accessed. [PROV-O] can be used to describe these activities. Caution should be used when analyzing real-time access for systems that consist of multiple computer systems. For example, tests that rely on wall clock time stamps may reflect inconsistencies between the individual computer systems as opposed to data publication time latency.

Evidence

Relevant requirements: R-AccessRealTime

Best Practice 25: Provide data up to date

Data must be available in an up-to-date manner and the update frequency made explicit.

Why

Data on the Web availability should closely coincide with data provided at creation time, collection time, or after it has been processed or changed. Carefully synchronizing data publication to the update frequency encourages data consumer confidence and re-use.

Intended Outcome

When new data is provided or data is updated, it must be published to coincide with the data changes.

Possible Approach to Implementation

Implement an API to enable data access. When data is provided by bulk access, new files with new data should be provided as soon as additional data is created or updated.

How to Test

Write test standard operating procedure for data publisher to keep test data on Web site up to date.

Following standard operating procedure:

Write test client to access published data.
Access data and save first copy locally.
Publish an updated version of data.
Access data and save second copy locally.
Compare first copy to second copy to verify change.

Evidence

Relevant requirements: R-AccessUptodate

Best Practice 26: Maintain separate versions for a data API

If data is made available through an API, the API itself should be versioned separately from the data. Old versions should continue to be available.

Why

Developers need to be made aware of changes to an API so that they can update their code to use it. When an API is changed, as opposed to when the data it makes available is changed, releasing it as a new version makes it possible to gracefully transition from the old version to the new one. Keeping the older versions available avoids breaking applications that cannot be updated.

Intended Outcome

It should be possible for developers to transition easily from one version of the API to another. Applications that are impractical to transition should continue to work. The API version should not be updated when data versions are updated, only when the API itself changes, and that should be infrequent.

Possible Approach to Implementation

Release updates to your API under a slightly different base URI so that older versions remain available under the previous base URI. For example, http://myapi.org/v1/dogs/alfred retrieves the older version of data about a dog named Alfred, and http://myapi.org/v2/dogs/alfred retrieves the newer version of data about Alfred. Keeping the version number as far to the left as possible in the API call allows developers to switch to the newer version with the least effort.

How to Test

Existing calls to the API should continue to work when the API is updated. New calls to a slightly different base URI should retrieve data according to the new rules.

Evidence

Relevant requirements: R-DataVersion

8.7 Data Preservation

Data preservation is a well understood and commonly performed task for static and self-contained data. This commonly includes the following steps:

ingest the data and assign a persistent identifier to it;
ensure the data is correctly stored and prevent bit rot;
provide access to the data and perform format translation if needed.

The model most commonly referred to is the Open Archival Information System [OAIS]. Many digital preservation institutions are implementing this model or some variant of it. Web pages can be preserved following the same strategies, considering a Web site as a static data set that is self-contained and can be preserved as a snap shot at a fixed time. When preserving data on the Web some new elements have to be taken into account:

the persistent re-use (URIs) used across the Web may be related to live data that can change;
the meaning of a resource is contextualized by the other resources it is linked to;
documents fetched in HTML, RDF/XML of JSON, for instance, are only one of the many possible serializations of the data they represent.

The preservation of Web data should generally focus on the preservation of the description of entities.

Issue 13

The Working group has not yet reached consensus on whether data preservation is in scope and, if so, which aspects of it. Issue 143.

8.8 Feedback

Publishing data on the Web enables data sharing on a large scale, providing data access to a wide range of audiences with different levels of expertise. Data publishers want to ensure that the data published is meeting the data consumer needs and user feedback is crucial. Feedback has benefits for both data publishers and data consumers, helping data publishers to improve the integrity of their published data, as well as to encourage the publication of new data. Feedback allows data consumers to have a voice describing usage experiences (e.g. applications using data), preferences and needs. When possible, feedback should also be publicly available for other data consumers to examine. Making feedback publicly available allows users to become aware of other data consumers, supports a collaborative environment, and allows user community experiences, concerns or questions are currently being addressed.

From a user interface perspective there are different ways to gather feedback from data consumers, including site registration, contact forms, quality ratings selection, surveys and comment boxes for blogging. From a machine perspective the data publisher can also record metrics on data usage or information about specific applications consumers are currently relying upon. Feedback such as this establishes a line of communication channel between data publishers and data consumers. In order to quantify and analyze usage feedback, it should be recorded in a machine-readable format. Blogs and other publicly available feedback should be displayed in a human-readable form through the user interface.

This section provides some BP to be followed by data publishers in order to enable data consumers to provide feedback about the consumed data. This feedback can be for humans or machines.

Best Practice 27: Gather feedback from data consumers

Data publishers should provide a means for consumers to offer feedback.

Why

Providing feedback contributes to improving the quality of published data, may encourage publication of new data, helps data publishers understand data consumers needs better and, when feedback is made publicly available, enhances the consumers' collaborative experience.

Intended Outcome

It should be possible for data consumers to provide feedback and rate data in both human and machine-readable formats. The feedback should be Web accessible and it should provide a URL reference to the corresponding dataset.

Possible Approach to Implementation

Provide data consumers with one or more feedback mechanisms including, but not limited to: a registration form, contact form, point and click data quality rating buttons, or a comment box for blogging.

Collect feedback in machine-readable formats to represent the feedback and use a vocabulary to capture the semantics of the feedback information.

How to Test

Demonstrate how feedback can be collected from data consumers.
Verify that the feedback is persistently stored. If the feedback is made publicly available verify that a URL links back to the published data being referenced.
Check that the feedback format conforms to a known machine-readable format specification in current use among anticipated data users.

Evidence

Relevant requirements: R-UsageFeedback, R-QualityOpinions

Data on the Web Best Practices

W3C First Public Working Draft 24 February 2015

Abstract

Status of This Document

Table of Contents

1. Introduction

2. Conformance

3. Audience

4. Scope

5. Data on the Web Challenges

6. Best Practices Template

7. Best Practices Summary

8. The Best Practices

8.1 Metadata

8.1.1 Data Discovery

8.1.2 Locale Parameters

8.1.3 Data Licenses

8.1.4 Data Provenance

8.1.5 Data Quality

8.1.6 Data Versioning

8.2 Data Identification

8.3 Data Formats

8.4 Data Vocabularies

8.5 Sensitive Data

8.6 Data Access

8.7 Data Preservation

8.8 Feedback

9. Conclusions

A. Acknowledgements

B. References

B.1 Normative references

B.2 Informative references