Share-PSI 2.0 logo

Best Practice: Enable quality assessment of open data

21 July 2016

This version
http://www.w3.org/2013/share-psi/bp/eqa-20160721/
Latest version
http://www.w3.org/2013/share-psi/bp/eqa/
Previous version
http://www.w3.org/2013/share-psi/bp/eqa-20160627/

This is one of a set of Best Practices developed by the Share-PSI 2.0 Thematic Network.

Creative Commons Licence Share-PSI Best Practice: Enable quality assessment of open data by Share-PSI 2.0 is licensed under a Creative Commons Attribution 4.0 International License.


Outline

Data Quality DQ is primarily perceived to be a subjective term: What suffices, is “good enough” for one person, might be inferior to another. “Suffice” here means to be suitable to fulfil a certain need in a process. However beside the subjective aspect of DQ, there is an objective view on DQ which can be measured and help to establish provable and comprehensible metrics on DQ. The adherence to standards, enforced by tools which in turn are embedded in and used by processes, will help to raise DQ. In order to sustainably raise DQ, measures need to be in place all along the data pipeline and not only at the providing front end. DQ improvement has to be considered as a process rather than a one-time measure.

Challenge

The proliferation of open data as a mean to foster open innovation processes towards improved or new products and services, to increase transparency and to perform self-empowered impact measurement of policies also raises concerns about the quality of the provided resources. The early assumption that more data, even of uncertain origin and quality, will unconditionally result in better decisions as long as the right algorithms are used, gave again way to the insight that the principle of garbage-in, garbage-out still holds true. This fact combined with raising concerns regarding data platform usability, data literacy and trust put the quality aspect into the focus. Ironically government Data Quality became of an issue lately primarily due to the fact that government started to release data sets as Open Data which enables stakeholders to carry out citizens control rights. Bringing together data from diverse sources for the first time partially makes data issues like missing data obvious, but even more so deficiencies which arouse due to lacking or missing Master Data Management.

Solution

Traditional metrics to assess Data Quality like accuracy, applicability, and understandability remain relevant, and in the realm of Open Data, get extended by measures like openness, timeliness and primacy. Work carried out in the European Commission's Open Data Support project suggests seven aspects to consider:

  • Accuracy: is the data correctly representing the real-world entity or event?
  • Consistency: Is the data not containing contradictions?
  • Availability: Can the data be accessed now and over time?
  • Completeness: Does the data include all data items representing the entity or event?
  • Conformance: Is the data following accepted standards?
  • Credibility: Is the data based on trustworthy sources?
  • Processability: Is the data machine-readable?
  • Relevance: Does the data include an appropriate amount of data?
  • Timeliness: Is the data representing the actual situation and is it published soon enough?

DQ improvement measures have to be in place all along the (open) data life cycle, otherwise quality measures will be perceived to be an additional burden, causing efforts and costing money. Also note, that the Open Data Life Cycle is - a cycle which suggest to set up data improvement measures as a process rather than a one time measure.

Why is this a Best Practice?

Lacking DQ will reduce data users trust and prevent the unfolding of an open data market. Investment into DQ will pay back internally to the administration, as the potential for interoperable data services will be risen as well as externally, as for data users it will become more easy to blend together data sets of diverse sources to create added value services.

How do I implement this Best Practice?

Implementation of this BP requires addressing the problem from a technical as well as organisational perspective.

Technically, DQ can be raised by adhering to conventions, norms and standards. However, the adoption of conventions, norms and standards requires governance at various levels. Set-up of governance structures is typically in the responsibility of the CIO or someone in charge with comparable powers and duties.

  • It's within the CIO's responsibility to provide guidance on how to structure and implement ICT-systems, which use common and agreed conventions, norms and standards.
  • The CIO should be responsible for identifying semantically equivalent data entities, describe standards according to which these data entities should be modeled and monitor the adherence to these standards.

Common data entities, where possible, should be modeled according to the core vocabuilaries.

CSV files could be annotated using W3C's CSV on the Web Recommendations, which also included a formalised model to describe the columns of CSV files.

Data descriptions should be made according to the DCAT-AP vocabulary.

During the data publishing stage, the W3C Data Quality Vocabulary (DQV) can be used. This provides a framework in which the quality of a dataset can be described either by the publisher or the wider audience.

Tools can automatically check a certain range of DQ domains, like adherence to claimed encodings (such as utf8) or the structural regularity of CSV files.

For assessing the quality of the dataset itself prior to publishing, e.g. for publishing statistical data in RDF format an RDF Data Cube validator (PDF) can be used.

To enrich the data with quality assessment information and track provenance in RDF integration process, e.g. the UnifiedViews tool can be used.

Organisation-wise

  • The CIO should implement a data governance framework which comprises data architecture management, meta-data management, and master data management (MDM).
  • The importance of data as a mission-critical asset can be risen by establishing the role of the Chief Data Officer (CDO).
  • The principles of ISO 8000, like vocabulary usage, semantic encoding, provenance, accuracy and completeness can be taken into account.

The obligatory usage of minimum widespread technical standards like utf8 could be enforced by legal measures or order of the federal CIO.

To assess the publishing process, consider the steps described by ODI Certificates (or similar).

Further reading

Where has this best practice been implemented?

Country Implementation Contact Point
Austria Mission Statement of the Sub-working Group Quality Assurance of Open Data Portals of the Cooperation Open Government Data Austria Cooperation OGD Austria
UK Cross platform character encoding profile
UK ODI Certificate for the Westminster City Council Westminster City Council
Serbia Validating RDF Data Cube Models Valentina Janev, Mihailo Pupin Institute, University of Belgrade, Belgrade, Serbia
Finland Valmistele ja avaa - Prepare and open Section 3.6. Tiedon viimeistely ja laatu - Finishing the data and data quality Prime Minister's Office Finland

References

Contact Info

Original Authors: Johann Höchtl, Valentina Janev

Contributors: Muriel Foulonneau, Lorenzo Canova

Editors: Valentina Janev, Johann Höchtl

Issue Tracker

Any matters arising from this BP, including implementation experience, lessons learnt, places where it has been implemented or guides that cite this BP can be recorded and discussed on the project's GitHub repository

$Id: Overview.html,v 1.2 2016/08/19 09:13:43 phila Exp $