Data quality notes

From Data on the Web Best Practices
Jump to: navigation, search


From the charter

[The Quality and Granularity Description Vocabulary] is foreseen as an extension to DCAT to cover the quality of the data, how frequently is it updated, whether it accepts user corrections, persistence commitments etc. When used by publishers, this vocabulary will foster trust in the data amongst developers.

Some important design questions:

  • the vocabulary could be an extension of DCAT, not repeating any of its elements, or be entirely new. The former is highly preferred. We will work out the model first and then try to map it to DCAT.
  • should quality and granularity vocabs be split?

Scoping and requirements from DWBP WG

Relevant use cases and requirements

For an updated analysis of UCR from the perspective of data quality, see this page (still ongoing work - currently assigned to Antoine & Deirdre)

Sources for use cases, requirements and challenges:

Relevant best practices

For an updated analysis of BP from the perspective of data quality, see this page (still ongoing work - currently assigned to Riccardo & Christophe)

Previous work: Prior to the current FPWD, The WG has identified a number of best practices here. The following have been noted to be quality-focused:

Note from Antoine: I fail to see now why only some of these BPs would be more relevant for quality than others in the table, e.g. MET02 Complete metadata, MOD5 Data models (vocabularies) conformance or TIM01 Timeliness updates. Probably the group will have to go through all BPs and judge!

The following have been noted to be granularity-focused:

There might be also some relevant pointers in the Guidance on the Provision of Metadata

Initial work

Open issues:

  • ISSUE-55: The word "granularity" can been many things. scope, city/state/country, data aggregation
  • ISSUE-64: Jeremy t's expression of concern over 'data must be complete' - not realistic. better to say where it isn't complete

Side (BP doc) issues:

  • ISSUE-116: Best Practices for Data Quality - Insertion of specific strategies apart from DATA QUALITY Vocabulary
  • ISSUE-117: Should Data quality vocabulary be mentioned as specific strategy in BP Document?

Closed issues:

  • ISSUE-65: How to carry forward the data quality issue - more use cases? available options? text only? machine readable dimensions?

Scoping and requirements from other activities

Suggested requirements:

The Quality Vocabulary should:

  • define general quality metrics, but allow for inclusion of additional domain-specific metrics (list taken from slide 8 of this presentation Credit Makx Dekkers/Open Data Support/PwC/CC-BY (c) 2013 European Commission)
    • accuracy;
    • availability;
    • completeness;
    • conformance
    • consistency;
    • credibility;
    • processability;
    • relevance;
    • timeliness.
    • (other potential dimentions from 'Quality Assessment for Linked Open Data: A Survey' paper availability, licensing, interlinking, security, performance, accuracy, consistnecy, conciseness, reputation, believability, verifiability, objectivity, completeness, amount-of-data, relevancy, representational-conciseness, representational-consistency, understandability, interpretability, versatility)
  • address reputation/certification issues
  • address liability/indemnity
  • support objective metrics (publisher & user)
  • support subjective opinions (publisher & user)
  • provenance (how was the data created/collected, by whom)
  • support concept of status (controlled vocabulary: e.g. legally definitive, informative, validated)
  • consider support for SLAs

WG should address data granularity, where data granularity refers to the level of detail within the dataset (precision)

Links, related work

Defining quality

Vocabulary-related work

Deployment opportunities

Implementation in CKAN

If we meet their requirements, as discussed e.g. in this thread).

Data quality vocabulary elements could be added to , if they don't have equivalent there already