Data quality notes

From Data on the Web Best Practices
Jump to: navigation, search


From the charter

[The Quality and Granularity Description Vocabulary] is foreseen as an extension to DCAT to cover the quality of the data, how frequently is it updated, whether it accepts user corrections, persistence commitments etc. When used by publishers, this vocabulary will foster trust in the data amongst developers.

Some important design questions:

  • the vocabulary could be an extension of DCAT, not repeating any of its elements, or be entirely new. The former is highly preferred. We will work out the model first and then try to map it to DCAT.
  • should quality and granularity vocabs be split?

Scoping and requirements from DWBP WG

Relevant use cases and requirements

For an updated analysis of UCR from the perspective of data quality, see this page (still ongoing work - currently assigned to Antoine & Deirdre)

Relevant best practices

First work: Prior to the current FPWD, The WG has identified a number of best practices here. The following have been noted to be quality-focused: QUA01 Complete data, QUA02 Primary data, QUA03 Built-in data sharing systems, QUA04 Quality assurance, QUA05 Feedback mechanisms, QUA06 Provide support, QUA07 Link to external references. Some other also are, like MET02 Complete metadata, MOD5 Data models (vocabularies) conformance, TIM01 Timeliness updates. The following has been noted to be granularity-focused: GRA01 Maximum granularity.

For an updated analysis of BP from the perspective of data quality, see this page (still ongoing work - currently assigned to Riccardo & Christophe)

Initial work

All issues and actions on Data Quality vocabulary

Scoping and requirements from other activities

Suggested requirements:

The Quality Vocabulary should:

  • define general quality metrics, but allow for inclusion of additional domain-specific metrics (list taken from slide 8 of this presentation Credit Makx Dekkers/Open Data Support/PwC/CC-BY (c) 2013 European Commission)
    • accuracy;
    • availability;
    • completeness;
    • conformance
    • consistency;
    • credibility;
    • processability;
    • relevance;
    • timeliness.
    • (other potential dimentions from 'Quality Assessment for Linked Open Data: A Survey' paper availability, licensing, interlinking, security, performance, accuracy, consistnecy, conciseness, reputation, believability, verifiability, objectivity, completeness, amount-of-data, relevancy, representational-conciseness, representational-consistency, understandability, interpretability, versatility)
  • address reputation/certification issues
  • address liability/indemnity
  • support objective metrics (publisher & user)
  • support subjective opinions (publisher & user)
  • provenance (how was the data created/collected, by whom)
  • support concept of status (controlled vocabulary: e.g. legally definitive, informative, validated)
  • consider support for SLAs

WG should address data granularity, where data granularity refers to the level of detail within the dataset (precision)

Links, related work

Defining quality

Vocabulary-related work

Deployment opportunities

Implementation in CKAN

If we meet their requirements, as discussed e.g. in this thread).

Data quality vocabulary elements could be added to , if they don't have equivalent there already