Data quality notes
- 1 From the charter
- 2 Scoping and requirements from DWBP WG
- 3 Initial work
- 4 Scoping and requirements from other activities
- 5 Deployment opportunities
From the charter
[The Quality and Granularity Description Vocabulary] is foreseen as an extension to DCAT to cover the quality of the data, how frequently is it updated, whether it accepts user corrections, persistence commitments etc. When used by publishers, this vocabulary will foster trust in the data amongst developers.
Some important design questions:
- the vocabulary could be an extension of DCAT, not repeating any of its elements, or be entirely new. The former is highly preferred. We will work out the model first and then try to map it to DCAT.
- should quality and granularity vocabs be split?
Scoping and requirements from DWBP WG
Relevant use cases and requirements
For an updated analysis of UCR from the perspective of data quality, see this page (still ongoing work - currently assigned to Antoine & Deirdre)
Sources for use cases, requirements and challenges:
- current editor's of the use cases document
- latest published version of the use cases document
- challenges that the use cases editors have pulled from our use cases. (See all 4 worksheets/tabs in the document)
Relevant best practices
For an updated analysis of BP from the perspective of data quality, see this page (still ongoing work - currently assigned to Riccardo & Christophe)
Previous work: Prior to the current FPWD, The WG has identified a number of best practices here. The following have been noted to be quality-focused:
- QUA01 Complete data
- QUA02 Primary data
- QUA03 Built-in data sharing systems
- QUA04 Quality assurance
- QUA05 Feedback mechanisms
- QUA06 Provide support
- QUA07 Link to external references
Note from Antoine: I fail to see now why only some of these BPs would be more relevant for quality than others in the table, e.g. MET02 Complete metadata, MOD5 Data models (vocabularies) conformance or TIM01 Timeliness updates. Probably the group will have to go through all BPs and judge!
The following have been noted to be granularity-focused:
There might be also some relevant pointers in the Guidance on the Provision of Metadata
- Minutes from London's F2F discussion on data quality
- PROV scenarios and quality
- Phil's first thoughts and UML diagram of Q&G Voc
- A mail thread that touches on quality
- Share-PSI workshop, March 2015 Makx' summary raw notes
- Data Quality Vocabulary (DQV): Very early draft conceptual scheme
- ISSUE-55: The word "granularity" can been many things. scope, city/state/country, data aggregation
- ISSUE-64: Jeremy t's expression of concern over 'data must be complete' - not realistic. better to say where it isn't complete
Side (BP doc) issues:
- ISSUE-116: Best Practices for Data Quality - Insertion of specific strategies apart from DATA QUALITY Vocabulary
- ISSUE-117: Should Data quality vocabulary be mentioned as specific strategy in BP Document?
- ISSUE-65: How to carry forward the data quality issue - more use cases? available options? text only? machine readable dimensions?
Scoping and requirements from other activities
The Quality Vocabulary should:
- define general quality metrics, but allow for inclusion of additional domain-specific metrics (list taken from slide 8 of this presentation Credit Makx Dekkers/Open Data Support/PwC/CC-BY (c) 2013 European Commission)
- (other potential dimentions from 'Quality Assessment for Linked Open Data: A Survey' paper availability, licensing, interlinking, security, performance, accuracy, consistnecy, conciseness, reputation, believability, verifiability, objectivity, completeness, amount-of-data, relevancy, representational-conciseness, representational-consistency, understandability, interpretability, versatility)
- address reputation/certification issues
- address liability/indemnity
- support objective metrics (publisher & user)
- support subjective opinions (publisher & user)
- provenance (how was the data created/collected, by whom)
- support concept of status (controlled vocabulary: e.g. legally definitive, informative, validated)
- consider support for SLAs
WG should address data granularity, where data granularity refers to the level of detail within the dataset (precision)
- Semantic Web data quality wiki and pointers
- CKAN quality discussion
- Open Data and Metadata Quality presentation by Makx and PwC
- Quality Assessment for Linked Open Data: A Survey. Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann, Sören Auer. NB: Anisa Rula and Andrea Maurino from University of Milano-Bicocca from COMSODE are willing to help and perhaps join the WG
- Kevin Roebuck Data Quality: High-impact Strategies
- Mark David Hansen. Zero Defect Data: Tackling the Corporate Data Quality Problem. 1991
- Joshua Tauberer. Open Government Data. Section 5.2 Data Quality: Precision, Accuracy, and Cost
- Sharon Dawes. Open data quality: a practical view. 2012
- Stefan Urbanek. Data Quality: What is It?
- Thomas R. Bruce, Diane Hillmann. The Continuum of Metadata Quality: Defining, Expressing, Exploiting
- Bernadette Loscio et al. Using Information Quality for the Identification of Relevant Web Data Sources
- DIACHRON daQ model and quality dimension framework by Jeremy
- Luzzu A Quality Assessment Framework for Linked Data
- 72 best practices for OpenData (mentioned in charter)
- papers on assessing trustworthiness of datasets by Davide Ceolin (VU).
- quality for data on research objects (slides 21-26)
- Work by Monica Scannapieco et al. Data quality under the computer science perspective Data quality at a glance
- A Metrics-Driven Approach for Quality Assessment of Linked Open Data
- Socio-technical Impediments of Open Data
- Risk Analysis to Overcome Barriers to Open Data
- The Sebastopol principles
- ISO 8000 Data quality series.
- ISO 25012 Data quality model.
- Share-PSI workshop has a session to discuss quality aspects
- Bruce and Hillmann on metadata quality in a LD context
- Dataset Quality Vocabulary (daQ)
- W3C accessibility Evaluation and Report Language (EARL)
- Dublin Core collection update frequency
- Schema.org http://schema.org/Dataset class
- Prov-O Ontology
- HCLS Community Profile as an example of DCAT/Void profile
Implementation in CKAN
If we meet their requirements, as discussed e.g. in this thread).
Data quality vocabulary elements could be added to schema.org http://schema.org/Dataset , if they don't have equivalent there already