Quality and Granularity Description Vocabulary

From Data on the Web Best Practices
Jump to: navigation, search

Class Diagram

I would be surprised if any of the classes and properties in the diagram below survive for very long. The diagram is designed to help us discuss the possibilities and make progress. It's meant to reflect the various headings etc, in the text below. The only classes/properties knowingly excluded at this point are the ones related to the ODI Expert certificate (because they'd be easy. It's whether we want to add them at all rather than what they should be IMO).

DCAT Extension Data Quality & Granularity VERY ROUGH.png


Open Data Support Fields

The discussion held during the London f2f meeting was instructive. Makx referred us to work he's done under the EU ISA Programme, in particular slide 8 from this presentation. Those points are listed below as headings.

Accuracy

Is the data correctly representing the real world entity or event?

Expansion

This looks like 90% of the problem we're trying to solve here!

See also granularity (below).

It might be useful to have a description of why data is or isn't accurate

Availability

Can the data be accessed now and over time?

Expansion

Since a dcat:Dataset is an abstract thing, it might be available at any point in time, past present or future. We already have dcterms:issued so two properties come to mind:

  • dcat:verifiedAvailableOn {date} (the last time someone/something checked that the dataset was accessible, probably applies to a dcat:Distribution, not dcat:Dataset)
  • dcat:availableUntilAtLeast {date} (Potentially a data on which the dataset is expected to be withdrawn)

Other questions that come to mind: how do we indicate that the dataset is expected to be available 'for the foreseeable future?'

Completeness

Does the data include all data items representing the entity or event?

Expansion

As phrased, I see a discussion of the Open World Assumption coming up. This also looks similar to Relevance (below). Is this like "you have enough data here to conclude X but not Y?"

Conformance

Is the data following accepted standards?

Expansion

This looks similar to void:vocabularies ? i.e. list the vocabularies used. But the data might follow a profile (machine readable or otherwise), or a convention.

Consistency

Does the data contain contradictions?

Expansion

That might be computable (especially if the data follows a formal ontology). It might also be a statement about whether the data has been cleaned/checked in some way.

Credibility

Is the data based on trustworthy sources?

Expansion

Sounds like a job for PROV-O

Processability

Is the data machine readable?

Expansion

Maybe the value of this is one of the 5 stars of LOD?

Relevance

Does this data include an appropriate amount of data?

Expansion

For what? (see Completeness above)

Timeliness

Is the data representing the actual situation and it is published soon enough?

Expansion

I read this as 'how long between data being created/curated and it being published. Value space should be a period of time. The ODI certificates have a lot more to say about this.

Computed Metrics

The f2f discussed various scenarios for metrics being published. Are we talking about user feedback? Computed metrics? Numbers? Textual comments?

Do we need a text field other than the existing dcterms:description? Do we need a dcat:ComputedMetric class (with inverse properties between it and dcat:Dataset)

Granularity

What does this mean? If the data is geographical then it's the resolution of the data. If it's spending data then it might be the frequency (dcterms:accrualPeriodicity) or the smallest itemised amount (e.g. £500 for UK local government spending). The number of data points (triples, records or whatever) is unlikely to be a useful guide.

There is a difference between accuracy and precision.

Consensus around granularity in this context = level of detail within the data (a.k.a precision). This is quite separate from accrualPeriodicity and accuracy (if the granularity is to the nearest $100 dollars but a figure is out by $300 dollars, that's not accurate. If it's out by $10 it is still accurate event if the value of $100 is not very precise).

ODI Certificates

The ODI Certificates ask data publishers a series of questions, the answers to which then lead to the production of a certificate.

Question: - should we have a property of dcat:hasCertificate to point to a Certificate?

Raw Certificate

The needs of the Raw Certificate are covered by DCAT.

Pilot Certificate

This raises a number of issues:

  1. Whether data contains personal data, if so, has it been anonymised? If not, do you have permission to publish it?
  2. The issue of time period between creation and publication raised above.

Other than that, DCAT seems to cover it.

Standard Certificate

This raises the issue of licences and rights (which are treated differently). And it refers to the ODRS - the Open Data Rights Statement Vocabulary which DWBP will treat separately.

On privacy, ODI demands that:

  • anonymisation processes are independently audited
  • if private data is published, there is a document that states you have a legal right to publish it
  • an impact assessment of publishing the private data is required and that impact assessment must itself be independently audited

It would be easy enough to create properties and classes to cover this - should we?

Aside - there's text in the Standard Certificate documentation relevant to data citation (i.e. data usage)

Other issues raised:

  • backups - there should be regular offsite backups of the data and those backups should be public. That might be handled by a simple pointer to another dcat:Distribution, perhaps with a subproperty of dcat:distribution (dcat:backupDistribution ?) and/or a property of dcat:backupFrequency?
  • expectations of API availability, rate limiting details (SLA?) - a link to a doc is all that's asked for by the certificate.
  • Machine readable service description (see Hydra Community Group as possible way forward on this)
  • Communication channels - dcat:contactPoint has a range of VCard but ODI asks for more - suggests we look at SIOC. ODI only asks for a forum to exist and for the dataset to link to a page that tells people where to find the forum etc.

Expert

These certificates include specific provisions tailored to base registers and the like, namely machine-readable versions of:

  • a copyright notice or statement
  • the copyright year
  • the copyright holder

and in jurisdictions that include database rights:

  • the database right year
  • the database right holder

It would be easy to create DCAT terms to cover these.

I find this statement interesting: "Data at the expert level of the certificate should be an essential part of the operation of your organisation; you should be able to provide a guarantee that it will continue to be available for a long time." i.e. you should be using the data you publish yourself, it shouldn't be a separate curated version of your internal data. That could be captured in a description like "master data" or "live data." It also brings to mind the idea of "definitive data" i.e. in law, whatever this dataset says is true is the legal state of affairs (even if it's actually wrong).