ACTION-153: Completeness as one of the quality dimensions

Dear all,

During the F2F I got an action to look at completeness as one of the quality dimensions [1]

At least for me then, it was about trying to gether completeness-related material from our use cases and best practices. Of course there is more about completeness, e.g. in my own (cultural heritage) domain but I would rather focus on our stuff first, as the outside world is wide [2] and going through everything is far beyond one action.

So my starting point is the pre-F2F gathering of quality-related aspects in the use cases [3]. Completeness (as represented by the req R-DataMissingIncomplete and R-QualityCompleteness) is mentioned in many UCs:
1 ASO: Airborne Snow Observatory
4 BuildingEye: SME use of public data
10 The Land Portal
12 LusTRE: Linked Thesaurus fRamework for Environment
14 Mass Spectrometry Imaging (MSI)
15 OKFN Transport WG
16 Open City Data Pipeline
18 Resource Discovery for Extreme Scale Collaboration (RDESC)
19 Recife Open Data Portal 	
20 Retrato da Violência (Violence Map)
22 Tabulae - how to get value out of data
24 Uruguay Open Data Catalog

The wiki page at [3] has all quality-related extracts in the UC document.
Most of these cases talk in very general terms (e.g. 'dataset must be complete') which strongly hints that completeness is indeed expected to be an indicator for quality.

However, I could find only one use case really defines concretely what completeness means in its context: it's UC #12, LusTRE, with Riccardo's paper [4]. It is focused on completeness of owl:sameAs linksets, ie. sets of owl:sameAs links between two different sets. Its goal is to reflect how datasets can be 'complemented' via a linkset. Based on a small set of indicators (number of types, mappable types, etc), it proposes 3 completeness measures:
- extent a linkset covers (all) types involved in its subject or object datasets.
- level completeness of a linkset with respect to (linkable) types involved in its datasets.
- percentage of entities of a selected type considered in the linkset.

One can say that linksets are a very specific case, as completeness is 'derived' from datasets. Still this case is the only one I've seen with indicators and measure for completeness.


Actually there is another UC that brings concrete hints about completeness is UC #3, Bio2RDF [5]
That one doesn't mention explicit completeness-related reqs. However, it does present a number of indicators that I think could relate to completeness:
    total number of triples
    number of unique subjects
    number of unique predicates
    number of unique objects
    number of unique types
    unique predicate-object links and their frequencies
    unique predicate-literal links and their frequencies
    unique subject type-predicate-object type links and their frequencies
    unique subject type-predicate-literal links and their frequencies
    total number of references to a namespace
    total number of inter-namespace references
    total number of inter-namespace-predicate references

But I see there is an issue raised precisely about it [6] questioning whether it relates to quality. If we decide that it's not the case, then the Bio2RDF UC has not much about completeness!

Best,

Antoine

[1] http://www.w3.org/2013/dwbp/track/actions/153
[2] https://www.w3.org/2013/dwbp/wiki/Data_quality_notes#Links.2C_related_work
[3] https://www.w3.org/2013/dwbp/wiki/Quality_Aspects_In_Use_Cases
[4] http://www.edbt.org/Proceedings/2013-Genova/papers/workshops/a8-albertoni.pdf
[5] http://www.w3.org/TR/2015/NOTE-dwbp-ucr-20150224/#UC-Bio2RDF
[6]http://www.w3.org/2013/dwbp/track/issues/164

Received on Friday, 8 May 2015 07:11:29 UTC