RE: DQV - metrics related to the completeness dimension

All,

 

Aren't we making this too complex? 

 

It seems to me that in certain cases there can be 'absolute' measures of
quality. For example, if I publish a dataset with air quality observations
from 50 measuring stations, I can state that the dataset is complete because
it contains all observations from all measuring stations, or that it is not
complete because observations from stations X and Y are missing. This is not
subjective at all.

 

Defining quality as "fitness for use" allows for a discussion about the
completeness of the approach behind the dataset, e.g. someone can argue that
my measurements of air quality do not include parameters that are crucial
for their research and therefore the measurements are "incomplete". I would
argue that the dataset is still "complete". 

 

In my mind, the three completeness metrics (Schema completeness, population
completeness, column completeness) as formulated by Nandana point mainly to
the quality of the approach as it talks about "required attributes", not to
the quality of the dataset itself. If you change the phrases "required
attributes" and "required population" by "observed attributes" and "observed
population", you can have an objective measure of the completeness of the
dataset.

 

Of course, if a user has particular requirements in terms of the set of
attributes, and a dataset contains a different set of attributes, that
dataset may not be "fit for (this user's) use", but it could still be 100%
complete with respect to its own set of attributes and population.

 

Makx.

 

 

 

From: Debattista, Jeremy [mailto:Jeremy.Debattista@iais.fraunhofer.de] 
Sent: 30 September 2015 09:26
To: Steven Adler <adler1@us.ibm.com>
Cc: Nandana Mihindukulasooriya <nmihindu@fi.upm.es>; Data on the Web Best
Practices Working Group <public-dwbp-wg@w3.org>
Subject: Re: DQV - metrics related to the completeness dimension

 

What you said is true Steven, and (in principle) I would agree on avoiding
universal completeness in favour of a more sustainable measure. On the other
hand your solution is highly subjective and thus very hard to calculate. It
would be nice to have such an index score, but I'm not quite sure that this
will work in practice as there are many factors that have to be considered.

 

Cheers,

Jer

 

On 30 Sep 2015, at 03:42, Steven Adler <adler1@us.ibm.com
<mailto:adler1@us.ibm.com> > wrote:





You can avoid "universal" completeness by allowing publishers and consumers
to publish their confidence level in the data. The combination of confidence
attributes would be calculated as an index of confidence and doubt, like a
set of product reviews. This method is more organic to how the data has been
and is used. 

Just a thought.



Best Regards,

Steve

Motto: "Do First, Think, Do it Again"

<graycol.gif>Nandana Mihindukulasooriya ---09/27/2015 08:07:02 PM---Hi all,
In the F2F (re: action-153), we talked about the difficulties of defining

From: Nandana Mihindukulasooriya <nmihindu@fi.upm.es
<mailto:nmihindu@fi.upm.es> >
To: Data on the Web Best Practices Working Group <public-dwbp-wg@w3.org
<mailto:public-dwbp-wg@w3.org> >
Date: 09/27/2015 08:07 PM
Subject: DQV - metrics related to the completeness dimension

  _____  




Hi all,

In the F2F (re: action-153), we talked about the difficulties of defining
metrics for measuring completeness and the need for examples. Here's some
input from a project we are working on at the moment. 

TD;LR version

It's hard to define universal completeness metrics that suit everyone.
However, completeness metrics can be defined for concrete use cases or
specific contexts of use. In the case of RDF data, a closed world assumption
has to be applied to calculate completeness. 

Longer version

Quality is generally defined as "fitness for *use*". Further, completeness
is defined as "The degree to which subject data associated with an entity
has values for all expected attributes and related entity instances *in a
specific context of use*" [ISO 25012]. It's important to note that both
definitions emphasize that the perceived quality depends on the intended
use. Thus, a dataset fully complete for a one task might be quite incomplete
for another task. 

For example, it's not easy to define a metric that universally measures the
completeness of a dataset. However, for a concrete use case such as
calculating some economic indicators of Spanish provinces, we can define a
set of completeness metrics. 

In this case, we can define three metrics
(i) Schema completeness i.e. the degree to which required attributes are not
missing in the schema. In our use case, the attributes we are interested are
the total population, unemployment level, and average personal income of a
province and the schema completeness is calculated using those attributes.  
(ii) Population completeness i.e. the degree to which elements of the
required population are not missing in the data. In our use case, the
population we are interested in is all the provinces of Spain and the
population completeness is calculated against them. 
(iii) Column completeness i.e. the degree to which which the values of the
required attributes are not missing in the data. Column completeness is
calculated using the schema and the population defined before and the facts
in the dataset.

With these metrics, now we can measure the completeness of the dataset for
our use case. As we can see, those metrics are quite specific to our use
case. Later if we have another use case about Spanish movies, we can define
a set of different schema, population, and column completeness metrics and
the same dataset will have different values for those different metrics. 

If the data providers foresee some specific use cases, they might be able to
define some concrete completeness metrics and made them available as quality
measures. If not, the data consumers can define more specific completeness
metrics for their use cases and measure values for those metrics. These
completeness metrics can be used to evaluate the "fitness for use" of
different datasets for a given use case. To generate population
completeness, the required population should be known. The required
attributes and other constraints of schema might be expressed using SHACL
shapes [1].

In the case of RDF data, we will assume a closed world assumption and only
consider the axioms and facts included in the dataset. Also, if the use case
involves linksets, other metrics such as interlinking completeness can be
used. 

Hope this helps to discuss more concretely about the completeness metrics.
It will be interesting to hear other experiences in defining completeness
metrics and counter examples where it is easy to define universal
completeness metrics.  

Best Regards,
Nandana

[1]  <http://w3c.github.io/data-shapes/shacl/>
http://w3c.github.io/data-shapes/shacl/

 

Received on Wednesday, 30 September 2015 08:59:50 UTC