Re: Data Quality Vocabulary - feedback welcome!

Dear Werner,

I am Jeremy, one of the DQV contributors and the creator of daQ. I would like to tell you, beforehand, that my views might not be shared with the DQV editors or the DWBP group in general.

many thanks for sharing your working draft. We have been using a recent draft version of the vocabulary in a project [1] that integrates and enriches metadata of cultural heritage objects in order to provide recommendations to users, contextualised by their current work. We perform metadata quality assessment of the records received from providers and after mapping to a common data model.

We found that DQV covers most of our requirements, and integrates smoothly with W3C PROV which is already part of our data model. The proposed DQV has been found to be compact and intuitive to use.

That’s great to hear!


There are however a few observation from the use of DQV that we'd like to share with the group.

1. daq:metric
a. multiple values of one metric

We found that there are metrics where a single output value may not be sufficient. In particular, this applies to statistics (which are listed as one dimension in the draft spec). For example, one may want to express the mean, min or max of a metric over a dataset, or providing an absolute and a relative (normalized) value for the same metric. Of course this could be done by defining multiple metrics, but then one would need a mechanism to group/link them or express their dependency.

In the EBU, the working group on quality control [2] has defined a data model for the somewhat related problem of describing quality of audiovisual content (with XML serialisations so far, not RDF). This model supports multiple output values, that can be typed. For DQV, this could for example be achieved by having multiple values, and defining subproperties of daq:value.

In theory, there is only one value for a metric - others are derivatives. With a daq:Observation, such derivatives should be easily defined by creating external data cube measure property.

b. parameters

For some metrics, input parameters could be required. E.g., there have been recent publications on metadata quality which use weights or target values in the metrics. For descriptions with quality measurements that are self-contained, it would be required to include the values of such parameters in the description of the metric.

A daq:Metric (which is the equivalent class of dqv:Metric) has the property daq:requires. The purpose of that property is exactly for input parameters.


c. daq:expectedDataType

This property from DAQ is defined to have range xsd:anySimpleType. While it seems useful to define the expected data type for a metric, a simple type may too narrow: in many cases a metric will be determined on a data record or a subgraph.

It will be taken into consideration - although I’m not sure it works well with data cube. Please can you provide us (or me) with an example where a quality metric returns a data record or sub graph?


2. Dimensions and categories

The dimensions proposed seem quite high-level, so it is difficult to think of categories that are more general and group dimensions. In contrast, it seems in some cases desirable to have a level between dimensions and metrics. For example, we are dealing with assessing mapping quality. The metrics fall in the dimension of accuracy (i.e., does the output of the mapping process represent the object less accurately), and form a specific group there. To make the distinction of the different levels more confusing, the note in 7.3 Processability currently says "Level on the 5-star scale", which sounds more like a metric than a dimension (there could of course be metrics aggregating results from other metric, daq:requires could be used to express such a dependency).
We are not sure if there is a strong need for categories, we would rather propose to consider nesting multiple levels of dimensions to allow grouping.

I’m not sure if I understood “nesting multiple levels of dimensions” correctly, but a category groups a set of dimensions which have a common type of information as a quality indicator. For example the Accessibility category groups dimensions such as Availability, Security and Performance. Each of these dimensions have a number of different metrics, each assessing different aspect of a dimension. This is how we define Category-Dimension-Metric in daq:

A Quality Dimension is a characteristic of a dataset relevant to the consumer (e.g. Availability of a dataset).

A Quality Metric is concrete quality measure for a concrete quality indicator usually associ- ated with a measuring procedure. This assessment procedure returns a score, which we also call the value of the metric. There are usually multi- ple metrics per dimension; e.g., availability can be measured by the accessibility of a SPARQL endpoint, or of an RDF dump. The value of a metric can be numeric (e.g., for the metric “human-readable labelling of classes, properties and entities”, the percentage of entities having an rdfs:label or rdfs:comment) or boolean (e.g. whether or not a SPARQL endpoint is accessible).

A Category is a group of quality dimensions in which a common type of information is used as quality indicator (e.g. Accessibility, which comprises not only availability but also dimensions such as security or performance). Grouping the dimensions into categories helps to organise the space of all quality aspects, given their large number.


I hope this helps.

Best Regards,
Jeremy

Received on Wednesday, 4 November 2015 11:15:20 UTC