Warning:
This wiki has been archived and is now read-only.

Requirements From FPWD BP

From Data on the Web Best Practices
Jump to: navigation, search

Parent page: https://www.w3.org/2013/dwbp/wiki/Data_quality_notes#Scoping_and_requirements_from_DWBP_WG

On the requirement elicitation/refinement process.

Each BP in the FPWD was considered, thinking about requirements that each could suggest for quality related information as a special case of open [meta]data that must be shared on the web.

Requirements are listed in the table below, trying to track the following pieces of information.

  • Best practice: the BP from which inspiration has been taken;
  • Requirement: a verbose sentence explaining the requirement. Requirements are extracted by analogy with BP written in the FPWD BP, or considering some of my experience in the field;
  • Competency questions : Some of the requirements can be turned into a set of “competency questions (CQ)” for the quality Vocabulary. Some of the requirements do not pertain to the terms that should be included in the quality vocabulary, but rather, they suggest either (i) requirements for an infrastructure managing quality for datasets or (ii) requirements on how the under development vocabulary should be made available/documented. Although, at least in this phase of the vocabulary development these kinds of requirements could be ignored, the aforementioned “non-CQ requirements” in the Competency questions's column respectively as Infrastructural requirement and General Vocabulary Requirement.

Competency questions are a first attempt to find “concrete requirements”.

I have to admit that in this early phase, I tried to be imaginative, including CQs that might be not in the scope of quality vocabulary, thus I wouldn't be surprised if the group decide to discard part of the requirements I have listed. In some cases, competency questions might suggest to introduce terms that substantially or partially overlap with other known vocabularies ( e.g, daQ, PROV-O ,DCAT, RDFCUBE). When that happens, the corresponding known vocabularies are indicated in the Vocabularies to check column. So that, the group can decide in a second phase whether to include overlapping terms as (i) brand new terms in the quality vocabulary, (ii) specialization of well-known vocabulary (e.g., by using rdfs:subPropertyOf, rdfs:subClassOf), (iii) terms existing in others vocabulary namespaces.

Requirements

no. Requirement Competency questions Vocabularies to check Best Practice
1 It should be possible for computer applications, notably search tools, to locate and process the quality of datasets easily, which implies, quality of data should be human and machine readable General Vocabulary Requirement -
  • Best Practice 2: “Use machine-readable formats to provide metadata”
  • Best Practice 8: “Provide data quality information”
2 Quality should be stated via standardized vocabulary. Possibly expressed in RDF having HTTP URI for terms, and having multilingual description for terms. [Of course, also other kind of technological approaches (e.g., schema.org /microdata) can be considered in order to make the quality vocabulary appealing for community not so interested in Linked data] General Vocabulary Requirement - Best Practices 3: “Use standard terms to define metadata”
3 It should be possible for consumers to interpret the meaning of quality. For example, quality should be provided mentioning the quality dimensions measured, metrics adopted, but also the scale metrics range.
  • Is the quality score associated good enough?
  • what a given metric stands for? /what a given metric measures?
daQ, RDFCUBE Best Practice 5: “Provide locale parameters metadata”
4 Quality info should be licensed, search tool and humans should know if and how they can use it.

Infrastructure requirement. What licence is associated to quality data/info X ? Does the FPWD assume metadata as openly available? I can't identify the exact excerpt saying that.

DCAT, VOID
  • Best Practice 6: “Provide data license information”
  • Best Practice 8: “Provide data quality information”
5 It should be possible to determine the provenance of quality.
  • who has provided the quality info X?;
  • Is the quality authoritatively provided?
  • Is the quality certified? who has certified the quality?
  • Is quality lively evaluated?
  • What service / program has been adopted to work that quality assessment ?
PROV-O
  • Best Practice 7: “Provide data provenance information”
  • Best Practice 8: “Provide data quality information”
6 Data Quality might be expressed according to different quality dimensions relying on metrics / feedback opinions. In particular, (i) results from cross-domain metrics/measures as well as domain specific metrics/measures should be representable in the quality vocabulary. (ii) The set of quality metrics/measures and quality dimensions considered in an quality assessment should be left open.
  • What kind of quality representation is provided? (Metric-based, feedback opinion, description of known quality issues, [any others?])
  • Which quality metric has been deployed?
  • What kind of quality dimension have been evaluated?
  • Which metric can be deployed to measure a certain quality dimension (e.g., Data Completeness)?
daQ Best Practice 8: “Provide data quality information”
7 Known quality issues should be documented (at least) for human consumption
  • What quality issues are discussed for dataset X in respect to a set of specific quality dimension Y ?
  • What quality issues are discussed for dataset X disregarding quality dimensions ?
- Best Practice 8: “Provide data quality information”
8 Quality information should be associated with a specific release/distribution of a dataset, date-time info about when the evaluation has been performed should be indicated, so that, the change in terms of quality over time can be tracked.
  • when quality [or even, a particular metric, dimension of quality] has been last assessed for dataset X [or even a distribution Y of the datataset X]?
  • does the newest release/distribution of dataset X have a better quality than the previous?
daQ, PROV-O, DCAT Best Practice 10: “Provide version history”
9 Every first class citizen quality concept such as quality dimensions, metrics, etc should have a unique ID [Possibly a HTTP IRI ???!??!?]. General Vocabulary Requirement - Best Practice 11: “Use unique identifiers”
10 Quality should be available in non proprietary formats General Vocabulary Requirement - Best Practice 13: “Use open data formats”
11 Quality should be available in multiple formats General Vocabulary Requirement - Best Practice 14: Provide data in multiple formats
12 Quality vocabulary may be published together with human-readable Web pages, as detailed in the recipes for serving vocabularies with HTML documents in the Best Practice Recipes for Publishing RDF Vocabularies [SWBP-VOCAB-PUB]. Elements from the vocabulary are defined with attributes containing human-understandable labels and definitions, such as rdfs:label, rdfs:comment, dc:description, skos:prefLabel, skos:altLabel, skos:note, skos:definition, skos:example, etc.. Documentation may benefit from the additional presence of visual documentation such as the UML-style diagram of the W3C Organization Ontology [ORG] General Vocabulary Requirement - Best Practice 15: Document vocabularies
13 Provide the quality vocabulary under an open license such as Creative Commons Attribution License CC-BY [CC-ABOUT]. Create entries for the vocabulary in repositories such as LOV, Prefix.cc, Bioportal and the European Commission's Joinup. General Vocabulary Requirement - Best Practice 16: Share vocabularies in an open way
14 Quality Vocabulary should include versioning information General Vocabulary Requirement - Best Practice 17: Vocabulary versioning
15 Existing reference vocabularies should be re-used where possible General Vocabulary Requirement - Best Practice 18: Re-use vocabularies
16 When creating or re-using a vocabulary for an application, a data publisher should opt for a level of formal semantics that fit data and applications. General Vocabulary Requirement - Best Practice 19: Choose the right formalization level
17  ???????Is there any special requirement for dealing with sensitive data when publishing quality data/info !?!? I would say no, but perhaps the answer should be double-checked by expert in sensitive data who are part of the group. - -
  • Best Practice 20: Preserve people's right to privacy
  • Preserve organization's security
  • Best Practice 21: Provide data unavailability reference
18 Quality data should be available for bulk download.
  • where data about the quality of the datasets X, Y, Z can be downloaded?
DCAT, VOID Best Practice 22: Provide bulk download
19 if APIs for accessing quality data are developed they should follows the REST architectural approaches Infrastructural requirement - Best Practice 23: Follow REST principles when designing APIs
20 Where data is produced in real-time, quality data should be available on the Web in real-time. Infrastructural requirement - Best Practice 24: Provide real-time access
21 Quality data must be available in an up-to-date manner and the update frequency made explicit.
  • is quality for dataset X evaluated on regular base?
  • how often the quality for dataset X is evaluated ?
- Best Practice 25: Provide data up to date
22 If quality data is made available through an API, the API itself should be versioned separately from the data. Old versions should continue to be available. Infrastructural requirement - Best Practice 26: Maintain separate versions for a data API
23 Data publishers should provide a means for consumers to offer quality feedback.
  • is there any quality feedback provided for dataset X?
  • who has provided been the quality feedback for dataset X?
  • when the feedback has been provided?
- Best Practice 27: Gather feedback from data consumers

Open issues to be discussed

  • Quality might be associated to a dataset as well as a dataset's distribution. Shall we agree on that?
    • Christophe: My gut feeling is that it is more important to focus on the quality of the data itself. Data consumers would be more interested in knowing if the data is valuable to them before seeing if the API is well designed.
    • Antoine: For time-specific distributions (versions) then yes clearly. For other kinds of distributions, I'd tend to say yes too, but I don't know what sort of quality metrics we could find. The quality for an API will be very different from the one of a bulk download.
  • It is quite clear that there are different possible ways to represent info about quality, e.g., Metric-based, feedback opinion, description of known quality issues,
(i) Apart the aforementioned, are there any other commonly adopted ways to provide/represent quality of a dataset?
(ii) which, among the possible ways to represents quality, the group wants to have in scope of the quality vocabulary ?
  • Is there any special requirement for dealing with sensitive data when publishing quality data/info? I would say no, but perhaps the answer should be double-checked those DWBP participants who are expert in managing sensitive data .
  • Issue-116: Should we provide more specific/ detailed strategies on how to attach quality info in metadata apart from the use of the data quality vocabulary?

TODO

  • are there other competency questions that might make sense?
  • to refine/split/group CQs as needed [ Probably, to do it after that FPWD BP requirements and UCS requirements are merged]
  • any other "known vocabulary" to check?
  • We could make a point that data that is being properly preserved is of higher quality. This could be expressed by having a term describing what preservation is being done.
  • make the distinction between what is coming directly from the best practices, and what results from creative interpretation of them, or from the reading of related work.
  • Many of the requirements are general metadata reqs that we are saying apply to quality metadata too. While that's true, does it offer enough added-value?
  • #12, #13, #14: I'd be tempted to remove, or at least seriously downplay them so that they don't consume superfluous bandwidth in our work. If we publish the voc at W3C then we will have to meet these requirements in the right way, anyway.
  • all the requirements categorised as "general vocabulary requirements" except req no 11 "quality should be available in multiple formats" express things that are quite settled but does not suggest anything in term of entities and relations to include in the vocabulary. They can just be exploited to prevent objections on the way we will publish the vocabulary.
  • all the requirements categorised as "infrastuctural requirements" can be disregarded. They are about API etc and not very pertinent to the goal of our exercise.
  • Instead, requirements providing concrete CQs should be double-checked, in first place from you, Christophe, and Deirdre, and then from the whole group. I haven't been too much imaginative, but to get the requirements double-checked is probably the only way to reach consensus and to avoid people complaining about requirements in the future.