Conformance for Vocabularies?
Yesterday's Government Linked Data WG meeting was dedicated to one of the vocabularies we're working on: DCAT. We were joined on the call by Rufus Pollock of the Open Knowledge Foundation whose CKAN platform is a critical use case for the vocabulary.
As DCAT is on the Recommendations Track, it should include a conformance statement. W3C's Quality Assurance Framework has more detail on this but the basic idea is simple enough: it should be possible to verify that an implementation adheres to the specification. For specs like XML, HTML and CSS that's a straightforward concept and Working Groups routinely produce test suites against which implementations can be tested: if the input is A the output should be B or the behavior should be C.
But vocabulary terms don't fit this paradigm.
Let's stick with DCAT as the example. The aim is to provide a set of terms for describing datasets published in a catalog, but — and here's the killer — the use of all terms is optional. The vocabulary includes terms like creation date and the date of last modification along with things like title, publisher and license, because the working group believes them to be useful information about a dataset. OK — but what if a particular set of metadata omits, say, the creation date? Does that make the description non-conformant? No. At least, not in terms of the spec.
Individual implementations, like CKAN, are free to state that provision of a metadata term is mandatory for datasets on its platform, and/or that values for properties must be selected from pre-defined lists, but that's application specific. What we really mean by conformance to DCAT is: "if you provide the creation date for a dataset, this is the property to use, this is the property to use to provide the title of the dataset" and so on. It also means "don't invent your own terms for anything listed here." Hmm… that's rather more wooly than "given input A the output should be B."
One of the terms in DCAT is dcterms:spatial
. Where a dataset applies to a specific area, this is the term to use to provide that information. But suppose the dataset publisher is a public administration in France. Is it wrong to define a new term of ex:commune
and use that instead of dcterms:spatial
? After all, the concept of a Commune in France is well established and has a specific meaning.
Vocabularies like Dublin Core, FOAF and schema.org do not include conformance statements at all. Should we even try to include one for DCAT and the others we're working on in the GLD WG? The gut instinct is that yes we should. The conformance statement should encourage the use of terms within the vocabulary for properties and classes that are exactly or approximately covered. But I admit, wording that conformance statement is going to be tricky and that providing a test suite seems, well, unlikely.
I don't know whether a set of rules for using a vocabulary is the way to go. Coming from the library world, I know that such rules can become very unwieldy. I think it would be better to name some best practices and present some example descriptions. Because in the end standards result from common usage and won't be created by decree.
If you do write a best practices/conformance document, one option for solving the problem you provided as example might be: Asking people to either use specific properties which are listed or to use sub-properties of these. E.g., the dcterms:spatial example could be solved by requiring DCAT users to either use dcterms:spatial or a subproperty of dcterms:spatial (which might make sense for ex:commune). Doing this would promote the interlinking of vocabularies which is a good thing.
As PhilA suggests, the question of conformance for vocabulary adopters is challenging. As I stated on the W3C GLD call yesterday, I believe a vocabulary recommendation can be no more than an enumeration of recommended terms for providers to use. The problem is for relying parties such as metadata aggregators and application/service developers; with no clear statement of which terms are required, which are optional, and how to deal with addition terms, development can be a moving target and unnecessarily complex.
I propose the following:
My main point is having something fixed that can be cited by the community.
Dare I say it, but would extra use of OWL help? Beyond simple creating a list of class, properties etc. in RDFS using OWL axioms to provide extra information e.g. to say that all datasets should have a value for dct:spatial. So in Manchester OWL syntax we'd have axioms like:
dcat:Dataset subClassOf dct:spatial some geo:SpatialThing
i.e. all dcat Datasets have at least one value of dct:spatial
Yes OWL is open world, but people are working on closed world interpretation. This approach would at least give machine readable ways of getting more semantics out of these vocabularies.
Make sense or have I missed your point?
Thanks for these comments.
Adrian and John E - I think there's full agreement (as was the meeting yesterday) that the place to note that given properties are required, optional or "fill it if you like but we'll ignore it" is separate from the spec, rather they are tied to an implementation or, perhaps a best practices document. And yes, John E - the place for this discussion to continue is on the GLD mailing list rather than the Team blog where comments are (manually) moderated before publication.
John G - Using OWL to recognise the presence or absence of mandatory fields is certainly an approach to suggest, yes, although again I'd see that as application-specific and not part of the actual vocabulary itself.
"Using OWL to recognise the presence or absence of mandatory fields is certainly an approach to suggest, yes, although again I'd see that as application-specific and not part of the actual vocabulary itself."
Coming originally from the same background as Adrian, I agree that rules for conformance with a vocabulary can get fairly complex and hard to maintain. However, I think SPIN RDF (http://spinrdf.org/) might be of help here. It offers a clear and structured formalization for testing conformance with defined constraints using SPARQL queries. The SPIN constraints are attached directly to the classes the instances of which they test, which also helps to achieve a sensible code organization supporting maintenance. To get an impression of what the conformance rules might look like, there is an example for Schema.org (http://topbraid.org/spin/schemaspin).
Until people are actively using DCAT data to publish and consume descriptions of datasets it seems a bit premature to define too many constraints. Aren't you required to have implementations as part of the REC track? I would keep the constraints to the bare minimum needed for DCAT to satisfy the requirements you have. If implementation experience proves that you need to provide guidance to implementors I think you should use RFC 2119 to express them. I see that the current DCAT specification references RFC 2119, but doesn't seem to use it much.
One could ask to begin with, standing stubbornly on formal logic ground, why such a question could arise at all. As long as DCAT (or any other vocabulary) is expressed using formal semantics of RDFS, with a pinch of OWL where needed (as suggested by John Goodwin), whatever you assert using its terms which is not inconsistent with its semantics is conformant. If a field is required, just put a cardinality constraint etc.
That said, I would go along with Adrian to say that good practices and examples are the best way to avoid misinterpretations by those users not keen to run reasoners to check if their implementation is consistent or not, implyingbearing in mind that this implies the challenge to define in which closed part of the open and linked world the reasoning is performed.
Including a clear natural language definition or description for each term, and translation in as many natural languages as possible, seems the first obvious step - unfortunately not achieved by a lot of published vocabularies!
Very good post Phil, and very interesting comments.
I agree with John Erickson, we have to move this to the GLD "Best Practices" thread.
For example, for SKOS we have qSkos [1], which measures the quality of SKOS concept schemes. Also in our group we have OOPS (OntOlogy Pitfall Scanner!) [2] helps you to detect some of the most common pitfalls appearing when developing ontologies. I'm thinking on extending OOPS to cover the ABox pitfalls for a given vocabulary/ontology.
Thank you Jindrich for the pointers, I wasn't aware of those.
[1] http://www.w3.org/2001/sw/wiki/QSKOS
[2] http://www.oeg-upm.net/oops