Data on the Web? Here's How
I want a revolution.
Not a political one, and certainly not a violent one, but a revolution nonetheless.
A revolution in the way people think about the way data is shared on the Web, whether openly or not. This is where I typically start talking about people using the Web as a glorified USB stick. That is, using the Web to do no more than transfer data from A to B in a way that could be just as easily achieved by putting it on a USB stick and sending it through the post.
The Web is so much more than that. To quote from the Architecture of the World Wide Web, it’s: “… a remarkable information space of interrelated resources, growing across languages, cultures, and media.” It’s the connectivity of ideas and facts between people who are unknown to each other that is so exciting and that has such profound implications.
But how to do it right? As Rebecca Williams of GovEx, formerly of data.gov, tweeted recently: “looking at 'open data portals' to gather your best practices in metadata and licensing is very backwards, they're almost all doing it wrong.” I wouldn’t go as far as to say they’re almost all doing it wrong, but it is true that there is a need for a reference for how to do it right.
Which is what today’s Data on the Web Best Practices Recommendation is all about.
It’s taken 4 years from planning the workshop, to setting up the Working Group, to working out what the heck the scope really is and fixing the relationship with the (externally funded) Share-PSI Project, to honing a set of 35 Best Practices that are actionable without being over prescriptive.
The first one is absolute motherhood and apple pie: provide metadata. It sounds silly, and one can argue that if you’re sharing data on the Web and not providing metadata then you’re probably quite keen for no one to find it, let alone use it. Best Practice 9 says “Use Persistent URIs as identifiers of datasets” and BP 10 says “Use persistent URIs as identifiers within datasets.” In my view these two are at the heart of the difference between using the Web as a glorified USB stick and using it as a global information space. The implementation report cites many examples of this, from the Brazilian federal government’s Compras públicas do governo federal to Macedonia’s St Cyril and Methodius University’s Linked Drugs project, from the Auckland War Museum’s API to the UK’s Acropolis project.
There are Best Practices around areas you’d probably expect, like provenance and licensing, and maybe less obviously things like data enrichment and data archiving. These are topics in their own right of course and the general Data on the Web Best Practices document can only act as a basis. At W3C, further work is currently under way, for example, to standardize ODRL for machine readable permissions and obligations, and the Spatial Data on the Web WG is building directly on DWBP in its own Best Practice document. There’s always more to say – and there are always different ways of working.
Data on the Web Best Practices doesn’t prescribe the use of any particular technology other than Web basics. Each BP has an intended outcome, such as BP14’s “As many users as possible will be able to use the data without first having to transform it into their preferred format.” Or BP23’s “Developers will have programmatic access to the data for use in their own applications, with data updated without requiring effort on the part of consumers. Web applications will be able to obtain specific data by querying a programmatic interface.” But from then on, each BP offers possible approaches to implementation with some examples. If you can achieve the same intended outcome with a different technology, go ahead, you’re still following best practice.
The Working Group as a whole was chartered not just to create a set of Best Practices but to help foster an ecosystem of data sharing. Part of this is addressed in two vocabularies, one for describing the usage of a dataset (through use in an application, citation in someone else’s work etc.) and one for describing quality. Quality is rarely an objective fact but the vocabulary provides a framework in which statements about quality can be made.
DWBP is not just about government data. GS1, the body behind the world’s product bar codes, contributed to the work and has already leveraged it in their proposed GS1 SmartSearch. In the world of scientific research, the Pacific Northwest National Laboratory, is advocating the work in its publishing of climate simulation datasets on the Earth System Grid Federation, the Atmosphere to Electrons (A2e) Data Archive and in its Portal (DAP). Los Alamos and Lawrence Berkeley National Laboratories are also using the document to improve the way data is shared online. Importantly for research data, W3C's Data on the Web Best Practices are fully aligned with the FAIR principles
It’s always encouraging when you hear other people referring to your work and DWBP got a lot of mentions at the Smart Descriptions and Smarter Vocabularies workshop (SDSVoc) last year (report soon, I promise). And we’ve had compliments from many quarters. I’d like to end by noting two unusual features of the Working Group. First, all of the three active chairs and three of the group’s editors are women. Second this was the first W3C WG that had such strong participation from Brazil.
It’s been a privilege to work with such a terrific group of revolutionaries from all over the world.