W3C

Possible future directions for data on the Web

As I enter my final days as a member of the W3C Team*, I’d like to record some brief notes for what I see as possible future directions in the areas in which I’ve been most closely involved, particularly since taking on the ‘data brief’ 4 years ago.

Foundations

The Data on the Web Best Practices, which became a Recommendation in January this year, forms the foundation. As I highlighted at the time, it sets out the steps anyone should take when sharing data on the Web, whether openly or not, encouraging the sharing of actual information, not just information about where a dataset can be downloaded. A domain-specific extension, the Spatial Data on the Web Best Practices, is now all-but complete. There again, the emphasis is on making data available directly on the Web so that, for example, search engines can make use of it directly and not just point to a landing page from where a dataset can be downloaded – what I call using the Web as a glorified USB stick.

Spatial Data

That specialized best practice document is just one output from the Spatial Data on the Web WG in which we have collaborated with our sister standards body, the Open Geospatial Consortium, to create joint standards. Plans are being laid for a long term continuation of that relationship which has exciting possibilities in VR/AR, Web of Things, Building Information Models, Earth Observations, and a best practices document looking at statistical data.

Research Data

Another area in which I very much hope W3C will work closely with others is in research data: life sciences, astronomy, oceanography, geology, crystallography and many more ‘ologies.’ Supported by the VRE4EIC project, the Dataset Exchange WG was born largely from this area and is leading to exciting conversations with organizations including the Research Data Alliance, CODATA, and even the UN. This is in addition to, not a replacement for, the interests of governments in the sharing of data. Both communities are strongly represented in the DXWG that will, if it fulfills its charter, make big improvements in interoperability across different domains and communities.

Linked Data

A line graph showing an initial peak of inflated peak of expectations, followed by the trough of disillusionment, the slope of enlightenment and the plateau of productivity
The Gartner Hype Cycle. CC: BY-SA Jeremykemp at English Wikipedia

The use of Linked Data continues to grow; if we accept the Gartner Hype Cycle as a model then I believe that, following the Trough of Disillusionment, we are well onto the Slope of Enlightenment. I see it used particularly in environmental and life sciences, government master data and cultural heritage. That is, it’s used extensively as a means of sharing and consuming data across departments and disciplines. However, it would be silly to suggest that the majority of Web Developers are building their applications on SPARQL endpoints. Furthermore, it is true that if you make a full SPARQL endpoint available openly, then it’s relatively easy to write a query that will be so computationally expensive as to bring the system down. That’s why the BBC, OpenPHACTS and others don’t make their SPARQL endpoints publicly available. Would you make your SQL interface openly available? Instead, they provide a simple API that runs straightforward queries in the background that a developer never sees. In the case of the BBC, even their API is not public, but it powers a lot of the content on their Web site.

The upside of this approach is that through those APIs it’s easy to access high value, integrated data as developer-friendly JSON objects that are readily dealt with. From a publisher’s point of view, the API is more stable and reliable. The irritating downside is that people don’t see and therefore don’t recognize the Linked Data infrastructure behind the API allowing the continued questioning of the value of the technology.

Semantic Web, AI and Machine Learning

The main Semantic Web specs were updated at the beginning of 2014 and there are no plans to review the core RDF and OWL specs any time soon. However, that doesn’t mean that there aren’t still things to do.

One spec that might get an update soon is JSON-LD. The relevant Community Group has continued to develop the spec since it was formally published as a Rec and would now like to put those new specs through Rec Track. Meanwhile, the Shapes Constraint Language. SHACL, has been through something of a difficult journey but is now at Proposed Rec, attracting significant interest and implementation.

But, what I hear from the community is that the most pressing ‘next thing’ for the Semantic Web should be what I call ‘annotated triples.’ RDF is pretty bad at describing and reflecting change: someone changes job, a concert ticket is no longer valid, the global average temperature is now y not x and so on. Furthermore, not all ‘facts’ are asserted with equal confidence. Natural Language Processing, for example, might recognize a ‘fact’ within a text with only 75% certainty.

It’s perfectly possible to express these now using Named Graphs, however, in talks I’ve done recently where I’ve mentioned this, including to the team behind Amazon’s Alexa, there has been strong support for the idea of a syntax that would allow each tuple to be extended with ‘validFrom’, validTo and ‘probability’. Other possible annotations might relate to privacy, provenance and more. Such annotations may be semantically equivalent to creating and annotating a named graph, and RDF 1.1 goes a long way in this direction, but I’ve received a good deal of anecdotal evidence that a simple syntax might be a lot easier to process. This is very relevant to areas like AI, deep learning and statistical analysis.

These sorts of topics were discussed at ESWC recently and I very much hope that there will be a W3C workshop on it next year, perhaps leading to a new WG. A project proposal was submitted to the European Commission recently that would support this, and others interested in the topic should get in touch.

Other possible future work in the Semantic Web includes a common vocabulary for sharing the results of data analysis, natural language processing etc. The Natural Language Interchange Format, for example, could readily be put through Rec Track.

Vocabularies and schema.org

Common vocabularies, maintained by the communities they serve, are an essential part of interoperability. Whether it’s researchers, governments or businesses, better and easier maintenance of vocabularies and a more uniform approach to sharing mappings, crosswalks and linksets, must be a priority. Internally at least, we have recognized for years that W3C needs to be better at this. What’s not so widely known is that we can do a lot now. Community Groups are a great way to get a bunch of people together and work on your new schema and, if you want it, you can even have a www.w3.org/ns namespace (either directly or via a redirect). Again, subject to an EU project proposal being funded, there should be money available to improve our tooling in this regard.

W3C will continue to support the development of schema.org which is transforming the amount of structured data embedded within Web pages. If you want to develop an extension for schema.org, a Community Group and a discussion on public-vocabs@w3.org is the place to start.

Summary

To summarize, my personal priorities for W3C in relation to data are:

  1. Continue and deepen the relationship with OGC for better interoperability between the Web and geospatial information systems.
  2. Develop a similarly deep relationship with the research data community.
  3. Explore the notion of annotating RDF triples for context, such as temporal and probabilistic factors.
  4. Be better at supporting vocabulary development and their agile maintenance.
  5. Continue to promote the Linked Data/Semantic Web approach to data integration that can sit behind high value and robust JSON-returning APIs.

I’ll be watching …

Leave a Reply

Your email address will not be published. Required fields are marked *