W3C study of practices and tooling for Web data standardisation

Dave Raggett dsr@w3.org, W3C Data Activity Lead, December 2017

This study has been made with support from the Open Data Institute and Innovate UK.

Abstract

The Web has had a huge impact on how we exchange and access information. W3C is the leading standards development organisation for Web technology standards, and has hosted community development of standards for both the Web of pages for use by people, and the Web of data for use by services. This report covers a study of W3C practices and tooling for Web data standardisation. A lengthy questionnaire was used to solicit input from a wide range of stakeholders. The feedback will be used as a starting point for making W3C a more effective, more welcoming and sustainable venue for communities seeking to develop Web data standards and exploit them to create value added services.

The report starts with an introduction to the Web of data and W3C’s standardisation activities. This is followed by a look at the design of the questionnaire and the feedback obtained for each of its sections. After this comes a section on the challenges for measuring the popularity of standards, the need to support the communities that develop and use them, and how to gather feedback that can be used to improve standards and identify gaps where new standards are needed. The report closes with a look at the potential of multidisciplinary approaches including AI, Computational Linguistics and Cognitive Science to transform the process of creating standards, and to evolve the Semantic Web into the Cognitive Web.

Table of Contents

Introduction

The Web is the World’s most successful vendor neutral distributed information system, enabling people to access applications and services right across the World from their smart phones, tablets, laptops and other computing devices. The Web is founded on the three pillars of addressing, document formats and network protocols. For the Web of pages as viewed with Web browsers, this involves URLs for addressing resources accessed by the Hypertext Transfer Protocol (HTTP), and document formats such as the Hypertext Markup Language (HTML), Cascading Style Sheets (CSS), image formats like JPEG and PNG, and Web page scripts using JavaScript.

Complementing the Web of pages, there is the Web of data which ranges from small amounts of data to vast datasets, and either which are open to all or restricted to a few. Data can be consumed by Web pages, downloaded for local processing, or accessed via network APIs that support remote processing. Data is often published without prior coordination with other publishers — let alone with precise modeling or common vocabularies. Standard data exchange formats, models, tools and guidance are needed to facilitate Web-scale data integration and processing.

This report surveys the W3C work in respect to Web Data standards that has already been done or which is ongoing, and looks to the future with a study of the challenges facing communities that are seeking to exploit the opportunities provided by the Web of Data. A lengthy questionnaire was created to elicit feedback from stakeholders across a broad range of topics, including the kinds of data standards of interest, sustainability and governance, scaling challenges, tooling and practices, liaisons, outreach and community building, and miscellaneous feedback on W3C groups. The analysis of this feedback will help W3C to improve its value proposition for communities seeking to develop and exploit Web Data standards, as part of W3C’s mission to bring the Web to its full potential.

An introduction to the Web of Data

This section of the report will look at the different kinds of standards that form the basis for the Web of Data. We will then review the different kinds of standardisation groups at W3C, and the current groups with an interest in the Web of Data.

What kinds of standards, why and for whom?

The principal purpose of standards is to enable interoperability and facilitate the growth in services. Public services including government departments and cities are increasingly making data freely available for interested parties to make use of and add value to. Interoperability depends upon knowing the data formats and the vocabularies used for data items. Some common formats include Comma Separated Values, JSON (JavaScript Object Notation) and XML. To understand individual data items, you need to know their format, e.g. a number or string, and what they represent, e.g. a house number or street name. For values that represent physical measurements, you need to know the units and the scaling factor, as well as what is being measured, e.g. the level of Nitrogen Dioxide pollution at a given street location.

The development of services is simplified if different data providers use common representations for their data. We therefore need a way for data providers to describe their datasets and a means to reference definitions shared with other data providers. These descriptions may include constraints that can be used to validate the data as a basis for checking for internal consistency. Communities of data providers and consumers have a common interest in defining and using such standards. The kinds of standards will vary considerably. Some are community based, whilst others require international agreements involving a more formal approach for how they are developed and maintained.

W3C aims to support lightweight community based standards that can be incubated within W3C Community Groups, and if appropriate, transferred to Working Groups where a formal standards track process is desired, e.g. for core standards where a greater level of scrutiny is needed.

Resource Description Framework and Linked Data

In his 1989 proposal for the Web, Tim Berners-Lee included a diagram depicting an example of a semantic network based upon named resources with labelled links between them.

1990 Web proposal

This idea was developed into W3C’s Resource Description Framework (RDF), where URLs are used for both resources and link labels. Each link (also known as a triple) thus consists of URLs for the subject, predicate and object, respectively. The URLs act as both a name and as a means to get further information by dereferencing the URL via an HTTP GET request on the URL. Over the years, W3C has developed a suite of standards around RDF.

Some standards related to the use of RDF to define models, e.g. RDF Core, RDF Schema and OWL. Others define data exchange formats for RDF, e.g. RDF/XML, N-Triples, Turtle, TriG, and JSON-LD. The Linked Data Platform (LDP) defines how to use HTTP for reading and writing triples. SPARQL is a query and update language for RDF analogous to SQL for relational data. SHACL provides a means to express validity constraints on a set of triples.

W3C standardisation groups

W3C hosts different kinds of standardisation groups according to the level of maturity of the work in question.

Previous Groups

Here is a list of relevant Working and Interest Groups that are now closed:

For more details, see: closed groups

Current Working Groups

For more details, see: current Working Groups

Current Interest Groups

For more details, see: current Interest Groups

Current Community Groups

There are many current W3C Community Groups with an interest in Web Data standards:

These groups vary considerably in how active they are and the kinds of opportunities they are addressing. Many groups make heavy use of GitHub for collaborative development of documents, e.g. use cases and requirements, specifications and test suites, primers and other introductory materials.

For more details, see: W3C Community Groups

What’s Driving Work on Web Data Standardisation?

The Web of Data has been growing steadily. One measure of this is the Linked Open Data Cloud diagram. Here is the May 2007 version:

2007 Linked Open Data Cloud

The 2017 version is shown below indicates the rapid growth in open data over the last 10 years. It was created by Andrejs Abele, John P. McCrae, Paul Buitelaar, Anja Jentzsch and Richard Cyganiak. See http://lod-cloud.net/

2017 Linked Open Data Cloud

The diagrams above show datasets that have been published as Linked Data using HTTP in a variety of RDF data formats. Web Data is also available in other formats, e.g. JSON, Comma Separated Values (CSV), and embedded in PDF. The ability to integrate different data sources is dependent on standards for both the data formats and the data models along with the means to relate terms in different datasets.

The emergence of the Internet of Things is resulting in an increasing amount of data from a wide variety of sensors. Much of this is using incompatible platforms and standards, resulting in data silos. Over time the demand for services that combine different data sources will help to drive demand for open standards. This will in turn facilitate open markets of data and services and this will further drive demand for data. Another source of vast amounts of data is the scientific community combined with interest in virtual research environments.

It is easy to coordinate and work on a shared vocabulary if there is a small well knit community. However, it becomes very much harder as the community size grows, and when there are uncoupled or only weakly coupled communities. It is therefore inevitable that different communities will develop rival vocabularies, and that these will address different or perhaps overlapping requirements due to differences in the context. Some use cases may call for a greater level of detail than others. This can make it cumbersome for simpler use cases. When integrating data from across such vocabularies, it becomes challenging to relate terms from the different vocabularies. One example of this is where units of measure are needed for sensor readings. The abbreviations are not universal and may have different meanings in different fields.

Another challenge relates to dealing with evolving APIs. In some cases, this may just be a matter of ignoring named arguments that a given software client doesn’t know about. In other cases, there may be a need to negotiate over which version of the API is used so that a server is able to support both current and legacy clients.

Questionnaire on W3C practices and tooling for Web Data standardisation

The questionnaire was designed to address a broad range of questions and this resulted in a long form that took a considerable time for respondents to fill out. I am extremely grateful for the time they devoted to this. Where practical multiple choice questions are used to facilitate the generation of graphical presentations. However, the breadth of stakeholders makes it impractical to cover everyone’s specific choices, so the questionnaire makes a lot of use of free-form text fields for open ended questions. This report provides a summary of the points covered in the free-form text fields, along with a preliminary analysis. This questionnaire is just the first stage, and the idea is to follow up with a broader discussion as to the choices available to make W3C a better venue for communities to work on Web data standards.

The questionnaire was created to elicit feedback from stakeholders across a broad range of topics, including the kinds of data standards of interest, sustainability and governance, scaling challenges, tooling and practices, liaisons, outreach and community building, and miscellaneous feedback on W3C groups.

The questionnaire was not limited to W3C Member organisations or people involved in W3C Community Groups. The questionnaire was publicised via a W3C blog post, and emails to all of the relevant W3C groups. People were encouraged to spread the word further using their social connections.

About You

The questionnaire starts with a section titled “About you” which ask for the respondent’s name, email address, organisation, organisation’s website and primary location (country) and the organisation’s interest in data standards. The name and email address were asked in order to be able to contact the respondent in case of any follow up questions in regard to the input provided by the respondent using the questionnaire.

Here is a chart for the countries provided by respondents. The question used a text field, and this resulted in people using different name for the same country, e.g. US, USA United States. The data thus required some post processing.

respondent countries

The following organisations contributed to the questionnaire results:

Acando ASCentre for eResearch and Digital Innovation - Federation University Australia Open Data Institute
Adobe SystemsGeonovumOrdnance Survey
Agency for DigitisationGerman Medicines Manufacturers AssociationOWASP
Alexandria Consulting LLChbzPayEx
AzaveaHES-SOPorism Limited
BTHigh LatitudesReportLab Europe Ltd
callas software GmbH Hokukahu LLCSchneider Electric
Camara dos DeputadosIMATI-CNRStratML Committee
Collective[i]INRAThe ODI Australian Network
ContentMineINRIA Thought Transfer Research
CPE Lyoninteractive instrumentsTrinity College Dublin
Cray IncISCAP-IPP, PortugalUbiquity Press
Data UnityKOOPUniversidad Politécnica de Madrid
Deutsche NationalbibliothekLawrence Berkeley National LaboratoryUniversity of Glasgow
DHS Legal UpUniversity of Kent
Dinador Ltd.Linked Data FactoryUniversity of Minho
Dow ChemicalMet OfficeUniversity of North Florida
DSS Ltd. MetalinkageUniversity of Queensland
Dublin Core Metadata Initiative (DCMI)Mikros ImageUniversity of Southampton
ECNatural Resources Canada, Government of CanadaWeb3D Consortium
ePanstwo FoundationNetworked PlanetWolters Kluwer
Ephox CorporationNISTYodata
Extremadura UniversityNuance

Several people responded independently on their own behalf.

What kind of data standards

This section of the questionnaire gathers information about the interest in application sectors, approaches to data access, approaches for discovery of data and services, stability vs agility of standards, the importance of standards for data formats, data vocabularies, data models, terms & conditions (licenses), privacy policies, payments, versioning, longevity of standards, the role of W3C for registering namespaces, and internationalisation.

importance of application sectors

importance of stability

importance of agility

importance of standards for data formats

importance of standards for data vocabularies

importance of standards for data models

importance of standards for terms and conditions

importance of standards for privacy policies

importance of standards for payments for access to data

importance of standards for versioning

interest in using W3C domain for vocabularies

Sustainability and governance

This section of the questionnaire considers how to fund and oversee the social and physical infrastructure needed to support standardisation.

Scaling challenges

This section invites respondents to comment on and provide suggestions for how to address scaling challenges for developing and maintaining data standards.

Tooling and practices

This section of the questionnaire gathers feedback that will help W3C review the tools available to standardisation groups and the associated practices. As an example, W3C groups have made increasing use of GitHub for collaborative specification development, despite GitHub being originally designed for software development teams.

Liaisons, outreach and community building

This section of the questionnaire sought input on the work done on reaching out beyond the standardisation group as a basis for successful standards.

Miscellaneous feedback on W3C groups

In this final section for the questionnaire, respondents were asked to describe which W3C groups they are involved in, what is working well, what problems they’ve seen, and their suggestions for improvements.

Tracking adoption and interest in Web Data standards

How successful are standards? To answer this question we need a way to measure the level of interest in particular standards. At the time this report was written W3C has done surprisingly little on measuring interest in standards.

One approach that could be implemented with modest resources would be to exploit the W3C website server logs, and to look at the requests for W3C technical reports and other documents from Working Groups, Interest Groups, Community Groups and Business Groups. W3C’s privacy policy states that W3C does not track users for behavioural tracking. Client IP addresses, and the HTTP Referer and User-Agent fields are logged to allow traffic to be analysed. The collected data is only used for server administration, site improvement, usage statistics, and Web protocol research.

The Webmaster helped with the analysis of the server logs for a select of URLs corresponding to W3C technical reports relevant to Web data standardisation. In keeping with the privacy policy IP addresses are only kept for a relatively short period of time - a quarter of year. This allows us to look at the popularity of different technical reports, and to see from which countries the requests were from.

The following figure shows the number of times each report was requested in the period covered (August - November 2017).

Technical Report Request Count for one quarter

One observation is that the huge popularity of the Semantic Sensor Network ontology (vocab-ssn) and the Time Ontology in OWL (owl-time) is due to both them becoming W3C Recommendations in the recent past. The Shape Constraint Language (shacl) became a W3C Recommendation three months earlier, and has a similar download count to many other Linked Data technical reports. This suggests that reports are initially very popular but this rapidly decays away to a background level. There are exceptions, e.g. the Data on the Web Best Practices (dwbp) which shows persisting popularity, To track popularity patterns, W3C would need to regularly record the request counts for each technical report, e.g. on a monthly basis.

Another observation is that JSON-LD is more popular than other Linked Data serialisation formats, and is followed by Turtle. Other Linked Data formats such as n-triples and n-quads are much less popular. JSON-LD defines a way to use the JavaScript Object Notation (JSON) to represent Linked Data. Its relative popularity points to the huge popularity of JSON amongst web developers, superseding the previous high levels of interest in XML.

The Geolite2 dataset was used to derive the country from the client IP address as a basis for assessing which countries were most interested in Web data standardisation. The results show a very long tail of countries with small download counts. Take for example the Semantic Sensor Network ontology. This is most popular in the USA (88932 downloads), followed by China (65703), UK (38430), Netherlands (25625), France (25000), Germany (24541), and fading to a single download for South Sudan, and the Central African Republic. Here is the data as a pie chart, note that there were downloads from 214 countries, not all of which are listed due to lack of room. The mapping isn’t perfect with IPAddressnotfound and Republicof as cases where the algorithm failed to work correctly.

pie chart for downloads by country

Further work is needed to figure out a sustainable solution for collecting such statistics across the W3C site on a long term basis and presenting the results in a way that can inform decisions on how W3C invests its limited resources.

The level of interest in Web data standards could be tracked in other ways, for example, citations from websites and research publications, and by providing a registration form for users. To make it worthwhile for users to register, this could be tied to a community based support service for W3C specifications. Community maintained support services are increasingly popular with companies as a way to provide good quality support at a lower cost. SMEs and independent consultants can benefit as their reputation as contributors boosts their business opportunities. Further investigation is needed on the detailed requirements and investment needed to kick start this approach.

Dealing with the challenges of heterogeneity

As more and more people want to provide or consume data on the Web, this will increase the demand for open standards for data vocabularies. People will for the most part be interested in using existing vocabularies where appropriate. The challenge is then how to discover and assess such vocabularies, especially when they have been developed by isolated communities. Adopting an existing vocabulary has its risks - the vocabulary could have been designed for different requirements and be overly cumbersome in a different context or fail to adequately cover the chosen use cases.

A related challenge is that people from a like minded background tend to think in similar ways, and have a tendency to not make their shared assumptions explicit, instead going directly into the details of the solution they envisage. This makes it hard for other people from different communities to evaluate a given vocabulary to see if it is a good fit.

The Web is World wide, but people may be separated by living in different countries, having different languages, or working in different industries. With uncoupled or weakly coupled communities, and only partially overlapping requirements, we can expect the emergence of vocabularies that play similar roles, but which aren’t directly compatible. This creates challenges for services that need to integrate data from multiple such vocabularies.

In the simplest case, a term in one vocabulary can be declared as the same as a term in a different vocabulary. More generally, a term in one vocabulary might be declared as equivalent to a graph in another vocabulary. For instance, a single term might be used to indicate a combination of a unit of measure and a scaling factor, e.g. milliamps for electrical current. A second vocabulary could express these separately.

More generally still, terms may be relatable only in specific contexts. This can be compared to human languages, e.g. the words used for water ways such as rivers, streams, brooks, etc. where the taxonomy of words in different languages don’t have a direct correspondence. For instance, to pick the right word, you may need to know if the river in question flows into the sea or merges into another river.

This suggests the need for a way to describe how to transform Linked Data graphs to replace one vocabulary with another, potentially with some form of defaults when required. This is something where experimentation is needed, and should lead to open standards for Linked Data transformation languages. Is this something that W3C should be driving, and if so, how?

Another approach involves so called “upper ontologies”. These define domain concepts in terms of underlying general concepts that are applicable across domains. It can be challenging to understand how to relate domain specific concepts to these very general concepts, and likewise to implement software that can take advantage of these definitions.

The promise of AI and the Cognitive Web

The difficulty of manually creating complex ontologies can in principle be avoided through the use of machine learning algorithms that are applied to a training corpus. One approach for this makes use of a synthesis of cognitive science, AI, computational linguistics and sociology, building upon progress in each of these fields, enabling conversational cognitive agents that can be trained and assessed using lessons expressed in natural language.

This necessitates a means to translate natural language into semantic graphs, and back again for natural language generation. Cognitive architectures like John R. Anderson's pioneering work on ACT-R have proven themselves in terms of replicating common characteristics of human memory and learning. This points to opportunities for extending Linked Data with persistent link strengths and exponentially decaying node activation levels. Procedural knowledge can be expressed using production rules, and trained using reinforcement learning algorithms.

Cognitive agents will require support for episodic memory and counterfactual reasoning (i.e. knowledge about what/when and what/if), both for learning from narratives and as a means to support a level of self-awareness as a basis for monitoring progress and deciding when to switch to different ways of thinking, the importance of which has been emphasised by Marvin Minsky.

Linked Data uses explicit concepts with nodes connected by labelled arcs. This makes it easier to provide explanations as compared to approaches based upon artificial neural networks and deep learning. However, Linked Data can also be represented using vector spaces and tensor expressions for implementations based upon neural networks. Much remains to be done on exploring how to apply vector spaces to rich graph representations and procedural rule sets, and there is considerable potential for addressing the statistical basis for reasoning in terms of what has been found useful in past experience, as compared with the emphasis on logical inference and completeness found in conventional approaches to ontologies. This is also relevant to mimicking the human ability to track changes in the meaning of words based upon their patterns of usage.

In the long term, this can be expected to change the nature of standardisation from a direct consideration of linked data vocabularies to the curation of a corpus of training materials as based upon an agreed set of use cases. At its simplest, this involves examples and counter examples for data fields, as input to a machine learning algorithm. Natural language descriptions could be used to relate data fields to what they represent, e.g. the address of a house or flat. Such descriptions can also be used to define taxonomies of terms including generalisations and exceptions. There has been plenty of work on extracting named entities from text, but so far much less on understanding narratives as would be needed for natural language descriptions of use cases.

Conclusions

The opportunities for data on the Web are huge, both for publicly shared open data, and for data exchanged business to business. This potential is critically dependent on standards to enable interoperability, to reduce the effort and risk involved, and to unlock the network effect. This study of W3C practices and tooling for Web data standardisation has gathered feedback from a wide range of stakeholders on many different aspects of standardisation.

The rise of the Internet of Things will accelerate the need for work on standards for vocabularies that describe devices, services and the context in which they are situated. Likewise, for the rise of open data published by governments and other organisations, including the availability of scientific data for virtual research environments.

There are many challenges to be overcome, e.g.

There is a lot to improve from the current status. The W3C home page for the Web of Data needs revamping and bringing alive with regular news posts and links to useful resources. The Web of Data needs greater visibility both within the W3C Team, W3C Members and the public at large. Whilst the W3C Community Groups programme has been very successful with a large number of groups, there is a lack of guidance for communities interested in developing standards. For W3C to step up to the challenge of the huge potential demand for community standards, new approaches will be needed to sustain the level of resources needed.

Web developers often express negative sentiments about the Semantic Web, and this can in part be attributed to a them and us attitude in respect to people working on Linked Data and the Semantic Web. It is not helped by the perceived complexity often associated with OWL ontologies and the esoteric focus of much of the published work. This gulf needs to be filled by a greater focus on simpler approaches that are a good fit to the use cases of interest for Web developers. A community supported forum aimed at Web developers for exchanging information on use cases and accounts of how they were solved in a simple way would be a big help.

This study will be used as a starting point for further discussion on how to improve the services that W3C Data Activity offers for communities interested in developing Web data standards.

The role of the W3C in supporting development of specific standards

What kinds of new standards are needed for accelerating the adoption of data on the Web? This includes metadata standards, e.g. relating to privacy, terms & conditions, machine interpretable licenses, and payments. To assist with discovery, there is a need for websites to be able to describe data services in a standard way that facilitates indexing by search engines. This could be done in collaboration with schema.org. W3C is already working on updating the Data Catalog Vocabulary as a basis for describing datasets (see Dataset Exchange WG), but there is a gap when it comes to discovery of network APIs. There are existing solutions for describing RESTful APIs, but these focus on the data types rather than the semantics. The work on thing descriptions in the Web of Things Working Group seems relevant, along with fresh ideas for mapping JSON to Linked Data. Is there a need for a meta vocabulary to facilitate discovery of vocabularies? Another area ripe for investigation is the potential for a new standard for a rule language for context based mappings between Linked Data vocabularies with partially overlapping semantics.

The role of W3C as an incubator for standards

What should W3C be doing to better support Working Groups and Community Groups? This could include better guidance about how to run effective Community Groups and advice on the different kinds of standards and how to incubate them and progress them along the standards track. What could W3C do to give Community Groups greater control over their home pages? What is needed to support training and outreach as part of the process of building momentum around new standards at various stages in their lifecycle. What changes to how groups are formed would provide the resources needed to provide better tooling? Is there a role for community maintained support services as part of this? This could include tools for facilitating sharing of advice and experience across different community groups. As data on the Web expands to cover new areas, what can W3C do to make it easier for communities with related goals to discover each other? The current study described in this report should be seen as a precursor to an ongoing dialogue to discuss the many questions raised. Perhaps it is time to consider organising a W3C workshop on how to better address the challenges of developing and supporting Web data standards? This could be co-organised with other organisations with shared goals, e.g. the Open Data Institute, and is likely to involve the need to find sponsors to cover some of the costs of running the workshop.

Acknowledgements

Grateful acknowledgements are due to the Open Data Institute and Innovate UK for funding this study. I would especially like to thank Leigh Dodds (ODI) for his efforts in coordinating the project and introducing me to others working on different aspects of data standardisation.


Dave Raggett, W3C
$Id: index.html,v 1.29 2018/01/04 16:22:27 coralie Exp $