W3C study of practices and tooling for Web data standardisation

Dave Raggett dsr@w3.org, W3C Data Activity Lead, December 2017

This study has been made with support from the Open Data Institute and Innovate UK.

Abstract

The Web has had a huge impact on how we exchange and access information. W3C is the leading standards development organisation for Web technology standards, and has hosted community development of standards for both the Web of pages for use by people, and the Web of data for use by services. This report covers a study of W3C practices and tooling for Web data standardisation. A lengthy questionnaire was used to solicit input from a wide range of stakeholders. The feedback will be used as a starting point for making W3C a more effective, more welcoming and sustainable venue for communities seeking to develop Web data standards and exploit them to create value added services.

The report starts with an introduction to the Web of data and W3C’s standardisation activities. This is followed by a look at the design of the questionnaire and the feedback obtained for each of its sections. After this comes a section on the challenges for measuring the popularity of standards, the need to support the communities that develop and use them, and how to gather feedback that can be used to improve standards and identify gaps where new standards are needed. The report closes with a look at the potential of multidisciplinary approaches including AI, Computational Linguistics and Cognitive Science to transform the process of creating standards, and to evolve the Semantic Web into the Cognitive Web.

Introduction
An introduction to the Web of Data
Questionnaire on W3C practices and tooling for Web Data standardisation
Dealing with the challenges of heterogeneity
The promise of AI and the Cognitive Web
Conclusions

Introduction

The Web is the World’s most successful vendor neutral distributed information system, enabling people to access applications and services right across the World from their smart phones, tablets, laptops and other computing devices. The Web is founded on the three pillars of addressing, document formats and network protocols. For the Web of pages as viewed with Web browsers, this involves URLs for addressing resources accessed by the Hypertext Transfer Protocol (HTTP), and document formats such as the Hypertext Markup Language (HTML), Cascading Style Sheets (CSS), image formats like JPEG and PNG, and Web page scripts using JavaScript.

Complementing the Web of pages, there is the Web of data which ranges from small amounts of data to vast datasets, and either which are open to all or restricted to a few. Data can be consumed by Web pages, downloaded for local processing, or accessed via network APIs that support remote processing. Data is often published without prior coordination with other publishers — let alone with precise modeling or common vocabularies. Standard data exchange formats, models, tools and guidance are needed to facilitate Web-scale data integration and processing.

This report surveys the W3C work in respect to Web Data standards that has already been done or which is ongoing, and looks to the future with a study of the challenges facing communities that are seeking to exploit the opportunities provided by the Web of Data. A lengthy questionnaire was created to elicit feedback from stakeholders across a broad range of topics, including the kinds of data standards of interest, sustainability and governance, scaling challenges, tooling and practices, liaisons, outreach and community building, and miscellaneous feedback on W3C groups. The analysis of this feedback will help W3C to improve its value proposition for communities seeking to develop and exploit Web Data standards, as part of W3C’s mission to bring the Web to its full potential.

An introduction to the Web of Data

This section of the report will look at the different kinds of standards that form the basis for the Web of Data. We will then review the different kinds of standardisation groups at W3C, and the current groups with an interest in the Web of Data.

What kinds of standards, why and for whom?

The principal purpose of standards is to enable interoperability and facilitate the growth in services. Public services including government departments and cities are increasingly making data freely available for interested parties to make use of and add value to. Interoperability depends upon knowing the data formats and the vocabularies used for data items. Some common formats include Comma Separated Values, JSON (JavaScript Object Notation) and XML. To understand individual data items, you need to know their format, e.g. a number or string, and what they represent, e.g. a house number or street name. For values that represent physical measurements, you need to know the units and the scaling factor, as well as what is being measured, e.g. the level of Nitrogen Dioxide pollution at a given street location.

The development of services is simplified if different data providers use common representations for their data. We therefore need a way for data providers to describe their datasets and a means to reference definitions shared with other data providers. These descriptions may include constraints that can be used to validate the data as a basis for checking for internal consistency. Communities of data providers and consumers have a common interest in defining and using such standards. The kinds of standards will vary considerably. Some are community based, whilst others require international agreements involving a more formal approach for how they are developed and maintained.

W3C aims to support lightweight community based standards that can be incubated within W3C Community Groups, and if appropriate, transferred to Working Groups where a formal standards track process is desired, e.g. for core standards where a greater level of scrutiny is needed.

Resource Description Framework and Linked Data

In his 1989 proposal for the Web, Tim Berners-Lee included a diagram depicting an example of a semantic network based upon named resources with labelled links between them.

1990 Web proposal

This idea was developed into W3C’s Resource Description Framework (RDF), where URLs are used for both resources and link labels. Each link (also known as a triple) thus consists of URLs for the subject, predicate and object, respectively. The URLs act as both a name and as a means to get further information by dereferencing the URL via an HTTP GET request on the URL. Over the years, W3C has developed a suite of standards around RDF.

Some standards related to the use of RDF to define models, e.g. RDF Core, RDF Schema and OWL. Others define data exchange formats for RDF, e.g. RDF/XML, N-Triples, Turtle, TriG, and JSON-LD. The Linked Data Platform (LDP) defines how to use HTTP for reading and writing triples. SPARQL is a query and update language for RDF analogous to SQL for relational data. SHACL provides a means to express validity constraints on a set of triples.

W3C standardisation groups

W3C hosts different kinds of standardisation groups according to the level of maturity of the work in question.

Working Groups — these are used to develop W3C Recommendations (our name for our standards) and are subject to the W3C Patent Policy which aims to ensure that W3C Recommendations can be implemented on a royalty free basis. The W3C Recommendation track starts with editor’s drafts which form the basis for formal publications of Working Drafts. As these mature they in turn lead to Candidate Recommendations. These define criteria for evidence of implementation experience, which when satisfied leads to a Proposed Recommendation. This is formally reviewed by the W3C Advisory Committee which consists of one representative per W3C Member organisation. If successful, the specification is then endorsed as a W3C Recommendation. If over time, the specification loses its market relevance or is found to have problems, it can be formally obsoleted or rescinded. More details are given in the W3C Process.
Interest Groups — these are used for work on gathering use cases and requirements as a precursor to being transferred to a Working Group for driving the specifications along with the W3C Recommendation Track. Interest Groups and Working Groups are launched following W3C Advisory Committee review of the proposed charter. The charter defines the scope of the group’s work, the expected deliverables, and the duration the charter will last for.
Community Groups — these are used for work that is at an early stage of maturity, e.g. a small community of people that wishes to incubate ideas as a means to attract a larger community and when appropriate, to transfer specifications to W3C Working Groups. Community Groups are free for anyone to join, but have limited support from the W3C staff. Anyone can very easily launch a new Community Group by writing a short proposal and finding at least 4 other people to support the proposal.
Business Groups — these are similar to Community Groups, but with a stronger focus on the business context. Business Groups have more support from the W3C staff, and involve a fee for participating organisations. For more details see the comparison of group types.

Previous Groups

Here is a list of relevant Working and Interest Groups that are now closed:

CSV on the Web Working Group
Data on the Web Best Practices Working Group
Government Linked Data Working Group
GRDDL Working Group
Linked Data Platform (LDP) Working Group
OWL Working Group
RDB2RDF Working Group
RDF Core Working Group
RDF Working Group
RDFa Working Group
Rule Interchange Format Working Group
Semantic Annotations for Web Services Description Language Working Group
Semantic Web Best Practices and Deployment Working Group
Semantic Web Coordination Group
Semantic Web Deployment Working Group
Semantic Web Education and Outreach (SWEO) Interest Group
Semantic Web Services Interest Group
SPARQL Working Group

For more details, see: closed groups

Current Working Groups

Data Exchange Working Group
Permissions & Obligations Working Group
Spatial Data on the Web Working Group
RDF Data Shapes Working Group
Data on the Web Best Practices Working Group

For more details, see: current Working Groups

Current Interest Groups

Semantic Web Interest Group
Semantic Web Healthcare and Life Sciences Working Group

For more details, see: current Interest Groups

Current Community Groups

There are many current W3C Community Groups with an interest in Web Data standards:

Age Labels Model, Automotive Ontology, Best Practices for Multilingual Linked Open Data, Bioschemas for life sciences, Credentials, CSV on the Web, Data on the Web Best Practices, Data Visualisation, DataSheets, Declarative Linked Data Apps, Digital Asset Management Industry Business Ontology, Digital Verification, Electronic Governance (eGov), Exploration of Scientific Data, Exposing and Linking Cultural Heritage Data, Exposing IEEE LOM metadata as Linked Data, Financial Industry Business Ontology, Geospatial Semantic Web, Healthcare Schema Vocabulary, Human Services, JSON for Linking Data, LDP Next, Linked Building Data, Linked Data for Language Technology, Linked Data Models for Emotion and Sentiment Analysis, Locations and Addresses, Machine Learning Schema, Meat Products, Natural Language Interfaces for the Web of Data, Networked Data, ODRL, Ontology Lexica, Open Annotation, Open Data Directory, Open Data Spain, Open Educational Resources Schema, Open Government, Open knowledge-driven service-oriented system architectures and APIs, Open Linked Education, Open Science, OpenActive, OpenTrack, Organisation Profile Documents, OWL: Experiences and Directions, PDF and Open Data, Permanent Identifier, Places, Property Graphs Model and API, RDF and XML Interoperability, RDF JavaScript Libraries, RDF Stream Processing, Research Object for Scholarly Communication, Restaurant Ontology, Schema Architypes, Schema Bib Extend, Schema Course extension, Schema Generator, Schema.org, Schema.org for dataset, Scholarly HTML, SDShare, Semantic Building Data, Semantic News, Semantic Open Data, Semantic Sensor Networks, Semantic Statistics, Semantic Water Interoperability Model, Semantic Web in Healthcare and Life Sciences, SHACL, Shape Expressions, SKOS and OWL for Interoperability, Smart Manufacturing, Sport Schema, Traffic Event Ontology, Web Observatory

These groups vary considerably in how active they are and the kinds of opportunities they are addressing. Many groups make heavy use of GitHub for collaborative development of documents, e.g. use cases and requirements, specifications and test suites, primers and other introductory materials.

For more details, see: W3C Community Groups

What’s Driving Work on Web Data Standardisation?

The Web of Data has been growing steadily. One measure of this is the Linked Open Data Cloud diagram. Here is the May 2007 version:

2007 Linked Open Data Cloud

The 2017 version is shown below indicates the rapid growth in open data over the last 10 years. It was created by Andrejs Abele, John P. McCrae, Paul Buitelaar, Anja Jentzsch and Richard Cyganiak. See http://lod-cloud.net/

2017 Linked Open Data Cloud

The diagrams above show datasets that have been published as Linked Data using HTTP in a variety of RDF data formats. Web Data is also available in other formats, e.g. JSON, Comma Separated Values (CSV), and embedded in PDF. The ability to integrate different data sources is dependent on standards for both the data formats and the data models along with the means to relate terms in different datasets.

The emergence of the Internet of Things is resulting in an increasing amount of data from a wide variety of sensors. Much of this is using incompatible platforms and standards, resulting in data silos. Over time the demand for services that combine different data sources will help to drive demand for open standards. This will in turn facilitate open markets of data and services and this will further drive demand for data. Another source of vast amounts of data is the scientific community combined with interest in virtual research environments.

It is easy to coordinate and work on a shared vocabulary if there is a small well knit community. However, it becomes very much harder as the community size grows, and when there are uncoupled or only weakly coupled communities. It is therefore inevitable that different communities will develop rival vocabularies, and that these will address different or perhaps overlapping requirements due to differences in the context. Some use cases may call for a greater level of detail than others. This can make it cumbersome for simpler use cases. When integrating data from across such vocabularies, it becomes challenging to relate terms from the different vocabularies. One example of this is where units of measure are needed for sensor readings. The abbreviations are not universal and may have different meanings in different fields.

Another challenge relates to dealing with evolving APIs. In some cases, this may just be a matter of ignoring named arguments that a given software client doesn’t know about. In other cases, there may be a need to negotiate over which version of the API is used so that a server is able to support both current and legacy clients.

Questionnaire on W3C practices and tooling for Web Data standardisation

The questionnaire was designed to address a broad range of questions and this resulted in a long form that took a considerable time for respondents to fill out. I am extremely grateful for the time they devoted to this. Where practical multiple choice questions are used to facilitate the generation of graphical presentations. However, the breadth of stakeholders makes it impractical to cover everyone’s specific choices, so the questionnaire makes a lot of use of free-form text fields for open ended questions. This report provides a summary of the points covered in the free-form text fields, along with a preliminary analysis. This questionnaire is just the first stage, and the idea is to follow up with a broader discussion as to the choices available to make W3C a better venue for communities to work on Web data standards.

The questionnaire was created to elicit feedback from stakeholders across a broad range of topics, including the kinds of data standards of interest, sustainability and governance, scaling challenges, tooling and practices, liaisons, outreach and community building, and miscellaneous feedback on W3C groups.

The questionnaire was not limited to W3C Member organisations or people involved in W3C Community Groups. The questionnaire was publicised via a W3C blog post, and emails to all of the relevant W3C groups. People were encouraged to spread the word further using their social connections.

About You

The questionnaire starts with a section titled “About you” which ask for the respondent’s name, email address, organisation, organisation’s website and primary location (country) and the organisation’s interest in data standards. The name and email address were asked in order to be able to contact the respondent in case of any follow up questions in regard to the input provided by the respondent using the questionnaire.

Location - a broad spread of countries but with the USA and UK in the lead possibly due to having English as their native language.

Here is a chart for the countries provided by respondents. The question used a text field, and this resulted in people using different name for the same country, e.g. US, USA United States. The data thus required some post processing.

respondent countries

Organisation - the respondents are primarily from a mix of companies and research organisations.

The following organisations contributed to the questionnaire results:

Acando AS	Centre for eResearch and Digital Innovation - Federation University Australia	Open Data Institute
Adobe Systems	Geonovum	Ordnance Survey
Agency for Digitisation	German Medicines Manufacturers Association	OWASP
Alexandria Consulting LLC	hbz	PayEx
Azavea	HES-SO	Porism Limited
BT	High Latitudes	ReportLab Europe Ltd
callas software GmbH	Hokukahu LLC	Schneider Electric
Camara dos Deputados	IMATI-CNR	StratML Committee
Collective[i]	INRA	The ODI Australian Network
ContentMine	INRIA	Thought Transfer Research
CPE Lyon	interactive instruments	Trinity College Dublin
Cray Inc	ISCAP-IPP, Portugal	Ubiquity Press
Data Unity	KOOP	Universidad Politécnica de Madrid
Deutsche Nationalbibliothek	Lawrence Berkeley National Laboratory	University of Glasgow
DHS	Legal Up	University of Kent
Dinador Ltd.	Linked Data Factory	University of Minho
Dow Chemical	Met Office	University of North Florida
DSS Ltd.	Metalinkage	University of Queensland
Dublin Core Metadata Initiative (DCMI)	Mikros Image	University of Southampton
EC	Natural Resources Canada, Government of Canada	Web3D Consortium
ePanstwo Foundation	Networked Planet	Wolters Kluwer
Ephox Corporation	NIST	Yodata
Extremadura University	Nuance

Several people responded independently on their own behalf.

Your organisation's interest in data standards - A broad range of interests, including: Linked open data, scientific data such as for astronomy, chemistry and agronomy, geospatial data and location, decentralisation, and role of blockchains, business integration and interoperability, enabling positive outcomes from working with data, government transparency, unified data exchange platforms and hypermedia, discovery and indexing, sports and athletics, open knowledge for conversational assistants, health and human services, combinations of open and proprietary data, breaking down data silos, data management for federation, integration and advanced services, research leading to better standards and use of standards, publishing government data for strategic and performance plans and reports, visualisation of data, sharing vocabularies across organisations, tools that reduce the time to process data, the use of datasets for analytics and big data, collaborative development of data standards, harmonisation of regulatory submissions, supporting eResearch across industry, government and academia, event processing using semantic Web technologies, the quality of linked data, the role of libraries for open data and standards. The broad range of interests shows the huge potential for Web data.

What kind of data standards

This section of the questionnaire gathers information about the interest in application sectors, approaches to data access, approaches for discovery of data and services, stability vs agility of standards, the importance of standards for data formats, data vocabularies, data models, terms & conditions (licenses), privacy policies, payments, versioning, longevity of standards, the role of W3C for registering namespaces, and internationalisation.

Which application sectors are of interest to your organisation - A broad spread of sectors. The biggest is science and research, followed by government operations, and geospatial/mapping. The range of sectors shows the huge potential for Web data.

importance of application sectors

Which approaches to data access are you interested in - 89% of respondents indicated an interest in the role of network APIs for access to remote data, which can be compared to 76% of respondents who indicated an interest in downloadable datasets. This suggests that W3C should take a look at discoverability of network APIs along with techniques for enabling semantic interoperability for such APIs.
Additional feedback on the challenges associated with different approaches to data access - Respondents cited challenges with finding people with the requisite skills, the need for better tooling, the value of global identifiers especially those that can be dereferenced to obtain further information, the limitation of ad-funded services and the need for open standards for payments as an alternative. Differences in semantics, shape and form of data makes interoperability, reuse and the combination of different datasets difficult. The need for very careful provenance management and version control. The variety of licenses for access to data is itself quite challenging.
What approaches are you considering for enabling discovery of data sets and network APIs for access to data - Dataset catalogues using DCAT, a variety of approaches for documenting network APIs, the importance of keeping metadata (such as ODRL) together with the data. Centralised vs distributed approaches to discovery. In general, there is need for more work on discovery, especially for network APIs and the means to keep metadata and data together.
Stability - The need for stable standards that provide reliable long lasting building blocks that applications can depend upon. The lack of stable standards is hurting innovation, e.g. in the fintech industry. The standardisation process for standards that are expected to be stable should emphasise the need for widespread review.

importance of stability

Agility - For standards that require frequent updates to track changes in requirements. This is easier to handle if the updates take the form of incremental extensions. Agility is related to technical approaches to versioning, e.g. a data streaming service may need to offer different streams for different versions of the standard. Different kinds of data have different requirements for stability and agility. In the Web world people expect both agility and backwards compatibility.

importance of agility

Please rate the importance of standards for data formats - Respondents showed a strong interest in the need for standards for data formats. Standards for metadata should be decoupled from standard for data formats and APIs. The heterogeneity of data formats is likely to continue.

importance of standards for data formats

Please rate the importance of standards for data vocabularies for your application areas - Respondents showed a clear preference for having standards for data vocabularies, but it can be hard to find existing vocabularies that fits one's purpose, thus organisations shouldn't feel bad when designing their own, as long they share it and properly document it. This relates to the general issue of discoverability.

importance of standards for data vocabularies

Please rate the importance of standards for data models for your application areas - Respondents showed a general preference for having standards for data models. The lack of data models, or a diversity of data models can make it challenging to relate different datasets. This also relates to quality of datasets as people may misunderstand or misapply data models to their data. Data models enable validation of datasets. Ontologies are hard to work with compared to UML.

importance of standards for data models

Please rate the importance of standards for terms & conditions for use of data (including licenses) for your application areas - Not all respondents see this as important, although the majority clearly do. Merging datasets can be a legal nightmare if licenses are different, or their compatibility is not well documented. Creative Commons licenses are commonly used for open data.

importance of standards for terms and conditions

Please rate the importance of standards for privacy policies for personal data for your application areas - Services involving personal data need to support privacy policies and associated regulations. Respondents varied in the level of importance they attached to this topic. GDPR is posing challenges around consent and control for personal data that is likely to impact the need for standards. Pseudo-anonymity is part of the discussion. Privacy is not relevant to all open data use cases.

importance of standards for privacy policies

Please rate the importance of standards for payments for access to data - Access to some datasets are expected to involve some form of remuneration. This is expected to require associated standards, but this is generally a low priority for now based upon the feedback from respondents. In many current cases, data is free for all, or free through collaboration agreements. W3C has existing work on payments and there is emerging interest in understanding what is needed to support IoT use cases.

importance of standards for payments for access to data

Please rate the importance of standards for identifying different versions of a dataset or API for your application areas - For datasets and APIs that are available for a sustained period to a large community, there will be inevitable pressures to evolve to track changing requirements from parts of that community. The older client applications may not be compatible with such changes. This results in the need to support a heterogeneous mix of old and new applications. Respondents showed a weak but clear interest in this challenge. One approach is to expose different URIs for different versions. Standards for metadata describing changes across versions is seen as important. Discussions in the W3C Data Exchange Working Group have shown that the requirements vary considerably across use cases and domains.

importance of standards for versioning

How long will the standards be needed for your application areas - Many respondents need to provide data for tens of years or more, perhaps even in perpetuity. There are reports of problems due to vanishing ontologies and challenges with persistent URLs, along with a desire for work on this “important web plumbing”.
How interested are you in using the W3C domain for registering namespaces - This question relates to the role of URLs for Linked Data Vocabularies, where a URL is needed for each vocabulary term. W3C has previously agreed to host such URLs for a number of data vocabulary standards. Some respondents see W3C’s role in hosting vocabularies as important, but other respondents don’t. Should W3C be taking on a more visible role for persistent URLs or should this be left to other organisations? See also the Persistent Identifier Community Group.

interest in using W3C domain for vocabularies

Internationalisation - What are the needs and practices for providing standards in multiple standards? What does W3C need to do beyond what it is already doing? Some people suggested that as English is the de facto lingua franca, internationalisation is less important. Others strongly disagree and need to have information and descriptions in all languages. However, this is less important for the specifications themselves.

Sustainability and governance

This section of the questionnaire considers how to fund and oversee the social and physical infrastructure needed to support standardisation.

Ad hoc short lived industry alliances funded by their consortium partners for specific needs - Some see this as a short term solution, but not reliable enough for standardisation work, but could be a nice addition if there is other, more reliable funding for sustainability and long term maintenance. Pro: quick and focused, Con: broader interoperability may be at risk. Tends to be brittle in face of changing business priorities.
Standards groups that are free to participate in, but require annual fees to be in the steering group or on the management board, along with greater visibility on group web pages, etc. - This makes it easy for small to medium sized enterprises (SMEs) to participate at the expense of giving larger organisations greater influence. There is a risk of increased business politics as companies lobby to get their way. Respondents are mixed in their like and dislike for this model of funding standards work.
Annual fee to belong to a standards development organisation, with no limits on the number of groups your organisation can participate in - This is how W3C currently operates in respect to Interest Groups and Working Groups, where the fee scales according to an organisation’s revenue and type. It can be onerous to get approval to pay the membership fees. Most respondents are comfortable with this approach to funding standardisation.
Advertising based sponsorship of standards groups - This is how many free to use web services are funded. Pro: solid source of revenue. Con: for some organisations the advertising will be an issue requiring consideration, given the curation that is likely to be enforced by W3C should not be a real issue but may be a perception problem. Most respondents weren’t keen on this approach to funding standardisation.
Sponsorship of standards meetings, workshops, interoperability testfests, etc. - This is a similar approach to advertising, and already used by W3C to avoid or lower attendance fees for events, including the all-groups meeting (TPAC) and W3C Workshops. This question invites respondents to comment and it is interesting to note that sponsorship of events is much more acceptable than including advertisements as part of the tooling for developing standards. Respondents are generally comfortable with sponsorship provided it follows regular norms for transparency, etc.
Fees for registering namespaces and hosting resources for standards - This would help to fund the costs for managing and hosting namespaces and associated resources on W3C servers. These resources include RDF and HTML documents as well as server configuration files. A beneficial side effect is the ability to observe the level of interest in terms of server hits. A downside is that this could put off many potentially interesting vocabularies done by independent groups or researchers. Some respondents were comfortable with this approach provided there is careful consideration to the fee structure. Others didn’t like the idea. A variant of the sponsorship approach is to seek endowments and voluntary contributions to support operational costs just like wikipedia. Of course this leads to the regular nuisance of the funding drives.
Fees for accessing resources for standards, e.g. specifications - This approach is used by ISO and other standards development organisations. W3C makes its standards (W3C Recommendations) available free of charge. The majority of respondents agree that access to specifications and related resources should be free of charge.
Fees for certification of compliance to standards - Organisations may find it beneficial for their sales team if their platform has been certified as complying with the standards it supports. This reassures customers as to the interoperability of the platform with others, and reduces concerns about being locked into a given vendor. W3C has considered a certification programme in the past, as a means to encourage interoperability as well as providing a funding stream. However, the costs of launching and managing such a programme are considerable. Respondents are comfortable with this approach provided it is thought through, e.g. fees should be differentiated according to factors such as public/private entities, size of the datasets, etc.
Other suggestions in respect to sustainability and governance - A standards development organisation could offer other services, e.g. training and consultancy, as a means to subsidise operations. Transparency and governance are important factors for effective standards development organisations.

Scaling challenges

This section invites respondents to comment on and provide suggestions for how to address scaling challenges for developing and maintaining data standards.

Encouraging sharing across weakly or uncoupled communities - People working in the same community gradually evolve a common mindset and language when it comes to the use cases they consider, and to the approaches they address or ignore. It is common for close knit communities to leave as implicit the assumptions they are working with, which can make it hard to communicate effectively with other communities. Respondents were asked what experience they had of this, and which approaches proved to be effective at countering it? It helps when groups can write in ways that are accessible to people outside of their group. This includes use cases, demos and assumptions. If groups are using different tools that can create a barrier. The ease of discovering work by other groups is a factor where having a common framework helps. The Web is valuable in supporting hyperlinks. Building a common understanding can be expensive when it involves the need for face to face meetings. People often prioritise their own work over the need for liaisons and outreach as needed to bridge communities.
Why are you reinventing the wheel - There is often social pressure to adopt another community's solution, but this can lead to resentment when the requirements turn out to be sufficiently different as to make the suggested solution a poor fit. This can lead to misunderstandings and a breakdown of trust. One dimension is the perceived complexity when one community feels that the other community's solution is overly complicated and too burdensome. In some cases, a central authority can impose order, but only at a cost of inefficient solutions that fail to match market needs. What steps can be taken to mitigate this? Ameliorating factors include effective outreach to share and discuss alternative approaches and their pros and cons. This can include tutorial materials, the use of lightweight profiles of complex standards, an easy and rapid means for companies to request updates to meet new needs, in order to avoid the temptation to create their own standards for greater ease of control, despite the interoperability problems that this is likely to entail. Modularity and ease of understanding helps. Less experienced newcomers tend not to realise the subtleties that drove the design for mature standards, so these need to be made apparent.
Mapping between different vocabularies - What are the approaches for mapping between different vocabularies as defined by different communities with somewhat overlapping requirements. One potential approach involves "upper ontologies" with very general terms that are applicable to the different communities. Another approach is to use rules that define how terms in the different vocabularies relate to each other in specific contexts. This may involve some form of fuzzy reasoning. Respondents were encouraged to discuss the pros and cons of different approaches and their implications for standardisation. Ontologies can provide shorthands for commonly used rules, but are not enough to address the diversity of requirements. Converging on a simple, sufficient rules language would be very helpful. RDF is often seen as too complex and finding simplifications would be helpful for wider adoption. Upper ontologies often complicate more than they help. Mapping may be challenging when the semantics aren’t sufficiently clear. Fuzzy matching is worth looking at and relates to rules that map terms in specific contexts. The best mapping may depend on what you’re trying to achieve.

Tooling and practices

This section of the questionnaire gathers feedback that will help W3C review the tools available to standardisation groups and the associated practices. As an example, W3C groups have made increasing use of GitHub for collaborative specification development, despite GitHub being originally designed for software development teams.

Use of GitHub - Widely used, but quite cumbersome and hard for non-programmers with a steep learning curve, though GitHub issues are well liked. This creates a barrier to contributions. Some concerns about potential risks for depending on GitHub as a long term system.
Use of wikis - Most W3C groups have access to wikis for their work if they choose to use it. Wikis are declining in use at W3C compared to GitHub markdown documents. Wikis can be challenging to maintain as they scale up, and are often out of date.
Use of teleconferences - Most groups use frequent teleconferences with a variety of systems, e.g. W3C’s WebEx, GoToMeeting, Zoom, Skype, Hangouts, Mattermost, Bluejeans, Viju, VideoNor. Time differences are a problem for international meetings. They may set a limit on the duration of the meeting or the engagement of certain parties. Connectivity can also be an issue, as well as feedback and background noise.
Use of remote screen sharing - Many online teleconferencing services offer screen sharing based on copying the screen pixels. This isn’t accessible to people with visual impairments. Respondents were asked how they use screen sharing, and what they do in respect to the accessibility issue. Screen sharing is only relied on by a small fraction of respondents. One work around for accessibility is to ensure that slides and other presentation materials are shared first through other channels, e.g. email.
Use of shared document editing - E.g. Google docs, Confluence, Etherpad and WebODF, which allow for collaborative online editing of shared documents. This is popular for making rapid progress on documents, but as the document stabilises, groups tend to switch to GitHub for more formal change tracking and editorial control. Collaborative editing is great for live meeting minutes, and sharing ideas. However, there are concerns about long term persistence and history.
Use of testbeds - Some groups use testbeds, but others don’t. Testbeds are seen as valuable for testing ideas to be included in new or revised standards. Testbeds are related to demonstrators as a means to provide concrete examples for learning and getting up to speed. Tests can be used to explore edge cases in proposed standards.

Liaisons, outreach and community building

This section of the questionnaire sought input on the work done on reaching out beyond the standardisation group as a basis for successful standards.

External industry alliances and group - Some but not all groups liaise with industry alliances and external standards development organisations. Liaison may be left to group members, but it is often seen as important to the success of a group’s aims.
Other W3C groups - Some groups work on their own, but others liaise with relevant W3C groups. This can help with achieving successful outcomes to a group’s aims.
External events - Some groups regularly participate in external events to promote their standards work, e.g. presentations, panel sessions and demos at conferences and workshops. This is often done on an individual basis. Plugfests are another way to encourage interest.
Social media - Some groups use social media for outreach, e.g.Twitter, Blogger, LinkedIn, Facebook, Instagram and YouTube. This may involve some level of discipline and care over message points. Other groups recognise the potential and would like to set up to using social media to spread their message.
Soliciting feedback - This question focused on how groups solicit feedback e.g. on improving or extending standards. Respondents say they use email, GitHub (issues and pull requests), Blogs, Tweets, Wiki pages, social media.
Use cases - Some but not all groups gather and publish use cases for their work, whether as markdown documents on GitHub or as formal reports. Use cases are seen as important for creating a consensus around the scope of new standards.
Best practices - Some groups collect and publish best practices for use with their specifications. The W3C Data on the Web Best Practices is cited as a good example.
Education and training - Some groups would be interested in working with W3C to deliver education and training relating to the group’s aims. W3C has now had several years of experience with providing online courses. This could also provide an additional funding stream to support standardisation activities.

Miscellaneous feedback on W3C groups

In this final section for the questionnaire, respondents were asked to describe which W3C groups they are involved in, what is working well, what problems they’ve seen, and their suggestions for improvements.

Groups - Respondents listed a wide variety of W3C groups, including Community Groups.
Strengths - Respondents liked the openness of W3C, the infrastructure, e.g. mailing lists, home pages, licensing/IP, working with good people and good chairs, dedicated people, good teleconference facility, W3C’s name and process, global reach, influence and expertise.
Weaknesses - The success of a group very much depends on the people in the group - if the main proponents of a standard are mainly focused on getting approval for what has been done in their company, this is likely to lead to problems in the resulting specification. Some groups are seen as focusing on academic concerns rather than industry needs. This is also reflected in comments that some standards are suffering from design by committee where the resulting compromises may have undermined a cleaner, easier more understandable, standard. Some are finding Membership fees a barrier as are the difficulties of participating due to time zones and the cost of travel to attend face to face meetings. Possibly related is the under representation of certain stakeholder groups. Some people feel that the use of external tools such as GitHub and Google docs shows a weakness of W3C’s infrastructure.
Suggestions for making W3C a better forum for data standardisation - More documentation on best practices for W3C Community Groups, an improved set of tooling especially for collaboration across time zones, giving individuals and SMEs greater influence, focusing on keeping things simple, more cross linking to other standards development organisations, more work on best practices and real world use cases, listening to others than industry.

Tracking adoption and interest in Web Data standards

How successful are standards? To answer this question we need a way to measure the level of interest in particular standards. At the time this report was written W3C has done surprisingly little on measuring interest in standards.

One approach that could be implemented with modest resources would be to exploit the W3C website server logs, and to look at the requests for W3C technical reports and other documents from Working Groups, Interest Groups, Community Groups and Business Groups. W3C’s privacy policy states that W3C does not track users for behavioural tracking. Client IP addresses, and the HTTP Referer and User-Agent fields are logged to allow traffic to be analysed. The collected data is only used for server administration, site improvement, usage statistics, and Web protocol research.

The Webmaster helped with the analysis of the server logs for a select of URLs corresponding to W3C technical reports relevant to Web data standardisation. In keeping with the privacy policy IP addresses are only kept for a relatively short period of time - a quarter of year. This allows us to look at the popularity of different technical reports, and to see from which countries the requests were from.

The following figure shows the number of times each report was requested in the period covered (August - November 2017).

Technical Report Request Count for one quarter

One observation is that the huge popularity of the Semantic Sensor Network ontology (vocab-ssn) and the Time Ontology in OWL (owl-time) is due to both them becoming W3C Recommendations in the recent past. The Shape Constraint Language (shacl) became a W3C Recommendation three months earlier, and has a similar download count to many other Linked Data technical reports. This suggests that reports are initially very popular but this rapidly decays away to a background level. There are exceptions, e.g. the Data on the Web Best Practices (dwbp) which shows persisting popularity, To track popularity patterns, W3C would need to regularly record the request counts for each technical report, e.g. on a monthly basis.

Another observation is that JSON-LD is more popular than other Linked Data serialisation formats, and is followed by Turtle. Other Linked Data formats such as n-triples and n-quads are much less popular. JSON-LD defines a way to use the JavaScript Object Notation (JSON) to represent Linked Data. Its relative popularity points to the huge popularity of JSON amongst web developers, superseding the previous high levels of interest in XML.

The Geolite2 dataset was used to derive the country from the client IP address as a basis for assessing which countries were most interested in Web data standardisation. The results show a very long tail of countries with small download counts. Take for example the Semantic Sensor Network ontology. This is most popular in the USA (88932 downloads), followed by China (65703), UK (38430), Netherlands (25625), France (25000), Germany (24541), and fading to a single download for South Sudan, and the Central African Republic. Here is the data as a pie chart, note that there were downloads from 214 countries, not all of which are listed due to lack of room. The mapping isn’t perfect with IPAddressnotfound and Republicof as cases where the algorithm failed to work correctly.

pie chart for downloads by country

Further work is needed to figure out a sustainable solution for collecting such statistics across the W3C site on a long term basis and presenting the results in a way that can inform decisions on how W3C invests its limited resources.

The level of interest in Web data standards could be tracked in other ways, for example, citations from websites and research publications, and by providing a registration form for users. To make it worthwhile for users to register, this could be tied to a community based support service for W3C specifications. Community maintained support services are increasingly popular with companies as a way to provide good quality support at a lower cost. SMEs and independent consultants can benefit as their reputation as contributors boosts their business opportunities. Further investigation is needed on the detailed requirements and investment needed to kick start this approach.

Dealing with the challenges of heterogeneity

As more and more people want to provide or consume data on the Web, this will increase the demand for open standards for data vocabularies. People will for the most part be interested in using existing vocabularies where appropriate. The challenge is then how to discover and assess such vocabularies, especially when they have been developed by isolated communities. Adopting an existing vocabulary has its risks - the vocabulary could have been designed for different requirements and be overly cumbersome in a different context or fail to adequately cover the chosen use cases.

A related challenge is that people from a like minded background tend to think in similar ways, and have a tendency to not make their shared assumptions explicit, instead going directly into the details of the solution they envisage. This makes it hard for other people from different communities to evaluate a given vocabulary to see if it is a good fit.

The Web is World wide, but people may be separated by living in different countries, having different languages, or working in different industries. With uncoupled or weakly coupled communities, and only partially overlapping requirements, we can expect the emergence of vocabularies that play similar roles, but which aren’t directly compatible. This creates challenges for services that need to integrate data from multiple such vocabularies.

In the simplest case, a term in one vocabulary can be declared as the same as a term in a different vocabulary. More generally, a term in one vocabulary might be declared as equivalent to a graph in another vocabulary. For instance, a single term might be used to indicate a combination of a unit of measure and a scaling factor, e.g. milliamps for electrical current. A second vocabulary could express these separately.

More generally still, terms may be relatable only in specific contexts. This can be compared to human languages, e.g. the words used for water ways such as rivers, streams, brooks, etc. where the taxonomy of words in different languages don’t have a direct correspondence. For instance, to pick the right word, you may need to know if the river in question flows into the sea or merges into another river.

This suggests the need for a way to describe how to transform Linked Data graphs to replace one vocabulary with another, potentially with some form of defaults when required. This is something where experimentation is needed, and should lead to open standards for Linked Data transformation languages. Is this something that W3C should be driving, and if so, how?

Another approach involves so called “upper ontologies”. These define domain concepts in terms of underlying general concepts that are applicable across domains. It can be challenging to understand how to relate domain specific concepts to these very general concepts, and likewise to implement software that can take advantage of these definitions.

The promise of AI and the Cognitive Web

The difficulty of manually creating complex ontologies can in principle be avoided through the use of machine learning algorithms that are applied to a training corpus. One approach for this makes use of a synthesis of cognitive science, AI, computational linguistics and sociology, building upon progress in each of these fields, enabling conversational cognitive agents that can be trained and assessed using lessons expressed in natural language.

This necessitates a means to translate natural language into semantic graphs, and back again for natural language generation. Cognitive architectures like John R. Anderson's pioneering work on ACT-R have proven themselves in terms of replicating common characteristics of human memory and learning. This points to opportunities for extending Linked Data with persistent link strengths and exponentially decaying node activation levels. Procedural knowledge can be expressed using production rules, and trained using reinforcement learning algorithms.

Cognitive agents will require support for episodic memory and counterfactual reasoning (i.e. knowledge about what/when and what/if), both for learning from narratives and as a means to support a level of self-awareness as a basis for monitoring progress and deciding when to switch to different ways of thinking, the importance of which has been emphasised by Marvin Minsky.

Linked Data uses explicit concepts with nodes connected by labelled arcs. This makes it easier to provide explanations as compared to approaches based upon artificial neural networks and deep learning. However, Linked Data can also be represented using vector spaces and tensor expressions for implementations based upon neural networks. Much remains to be done on exploring how to apply vector spaces to rich graph representations and procedural rule sets, and there is considerable potential for addressing the statistical basis for reasoning in terms of what has been found useful in past experience, as compared with the emphasis on logical inference and completeness found in conventional approaches to ontologies. This is also relevant to mimicking the human ability to track changes in the meaning of words based upon their patterns of usage.

In the long term, this can be expected to change the nature of standardisation from a direct consideration of linked data vocabularies to the curation of a corpus of training materials as based upon an agreed set of use cases. At its simplest, this involves examples and counter examples for data fields, as input to a machine learning algorithm. Natural language descriptions could be used to relate data fields to what they represent, e.g. the address of a house or flat. Such descriptions can also be used to define taxonomies of terms including generalisations and exceptions. There has been plenty of work on extracting named entities from text, but so far much less on understanding narratives as would be needed for natural language descriptions of use cases.

Conclusions

The opportunities for data on the Web are huge, both for publicly shared open data, and for data exchanged business to business. This potential is critically dependent on standards to enable interoperability, to reduce the effort and risk involved, and to unlock the network effect. This study of W3C practices and tooling for Web data standardisation has gathered feedback from a wide range of stakeholders on many different aspects of standardisation.

The rise of the Internet of Things will accelerate the need for work on standards for vocabularies that describe devices, services and the context in which they are situated. Likewise, for the rise of open data published by governments and other organisations, including the availability of scientific data for virtual research environments.

There are many challenges to be overcome, e.g.

Improved guidance and support for small communities that want to collaborate on creating community standards - including opportunities for community based support services and easy discovery of existing work and the assumptions behind it.
Better ways to measure the level of interest in standards and for gathering feedback for improving them and identifying gaps where new standards are needed.
The W3C Recommendation track isn't always a good fit to the needs for Web data standardisation.
Improved sustainability for standardisation through different approaches to funding the services provided by standards development organisations.
Improved ways to relate vocabularies developed by different communities.
The use of Linked Data as a basis for creating agile services both within and between enterprises, and on top of heterogeneous platforms and standards.
The role of Linked Data in respect to enabling ecosystems of services, e.g. privacy, terms & conditions, payments, semantic discovery, adaptation to variations across services, semantic composition and validation of services.

There is a lot to improve from the current status. The W3C home page for the Web of Data needs revamping and bringing alive with regular news posts and links to useful resources. The Web of Data needs greater visibility both within the W3C Team, W3C Members and the public at large. Whilst the W3C Community Groups programme has been very successful with a large number of groups, there is a lack of guidance for communities interested in developing standards. For W3C to step up to the challenge of the huge potential demand for community standards, new approaches will be needed to sustain the level of resources needed.

Web developers often express negative sentiments about the Semantic Web, and this can in part be attributed to a them and us attitude in respect to people working on Linked Data and the Semantic Web. It is not helped by the perceived complexity often associated with OWL ontologies and the esoteric focus of much of the published work. This gulf needs to be filled by a greater focus on simpler approaches that are a good fit to the use cases of interest for Web developers. A community supported forum aimed at Web developers for exchanging information on use cases and accounts of how they were solved in a simple way would be a big help.

This study will be used as a starting point for further discussion on how to improve the services that W3C Data Activity offers for communities interested in developing Web data standards.

The role of the W3C in supporting development of specific standards

What kinds of new standards are needed for accelerating the adoption of data on the Web? This includes metadata standards, e.g. relating to privacy, terms & conditions, machine interpretable licenses, and payments. To assist with discovery, there is a need for websites to be able to describe data services in a standard way that facilitates indexing by search engines. This could be done in collaboration with schema.org. W3C is already working on updating the Data Catalog Vocabulary as a basis for describing datasets (see Dataset Exchange WG), but there is a gap when it comes to discovery of network APIs. There are existing solutions for describing RESTful APIs, but these focus on the data types rather than the semantics. The work on thing descriptions in the Web of Things Working Group seems relevant, along with fresh ideas for mapping JSON to Linked Data. Is there a need for a meta vocabulary to facilitate discovery of vocabularies? Another area ripe for investigation is the potential for a new standard for a rule language for context based mappings between Linked Data vocabularies with partially overlapping semantics.

The role of W3C as an incubator for standards

What should W3C be doing to better support Working Groups and Community Groups? This could include better guidance about how to run effective Community Groups and advice on the different kinds of standards and how to incubate them and progress them along the standards track. What could W3C do to give Community Groups greater control over their home pages? What is needed to support training and outreach as part of the process of building momentum around new standards at various stages in their lifecycle. What changes to how groups are formed would provide the resources needed to provide better tooling? Is there a role for community maintained support services as part of this? This could include tools for facilitating sharing of advice and experience across different community groups. As data on the Web expands to cover new areas, what can W3C do to make it easier for communities with related goals to discover each other? The current study described in this report should be seen as a precursor to an ongoing dialogue to discuss the many questions raised. Perhaps it is time to consider organising a W3C workshop on how to better address the challenges of developing and supporting Web data standards? This could be co-organised with other organisations with shared goals, e.g. the Open Data Institute, and is likely to involve the need to find sponsors to cover some of the costs of running the workshop.

Acknowledgements

Grateful acknowledgements are due to the Open Data Institute and Innovate UK for funding this study. I would especially like to thank Leigh Dodds (ODI) for his efforts in coordinating the project and introducing me to others working on different aspects of data standardisation.

Dave Raggett, W3C

$Id: index.html,v 1.29 2018/01/04 16:22:27 coralie Exp $