Introduction

Data is increasingly important for all organisations, especially with the rise of IoT and Big Data. The falling costs for storage and processing is driving interest in extracting competitive value from ever larger amounts of data through analytics and data hungry AI algorithms. In addition, organisations are seeking to exploit opportunities for sharing data within emerging digital ecosystems. W3C has an extensive suite of standards relating to data that were developed over two decades of experience. These include core standards for RDF, the Semantic Web and Linked Data.

A W3C Workshop is now planned for early 2019 on emerging standardisation opportunities, e.g. query languages for graph databases and improvements for handling link annotations (i.e. embracing property graphs), support for enterprise-wide knowledge graphs, different forms of reasoning that are suited to incomplete, uncertain and inconsistent knowledge, AI and Machine Learning, approaches for transforming data between different vocabularies with overlapping semantics, signed graphs, what's next for remote access to data and information services. In addition, W3C hosts many Community Groups working on data standards and we are interested in what is needed to better support work on vocabulary standards.

See this Workshop's Call for Participation. Further background is given below.

Graph Databases and Link Annotations

Businesses relied on relational databases (RDBMS) for many years using SQL for query and update. More recently we have seen the rise of NoSQL databases that address the need for flexible handling of unstructured data with key-value stores, document stores, and graph databases. One example is CouchDB which uses JSON for data storage with ready support for replication for speedy access at different sites. NoSQL is good when you need agility to deal with ever changing data models.

Most NoSQL databases store disconnected sets of data. This is a drawback when you want to deal with connected data and graphs, which is where graph databases come into their own, as they enable links across data, for fine grained networks of information. Graph databases may support robust transactions with ACID guarantees in the event of errors, power failures etc. as found in RDBMS. Some examples of graph databases include Apache Tinkerpop, AllegroGraph, Amazon Neptune and Neo4J Graph platform, see Wikipedia for a longer list.

W3C's RDF uses URIs (Web addresses) for nodes and link labels in directed graphs. This has the advantage of enabling them to be dereferenced to obtain further information, making for a Web of linked data. In particular, nodes can be dereferenced to graphs on remote databases. URIs further provide for globally unique identifiers for vocabulary terms allowing people to be sure of the terms they are using, and enabling data joins across different data sources. This forms the basis for W3C's Web ontology language OWL and has encouraged massive growth in Linked Open Datasets, see. e.g. John P. McCrae's linked open data cloud diagram for 2018.

Additional Background

Some quotes:

A decade ago, it was just Neo4j, the world's most popular graph engine, and a handful of niche players in the graph technology space, until five years later when other graph database startups started to emerge. Since then the graph space has grown exponentially as mega-vendors like Oracle, Microsoft, SAP and IBM introduced graph products of their own.

From: Why Graph Technology Is the Future, Bryce Merkl Sasaki, Neo4J, 12 July 2018.

"We expect the graph database market to grow significantly as organizations look to new approaches in dealing with silos of data," Yuhanna wrote in a recent report on the market. He noted that a recent Forrester survey found that just over half of global data and analytics technology decision-makers are implementing or already using one or more of the dozen graph options on the market. Graph databases provide "insights and intelligence that were extremely challenging to produce with traditional technologies," he wrote.

From: Graph databases are hot, but can they break relational's grip?, Paul Gillin, SiliconANGLE, 2 December 2017.

Right now there are three property graph query languages that are closely related. We have Cypher (from Neo4j and the openCypher community). We have PGQL (from Oracle). And we have G-CORE, a research language proposal from Linked Data Benchmark Council. They have very similar data models, syntax and semantics. Their authors share many ambitions for the next generation of graph querying such as a composable graph query language with graph construction, views and named graphs; and a pattern-matching facility that extends to regular path queries, where regular expressions find and retrieve patterns even more concisely and flexibly.

From: The GQL Manifesto which argues for work on a new common standard for graph query languages that merges the best of PGQL, G-CORE and Cypher, and complements SQL, the traversal API of Apache Tinkerpop's Gremlin as well as SPARQL for RDF triple-stores.

Property graphs allow properties (key/value pairs) to be associated with both nodes and links in directed graphs. This allows you to annotate links with information such as the start and stop times for when the link is valid, its provenance, a statement about its quality and so forth. RDF allows for such annotations using the RDF reification vocabulary, but this is awkward, and perhaps it is time to consider updating the RDF core to allow for links to be used as the subject or object of other links, along with the corresponding extensions to RDF serialisation formats and query languages (e.g. SPARQL), and associated examinations of the correspondence of RDF and property graph data models and querying. One such proposal is RDF* and SPARQL*. Extending RDF in this way could further help with the lack of portability of graph data across property graph databases. Another fertile area is graph schema, where there is continuing thought on advancing SHACL, see e.g. SHACL Advanced Features.

There is a very noticeable increase in the use of graph databases, time-series databases and allied technologies, see e.g. DBEngines recent rankings. This is reflected in new products and services, in highly active research, and new technical approaches. It is also reflected in a felt need for stronger standardization in the area of graph data management.

It would be good to achieve a productive and interoperable boundary between the adjacent worlds of RDF and Property Graphs. At minimum, information exchange and mutual awareness should be helpful for standards work in these two related worlds. Their relationship to the predominant SQL standard is equally important. This workshop will therefore seek to include people involved in ISO discussions on initiating work on a standard Graph Query Language GQL for property graphs. This is, intended to complement and extend current work on SQL Property Graph Querying (SQL PGQ) for SQL:2020.

Starting from industrial languages like openCypher and PGQL, research and industry partners in the Linked Data Benchmark Council have published on future directions for property graph querying, including query composition and multiple named graphs, path elements, and regular path queries. There is also very strong interest in the property graph data community in graph schema/constraints, an area that has not been addressed systematically in any of the existing languages, but which is beginning to be worked on in SQL PGQ.

Similarly, it would be worth examining the relationship of RDF to times-series databases and geospatial databases given their importance for IoT data. For instance, what are the implications for query languages for graph data that incorporates indexing for temporal and spatial annotations? Further background is available via the following links:

Michael Risse's 26 February 2018 post on The new rise of time-series databases
Trajectory Magazine 1 February 2018 article: The Past, Present, and Future of Geospatial Data Use
W3C ontologies: OWL Time and Semantic Sensor Networks (SSN)

Enterprise-wide knowledge graphs

Given the increasingly strategic importance of data, and the challenges of integrating large numbers of data silos, almost all enterprises are likely to need to build large internal knowledge graphs. These are actively maintained models of collections of entities, their semantic types, their properties and interrelationships. Enterprise-wide knowledge graphs provide a powerful tool for organisations in respect to data management and data governance. Whilst the data and processes are likely to be distributed across an enterprise, a uniform framework for data governance will make it much easier for organisations to extract value from their data. In principle, RDF and Linked Data provide a generic foundation for enterprise-wide knowledge graphs. There are many potential discussion areas, including federated control and storage of data, and resilience in the face of faults and cyber attacks.

Beyond deduction: incomplete, uncertain and inconsistent knowledge, AI and Machine Learning – the emergence of the "Sentient Web"

Real world data is often incomplete and uncertain, and worse may include inconsistencies as well as errors. Sound deductive reasoning is no longer adequate, and we need to consider other forms of reasoning that are rational given the statistics of prior knowledge and past experience: inductive reasoning on commonalities across a set of examples, abductive reasoning seeking explanations given a set of observations, analogical reasoning based upon structural similarities with other problems, spatial reasoning, e.g. road navigation, temporal reasoning over intervals and points in time, causal reasoning, e.g. on plans for achieving certain goals, counterfactual reasoning and stories, social reasoning, e.g. on social relationships and status within a group, and emotional reasoning, e.g. on how someone's feelings are likely to be effected by certain actions.

Reasoning from experience requires that graph databases embody a computational treatment of statistics. For this, we can turn to cognitive science and psychology for an account of how humans do this. In particular, decay where memories become harder to retrieve over time unless reinforced, and interference where more useful memories tend to hide less useful ones. In addition, it is common to distinguish two styles of processing: intuition which is fast and autonomous, using heuristic short-cuts (efficient algorithms operating in parallel over graph data), and reasoning which is slow and considered (sequential rule execution on graph data), and which can be divided into analytic reasoning and reflective reasoning (reasoning about reasoning). Mimicking human memory and processing would allow computers to better handle massive amounts of real-world data. What requirements does this imply for standards for graph data and associated rule languages?

Artificial Intelligence and Machine Learning (AI/ML) depend on the availability of data. In some cases, this would benefit from standards that allow data to be pooled from different sources. Standards for metadata would help with measures to counter bias, a widely acknowledged challenge for deep learning. What other related opportunities are there? Deep learning has been very successful, but requires very large amounts of training data, and for the statistics of the data you apply it to, to reflect that of the training data. Recent work has identified the opportunity for continuous learning with neural processing over real-time data in combination with symbolic graph data and rules, e.g. for scene understanding in computer vision, involving induction and abduction over video frames. What are the implications for graph data standards?

Note: further W3C Workshops on AI are under consideration:

For discussion of standardisation opportunities for text or speech based conversational agents (chatbots) as part of the user experience when visiting Web pages. This will address opportunities for improvements to customer services, including approaches involving a synergistic mix of human and automated customer service agents.
There is growing interest in applying AI/ML within Web pages for richer user experiences. The existing JavaScript libraries are over twenty times slower than optimised native code, even when using WebAssembly. Native performance can be boosted even further using GPUs or new kinds of AI hardware acceleration in next generation consumer devices. What's needed is a standard API for Web browsers that provides an abstraction layer for different kinds of native hardware and operating systems. What kinds of application would this make possible that weren't possible before?

Scalability of vocabularies across communities

Whilst it is desirable to use the same vocabularies, it is inevitable that different communities with differences in their needs will come up with vocabularies that vary in their semantics and structure. This poses challenges for creating services that span different communities and increases friction for open markets of services. Is it timely to work on context sensitive languages for mapping data between vocabularies with overlapping semantics? What is needed to encourage re-use of existing vocabularies when appropriate? This includes provision for different communities to share their experiences, use cases and requirements along with recommendations on best practices. How can this be made sustainable given the running costs for the social and physical infrastructure?

APIs and digital signatures

What's next for remote access to data and information services? Web developers have become used to REST based APIs for accessing and manipulating textual representations of data resources, using a predefined set of stateless operations. This may incur multiple round trips and under/over fetching of data. Newer approaches seek to overcome these weaknesses. JSON API offers a flexible means to express APIs for data, whilst GraphQL provides a means to use JSON to express data queries for multiple resources in the same request. JSON Schema can be used to annotate and validate data expressed in JSON. W3C's Web of Things seeks to simplify access to things such as sensors, actuators and related information services. Things are exposed to applications as software objects with properties, actions and events, hiding the details of the underlying protocols. The kinds of things, their capabilities and the context in which they are situated are expressed as Linked Data.

An open question is whether there is sufficient shared requirements to justify work on standardising a JavaScript API for Linked Data, with the benefit of reduced effort for deploying applications an ease of re-use of code as compared to having to cope with heterogeneous sets of APIs. Another potential discussion topic of interest concerns the range of challenges for security and trust in relation to data, for instance, the role of digital signatures for RDF graphs for verifying that a graph hasn't been tampered with since it was produced by a trusted entity. A related role for digital signatures is for annotations by third parties with trusted vetting procedures for credentials, see W3C's Verifiable Claims Working Group which seeks to make expressing and exchanging credentials that have been verified by a third party easier and more secure on the Web.

What is W3C?

W3C is a voluntary standards consortium that convenes companies and communities to help structure productive discussions around existing and emerging technologies, and offers a Royalty-Free patent framework for Web Recommendations. W3C develops work based on the priorities of our members and our community.

W3C Workshop on Web Standardization for Graph Data

Creating Bridges: RDF, Property Graph and SQL