W3C

W3C Workshop on Web Standardization for Graph Data

Creating Bridges: RDF, Property Graph and SQL

Monday 4th March to Wednesday 6th March 2019, Berlin, Germany

Introduction

Data is increasingly important for all organisations, especially with the rise of IoT and Big Data. The falling costs for storage and processing is driving interest in extracting competitive value from ever larger amounts of data through analytics and data hungry AI algorithms. In addition, organisations are seeking to exploit opportunities for sharing data within emerging digital ecosystems. W3C has an extensive suite of standards relating to data that were developed over two decades of experience. These include core standards for RDF, the Semantic Web and Linked Data.

A W3C Workshop is now planned for early 2019 on emerging standardisation opportunities, e.g. query languages for graph databases and improvements for handling link annotations (i.e. embracing property graphs), support for enterprise-wide knowledge graphs, different forms of reasoning that are suited to incomplete, uncertain and inconsistent knowledge, AI and Machine Learning, approaches for transforming data between different vocabularies with overlapping semantics, signed graphs, what's next for remote access to data and information services. In addition, W3C hosts many Community Groups working on data standards and we are interested in what is needed to better support work on vocabulary standards.

See this Workshop's Call for Participation. Further background is given below.

Graph Databases and Link Annotations

Businesses relied on relational databases (RDBMS) for many years using SQL for query and update. More recently we have seen the rise of NoSQL databases that address the need for flexible handling of unstructured data with key-value stores, document stores, and graph databases. One example is CouchDB which uses JSON for data storage with ready support for replication for speedy access at different sites. NoSQL is good when you need agility to deal with ever changing data models.

Most NoSQL databases store disconnected sets of data. This is a drawback when you want to deal with connected data and graphs, which is where graph databases come into their own, as they enable links across data, for fine grained networks of information. Graph databases may support robust transactions with ACID guarantees in the event of errors, power failures etc. as found in RDBMS. Some examples of graph databases include Apache Tinkerpop, AllegroGraph, Amazon Neptune and Neo4J Graph platform, see Wikipedia for a longer list.

W3C's RDF uses URIs (Web addresses) for nodes and link labels in directed graphs. This has the advantage of enabling them to be dereferenced to obtain further information, making for a Web of linked data. In particular, nodes can be dereferenced to graphs on remote databases. URIs further provide for globally unique identifiers for vocabulary terms allowing people to be sure of the terms they are using, and enabling data joins across different data sources. This forms the basis for W3C's Web ontology language OWL and has encouraged massive growth in Linked Open Datasets, see. e.g. John P. McCrae's linked open data cloud diagram for 2018.

Property graphs allow properties (key/value pairs) to be associated with both nodes and links in directed graphs. This allows you to annotate links with information such as the start and stop times for when the link is valid, its provenance, a statement about its quality and so forth. RDF allows for such annotations using the RDF reification vocabulary, but this is awkward, and perhaps it is time to consider updating the RDF core to allow for links to be used as the subject or object of other links, along with the corresponding extensions to RDF serialisation formats and query languages (e.g. SPARQL), and associated examinations of the correspondence of RDF and property graph data models and querying. One such proposal is RDF* and SPARQL*. Extending RDF in this way could further help with the lack of portability of graph data across property graph databases. Another fertile area is graph schema, where there is continuing thought on advancing SHACL, see e.g. SHACL Advanced Features.

There is a very noticeable increase in the use of graph databases, time-series databases and allied technologies, see e.g. DBEngines recent rankings. This is reflected in new products and services, in highly active research, and new technical approaches. It is also reflected in a felt need for stronger standardization in the area of graph data management.

It would be good to achieve a productive and interoperable boundary between the adjacent worlds of RDF and Property Graphs. At minimum, information exchange and mutual awareness should be helpful for standards work in these two related worlds. Their relationship to the predominant SQL standard is equally important. This workshop will therefore seek to include people involved in ISO discussions on initiating work on a standard Graph Query Language GQL for property graphs. This is, intended to complement and extend current work on SQL Property Graph Querying (SQL PGQ) for SQL:2020.

Starting from industrial languages like openCypher and PGQL, research and industry partners in the Linked Data Benchmark Council have published on future directions for property graph querying, including query composition and multiple named graphs, path elements, and regular path queries. There is also very strong interest in the property graph data community in graph schema/constraints, an area that has not been addressed systematically in any of the existing languages, but which is beginning to be worked on in SQL PGQ.

Similarly, it would be worth examining the relationship of RDF to times-series databases and geospatial databases given their importance for IoT data. For instance, what are the implications for query languages for graph data that incorporates indexing for temporal and spatial annotations? Further background is available via the following links:

Enterprise-wide knowledge graphs

Given the increasingly strategic importance of data, and the challenges of integrating large numbers of data silos, almost all enterprises are likely to need to build large internal knowledge graphs. These are actively maintained models of collections of entities, their semantic types, their properties and interrelationships. Enterprise-wide knowledge graphs provide a powerful tool for organisations in respect to data management and data governance. Whilst the data and processes are likely to be distributed across an enterprise, a uniform framework for data governance will make it much easier for organisations to extract value from their data. In principle, RDF and Linked Data provide a generic foundation for enterprise-wide knowledge graphs. There are many potential discussion areas, including federated control and storage of data, and resilience in the face of faults and cyber attacks.

Beyond deduction: incomplete, uncertain and inconsistent knowledge, AI and Machine Learning – the emergence of the "Sentient Web"

Real world data is often incomplete and uncertain, and worse may include inconsistencies as well as errors. Sound deductive reasoning is no longer adequate, and we need to consider other forms of reasoning that are rational given the statistics of prior knowledge and past experience: inductive reasoning on commonalities across a set of examples, abductive reasoning seeking explanations given a set of observations, analogical reasoning based upon structural similarities with other problems, spatial reasoning, e.g. road navigation, temporal reasoning over intervals and points in time, causal reasoning, e.g. on plans for achieving certain goals, social reasoning, e.g. on social relationships and status within a group, and emotional reasoning, e.g. on how someone's feelings are likely to be effected by certain actions.

Reasoning from experience requires that graph databases embody a computational treatment of statistics. For this, we can turn to cognitive science and psychology for an account of how humans do this. In particular, decay where memories become harder to retrieve over time unless reinforced, and interference where more useful memories tend to hide less useful ones. In addition, it is common to distinguish two styles of processing: intuition which is fast and autonomous, using heuristic short-cuts (efficient algorithms operating in parallel over graph data), and reasoning which is slow and considered (sequential rule execution on graph data), and which can be divided into analytic reasoning and reflective reasoning (reasoning about reasoning). Mimicking human memory and processing would allow computers to better handle massive amounts of real-world data. What requirements does this imply for standards for graph data and associated rule languages?

Artificial Intelligence and Machine Learning (AI/ML) depend on the availability of data. In some cases, this would benefit from standards that allow data to be pooled from different sources. Standards for metadata would help with measures to counter bias, a widely acknowledged challenge for deep learning. What other related opportunities are there? Deep learning has been very successful, but requires very large amounts of training data, and for the statistics of the data you apply it to, to reflect that of the training data. Recent work has identified the opportunity for continuous learning with neural processing over real-time data in combination with symbolic graph data and rules, e.g. for scene understanding in computer vision, involving induction and abduction over video frames. What are the implications for graph data standards?

Note: further W3C Workshops on AI are under consideration:

Scalability of vocabularies across communities

Whilst it is desirable to use the same vocabularies, it is inevitable that different communities with differences in their needs will come up with vocabularies that vary in their semantics and structure. This poses challenges for creating services that span different communities and increases friction for open markets of services. Is it timely to work on context sensitive languages for mapping data between vocabularies with overlapping semantics? What is needed to encourage re-use of existing vocabularies when appropriate? This includes provision for different communities to share their experiences, use cases and requirements along with recommendations on best practices. How can this be made sustainable given the running costs for the social and physical infrastructure?

APIs and digital signatures

What's next for remote access to data and information services? Web developers have become used to REST based APIs for accessing and manipulating textual representations of data resources, using a predefined set of stateless operations. This may incur multiple round trips and under/over fetching of data. Newer approaches seek to overcome these weaknesses. JSON API offers a flexible means to express APIs for data, whilst GraphQL provides a means to use JSON to express data queries for multiple resources in the same request. JSON Schema can be used to annotate and validate data expressed in JSON. W3C's Web of Things seeks to simplify access to things such as sensors, actuators and related information services. Things are exposed to applications as software objects with properties, actions and events, hiding the details of the underlying protocols. The kinds of things, their capabilities and the context in which they are situated are expressed as Linked Data.

An open question is whether there is sufficient shared requirements to justify work on standardising a JavaScript API for Linked Data, with the benefit of reduced effort for deploying applications an ease of re-use of code as compared to having to cope with heterogeneous sets of APIs. Another potential discussion topic of interest concerns the range of challenges for security and trust in relation to data, for instance, the role of digital signatures for RDF graphs for verifying that a graph hasn't been tampered with since it was produced by a trusted entity. A related role for digital signatures is for annotations by third parties with trusted vetting procedures for credentials, see W3C's Verifiable Claims Working Group which seeks to make expressing and exchanging credentials that have been verified by a third party easier and more secure on the Web.

What is W3C?

W3C is a voluntary standards consortium that convenes companies and communities to help structure productive discussions around existing and emerging technologies, and offers a Royalty-Free patent framework for Web Recommendations. W3C develops work based on the priorities of our members and our community.