Next steps for RDF: Keep the core and pave the cowpaths

Richard Cyganiak ● Linked Data Research Centre at DERI ● NUI Galway, Ireland
richard@cyganiak.de | Homepage

Six years after their publication, a critical look at the core RDF specifications and the entire RDF stack is definitely justified. Nevertheless, a conservative approach should be taken. RDF adoption has gained momentum and this is not the time to make RDF a moving target. The main opportunities are in codifying practices that have emerged outside of W3C working groups, most importantly the Turtle syntax, the Named Graphs data model, and the follow-your-nose interpretation of URIs in RDF graphs. Furthermore, there is an opportunity for making some underused features of the RDF model more useful by improving their support throughout the stack.

A position paper submitted to the W3C Workshop RDF Next Steps

Introduction

More than six years have passed since RDF has become a standard. Much has happened since: SPARQL has made programmatic access to large RDF databases practical. The W3C's Technical Architecture Group has settled the httpRange-14 issue, answering the question of how RDF should be deployed on the Web, and paving the road towards the rise of Linked Data. RDF has seen increased use in new areas, from social networks over library catalogs and e-commerce to official government data. It has moved from the Artificial Intelligence and Computer Science departments of universities to the business world, and has taken hold in multinational companies as well as startups. Certainly, the RDF community has learned a lot in these six years. So, should RDF evolve? Is it time to fix the blunders of the past, to embrace change and to design a better RDF 2.0 for the future?

The answer is a complex one, since it has to find a balance between fixing what's broken in the stack, and between not derailing the moving train that has gained momentum.

This document proposes a set of boundary conditions that any successful update to the RDF core specifications should meet; states some high-level goals for such an update; and lists some specific issues that could be addressed. Nothing particularly novel is mentioned in those lists, because it is the author's opinion that W3C can best contribute by standardizing one approach where there are already multiple non-interoperable approaches to the same problem, or where there already is a single de-facto standard way of doing things that could benefit from approval as a W3C Recommendation.

Goals and boundary conditions

Before we can get into detail, it is worth thinking through some boundary conditions that must be met by any effort at revisiting the foundational specifications of the RDF stack.

Standards are a tool for achieving interoperability. This is their purpose and reason for existence. The first question to be asked is: Does it improve interoperability?
Do not make RDF a moving target. The existing RDF specifications, but more so the tools and libraries, the sites and datasets and applications, represent an enormous investment. We should be conscious of this, and protect investments that have already been made. RDF adoption benefits from the network effect: Each new tool, dataset and application increases the value of other compatible ones. A disruptive version change would negate much of this effect.
No speculative development within W3C working groups. More so than in the early years of the Semantic Web project, there is now a large and innovative community outside of W3C that can develop, implement and test new ideas. The W3C's role should be one of alignment of existing similar efforts, with the goal of establishing interoperability between vendors. Working groups should not design things from scratch where there are no existing efforts to draw experience from.
A focus on serving the areas where RDF has proven to be successful. RDF has been used for many things, but it appears that it has succeeded only at a few. RDF has roots in knowledge representation, and it has been used as a language for browser configuration files and for RSS news syndication and in other areas, but these efforts have remained niche applications or failed. RDF seems to do well when used for loosely-coupled data integration. It seems to do well as a webby graph data model. It seems to do well as a language for embedding data in web pages. Future development should be dictated by the needs of these areas.

What goals should an effort to update the core RDF standards have? If the boundary conditions listed above are to be met, many goals that might be attractive in theory are clearly not feasible. As an extreme example, designing a new RDF 2.0 from scratch, incorporating everything that has been learned since 2004, is clearly not desirable as it violates conditions 1, 2 and 3. Similarly, no new logic foundation for RDF should be standardized, as it fails conditions 3 and 4.

Two general goals seem desirable and achievable.

Putting de-facto standards on the W3C Recommendation track. The community has adopted certain technologies that have become so widespread that they can be seen as de facto standards. As long as this state persists, the W3C-recommended RDF technology stack poorly reflects reality and is not sufficiently complete to be actually implementable without reference to tacit knowledge that is dispersed throughout the community. Prime candidates that seem to enjoy widespread consensus are the Turtle syntax, the Named Graphs data model, and the follow-your-nose interpretation of the meaning of HTTP URIs.
Aligning the stack. While the set of standard has grown to cover higher layers of the Semantic Web “layer cake”, time pressure in working groups and other difficulties have led to orphaned features, obvious omissions, and incomplete interfaces between standards. For example, RDFa has no reasonable syntax for RDF lists. SPARQL cannot properly query RDF containers.

Do not change the interoperable core of RDF

Before talking more about things that should be done, it is worth noting some flaws of the RDF stack that, despite frequent expressions of pain from the practitioners' community, are better left untouched.

Should we allow literal subjects? No. Literals are not allowed in the subject position for rather accidental and historical reasons. There is no solid design argument for not allowing them as subjetcs. The SPARQL specification has takes first steps towards relaxing the restriction. But this restriction is not a problem in practice. The constraint may be unnecessary, but it doesn't preclude any major usage scenarios, and the situations in which it causes pain are limited. A change to the model would ripple through every syntax and every implementation. This cost is not justified.

Should we fix RDF/XML? No. Experience has shown that RDF/XML is not a good format. It is complex, verbose, and exhibits rather arbitrary restrictions in the graphs that can be serialized. But it has a redeeming feature: After all these years, there are reliable and interoperable RDF/XML parsers for most major computing platforms. Modifying RDF/XML would negate this benefit. The community has to accept that we are stuck with a poor XML syntax, and focus energies on friendlier syntaxes. With Turtle and RDFa, good alternatives are now readily available.

Should we abolish blank nodes? No. They are much reviled, but they are occasionally useful, and people can be taught not to use them.

Paving the cowpaths: De-facto standards

Turtle: a friendly RDF syntax. Much has been said about the harm that has been done to RDF adoption by the RDF/XML syntax. The solution is not to fix RDF/XML; the solution is to put a better syntax on equal footing with RDF/XML. This syntax is Turtle. Unlike a few years ago, Turtle implementations are now almost as readily available as RDF/XML implementations. The main obstacle to wider use of Turtle is its lack of W3C Recommendation status. Turtle is already a W3C Team Submission. Rubber-stamping it as a Recommendation would also straighten the path towards updating core RDF documents, such as the RDF Primer, with versions that use Turtle examples throughout the document.

Named Graphs. Managing context, provenance and graph updates are extremely important in almost any RDF application. The solution is the Named Graphs data model. It is already part of the SPARQL Recommendation, is widely implemented also outside of SPARQL, and generally well-understood. It should be elevated to a separate Recommendation. Besides codifying existing practice, this would be a welcome support for those practitioners who are trying to improve the general state of provenance tracking and metadata on the RDF-based Web and who are currently fighting a somewhat uphill battle because of Named Graphs' relative obscurity. Furthermore, a Named Graphs standard could galvanize research on the upper layers of the Semantic Web stack, where the availability of rich context information, along with a standard model for its representation, is a key requirement.

Codifying follow-your-nose. RDF statements are assertions about the world. But to understand what a statement means, one has to know what the URIs refer to. One has to know what they name. Despite the centrality of URIs in the RDF data model, the RDF specifications have nothing to say about how a URI actually receives its meaning. This needs fixing. It is possible to get a coherent picture of the process by referring to a number of other documents, in particular the httpRange-14 TAG Finding, the Architecture of the World Wide Web document, the Cool URIs for the Semantic Web Note, and a number of documents published by enthusiasts outside of W3C. Further progress towards codification was made by the TAG's AWWSW task force. Finally completing this job is an important companion to the standardization of Named Graphs; both together allow for a solid account of Web document metadata and thus context information for RDF data published on the Web.

Aligning the stack

Among the RDF stack's many features, there are some that are rarely liked, rarely used, sometimes mis-used, and generally in a poor shape in terms of actual deployment and tool support.

The existence of underused features is not a major problem. At worst, it increases the cost of conformant implementation, and it might lure newbies down a wrong road. Nevertheless, it is worth exploring the reasons for the lack of love for those features. In some cases, they may have been made redundant by newer developments. In other cases it could simply be a lack of proper support in some other layer of the stack. Especially SPARQL makes it very hard to successfully use some RDF features because SPARQL is so opinionated when it comes to support of the RDF model's full richness. In such cases, there might be an opportunity for making the overall stack better and richer by extending SPARQL's coverage. If, on the other hand, a convincing argument can be made against adding support for these features in SPARQL and elsewhere, then perhaps the features ought to be deprecated in the base RDF model. Notable examples:

The RDF Containers rdf:Alt, rdf:Bag and rdf:Seq are rarely used, and if it wasn't for a strong reliance on rdf:Seq in RSS 1.0, they would probably be forgotten. They suffer from a lack of clear semantics, from a lack of purpose, and from redundancy with newer features such as rdf:List.

RDF Lists are much more widely used, and especially the RDF syntax of OWL relies on them heavily. But they are poorly supported throughout the stack. SPARQL has no syntax for querying them. Their representation in RDFa and in N-Triples is horrible. The implementation record in RDF APIs and in RDF visualizers is spotty.

Reification is a controversial feature. The facts are that it is very rarely used in published RDF and that it has no formal semantics. This author's opinion is that it is misdesigned and that Named Graphs are a superior approach for dealing with the context of a statement in the typical case. If Named Graphs are accepted as a W3C Recommendation, then it would be worth exploring if reification can be handled as a special case of single-statement Named Graphs.

Custom datatypes, beyond the XML Schema datatypes such as xsd:int and xsd:date, are very rarely used. Finding examples where they are used in a sensible way is rather hard, and most uses of non-XSD datatypes on the public Web can be classified as either mistakes, or redundant re-definitions of XSD types, or questionable modelling (such as using datatypes for currencies and units of measurement). A main reason might be the lack of a well-documented method of associating a definition with a custom datatype URI.

This document will not express an opinion on what ought to be done about each of these particular features, but general options include:

Putting better support of the feature throughout the RDF stack on the agenda for future working groups.
Publishing W3C Notes that comment on a feature's most effective use, including recommendations about the (perhaps more frequent) situations where it should not be used.
Deprecation of the feature.

Either way, a first step could be done by understanding why these features are not widely supported and deployed.

Conclusion

Innovation in the RDF community is ongoing and healthy, and after six years it is time to revisit the older layers of the RDF stack. Nevertheless, this document recommends a conservative approach. Any updates to the stack must fulfill a number of boundary conditions. Most of all, they must increase interoperability, and they must not require sweeping changes that affect all or most existing RDF tools, libraries and applications. Furthermore, changes should benefit the areas where RDF is most successful—data integration and data exchange via web protocols—because otherwise the changes are unlikely to deliver benefits that offset the cost of disruption.

A number of possible areas of work have been identified. These tasks could be tackled by an “RDF Maintenance” or “RDF Housekeeping” working group with a narrowly defined charter.