RDF Syntaxes 2.0
David Beckett Position paper
W3C 2010 RDF Next Steps workshop

Author
David Beckett
Email
dave@dajobe.org
Date
10 April 2010
Web
http://www.dajobe.org/

Abstract

This is David Beckett's personal position paper for the W3C RDF Next Steps Workshop primarily dealing with how RDF syntaxes (serializations of the RDF graph) could be worked on in a future RDF working group.

This is my personal position and not that of my employer.

Introduction

My background and interest is in the lower levels of of the semantic web stack - RDF triple data, URIs, formats and storage, rather than higher levels such as inference, reasoning and trust. I have worked on the revising of RDF in 2004 as editor of the RDF/XML (Revised) W3C Recommendation[4], co-editor of the RDF Test Cases W3C Recommendation[5], and in 2008 I co-authored a W3C Team Note on the Turtle[6] RDF Syntax. I have also implemented the RDF model, multiple syntaxes and SPARQL querying in the Redland libraries[7] over the period 2000-present.

With the focus on the practical, I will outline my position primarily on syntax issues as well as some important model-related changes.

In general, the approach to updating RDF should be on improving on items that were discovered during implementations. It should NOT add features that have not been tested in multiple systems in practice over some time. This work should not be a research project.

RDF Model Updates

There are a few RDF model parts that should be deprecated (or removed if that seems possible), in particular reification which turned out not to be widely used, understood or implemented even in the RDF 2004 update.

In terms of additions, there is one major addition that should be added since toolkits and SPARQL have implicitly or explicitly supported it for some time: Named Graphs. There should be an RDF datamodel concept of a graph with a name, and set of graphs forming a dataset. These use cases can be taken from the Linked Data community and SPARQL query execution, especially in the forthcoming SPARQL 1.1 changes. This might mean turning an RDF statement into a 4-part 'quad' which is already a quite common implementation technique for RDF triples to be stored in "quadstores" rather than "triplestores". Other model choices are possible such as triples belonging to (contained by) a graph with a name. Graph names or graph literals have mostly been seen as URIs (IRIs), sometimes blank nodes but hardly ever as RDF literals.

The RDF statement (subject, predicate, object) has always been asymmetric in that there are RDF terms that cannot be put into the subject or predicate that are allowed in object. It would be worth considering making an RDF statement a 3-tuple of any RDF Term. That would allow blank-node predicates, literal subjects and literal predicates. However nice this would be for the semantics, the major problem with this would be that most existing serializations would not support it, RDF/XML especially would be hard to update to fix this (see below). My position is that if this change is made, the consequences on serialization should be seriously considered.

There are some minor additions and corrections that can be made to the model such as replacing "RDF URI References" with "IRI References".

RDF Syntax Updates

I was involved with specifying N-Triples, RDF/XML and Turtle formats as well as implementing other RDF syntaxes. In this section I outline my position on future RDF syntaxes.

N-Triples

In general: leave it alone, it works well for the job it was designed for. If the RDF model changes to a quad (4-ary RDF Term) model, the specification should be updated for that. If the model changes to a triple + scoped graph model then something more like Turtle would be appropriate to use.

If the specification does have to change for model needs, then the major thing that people have used with N-Triples is to add prefixes (aka a subset of Turtle and N3). This can be seen in several existing specifications such as the RDF Primer and is also widely used when people are educating RDF (URIs are too long for slides and new users). If this is done, the best approach would be to add the Turtle/N3-like @prefix and XML-like QNames (curies?), although heading along that route will need careful explanation of namespaces, prefixes and allowed names.

In retrospect, if there was is one thing I would change if I was re-specifying N-Triples now, it would be to move it from 7-bit ASCII to Unicode UTF-8. This will introduce the need for new test cases to deal with escaping and UTF-8 issues (bad UTF-8 encoding, encoding of delimiters etc). I would not change the line-based specification since that has proved very useful for UNIX command-line filtering and streaming approaches to RDF processing.

In terms of presentation, the current specification should be moved to a separate REC-track document and any errata folded in and the existing test cases made more prominent.

RDF/XML

RDF/XML[4] has many flaws that I discussed in detail in 2003 in [1]. To the list in that paper, add the lack of support for named graphs.

My position is to leave this format alone and do not try to alter it. Not that it is perfect, but that it works for (part of) the job it was designed - machine-to-machine transmission of a single RDF graph - and has been widely implemented and tested.

The current document could be updated for clarity and fixing some errata and ambiguity. It might need a better way to explain the syntax to RDF triples mapping, although the current one plus the test cases has seemed fine over the years.

If the specification has to be changed, make the syntax it simpler by deprecating or removing these parts of the syntax:

New XML Syntaxes

There have been several attempts over the years to make new XML syntaxes for RDF graphs such as RXR in [2] by the author, TRiX[8] (an XML version of Turtle) and more recently GRIT[9]. None of these gained any substantial traction in implementations. The problems of writing down a graph in a sequential document representing a tree such as the XML DOM has proved just too hard to make it easy to do and clear.

I recommend that any future WG does NOT attempt to make a new XML syntax, even if the RDF model changes. The current state of the art in data model syntaxes is in the area of textual syntaxes such as JSON or Turtle that are both easy for humans to create/read as well as possible for machines to interpret. The focus should be to make it easy for people, which means "not XML" in 2010.

It should be clear from this that this is why I created Turtle as a conservative subset of N3, and I contend that this syntax style and approach was successful in that Turtle has become widely used and implemented, even without having a W3C REC-track document available to define it. The next section discusses Turtle specification needs.

One issue for new syntaxes is the following: should new syntaxes allow encoding a single graph, a set of graphs or both? There are pros and cons to all of these approaches since sometimes you want to know you have just one graph at hand. If a graph is always a document URI, a syntax that is a single graph/document could work, with the graph name embedded inside e.g. in a similar fashion to @base in Turtle, there could could be an @graph directive to name the graph in the current document.

Turtle

Turtle[6] has been very successful when measured by number of uses in explaining RDF, in examples in specifications, and in implementations. It has been recognized as an easy to author and easy to read format. The current team note document has issues that need resolving especially in the area of providing a much clearer mapping to RDF triples, although the test cases have been sufficient for implementers to figure that out. Turtle needs better alignment with the SPARQL triples pattern language since there are differences in QName / Curie formats as well as some other minor differences. Turtle's design has been conservative - standardise what people use - rather than adding new syntax items and hoping people use them.

JSON Formats

A standard JSON encoding of RDF is possible to create and is a good idea to better align with the current web development language space focused on Javascript. JSON does, however, make encoding RDF rather verbose and harder to read since the format does not include native URI datatype support or ways to abbreviate them. There have been several approaches to improve on this by QName (Curie) or other abbreviations, to moderate success. This will tend to generate patterns that look like Turtle written in JSON with prefixes, blocks and sequences of predicate/objects or objects. This work should be done with a strong focus on providing usability in Javascript, and may even be worth creating a standard JS API for the RDF graph. It should take note of and potentially be based on existing RDF JSON work such as: RDF/JSON[10] Freebase / Acre[11], irON[12], SIMILE Exhibit, RDFj[15], and should be aligned with the SPARQL JSON result format[13] that may be developed by the current SPARQL WG. The proliferation of formats here indicates that a standardization effort may be of great value.

Binary RDF Format

Do not do this - the case for binary XML has been slowly made and had little take-up. There are other approaches such as portable application object serializations - Protocol Buffers, Thrift that can be used in co-operation with streaming compression libraries, all of which are widely available.

Summary

My position is that the RDF next steps should be cautious and primarily (90%) based on existing implementation experience, not on research. If it does not have 2 or 3 major, complete and independent implementations today, it should not be done.

References

[1] A retrospective on the development of the RDF/XML Revised Syntax,
David Beckett, ILRT Tech Report, University Bristol, 2003. http://www.dajobe.org/2003/05/iswc/paper.html

[2] Modernising Semantic Web Markup,
David Beckett, paper presented at XML Europe 2004. http://www.dajobe.org/papers/xmleurope2004/

[3] RDF Syntaxes 2.0,
David Beckett, January 2010. http://journal.dajobe.org/journal/posts/2010/01/24/rdf-syntaxes-2-0/

[4] RDF/XML Syntax Specification (Revised),
David Beckett (ed.), W3C Recommendation 10 February 2004. http://www.w3.org/TR/2004/REC-rdf-syntax-grammar-20040210/

[5] RDF Test Cases,
Jan Grant and David Beckett (eds.), W3C Recommendation 10 February 2004. http://www.w3.org/TR/2004/REC-rdf-testcases-20040210/

[6] Turtle - Terse RDF Triple Language,
David Beckett and Tim Berners-Lee, W3C Team Submission 14 January 2008. http://www.w3.org/TeamSubmission/2008/SUBM-turtle-20080114/

[7] Redland Libraries,
David Beckett. http://librdf.org/

[8] RDF Triples in XML,
J.J. Carroll and P. Stickler HP Labs Technical Report HPL-2003-268, 11 February 2004. http://www.hpl.hp.com/techreports/2003/HPL-2003-268.html

[9] GRIT - Grokkable RDF Is Transformable,
Niklas Lindström, January 2010. http://code.google.com/p/oort/wiki/Grit

[10] RDF/JSON,
Talis Inc. http://n2.talis.com/wiki/RDF_JSON_Specification

[11] Freebase Data,
Metaweb, http://www.freebase.com/docs/data

[12] irON
Frédérick Giasson and Michael Bergman, http://openstructs.org/iron/iron-specification

[13] Serializing SPARQL Query Results in JSON,
Kendall Grant Clark, Lee Feigenbaum and Elias Torres (eds) W3C Working Group Note 18 June 2007, http://www.w3.org/TR/rdf-sparql-json-res/

[14] SIMILE Exhibit JSON Format,
MIT, April 2008. http://simile.mit.edu/wiki/Exhibit/Understanding_Exhibit_Database

[15] RDFj
Mark Birbeck, Dec 2009. http://code.google.com/p/backplanejs/wiki/Rdfj