Here’s what I think should be standardized at some point, soon,
in the Semantic Web infrastructure. These items are at various
levels of maturity; some are probably ready for a W3C Working Group
right now, while others are in need of research. They are mostly
orthogonal and most can be handled in independent efforts. (I would
lean against forming a single RDF Working Group to handle all of
this; that would be slower, I think.)
To be clear, when I say “RDF 2″ I mean it like OWL 2: an important step
forward, but still compatible with version 1. I’m not interested in
breaking any existing RDF systems, or even in causing their users
significant annoyance. In some traditions, where the major version
number is only incremented for incompatible changes, this would be
called a 1.1 release. In contrast, at W3C we normally signal a
major, incompatible change by changing the name, not the version
number. (And we rarely do that: the closest I can think of is
CSS->XSL, PICS->POWDER, and HTML->XHTML). The nice thing
about using a different name is it makes clear that users each
decide whether to switch, and the older design might live on and
even win in the end. So if you want to make deep, incompatible
changes to RDF, please pick a new name for what you’re proposing,
and don’t assume everyone will switch.
This is partially a trip report for ISWC, because the
presentations and especially the hallway and lounge conversations
helped me think about all this.
Note that although I work for W3C, this is certainly not a
statement of what W3C will do next. It’s not my decision, and even
if it were, there would be a lot of community discussion first.
This is just my own opinion, subject to change after a little more
sleep. Formally the decisions about how to allocate W3C resources
among the different possible standards efforts are made by W3C
management guided by the the folks who provide those resources, via
their representatives on the Advisory
Committee (AC). If the direction of the W3C is important to you
or your business, it may be worthwhile to join and participate in
that process.
1. RDF and XML interoperation
There’s a pretty big divide between RDF and XML in the real
world. It’s a bit like any divide between different programming
languages or different operating systems: users have to pick which
technology family to adopt and invest in. It’s hard to switch,
later, because of all the investment in tools, built systems,
educations, and even socially networks. (People who use some
technology build social and professional relationships other people
who use the same technology. Thus we have an XML community, an RDF
community, etc. Few people are motivated to be in both
communities.)
I think we should have better tools for bridging the gap,
technologically, so that when data is published in XML, it’s easy
for RDF consumers to use it, and when the data is published in RDF,
it’s easy for XML consumers to use it.
The leading W3C answer is GRDDL, which I think is pretty
good, but could use some love. I’d like to see support for the
transforms being in Javascript, which I think is probably the
dominant language these days for writing code that’s going to run
on someone else’s computer. It certainly has a bigger community
than XSLT. I’d probably support Java bytecode, too.
I would also like to see some way to support third-party GRDDL,
where the transform is provided by someone not associated with
either the data provider or data consumer. Nova Spivack gave a
keynote where he talked about this feature of
T2. They’re focused on HTML not XML, but the solution is
probably the same.
Beyond GRDDL, I think there’s room for a special data
format that bridges the gap. I’ve called it “rigid rdf” or
“type-tagged xml” in the past: it’s a sub-language of RDF/XML, or a
style of writing XML, which can be read by RDF/XML parsers and is
also amenable to validation and processing using XML schemas.
Basically you take away all choices one has in serializing
RDF/XML.
I note the The
Cambridge Communiqué is ten years old, this month. It proposed
schema annotation as an approach, and that’s not a bad one, either.
I haven’t heard of anyone working on it recently, but maybe that
will change if the XML community starts to see more need to export
RDF.
Amusingly, while I was talking to Gary Katz from MarkLogic
about this, he mentioned XSPARQL as a possible solution, and
I pointed out Axel
Polleres (xsparql project leader) was sitting right next to us.
So, they got to talk about it. XSPARQL doesn’t excite me,
personally, because I don’t use either SPARQL or XQuery, but
objectively, yes, it might solve the problem for some significant
userbase.
2. Linked Data Inference
For me, an essential element of a working Linked Data ecosystem
is automatic translation of data between vocabularies. If you
provide data about the migration of frogs in one vocabulary, and my
tools are looking for it in another one, the infrastructure should
(in many cases) be able to translate for us. We need this because
we can’t possibly agree on one vocabulary (for any given domain)
that we’ll all use for all time. Even if we can agree for now,
we’ll want this so that we can migrate to another vocabulary some
time in the future.
Inference using OWL (and its subsets like RDFS) provides some of
this, but I don’t think it’s enough. RIF fills in some more, but
the WG did not think much about this use case, and there’s might be
some glue missing. Maybe we can get WG Note out of RIF to help this
along.
I’d like us to be clear about first principles: when you’re
given an RDF graph, and you’re looking for more information that
might be useful, you should dereference the predicate IRIs to learn
about what kinds of inference you’re entitled to do. And then,
given resources and suitable reasoners, you should do it. That is,
the use of particular IRIs as predicates implies certain
things, as defined by the IRI’s owner. The graph is invoking
certain logics by using those IRIs. (Of course you can always infer
things that were not implied, but as among humans, those
“inferences” are really just guesses you are making. They have
quite a different status from true implications.)
If this is put together properly, and the logics are constructed
in the right form, I think we’ll get the dynamic, on demand
translation I’m looking for. I imagine RIF could be very useful for
this, but reasoner plugins written in Javascript of Java bytecode
could be a better solution in some cases.
Some of my thinking here is in my workshop keynote slides, but later
conversations with various folks, especially Pat Hayes and TimBL,
helped it along. There’s more work to do here. I think it’s pretty
small, but crucial.
3. Presentation Syntaxes
RDF, OWL, and RIF all have hideous primary exchange syntaxes and
some decent not-W3C-recommended alternative serializations. I’m not
really sure what can practically be done here that hasn’t been
done.
At very least, I’d like to see a nice RDF-friendly presentation
syntax for RIF. A bit like N3, I suppose. I did some work on this;
maybe I can finish it up, and/or someone else can run with it.
OWL 2 has 3+n syntaxes, where
n is the number of RDF syntaxes we have. Exactly one of those
syntaxes is required of all consumers, for interchange. I’ll be
interested to see how this plays out in the market.
4. Multi-Graph Syntax
Most systems that work with RDF handle multiple graphs at the
same time. Sometimes they do this by storing the triples in a quad
store, with the fourth entry being a graph identifier. This works
pretty well, and SPARQL supports querying such things.
We don’t have a way to exchange multiple graphs in the same
document, however. N3 has graph literals (originally called
contexts), and there was some work under the term named graphs, which is kind
of the opposite approach.
Personally, I don’t yet understand the use case for
interchanging multiple graphs in one document, so I’m not sure
where to go with this.
Hmmm. I guess RIF could be used for this. You can write RDF
triples as RIF frame facts, and the rif:Document format allows
multiple rulesets, each with an optional IRI identifier, in the
same document. ETA: RIF also gives you an exchange syntax where you
can syntactically put literals in the subject and use bnodes as
predicates, if you want. But now you’re technically exchanging RIF
Frames instead of RDF Triples.
5. RDF Graph Validation
When writing software that operates on RDF data, it’s really
nice to know the shape of the data you’ll find. It’s even nicer, if
software can check to see if that’s actually what you got. And if
reasoners can work to fill in any missing peices.
I don’t exactly understand how important or unimportant this is.
It’s closely related to the Duck Typing debate.
Whatever mechanisms make duck typing work (eg exception handling,
reflection, side-effect-free programming) probably help folks be
okay without graph validation. But I think folks trained on
C++/Java or XML Schema would be much happier with
RDF if it had this
The easiest solution might be using rigid RDF. One could
probably also do it with SPARQL, essentially publishing the graph
patterns that will match the data in the expected graphs.
The most interesting and weird approach is to use OWL. Of
course, OWL is generally used to express knowledge and reason about
some application domain, like books, genes, or battleships. But
it’s possible to use OWL to express knowledge about RDF
graphs about the application domain. In the first case, you
say every book has one or more authors, who are humans. In the
second case, you say every book-node-in-a-valid-graph has one or
more author links to a human-node in the same graph. At least
that’s the general idea. I don’t know if this can actually be made
to work, and even if it can, it risks confusing new OWL users about
one of the subjects they’re already seriously prone to get
wrong.
6. Editorial Issues
Finally, I’d like some portions of the 2004 RDF spec rewritten,
to better explain what’s really going on and guide people who
aren’t heavily involved in the community. This could just be a
Second
Edition — no need for RDF 2 — because no implementations
changes would be involved.
I’d like us to include some practical advice about when/how to
use List/Seq/Bag/Alt, and reification, maybe going so far as to
deprecate some of them (IMHO, all but List). Maybe bring in some of
the best-practice stuff on publishing and n-ary relations.
I understand Pat Hayes would like to explain blank nodes
differently, explicitly introducing the notion of “surfaces” (what
I would call knowledge bases, probably). Personally, I’d love to go
one step farther and get rid of all “graph” terminology, instead
just using N-Triples as the underlying formalism, but I might a
minority of one on that.
ETA: Of course we should also change “URI-Reference” to “IRI”,
and stuff like that.
Okay, that’s my list. What’s yours? (For long replies, I suggest
doing it on your own blog, and using trackback or posting a link
here to that posting.) Discussion on semantic-web@w3.org is fine,
too.