User:Rcygania2/RDF Datasets and Stateful Resources

From RDF Working Group Wiki
Jump to: navigation, search

Due to the requirement for SPARQL compatibility, we seem to be converging on a very simple abstract syntax. Removing terminological baggage, we can define like this:

An RDF dataset consists of a possibly empty default graph and zero or more IRI-graph-pairs. The IRIs are unique within an RDF dataset.

But this still leaves many questions unanswered:

  • What terminology shall we attach?
  • What do the IRIs actually identify?
  • How does it work for dereferenceable IRIs?
  • How does it explain change over time?
  • What are the formal semantics?
  • Is an RDF dataset something that has truth value and can be asserted?

This document is an attempt at providing one coherent account that answers these questions.

Not a proposal for standardization

The purpose of this document is not to provide a draft design for standardization in RDF 1.1. I believe that this proposal is more than we should actually standardize. It is too much. We should standardize something simpler, a subset.

The purpose of this document is to show that the simple abstract syntaxdefined above is a sufficient foundation on top of which coherent and powerful solutions can be defined.

I hope that this creates confidence that we don't design ourselves into a corner by adopting the simple abstract syntax.

Benefits of the presented design

I believe that the presented design has the following nice properties:

  • Clearly answers the question what the IRIs denote
  • Works exceptionally well for the “web convention” for datasets (where the graph in an IRI-graph pair is what we get when dereferencing the IRI)
  • Is compatible with other conventions for the use of the IRIs
  • Has formally defined and useful semantics
  • Can cover many interesting use cases by defining additional vocabulary and semantic extensions
  • Nicely fits REST terminology
  • Provides nice intuitions around change over time

Stateful resources

So, what do the IRIs in these IRI-graph-pairs denote?

They denote stateful resources.

Stateful resources are resources that have state, and the state can be expressed as an RDF graph.

We can accept the intuition that the state of a resource may change over time, but that it only has one state at any given time. A stateful resource doesn't necessarily ever have to change its state—it can be immutable.

We note that that the notion of a stateful resource is identical to the notion of resource in REST, and that it's reasonable to think of this state as the thing that REST (REpresentational State Transfer) is about.

We will not go into the question what kinds of things exactly can have state, and what exactly may or may not be a reasonable state for particular kinds of resources. We can accept whatever answer works for REST.

But it is certainly the case that, if we dereference an IRI i and get back a 200 status code along with a representation that encodes an RDF graph G, then it would be reasonable to conclude that G is the state of (the resource denoted by) i.

State pairs

The IRI-graph-pairs shall henceforth be called state pairs because they associate a resource with its state.

State pairs, like RDF triples, have truth value.

Recall that an RDF triple <s,p,o> is true if p denotes a relationship, and the relationship holds between the resources denoted by s and o.

A state pair <i,G> is true if the state relationship holds between the resource denoted by i and the graph G.

The state relationship can be thought of as a function from IRIs to graphs. The dereference function is a reasonable state relationship, but others (less likely to be interoperable) are possible.

It is important to keep in mind that the state of a resource cannot necessarily be understood as “true” in some objective sense. If i denotes the ramblings of a madman, then the state may be an RDF graph G containing the most outlandish nonsense. G would be false, but <i,G> would be true, because G is the state of the resource denoted by i.

rdf:StatefulResource and rdf:state

Let's define a property, rdf:state, that denotes the state relationship (whatever that relationship actually is).

The range of this property is the class of RDF graphs, rdf:Graph. The only way we provide to write down graphs in our concrete syntaxes is by writing down state pairs.

The domain of this property is, unsurprisingly, rdf:StatefulResource.

Technically speaking, rdf:state is an owl:FunctionalProperty: A resource can only have one state at a time.

Thus, a state pair <i,G> implies a number of triples (glossing over the fact that we can't actually write down triples involving G:

?i a rdf:StatefulResource.
?G a rdf:Graph.
?i rdf:state ?G.

rdf:semantics and dataset entailment

Now, the state of a stateful resource is an RDF graph. RDF graphs have truth values. RDF graphs entail other RDF graphs. The entailed graphs of the state of a resource can be extremely interesting. This is what the rdf:semantics property is about.

If <i,G1>, and G1 entails G2, then the following statement holds (again, glossing over the fact that we can't write down triples involving graphs):

?i rdf:semantics ?G2.

For example, given this state pair (written down in TriG syntax):

<foaf.rdf> { <foaf.rdf#me> a foaf:Person; foaf:name "Alice". }

we could infer the following triples and graphs under dataset entailment (written down in a sort of N3-like syntax that allows triples involving graphs):

<foaf.rdf> rdf:semantics { <foaf.rdf#me> a foaf:Person; foaf:name "Alice". }
<foaf:rdf> rdf:semantics { <foaf.rdf#me> a foaf:Person. }
<foaf:rdf> rdf:semantics { [] foaf:name "Alice". }
<foaf:rdf> rdf:semantics { <foaf.rdf#me> a foaf:Agent. }

The first one follows trivially from the state pair. The others follow under simple entailment (2 and 3) and under RDFS+FOAF-entailment (4).

Note that the choice of name here, rdf:semantics, is inspired by, but not quite compatible with, cwm's log:semantics. The log:semantics property actually corresponds closer to rdf:state.

Formalism

Here is my feeble attempt at a formal definition. I need to think more about this.

A dataset interpretation has, in addition to the usual stuff:

  • A set ISR of stateful resources
  • A set IG of RDF graphs

If E is a state pair <i,G> then I(E) = true if E is in IEXT(I(rdf:state))

If E is an RDF dataset the I(E) = true if its default graph is true and all of its state pairs are true

<i,G2> is in IEXT(I(rdf:semantics)) if <i,G1> is in IEXT(I(rdf:state)) and G1 entails G2

IG = ICEXT(I(rdf:Graph))

ISR = ICEXT(I(rdf:StatefulResource))

Some axiomatic triples:

rdf:state rdfs:domain rdf:StatefulResource.
rdf:state rdfs:range rdf:Graph.
rdf:state rdfs:subPropertyOf rdf:semantics.
rdf:semantics rdfs:domain rdf:StatefulResource.
rdf:semantics rdfs:range rdf:Graph.

I have toyed with the idea of a “state extension”, similar to class extensions and property extensions, for stateful resources. The state extension of a stateful resource might be the set of all interpretations that satisfy the resource's state graph. But maybe this is not necessary.

Different kinds of state relationships

One nice thing here is that the concept of a state relationship makes it easy to formally explain various conventions.

For example, let's say we have a crawler that does nothing but look for CSV files, and if it finds any, converts them to RDF and stores them in an RDF dataset. Now the state function here is not the usual “dereference and try parsing as RDF” function, but a slightly different one that goes “dereference and try parsing as CSV”. This is obviously not going to be terribly compatible to expectations of others, but may be completely appropriate in some internal scenarios.

The same goes if we want to use arbitrary resources as the IRIs, and stuff their descriptions into the graphs: We just consider all resources as stateful, and their state is in fact their description. It's just a very weird state function.

Semantic extensions can put additional constraints onto the state relationship, for example “web dataset entailment” could demand that the state relationship is a superset of the dereference-and-parse-as-RDF relationship.

Explaining owl:import

Stated as a rule over N3-like syntax:

?g1 rdf:semantics { ?g1 owl:imports ?g2 }
?g2 rdf:state ?imported.
  =>
?g1 rdf:semantics ?imported.     

This shows why it's good to have both rdf:state and rdf:semantics: The one records the exact graph and is functional; the other just claims that something holds true within the state of the resource. I don't think this would be possible in a semantics that only does graph baptism (because we can't “modify” the importing graph).

Open issues

  • Blank node scope? Probably it's ok to say that blank nodes are shared
  • Any point in defining rdf:ImmutableResource?
  • Come up with a nice example why rdf:state is needed—something where we really want to know exactly what triples are in the state
  • Formalism needs to be better
  • Is the entailment regime for determining rdf:semantics the same as the entailment regime we use for the entire dataset?
    • Anything said in the default graph (including the state pairs, which can be considered part of the default graph) should probably be true in any interpretation of any state graph too, no? Test case: State graph talks about one of the stateful resources
    • Perhaps better to say that it's a good idea to interpret the individual graphs with the same state function as the default graph
    • Perhaps better to say that it's a good idea to assume that stuff randomly published on the web should be interpreted with the deref+parse state function
  • Do we really need rdf:state and rdf:semantics as properties? Maybe see how a minimal thing without would work
  • RDF datasets are mathematical constructs. We probably do need a term for mutable datasets. Otherwise, that's the thing in an RDF store? What does a TriG file's URI identify? “Graph Store” is not a great term once we go beyond SPARQL Update. How about RDF dataspace as a mutable version of RDF dataset?