Status of this Document

The document evolved from a set of notes on issues related to recording provenance in RDF data. It represents only the thoughts of the author and not any sort of recommendation or note by the W3C. The author hopes that these notes will be helpful in evaluating the issues and moving toward a popular solution. There is currently no commitment to maintain any particular version of this document. Future revisions are likely to replace this version, $Revision: 1.55 $. The author will maintain anchor tags withing this document. Obsolete anchor tags will be listed in the obsolete anchors section.

Abstract

While described as a meta-data format, RDF is being used to express data that is not explicitly about documents. Applications like RDF query and annotation create scenarios where it would be convenient to express data with provenance information about that data. RDF provides no model-level division between data and meta-data (contrast with UML with its data, model, meta-model and meta-meta-model distinctions) so RDF should allow for mixed data and meta-data. This document describes ways to convey or query attributions, context, provenance within the RDF model and by extending the RDF model.

Table of Contents

Issues

This paper discusses approaches to recording provenance with an eye to the following issues.

Round-tripping verses paper-trail.
Should document look identical as it passes from agent to agent, or should it be altered to reflect the history?
Protection against spoofing.
If a document reflects its history, how does one distinguish these added statements from the arcs in the original documents?
Intuitive model.
For consistency's sake, making assertions about assertions should be analogous to making any other assertions.
Simplicity of syntax.
The RDF/XML syntax for making assertions about RDF graphs should not repulse potential users.
Consistency of data.
This is, of course, paramount, though, as explained later, consistency is a bit of a continuum.
@@@
@@@

Provenance

Many RDF database implementations associate some source properties with a Statement. How, or even whether, these sources are made available to the application varies. The implementation styles (conjecture here) boil down to one of two approaches, quads and formulas.

Quads

Quad-based systems decorate each triple with an identifier for a group of triples with which it is associated. This identifier may be a simple URI or it may be a pointer to richer collection of information about the group.

subject predicate object source
annot1 annotates doc1 doc2
annot1 context xpointer(/html/body/p[1]) doc2
check3 payee marja doc18
check3 amount 18 USD doc18

two distinct graphs with different colored arcs

Algae
stored in an RDF database, used in Annotea
Haystack
 

This system makes it easy to join data from multiple sources as the source is merely another attribute of the underlying data structure. The ever-popular triplesMatching query traverses the triples from all sources. For example, the following would find everything that annotates doc1.

annotators = triplesMatching(?, annotates, doc1, ?)

Formulas

Another set of RDF databases don't tag the triples with their source, but instead aggregate triples by a context which may be associated with a source.

context triples
0x8ce532
subject predicate object
annot1 annotates doc1
annot1 context xpointer(/html/body/p[1])
context triples
0x8cfcd8
subject predicate object
check3 payee marja
check3 amount 18 USD

two distinct boxes containing graphs

The application my then associate a context with some set of properties like source document.

cwm
formulas are associated with the source document via the log:semantics property. See the travel use case.
redland
API functions like librdf_model_add_statements add statements to a model (analogous to a context).
jena
(statement interface, model interface)

This approach is perhaps more faithful to the standard model for RDF — the triples have only three components. Crossing source boundaries requires iteration across known sources:

annotators = () // empty set
for each source in sources
  annotators .= triplesMatching(?, annotates, doc1)

These two implementation strategies were described to illustrate the following point: Neither of these approaches actually describes the relationship between the source and the triple. Instead, they are only associated via an implementation data structure and implementation-specific APIs. This has lead to varied special-purpose solutions and great opportunity for improvement (meaning the state of the art is not so good). A unified solution to this would solve problems in implementation, serialization, and querying (and rules).

Propositional Attitude

When Marja publishes doc2 which contains the statement annot1 annotates doc1, she is making an assertion in a protocol that does not allow caveats to interpretation. That is, a conventional RDF document asserts a conjunction of all of the statements in the document which may be capriciously reduced to any subset of the statements. Any caveats (like the ever-popular sentence suffix ...NOT!) may not be assumed to be understood and therefor must not be part of the interpretation of the publication.

Marja's publication may be stored with the propositional attitude says (Marja says "annot1 annotates doc1"). In all following examples, we use says when we take some data at face value and associate it with the publisher. If we learn of Marja's assertion from Bob, we may say Bob says "Marja says 'annot1 annotates doc1.'" If we model a limited number of propositional attitudes, we can shortcut some apparently redundant chains such as Marja says "Marja says 'annot1 annotates doc1.'" If we minimize the number of propositional attitudes, we can avoid modeling assertions that may be uselessly subtle like Marja says "Marja believes 'Marja is convinced that 'annot1 annotates doc1.''"

Serialization

Many RDF documents express no information about the triples within he document; they merely present a flat set of data. When answering a query, it is often preferable to express not only the data that fits the query, but also some information about the source or treatment of that data. The main approaches have been broken down as follows:

Processor-added Properties

Within a particular application, it is possible to isolate a property that will not occur in that application data. This property can be added to serialized graphs to identify provenance information. This approach asserts a fictitious relationship between an object in the graph and the provenance information for that graph. It does not attempt to convey the actual relationship between any arc in the graph and the provenance. A good example of the special-purpose solution is Annotea's use of the property attribution identifying the source of at least one of the statements. For example:

<?xml version="1.0"?>
<!-- session-id 1026827315.344508 -->
<r:RDF
 xmlns:d="http://purl.org/dc/elements/1.0/"
 xmlns:http="http://www.w3.org/1999/xx/http#"
 xmlns:a="http://www.w3.org/2000/10/annotation-ns#"
 xmlns:r="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
   <thread:Reply r:about="http://annotest.w3.org/annotations/reply/1026276020.748051"
    a:context="/home/kahan/.amaya/annotations/annotDtH15Q.html#xpointer(/html[1])"
    d:title="Reply to Annotation of Welcome to Amaya"
    a:created="2002-07-10T11:32:27"
    xmlns:thread="http://www.w3.org/2001/03/thread#"
    d:date="2002-07-10T11:48:25"
    d:creator="José">
      <ns:attribution
       r:resource="http://annotest.w3.org/annotations/attribution/1026276020.748051"
       xmlns:ns="http://www.w3.org/2001/12/attributions/ns#" />
      <r:type
       r:resource="http://www.w3.org/2001/12/replyType#Disagree" />
      <a:body>
         <r:Description
          http:ContentType="application/xhtml+xml"
          http:ContentLength="205"
          r:about="http://annotest.w3.org/annotations/body/1026276511.598918">
            <http:Body
             r:parseType="Literal">
  <html>
  <head>
    <title>Annotation of Annotation of Welcome to Amaya</title>
  </head>
  
  <body>
  <p>a replly test</p>
  </body>
  </html>
</http:Body>
         </r:Description>
      </a:body>
      <thread:root
       r:resource="file:///home/kahan/.amaya/annotations/annotDtH15Q.html" />
      <thread:inReplyTo
       r:resource="file:///home/kahan/.amaya/annotations/annotDtH15Q.html" />
      <d:creator>
         <r:Description
          addr:firstName="José" addr:name="Kahan"
          xmlns:addr="xmlns:http://www.w3.org/2000/08/palm56/addr#" />
      </d:creator>
   </thread:Reply>
</r:RDF>

Problems

Apart from the stated weakness that the stated relationship is fallacious, it is trivial to make similar or misleading claims in the initial graph. For example, the poster of the above data could have stated that Marja, instead of José created that data. Internally, the system knows whose authentication token was used when posting the data so no information is lost or confused. However, when serializing this data, the service is then faced with two unpleasant alternatives for dealing with the potentially confusing or misleading duplicate statements:

ignore them
The server sends back a thread:Reply with two d:creator arcs indicating different creators of the data.
eliminate them
The server removes the d:creator arc from the posted data and replaces it with one based on the authentication token used to post the data.

Quoting by Reference

The simplest solution is to avoid mixing meta data with the data that it describes. That is, segregate data from the assertions about the data. Once the data is in its own document, another document may be created to make assertions about that data. For instance, the d:creator arc in the above example would not be added, but would asserted in another document. The answer to a query for annotations would be indirected as the response would not be the annotations, but instead assertions about who made the annotations. The annotations would be retrieved in a separate transaction once the client parse the first reply.

Property ID

The RDF syntax allows for an ID attribute on the propertyElt production. From the model and syntax specification: "The value of the ID attribute, if specified, is the identifier for the resource that represents the reification of the statement." This allows for one to associate arbitrary properties, such as the attribution, with the ID of an assertion. This example comes from algae's output:

+----------------------------+-----------------------------+-----------------------------+
|              interpretAlgae|                             |                             |
|----------------------------|                             |                             |
|                           p|                            s|                            o|
|----------------------------|-----------------------------|-----------------------------|
|http://www.w3.org/e/f/p1.rdf|http://www.w3.org/c/d/ob1.rdf|http://www.w3.org/c/d/ob2.rdf|
|http://www.w3.org/e/f/p2.rdf|http://www.w3.org/c/d/ob1.rdf|http://www.w3.org/c/d/ob3.rdf|
|http://www.w3.org/e/f/p2.rdf|http://www.w3.org/c/d/ob1.rdf|http://www.w3.org/c/d/ob4.rdf|
+----------------------------+-----------------------------+-----------------------------+
<rdf:RDF
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
   <rdf:Description
    rdf:about="http://www.w3.org/c/d/ob1.rdf">
      <f:p2.rdf
       rdf:ID="s1"
       xmlns:f="http://www.w3.org/e/f/"
       rdf:resource="http://www.w3.org/c/d/ob3.rdf" />
      <f:p2.rdf
       rdf:ID="s2"
       xmlns:f="http://www.w3.org/e/f/"
       rdf:resource="http://www.w3.org/c/d/ob4.rdf" />
      <f:p1.rdf
       rdf:ID="s3"
       xmlns:f="http://www.w3.org/e/f/"
       rdf:resource="http://www.w3.org/c/d/ob2.rdf" />
   </rdf:Description>
   <rdf:Description
    rdf:ID="s1">
      <rdf:attribution
       rdf:resource="http://www.w3.org/a/b/doc2.rdf" />
   </rdf:Description>
   <rdf:Description
    rdf:ID="s2">
      <rdf:attribution
       rdf:resource="http://www.w3.org/a/b/doc2.rdf" />
   </rdf:Description>
   <rdf:Description
    rdf:ID="s3">
      <rdf:attribution
       rdf:resource="http://www.w3.org/a/b/doc1.rdf" />
   </rdf:Description>
</rdf:RDF>

Unfortunately, while the syntax is specified, it is not clear what triples are created by a property with an ID on it. For instance, does the above entail

ob1.rdf p1.rdf ob2.rdf .
ob1.rdf p2.rdf ob3.rdf .
ob1.rdf p2.rdf ob4.rdf .
s1 attribution <http://www.w3.org/a/b/doc2.rdf> .
s2 attribution <http://www.w3.org/a/b/doc2.rdf> .
s3 attribution <http://www.w3.org/a/b/doc1.rdf> .

which does not reflect the connection between the statements and the documents, or

s1 rdf:subject ob1.rdf .
s1 rdf:predicate p1.rdf .
s1 rdf:object ob2.rdf .
s2 rdf:subject ob1.rdf .
s2 rdf:predicate p2.rdf .
s2 rdf:object ob3.rdf .
s3 rdf:subject ob1.rdf .
s3 rdf:predicate p2.rdf .
s3 rdf:object ob4.rdf .
s1 attribution <http://www.w3.org/a/b/doc2.rdf> .
s2 attribution <http://www.w3.org/a/b/doc2.rdf> .
s3 attribution <http://www.w3.org/a/b/doc1.rdf> .

which sets the awkward precedent that statements with an ID are at a different level of interpretaion than those without. This would imply that all statements at the same level of interpretation in the same document must all have an ID on the propertyElts if any have an ID.

Reified paper trail.

Procedurally, it is simple to use RDF Reification to enumerate a chain of custody (sequence of ownership of the document) so long as one reifies each statement and attach properties to the reification to state the source. This implies that one needs to re-reify anything already reified.

marja POSTs annot1 to server1:

    <rdf:Description rdf:about="annot1">
        <a:annotates rdf:resource="doc1"/>
    </rdf:Description>

simple arc annot1 annotates doc1

or in N3:

annot1 annotates doc1.

server2 GETs annot1 and reifies the above statement, adding an Attribution to store the authentication identity:

# annot1 annotates doc1.
s1 a rdf:Statement;
   rdf:subject annot1;
   rdf:predicate annotates;
   rdf:object doc1;
   # tie statement to attributions attrib1.
   attrib:attribution attrib1.
# information about this attribution.
attrib1 attrib:authUser [ email:mbox marja ].

four arcs resulting from reifying annot1 annotates doc1, plus an attribution to a bnode with mailto of marja

This model reflects the entire chain of custody. In order to tell an agent that is acting in "find assertions and who said them" mode that these are reified only for the purpose of quotation, we can add that the Statements are also of type ReifyToQuote. Rdf databases with the ability to store the attribution properties, in this case attrib:authUser, will be able know that the reason the statement was reified was to communicate attribution information and may de-reify s1 and store the attrib:authUser in the context for that statements.

   s1 a attrib:ReifyToQuote.

Interpretation of Reified Statements

In order to make good use of the reified statement, we must de-reify it in some way. In general, a processor should not it assume that it knows why a statement was reified. For instance, the reification may hide a quotation (A says "B"), disjunction (A or B) or negation (NOT A). The data channel could be defined to carry the addional semantics that it is not only encoded in RDF, but that the RDF will be reified RDF with defined properties to identify sources. Such a protocol may be a practical way for a query service to report query solutions with the sources of those solutions. For example, a query service could be defined to report query results this way. It is also possible, and maybe preferable, to derive the propositional attitude from the graph rather than data channel semantics (in keeping with the ideals of self-describing documents). The attrib:ReifyToQuote property is defined for this purpose.

The above model could be stored in a simple triple store or it could be interpreted and stored in a database with context or attributions (or provenience or context or ...):

Statements Attributions
subject predicate object attribution name authUser ...
annot1 annotates doc1 attrib1 attrib1 marja

The paper trail works through an arbitrary number of intermediate agents. For instance, if Ralph GETs server2?w3c_annotation=annot1:

# s1 a rdf:Statement.
s1t a rdf:Statement;
    rdf:subject s1;
    rdf:predicate rdf:type;
    rdf:object rdf:Statement;
    # tie statement to attributions attrib2.
    a attrib:ReifyToQuote;
    attrib:attribution attrib2.
# s1 rdf:subject annot1.
s1s a rdf:Statement;
    rdf:subject s1;
    rdf:predicate rdf:subject;
    rdf:object annot1;
    # tie statement to attributions attrib2.
    a attrib:ReifyToQuote;
    attrib:attribution attrib2.
# s1 rdf:predicate annotates.
s1p a rdf:Statement;
    rdf:subject s1;
    rdf:predicate rdf:predicate;
    rdf:object annotates;
    # tie statement to attributions attrib2.
    a attrib:ReifyToQuote;
    attrib:attribution attrib2.
# s1 rdf:object doc1.
s1o a rdf:Statement;
    rdf:subject s1;
    rdf:predicate rdf:object;
    rdf:object doc1;
    # tie statement to attributions attrib2.
    a attrib:ReifyToQuote;
    attrib:attribution attrib2.
# s1 attrib:attribution attrib1.
s1a a rdf:Statement;
    rdf:subject s1;
    rdf:predicate attrib:attribution;
    rdf:object attrib1;
     # tie statement to attributions attrib2.
    a attrib:ReifyToQuote;
    attrib:attribution attrib2.
# attrib1 attrib:authUser [ mail:mbox marja ].
s1u a rdf:Statement.
    rdf:subject attrib1;
    rdf:predicate attrib:authUser;
    rdf:object _1;
    # tie statement to attributions attrib2.
    a attrib:ReifyToQuote;
    attrib:attribution attrib2.
s1ub a rdf:Statement.
    rdf:subject _1;
    rdf:predicate mail:mbox;
    rdf:object marja;
    # tie statement to attributions attrib2.
    a attrib:ReifyToQuote;
    attrib:attribution attrib2.
# information about this attribution.
attrib2 attrib:trustedHost server1.

Later we discuss a set of criticisms of reification and some proposed work-arounds.

  1. Superman Problem
  2. Incompleteness Problem

SOAP shortcut to reified paper trail

One can use SOAP to envelope the originally stated statements and provide the paper trail in an ordered set of headers. The non-"find assertions and who said them" mode interpretation is identical to the above example.

<?xml version="1.0" encoding="iso-8859-1"?>
<env:Envelope xmlns:env="http://www.w3.org/2003/05/soap-envelope"
   xmlns:attrib="http://www.w3.org/2001/12/attributions/ns#"
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:email="http://www.w3.org/2000/10/swap/pim/email#">
  <env:Header>
    <attrib:history href="#quote1" env:encoding="http://www.w3.org/2001/12/attributions/#SOAP">
      <rdf:RDF>
        <attrib:Holder rdf:ID="attrib1">
          <attrib:authUser rdf:parseType="Resource">
            <email:mbox rdf:resource="marja@w3.org"/>
          </attrib:authUser>
        </attrib:Holder>
      </rdf:RDF>
    </attrib:history>
    <attrib:history href="#quote1" env:encoding="http://www.w3.org/2001/12/attributions/#SOAP">
      <rdf:RDF>
        <attrib:Holder rdf:ID="attrib2">
          <attrib:trustedHost rdf:resource="server1"/>
        </attrib:Holder>
      </rdf:RDF>
    </attrib:history>
  </env:Header>
  <env:Body>
    <attrib:quote ID="quote1">
      <rdf:RDF xmlns:a="http://www.w3.org/2000/10/annotation-ns#">
        <rdf:Description rdf:about="annot1">
          <a:annotates rdf:resource="doc1"/>
        </rdf:Description>
      </rdf:RDF>
    </attrib:quote>
  </env:Body>
</env:Envelope>

One attractive feature of this shortcut is that the total message in supplemented by a constant amount as the chain of custody grows. Contrast this with the overt reification which multiplies the message size by at least five and property identification which multiplies it by two.

Superman Problem

One criticism of reification is the anticipated vulnerability to the "superman problem":

  1. Lois Lane: Superman can fly.
  2. Lex Luther: Clark Kent is Superman.
  3. Lois+Lex+substitution: Clark Kent can fly.

Bear in mind that my understanding of Superman comics and movies is not complete, but I believe that the premises in 1 and 2 are common to most of the Superman comics.

The goal here is to keep Lois from stating 3 as it is not her intention.

Opaque Reification

One proposed solution to the superman problem in RDF is to encode beliefs in strings so that no unwanted substitution will occur on them. This is variation of referential opacity.

# Superman can fly.
s6 a rdfre:Statement;
   rdfre:subject [ rdfre:uriOfConstant "Superman" ] ;
   rdfre:predicate [ rdfre:uriOfConstant "can" ] ;
   rdfre:object [ rdfre:uriOfConstant "fly" ] ;
   attrib:attribution attrib6.
# information about this attribution.
attrib6 attrib:authUser "Lois Lane".

There are many inferences we could investigate here, but those based on is are the least constrained and the most likely to accidentally match data where they ought not be applied. The inferences implied by the semantics of RDFS and OWL (save owl:sameAs which means is) would not match any reified data. It is hard to imagine a legitimate use case in, say, banking rules, that would act on the arc subj rdf:property bank:owes that was not tailored to deal with graphs reified for a specific purpose.

This requires a protocol/tool that knows which bits of the universe may be mixed and parses the strings into another knowledge base. But how do we characterize which substitutions is warranted, or even desirable? For instance, the n3 above says that Lois says "Superman can fly". If she adds "ManOfSteel owl:sameAs Superman" and Lex adds "ClarkKent owl:sameAs Superman", we have three options for deductive closures:

  1. Lois says "ManOfSteel can fly."
  2. Lois says "ClarkKent can fly."
  3. Lex says "ClarkKent can fly."

The latter two seem pretty much out of line, but if Lois utters the document { Superman can fly. ManOfSteel owl:sameAs Superman. }, did she imply { ManOfSteel can fly. } ? If so, there is an implied rule that permits owl:sameAs inferences to work on statements in the same document (see document-closed sameAs rule below).

In some sense, the best constraint to apply to substitution and other logical deductions is the most conservative. For instance, only assume that the statements made by a provably consistent individual over the interval of their attention span are mutually consistent. This may be a bit constrictive so we may chose a somewhat looser gauge. For instance, use all "current" (non obsolesced) documents, but tread delicately and don't write any checks.

Reporting Proofs with Results

If we ask "Does Lois believe that the person she knows as Clark Kent can fly?" the answer is no. If we ask "Does Lois believe that someone we know to be Clark Kent can fly?" the answer is yes. Further, if we are looking for someone to wash windows in our 50th floor apartment we may ask "Who do we know that knows someone that can fly?". Our hope now lies in the solution set:

we know can fly
people:Lois people:Clark_Kent
people:Lois people:Superman

with the proof:

# Superman can fly.
s6 a rdf:Statement;
   rdf:subject people:Superman;
   rdf:predicate natLang:can;
   rdf:object natLang:fly;
   attrib:attribution attrib6.
# information about this attribution.
attrib6 attrib:authUser "Lois Lane".
# Clark Kent is Superman.
s7 a rdf:Statement;
   rdf:subject people:Clark_Kent;
   rdf:predicate natLang:is;
   rdf:object people:Superman;
   attrib:attribution attrib7.
# information about this attribution.
attrib7 attrib:authUser "Lex Luther".
# Clark Kent can fly.
s8 a rdf:Statement;
   rdf:subject people:Clark_Kent;
   rdf:predicate natLang:can;
   rdf:object natLang:fly;
   attrib:attribution attrib6.
# information about this attribution.
attrib6 attrib:authUser "Lois+Lex+substitution".

Being sensitive to personal boundaries that maintain comic book plots, we look at ground facts for the answer we got and make sure we don't divulge any secrets. The above premise provides us the information we need to tread delicately. The substitution component of 3 tips us off to the fact that we ought reveal the implication to neither Lois nor Lex (especially since Lex may yet be unaware that his nemesis can fly).

Despite the fact that the semantic web reduces any hope Lex Luther had of keeping his revelation obscured, we must model our statement reporting with the presumption that not all the people will agree all the time (being neither reliably foolish nor reliably wise).

If Lex may make his assertion with an owl:sameAs property, rather than a foaf:name property, a processor may be tempted to alter Lois's statement to read

# Clark Kent can fly.
s6 a rdf:Statement;
   rdf:subject people:Clark_Kent;
   rdf:predicate natLang:can;
   rdf:object natLang:fly;
   attrib:attribution attrib6.
# information about this attribution.
attrib6 attrib:authUser "Lois".

The semantics of sameAs seem to imply that interpretation.

Such an owl:sameAs statement indicates that two URI references actually refer to the same thing: the individuals have the same "identity".

If the provenance of the ground facts for an inference is preserved, there may be no need to hide the references in the reified triples. A problem arises, however, when the reified data is shown to the world and the world then does substitution on those triples. Algae may give the client all the information it needs to know that Lois did not say that Clark Kent could fly, but if Algae dumps the database in the M&S-reified form, and another reasoner substitutes "Clark Kent" for "Superman", a client of the substituted data will be told that Lois did say that Clark Kent can fly. This may force serialization in the referentially opaque style described above.

Incompleteness Problem

Another difficulty with inferences over reified data is the completeness of the conclusions. For instance, if the rdfs:domain of annotates is an Annotation, we can conclude from the statement annot1 annotates doc1. that annot1 is a Annotation. This rule will not match the reified form and will not imply the expected conclusion. The Inference Responsibility section describes some rules customed to work with reified data.

Referential Opacity

It is clear that reasoning over quoted data must be handled differently than that over the data which we are to take as "fact". In the example above, Lois said that "Superman can fly". We tried first reification and then opaque reification to create a relationship between Lois and the sentence "Superman can fly". Both of these left terms exposed that were vulnerable to substitution. Reification and Opaque Reification are transformations of the data intended to hide it from the "normal" substitution rules. If we abandon the goal of identifying the nodes in the obscured statements, we can obscure the whole statement in an opaque string that hides it from substitution on all but the statement level. Complete referential opacity is a more direct way to accomplish this, but has no existing representation in the conventional RDF syntax.

Traditional referential opacity involves the quoting of an entire sentence. This protects any of the terms in the sentence from substitution. The superman problem is avoided as the symbol "Superman" is not visible for substitution. Further, as no substitution will occur, completeness is not an issue — no conclusions are drawn at all. This approach establishes no RDF model-level relationship between the directly asserted statements and the quoted statements. Instead it extends the RDF database data structure and relies on API to make this information available to the application.

One could still write a rule that said if Lois says "Superman can fly" then Lois says "Clark Kent can fly". Also, rules that act on substrings of quoted rules may lose this protection, but they are effectively cheating by breaking the sentence components back down to a set of not quite opaque terms.

Parsetype Quote

Parsetype quote is an extension to the RDF XML syntax that allows the serialization of collections of unasserted facts. They are syntactically set apart from the enveloping RDF by their inclusion in an XML element:

<rdf:Description rdf:about="Marja">
  <attrib:says rdf:parseType="quote">
    <rdf:Description rdf:about="annot1">
        <a:annotates rdf:resource="doc1"/>
    </rdf:Description>
  </attrib:says>
<rdf:Description/>

As a syntactic feature, parsetype quote may be bound to any of the above model implications. For instance, It may be defined to assert either the reified or opaque-reified graphs described above, or provide a serialization for full referential opacity. In the latter case, where the enveloping RDF was being parsed into one context, the quoted RDF will be parsed into another. This is analogous to a lisp quoted expression, (car '(+ 2 3)), where the list is not evaluated but is still parsed into the list compile tree instead of being left as the less useful string "(+ 2 3)".

Per the RDF syntax last call, any parseType that is neither Resource nor Collection is assumed to be equivalent to Literal. Thus the naive parser will assert that Marja says "<rdf:Description.../rdf:Description>" and the savvy parser will parse those quotes into a new model or context. For backward compatibility, it is necessary that the savvy parser also assert that Marja says "<rdf:Description.../rdf:Description>".

Regardless of the model implications, this syntactic approach nests conveniently -- after another holder gets their paws on it and passes it on we get a very small change to the serialization:

<rdf:Description rdf:about="Server1">
  <attrib:says rdf:parseType="quote">
    <rdf:Description rdf:about="Marja">
      <attrib:says rdf:parseType="quote">
        <rdf:Description rdf:about="annot1">
            <a:annotates rdf:resource="doc1"/>
        </rdf:Description>
      </attrib:says>
    </rdf:Description>
  </attrib:says>
<rdf:Description/>

This maps trivially to the N3 notion of formulas:

Server1 attrib:says {
  Marja attrib:says {
    annot1 annotates doc1.
  }.
}.

Datatype Quote

Another option is to introduce a new data type to RDF/XML. This data type entails the graph obtained by parsing the data as RDF/XML. This is seen as a way to add support for quoting to naive parsers as many will have an extensible way dispatch parsers for serializations of new data types.

<rdf:Description rdf:about="Marja">
  <attrib:says rdf:parseType="Literal" rdf:datatype="http://example.org/ns#quote">
    <rdf:Description rdf:about="annot1">
        <a:annotates rdf:resource="doc1"/>
    </rdf:Description>
  </attrib:says>
<rdf:Description/>

Unfortunately, the RDF syntax last call defines the data type of parseType="Literal" to be XMLLiteral and thus does not allow an rdf:datatype attribute on the same propertyElt as a parseType="Literal".

Querying With Trust

Assuming some variation of either the opaque or the non-opaque schemes described above, how do we construct queries invoking data from trusted sources? A simple query subjects of triples that annotates doc1 may look like this:

algae
ask ( ?what annotates doc1 )
collect ( ?what )
cwm
this log:forall ?what .
{ ?what annotates doc1 } log:implies { query solution ?what } .

This simple search for matching facts is sometimes called naive optimism and is reminiscent of a typical search engine query where the requester asks all the world. If proofs with provenance are reported, and the requester considers the solutions based on relative levels of trust, the process can be described as cautious optimism and is analogous to the search engine requester reviewing the search results in light of some trust with known domains or authors or apparent expertise on the part of the author. The popularity of "googling" for useful data implies that these optimistic approaches my be of great practical use.

Above, we've been discussing moving out one layer of reference to where the database does not carry facts like A annotates B but instead that doc2 says that A annotations B. If we follow the opaque reification model describe above and view this query as a trust problem, we may ask a question trusting a particular document (doc2):

algae
ask ( doc2 says ?statement1 .
      ?statement1 re2:subject ?ns .
      ?ns uriOfConstant ?what .
      ?statement1 re2:predicate ?np .
      ?np uriOfConstant "annotates" .
      ?statement1 re2:object ?no .
      ?no uriOfConstant "doc1" )
collect ( ?what )
cwm
this log:forall ?what,?statement1,?ns,?np .
{ doc2 says ?statement1 .
  ?statement1 re2:subject ?ns .
  ?ns uriOfConstant ?what .
  ?statement1 re2:predicate ?np .
  ?np uriOfConstant "annotates" .
  ?statement1 re2:object ?no .
  ?no uriOfConstant "doc1" } log:implies { query solution ?what } .

Where the relationship is to the containing document and not to whom uttered the document, both algae and cwm have shortcuts. In algae, the shortcut is via a constraint predicate (in the XPath sense of the word) and in n3 it is via the special property log:semantics.

algae
ask ( ?what annotates doc1 [%ATTRIB == <doc2>] )
collect ( ?what )
cwm
this log:forall ?what .
{ doc2 log:semantics [ log:conclusion [ log:includes { ?what annotates doc1 } ] ] }
 log:implies { query solution ?what }

Neither of these examples uses the opaque reification. cwm has a notion of formulas which have no defined arc relationship to the statements within the formula. It is possible to simultaneously treat formulas as an extension to the database data structure and as an encoding of the triples, but it seems unlikely to be practical.

Existential Opacity

The vulnerability of opaque reification is that rules can match the bNode that is the subject, predicate or object of the reified statement. Making these bNodes into existential variables will avoid this problem.

:y n3:serialization """this forSome x, y, z.   x y z. x uri "Superman".  y uri "can".  z uri "fly"."""

Serializing this within conventional RDF is not feasible, however it is possible to use a data type to mark an extended syntax for RDF.

Inference Responsibility

In architecting the principles and data on the semantics web, it is important to consider what responsibility a data publisher has for data that may be inferred from their publication. It would not be fair to hold someone responsible for a rule that says every time you use the rdf:type Annotation, you owe the Annotea team a cookie. It would, however be fair to say that if MyCheck is a subclass of Check and I write a MyCheck to you, I am also responsible for writing a Check to you. There are a variety of places we my choose to draw a circle and say the publisher is responsible for all inferences defined inside the circle.

document-closed sameAs

This rule makes the social meaning assumption that a document implies not just the ground facts stated in the document, but a deductive closure of statements as implied by a set of predicates in that document, specifically, those predicates defined in rdfs and owl. This rule is a small extension of the fundamental assumption that a document is a conjunction of statements, without which it would be impossible to draw any meaning from a document with more than one arc. The example includes only the rules for an unsafe deduction of owl:sameAs:

algae
fwrule head (?l owl:sameAs ?r .
             ?l ?p ?o)
       body (?r ?p ?o)
fwrule head (?l owl:sameAs ?r .
             ?r ?p ?o)
       body (?l ?p ?o)
fwrule head (?l owl:sameAs ?r .
             ?s ?p ?l)
       body (?s ?p ?r)
fwrule head (?l owl:sameAs ?r .
             ?s ?p ?r)
       body (?s ?p ?l)
ask ( doc2 says ?statement1 .
      ?statement1 re2:subject ?ns .
      ?ns uriOfConstant ?what .
      ?statement1 re2:predicate ?np .
      ?np uriOfConstant "annotates" .
      ?statement1 re2:object ?no .
      ?no uriOfConstant "doc1" .)
collect ( ?what )
cwm
this log:forall ?l, ?r, ?s, ?p, ?o .
{ ?l owl:sameAs ?r .
  ?l ?p ?o) } log:implies { ?r ?p ?o } .
{ ?l owl:sameAs ?r .
  ?r ?p ?o) } log:implies { ?l ?p ?o } .
{ ?l owl:sameAs ?r .
  ?s ?p ?l) } log:implies { ?s ?p ?r } .
{ ?l owl:sameAs ?r .
  ?s ?p ?r) } log:implies { ?s ?p ?l } .
this log:forall ?what,?statement1,?ns,?np .
{ doc2 says ?statement1 .
  ?statement1 re2:subject ?ns .
  ?ns uriOfConstant ?what .
  ?statement1 re2:predicate ?np .
  ?np uriOfConstant "annotates" .
  ?statement1 re2:object ?no .
  ?no uriOfConstant "doc1" } log:implies { query solution ?what } .

This model will be interpreted as asserting that doc2 says the deductive closure of owl:sameAs applied to doc2. For instance, if doc2 stateded that doc3 owl:sameAs doc1, we would learn that doc2 says "annot1 annotates doc3." This contradicts the exact semantics of says which were identified as "take some data at face value and associate it with the publisher". We can invent a new propositional attitude called implies and create a much more complicated rule:

algae
fwrule head (?doc says ?s1 .
             ?s1 re2:subject ?l .
             ?s1 re2:predicate owl:sameAs .
             ?s1 re2:object ?r .
             ?doc says ?s2 .
             ?s2 re2:subject ?l .
             ?s2 re2:predicate ?p .
             ?s2 re2:object ?o)
       body (?doc implies ?s3 .
             ?s3 re2:subject ?r .
             ?s3 re2:predicate ?p .
             ?s3 re2:object ?o)
fwrule head (?doc says ?s1 .
             ?s1 re2:subject ?l .
             ?s1 re2:predicate owl:sameAs .
             ?s1 re2:object ?r .
             ?doc says ?s2 .
             ?s2 re2:subject ?r .
             ?s2 re2:predicate ?p .
             ?s2 re2:object ?o)
       body (?doc implies ?s3 .
             ?s3 re2:subject ?l .
             ?s3 re2:predicate ?p .
             ?s3 re2:object ?o)
fwrule head (?doc says ?s1 .
             ?s1 re2:subject ?l .
             ?s1 re2:predicate owl:sameAs .
             ?s1 re2:object ?r .
             ?doc says ?s2 .
             ?s2 re2:subject ?s .
             ?s2 re2:predicate ?p .
             ?s2 re2:object ?l)
       body (?doc implies ?s3 .
             ?s3 re2:subject ?s .
             ?s3 re2:predicate ?p .
             ?s3 re2:object ?r)
fwrule head (?doc says ?s1 .
             ?s1 re2:subject ?l .
             ?s1 re2:predicate owl:sameAs .
             ?s1 re2:object ?r .
             ?doc says ?s2 .
             ?s2 re2:subject ?s .
             ?s2 re2:predicate ?p .
             ?s2 re2:object ?r)
       body (?doc implies ?s3 .
             ?s3 re2:subject ?s .
             ?s3 re2:predicate ?p .
             ?s3 re2:object ?;)
ask ( doc2 says ?statement1 .
      ?statement1 re2:subject ?ns .
      ?ns uriOfConstant ?what .
      ?statement1 re2:predicate ?np .
      ?np uriOfConstant "annotates" .
      ?statement1 re2:object ?no .
      ?no uriOfConstant "doc1" .)
collect ( ?what )
cwm
this log:forall ?doc, ?s1, ?s2, ?s3, ?l, ?r, ?s, ?p, ?o .
{ ?doc says ?s1 .
  ?s1 re2:subject ?l .
  ?s1 re2:predicate owl:sameAs .
  ?s1 re2:object ?r .
  ?doc says ?s2 .
  ?s2 re2:subject ?l .
  ?s2 re2:predicate ?p .
  ?s2 re2:object ?o } log:implies { ?doc implies ?s3 .
                                    ?s3 re2:subject ?r .
                                    ?s3 re2:predicate ?p .
                                    ?s3 re2:object ?o } .
{ ?doc says ?s1 .
  ?s1 re2:subject ?l .
  ?s1 re2:predicate owl:sameAs .
  ?s1 re2:object ?r .
  ?doc says ?s2 .
  ?s2 re2:subject ?r .
  ?s2 re2:predicate ?p .
  ?s2 re2:object ?o } log:implies { ?doc implies ?s3 .
                                    ?s3 re2:subject ?l .
                                    ?s3 re2:predicate ?p .
                                    ?s3 re2:object ?o } .
{ ?doc says ?s1 .
  ?s1 re2:subject ?l .
  ?s1 re2:predicate owl:sameAs .
  ?s1 re2:object ?r .
  ?doc says ?s2 .
  ?s2 re2:subject ?s .
  ?s2 re2:predicate ?p .
  ?s2 re2:object ?l } log:implies { ?doc implies ?s3 .
                                    ?s3 re2:subject ?s .
                                    ?s3 re2:predicate ?p .
                                    ?s3 re2:object ?r } .
{ ?doc says ?s1 .
  ?s1 re2:subject ?l .
  ?s1 re2:predicate owl:sameAs .
  ?s1 re2:object ?r .
  ?doc says ?s2 .
  ?s2 re2:subject ?s .
  ?s2 re2:predicate ?p .
  ?s2 re2:object ?r } log:implies { ?doc implies ?s3 .
                                    ?s3 re2:subject ?s .
                                    ?s3 re2:predicate ?p .
                                    ?s3 re2:object ?; } .
this log:forall ?what,?statement1,?ns,?np .
{ doc2 says ?statement1 .
  ?statement1 re2:subject ?ns .
  ?ns uriOfConstant ?what .
  ?statement1 re2:predicate ?np .
  ?np uriOfConstant "annotates" .
  ?statement1 re2:object ?no .
  ?no uriOfConstant "doc1" } log:implies { query solution ?what } .

These examples show only the rules to implement owl:sameAs. The complete rules for owl-full are lengthly to enumerate.

Conclusion

Though rdfs and owl have defined several graph patterns (ranging from simple arcs like rdfs:subPropertyOf to more complex graphs like owl:constraint objects) which have rule implications, there are no established forms of these rules for manipulating quoted graphs. Further, there is no "standard" way to even serialize quoted material. The ability to talk about data and meta-data in the same breath is seen as a strong point of RDF so it seems that some conventions for how to do that would markedly increase interoperability and decrease ambiguity for those publishing such data.

Presented above are some options for modeling data and meta-data in the same document. Each has its benefits and drawbacks — none are without cost. Thus is seems that understanding of, and perhaps consensus on, some set of these approaches will not be driven by the recognition of "the right choice". The need to actually express data and meta-data and understand the limitations of our expression will drive us to identify and describe a (hopefully small, though larger than zero) set of models.

Mumble Foo

Data quoting data establishes different propositional attitudes for the layers of data. The directly asserted data, likely just a collection of he said/she said patterns, is taken as "asserted" by the publisher. The quoted data is taken with more of a grain of salt — different rules apply to the different propositional attitudes. Reification encodes transformations of quoted data in the "asserted" attitude.

# Superman can fly.
s6 a rdf:Statement;
   rdf:subject people:Superman;
   rdf:predicate natLang:can;
   rdf:object natLang:fly;
   attrib:attribution attrib6.
# information about this attribution.
attrib6 attrib:authUser "Lois Lane".
# Clark Kent is Superman.
s7 a rdf:Statement;
   rdf:subject people:Clark_Kent;
   rdf:predicate natLang:is;
   rdf:object people:Superman;
   attrib:attribution attrib7.
# information about this attribution.
attrib7 attrib:authUser "Lex Luther".

made us a liar when we de-reified his statement and applied that rule to Lois's statements. Doing this substitution, we lead people to the conlcusion that Lois says "Clark Kent can fly." On the other hand, maybe he's the liar — shouldn't "Clark Kent is Superman" be further qualified?

Should these statements with different propositional attitudes be included in the same graph?

See Also

Drew McDermott's issues with reification
Drew asserts that using reification for handling "opaque" contexts is "is a classic example of fixing a bug with a bigger bug."
design discussion plea on www-annotation
a proposal for handling attributions in the Annotea server
Embedding RDF in SOAP
 
SOAP Encoding RDF
 
Attribution Theory
do we want to add stuff from attributions in the sociology sense? -- the "why", not "who" sense?

Obsolete Anchors

Identify source in RDF encoding
re-named and migrated to Property ID

SPARQL Queries

Given an atom entry <http://whereami.example/entry.atom> with changing content:

20061202T00:00 -- { :I :nearestAirport :TYO }
20061203T00:00 -- { :I :nearestAirport :CDG }
20061204T00:00 -- { :I :nearestAirport :BOS }

you could:

CVS Log

$Log: Overview.html,v $
Revision 1.55  2012/04/03 15:49:05  eric
~ fixed an encoding error

Revision 1.54  2006/11/25 02:46:44  eric
+ notes from a conversation with bblfish on #swig at 2006-11-25T02:07:51Z

Eric Prud'hommeaux
$Id: Overview.html,v 1.55 2012/04/03 15:49:05 eric Exp $