Re: Modified proposal for 'provenance triple', ISSUE-110 from Niklas Lindström on 2011-09-01 (public-rdfa-wg@w3.org from September 2011)

From: Niklas Lindström <lindstream@gmail.com>
Date: Thu, 1 Sep 2011 14:44:05 +0200
To: Ivan Herman <ivan@w3.org>
Cc: Gregg Kellogg <gregg@kellogg-assoc.com>, W3C RDFWA WG <public-rdfa-wg@w3.org>
Message-ID: <CADjV5jfMud9q4-Wre0oHy8ciZ=Cqr6sb3p0hpyPCM7eKZf4mAA@mail.gmail.com>
2011/9/1 Ivan Herman <ivan@w3.org>:
> Niklas,
>
> forget about URIRef, named graphs, quads, etc....
>
> As far as I know, if you have a http://www.example.org/bla.ttl file containing the following triples:
>
> <> a:b <http://example.org> .
>
> Then <> stands for the base of the containing turtle file, ie, it stands for <http://www.example.org/bla.ttl>. That turtle file may be generated from the RDFa file http://www.example.org/bla.html but the base URI in that RDFa file is different...

Yes, if that is the only data in the file, it stands for the
<.../bla.ttl>. But if that Turtle file was generated from an RDFa file
where the base URI is different (as one would expect), the conversion
to Turtle was lossy. Either the parsed graph didn't generate an
absolute URI for the subject in the triple, or the serializer failed
to output a @base directive (which it has to if it produces relative
URIs).

> Now... I realize that if all this is done through a service returning some turtle content, I am not sure what the base uri means in the turtle serialization. But one thing is sure: this is _not_ the same as the base URI of the original RDFa file (unless of course a @base is put into the file explicitly)

Very true, it is not.

Best regards,
Niklas



> Ivan
>
>
> On Sep 1, 2011, at 14:06 , Niklas Lindström wrote:
>
>> Ivan, Gregg,
>>
>> I'm quite sure that Gregg is correct. Ivan, you say "URI referring to
>> the processor graph". But there is no predefined means of determining
>> the IRI for a graph *within* a graph. Any RDF format (including RDFa)
>> which only deals with triples has no means to even express what the
>> "containing graph" is (in the quad sense). You may of course express
>> information about the document (base) URI though.
>>
>> Correct me if I'm wrong, but since the conceptual RDF model doesn't
>> include quads (only reification), it isn't even currently clear what
>> "graphs of graphs" are, apart from the instrumental approach taken by
>> e.g. SPARQL to express how you can store and query different contexts.
>>
>> Anyway, there is no special meaning in RDF/XML to rdf:about="", in
>> Turtle to <>, nor in RDFa to about="" (or href="", resource=""). They
>> are syntactic mechanisms of expressing an empty relative IRI, which by
>> a processor turning this syntax into triples *must* (AFAIK) resolve
>> against the document base to produce an absolute IRI. All these
>> syntaxes have optional means of supplying this base, and processors
>> should by default use the URL (commonly a http or file URI), System ID
>> or similar, and also provide a means to programmatically supply the
>> base URI.
>>
>> So I'm a bit lost here I'm afraid, as to what you mean with <>, Ivan,
>> if you *don't* mean the base URI.
>>
>> .. The fact that RDFLib actually preserves URIRef("") as a kind of
>> "absolute relative reference" seems like a bug, or at most an esoteric
>> feature to preserve a syntactic form which doesn't represent any valid
>> RDF concept.
>>
>> Now, I'm not saying that the topic itself is unimportant. I've dealt
>> with it a lot when storing data in quad stores -- regularly creating
>> named graphs based on input document URIs, and relating the named
>> graph IRI to this input source (with e.g. dc:source or
>> foaf:primaryTopic). In this way, a user of an RDFa processor may store
>> the resulting triples into a named graph within e.g. a quad store. And
>> if an RDF API supports named graphs (and graphs of graphs), the
>> resulting graph from an RDFa document can reasonably be named (with a
>> IRI) and a triple be added relating this named graph to the source
>> document IRI. But this mechanism of minting graph IRIs and adding data
>> about them (e.g. relating them to the source document(s)) is beyond
>> what RDFa should specify.
>>
>> (It's not uncommon AFAIK to use the actual document IRI for this in
>> SPARQL, albeit this is logically conflating the document and the
>> graph.)
>>
>> In any case, the RDFa syntax is a syntax for RDF triples, and not
>> quads, so it cannot express facts about the relationship (if any)
>> between a named graph and any of the resources described therein.
>> Neither should it. Named graphs and provenance is orthogonal to all
>> triple syntaxes, and should be kept separate from these.
>>
>> Best regards,
>> Niklas
>>
>>
>>
>> On Tue, Aug 30, 2011 at 8:58 AM, Ivan Herman <ivan@w3.org> wrote:
>>>
>>> On Aug 30, 2011, at 07:25 , Gregg Kellogg wrote:
>>>
>>>> On Aug 29, 2011, at 5:56 AM, Ivan Herman wrote:
>>>>
>>>>> After our discussion and the last telco, and subsequent emails, I would like to modify the proposal.
>>>>>
>>>>> Proposal: for each RDFa source, the processor graph should contain one triple of the sort
>>>>>
>>>>> - subject: URI referring to the processor graph (typically <> in Turtle, or @about="" in RDF/XML, though implementation MAY define a specific URI for that purpose)
>>>>> - predicate: http://www.w3.org/ns/rdfa#hasSource (see also discussion below)
>>>>> - object: the initial value of the base URI, as defined in 7.2 of the RDFa Core document
>>>>
>>>> Processor Graph? I thought we had discussed placing it in the default graph.
>>>
>>> I am very sorry. Yes, I meant the default graph...
>>>
>>>>
>>>> As I discussed before, <> or @about="" end up resolving to the document's IRI or html>head>base, as they describe relative IRIs. It seems that what we need is an empty IRI output, so that another processor encountering a serialization of the original document will see that the document at a new IRI continues to describe the original location. Consider the following:
>>>>
>>>> <html>
>>>>   <head>
>>>>     <base href="http://example.org/original"/>
>>>>   </head>
>>>>   <body about="">
>>>>     <p property="dc:title">Document Title</p>
>>>>   </body>
>>>> </html>
>>>>
>>>> This will generate the following:
>>>>
>>>> @base <http://example.org/original> .
>>>> <> dc:title "Document Title" ; rdfa:hasSource <> .
>>>
>>> Well... if this is the way you generate then of course there is an issue. But that is a serialization problem. On the RDF concept level there is no such thing as a relative URI, only absolute. Without the @base turtle directive, this code
>>>
>>> <http://example.org/original> dc:title "Document Title" ; rdfa:hasSource <http://example.org/original> .
>>>
>>> which is of course not what you would generate but, instead
>>>
>>> <http://example.org/original> dc:title "Document Title" .
>>> <> rdfa:hasSource <http://example.org/original> .
>>>
>>> This just shows that the usage of @base _in the serialization_ might indeed be misleading.
>>>
>>>
>>>
>>>>
>>>> What you might want instead would be the following:
>>>>
>>>> <> rdfa:hasSource <http://example.org/original> .
>>>> <http://example.org/original> dc:title "Document Title" .
>>>>
>>>> The problem is, that as soon as the document is parsed, <> is given an actual URI (the base of the document being parsed), so I don't quite see how we accomplish this.
>>>>
>>>>> I have chosen the simplest possible way for the predicate URI, namely to define one for ourselves, which may not be the best. Ideas that came up during the discussion
>>>>>
>>>>> - powder:describedby : but is it correct that the RDF content 'describes' the HTML content? THat may not necessarily be the case, it may give additional data that is not in the HTML
>>>>>
>>>>> - foaf:primaryTopic (Virtuoso seems to use that): "property relates a document to the main thing that the document is about.", says the foaf spec; this is, in my view, closer than powder:described by
>>>>
>>>> I think this is most appropriate.
>>>
>>> As I said, I am not 100% happy with this, but I can live with it:-)
>>>
>>>
>>> Cheers
>>>
>>> Ivan
>>>
>>>
>>>>
>>>>> - dcterms has a provenance property, but its range is defined as a 'ProvenanceStatement', which would then create (via RDFS) an extra type information on the original data, and I do not think that is fine
>>>>>
>>>>> - The provenance vocabulary (http://purl.org/net/provenance/ns#) also has some predicates but, just as dcterms, it contains a number of range specification that yields extra types on the original base URI. I am not sure that is o.k. If we disregard that, then prv:accessedResource is probably the best one[1], it generates a type information of 'internet Resource'[2], which is fairly harmless. The problem is whether prv is stable enough for a Rec, though.
>>>>>
>>>>> - The draft of the provenance model of the Prov WG seems to have a hasOriginalSource predicate (in section 6.4), but I am not sure whether this is stable.
>>>>>
>>>>>
>>>>> The stable thing is to use our own predicate, and maybe define a sub-property relationship later when the provenance WG's terms gel. Alternatively, we can ask the Prov WG for their advice. I can live with primaryTopic, but it does not feel _really_ right either.
>>>>>
>>>>> Ivan
>>>>>
>>>>>
>>>>>
>>>>> [1] http://trdf.sourceforge.net/provenance/ns.html#accessedResource
>>>>> [2] http://ontologydesignpatterns.org/ont/web/irw.owl#WebResource
>>>>> [3] http://dvcs.w3.org/hg/prov/raw-file/default/model/ProvenanceModel.html
>>>>>
>>>>> ----
>>>>> Ivan Herman, W3C Semantic Web Activity Lead
>>>>> Home: http://www.w3.org/People/Ivan/
>>>>> mobile: +31-641044153
>>>>> PGP Key: http://www.ivan-herman.net/pgpkey.html
>>>>> FOAF: http://www.ivan-herman.net/foaf.rdf
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> ----
>>> Ivan Herman, W3C Semantic Web Activity Lead
>>> Home: http://www.w3.org/People/Ivan/
>>> mobile: +31-641044153
>>> PGP Key: http://www.ivan-herman.net/pgpkey.html
>>> FOAF: http://www.ivan-herman.net/foaf.rdf
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>
>
> ----
> Ivan Herman, W3C Semantic Web Activity Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> PGP Key: http://www.ivan-herman.net/pgpkey.html
> FOAF: http://www.ivan-herman.net/foaf.rdf
>
>
>
>
>
>
Received on Thursday, 1 September 2011 12:44:54 UTC