PROV-AQ responses to Stian's review (part 2) from Graham Klyne on 2013-03-11 (public-prov-wg@w3.org from March 2013)

From: Graham Klyne <graham.klyne@zoo.ox.ac.uk>
Date: Mon, 11 Mar 2013 09:55:41 +0000
To: W3C provenance WG <public-prov-wg@w3.org>, Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>
Message-ID: <513DAA1D.8010706@zoo.ox.ac.uk>
Stian part 2 (http://lists.w3.org/Archives/Public/public-prov-wg/2013Jan/0121.html)

 >>> My responses are prefixed like this.

Summary:
========

PROV-AQ is a very interesting document, because it describes how to
connect provenance to the world, or more specifically to resources on
the Internet. For my own domain of scientific workflow preservation,
there is a particular need for this kind of standardization as
currently there is no recognized mechanism for a service to provide
provenance data in any form.

The core concepts of PROV-AQ are very easy to understand, simple to
use and clearly scoped. The document is however at times heavy to
read, as edge cases are often explored in detail before introducing
the main concepts and how a functionality is to be used.

The terminology is a bit odd compared to the rest of the PROV
documents, I particularly wonder why the authors are using the term
target-URI rather than entity-URI; however I understand this is
careful threading as in this particular document there is necessarily
a lot of talk about *resources*.

 >>> I think the terminology is now more closely aligned to other PROV specs.  I 
think the remaining differences are due to different intent, and hopefully these 
have been clarified.


It is unclear as to whether PROV-AQ can and should be used for finding
non-PROV provenance descriptions, such as alternative models (OPM,
DCTerms), application-specific resources (logfiles, commit logs), and
human-readable documents (HTML, Word). My view: "PROV-AQ MAY be used
for such purposes, but that PROV-AQ provenance descriptions SHOULD be
available as PROV. PROV SHOULD be represented as PROV-O RDF, and MAY
be represented in other W3C specified PROV serializations.".
 >>> See issue http://www.w3.org/2011/prov/track/issues/428
 >>> I mostly agree with Stian's position here
 >>> I'm not sure if we actually need to say anything about non-PROV formats
 >>> The introduction now explicitly states:
[[
Most mechanisms described in this note are independent of the provenance format 
used, and may be used to access provenance in any available format. For 
interoperable provenance publication, use of PROV-O represented in a 
standardized RDF format is recommended. Where alternative formats are available, 
selection may be made by content negotiation.
]]


I find that the section about pingback service is out of scope for a
PROV-AQ service, and therefore below (point 56) suggest an alternative
approach where the pingback service simply receives link that a
provenance service may later return or include in its store. I don't
distinguish between 'forward' and 'backward' provenance, so for me1
"has provenance" means I will find some provenance data where this
entity ("target-URI") is present - but the WG might have a different
view and could want to distinguish between the two directions, as
popular resources could accumulate a lot of forward traces.
 >>> Comments at point 56.  These changes have been substantially adopted.


Detailed review - numbering continues from previous email:
=======


4. Provenance query service

35) "the naming authority associated with the target-URI is not the
same as the service offering provenance descriptions" - why is this a
problem?

"multiple services have provenance descriptions about the same
resource" - why is this a problem?

Neither of these seem like a problem from the previous bits of this
specification. Section 3 specifically allows multiple provenance-uris
and don't require these to be hosted at the sane "naming authority".

I think what you are trying to say in these two is something like:

* "third-party providers of provenance descriptions who can't use the
mechanisms of Section 3 because the target-URI is outside their
control"
 >>> Yes, I've revised the text to reflect this intent, using substantially this 
wording.

36) "the service associated with the target-URI is not accessible for
adding additional information when handling retrieval requests"
I don't know what this means.  Which service? Adding on retrieval? Not
accessible?
 >>> Covered by revised wording above.


37) "query services may provide additional control over what
provenance is returned"
perhaps change "control" to "filters" - make it sound like a good
thing when there is too much provenance!
 >>> Changed.

38) I suggest to add consideration:
"query services may support more complex queries such as "which
entities were derived from entities attributed to agent X""
 >>> I've reworked the motivation taking this on board.


39) "such usage is not described here" -> ".. not described here"
 >>> OBE - text no longer exists


40) "use the information obtained to query for required provenance."
...  add "according to the specified query mechanism"
 >>> Agree
 >>> Updated with similar


41) "Dereferencing a provenance query service URI" --> "... service-URI"
 >>> OK (sect 4.1)
 >>> Done.



42) "this specification does not preclude the use of non-RDF formats"
JSON-LD <http://json-ld.org/spec/latest/json-ld-syntax/> is growing in
popularity, should we perhaps propose a JSON-LD context? I think it
would be quite straight forward, and actually managed to do it in
about 15 minutes (including learning the syntax).
 >>> I regard this option as being covered by "RDF (in any of its common 
serializations...)"
 >>> I'm quite open to use of JSON-LD, but I feel it may be too early to push this
 >>> The current text sticks with "The service description presented here may be 
supplied as RDF (in any of its common serializations as determined by HTTP 
content negotiation),"
 >>> See also: http://www.w3.org/2011/prov/track/issues/622

[...example elided...]


43) As shown in the complete example in 4.1.3, the
ProvenanceQueryService is not connected to the DirectQueryService or
sd:Service. Given that services don't have a general name, it would be
difficult for implementers to know if a node in the graph is a service
or just happens to be further/additional data (for instance details
about the publisher of the service). It also means I can't mention at
all a service, without implying that I am somehow providing it as part
of my service description.

I therefore suggest that the ProvenanceQueryService should link to the
services using a term like prov:describesService - see modified
example:
 >>> My intention was that they could be located by type, but I would be happy 
to include a prov:describesService relation

@prefix prov: <http://www.w3c.org/ns/prov#>
@prefix sd: <http://www.w3.org/ns/sparql-service-description#>

<> a prov:ProvenanceQueryService ;
     prov:describesService <#direct>, <#sparql> ;
     dcterms:publisher <#us> .

<#us> a foaf:Organization ;
    foaf:name "and not a service!" .

<#direct> a prov:DirectQueryService ;
   prov:provenanceUriTemplate "?target={+uri}"
   .
<#sparql> a sd:Service ;
     sd:endpoint </sparql/> ;
     sd:supportedLanguage sd:SPARQL11Query .


The added advantage of this is that you can do the bnode shorthand
when you don't know quite know or care what to call your service
entries:

<> a prov:ProvenanceQueryService ;
     prov:describesService [
       [ a prov:DirectQueryService ;
         prov:provenanceUriTemplate "?target={+uri}" ],
       [ a sd:Service ;
         sd:endpoint "?target={+uri}",
         sd:supportedLanguage sd:SPARQL11Query
       ] .

 >>> I like that - it has the added advantage of making the relationship between 
a service description document and an individual service description more explicit.
 >>> Done


44) I suggest renaming the verbose prov:ProvenanceQueryService to
prov:ServiceDescription. We don't need to say Provenance because of
the namespace. It's also not a service itself, just descriptions. This
avoids confusion whether the DirectQueryService is a
ProvenanceQueryService. Combined with the prov:describesService from
above, the distinction should be clear.
 >>> Done

45) This protocol typically combines the target-URI with the
service-URI to formulate an HTTP GET request, according to the
following convention:

Typically..? Is this not meant to *define* the protocol? Remove "typically".

 >>> Partly due to other changes, this has been reworked to require the URI to 
be defined by the URI template in the service description.


46) "provenance description for the resource-URI"
  - while I like "resource-URI" over "target-URI" (and perhaps
entity-URI even more) - I think this is a typo.  --> target-URI
 >>> Changed to target-uri.  (We considered using entity-uri throughout, but 
this would not haver covered activities.)


47) "Any server that implements this protocol and receives a request
URI in this form SHOULD return a provenance description for the
resource-URI embedded in the query component, where that URI is the
result of percent-decoding the value associated with the
provenance-resource key" - a bit heavy and cryptic sentence. What is
"the value associated with the the provenance-resource key"?
 >>> Sect 4.2
 >>> Re-worked.

48) "If the supplied resource-URI includes a fragment identifier, the
'#' MUST be %-encoded as %23 when constructing the provenance-URI
value; similarly, any '&' character in the resource-URI must be
%-encoded as %26 [[RFC3986]]."  - I am a bit uncertain about this -
are you implying that only those characters need to be escaped? What
about "%"? It should be clearly specified if a URL like
http://example.com/with%20spaces should be sent along as-is with %20,
or double-encoded as %2520.  I agree that it's very important to
highlight that # and & must be %-encoded as they would otherwise fall
out - but it should also here clearly indicate the regular encoding.
As this is getting a bit long - perhaps split into a second paragraph
which is only about encoding. (Ie. first paragraph says what is to be
returned, etc, second paragraph just details about the URI encoding)
 >>> This has been substantially re-worked.  Some of the discussion has been 
moved to a supporting note.

49) "If the provenance described by the request does not exist in the
server, a 404 Not Found response code SHOULD be returned."

This section does not define other error conditions, like what the
server should do if access is restricted. Obviously the regular HTTP
status codes apply, but it might be worth pointing out that the server
is not required to make such responses public - so it might for
instance require authentication with 401, or 'hide' the existence of a
response with 404. " This status code is commonly used when the server
does not wish to reveal exactly why the request has been refused, or
when no other response is applicable.".

 >>> I've re-worked the text


Probably this is out of scope - but I was thinking that it could be
useful if the server could return 403 Forbidden, for instance because
it refuses to give provenance details for resources that are not 'his'
(not under example.com for instance). It could return a text/uri-list
of base URIs of which the server will support.
(this is slight abuse of text/uri-list because there might be no
resource with that particular URI - more appropriate would be a list
of URI templates, but there are no media type for that).
 >>> I agree it's out of scope.


50) "does not exist in the server"  --> change to "is unknown to the
server" - as there is no requirement that the provenance resource is
on the same server. (and neither should there be!)
 >>> Done as part of above.


51) "should be capable of returning RDF using the vocabulary defined
by [PROV-O], in any standard RDF serialization (e.g. RDF/XML), or any
other standard serialization of the Provenance Model specification
[PROV-DM]."  - both "any" change to "a" - only one of them is needed,
not all - which 'any' might imply!
 >>> sect 4.2
 >>> Re-worked


52) "other standard serialization (..) PROV-DM"  - Is this something
we've defined somewhere? How would you know if say PROV JSON is a
standard serialization?
 >>> You'd know because it's defined in a standard specification :^)
 >>> There intent is to leave the way open to future standards.
 >>> The text is re-workedm and now more open to any format trough content 
negotiation.
 >>> See also: http://www.w3.org/2011/prov/track/issues/428


53) "A provenance query service SHOULD  be capable of returning RDF
... , or any other standard serialization of the Provenance Model
specification"
- it is unclear if second part is covered by the SHOULD or not.   I
can see 4 interpretations:

a) Service SHOULD return PROV-O RDF, and MAY return other PROV serializations

b) Service SHOULD return ( either PROV-O RDF or other PROV serialization )

c) Service SHOULD return at least one of ( PROV-O RDF, other PROV
serialization)  (ie.  simply "one of the PROV serializations")

d) Service SHOULD return PROV-O RDF.   Other PROV serializations could
be used. (no MAY/SHOULD).

I would recommend a) above - as then the clients would have some
reasonable expectation about what is generally supported, rather than
having to build in support for PROVXML, PROV-N, etc. just because they
are all covered by the same SHOULD of b).
 >>> Agree -- see comments above at introduction to review.
 >>> SHOULD dropped in re-work
 >>> See also: http://www.w3.org/2011/prov/track/issues/428


54) "Previously, section 3. Locating provenance descriptions has
described use of HTTP Link: header fields and HTML <link> elements to
indicate provenance query services. Beyond that, this specification
does not define any specific mechanism for discovering query services.
"  - this forgot about section 3.3 Resource represented as RDF.
 >>> Section 4.3
 >>> "RDF statements" added

5. Forward provenance

   S: Link: <http://acme.example.org/pingback/super-widget>;
           rel=http://www.w3.org/ns/prov#provPingback


55) I would rename this to just "pingback" why double "prov"?

           rel=http://www.w3.org/ns/prov#pingback
 >>> Done


  A consumer of the resource, or some other system, may perform an HTTP POST 
operation to the pingback URI where the POST request body contains provenance in 
one of the recognized provenance description formats. For interoperability, a 
ping-back receiving service should be able to accept at least PROV-O provenance 
presented as RDF/XML or Turtle.

56) I think this kind of "provenance posting" (and hence intended
provenance-URI creation) sounds out of scope for a pingback service
and probably also for this whole document. There are many existing
protocols on how to manage and create resources, such as AtomPub,
WebDav (uggh..), SFTP, etc. I don't think we need to go into that area
to define yet another way on how to create HTTP resources.

I would not expected to have to post my actual provenance to the
service, which implies that the service then should keep this and
present it willy-nilly to others as its own.  This document also does
not say much about what the server is expected or not to do with this,
or how it can refuse provenance which it does not like or permit.

I would rather think that a pingback service should work like
pingbacks in blogs, where the pingback simply gives the blog anURI of
a third-party site which talks about a given blog post at the pingback
host.

[Details from original message elided]

 >>> This proposal has been adopted and discussed with Stian.  I think it does 
indeed sit better with the goals of PROV-AQ.


6. Security considerations

  When retrieving a provenance URI from a document, steps should be taken to 
ensure the document itself is an accurate copy of the original whose author is 
being trusted (e.g. signature checking, or use of a trusted secure web service).
57) What is "document" above? Should this refer to section 3.2?
 >>> Yes - cross-ref section 3.2, 3.3
 >>> Discussion moved to 1.3, and cross-ref added.

58) A paragraph should be added about cross-site request forgery and
distributed denial attacks, similar to my blurb above:

When clients and servers are retrieving submitted URIs such as
provenance descriptions and following or registering links; reasonable
care should be taken to prevent malicious use such as distributed
denial of service attacks (DDoS), cross-site request forgery (CSRF),
spamming and hosting of inappropriate materials. Reasonable
preventions might include same-origin policy, HTTP authorization, SSL,
rate-limiting, spam filters, moderation queues, user acknowledgements
and validation. It is out of scope for this document to specify how
such mechanisms work and should be applied.

 >>> I'm not sure how CSRF applies here:  my understanding is that that's a 
browser issue, not a general application issue
 >>> I've added this, but have an outstanding query about CSRF

Provenance descriptions may provide a route for leakage of privacy-related 
information


59) We should also add something obvious like:

Accessing provenance services might reveal to the service and
third-parties information which is considered private, including which
resources a client has taken interest in. For instance, a browser
extension which collects all provenance data for a resource which is
being saved to the local disk, could be revealing user interest in a
sensitive resource to a third-party site listed by prov:hasProvenance
or prov:hasQueryService relation. A detailed query submitted to a
third-party provenance query service might be revealing personal
information such as social security numbers.
 >>> Worked in

B. Names added to prov: namespace


60) Broken definition links: DirectQueryService, provenanceURITemplate
 >>> Fixed.


61) Where can I download the OWL for the additional relations?
 >>> Placeholders pointing into mercurial added, with TODO to fix.


62) After table, add a note like "In addition, PROV-AQ reuses these
terms from the SPARQL service description vocabulary: sd:AA sd: BB"
 >>> Actually, I don't think PROV-AQ is re-using those terms so much as 
providing a framework within which they, and others, MAY be applicable.
 >>> The intent of this summary was to provide a summary of terms over and above 
other PROV-x specs that are in the prove namespace.
 >>> No change


It is is tempting to think of prov:DirectQueryService as a particular kind of 
prov:ProvenanceQueryService (..)



63) This section can be deleted if you follow my previous suggestion
to rename the latter to prov:ServiceDescription and add
prov:describesService relation. (See 43/44 above)
 >>> Yes, the explicit relation makes that clearer
 >>> Done - deleted



C. References
I have NOT checked the validity or correctness of most of these links.

Should not SPARQL-SD and URI-template be given as normative
references, as this specification depends on them?

 >>> Ivan confirms we cannot have normative references.
Received on Monday, 11 March 2013 10:01:48 UTC