Re: PROV-ISSUE-333 (review-prov-dm-constraints-wd5): issue to collect feedback on prov-dm-constraints wd5 [prov-dm-constraints] from James Cheney on 2012-04-09 (public-prov-wg@w3.org from April 2012)

From: James Cheney <jcheney@inf.ed.ac.uk>
Date: Mon, 9 Apr 2012 23:51:25 +0100
To: Provenance Working Group <public-prov-wg@w3.org>
Message-Id: <3E48B90A-5C3B-4AFF-A082-FAF565AEBDB0@inf.ed.ac.uk>

Here are my comments.

Review questions:

> Can the document be released as a next public working draft? If no, what are
> the blocking issues?

I don't think it's ready. Putting myself in the shoes of a developer, what exactly would it mean to implement the PROV-DM-CONSTRAINTS recommendation? It isn't clear what a compliant implementation must do, may do, or should do.

More concretely, there are at least four kinds of semi-formal boxes in the document:

- constraints (failure to satisfy a constraint is bad?)
- inferences (which provide additional information; do we need to check that inferences don't violate constraints?)
- definitions (of some PROV-N assertions in terms of others; usually called constraints too)
- "interpretations" - which seem to be inferences about event ordering mostly

and it is not clear to me what it means to satisfy or apply them.

> * Is the structure of the document approved?

PROV-DM-CONSTRAINTS is not especially well-organized. A lot of the early material is redundant given PROV-DM.

What is the difference between definitional constraints and inferences, "interpretations", and the remaining miscellaneous constraints? Why not have one big constraint section, organized analogously to the main data model description? There are a lot of forward references to "interpretations" (and the meaning of this term /difference between an "interpretaion" and a "constraint" or "inference" is never discussed).

> * Can the short name of the document be confirmed (in particular, for prov-n,
> prov-dm-constraints, since request needs to be sent for publication)?

I think the names are OK, assuming this way of splitting things sticks. However, concerning the content, there now seems to be enough overlap between PROV-DM-CONSTRAINTS and PROV-SEM that it may make sense to merge them somehow, as otherwise there will be a lot of duplication and there are parts of PROV-DM-CONSTRAINTS that don't make much sense unless we say a little about the intended meaning/semantics of the records.

High-level comments:

* Both the main DM and the constraints documents are missing a clear description of the main problems we are trying to overcome, which (in my view) include:

- We are trying to deal sanely with descriptions of "things that can change over time"; much of the complexity comes from this; in the common case where things aren't changing or change slowly, much of the complexity (specifically alternate/specialization) can be avoided

- different suers/applications may make different (equally valid) subjective decisions about where to draw boundaries between entities, artifacts, etc. and what events to consider important

- provenance needs to allow for different perspectives on the same situation, for example to allow duplicate elimination where multiple descriptions of the same entity exist (Royal Society), or to allow for disambiguation when multiple things can have similar descriptions (person in chair)

- The purpose of the constraints document is (I think) to help manage this complexity by identifying constraints that we can check to determine whether a provenance record is minimally sane, defining some complex concepts in terms of other simpler ones, and identifying inferences that can be used to fill in implicit information (which might itself be relevant to checking consistency).

* Most of the constraints themselves seem sensible, and I will add a section to the semantics indicating which are satisfied by the current version and if not, why.

* The document doesn't adequately explain what the constraints, inferences, definitions and "interpretations" are for, or how an implementation might satisfy them.

One rationalization (as suggested by Graham)may be: PROV-DM instances can be syntactically well-formed but nonsensical. To rule this out, PROV-DM-CONSTRAINTS introduces a class of "valid" or "strict" PROV-DM data. Valid PROV-DM has to satisfy some constraints, and in addition, supports some inferences (which must, after being applied, not violate the constraints either).

However, even if this is the idea (it is never said plainly), it's not clear to me how an implementation should interpret them. Many of the constraints are of the form "if such and such holds then something else must hold". What are the consequences if not? Should an application reject the data? or add the missing data?

Similarly, some inferences say "if such and such holds then some other things must hold", but are written in such a way that it's not clear whether you mean "the data model instance must already have the additional records" or "we consider the additional records to be implicitly part of the model". If the latter, this seems to imply that new ids will be created at run time for unknown entities, e.g. in Inference:wasRevision.

* the status of missing values is unclear. Moreover, in wasRevisionOf and Quotation, the missing agent is specified as meaning either no agent exists or one exists but is not identified. This seems to indicate that we can not make any assumption about the missing value's existence or value; I guess the default case is taht we assume the missing value exists but is unknown. But it's not clear what difference this makes to an implementation - under what circumstances would it matter?

Detailed comments:

Third, fourth and fifth paragraphs of 2.1.2 are unclear - what does "it is anticipated that" mean? The same as "we expect that in practice"? Or "implementations had better..."?

"instantaneous event to *be* inferred"

- concept of actual verification of ordering constraints is vague

Sec. 2.2 - heading "Attributes in Entities and Beyond" is unspecific. Why not just "Attributes". The second paragraph (which is a complete sentence) is also quite convoluted.

Here, as in a lot of places, the text is extremely verbose/indirect where it doesn't need to be, e.g. "It is the purpose of attributes in PROV-DM to help fix some aspect of entities" -> "Attributes in PROV-DM describe some aspects of entities".

"period comprised between" -> "period between"

"alternative entity" - I think you are alluding to "alternativeOf", without makign it clear to a reader that "alternative" refers to the PROV-DM concept. Perhaps "An alternative entity that describes the same thing"

In example: "expressed *as*:". Also, why not an example in PROV-O or PROV-N syntax?

"mroe important" - what is the metric for importance? I think you are basically saying that we don't assume an absolute ground truth with respect to which we can judge correctness or completeness of descriptions.

"belong to a variety of PROV-DM objects" - maybe "can be associated with" instead of "belong"? Also, is "object" used in the same sense as in the semantics?

Sec 2.3. Last paragraph: "When this is the case, this specification defines such inferences" - I think it's more accurate to say "This specification defines some such inferences" - otherwise it sounds like we're claiming to have a complete axiomatization of possible inferences, which I don't think we do.

Sec. 2.4. Some of the problems discussed here are relevant whether or not we consider accounts.

Aso, "must" and "may" are used to constrain hypothetical account mechanisms. I have no idea how to check such a constraint.

What is the "set of descriptions" of an account? MUST it increase monotonically with time?

Since PROV-DM doesn't specify how accounts can be handled, or provide an abstraction specifying how implementations could provide for accounts, I don't see any point in saying anything in PROV-DM-CONSTRAINTS about accounts, unless we have concrete examples in mind.

sec. 2.5. "some value SHOULD be assumed to exist" - If I understand correctly, this means that implementations can ignore this requirement if there is a strong reason to. But I don't understand who is doing the assuming and what the effect on the implementation is. Can an implementor satisfy this recommendation simply by saying in the documentation that she assumes missing optional values exist, or is there something that the implementation actually has to do in order to fulfil this expectation?

sec. 3.1.1. For entity, we also don't assume that the attributes uniquely identify the entity (or underlying thing), right? Examples of the various (non)properties would help.

- Also, here is the first of many forward references labeled "interpretation: ... see blah blah". These make zero sense the first time you read through the document.

Sec. 3.1.2. "However, an activity *record*"

- The bullet point under "further considerations" is useful information that could be said earlier, or in PROV-DM.

Sec. 3.1.3 "This instantaneous event encompasses a description of the modalities of genration of this entity by this activity, by means of key-value pairs". This is opaque. Suggest: "Generation events can have attributes that describe how the entity was generated by the activity."

- Constraint unique-generation-time: isn't the activity also unique?

Sec. 3.1.6. The constraint wasInformedBy-definition is missing the identifiers on wasGeneratedBy and used.

- The term information flow is used without explanation that this means "communication"

Sec. 3.1.8. Similarly, the constraint here is missing ids on wasGeneratedBy and wasStartedBy.

Is the constraint really a definition?

Sec. 3.2.1 has no explanation.

Sec. 3.2.2 has two sentences of explanation, but there is not any context.

Sec. 3.2.3 has no explanation apart from a forward reerence to a later constraint./

Sec. 3.3.1."since of e2" -> "since e2"

Sec. 3.3.4 Traceability-inference: line 5 is incomplete. Also, the section talks about "the defintiion of tracedTo" - do the traceability-inference rules constitute a (recursive) definition of tracedTo? or are they just some rules for inferring tracedTo and there could be others that don't conform to the rules? What is the whole definition, if these are just parts?

- Why use superscripts on e in traceability-assertion?

Sec. 3.4.1. Anti-symmetry counter example: I suggest saying that we don't assume antisymmetry because we don't assume that two different entities that happen to have the same informatipon about the same thing are the same entity. (However, we can easily accommodate antisymmetry in the formal semantics.)

The example about the email, printed version, and thoughts is too vague to be useful. Thoughts are not the same kind of thing as emails, which are not the same kind of thing as printouts.

Sec. 3.4.2. Appears to tacitluy assume that alternateOf is defined in terms of specialization, which I believe has been revisited as a result of email discussion already.

- The customerInChair example is confusing, since it seems to use the same entity id customerInChair to refer to different things at different times.

Sec. 4. I'm not sure of the rationale for unique-description-in-account.

-In the example, why list both alternateOf links, since it is (at least) symmetric?

Sec. 5. "to be meaningful" -is this the same as "valid" in the sense I suggested above? What are the consequences of failing to be meaningful (by violating some constraints)?

- "that such *an* instantiated"

- "the four kind of " -> "the five kinds of"

- "By transitivity of generation-precedes-usage, generation of an entity precedes its invalidation" - This is ONLY true if the entity is ever used! I think it is better to explicitly give a constraint "usage-precedes-invalidation" dual to "genration-precedes-usage".

- Just after "wasStartedByAgent-ordering" - "A similar constraints exists" -> "...constraint..."

- wasAttributedWith should be wasAttributedTo. Similarly, "attributed with" in the preceding paragraphs should be "attributed to".

Sec. 6. The numbering of prior sections at the beginning of this section is off.

- Second paragraph: seems like a long-winded and circular way of saying "We assume that each entity is generated exactly once."

- "said not *to* be". Also, is "structurally well-formed" the same as my "valid" notion?

- Overall, as for the discussion of accountes earlier in sec. 2.4, I don't really see the point of talking about what happens when we merge tow accounts since PROV does not specify anything about accounts.

Sec. 7 - The constraints about collections (and to some extent the collections mechanisms themselves) seem preliminary and not particularly strongly motivated to me. I agree that collections are important, but I am not sure I agree that they're well understood enough to merit standardization.

- Moreover, I'm not sure I agree with collection-unique-derivation. Suppose

c1 = {}
c2 = {a:1}, obtained by inserting (a,1) into c0
c3 = {a:1,b:2} obtained by inserting (b,2) into c1

I might want to be able to say that c3 is obtained by inserting (a,1), (b,2) into c1. Why can't I?

- Do collection-derivation relations imply alternateOf? (And for that matter, does wasRevision?)

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Received on Monday, 9 April 2012 22:51:53 UTC