Dm-constraints review 2012 May 17 by Lebo

From Provenance WG Wiki
Jump to: navigation, search

Reviewing after changes from last round of feedback.

2012 May 17, Tim Lebo

James' request:


"defines the induced notion of equivalence"

From what is "equivalence" induced? From "Inferences and Definitions"? If so, please make this explicit.


Suggest to split up:

"In separate sections we consider additional constraints specific to collections and accounts (section 6. Collection Constraints and section 7. Account Constraints). "


"In separate sections we consider additional constraints specific to collections (link: section 6) and accounts (link: section 7), respectively."



"PROV-DM is a conceptual data model for provenance (realizable using different serializations such as PROV-N, PROV-O, or PROV-XML)."


"PROV-DM is a conceptual data model for provenance which is realizable using different serializations such as PROV-N, PROV-O, or PROV-XML."

4) First paragraph of 1.2 Purpose seems to beat up DM a bit too much without balancing it with "what it is good for". Regardless, the point is well taken and draw the distinction clearly.

5) Third paragraph quickly lists what the spec provides, but does not tie them together. What is the relation between the content of sections 2-5? (is sec 6 and 7 just section 5 applied to two topics?) THIS POINT IS CRITICAL - since it provides the overview the reader needs to start digging into the details provided. (Some of what is said in section 2 can help here, but it should also be said here, I think). Alternatively, destroy this paragraph in its entirety and let "2. Compliance" establish the relations between the sections.

6) The "2. Compliance" section is a nice new component. It grabs developers by the horns; there's no question why this document matters (interoperability!).

7) Regarding:

"Should we specify a way for PROV instances to say whether they are meant to be validated or not? Seems outside the scope of this document, may require changes to PROV-N."

This can be done with describing Bundles using the constructs we already have.


"In this section, we describe inferences and definitions that may be used on provenance data, and a notion of equivalence on PROV instances."

I thought that section 4 did "equivalence" - why is it being talked about in 3, too?


Regarding, "TODO: Is this re-inventing blank nodes in PROV-DM, and do we want to do this?" - I don't think that it is. You're identifying new resources, which you can name [with URIs].


Perhaps missing style on defined_exp in "provenance expression defined_exp is defined" ?


Breaking section 3 by component is a nice organizational structure. Thanks for letting me use something I already know! :-)


It's rather annoying that one can't select and copy the names of the inferences and definition. Though, I'm guessing this is a respec problem that is not easily fixed.


Are we using semi colons now? i.e., "wasAssociatedWith(-; a, ag, -, -)" not "wasAssociatedWith(-,a, ag, -, -)"


Section 4: "we define a notion of equivalence of PROV s" - what is "PROV s"?


Section 3's intro should hint towards how Section 4's equivalence will be used to check inferences (with a link to section 4). This should be straightforward, because equivalence is already mentioned "defined provenance expressions can be replaced by their definitions, and vice versa."


Why does Section 4's equivalence check inferences and constraints (and not definitions?). Constraints are in section 5, which I haven't read yet (as a top-to-bottom reader). Where did definitions go in the mix? Section 4 talks about one of two things that I've read about and another that I haven't. I'm falling into meta-wondering about the document. What is the difference between all of these things? I thought I had it, but I lost it.

suggest in the intro of section 4:

  • citing (and linking) "From section 3" when saying "checking inferences"
  • citing (and linking) "upcoming in section 5" when saying "checking constraints" (to acknowledge that you know I haven't read it yet, and don't expect me to have read it)
  • mention why one doesn't "check definitions".


Section 4.1

"Activities also allow for an optional start time attribute. If both are specified"

"both" what?


section 4.1

"Unless otherwise specified, when an optional attribute is not present in a statement, some value should be assumed to exist for this attribute, though it is not known which"

Does this include all of Derivation's hadActivity hadUsage and hadGeneration? That seems like it imposes a lot of verbosity. And fighting the identity problem between (e.g.) the Generation's hadUsage and its hadActivity's qualifiedUsage seems like a nightmare. Is this handled?


4.2 Normalization

You're defining it here and have used it before. Previous uses were intuitive, but it might be nice to link from previous uses to this section. However, adding the link down might invite a distraction for the user (when I'm arguing for a natural linear reading).


You've been using 'instance' a lot, and I've been okay with it. But its use in the def of Normalization seems to be "bigger" ("more statements") than I've grown to conceptualize it in reading the document so far (which I think is caused by looking at so many inferences that look at small chunks and not a whole lot of chunks).

Are we not talking about bundles when we say "instances"? :-)


I'm surprised that "closure" isn't mentioned in the section on normal form.

21) I love how short section 4 is.


"This section defines a collection of constraints on PROV instances. A PROV instance is valid if, after applying all possible inference and definition rules from Section 2, the resulting instance satisfies all of the constraints specified in this section." is great, and should also be in section 1 and/or 2. This starts to give the sense of order of operations one applies inf/defs, norms, and constraints.


section 1.2 Purpose says "structural constraints" is the first of two types of constraints (with ordering), while section 5 says "uniqueness constraints" are the first of two types. Did structural go away? It seems that the def/inf handles the structure now.


What does "Attribute uniqueness constraints? " mean?


"We assume that the various identified objects of PROV-DM have unique statements describing them within a PROV instance.

Given an entity identifier e, there is at most one expression entity(e,attrs), where attrs is some set of attribute-values. "

Should be explained or motivated more. It seems to me that this is trying to impose a constraint on the serialization, but I don't think that's the intent.


Agree. the paragraph after "The following discussion is unclear: what is being said here, and why?" seems to be distracting.


The use of triangles in the ordering constraints leaves me uncertain about which vertical line they are constraining. I worry that it is not the "very next one". I would be more confident if a yellow horizontal "edge" actually touched the two vertical edges that it was constraining (and the yellow triangle could be on top of that edge).


I think it would help if the "corresponding edges between entities and activities" where the same visual style as the vertical line marking the time the Usage, generation and derivation occurred. A matching visual style provides a Gestalt that matches the concept. I am looking at subfigures b and c in 5.2.0


Is Invalidation missing in "The following figure summarizes the ordering constraints" paragraph?


The figure is not labeled. "The following figure summarizes the ordering constraints"


The link to goes "too far down" and one cannot see the narrative that discusses it. I imagine the same is true for other constraints.


"entities have lifetimes: they are generated, then can be used, revised, or other entities "

How can entities be revised? Doesn't that go against what an entity is?


"As with activities, entities have lifetimes: they are generated, then can be used, revised, or other entities can be derived from them, and finally are invalidated"

"are invalidated" -> "may be invalidated"?


5.2.2 figure is not labeled


The figure in 5.2.2 should have vertical lines with visual styles that match the diagonal arrow that they go with.


Figure 5.2.3 is not labeled.


Although section 6 is in progress, the meta discourse should be done sooner than later to situate itself into the document and guide its content.


6.2 "The state of a collection is only known to the extent that a chain of derivations starting from an empty collection can be found." - but the last section just gave us the convenience memberOf :-/


section 8.1 highlighting some stuff that I'd like to see kept (somewhere):

"From a provenance viewpoint, it is important to identify a partial state of something, i.e. something with some aspects that have been fixed, so that it becomes possible to express its provenance (i.e. what caused the thing with these specific aspects)."

perhaps we can avoid discussing entity vs. thing?

"An entity is a thing one wants to provide provenance for and whose situation in the world is described by some fixed attributes. An entity has a characterization interval, or lifetime, defined as the period between its generation event and its invalidation event. An entity's attributes are established when the entity is created and describe the entity's situation and (partial) state during an entity's lifetime."

"A different entity (perhaps representing a different user or system perspective) may fix other aspects of the same thing, and its provenance may be different. Different entities that are aspects of the same thing are called alternate, and the PROV-DM relations of specialization and alternate can be used to link such entities."


"Besides entities, a variety of other PROV-DM objects have attributes, including activity, generation, usage, start, end, communication, attribution, association, responsibility, and derivation. Each object has an associated duration interval (which may be a single time point), and attribute-value pairs for a given object are expected to be descriptions that hold for the object's duration. "

"However, the attributes of entities have special meaning because they are considered to be fixed aspects of underlying, changing things. This motivates constraints on alternateOf and specializationOf relating the attribute values of different entities. "

"In order to describe the provenance of something during an interval over which relevant attributes of the thing are not fixed, it is required to create multiple entities, each with its own identifier, characterization interval, and fixed attributes, and express dependencies between the various entities using events. For example, if we want to describe the provenance of several versions of a document, involving attributes such as authorship that change over time, we need different entities for the versions linked by appropriate generation, usage, revision, and invalidation events. "

"There is no assumption that the set of attributes is complete, nor that the attributes are independent or orthogonal of each other. There is no assumption that the attributes of an entity uniquely identify it. "


I hesitate on:

"Two different entities that are aspects of different things can have the same attributes."

because of "are aspects of different things", which I think could be removed and preserve the same meaning.


8.2 highlighting narrative that I would like to see stick around:

"An activity is delimited by its start and its end events; hence, it occurs over an interval delimited by two instantaneous events. However, an activity record need not mention start or end time information, because they may not be known. An activity's attribute-value pairs are expected to describe the activity's situation during its interval,"

KILL( i.e. an interval between two instantaneous events, namely its start event and its end event. )

"An activity is not an entity. Indeed, an entity exists in full at any point in its lifetime, persists during this interval, and preserves the characteristics that makes it identifiable. In contrast, an activity is something that occurs, happens, unfolds, or develops through time, but is typically not identifiable by the characteristics it exhibits at any point during its duration. This distinction is similar to the distinction between 'continuant' and 'occurrent' in logic [Logic]."


Section 8.3 doesn't have much use for me.


Something from:

"Although time is critical, we should also recognize that provenance can be used in many different contexts within individual systems and across the Web. Different systems may use different clocks which may not be precisely synchronized, so when provenance records are combined by different systems, we may not be able to align the times involved to a single global timeline. Hence, PROV-DM is designed to minimize assumptions about time. "

should probably make it's way back in, but could use some trimming and clarification.

Other than that, 8.3 is covered enough in other places and could go away.


8.4.1 doesn't say anything to me.


I'd like to see 8.4.2 go back into the main document. It is a nice reference.


That first paragraph in 8.5 isn't saying much to me. If it stays, it should go into DM.