From Provenance WG Wiki
Currently, the DM uses variable numbers of arguments per relation, since some arguments are optional.
- Entity, activity and actor take an id which is required, and attributes which are optional (activity also takes optional start and end times).
- Many relations have an id, which is often optional.
- Many relations have 2 "principal arguments" plus 1-3 other optional (non-id, non-attribute) arguments. (In some cases, there is a further onstraint such as "at least one of the following optional attributes must be present".)
- According to section 2.6 of PROV-DM, optional arguments can now be either omitted or replaced with "-".
- There are two kinds of "optional": "if missing, assume an unknown value exists" and "if missing, the value is absent". The general rule is that missing means "unknown", but there are a few ad hoc exceptions (I've collected them in section 7 of PROV-CONSTRAINTS).
- Also, attributes can be used to provide additional values where missing is "absent value", so there are two mechanisms for absent values that apply in different situations.
Thus, for example, the following are legal:
as well as any instances of the above where id or t is replaced with "-". It also seems to be allowed to replace the attribute list by "-" in a PROV-DM, but not in PROV-N.
I think this leaves many opportunities for confusion, especially when there are 2 or more optional arguments (especially for wasDerivedFrom). While in examples it is usually clear whether an identifier is supposed to be an entity, activity or agent, in a large PROV description it may not be easy to tell, hence understanding the intended meaning (for example in order to translate to RDF) may require several passes over the data.
For example, in:
there are (at least) six possibilities:
- x is derivation id, y is entity, z is entity, w is activity
- x is derivation id, y is entity, z is entity, w is generation
- x is derivation id, y is entity, z is entity, w is use
- x is entity, y is entity, z is activity, w is generation
- x is entity, y is entity, z is activity, w is use
- x is entity, y is entity, z is generation, w is use
and to determine which we need to do a first pass over the document and figure out which types the ids have.
0. I think having optional attribute lists is fine, but suggest that we disallow "-" for the missing attribute list, since one can just omit it or write "" already. The grammar already seems to do this.
1. To make it easier to tell whether the first argument is an id or the id is omitted, I suggest that we use a different separator from comma such as ";" to separate the id from the remaining arguments.
For example, wasGeneratedBy(id;e,a,t,attrs). (For uniformity we could do this with entity, activity, agent too, but I think it's not needed.)
2. To avoid confusion between "missing=unknown" and "missing=absent", I suggest that we use positional arguments for "missing = unknown" and attributes for "missing=absent".
For example, wasAssociatedWith(id,ag,act,plan,attrs) would become wasAssociatedWith(id,ag,act,[prov:hadPlan=plan] ^^ attrs) - the optional plan argument becomes an attribute because it is interpreted as absent, rather than unknown.
3. I suggest that we standardize the following convention for the remaining optional arguments:
To avoid ambiguity, optional arguments (other than id and attrs) must either all be omitted or all given in the syntax, with the missing ones explicitly written "-". Then, "-" would always denote an unknown value, never an absent value.
For example, it would no longer be legal to write
where the generation of e2 is omitted. Instead, if any optional arguments are filled in, then the rest need to be filled in with "-"
To summarize, this proposal would ensure that each relation has at most 8 variants, that each expression can be unambiguously parsed on its own without first guessing type information, it is easier to locate ids, and there is a uniform rule for determining whether a missing value should be interpreted as an unknown, but present value, or as an absent value.
Moreover, while the current approach may technically be unambiguously parsable (I'm not sure), if it is then it is probably fragile with respect to changes or extensions. The convention I outline above should be more robust so that we don't have to keep checking for ambiguity.
This would obviously affect PROV-N. For the purpose of examples the main changes required would be adding semicolons, making "absent value" parameters into attributes, and adding "-"'s for missing "unknown value" parameters. I realize that this could be a lot of work.