Warning:
This wiki has been archived and is now read-only.

Prov-XML Identifiers

From Provenance WG Wiki
Jump to: navigation, search

Purpose

This document contains discussions on how to represent identifiers (prov:id) and reference identified elements (prov:ref) in PROV-XML

Requirements

  1. Allow for 'scruffy' provenance
  2. Support referencing provenance records from external serializations in different formats (PROV-O, PROV-N)
  3. Play nice with XML tooling

Other Considerations

  • from PROV-DM Qualified Name:
    • PROV-DM stipulates that a qualified name can be mapped into an IRI by concatenating the IRI associated with the prefix and the local part.
  • are all of these requirements mutually exclusive?

Possible approaches to identifiers and references in PROV-XML

In the schema prov:id and prov:ref are defined as XML attributes.

ID/IDREF

Use type xs:ID for prov:id and xs:IDREF for prov:ref

 <xs:attribute name="id" type="xs:ID"/>
 <xs:attribute name="ref" type="xs:IDREF"/>

Contraints

Validity constraint: ID
Values of type ID must match the Name production. A name must not appear more than once in an XML document as a value of this type; i.e., ID values must uniquely identify the elements which bear them.


Validity constraint: One ID per Element Type
No element type may have more than one ID attribute specified.


Validity constraint: ID Attribute Default
An ID attribute must have a declared default of #IMPLIED or #REQUIRED.


Validity constraint: IDREF
Values of type IDREF must match the Name production, and values of type IDREFS must match Names; each Name must match the value of an ID attribute on some element in the XML document; i.e. IDREF values must match the value of some ID attribute.

Advantages

  1. ID recognized by XML tools as an identifier type
    1. There is a uniqueness constraint on prov:id values (scope global to document)
  2. IDREF recognized by XML tools as a reference to an identified element
    1. A prov:ref must match the prov:id of some identified element in the document

Disadvantages

  1. lexical space is the same as the unqualified XML name (known as the xs:NCName datatype)
    1. ID and IDREF cannot contain colons, whitespaces, or start with numbers
    2. URIs and qualified names are not valid IDs because both contain colons.
  2. entity/relation records defined in different bundles in the same document cannot have the same prov:id value
  3. ID must be required or have a specified default value but PROV-DM defines identifiers as optional
    1. note: xmllint does not seem to complain if prov:id is left optional

QName

Use type xs:QName for both prov:id and xs:IDRef

 <xs:attribute name="id" type="xs:QName"/>
 <xs:attribute name="ref" type="xs:QName"/>

Contraints

TODO

Advantages

  1. Closest type to PROV-DM QualifiedName
  2. Schema validators will test for the existence of a namespace specified in a QName (e.g. "ex:foo" is invalid if namespace "foo" is not defined)

Disadvantages

  1. No uniqueness contraint on prov:id
  2. The value of prov:ref is not required to match any existing prov:id
  3. Full URIs (e.g. http://example.com/ns/ex#e1) are not valid values of xs:QName; you must use a namespace.
  4. number-only local names (e.g. "ex:0001") are not supported

anyURI

Use type xs:anyURI for both prov:id and xs:IDRef

 <xs:attribute name="id" type="xs:anyURI"/>
 <xs:attribute name="ref" type="xs:anyURI"/>

Contraints

  1. TODO

Advantages

  1. Alignment with PROV-O requirement (from RDF) that identifiers be URIs.
  2. URIs are valid values for prov:id
  3. prov:id="ex:0001" does not cause validation to fail (because anyURI values are not validated)

Disadvantages

  1. No uniqueness contraint on prov:id
  2. The value of prov:ref is not required to match any existing prov:id
  3. URI correctness does not appear to be validated by schema parsers citation
    1. Schema validators do not recognize the namespace component of a qname-formed id (e.g. "ex:foo")
    2. Schema validators will not test the existence of a namespace mentioned in a name-formed id (e.g. "ex" from "ex:foo" can be undefined)


When prov:id has type xs:anyURI the following XML validates successfully even though the namespace foo is not defined:

<prov:document
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema"
    xmlns:prov="http://www.w3.org/ns/prov#"
    xmlns:ex="http://example.com/ns/ex#">
	
    <prov:entity prov:id="foo:001"/>

</prov:document>

Since the namespace foo is undefined this identifier cannot be expanded into a URI and is not a valid PROV-DM Qualified Name (Unless we decide the undefined namespace is dropped and this identifier could be interpreted as just the local name "001").

ID/IDREF, XLink, and XPointer

  • Use type xs:ID for prov:id and xs:IDREF for prov:ref

 <xs:attribute name="id" type="xs:ID"/>
 <xs:attribute name="ref" type="xs:IDREF"/>

  • Use XLink's xlink:type="simple" to simplify use of referencing between two elements. The simple type specifies only one locator (target).
  • Use XPointer fragment identifier for simpler referencing of local and remote IDs.
  • Note that an XLink can link to an entire document, while the use of an XPointer can link to a specific element (identified by an ID) of a document.

Example

  • Example use of ID/IDREF along with XPointers to reference entities across prov-xml documents.

http://www.example.com/trace1.provx:

 <prov:document
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xmlns:xsd="http://www.w3.org/2001/XMLSchema"
   xmlns:xlink="http://www.w3.org/1999/xlink"
   xmlns:prov="http://www.w3.org/ns/prov#">
   <prov:entity prov:id="e1"/>
 </prov:document>

http://www.example.com/trace2.provx:

 <prov:document
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xmlns:xsd="http://www.w3.org/2001/XMLSchema"
   xmlns:xlink="http://www.w3.org/1999/xlink"
   xmlns:prov="http://www.w3.org/ns/prov#">
   <prov:entity prov:id="e2"/>
 </prov:document>

http://www.example.com/trace3.provx:

 <prov:document
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xmlns:xsd="http://www.w3.org/2001/XMLSchema"
   xmlns:xlink="http://www.w3.org/1999/xlink"
   xmlns:prov="http://www.w3.org/ns/prov#">
   <prov:wasDerivedFrom xsi:type="prov:Derivation">
     <prov:generatedEntity xlink:type="simple" xlink:href="http://www.example.com/trace2.provx#e2"/>
     <prov:usedEntity xlink:type="simple" xlink:href="http://www.example.com/trace1.provx#e1"/>
   </prov:wasDerivedFrom>
   <prov:wasDerivedFrom>
     <prov:generatedEntity xlink:type="simple" xlink:href="http://www.example.com/trace2.provx#e2"/>
     <prov:usedEntity xlink:type="simple" xlink:href="http://www.example.com/trace1.provx#e1"/>
     <prov:type xsi:type="xsd:string">physical transform</prov:type>
   </prov:wasDerivedFrom>
 </prov:document>

Contraints

  1. Dependent on the ID/IDREF approach for identifiers.
  2. Use of XLink and XPointer assumes that each prov-xml document is URL-accessible.

Advantages

  1. Leverages existing W3C Recommendations on ID/IDREF, XLink, and XPointer.
  2. Use of xlink:type="simple" maintains simplicity.
  3. Leverages ID/IDREF for internal references in the same prov-xml document.
  4. Leverages XLink and XPointer when references need to span across multiple prov-xml documents.
  5. Use of XPointer fragment identifier can allow for more complex references via xpaths if needed.

Disadvantages

  1. Needs more community implementations of XLinks.
  2. Use of XLink and XPointer assumes that each prov-xml document is URL-accessible.
  3. Shares disadvantages with the ID/IDREF approach. e.g. constrained usage of colons in IDs.

Analysis

  • Use of of XPointer xpath fragment allows for more complex references.

 trace.provx#xpointer(xpath)

  • But use of the simpler XPointer fragment identifier simplifies usage and adoption.

 trace.provx#xpointer(id("entity1"))

can be expressed simply as:

 trace.provx#entity1

allowing concurrent unique references to IDs within a prov-xml document via IDREF and across multiple prov-xml documents via XPointer fragment identifier.

  • XLink 1.1's XLink Attribute Usage Patterns indicates that the xlink syntax could be even simpler where the xlink:type="simple" is optional when xlink:href is used. So the prov:entity link could be reduced to:

 <prov:generatedEntity xlink:href="http://www.example.com/trace2.provx#e2"/>

  • The use of IDREF for local references could be replaced uniformly with XLinks to local IDs:

 <prov:generatedEntity xlink:href="#e2"/>

Analysis

ID/IDREF

Stephan: ID and IDREF are the native way to use identifiers in XML but introduce too many restrictions (no qnames or URIs, and identified records must exist in local document) for this to be an adopted solution by PROV-XML. I do not at this time recommend using ID/IDREF

Hook: This native use of identifiers option is part of the W3C Recommendation on ID/IDREF. The requirement for identified records to exists and be unique in a local document may actually help to promote more well-formed usage of *.provx documents. Additionally, ID/IDREF can be checked by xml validators. If our requirement for "scruffiness" has the highest priority, then this is not a suitable option for a prov-xml identifier.

QName

Stephan: While QName does have a few restrictions the PROV-DM Qualified Name does not ( "ex:001" is not allowed) I think it is the closest native-XML type to the PROV-DM qualified name. QName is understood by schema validators to be an optional namespace followed by a local name and validators will test for the existence a namespace defined in a QName value. I think this is a good restriction and matches the PROV-DM Qualified Name definition which states that PROV-DM Qualified Names can be mapped into an IRI by concatenating the IRI associated with the prefix and the local part.. At this time I recommend staying with QName.

anyURI

Stephan: anyURI values are not validated by parsers so they can contain pretty much anything. This means there is no guarantee they are a valid URI. Namespaces used in prov:id values are not recognized as namespaces and therefore their existence will not be tested for. Undefined namespaces will make it difficult to expand the identifier value into a IRI as stated in the PROV-DM Qualified Name definition. I do not at this time recommend using anyURI.

ID/IDREF, XLink, and XPointer

Hook: This option follows native support for XML identifiers and references. With the use of xlink:type="simple", it maintains simplicity while achieving capabilities such as referencing other *.provx documents (via XLink) as well as their specific entities or other elements (via XPointer). The use of XLink and XPointers, in a simple type mode, provides a formalism to reference distributed bundles defined in separate *.provx documents. This will be true in some domains such as Earth Science data processing where each distributed data system will [ideally] be generating their own bundle traces as part of a data product's lineage. This option inherits the benefits and drawbacks of the ID/IDREF approach. If the the "scruffiness" requirements is high priority, then this option is not viable due to ID/IDREF's requirement for uniqueness and constraints on ID values.

Custom type

Stephan: Can we define our own type whose constraints are validated by the schema validator? If we define our own 'qualified name' will we be able to define it as such that the schema validator tests for the existence of used namespaces? I am not in-favor of defining our own type unless it has significant benefits over using a well-known XML type such as xs:Qname or xs:anyURI.

Hook: Though we can define out own custom type, it may be more difficult to validate via traditional schema XSD validation methods. It could be validated with the use of Schematron validation, which is a rule-based validation validation that can check conditionals such as existences and uniqueness. Note that Schematron is a ISO/IEC 19757 standard for Document Schema Definition Languages (DSDL), Part 3: Rule-based validation. However, this deviates from the more mainstream schema validation approaches more frequently employed by implementors.

References