User:Rcygania2/B-Scopes

From RDF Working Group Wiki
< User:Rcygania2
Revision as of 23:23, 21 November 2012 by Rcygania2 (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

This is a proposal to modify the design of blank nodes in RDF.

Proposed Specification Changes — Scope-As-Bijection Version

This is a new version of the proposal. It formalizes the concept of a “scope” more rigorously, and makes it an “add-on” that explains blank node identifiers, rather than making it an inherent part of the blank node definition.

Make the following changes in RDF Concepts:

3.4 Blank Nodes

Replace the current text of the section with this:

The '''blank nodes''' in an RDF graph are drawn from an infinite set. This set is disjoint from the set of all IRIs and the set of all literals. Otherwise, this set of blank nodes is arbitrary.

3.5 Blank Node Identifiers and Their Scope

Add a new subsection:

[[ A blank node identifier is a Unicode string that identifies a blank node within some local context, called a scope. A scope has an associated 1:1 mapping (bijection) between the set of all blank node identifiers and a set of blank nodes. Scopes are subject to the following rules:

  • The sets of blank nodes in any two scopes are disjoint.
  • Every RDF document forms its own scope.
  • Scope boundaries outside of RDF documents (for example, in RDF stores) are implementation-dependent.
  • Other specifications MAY impose additional rules, including constraints on the syntax of a scope's blank node identifiers.

A fresh blank node is any blank node that is not yet used within its scope.

An RDF graph is copied into a scope by replacing each blank node in the graph with a fresh blank node in the target scope. Occurrences of one blank node in multiple triples are all replaced with the same fresh blank node. If none of the source's blank node identifiers are used in the target scope, copying into a scope can be achieved by simply re-using the same blank node identifiers in the new scope.

The merge of two RDF graphs is the result of copying both graphs into a target scope. The result is a single graph where all blank nodes are in the same scope, and where any blank node identifiers that occurred in both input graphs have been replaced in order to avoid clashes.

Note: Blank node identifiers are local identifiers, and therefore may have to be re-allocated when RDF data crosses a system boundary, in order to avoid clashes. This possible re-allocation upon boundary crossing is formalized as “copying graphs between scopes”. In the simplest case, a system may have a single self-contained scope, and perform this operation only when RDF documents are read (to avoid clashes with blank node identifiers already in the system) or written (to comply with syntax restrictions). ]]

3.6 Replacing Blank Nodes with IRIs

Delete the first paragraph of the current text:

[[ Blank nodes do not have identifiers in the RDF abstract syntax. The blank node identifiers introduced by some concrete syntaxes have only local scope and are purely an artifact of the serialization. In situations where stronger identification is needed… ]]

Add instead:

[[ This specification does not provide a mechanism for referencing blank nodes across scope boundaries. In situations where stronger identification is needed… ]]

Proposed Specification Changes — Classic Version

This is the version of the proposal discussed in the November 21 call. It was criticized for baking scopes right into the definition of blank nodes.

Make the following changes in RDF Concepts:

3.4 Blank Nodes

Replace the section's current text with:

[[ A blank node is a blank node identifier, being a Unicode string, in a scope.

A scope is the context in which a blank node identifier refers to a particular blank node. The same identifier in a different scope refers to a different blank node. Every RDF document forms its own, self-contained scope. The handling of scopes outside of RDF documents (for example, in RDF stores) is implementation-dependent. Other specifications MAY impose additional scoping rules.

Blank node equality: Two blank nodes are equal if and only if their blank node identifiers are equal and they are in the same scope.

A fresh blank node is a blank node with a blank node identifier that is new and unique within its scope.

An RDF graph is copied into a scope by replacing each blank node in the graph with a fresh blank node in the target scope. Note that occurrences of one blank node in multiple triples are all replaced with the same fresh blank node. If none of the source's blank node identifiers are used in the target scope, copying into a scope can be achieved by simply re-using the same blank node identifiers in the new scope.

The merge of two RDF graphs is the result of copying both graphs into a target scope. The result is a single graph where all blank nodes are in the same scope, and where any blank node identifiers that occurred in both input graphs have been replaced in order to avoid clashes.

Note: Blank node identifiers are not required to be globally unique, and therefore may have to be re-allocated when RDF documents are parsed or serialized, or when RDF data otherwise crosses a system boundary. This boundary-crossing is formalized as “copying graphs between blank node scopes”.

A specification or implementation that defines its own self-contained scope may restrict the syntax of blank node identifiers allowed within its scope. For example, many concrete RDF syntaxes impose such restrictions.

In the simplest case, a system may be a single self-contained scope, and perform reallocation only if needed when RDF documents are read (to avoid clashes with blank node identifiers already in the system) or written (to comply with syntax restrictions). ]]

3.5 Replacing Blank Nodes with IRIs

Delete the first paragraph of the current text:

[[ Blank nodes do not have identifiers in the RDF abstract syntax. The blank node identifiers introduced by some concrete syntaxes have only local scope and are purely an artifact of the serialization. ]]

Add instead:

[[ This specification does not provide a mechanism for referencing blank nodes across scope boundaries. Blank node identifiers are not globally scoped and therefore do not allow such reference. ]]

Background

Requirements

  • Consistency with all resolutions the WG has made so far
  • No changes to other specs beyond Concepts and Semantics necessary
  • No changes to implementations necessary

Motivation

  • The distinction between blank nodes and blank node identifiers is vague and confusing.
  • Blank node identifiers have clear scope, but the specs don't always make that as clear as they should.
  • In practice, blank nodes also have clear scope (they cannot “move around” arbitrarily), but the specs don't acknowledge that, leading to confusion.
  • As we standardize models and languages for working with multiple graphs, this disconnect becomes a larger problem.
  • An example of this disconnect: There is a widely held misconception in the RDF community that graphs cannot share blank nodes. This stems from the fact in practice they rarely do (except in SPARQL stores with a union default graph) and almost never need to, and the specs neither rule it out nor explicitly allow it.
  • The RDF core specs are rather abstract and can sort of get away without talking about the scope of blank nodes, but more concrete specs built on this foundation need to address the issue, and do so in awkward and sometimes incompatible ways; cf. treatment of blank nodes in SPARQL query results, and the different assumptions regarding graphs sharing blank nodes in SPARQL Update and R2RML.
  • The notion of a “fresh blank node”, often used when describing algorithms and mappings that generate RDF graphs, is hard to explain in terms of a single universal arbitrary set of blank nodes.

Origins of the design

  • Pat's “RDF surfaces”
  • Richard's “blank node sequences”
  • Ted's mantra that snapshotting yields new blank nodes
  • ISSUE-107 discussions

The design can be seen as an attempt to take some ideas from Pat's “RDF surfaces” proposal (the notion that blank nodes “live” on a particular surface and therefore have scope, and the notion that graphs can be “copied” from one surface onto another), and fitting them into the WG's existing framework of RDF datasets and g-boxes, while ignoring the other ideas of the proposal (different kinds of surfaces, bundling surfaces into codices, etc.).

Previous draft (outdated!)

Below an earlier draft of the design. I believe it's inferior to the new one above, but retained here as it goes into more detail on a couple of points and may be helpful in understanding the motivation.

Definitions

A b-scope is a scope for blank node identifiers.

A blank node is a pair consisting of a blank node identifier and a b-scope. The blank node identifier uniquely identifies the blank node within the b-scope. If the same blank node identifier is used in two different b-scopes, then we have two different blank nodes. Two blank nodes are equal if their blank node identifiers are equal and they are in the same b-scope.

Note: B-scopes do not need to be explicitly modelled or managed in most implementations. They are theoretical constructs that allow us to talk more formally about “system boundaries” and what happens when data containing blank nodes crosses such a system boundary.

Note: Only blank nodes are bound to a b-scope in this proposal. RDF triples, RDF graphs, and RDF datasets are not. This means RDF graphs and RDF datasets can contain blank nodes from multiple b-scopes, and multiple graphs or datasets can share the same blank node.

Note: When declaratively describing the structure of an RDF graph, it is often convenient to use the concept of a “fresh blank node”. This is an arbitrary blank node that has not yet been used within a given b-scope. This implies that an implementation will either keep track of all the identifiers that are already in use within the scope, or alternatively will have some sort of sequence generator that can dispense a new identifier that is guaranteed to have not been dispensed before.

An RDF graph can be copied into a b-scope by systematically replacing all the graph's blank nodes with fresh ones in the target scope. The original and the copy are thus guaranteed not to share any blank nodes. If the source and target scopes are different, and the blank node identifiers do not occur elsewhere in either scope, this can be achieved by simply using the same blank node identifiers in both graphs.

The merge of two RDF graphs is the result of copying both graphs into a target b-scope.

Note: The set union of two graphs maintains the meaning of the graphs only if the graphs don't share blank nodes. This motivates the “graph merge” operation.

Note: A graph and any of its copies are isomorphic, and are equivalent under any entailment regime.

B-scopes in practice

Blank nodes in RDF documents: A document in a concrete RDF syntax always forms its own separate and self-contained b-scope. For example, taking a snapshot of a g-box, and serializing it in Turtle, creates fresh blank nodes in a new b-scope that is unique to the Turtle g-text. Also, parsing an RDF document implies that there is some target b-scope, and yields fresh blank nodes in the target scope.

Blank nodes in SPARQL and in graph stores: Other specifications that use RDF may place stronger constraints on the management of b-scopes. For example, SPARQL Update is most easily explained by saying that the entire graph store forms a single self-contained b-scope, as blank nodes can be shared between g-boxes in the store, but (for the time being) not between graph stores.

Blank nodes in implementations: Where specifications don't constrain the use of b-scopes, implementations are free to define their own rules. For example, a large RDF processing system may maintain only a single b-scope, and any incoming data that contains blank nodes will first be “adopted” by copying its graphs into that b-scope. Or it may treat each graph/dataset data structure as a separate b-scope, meaning that re-allocation of blank node identifiers may be needed when two such data structures need to be combined into one.

If all blank nodes within a system or within a data structure are guaranteed to be in a single scope, then the scope doesn't need to be explicitly tracked for each blank node, and therefore the blank node identifiers can be treated as being the blank nodes. This is, in fact, what most if not all implementations do today.