Copyright © 2010-2023 World Wide Web Consortium. W3C® liability, trademark and permissive document license rules apply.
RDF [RDF11-CONCEPTS] describes a graph-based data model for making claims about the world and provides the foundation for reasoning upon that graph of information. At times, it becomes necessary to compare the differences between sets of graphs, digitally sign them, or generate short identifiers for graphs via hashing algorithms. This document outlines an algorithm for normalizing RDF datasets such that these operations can be performed.
This section describes the status of this document at the time of its publication. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.
This document describes the URDNA2015 algorithm for canonicalizing RDF datasets, which was the input from the W3C Credentials Community Group published as [CCG-RDC-FINAL]. There are other canonicalization algorithms actively being considered by the Working Group – notably [Hogan-Canonical-RDF]; future versions of this document may change accordingly. See Issue 6: Compare the two algorithms, and decide on basis for our work and Issue 10: C14N choice criteria for further discussion.
At the time of publication, [RDF11-CONCEPTS] is the most recent recommendation defining RDF datasets and [N-QUADS], however work on an updated specification is ongoing within the W3C RDF-star Working Group. Some dependencies from relevant updated specifications are provided normatively in this specification with the expectation that a future update to this specification will replace those with normative references to updated RDF specifications.
This document was published by the RDF Dataset Canonicalization and Hash Working Group as a Working Draft using the Recommendation track.
Publication as a Working Draft does not imply endorsement by W3C and its Members.
This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
This document is governed by the 2 November 2021 W3C Process Document.
This section is non-normative.
When data scientists discuss canonicalization, they do so in the context of achieving a particular set of goals. Since the same information may sometimes be expressed in a variety of different ways, it often becomes necessary to transform each of these different ways into a single, standard representation. With a standard representation, the differences between two different sets of data can be easily determined, a cryptographically-strong hash identifier can be generated for a particular set of data, and a particular set of data may be digitally-signed for later verification.
In particular, this specification is about normalizing RDF datasets, which are collections of graphs. Since a directed graph can express the same information in more than one way, it requires canonicalization to achieve the aforementioned goals and any others that may arise via serendipity.
Most RDF datasets can be normalized fairly quickly, in terms of algorithmic time complexity. However, those that contain nodes that do not have globally unique identifiers pose a greater challenge. Normalizing these datasets presents the graph isomorphism problem, a problem that is believed to be difficult to solve quickly. More formally, it is believed to be an NP-Intermediate problem, that is, neither known to be solvable in polynomial time nor NP-complete. Fortunately, existing real world data is rarely modeled in a way that manifests this problem and new data can be modeled to avoid it. In fact, software systems can detect a problematic dataset and may choose to assume it's an attempted denial of service attack, rather than a real input, and abort.
This document outlines an algorithm for generating a canonical serialization of an RDF dataset given an RDF dataset as input. The algorithm is called the Universal RDF Dataset Canonicalization Algorithm 2015 or URDNA2015.
There are different use cases where graph or dataset canonicalization are important:
A canonicalization algorithm is necessary, but not necessarily sufficient, to handle many of these use cases. The use of blank nodes in RDF graphs and datasets has a long history and creates inevitable complexities. Blank nodes are used for different purposes:
Furthermore, RDF semantics dictate that deserializing an RDF document results in the creation of unique blank nodes, unless it can be determined that on each occasion, the blank node identifies the same resource. This is due to the fact that blank node identifiers are an aspect of a concrete RDF syntax and are not intended to be persistent or portable. Within the abstract RDF model, blank nodes do not have identifiers (although some RDF store implementations may use stable identifiers and may choose to make them portable). See Blank Nodes in [RDF11-CONCEPTS] for more information.
RDF does have a provision for allowing blank nodes to be published in an externally identifiable way through the use of Skolem IRIs, which allow a given RDF store to replace the use of blank nodes in a concrete syntax with IRIs, which then serve to repeatably identify that blank node within that particular RDF store; however, this is not generally useful for talking about the same graph in different RDF stores, or other concrete representations. In any case, a stable blank node identifier defined for one RDF store or serialization is arbitrary, and typically not relatable to the context within which it is used.
This specification defines an algorithm for creating stable blank node identifiers repeatably for different serializations possibly using individualized blank node identifiers of the same RDF graph (dataset) by grounding each blank node through the nodes to which it is connected, essentially creating Skolem blank node identifiers. As a result, a graph signature can be obtained by hashing a canonical serialization of the resulting normalized dataset, allowing for the isomorphism and digital signing use cases. As blank node identifiers can be stable even with other changes to a graph (dataset), in some cases it is possible to compute the difference between two graphs (datasets), for example if changes are made only to ground triples, or if new blank nodes are introduced which do not create an automorphic confusion with other existing blank nodes. If any information which would change the generated blank node identifier, a resulting diff might indicate a greater set of changes than actually exists.
Add descriptions for relevant historical discussions and prior art:
It's important for both patent and person credit reasons to include the full history.
Manu has offered to do this.
This document is a detailed specification for an RDF dataset canonicalization algorithm. The document is primarily intended for the following audiences:
To understand the basics in this specification you must be familiar with basic RDF concepts [RDF11-CONCEPTS]. A working knowledge of graph theory and graph isomorphism is also recommended.
This section is non-normative.
The following typographic conventions are used in this specification:
markup
markup definition reference
markup external definition reference
This area would provide more information about the step involved.
For example, the following output snippet might describe the operation of an implementation using the [YAML] format.
Notes are in light green boxes with a green left border and with a "Note" header in green. Notes are always informative.
Examples are in light khaki boxes, with khaki left border, and with a numbered "Example" header in khaki. Examples are always informative. The content of the example is in monospace font and may be syntax colored. Examples may have tabbed navigation buttons to show the results of transforming an example into other representations. Code examples are generally given in a Turtle or TriG format for brevity, where each line represents a single triple or quad. Additionally, have the following implied directives: BASE <http://example.com/> PREFIX : <#> Following the Turtle/TriG syntax rules, blank nodes always appear in the `_:xyz` format.
As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.
The key words MUST and MUST NOT in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.
U+000A
).
_:
that is used as an identifier for a
blank node. Blank node identifiers
are typically implementation-specific local identifiers; this document
specifies an algorithm for deterministically specifying them._:
string
to differentiate them from other nodes in the graph. This affects the
canonicalization algorithm, which is based on calculating a hash over the representations of quads in this format.
true
and false
A
and B
),
using Unicode Codepoint Collation,
as defined in [XPATH-FUNCTIONS],
which defines a
total ordering
of strings comparing code points.
Canonicalization is the process of transforming an input dataset to its serialized canonical form. That is, any two input datasets that contain the same information, regardless of their arrangement, will be transformed into the same serialized canonical form. The problem requires directed graphs to be deterministically ordered into sets of nodes and edges. This is easy to do when all of the nodes have globally-unique identifiers, but can be difficult to do when some of the nodes do not. Any nodes without globally-unique identifiers must be issued deterministic identifiers.
This specification defines a normalized dataset to include stable identifiers for blank nodes, practical uses of which will always generate a canonical serialization of such a dataset.
In time, there may be more than one canonicalization algorithm and, therefore, for identification purposes, this algorithm is named the "Universal RDF Dataset Canonicalization Algorithm 2015" (URDNA2015).
This statement is overly prescriptive and does not include normative language. This spec should describe the theoretical basis for graph canonicalization and describe behavior using normative statements. The explicit algorithms should follow as an informative appendix.
This section is non-normative.
To determine a canonical labeling, URDNA2015 considers the information connected to each blank node. Nodes with unique first degree information can immediately be issued a canonical identifier via the Issue Identifier algorithm. When a node has non-unique first degree information, it is necessary to determine all information that is transitively connected to it throughout the entire dataset. 4.6 Hash First Degree Quads defines a node’s first degree information via its first degree hash.
Hashes are computed from the information of each blank node. These hashes encode the mentions incident to each blank node. The hash of a string s, is the lower-case, hexadecimal representation of the result of passing s through a cryptographic hash function. URDNA2015 uses the SHA-256 hash algorithm.
When performing the steps required by the canonicalization algorithm, it is helpful to track state in a data structure called the canonicalization state. The information contained in the canonicalization state is described below.
c14n
, for issuing canonical
blank node identifiers.
During the canonicalization algorithm, it is sometimes necessary to issue new identifiers to blank nodes. The Issue Identifier algorithm uses an identifier issuer to accomplish this task. The information an identifier issuer needs to keep track of is described below.
c14n
is a proper initial value for the
identifier prefix that would produce
blank node identifiers like c14n1
.0
.At the time of writing, there are several open issues that will determine important details of the canonicalization algorithm.
Generalized RDF is described in RDF 1.1 Concepts and Abstract Syntax.
It removes restrictions on the type of RDF term that can occur in any slot in a quad tuple - literals as subjects or predicates, blank nodes as predicates etc. By implication, that would include RDF-start quoted triples.
RDF 1.1 separately changed "RDF dataset" to allow blank nodes for in the graph slot.
Generalized RDF does arise - for example, in some rules systems.
Covering generalized RDF gives some future proofing.
I completely agree with the importance of the "herd-privacy" canonicalization proposed in #4 (comment) by @dlongley when we use c14n with selective disclosure. However, if I understand it correctly, we would still have to improve the above algorithm; it seems to me that the following normalized datasets CX1 and CX2 are not modified via the above transformation, i.e., CX1==CY1 and CX2==CY2.
CX1 (obtained from JSON-LD Playground) (==CY1)
_:c14n0 <http://schema.org/name> "Alice" .
_:c14n0 <http://schema.org/spouse> _:c14n1 .
_:c14n1 <http://schema.org/name> "Bob" .
CX2 (obtained from JSON-LD Playground) (==CY2)
_:c14n0 <http://schema.org/name> "Carl" .
_:c14n1 <http://schema.org/name> "Alice" .
_:c14n1 <http://schema.org/spouse> _:c14n0 .
Therefore, even if Alice selectively hides the statement about her spouse, anyone can easily guess whether Bob or Carl is Alice's spouse based on the canonicalized identifiers or the order of unrevealed statement:
CY1 with selective disclosure
_:c14n0 <http://schema.org/name> "Alice" .
_:c14n0 <http://schema.org/spouse> _:c14n1 .
### 3rd statement is unrevealed ####
CY2 with selective disclosure
### 1st statement is unrevealed ####
_:c14n1 <http://schema.org/name> "Alice" .
_:c14n1 <http://schema.org/spouse> _:c14n0 .
What we actually wanted seemed like the following result:
CY1'
_:c14n0 <http://schema.org/name> "Alice" .
_:c14n1 <http://schema.org/name> "Bob" .
_:c14n0 <http://schema.org/spouse> _:c14n1 .
CY2'
_:c14n0 <http://schema.org/name> "Alice" .
_:c14n1 <http://schema.org/name> "Carl" .
_:c14n0 <http://schema.org/spouse> _:c14n1 .
I am trying to figure out a solution, but haven't found one yet so just submitting this issue at the moment...
In the meeting on 2022-10-12, we discussed criteria that can be used to make a choice between alternative choices in specific steps in the c14n algorithm. The initial list of suggestions is below. We need to formalize this and, IMO, include it in the explainer doc.
In the meeting on 2022-10-12, there was mention of needing to say which choices were made in generating the hash.
We already have the existing URDNA2015 as the canonicalization algorithm. Anything this working group does that makes any change to that will need to have a way to declare which algorithm was used to produce the resultant canonical form.
There may be good reasons for different hashing and signing algorithms.
We need a naming scheme of choices made, together with a way to transmit that information.
See #4 about the output of canonicalization.
From the CG spec:
An additional input to this algorithm should be added that
allows it to be optionally skipped and throw an error if any
equivalent related hashes were produced that must be permuted
during step 5.4.4. For practical uses of the algorithm, this step
should never be encountered and could be turned off, disabling
canonizing datasets that include a need to run it as a security
measure.
As a starting point for preparing Privacy Considerations (#70), I am trying to organize some existing discussions about privacy aspects of the URDNA2015, including:
While the discussions here mainly focus on URDNA2015, I believe we can apply similar arguments to the other RDF canonicalization algorithms.
Privacy considerations here are primarily worth discussing when the canonicalization scheme is used for privacy-respecting signed RDF dataset.
The following issues are worth discussing when the canonicalization scheme is used for privacy-respecting signed RDF datasets and are likely acceptable for other use cases. One of the former examples is a verifiable credential with selective disclosure.
Selective disclosure is the ability for someone to share only some of the statements from a signed dataset, without harming the ability of the recipient to verify the authenticity of those selected statements. (Note: copied from Dave's comment in #4)
The normalized dataset, the output of the canonicalization algorithm described in this specification, may leak partial information about undisclosed statements and help the adversary correlate the original and disclosed datasets.
If a dataset contains at least two blank nodes, the canonical labeling can be exploited to guess the undisclosed quad in the dataset.
For example, let us assume we have the following dataset to be signed. (Note: this person is fictitious, prepared only to make this example work.)
# original dataset
_:b0 <http://schema.org/address> _:b1 .
_:b0 <http://schema.org/familyName> "Jarrett" .
_:b0 <http://schema.org/gender> "Female" . # gender === Female
_:b0 <http://schema.org/givenName> "Ali" .
_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:b1 <http://schema.org/addressCountry> "United States" .
_:b1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PostalAddress> .
Using URDNA2015, we can obtain the normalized dataset with canonical labels sorted in the canonical (code-point) order.
# normalized dataset
_:c14n0 <http://schema.org/addressCountry> "United States" .
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PostalAddress> .
_:c14n1 <http://schema.org/address> _:c14n0 .
_:c14n1 <http://schema.org/familyName> "Jarrett" .
_:c14n1 <http://schema.org/gender> "Female" . # gender === Female
_:c14n1 <http://schema.org/givenName> "Ali" .
_:c14n1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
The signer (or issuer) can generate a signature for the dataset by first hashing each statement and then signing them using a multi-message digital signature scheme like BBS+. The resulting dataset with signature is passed to the holder, who can control whether or not to disclose each statement while maintaining their verifiability.
Let us say that the holder wants to show her attributes except for gender
to a verifier. Then the holder should disclose the following partial dataset. (Note: proofs omitted here for brevity)
# disclosed dataset
_:c14n0 <http://schema.org/addressCountry> "United States" .
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PostalAddress> .
_:c14n1 <http://schema.org/address> _:c14n0 .
_:c14n1 <http://schema.org/familyName> "Jarrett" .
### 5th statement is unrevealed ##
_:c14n1 <http://schema.org/givenName> "Ali" .
_:c14n1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
However, in this example, anyone can guess the unrevealed statement by exploiting the canonical labels and order.
Since the dataset was sorted in the canonical order, we can get to know that the hidden statement must start with _:c14n1 <http://schema.org/[f-g]
, which helps us guess that the hidden predicate is <http://schema.org/gender>
with high probability.
Alternatively, we can assume that the guesser already has such knowledge via the public credential schema.
Then, if the canonical labeling produces different results depending on the gender value, we can use it to deduce the gender value.
In fact, this example produces different results depending on whether the gender is Female
or Male
.
(Note: I ignored the other types of gender just for brevity)
The following example shows that "gender = Male" yields different canonical labeling.
# hypothetical normalized dataset
_:c14n0 <http://schema.org/address> _:c14n1 .
_:c14n0 <http://schema.org/familyName> "Jarrett" .
_:c14n0 <http://schema.org/gender> "Male" . # gender === Male
_:c14n0 <http://schema.org/givenName> "Ali" .
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:c14n1 <http://schema.org/addressCountry> "United States" .
_:c14n1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PostalAddress> .
So the verifier should have obtained the following dataset if gender
had the value Male
, which differs from the revealed dataset, so the verifier can conclude that the gender
is Female
.
# hypothetical disclosed dataset
_:c14n0 <http://schema.org/address> _:c14n1 .
_:c14n0 <http://schema.org/familyName> "Jarrett" .
### 3rd statement is unrevealed ##
_:c14n0 <http://schema.org/givenName> "Ali" .
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:c14n1 <http://schema.org/addressCountry> "United States" .
_:c14n1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PostalAddress> .
Note that we can use the same approach to guess non-boolean values if the range of possible values is still a reasonable (small) size for a guesser to try all the possibilities.
By making the canonicalization process private, we can prevent a brute-forcing attacker from trying to see the labeling change by trying multiple possible attribute values.
For example, we can use a HMAC instead of a hash function in the canonicalization algorithm. Alternatively, we can add a secret random nonce (always undisclosed) into the dataset.
Note that these workarounds force dataset issuers and holders to manage shared secrets securely.
We also note that these workarounds adversely affect the unlinkability described below because canonical labeling now varies depending on the secret shared by the issuer and the holder, which will help correlate them.
The canonical order can leak unrevealed information even without canonical labelings.
Let us assume that the holder has the following signed dataset, sorted in the canonical (code-point) order.
:a <http://schema.org/children> "Albert" .
:a <http://schema.org/children> "Alice" .
:a <http://schema.org/children> "Allie" .
:a <http://schema.org/name> "John Smith" .
If the holder wants to hide the statement for their second child for any reason, the disclosed dataset now looks like this:
:a <http://schema.org/children> "Albert" .
### 2nd statement is unrevealed ##
:a <http://schema.org/children> "Allie" .
:a <http://schema.org/name> "John Smith" .
Knowing that these statements are sorted in the canonical order, we can guess that the hidden statement must start with :a <http://schema.org/children> "Al
, which leaks the subject (:a
), predicate (<http://schema.org/children>
) and the first two letters of the object ("Al"
) in the hidden statement.
To avoid this leakage, the dataset issuer can randomly shuffle the normalized statements before signing and issuing them to the holder, preventing others from guessing undisclosed information from the canonical order.
However, similar to the workarounds mentioned above, this workaround also adversely affects unlinkability. This is because there are
Unlinkability ensures that no correlatable data are used in a signed dataset while still providing some level of trust, the sufficiency of which must be determined by each verifier. (Note: based on the description in the VC Data Integrity spec)
While canonical sorting works better for unlinkability, canonical labeling can be exploited to break it.
The total number of canonical labelings for a dataset with
It means that the herd constructed as a result of selective disclosure will be split into
For example, let us assume that an employee of the small company "example.com" shows its employee ID dataset without their name like this:
# disclosed dataset
### 1st statement is unrevealed ##
_:c14n0 <http://schema.org/worksFor> _:c14n1 .
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:c14n1 <http://schema.org/address> _:c14n2 .
_:c14n1 <http://schema.org/geo> _:c14n3 .
_:c14n1 <http://schema.org/name> "example.com" .
_:c14n2 <http://schema.org/addressCountry> "United States" .
_:c14n3 <http://schema.org/latitude> "0.0" .
_:c14n3 <http://schema.org/longitude> "0.0" .
The verifier can always trace this person without knowing their name if this company has only three employees with the following employee ID datasets.
# normalized dataset 1
_:c14n0 <http://schema.org/address> _:c14n1 .
_:c14n0 <http://schema.org/geo> _:c14n3 .
_:c14n0 <http://schema.org/name> "example.com" .
_:c14n1 <http://schema.org/addressCountry> "United States" .
_:c14n2 <http://schema.org/name> "Jayden Doe" .
_:c14n2 <http://schema.org/worksFor> _:c14n0 .
_:c14n2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:c14n3 <http://schema.org/latitude> "0.0" .
_:c14n3 <http://schema.org/longitude> "0.0" .
# normalized dataset 2
_:c14n0 <http://schema.org/address> _:c14n1 .
_:c14n0 <http://schema.org/geo> _:c14n2 .
_:c14n0 <http://schema.org/name> "example.com" .
_:c14n1 <http://schema.org/addressCountry> "United States" .
_:c14n2 <http://schema.org/latitude> "0.0" .
_:c14n2 <http://schema.org/longitude> "0.0" .
_:c14n3 <http://schema.org/name> "Morgan Doe" .
_:c14n3 <http://schema.org/worksFor> _:c14n0 .
_:c14n3 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
# normalized dataset 3
_:c14n0 <http://schema.org/name> "Johnny Smith" .
_:c14n0 <http://schema.org/worksFor> _:c14n1 .
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:c14n1 <http://schema.org/address> _:c14n2 .
_:c14n1 <http://schema.org/geo> _:c14n3 .
_:c14n1 <http://schema.org/name> "example.com" .
_:c14n2 <http://schema.org/addressCountry> "United States" .
_:c14n3 <http://schema.org/latitude> "0.0" .
_:c14n3 <http://schema.org/longitude> "0.0" .
The canonicalization in this example produces different labelings for these three employees, which helps anyone to correlate their activities even if they do not reveal their names in the dataset.
By determining some "template" for each anonymous set (or herd) and fixing the canonical labeling and canonical order used in the anonymous set, we can achieve a certain unlinkability.
Alternatively, we might be able to generate a kind of "on-the-fly" template proposed in Dave's comment in #4, which seems ineffective in some cases (see #8).
I must admit I hate the name "URDNA2015". I never remember what URDNA stands for (I always associate it with some weird gene...) and the 2015 is irrelevant.
Shouldn't we come with a simpler name? We could simply call it RCH, for example and, if we need it, we can use the "usual" W3C versioning habit of calling it RCH 1.0.
Just a thought. But it is still time to make this change.
there is a PR that contributed a reference to some of this in multiformats/multicodec
the pr title says 'urdna' but the pr added the name using 'urdca'. Which is more... canonical.... normal.... whatever. What should we put it in the multicodec name
column? :)
I imagine this decision has editorial implications for the spec itself? It still says 'urdna2015' quite a bit and does not contain 'urdca'. So why should multibase say 'urdca'?
Relates to:
Starting with my comment here: #86 (comment)
Some discussion spawned around the need, as an optional output, a mapping of quad input indices to quad output indices for selective disclosure use cases.
@gkellogg made this comment:
Just be be clear on what we're talking about, the input dataset is unordered and no blank node labels are persistent or possibly even present.
Which implies that we might want to also take an ordered list of quads as an optional alternative input to the algorithm. Or perhaps we can describe the RDF abstract dataset as being optionally represented as such -- for the case where this mapping output is desirable. Notably, the presence (or lack thereof) of input blank node labels in this case is not relevant.
The canonicalization algorithm converts an input dataset into a normalized dataset. This algorithm will assign deterministic identifiers to any blank nodes in the input dataset.
This section is non-normative.
URDNA2015 canonically labels an RDF dataset
by assigning each blank node a canonical identifier.
In URDNA2015, an RDF dataset D
is represented as a set of quads of the form < s, p, o, g >
where the graph component g
is empty if and only if the
triple < s, p, o >
is in the default graph.
It is expected that, for two RDF datasets,
URDNA2015 returns the same canonically labeled list of quads
if and only if the two datasets are isomorphic (i.e., the same modulo blank node identifiers).
URDNA2015 consists of several sub-algorithms. These sub-algorithms are introduced in the following sub-sections. First, we give a high level summary of URDNA2015.
This section is non-normative.
This has the effect of initializing the blank node to quads map, and the hash to blank nodes map, as well as instantiating a new canonical issuer.
This establishes the blank node to quads map, relating each blank node with the set of quads of which it is a component.
Literal components of
quads are not subject to any normalization.
As noted in
Section 3.3
of [RDF11-CONCEPTS],
literal term equality
is based on the
lexical form,
rather than the literal value,
so two literals "01"^^xs:integer
and "1"^^xs:integer
are treated as distinct resources.
This step creates a hash for every blank node in the input document. Some blank nodes will lead to a unique hash, while other blank nodes may share a common hash.
This step establishes the canonical identifier for blank nodes having a unique hash, which are recorded in the canonical issuer.
Log the assigned canonical identifiers.
This step establishes the canonical identifier for blank nodes having a shared hash. This is done by creating unique blank node identifiers for all blank nodes traversed by the Hash N-Degree Quads algorithm, running through each blank node without a canonical identifier in the order of the hashes established in the previous step.
Log hash and identifier list for this iteration.
This list establishes an order for those blank nodes sharing a common first-degree hash.
b
.The previous step created temporary identifiers for the blank nodes sharing a common first degree hash, which is now used to generate their canonical identifiers.
In Step 5.2, hash path list was created with an ordered set of results. Each result contained a temporary issuer which recorded temporary identifiers associated with a particular blank node identifier in identifier list. This step processes each returned temporary issuer, in order, and allocates canonical identifiers to the temporary identifier mappings contained within each temporary issuer, creating a full order on the remaining blank nodes with unissued canonical identifiers.
Log newly issued canonical identifiers.
This step populates the normalized dataset with quads substituting the original blank node identifiers, with the newly established canonical blank node identifiers.
This algorithm issues a new blank node identifier for a given existing blank node identifier. It also updates state information that tracks the order in which new blank node identifiers were issued. The order of issuance is important for canonically labeling blank nodes that are isomorphic to others in the dataset.
The algorithm maintains an issued identifiers map to
relate an existing blank node identifier from the input dataset
to a new blank node identifier using a given identifier prefix
(c14n
) with new identifiers issued by appending an incrementing number.
For example, when called for a blank node identifier such as e3
,
it might result in a issued identifier of c14n1
.
The algorithm takes an identifier issuer I and an existing identifier as inputs. The output is a new issued identifier. The steps of the algorithm are:
This algorithm calculates a hash for a given blank node across the quads in a dataset in which that blank node is a component. If the hash uniquely identifies that blank node, no further examination is necessary. Otherwise, a hash will be created for the blank node using the algorithm in 4.8 Hash N-Degree Quads invoked via 4.4 Canonicalization Algorithm.
This section is non-normative.
To determine whether the first degree information of a node n is unique, a hash is assigned to its mention set, Qn. The first degree hash of a blank node n, denoted hf(n), is the hash that results from 4.6 Hash First Degree Quads when passing n. Nodes with unique first degree hashes have unique first degree information.
For consistency, blank node identifiers used in Qn
are replaced with placeholders in a canonical n-quads serialization of that quad.
Every blank node component is replaced with either a
or z
,
depending on if that component is n or not.
The resulting serialized quads are then code point ordered, concatenated, and hashed. This hash is the first degree hash of n, hf(n).
This section is non-normative.
This algorithm takes the canonicalization state and a reference blank node identifier as inputs.
a
,
otherwise, use the blank node identifier
z
.Log the inputs and result of running this algorithm.
This algorithm calculates a hash for a given blank node across the quads in a dataset in which that blank node is a component for which the hash does not uniquely identify that blank node. This is done by expanding the search from quads directly referencing that blank node (the mention set), to those quads which contain nodes which are also components of quads in the mention set, called the gossip path. This process proceeds in every greater degrees of indirection until a unique hash is obtained.
The 'path' terminology could also be changed to better indicate what a path is (a particular deterministic serialization for a subgraph/subdataset of nodes without globally-unique identifiers).
This section is non-normative.
Usually, when trying to determine if two nodes in a graph are equivalent, you simply compare their identifiers. However, what if the nodes don't have identifiers? Then you must determine if the two nodes have equivalent connections to equivalent nodes all throughout the whole graph. This is called the graph isomorphism problem. This algorithm approaches this problem by considering how one might draw a graph on paper. You can test to see if two nodes are equivalent by drawing the graph twice. The first time you draw the graph the first node is drawn in the center of the page. If you can draw the graph a second time such that it looks just like the first, except the second node is in the center of the page, then the nodes are equivalent. This algorithm essentially defines a deterministic way to draw a graph where, if you begin with a particular node, the graph will always be drawn the same way. If two graphs are drawn the same way with two different nodes, then the nodes are equivalent. A hash is used to indicate a particular way that the graph has been drawn and can be used to compare nodes.
When two blank nodes have the same first degree hash, extra steps must be taken to detect global, or N-degree, distinctions. All information that is in any way connected to the blank node n through other blank nodes, even transitively, must be considered.
To consider all transitive information, the algorithm traverses and encodes all possible paths of incident mentions emanating from n, called gossip paths, that reach every unlabeled blank node connected to n. Each unlabeled blank node is assigned a temporary identifier in the order in which it is reached in the gossip path being explored. The mentions that are traversed to reach connected blank nodes are encoded in these paths via related hashes. This provides a deterministic way to order all paths coming from n that reach all blank nodes connected to n without relying on input blank node identifiers.
This algorithm works in concert with the main canonicalization algorithm to produce a unique, deterministic identifier for a particular blank node. This hash incorporates all of the information that is connected to the blank node as well as how it is connected. It does this by creating deterministic paths that emanate out from the blank node through any other adjacent blank nodes.
Ultimately, the algorithm selects a shortest gossip path, distributing canonical identifiers to the unlabeled blank nodes in the order in which they appear in this path. The hash of this encoded shortest path, called the N-degree hash of n, distinguishes n from other blank nodes in the dataset.
For clarity, we consider a gossip path encoded via the string s to be shortest provided that:
For example, abc is shorter than bbc, whereas abcd is longer than bcd.
The following provides a high level outline for how the N-degree hash of n is computed along the shortest gossip path. Note that the full algorithm considers all gossip paths, ultimately returning the hash of the shortest encoded path.
As described above in step 2.3, HN recurses on each unlabeled blank node when it is first reached along the gossip path being explored. This recursion can be visualized as moving along the path from n to the blank node ni that is receiving a temporary identifier. If, when recursing on ni, another unlabeled blank node nj is discovered, the algorithm again recurses. Such a recursion traces out the gossip path from n to nj via ni.
The recursive hash r(i) is the hash returned from the completed recursion on the node ni when computing hN(n). Just as hN(n) is the hash of Dn, we denote the data to hash in the recursion on ni as Di. So, r(i) = h(Di). For each related hash x ∈ Hn, Rn(x) is called the recursion list on which the algorithm recurses.
This section is non-normative.
The inputs to this algorithm are the canonicalization state, the identifier for the blank node to recursively hash quads for, and path identifier issuer which is an identifier issuer that issues temporary blank node identifiers. The output from this algorithm will be a hash and the identifier issuer used to help generate it.
Log the inputs to the algorithm.
quads is the mention set of identifier.
This loop calculates the related hash Hn for other blank nodes within the mention set of identifier.
s
, o
, or
g
based on whether component is a
subject, object,
graph name, respectively.Include the logs for each iteration of the Hash Related Blank Node algorithm and the resulting Hn.
This loop explores the gossip paths for each related blank node sharing a common hash to identifier finding the shortest such path (chosen path). This determines how canonical identifiers for otherwise commonly hashed blank nodes are chosen.
Each path is represented by the concatenation of the identifiers for each related blank node – either the issued identifier, or a temporary identifier created using a copy of issuer. Those for which temporary identifiers were issued are later recursed over using this algorithm.
Log the value of related hash and state of data to hash.
_:
, followed by
the canonical identifier for related, to path.
A canonical identifier may have been generated before calling this algorithm, if it was issued from an earlier call to Hash First Degree Quads algorithm. There is no reason to recurse and apply the algorithm to any related blank node that has already been assigned a canonical identifier. Furthermore, using the canonical identifier also further distinguishes it from any temporary identifier, allowing for even greater efficiency in finding the chosen path.
Temporarily labeled nodes have identifiers recorded in issuer copy, which is later used to recursively call this algorithm, so that eventually all nodes are given canonical identifiers.
_:
, followed by the result, to path.If path is already longer than the prospective chosen path, we can terminate this iteration early.
path is used to generate a hash at a later step; in this respect, it is similar to
the Hash First Degree Quads algorithm which
uses the serialization of quads in nquads for hashing. For the sake of consistency, the
nquad representation of blank node identifiers is used in these steps, hence the
usage of the _:
string.
Log related and path.
The prospective path is extended with the hash resulting from recursively calling this algorithm on each related blank node issued a temporary identifier.
Log recursion list and path.
_:
, followed by
the result, to path.<
, the hash in
result, and >
to path.If path is already longer than the prospective chosen path, we can terminate this iteration early.
Log chosen path and data to hash.
This section describes the process of creating a serialized [N-Quads] representation of a normalized dataset.
The serialized canonical form of a normalized dataset
is an N-Quads document [N-QUADS]
created by representing each quad from the normalized dataset
in canonical n-quads form,
sorting them into code point order,
and concatenating them. (Note that each canonical N-Quads statement ends with a new line,
so no additional separators are needed in the concatenation.)
The resulting document has a media type of application/n-quads
,
as described in C. N-Quads Internet Media Type, File Extension and Macintosh File Type
of [N-QUADS].
When serializing quads in canonical n-quads form, components which are blank nodes MUST be serialized using the canonical label associated with each blank node from the map component of the normalized dataset.
This section is non-normative.
Add text that warns implementers using this specification in selective disclosure schemes that graph structure might reveal information about the entity disclosing the information. For example, knowing that a blank node contains two triples vs. five triples might reveal that the entity that is disclosing the information is a part of a subclass of a population, which might be enough to disclose information beyond what the discloser intended to disclose.
This section is non-normative.
Add text that warns that attackers can construct datasets which are known to take large amounts of compute time to canonize. The algorithm has a mechanism to detect and prevent this sort of abuse, but implementers need to ensure that they think holistically about their system such as what happens if they don't have rate limiting on a service exposed to the Internet and they are the subject of a DDoS. Default mechanisms that prevent excessive use of compute when an attacker sends a poisoned dataset might be different from system to system.
Add text that warns implementers that, while the algorithm has a mathematical proof associated with it that has had peer review, and while a W3C WG has reviewed the algorithms and that there are multiple interoperable implementations, that a formal proof using a system such as Coq isn't available at this time. We are highly confident of the correctness of the algorithm, but we will not be able to say with 100% certainty that it is correct until we have a formal, machine-based verification of the proof. Any system that utilizes this canonicalization mechanism should have a backup canonicalization mechanism, such as JCS, or other mitigations, such as schema-based validation, ready in the event that an unrecoverable flaw is found in this algorithm.
This section is non-normative.
TBD
This section is non-normative.
This example illustrates a more complicated example where the same paths through blank nodes are duplicated in a graph, but use different blank node identifiers.
_:e0 :p1 _:e1 .
_:e1 :p2 "Foo" .
_:e2 :p1 _:e3 .
_:e3 :p2 "Foo" .
The following is a summary of the more detailed execution log found here.
This example illustrates another complicated example of nodes that are doubly connected in opposite directions.
_:e0 :next _:e1 .
_:e0 :prev _:e1 .
_:e1 :next _:e0 .
_:e1 :prev _:e0 .
The example is not explored in detail, but the execution log found here shows examples of more complicated pathways through the algorithm
This section defines a canonical form of N-Quads which has a completely specified layout. The grammar for the language remains unchanged.
Canonical N-Quads updates and extends
Canonical N-Triples in [N-TRIPLES]
to include graphLabel
.
While the N-Quads syntax [N-QUADS] allows choices for the representation and layout of RDF data,
the canonical form of N-Quads provides a unique syntactic representation of any quad.
Each code point
can be represented by only one of
UCHAR
,
ECHAR
,
or unencoded character,
where the relevant production allows for a choice in representation.
Each quad is represented entirely on a single line with specified white space.
Canonical N-Quads has the following additional constraints on layout:
subject
,
predicate
,
object
,
and graphLabel
,
each of which MUST be a single space (U+0020
).http://www.w3.org/2001/XMLSchema#string
MUST NOT use the datatype IRI part of the literal,
and are represented using only STRING_LITERAL_QUOTE.
HEX
MUST use only uppercase letters ([A-F]
).U+0008
(BS
),
U+0009
(HT
),
U+000A
(LF
),
U+000C
(FF
),
U+000D
(CR
),
U+0022
("
), and
U+005C
(\
)
MUST be encoded using ECHAR
.
Characters in the range from U+0000
to U+001F
and U+007F
(DEL
)
that are not represented using ECHAR
MUST be represented by UCHAR
.
All other characters MUST be represented by their native [UNICODE] representation.EOL
MUST be a single U+000A
.EOL
MUST be provided.This section is non-normative.
A previous version of this algorithm has light deployment. For purposes of identification, the algorithm is called the "Universal RDF Graph Canonicalization Algorithm 2012" (URGNA2012), and differs from the stated algorithm in the following ways:
g
, instead of z
.<
and >
; there
were no delimiters.p
, for property, when the related blank node was a
subject and the value r
, for reverse or reference, when
the related blank node was an object. Since URGNA2012 only normalized
graphs, not datasets, there was no use of the graph name position.p
as position.
r
as position.
map
)
map
)
map
)
map
)
This section is non-normative.
xyz
format for blank node identifiers, instead of
_:xyz
. See Issue 46
for the discussion.
simple
flag,
which was unused in existing implementations.
The original design of the algorithm was to use the
assigned canonical blank node identifier,
if available, instead of _:a
or _:z
,
similar to how it is used in
the related hash algorithm, but this text never made it into the spec
before implementations moved forward.
Therefore, the hashes never change,
making the loop based on the simple
flag that calls this algorithm unnecessary.
See Issue 23 for the discussion.
_:
which is a serialization artifact.
This is still required in the algorithms, but the distinction
between what is an identifier, and the serialization form
is clarified.This section is non-normative.
The editors would like to thank Jeremy Carroll for his work on the graph canonicalization problem, Gavin Carothers for providing valuable feedback and testing input for the algorithm defined in this specification, Sir Tim Berners Lee for his thoughts on graph canonicalization over the years, Jesús Arias Fisteus for his work on a similar algorithm.
Members of the RDF Dataset Canonicalization and Hash Working Group Group included Ahamed Azeem, Ahmad Alobaid, Andy Seaborne, Benjamin Goering, Brent Zundel, Dan Brickley, Dan Yamamoto, Dave Longley, David Lehn, Gregg Kellogg, Ivan Herman, Jesse Wright, Kazue Sako, Leonard Rosenthol, Mahmoud Alkhraishi, Manu Sporny, Markus Sabadello, Michael Prorock, Phil Archer, Pierre-Antoine Champin, Sebastian Crane, Ted Thibodeau, Timothée HAUDEBOURG, and Tobias Kuhn.
Acknowledge CCG members.
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in: