RDF Dataset Canonicalization

Abstract

RDF [RDF11-CONCEPTS] describes a graph-based data model for making claims about the world and provides the foundation for reasoning upon that graph of information. At times, it becomes necessary to compare the differences between sets of graphs, digitally sign them, or generate short identifiers for graphs via hashing algorithms. This document outlines an algorithm for normalizing RDF datasets such that these operations can be performed.

Canonicalization is the process of transforming an input dataset to its serialized canonical form. That is, any two input datasets that contain the same information, regardless of their arrangement, will be transformed into the same serialized canonical form. The problem requires directed graphs to be deterministically ordered into sets of nodes and edges. This is easy to do when all of the nodes have globally-unique identifiers, but can be difficult to do when some of the nodes do not. Any nodes without globally-unique identifiers must be issued deterministic identifiers.

Note

This specification defines a normalized dataset to include stable identifiers for blank nodes, practical uses of which will always generate a canonical serialization of such a dataset.

In time, there may be more than one canonicalization algorithm and, therefore, for identification purposes, this algorithm is named the "Universal RDF Dataset Canonicalization Algorithm 2015" (URDNA2015).

Editor's note

This statement is overly prescriptive and does not include normative language. This spec should describe the theoretical basis for graph canonicalization and describe behavior using normative statements. The explicit algorithms should follow as an informative appendix.

This section is non-normative.

To determine a canonical labeling, URDNA2015 considers the information connected to each blank node. Nodes with unique first degree information can immediately be issued a canonical identifier via the Issue Identifier algorithm. When a node has non-unique first degree information, it is necessary to determine all information that is transitively connected to it throughout the entire dataset. 4.6 Hash First Degree Quads defines a node’s first degree information via its first degree hash.

Hashes are computed from the information of each blank node. These hashes encode the mentions incident to each blank node. The hash of a string s, is the lower-case, hexadecimal representation of the result of passing s through a cryptographic hash function. URDNA2015 uses the SHA-256 hash algorithm.

When performing the steps required by the canonicalization algorithm, it is helpful to track state in a data structure called the canonicalization state. The information contained in the canonicalization state is described below.

blank node to quads map: A map that relates a blank node identifier to the quads in which they appear in the input dataset.
hash to blank nodes map: A map that relates a hash to a list of blank node identifiers.
canonical issuer: An identifier issuer, initialized with the prefix c14n, for issuing canonical blank node identifiers.
Editor's note
Mapping all blank nodes to use this identifier spec means that an RDF dataset composed of two different RDF graphs will use different identifiers than that for the graphs taken independently. This may happen anyway, due to automorphisms, or overlapping statements, but an identifier based on the resulting hash along with an issue sequence number specific to that hash would stand a better chance of surviving such minor changes, and allow the resulting information to be useful for RDF Diff.

During the canonicalization algorithm, it is sometimes necessary to issue new identifiers to blank nodes. The Issue Identifier algorithm uses an identifier issuer to accomplish this task. The information an identifier issuer needs to keep track of is described below.

identifier prefix: The identifier prefix is a string that is used at the beginning of an blank node identifier. It should be initialized to a string that is specified by the canonicalization algorithm. When generating a new blank node identifier, the prefix is concatenated with a identifier counter. For example, c14n is a proper initial value for the identifier prefix that would produce blank node identifiers like c14n1.
identifier counter: A counter that is appended to the identifier prefix to create an blank node identifier. It is initialized to 0.
issued identifiers map: An ordered map that relates existing identifiers to issued identifiers, to prevent issuance of more than one new identifier per existing identifier, and to allow blank nodes to be reassigned identifiers some time after issuance.

Editor's note

At the time of writing, there are several open issues that will determine important details of the canonicalization algorithm.

Issue 7: Support generalized RDF

Generalized RDF is described in RDF 1.1 Concepts and Abstract Syntax.

It removes restrictions on the type of RDF term that can occur in any slot in a quad tuple - literals as subjects or predicates, blank nodes as predicates etc. By implication, that would include RDF-start quoted triples.

RDF 1.1 separately changed "RDF dataset" to allow blank nodes for in the graph slot.

Generalized RDF does arise - for example, in some rules systems.

Covering generalized RDF gives some future proofing.

Issue 8: "Herd-privacy" canonicalization propose closing

I completely agree with the importance of the "herd-privacy" canonicalization proposed in #4 (comment) by @dlongley when we use c14n with selective disclosure. However, if I understand it correctly, we would still have to improve the above algorithm; it seems to me that the following normalized datasets CX1 and CX2 are not modified via the above transformation, i.e., CX1==CY1 and CX2==CY2.

CX1 (obtained from JSON-LD Playground) (==CY1)

_:c14n0 <http://schema.org/name> "Alice" .
_:c14n0 <http://schema.org/spouse> _:c14n1 .
_:c14n1 <http://schema.org/name> "Bob" .

CX2 (obtained from JSON-LD Playground) (==CY2)

_:c14n0 <http://schema.org/name> "Carl" .
_:c14n1 <http://schema.org/name> "Alice" .
_:c14n1 <http://schema.org/spouse> _:c14n0 .

Therefore, even if Alice selectively hides the statement about her spouse, anyone can easily guess whether Bob or Carl is Alice's spouse based on the canonicalized identifiers or the order of unrevealed statement:

CY1 with selective disclosure

_:c14n0 <http://schema.org/name> "Alice" .
_:c14n0 <http://schema.org/spouse> _:c14n1 .
### 3rd statement is unrevealed ####

CY2 with selective disclosure

### 1st statement is unrevealed ####
_:c14n1 <http://schema.org/name> "Alice" .
_:c14n1 <http://schema.org/spouse> _:c14n0 .

What we actually wanted seemed like the following result:

CY1'

_:c14n0 <http://schema.org/name> "Alice" .
_:c14n1 <http://schema.org/name> "Bob" .
_:c14n0 <http://schema.org/spouse> _:c14n1 .

CY2'

_:c14n0 <http://schema.org/name> "Alice" .
_:c14n1 <http://schema.org/name> "Carl" .
_:c14n0 <http://schema.org/spouse> _:c14n1 .

I am trying to figure out a solution, but haven't found one yet so just submitting this issue at the moment...

Issue 10: C14N choice criteria documentation

In the meeting on 2022-10-12, we discussed criteria that can be used to make a choice between alternative choices in specific steps in the c14n algorithm. The initial list of suggestions is below. We need to formalize this and, IMO, include it in the explainer doc.

ease of implementation
existing incubation / use in the marketplace
time / resource complexity in solving common datasets
time / resource complexity in solving complex (or poison?) datasets
Existence of formal proofs for the algorithms
Demonstration of review of formal proofs for the algorithms
reusing existing primitives that are available on various platforms
cover real life examples

Issue 11: Recording the canonicalization and hashing applied.

In the meeting on 2022-10-12, there was mention of needing to say which choices were made in generating the hash.

We already have the existing URDNA2015 as the canonicalization algorithm. Anything this working group does that makes any change to that will need to have a way to declare which algorithm was used to produce the resultant canonical form.

There may be good reasons for different hashing and signing algorithms.

We need a naming scheme of choices made, together with a way to transmit that information.

See #4 about the output of canonicalization.

Issue 16: Optionally fail on duplicates in Hash N-Degree Quads? question

From the CG spec:

An additional input to this algorithm should be added that
allows it to be optionally skipped and throw an error if any
equivalent related hashes were produced that must be permuted
during step 5.4.4. For practical uses of the algorithm, this step
should never be encountered and could be turned off, disabling
canonizing datasets that include a need to run it as a security
measure.

Issue 84: Some privacy considerations ms:CR ready for pr

As a starting point for preparing Privacy Considerations (#70), I am trying to organize some existing discussions about privacy aspects of the URDNA2015, including:

Dave's comment in #4
Blank node labels may leak information · Issue #60 · w3c-ccg/ldp-bbs2020
Academic articles by Vasilis Kalos (1), (2)

While the discussions here mainly focus on URDNA2015, I believe we can apply similar arguments to the other RDF canonicalization algorithms.

Privacy considerations here are primarily worth discussing when the canonicalization scheme is used for privacy-respecting signed RDF dataset.

The following issues are worth discussing when the canonicalization scheme is used for privacy-respecting signed RDF datasets and are likely acceptable for other use cases. One of the former examples is a verifiable credential with selective disclosure.

Selective disclosure is the ability for someone to share only some of the statements from a signed dataset, without harming the ability of the recipient to verify the authenticity of those selected statements. (Note: copied from Dave's comment in #4)

The normalized dataset, the output of the canonicalization algorithm described in this specification, may leak partial information about undisclosed statements and help the adversary correlate the original and disclosed datasets.

If a dataset contains at least two blank nodes, the canonical labeling can be exploited to guess the undisclosed quad in the dataset.

For example, let us assume we have the following dataset to be signed. (Note: this person is fictitious, prepared only to make this example work.)

# original dataset
_:b0 <http://schema.org/address> _:b1 .
_:b0 <http://schema.org/familyName> "Jarrett" .
_:b0 <http://schema.org/gender> "Female" .  # gender === Female
_:b0 <http://schema.org/givenName> "Ali" .
_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:b1 <http://schema.org/addressCountry> "United States" .
_:b1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PostalAddress> .

Using URDNA2015, we can obtain the normalized dataset with canonical labels sorted in the canonical (code-point) order.

# normalized dataset
_:c14n0 <http://schema.org/addressCountry> "United States" .
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PostalAddress> .
_:c14n1 <http://schema.org/address> _:c14n0 .
_:c14n1 <http://schema.org/familyName> "Jarrett" .
_:c14n1 <http://schema.org/gender> "Female" .  # gender === Female
_:c14n1 <http://schema.org/givenName> "Ali" .
_:c14n1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .

The signer (or issuer) can generate a signature for the dataset by first hashing each statement and then signing them using a multi-message digital signature scheme like BBS+. The resulting dataset with signature is passed to the holder, who can control whether or not to disclose each statement while maintaining their verifiability.

Let us say that the holder wants to show her attributes except for gender to a verifier. Then the holder should disclose the following partial dataset. (Note: proofs omitted here for brevity)

# disclosed dataset
_:c14n0 <http://schema.org/addressCountry> "United States" .
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PostalAddress> .
_:c14n1 <http://schema.org/address> _:c14n0 .
_:c14n1 <http://schema.org/familyName> "Jarrett" .
### 5th statement is unrevealed ##
_:c14n1 <http://schema.org/givenName> "Ali" .
_:c14n1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .

However, in this example, anyone can guess the unrevealed statement by exploiting the canonical labels and order.

Since the dataset was sorted in the canonical order, we can get to know that the hidden statement must start with _:c14n1 <http://schema.org/[f-g], which helps us guess that the hidden predicate is <http://schema.org/gender> with high probability.
Alternatively, we can assume that the guesser already has such knowledge via the public credential schema.

Then, if the canonical labeling produces different results depending on the gender value, we can use it to deduce the gender value.
In fact, this example produces different results depending on whether the gender is Female or Male.
(Note: I ignored the other types of gender just for brevity)

The following example shows that "gender = Male" yields different canonical labeling.

# hypothetical normalized dataset
_:c14n0 <http://schema.org/address> _:c14n1 .
_:c14n0 <http://schema.org/familyName> "Jarrett" .
_:c14n0 <http://schema.org/gender> "Male" .  # gender === Male
_:c14n0 <http://schema.org/givenName> "Ali" .
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:c14n1 <http://schema.org/addressCountry> "United States" .
_:c14n1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PostalAddress> .

So the verifier should have obtained the following dataset if gender had the value Male, which differs from the revealed dataset, so the verifier can conclude that the gender is Female.

# hypothetical disclosed dataset
_:c14n0 <http://schema.org/address> _:c14n1 .
_:c14n0 <http://schema.org/familyName> "Jarrett" .
### 3rd statement is unrevealed ##
_:c14n0 <http://schema.org/givenName> "Ali" .
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:c14n1 <http://schema.org/addressCountry> "United States" .
_:c14n1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PostalAddress> .

Note that we can use the same approach to guess non-boolean values if the range of possible values is still a reasonable (small) size for a guesser to try all the possibilities.

By making the canonicalization process private, we can prevent a brute-forcing attacker from trying to see the labeling change by trying multiple possible attribute values.
For example, we can use a HMAC instead of a hash function in the canonicalization algorithm. Alternatively, we can add a secret random nonce (always undisclosed) into the dataset.
Note that these workarounds force dataset issuers and holders to manage shared secrets securely.
We also note that these workarounds adversely affect the unlinkability described below because canonical labeling now varies depending on the secret shared by the issuer and the holder, which will help correlate them.

The canonical order can leak unrevealed information even without canonical labelings.

Let us assume that the holder has the following signed dataset, sorted in the canonical (code-point) order.

:a <http://schema.org/children> "Albert" .
:a <http://schema.org/children> "Alice" .
:a <http://schema.org/children> "Allie" .
:a <http://schema.org/name> "John Smith" .

If the holder wants to hide the statement for their second child for any reason, the disclosed dataset now looks like this:

:a <http://schema.org/children> "Albert" .
### 2nd statement is unrevealed ##
:a <http://schema.org/children> "Allie" .
:a <http://schema.org/name> "John Smith" .

Knowing that these statements are sorted in the canonical order, we can guess that the hidden statement must start with :a <http://schema.org/children> "Al, which leaks the subject (:a), predicate (<http://schema.org/children>) and the first two letters of the object ("Al") in the hidden statement.

To avoid this leakage, the dataset issuer can randomly shuffle the normalized statements before signing and issuing them to the holder, preventing others from guessing undisclosed information from the canonical order.
However, similar to the workarounds mentioned above, this workaround also adversely affects unlinkability. This is because there are $n!$ different permutations for shuffling $n$ statements, and whichever one is used will help correlate the dataset.

Unlinkability ensures that no correlatable data are used in a signed dataset while still providing some level of trust, the sufficiency of which must be determined by each verifier. (Note: based on the description in the VC Data Integrity spec)

While canonical sorting works better for unlinkability, canonical labeling can be exploited to break it.
The total number of canonical labelings for a dataset with $n$ blank nodes is $n!$, which is not controllable by the issuer.
It means that the herd constructed as a result of selective disclosure will be split into $n!$ pieces due to the canonical labeling, which reduces unlinkability.

For example, let us assume that an employee of the small company "example.com" shows its employee ID dataset without their name like this:

# disclosed dataset
### 1st statement is unrevealed ##
_:c14n0 <http://schema.org/worksFor> _:c14n1 .
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:c14n1 <http://schema.org/address> _:c14n2 .
_:c14n1 <http://schema.org/geo> _:c14n3 .
_:c14n1 <http://schema.org/name> "example.com" .
_:c14n2 <http://schema.org/addressCountry> "United States" .
_:c14n3 <http://schema.org/latitude> "0.0" .
_:c14n3 <http://schema.org/longitude> "0.0" .

The verifier can always trace this person without knowing their name if this company has only three employees with the following employee ID datasets.

# normalized dataset 1
_:c14n0 <http://schema.org/address> _:c14n1 .
_:c14n0 <http://schema.org/geo> _:c14n3 .
_:c14n0 <http://schema.org/name> "example.com" .
_:c14n1 <http://schema.org/addressCountry> "United States" .
_:c14n2 <http://schema.org/name> "Jayden Doe" .
_:c14n2 <http://schema.org/worksFor> _:c14n0 .
_:c14n2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:c14n3 <http://schema.org/latitude> "0.0" .
_:c14n3 <http://schema.org/longitude> "0.0" .

# normalized dataset 2
_:c14n0 <http://schema.org/address> _:c14n1 .
_:c14n0 <http://schema.org/geo> _:c14n2 .
_:c14n0 <http://schema.org/name> "example.com" .
_:c14n1 <http://schema.org/addressCountry> "United States" .
_:c14n2 <http://schema.org/latitude> "0.0" .
_:c14n2 <http://schema.org/longitude> "0.0" .
_:c14n3 <http://schema.org/name> "Morgan Doe" .
_:c14n3 <http://schema.org/worksFor> _:c14n0 .
_:c14n3 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .

# normalized dataset 3
_:c14n0 <http://schema.org/name> "Johnny Smith" .
_:c14n0 <http://schema.org/worksFor> _:c14n1 .
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:c14n1 <http://schema.org/address> _:c14n2 .
_:c14n1 <http://schema.org/geo> _:c14n3 .
_:c14n1 <http://schema.org/name> "example.com" .
_:c14n2 <http://schema.org/addressCountry> "United States" .
_:c14n3 <http://schema.org/latitude> "0.0" .
_:c14n3 <http://schema.org/longitude> "0.0" .

The canonicalization in this example produces different labelings for these three employees, which helps anyone to correlate their activities even if they do not reveal their names in the dataset.

By determining some "template" for each anonymous set (or herd) and fixing the canonical labeling and canonical order used in the anonymous set, we can achieve a certain unlinkability.
Alternatively, we might be able to generate a kind of "on-the-fly" template proposed in Dave's comment in #4, which seems ineffective in some cases (see #8).

Issue 87: Naming the algorithm (Bike-shedding danger ahead!)

I must admit I hate the name "URDNA2015". I never remember what URDNA stands for (I always associate it with some weird gene...) and the 2015 is irrelevant.

Shouldn't we come with a simpler name? We could simply call it RCH, for example and, if we need it, we can use the "usual" W3C versioning habit of calling it RCH 1.0.

Just a thought. But it is still time to make this change.

Issue 88: Should multiformats refer to URDNA or URDCA?

there is a PR that contributed a reference to some of this in multiformats/multicodec

the pr title says 'urdna' but the pr added the name using 'urdca'. Which is more... canonical.... normal.... whatever. What should we put it in the multicodec name column? :)

I imagine this decision has editorial implications for the spec itself? It still says 'urdna2015' quite a bit and does not contain 'urdca'. So why should multibase say 'urdca'?

Relates to:

Issue 89: Support ordered input dataset or "list of quads" and optional mapping from input indices to output indices enhancement spec:enhancement ready for pr

Starting with my comment here: #86 (comment)

Some discussion spawned around the need, as an optional output, a mapping of quad input indices to quad output indices for selective disclosure use cases.

@gkellogg made this comment:

Just be be clear on what we're talking about, the input dataset is unordered and no blank node labels are persistent or possibly even present.

Which implies that we might want to also take an ordered list of quads as an optional alternative input to the algorithm. Or perhaps we can describe the RDF abstract dataset as being optionally represented as such -- for the case where this mapping output is desirable. Notably, the presence (or lack thereof) of input blank node labels in this case is not relevant.

The canonicalization algorithm converts an input dataset into a normalized dataset. This algorithm will assign deterministic identifiers to any blank nodes in the input dataset.

This section is non-normative.

URDNA2015 canonically labels an RDF dataset by assigning each blank node a canonical identifier. In URDNA2015, an RDF dataset D is represented as a set of quads of the form < s, p, o, g > where the graph component g is empty if and only if the triple < s, p, o > is in the default graph. It is expected that, for two RDF datasets, URDNA2015 returns the same canonically labeled list of quads if and only if the two datasets are isomorphic (i.e., the same modulo blank node identifiers).

URDNA2015 consists of several sub-algorithms. These sub-algorithms are introduced in the following sub-sections. First, we give a high level summary of URDNA2015.

Initialization. Initialize the state needed for the rest of the algorithm using 4.2 Canonicalization State.
Compute first degree hashes. Compute the first degree hash for each blank node in the dataset using 4.6 Hash First Degree Quads.
Canonically label unique nodes. Assign canonical identifiers via 4.5 Issue Identifier Algorithm, in Unicode code point order, to each blank node whose first degree hash is unique.
Compute N-degree hashes for non-unique nodes. For each repeated first degree hash (proceeding in Unicode code point order), compute the N-degree hash via 4.8 Hash N-Degree Quads of every unlabeled blank node that corresponds to the given repeated hash.
Canonically label remaining nodes. In Unicode code point order of the N-degree hashes, issue canonical identifiers to each corresponding blank node using 4.5 Issue Identifier Algorithm. If more than one node produces the same N-degree hash, the order in which these nodes receive a canonical identifier does not matter.
Finish. Return the serialized canonical form of the normalized dataset.

This section is non-normative.

Example 2: Unique hashes

This example illustrates the Canonicalization Algorithm where all blank nodes have unique first-degree hashes.

Figure 1 An illustration of a graph resulting in unique hashes.
Image available in SVG .

:p :q _:e0 .
:p :r _:e1 .
_:e0 :s :u .
_:e1 :t :u .

Step 2 is called twice, with each blank node (e0 and e1) in the input dataset to populate blank node to quads map:

Table 1 Blank node to quads map for unique hashes
`blank node`	`Q`
`e0`	`:p :q e0 .` `e0 :s :u .`
`e1`	`:p :r e1 .` `e1 :t :u .`

Step 3 generates the first-degree hash for each blank node, which is explored further in Example 4:

Table 2 Hash to blank nodes map for unique hashes
`hash`	`blank node(s)`
`21d1dd5ba21f3dee9d76c0c00c260fa6f5d5d65315099e553026f4828d0dc77a`	`e0`
`6fa0b9bdb376852b5743ff39ca4cbf7ea14d34966b2828478fbf222e7c764473`	`e1`

Step 4 creates canonical identifiers for each blank node which has a unique hash:

Table 3 Canonical identifiers for blank nodes with unique hashes
`blank node`	`canonical identifier`
`e0`	`c14n0`
`e1`	`c14n1`

Step 5 has no effect, as there are no remaining blank nodes without canonical identifiers.

Step 6 generates the normalized dataset by replacing blank node identifiers in the original input with their canonical identifiers:

:p :q _:c14n0 .
:p :r _:c14n1 .
_:c14n0 :s :u .
_:c14n1 :t :u .

Example 3: Shared hashes

This example illustrates the Canonicalization Algorithm where hashing the statements mentioning those blank nodes have overlapping results.

Figure 2 An illustration of a graph resulting in shared hashes.
Image available in SVG .

:p :q _:e0 .
:p :q _:e1 .
_:e0 :p _:e2 .
_:e1 :p _:e3 .
_:e2 :r _:e3 .

Step 2 is called four times, with each blank node (e0, e1, e2, and e3) to populate blank node to quads map:

Table 4 Blank node to quads map for shared hashes
`blank node`	`Q`
`e0`	`:p :q e0 .` `e0 :p e2 .`
`e1`	`:p :q e1 .` `e1 :p e3 .`
`e2`	`e0 :p e2 .` `e2 :r e3 .`
`e3`	`e1 :p e3 .` `e2 :r e3 .`

Step 3 generates the first-degree hash for each blank node, which is explored further in Example 5 (note that the hashes for e0 and e1 are shared):

Table 5 Hash to blank nodes map shared hashes
`hash`	`blank node(s)`
`3b26142829b8887d011d779079a243bd61ab53c3990d550320a17b59ade6ba36`	`e0`, `e1`
`15973d39de079913dac841ac4fa8c4781c0febfba5e83e5c6e250869587f8659`	`e2`
`7e790a99273eed1dc57e43205d37ce232252c85b26ca4a6ff74ff3b5aea7bccd`	`e3`

Step 4 creates canonical identifiers for each blank node which has a unique hash:

Table 6 Canonical identifiers for blank nodes with shared hashes
`blank node`	`canonical identifier`
`e2`	`c14n0`
`e3`	`c14n1`

Step 5 is run on e0 and e1, separately, which share the same hash, and use separate instances of a temporary issuer, (as explored in Example 7) to create the hash path list composed of the hash result and temporary identifier mappings from Hash N-Degree Quads algorithm for each of these blank nodes:

temporary issuer mappings

hash fbc300de5afafd97a4b9ee1e72b57754dcdcb7ebb724789ac6a94a5b82a48d30

temporary issuer mappings

hash 2c0b377baf86f6c18fed4b0df6741290066e73c932861749b172d1e5560f5045

Step 5.3 creates the canonical identifiers for the temporary identifiers established in the previous step running in order of the hash component from each result. This updates the canonical issuer with the following mappings:

Table 9 Blank node to canonical identifiers
`blank node`	`canonical identifier`
`e0`	`c14n3`
`e1`	`c14n2`
`e2`	`c14n0`
`e3`	`c14n1`

Step 6 generates the normalized dataset by replacing blank node identifiers in the original input with their canonical identifiers:

:p :q _:c14n2 .
:p :q _:c14n3 .
_:c14n0 :r _:c14n1 .
_:c14n2 :p _:c14n1 .
_:c14n3 :p _:c14n0 .

Create the canonicalization state.

Explanation

This has the effect of initializing the blank node to quads map, and the hash to blank nodes map, as well as instantiating a new canonical issuer.
For every quad Q in input dataset:
1. For each blank node that is a component of Q, add a reference to Q from the map entry for the blank node identifier identifier in the blank node to quads map, creating a new entry if necessary.
  
  Explanation
  
  This establishes the blank node to quads map, relating each blank node with the set of quads of which it is a component.
  
  Note
  Literal components of quads are not subject to any normalization. As noted in Section 3.3 of [RDF11-CONCEPTS], literal term equality is based on the lexical form, rather than the literal value, so two literals "01"^^xs:integer and "1"^^xs:integer are treated as distinct resources.
Logging

Log the state of the blank node to quads map:
For each key n in the blank node to quads map:

Explanation

This step creates a hash for every blank node in the input document. Some blank nodes will lead to a unique hash, while other blank nodes may share a common hash.
1. Create a hash, h_f(n), for n according to the Hash First Degree Quads algorithm.
2. Add h_f(n) and n to hash to blank nodes map, including repetitions, creating a new entry if necessary.
Logging

Log the results from the Hash First Degree Quads algorithm.
For each hash to identifier list map entry in hash to blank nodes map, code point ordered by hash:

Explanation

This step establishes the canonical identifier for blank nodes having a unique hash, which are recorded in the canonical issuer.
1. If identifier list has more than one entry, continue to the next mapping.
2. Use the Issue Identifier algorithm, passing canonical issuer and the single blank node identifier, identifier in identifier list to issue a canonical replacement identifier for identifier.
3. Remove the map entry for hash from the hash to blank nodes map.
Logging

Log the assigned canonical identifiers.
For each hash to identifier list map entry in hash to blank nodes map, code point ordered by hash:

Explanation

This step establishes the canonical identifier for blank nodes having a shared hash. This is done by creating unique blank node identifiers for all blank nodes traversed by the Hash N-Degree Quads algorithm, running through each blank node without a canonical identifier in the order of the hashes established in the previous step.
Logging

Log hash and identifier list for this iteration.
1. Create hash path list where each item will be a result of running the Hash N-Degree Quads algorithm.
  
  Explanation
  
  This list establishes an order for those blank nodes sharing a common first-degree hash.
2. For each blank node identifier n in identifier list:
  1. If a canonical identifier has already been issued for n, continue to the next blank node identifier.
  2. Create temporary issuer, an identifier issuer initialized with the prefix b.
  3. Use the Issue Identifier algorithm, passing temporary issuer and n, to issue a new temporary blank node identifier b_n to n.
  4. Run the Hash N-Degree Quads algorithm, passing the canonicalization state, n for identifier, and temporary issuer, appending the result to the hash path list.
    Logging
    
    Include logs for each call to Hash N-Degree Quads algorithm.
3. For each result in the hash path list, code point ordered by the hash in result:
  
  Explanation
  
  The previous step created temporary identifiers for the blank nodes sharing a common first degree hash, which is now used to generate their canonical identifiers.
  1. For each blank node identifier, existing identifier, that was issued a temporary identifier by identifier issuer in result, issue a canonical identifier, in the same order, using the Issue Identifier algorithm, passing canonical issuer and existing identifier.
    
    Explanation
    
    In Step 5.2, hash path list was created with an ordered set of results. Each result contained a temporary issuer which recorded temporary identifiers associated with a particular blank node identifier in identifier list. This step processes each returned temporary issuer, in order, and allocates canonical identifiers to the temporary identifier mappings contained within each temporary issuer, creating a full order on the remaining blank nodes with unissued canonical identifiers.
  Logging
  
  Log newly issued canonical identifiers.
For each quad, q, in input dataset:

Explanation

This step populates the normalized dataset with quads substituting the original blank node identifiers, with the newly established canonical blank node identifiers.
1. Create a copy, quad copy, of q and replace any existing blank node identifier n using the canonical identifiers previously issued by canonical issuer.
2. Add quad copy to the normalized dataset.
Logging

Log the state of the canonical issuer at the completion of the algorithm.
Return the serialized canonical form of the normalized dataset.

This algorithm issues a new blank node identifier for a given existing blank node identifier. It also updates state information that tracks the order in which new blank node identifiers were issued. The order of issuance is important for canonically labeling blank nodes that are isomorphic to others in the dataset.

The algorithm maintains an issued identifiers map to relate an existing blank node identifier from the input dataset to a new blank node identifier using a given identifier prefix (c14n) with new identifiers issued by appending an incrementing number. For example, when called for a blank node identifier such as e3, it might result in a issued identifier of c14n1.

The algorithm takes an identifier issuer I and an existing identifier as inputs. The output is a new issued identifier. The steps of the algorithm are:

If there is a map entry for existing identifier in issued identifiers map of I, return it.
Generate issued identifier by concatenating identifier prefix with the string value of identifier counter.
Add an entry mapping existing identifier to issued identifier to the issued identifiers map of I.
Increment identifier counter.
Return issued identifier.

This algorithm calculates a hash for a given blank node across the quads in a dataset in which that blank node is a component. If the hash uniquely identifies that blank node, no further examination is necessary. Otherwise, a hash will be created for the blank node using the algorithm in 4.8 Hash N-Degree Quads invoked via 4.4 Canonicalization Algorithm.

This section is non-normative.

To determine whether the first degree information of a node n is unique, a hash is assigned to its mention set, Q_n. The first degree hash of a blank node n, denoted h_f(n), is the hash that results from 4.6 Hash First Degree Quads when passing n. Nodes with unique first degree hashes have unique first degree information.

For consistency, blank node identifiers used in Q_n are replaced with placeholders in a canonical n-quads serialization of that quad. Every blank node component is replaced with either a or z, depending on if that component is n or not.

The resulting serialized quads are then code point ordered, concatenated, and hashed. This hash is the first degree hash of n, h_f(n).

This section is non-normative.

Example 4: Unique hashes

This example illustrates hashing quads containing blank nodes where hashing the statements mentioning those blank nodes generates unique results.

Figure 3 An illustration of a graph resulting in unique hashes.
Image available in SVG .

:p :q _:e0 .
:p :r _:e1 .
_:e0 :s :u .
_:e1 :t :u .

The algorithm will be called twice, with each blank node (e0 and e1). Running the algorithm with the reference node e0 results in the following quads, after replacing blank nodes:

:p :q _:a .
_:a :s :u .

These are then serialized to canonical n-quads form: '<http://example.com/#p> <http://example.com/#q> _:a .\n_:a <http://example.com/#s> <http://example.com/#u> .\n', concatenated and hashed using the hash algorithm (SHA-256) resulting in 21d1dd5ba21f3dee9d76c0c00c260fa6f5d5d65315099e553026f4828d0dc77a.

The algorithm is run a second time with the reference node e1 resulting in the following quads:

:p :r _:a .
_:a :t :u .

These are then serialized to canonical n-quads form: '<http://example.com/#p> <http://example.com/#r> _:a .\n_:a <http://example.com/#t> <http://example.com/#u> .\n', concatenated and hashed as before resulting in 6fa0b9bdb376852b5743ff39ca4cbf7ea14d34966b2828478fbf222e7c764473.

Thus the generated hashes each reference just a single blank node, allowing the canonicalization algorithm to use only the Hash First Degree Quads algorithm.

Example 5: Shared hashes

This example illustrates hashing quads containing blank nodes where hashing the statements mentioning those blank nodes have overlapping results.

Figure 4 An illustration of a graph resulting in shared hashes.
Image available in SVG .

:p :q _:e0 .
:p :q _:e1 .
_:e0 :p _:e2 .
_:e1 :p _:e3 .
_:e2 :r _:e3 .

The algorithm will be called four times, with each blank node (e0, e1, e2, and e3). Running the algorithm with the reference node e0 results in the following quads, after replacing blank nodes:

:p :q _:a .
_:a :p _:z .

Which hashes to: 3b26142829b8887d011d779079a243bd61ab53c3990d550320a17b59ade6ba36.

Note that using reference node e1 results in the same quads, and thus results in the same hash.

Using the reference node e2 results in the following quads:

_:z :p _:a .
_:a :r _:z .

Which hashes to: 15973d39de079913dac841ac4fa8c4781c0febfba5e83e5c6e250869587f8659.

Lastly, using the reference node e3 results in the following quads:

_:z :p _:a .
_:z :r _:a .

Which hashes to: 7e790a99273eed1dc57e43205d37ce232252c85b26ca4a6ff74ff3b5aea7bccd.

The hashes for e2 and e3 are unique, but e0, and e1 share a common hash, which will require the use of the Hash N-Degree Quads Algorithm, as it is necessary to consider quads further removed from the direct mentions to determine a unique hash.

This algorithm takes the canonicalization state and a reference blank node identifier as inputs.

Initialize nquads to an empty list. It will be used to store quads in canonical n-quads form.
Get the list of quads quads from the map entry for reference blank node identifier in the blank node to quads map.
For each quad quad in quads:
1. Serialize the quad in canonical n-quads form with the following special rule:
  1. If any component in quad is an blank node, then serialize it using a special identifier as follows:
    1. If the blank node's existing blank node identifier matches the reference blank node identifier then use the blank node identifier a, otherwise, use the blank node identifier z.
Sort nquads in Unicode code point order.
Return the hash that results from passing the sorted and concatenated nquads through the hash algorithm.
Logging

Log the inputs and result of running this algorithm.

This algorithm generates a hash for some blank node component of a quad, considering its position within that quad. This is used as part of the Hash N-Degree Quads algorithm to characterize the blank nodes related to some particular blank node within their mention sets.

This algorithm creates a hash to identify how one blank node is related to another. It takes the canonicalization state, a related blank node identifier, a quad, an identifier issuer, issuer, and a string position as inputs.

Initialize a string input to the value of position.
If position is not g, append <, the value of the predicate in quad, and > to input.
If there is a canonical identifier for related, or an identifier issued by issuer, append the string _:, followed by that identifier (using the canonical identifier if present, otherwise the one issued by issuer) to input.

Explanation

If a canonical identifier was already issued for related, it will be in the canonical issuer contained within canonicalization state. Otherwise, the temporary issuer instance may already have a mapping for related.
Otherwise, append the result of the Hash First Degree Quads algorithm, passing related to input.

Explanation

If no identifier, canonical or temporary, has already been issued, a new identifier is created using the Hash First Degree Quads algorithm.
Return the hash that results from passing input through the hash algorithm.

Explanation

This resulting string is used to generate a hash; in this respect, it is similar to the Hash First Degree Quads algorithm which uses the serialization of quads in nquads for hashing. For the sake of consistency, the nquad representation of identifier is used in this step, hence the appearance of the _: string.
Logging

Log the inputs and result of running this algorithm.

This algorithm calculates a hash for a given blank node across the quads in a dataset in which that blank node is a component for which the hash does not uniquely identify that blank node. This is done by expanding the search from quads directly referencing that blank node (the mention set), to those quads which contain nodes which are also components of quads in the mention set, called the gossip path. This process proceeds in every greater degrees of indirection until a unique hash is obtained.

Editor's note

The 'path' terminology could also be changed to better indicate what a path is (a particular deterministic serialization for a subgraph/subdataset of nodes without globally-unique identifiers).

This section is non-normative.

Usually, when trying to determine if two nodes in a graph are equivalent, you simply compare their identifiers. However, what if the nodes don't have identifiers? Then you must determine if the two nodes have equivalent connections to equivalent nodes all throughout the whole graph. This is called the graph isomorphism problem. This algorithm approaches this problem by considering how one might draw a graph on paper. You can test to see if two nodes are equivalent by drawing the graph twice. The first time you draw the graph the first node is drawn in the center of the page. If you can draw the graph a second time such that it looks just like the first, except the second node is in the center of the page, then the nodes are equivalent. This algorithm essentially defines a deterministic way to draw a graph where, if you begin with a particular node, the graph will always be drawn the same way. If two graphs are drawn the same way with two different nodes, then the nodes are equivalent. A hash is used to indicate a particular way that the graph has been drawn and can be used to compare nodes.

When two blank nodes have the same first degree hash, extra steps must be taken to detect global, or N-degree, distinctions. All information that is in any way connected to the blank node n through other blank nodes, even transitively, must be considered.

To consider all transitive information, the algorithm traverses and encodes all possible paths of incident mentions emanating from n, called gossip paths, that reach every unlabeled blank node connected to n. Each unlabeled blank node is assigned a temporary identifier in the order in which it is reached in the gossip path being explored. The mentions that are traversed to reach connected blank nodes are encoded in these paths via related hashes. This provides a deterministic way to order all paths coming from n that reach all blank nodes connected to n without relying on input blank node identifiers.

This algorithm works in concert with the main canonicalization algorithm to produce a unique, deterministic identifier for a particular blank node. This hash incorporates all of the information that is connected to the blank node as well as how it is connected. It does this by creating deterministic paths that emanate out from the blank node through any other adjacent blank nodes.

Ultimately, the algorithm selects a shortest gossip path, distributing canonical identifiers to the unlabeled blank nodes in the order in which they appear in this path. The hash of this encoded shortest path, called the N-degree hash of n, distinguishes n from other blank nodes in the dataset.

For clarity, we consider a gossip path encoded via the string s to be shortest provided that:

The length of s is less than or equal to the length of any other gossip path string s′.
If s and s′ have the same length (as strings), then s is code point ordered less than or equal to s′.

For example, abc is shorter than bbc, whereas abcd is longer than bcd.

The following provides a high level outline for how the N-degree hash of n is computed along the shortest gossip path. Note that the full algorithm considers all gossip paths, ultimately returning the hash of the shortest encoded path.

Compute related hashes. Compute the related hash H_n set for n, i.e., all first degree mentions between n and another blank node. Note that this includes both unlabeled blank nodes and those already issued a canonical identifier (labeled blank nodes).
Explore mentions. Given the related hash x in H_n, record x in the data to hash D_n. Determine whether each blank node reachable via the mention with related hash x has already received an identifier.
1. Record the identifiers of labeled nodes. If a blank node already has an identifier, record its identifier in D_n once for every mention with related hash x. Skip to the next related hash in H_n and repeat step 2.
2. Distribute and record temporary identifiers to unlabeled nodes. For each unlabeled blank node, assign it a temporary identifier according to the order in which it is reached in the gossip path, recording its given identifier in D_n (including repetitions). Add each unlabeled node to the recursion list R_n(x) in this same order (omitting repetitions).
3. Recurse on newly labeled nodes. For each n_i in R_n(x)
  1. Record its identifier in D_n
  2. Append < r(i) > to D_n where r(i) is the data to hash that results from returning to step 1, replacing n with n_i.
Compute the N-degree hash of n. Hash D_n to return the N-degree hash of n, namely h_N(n). Return the updated issuer I_n that has now distributed temporary identifiers to all unlabeled blank nodes connected to n.

As described above in step 2.3, HN recurses on each unlabeled blank node when it is first reached along the gossip path being explored. This recursion can be visualized as moving along the path from n to the blank node n_i that is receiving a temporary identifier. If, when recursing on n_i, another unlabeled blank node n_j is discovered, the algorithm again recurses. Such a recursion traces out the gossip path from n to n_j via n_i.

The recursive hash r(i) is the hash returned from the completed recursion on the node n_i when computing h_N(n). Just as h_N(n) is the hash of D_n, we denote the data to hash in the recursion on n_i as D_i. So, r(i) = h(D_i). For each related hash x ∈ H_n, R_n(x) is called the recursion list on which the algorithm recurses.

This section is non-normative.

Example 7: Shared hashes

This example revisits Example 3 to illustrate the operation of the Hash N-Degree Quads Algorithm on the blank nodes (e0 and e1) which resulted in a shared result from the Hash First Degree Quads Algorithm. The other blank nodes (e2 and e3) had unique hashes when processed by the Hash N-Degree Quads Algorithm, so that canonical identifiers have already been issued, as recorded in the identifier issuer instance.

Table 10 Established canonical identifiers from canonical issuer
`blank node`	`canonical identifier`
`e2`	`c14n0`
`e3`	`c14n1`

Figure 5 An illustration of a graph resulting in shared hashes.
Image available in SVG .

:p :q _:e0 .
:p :q _:e1 .
_:e0 :p _:e2 .
_:e1 :p _:e3 .
_:e2 :r _:e3 .

The algorithm will be called twice, with each blank node which did not result in a unique hash from the Hash First Degree Quads Algorithm (e0 and e1). The map entry for e0 in the blank node to quads map results in the following quads:

:p :q _:e0 .
_:e0 :p _:e2 .

When called, the temporary issuer has the following mappings:

Step 3 iterates over each of these quads to find blank node components to populate H_n with the result of running the Hash Related Blank Node algorithm using the position of that component within the quad.

Table 12 Hash for `e2` related to `e0`
`related`	`quad`	`position`	`hash`
`e2`	`e0 :p e2 .`	`o`	`29cf7e2279...7252ca60fa`

which results in the following H_n:

Table 13 Blank node list for hash
`related hash`	`blank node list`
`29cf7e22790bc2ed395b81b3933e5329fc7b25390486085cac31ce7252ca60fa`	`[ e2 ]`

Step 5 iterates over each related hash which is a key in H_n, which can have one or more related blank nodes, which determines the shortest gossip path between the two nodes (e0 and e2). In this case, the hash maps to just e2, for which a canonical identifier has already been chosen (c14n0), so there is a single permutation resulting in one candidate path. The resulting chosen path is _:c14n0.

The string data to hash is composed of the single related hash and the chosen path: 29cf7e22790bc2ed395b81b3933e5329fc7b25390486085cac31ce7252ca60fa_:c14n0, which hashes to fbc300de5afafd97a4b9ee1e72b57754dcdcb7ebb724789ac6a94a5b82a48d30, which, along with the temporary issuer used to traverse these paths, is the result of the algorithm:

temporary issuer mappings

hash fbc300de5afafd97a4b9ee1e72b57754dcdcb7ebb724789ac6a94a5b82a48d30

The algorithm is called again for e1.

The map entry for e1 in the blank node to quads map results in the following quads:

:p :q _:e1 .
_:e1 :p _:e3 .

Step 3 again calculates blank node components using Hash Related Blank Node algorithm.

Table 16 Hash for `e3` related to `e1`
`related`	`quad`	`position`	`hash`
`e3`	`e1 :p e3 .`	`o`	`b7956ea1d6...ff6b098216`

which results in the following H_n:

Table 17 Blank node list for hash
`related hash`	`blank node list`
`b7956ea1d654d5824496eb439a1f2b79478bd7d02d4a115f4c97cbff6b098216`	`[ e3 ]`

Step 5 runs in essentially the same manner, mapping just e3 having the canonical identifier c14n1, so there is again a single permutation resulting in one candidate path. The resulting chosen path is _:c14n1.

The string data to hash is composed of the single related hash and the chosen path: af54b9512b1ef069205e8e41bc5a96e86a108b0389caa5029f2c3fd0bc465246_:c14n1, which hashes to 767a5e66f509221f45003a16c12a89d4d9675cfa51ffa80459b63606bdfc2ada.

The string data to hash is composed of the single related hash and the chosen path: b7956ea1d654d5824496eb439a1f2b79478bd7d02d4a115f4c97cbff6b098216_:c14n1, which hashes to fbc300de5afafd97a4b9ee1e72b57754dcdcb7ebb724789ac6a94a5b82a48d30, which, along with the temporary issuer used to traverse these paths, is the result of the algorithm:

temporary issuer mappings

hash 2c0b377baf86f6c18fed4b0df6741290066e73c932861749b172d1e5560f5045

Issue 16: Optionally fail on duplicates in Hash N-Degree Quads? question

An additional input to this algorithm should be added that allows it to be optionally skipped and throw an error if any equivalent related hashes were produced that must be permuted during step 5.4.4. For practical uses of the algorithm, this step should never be encountered and could be turned off, disabling canonizing datasets that include a need to run it as a security measure.

The inputs to this algorithm are the canonicalization state, the identifier for the blank node to recursively hash quads for, and path identifier issuer which is an identifier issuer that issues temporary blank node identifiers. The output from this algorithm will be a hash and the identifier issuer used to help generate it.

Logging

Log the inputs to the algorithm.

Create a new map H_n for relating hashes to related blank nodes.
Get a reference, quads, to the list of quads from the map entry for identifier in the blank node to quads map.

Explanation

quads is the mention set of identifier.
Logging

Log the quads from the mention set of identifier.
For each quad in quads:

Explanation

This loop calculates the related hash H_n for other blank nodes within the mention set of identifier.
1. For each component in quad, where component is the subject, object, or graph name, and it is a blank node that is not identified by identifier:
  1. Set hash to the result of the Hash Related Blank Node algorithm, passing the blank node identifier for component as related, quad, issuer, and position as either s, o, or g based on whether component is a subject, object, graph name, respectively.
  2. Add a mapping of hash to the blank node identifier for component to H_n, adding an entry as necessary.
Logging

Include the logs for each iteration of the Hash Related Blank Node algorithm and the resulting H_n.
Create an empty string, data to hash.
For each related hash to blank node list mapping in H_n, code point ordered by related hash:

Explanation

This loop explores the gossip paths for each related blank node sharing a common hash to identifier finding the shortest such path (chosen path). This determines how canonical identifiers for otherwise commonly hashed blank nodes are chosen.

Each path is represented by the concatenation of the identifiers for each related blank node – either the issued identifier, or a temporary identifier created using a copy of issuer. Those for which temporary identifiers were issued are later recursed over using this algorithm.
Logging

Log the value of related hash and state of data to hash.
1. Append the related hash to the data to hash.
2. Create a string chosen path.
3. Create an unset chosen issuer variable.
4. For each permutation p of blank node list:
  Logging
  
  Log each permutation p.
  1. Create a copy of issuer, issuer copy.
  2. Create a string path.
  3. Create a recursion list, to store blank node identifiers that must be recursively processed by this algorithm.
  4. For each related in p:
    1. If a canonical identifier has been issued for related by canonical issuer, append the string _:, followed by the canonical identifier for related, to path.
      Explanation
      
      A canonical identifier may have been generated before calling this algorithm, if it was issued from an earlier call to Hash First Degree Quads algorithm. There is no reason to recurse and apply the algorithm to any related blank node that has already been assigned a canonical identifier. Furthermore, using the canonical identifier also further distinguishes it from any temporary identifier, allowing for even greater efficiency in finding the chosen path.
    2. Otherwise:
      1. If issuer copy has not issued an identifier for related, append related to recursion list.
        
        Explanation
        
        Temporarily labeled nodes have identifiers recorded in issuer copy, which is later used to recursively call this algorithm, so that eventually all nodes are given canonical identifiers.
      2. Use the Issue Identifier algorithm, passing issuer copy and related, and append the string _:, followed by the result, to path.
    3. If chosen path is not empty and the length of path is greater than or equal to the length of chosen path and path is greater than chosen path when considering code point order, then skip to the next permutation p.
      
      Explanation
      
      If path is already longer than the prospective chosen path, we can terminate this iteration early.
    Explanation
    
    path is used to generate a hash at a later step; in this respect, it is similar to the Hash First Degree Quads algorithm which uses the serialization of quads in nquads for hashing. For the sake of consistency, the nquad representation of blank node identifiers is used in these steps, hence the usage of the _: string.
    Logging
    
    Log related and path.
  5. For each related in recursion list:
    
    Explanation
    
    The prospective path is extended with the hash resulting from recursively calling this algorithm on each related blank node issued a temporary identifier.
    Logging
    
    Log recursion list and path.
    1. Set result to the result of recursively executing the Hash N-Degree Quads algorithm, passing the canonicalization state, related for identifier, and issuer copy for path identifier issuer.
      Logging
      
      Log related and include logs for each recursive call to Hash N-Degree Quads algorithm.
    2. Use the Issue Identifier algorithm, passing issuer copy and related; append the string _:, followed by the result, to path.
    3. Append <, the hash in result, and > to path.
    4. Set issuer copy to the identifier issuer in result.
    5. If chosen path is not empty and the length of path is greater than or equal to the length of chosen path and path is greater than chosen path when considering code point order, then skip to the next p.
      
      Explanation
      
      If path is already longer than the prospective chosen path, we can terminate this iteration early.
  6. If chosen path is empty or path is less than chosen path when considering code point order, set chosen path to path and chosen issuer to issuer copy.
5. Append chosen path to data to hash.
  Logging
  
  Log chosen path and data to hash.
6. Replace issuer, by reference, withchosen issuer.
Return issuer and the hash that results from passing data to hash through the hash algorithm.
Logging

Log issuer and results from passing data to hash through the hash algorithm.

This section is non-normative.

This example illustrates a more complicated example where the same paths through blank nodes are duplicated in a graph, but use different blank node identifiers.

Figure 6 An illustration of a graph with duplicated paths.
Image available in SVG .

_:e0 :p1 _:e1 .
_:e1 :p2 "Foo" .
_:e2 :p1 _:e3 .
_:e3 :p2 "Foo" .

The following is a summary of the more detailed execution log found here.

Example 9: Duplicate Paths

Step 2 of the Canonicalization Algorithm is called four times, with each blank node (e0, e1, e2, and e3) to populate blank node to quads map:

Table 19 Blank node to quads map for duplicate paths
`blank node`	`Q`
`e0`	`_:e0 :p1 _:e1 .`
`e1`	`_:e0 :p1 _:e1 .` `_:e1 :p2 "Foo" .`
`e2`	`_:e2 :p1 _:e3 .`
`e3`	`_:e2 :p1 _:e3 .` `_:e3 :p2 "Foo" .`

Step 3 generates the first-degree hash for each blank node.

For e0, the Hash First Degree Quads algorithm uses the nquad _:a :p1 _:z . to calculate the hash 24da9a4406b4e66dffa10ad3d4d6dddc388fbf193bb124e865158ef419893957.

For e1, the Hash First Degree Quads algorithm uses the nquads _:z :p1 _:a . and _:a :p2 "Foo" . to calculate the hash a994e40b576809985bc0f389308cd9d552fd7c89d028c163848a6b2d33a8583a.

For e2, the Hash First Degree Quads algorithm uses the nquad _:a :p1 _:z . to calculate the hash 24da9a4406b4e66dffa10ad3d4d6dddc388fbf193bb124e865158ef419893957.

For e3, the Hash First Degree Quads algorithm uses the nquads _:z :p1 _:a . and _:a :p2 "Foo" . to calculate the hash a994e40b576809985bc0f389308cd9d552fd7c89d028c163848a6b2d33a8583a.

Table 20 Hash to blank nodes map shared hashes
`hash`	`blank node(s)`
`24da9a4406b4e66dffa10ad3d4d6dddc388fbf193bb124e865158ef419893957`	`e0` and `e2`
`a994e40b576809985bc0f389308cd9d552fd7c89d028c163848a6b2d33a8583a`	`e1 and e3`

Step 4 would create canonical identifiers for each blank node which has a unique hash, but no blank node has a unique hash.

Step 5 is run on e0, e1, e2, and e3, separately, which share hash values. each use separate instances of a temporary issuer, to create the hash path list composed of the hash result and temporary identifier mappings from Hash N-Degree Quads algorithm for each of these blank nodes/

The Hash N-Degree Quads algorithm is first called for hash 24da9a4406b4e66dffa10ad3d4d6dddc388fbf193bb124e865158ef419893957 and blank node e0 related to the nquad _:e0 :p1 _:e1 ..

`blank node`	`temporary identifier`
`e0`	`b0`

Step 3 calls the Hash Related Blank Node algorithm for each position in this quad:

Table 21 Hash for `e2` related to `e0`
`related`	`quads`	`position`	`hash`
`e3`	`z :p1 a .` `a :p2 "Foo" .`	`o`	`a994e40b57...2d33a8583a`

which results in the following H_n:

Table 22 Blank node list for hash
`related hash`	`blank node list`
`a994e40b576809985bc0f389308cd9d552fd7c89d028c163848a6b2d33a8583a`	`[ e3 ]`

Step 5 iterates over each related hash which is a key in H_n, which can have one or more related blank nodes.

At Step 5.4.5, the recursion list is [ e3 ], so the Hash N-Degree Quads algorithm is called recursively resulting in the following:

temporary issuer mappings

`original identifier`	`temporary identifier`
`e2`	`b0`
`e3`	`b1`

hash c484f98e6cbf9e21f287433c8b1caa7f1486fd61d84ab220a494bf8184751b8c

Back in step 5.4.5.4, the path and issuer copy are: _:b1_:b1<c484f98e6cbf9e21f287433c8b1caa7f1486fd61d84ab220a494bf8184751b8c> and {e2: b0, e3: b1}.

This results in the chosen path _:b1_:b1<c484f98e6cbf9e21f287433c8b1caa7f1486fd61d84ab220a494bf8184751b8c> and data to hash 3d96946f27fc34a78e8d067135a1cb1b77083aebc4b2c6cbdc536f067242686c_:b1_:b1<c484f98e6cbf9e21f287433c8b1caa7f1486fd61d84ab220a494bf8184751b8c>

temporary issuer mappings

`original identifier`	`temporary identifier`
`e2`	`b0`
`e3`	`b1`

hash 39d609fcd8236b74c70744f492cd2baaf0a55765b380ff9e0811ce23e2f409d7

The Hash N-Degree Quads algorithm is next called for the same hash 24da9a4406b4e66dffa10ad3d4d6dddc388fbf193bb124e865158ef419893957 but this time with blank node e2 related to the nquad _:e2 :p1 _:e3 ..

`blank node`	`temporary identifier`
`e2`	`b0`

Step 3 calls the Hash Related Blank Node algorithm for each position in this quad:

Table 25 Hash for `e2` related to `e0`
`related`	`quads`	`position`	`hash`
`e3`	`z :p1 a .` `a :p2 "Foo" .`	`o`	`3d96946f27...184751b8c`

which results in the following H_n:

Table 26 Blank node list for hash
`related hash`	`blank node list`
`3d96946f27fc34a78e8d067135a1cb1b77083aebc4b2c6cbdc536f067242686c`	`[ e3 ]`

Step 5 iterates over each related hash which is a key in H_n, which can have one or more related blank nodes.

temporary issuer mappings

hash fbc300de5afafd97a4b9ee1e72b57754dcdcb7ebb724789ac6a94a5b82a48d30

Step 5.3 back in the Canonicalization Algorithm creates canonical identifiers for the temporary identifiers just issued:

`blank node`	`canonical identifier`
`e0`	`c14n0`
`e1`	`c14n1`
`e2`	`c14n2`
`e3`	`c14n3`

Next, back in step 5.1, now using hash a994e40b576809985bc0f389308cd9d552fd7c89d028c163848a6b2d33a8583a, canonical identifiers have already been created for the two blank nodes e1 and e3, so no further processing is necessary.

Step 6 ends with the canonical issuers containing the following mappings:

`blank node`	`canonical identifier`
`e0`	`c14n0`
`e1`	`c14n1`
`e2`	`c14n2`
`e3`	`c14n3`

This example illustrates another complicated example of nodes that are doubly connected in opposite directions.

Figure 7 An illustration of a graph back and forth links to nodes.
Image available in SVG .

_:e0 :next _:e1 .
_:e0 :prev _:e1 .
_:e1 :next _:e0 .
_:e1 :prev _:e0 .

The example is not explored in detail, but the execution log found here shows examples of more complicated pathways through the algorithm

RDF Dataset Canonicalization

A Standard RDF Dataset Canonicalization Algorithm

Abstract

Status of This Document

1. Introduction

1.1 Uses of Dataset Canonicalization

1.2 How to Read this Document

1.3 Typographical conventions

2. Conformance

3. Terminology

3.1 Terms defined by this specification

3.2 Terms defined by cited specifications

4. Canonicalization

4.1 Overview

4.2 Canonicalization State

4.3 Blank Node Identifier Issuer State

4.4 Canonicalization Algorithm

a) possible leakage via canonical labeling

b) possible leakage via canonical sorting

c) possible linking via canonical labeling

4.4.1 Overview

4.4.2 Examples

4.4.3 Algorithm

4.5 Issue Identifier Algorithm

4.5.1 Overview

4.5.2 Algorithm

4.6 Hash First Degree Quads

4.6.1 Overview

4.6.2 Examples

4.6.3 Algorithm

4.7 Hash Related Blank Node

4.7.1 Overview

4.7.2 Examples

4.7.3 Algorithm

4.8 Hash N-Degree Quads

4.8.1 Overview

4.8.2 Examples

4.8.3 Algorithm

5. Serialization

6. Privacy Considerations

6.1 Selective Disclosure Schemes

7. Security Considerations

7.1 Dataset Poisoning

7.2 Formal Verification Incomplete

8. Use Cases

9. Examples

9.1 Duplicate Paths

9.2 Double Circle

A. A Canonical form of N-Quads

B. URGNA2012

C. Index

C.1 Terms defined by this specification

C.2 Terms defined by reference

D. Changes since the First Public Working Draft of 24 November 2022

E. Acknowledgements

F. References

F.1 Normative references

F.2 Informative references