TDM Reservation Protocol (TDMRep)

Final Community Group Report

This version:
https://www.w3.org/2022/tdmrep/
Latest editor's draft:
https://w3c.github.io/tdm-reservation-protocol/
Editor:
Laurent Le Meur (EDRLab)

Abstract

This specification defines a simple and practical Web protocol, capable of expressing the reservation of rights relative to text & data mining (TDM) applied to lawfully accessible Web content, and to ease the discovery of TDM licensing policies associated with such content.

This initiative is a technical answer to the constraints set by the Article 4 of the new European Directive on copyright and related rights in the Digital Single Market.

Status of This Document

This section is non-normative.

This specification was published by the Text and Data Mining Reservation Protocol Community Group. It is not a W3C Standard nor is it on the W3C Standards Track. Please note that under the W3C Community Final Specification Agreement (FSA) other conditions apply. Learn more about W3C Community and Business Groups.

GitHub Issues are preferred for discussion of this specification. Alternatively, you can send comments to our mailing list. Please send them to public-tdmrep@w3.org (subscribe, archives).

1. Introduction

This section is non-normative.

In addition to their significance in the context of scientific research, text and data mining techniques (TDM) are widely used both by private and public entities to analyse large amounts of data (including copyright protected content like text, images, video etc.) in different areas of life and for various purposes, including for government services, complex business decisions and the development of new applications or technologies.

In a digital environment, TDM usage of copyright protected works can be subject to different terms and conditions, depending on the legal framework. In generic terms, an act of reproduction is required before TDM can be applied on content accessible on the Web; international laws stipulate that such act of reproduction is subject to authorization by rightsholders. So far, analyzing and processing the terms and conditions of a website, contacting rightsholders, seeking for permission and concluding licensing agreements require time and resources.

In such context, a machine-readable solution which streamlines the communication of TDM rights and licenses available for online copyrighted content is necessary to facilitate the development of TDM applications and reduce the risks of legal uncertainty for TDM actors. Such a solution, that shall rely on a consensus by rightsholders and TDM actors, will optimize the capacity of TDM actors to lawfully access and process useful content at large scale.

The Directive on copyright and related rights in the Digital Single Market or EU Directive 2019/790, better known as the "DSM Directive" (DSM meaning Digital Single Market), introduces two exceptions or limitations to the rights of rightsholders on lawfully accessible content, for reproductions and extractions for the purposes of TDM:

In its Article 3, a mandatory exception for research organisations and cultural heritage institutions which carry out TDM for the purposes of scientific research.

In its Article 4, an exception for any organisation willing to carry out TDM for any purpose other than scientific research, including commercial purposes, which applies on the condition that the use of content for TDM has not been expressly reserved by their rights holders in an appropriate manner, such as machine-readable means.

These TDM exceptions apply to TDM usage in the European Union in relation to content from European and foreign rightsholders. Outside of the EU, where the DSM legislation does not apply, the said exception does not apply: exclusive rights of right-holders to authorize acts of reproduction are maintained. In such cases, no TDM can be performed without the explicit authorisation of these rightsholders: in these countries, the absence of a reservation of rights by rightsholders cannot be considered as an implicit authorization to reproduce copyrighted content for TDM purpose, and advocating fair use or a similar rule is legally uncertain, as these actions are judged on a case-per-case basis.

The “opt-out” mechanism introduced by the DSM Directive is therefore a real opportunity for TDM actors and publishers across countries to define a machine-readable technique able to express not only if TDM rights on specific Web content are reserved or not, but also how rightsholders can be contacted and which licenses are available, if any. This is a tremendous help for TDM actors from all countries looking for legal certainty.

2. Terminology

This section is non-normative.

Rightsholder

Person or organization that owns the legal rights to something, in our case Web resources Wiktionary.

Publisher

Person or organization that makes Web resources available to the public.

TDM Actor

Person or organization practicing TDM (on Web resources in our case).

TDM Agent

Software accessing Web resources for TDM purposes.

TDM License

Description of the terms and conditions by which a TDM Actor can process a given Web resource.

TDM Policy

Description of the kind of TDM Licenses a TDM Actor may obtain from a Rightsholder.

TDM Rights

Rights to process a Web resource via TDM techniques, for a certain purpose (e.g scientific research, commercial).

Web Resource

Identifiable thing available on the Web Wikipedia. Web resources are located using URLs.

Web page

Web resource formatted in HTML.

3. Conformance

This section is non-normative.

As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.

The key words MAY, MUST, SHOULD, and SHOULD NOT in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

4. Requirements

This section is non-normative.

The technical specification shall:

5. Declaring the reservation of TDM Rights

The goal of this protocol is to allow a rightsholder to declare his choice regarding text & data mining of Web resources he controls, thereby allowing recipients of that declaration to adjust their scraping behavior, or to reach a separate agreement with the rightsholder that satisfies all parties.

Such a preference is expressed via two complementary properties, tdm-reservation and tdm-policy.

5.1 tdm-reservation

tdm-reservation is an integer.

tdm-reservation meaning
1 TDM rights are reserved. If a TDM Policy is set, TDM Agents MAY use it to get information on how they can acquire from the rightsholder an authorization to mine the content.
0 TDM rights are not reserved. TDM agents can mine the content for TDM purposes without having to contact the rightsholder.

Other values are considered protocol errors. In such a case the TDM Agents MUST consider that tdm-reservation is unset.

Note

The "opt-out" option specified by the Article 4 of the DSM Directive is expressed by the use of tdm-reservation with value equal 1.

5.2 tdm-policy

tdm-policy is a URL pointing to a TDM Policy set by the rightsholder.

Note

The presence of tdm-policy when the value of tdm-reservation is 0 is not considered a protocol error. TDM Agents SHOULD NOT process tdm-policy in this case.

A TDM Policy is considered human readable if its content-type is text/html. It is considered machine-readable if its content-type is either application/json or application/ld+json.

Note

Being unable to access or parse a TDM Policy is not considered a protocol error. In such a case, the TDM Agent MUST consider that there is no way to know at this time which conditions would allow it to process the resource.

6. Protocol

This specification provides three complementary techniques for expressing rightsholders' choices. These three techniques correspond to different situations and technical skills a Publisher may have.

6.1 TDM File on the Origin Server

The TDM file on the origin server is a mechanism for declaring site-wide righsholder's choices in a file hosted on the origin server of the Web content a TDM Agent wishes to mine.

An origin server that receives a valid GET request targeting this resource MUST send either a successful response containing a machine-readable representation of the site-wide righsholder's choices, as defined below, or a sequence of redirects that leads to such a representation. Failure to provide access to such a representation implies that the origin server does not implement this protocol.

This specification defines a JSON file named tdmrep.json, which MUST be hosted in the /.well-known repository of a Web server.

This file contains an array of JSON objects; each object represents a rule and contains three properties:

To evaluate if the URL of a Web resource is subject to a given pattern, a TDM Agent MUST match the paths inferred from the pattern against the URL. The matching SHOULD be case sensitive. The most specific match found MUST be used. The most specific match is the first in sequence.

If no match is found, a TDM Agent MUST consider that tdm-reservation is unset for the given URL.

If a percent-encoded US-ASCII character is encountered in the URI, it MUST be unencoded prior to comparison, unless it is a reserved character in the URI as defined by RFC3986 or the character is outside the unreserved character range. The match evaluates positively if and only if the end of the path from the rule is reached before a difference in octets is encountered.

6.1.1 Use of regular expressions

There are many variants of regular expressions. In order to simplify the work of TDM Agents, this specification is re-using the specification and wording of the robots.txt draft-koster-rep-00 2.2.3.

TDM Agents MUST allow the following special characters:

Character Description Example
"$" Designates the end of the match pattern. "tdm-policy: /this/path/exactly$"
"*" Designates 0 or more instances of any character. "tdm-policy: /this/*/then"

If TDM Agents match special characters verbatim in the URI, they MUST use "%" encoding. For example:

Pattern URI
/path/foo-%24 https://provider.com/path/foo-$
Note

This URL matching notation is subject to interpretation. For the sake of interoperability, TDM Agents should follow the rules detailed by Google in How Google interprets the robots.txt specification, section "URL matching based on path values".

6.1.2 Examples

This section is non-normative.

In the following example, a rightsholder wants to "opt-out" for every file present on a Web server.

tdmrep.json is therefore simply structured as:

[
  {
  "location": "/",
  "tdm-reservation": 1
  }
]

In the following example, a Web server is hosting three groups of files. The rightsholder of the first group of files (PDF documents) wants to express that TDM rights are reserved on these files with no way to acquire a TDM License. The rightsholder of the second group of files (html pages) wants to express that TDM rights are reserved with a TDM Policy. TDM rights are not reserved for all JPEG images contained in the third group.

In this example, the first group is a set of files stored in /directory-a; the second group is stored in /directory-b/html and the third group in /directory-b/images.

tdmrep.json is therefore structured as:

[
  {
  "location": "/directory-a/",
  "tdm-reservation": 1
  },
  {
  "location": "/directory-b/html/",
  "tdm-reservation": 1,
  "tdm-policy":"https://provider.com/policies/policy.json"
  },
  {
  "location": "/directory-b/images/*.jpg",
  "tdm-reservation": 0
  }
]

6.2 TDM Header Field in HTTP Responses

The TDM Header Field is a mechanism for declaring a choice in an HTTP response ([RFC7230]).

In the following example, the rightsholder expresses that TDM rights are reserved on these files with no way to acquire a TDM License. The server returns a tdm-reservation header field with value 1.

HTTP/1.1 200 OK
Date: Wed, 14 Jul 2021 12:07:48 GMT
Content-type: image/jpg
tdm-reservation: 1

In the following example, a TDM License may be acquired. The server returns a tdm-reservation header field with value 1 and a tdm-policy header field pointing to a TDM Policy.

HTTP/1.1 200 OK
Date: Wed, 14 Jul 2021 12:07:48 GMT
Content-type: text/html
tdm-reservation: 1
tdm-policy: https://provider.com/policies/policy.json

6.3 TDM Metadata in HTML Content

TDM Metadata in HTML Content is a mechanism for declaring a choice embedded in html content.

tdm-reservation is expressed as value of the name attribute of a meta element and tdm-policy is expressed as value of the name attribute of a second meta element.

In the following example, an html document is associated with a TDM Policy:

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta name="tdm-reservation" content="1">
    <meta name="tdm-policy" content="https://provider.com/policies/policy.json">
    <title>Document title</title>
  </head>
  <body>
    ...
    <!-- body content -->
    ...
  </body>
</html>

6.4 Processing priority

Rightsholders SHOULD only use one of the techniques specified in the previous section. But in case a Web server is badly configured, TDM Agents need a way to unambiguously define rightsholder's choices. This is why the following processing rules are specified.

A TDM Agent MUST check the presence of a TDM file on the origin server before it starts scraping the content of the Web server.

Note

A TDM Agent will keep in cache the content of the TDM file, usually as an in-memory object, so that it can check its rules against every Web resource it fetches from the origin server.

A TDM Agent MUST check the presence of a TDM Header Field in every http header it gets from fetching a resource on the Web server. The values of tdm-reservation and tdm-policy found in this header supercede any value inferred from a TDM file on the origin server.

A TDM Agent MUST check the presence of TDM Metadata found in HTML content fetched from the Web server. The values of tdm-reservation and tdm-policy found here supercede previous values.

7. Expressing a TDM Policy

Policies are machine-readable structures referenced from the tdm-policy property defined in the specification. They provide ways for TDM Actors to contact content rightsholder and they offer details about available TDM licenses. Thus, they facilitate the acquisition of TDM licenses from rightsholders by TDM Actors.

The format of policies defined in this specification is a profile of the Open Digital Rights Language 2.2 [ODRL].

Note

This specification assumes basic knowledge of the ODRL model and vocabulary.

7.1 Specification of the TDM Profile

7.1.1 JSON-LD context of a Policy

The @context of a Policy MUST be "http://www.w3.org/ns/odrl.jsonld".

A tdm alias MUST be added to the context if "tdm" prefixed properties are used in the Policy, and its value MUST be http://www.w3.org/ns/tdmrep#.

Note

ODRL Policies also require an identifier, expressed as a URI.

{
  "@context": [
    "http://www.w3.org/ns/odrl.jsonld",
    {"tdm": "http://www.w3.org/ns/tdmrep#"}
  ]
  "uid": "https://provider.com/policies/policy-a",
  ...
}

7.1.2 Type of a Policy

The @type of a Policy MUST have Offer as value.

{
  "@context": [
    "http://www.w3.org/ns/odrl.jsonld",
    {"tdm": "http://www.w3.org/ns/tdmrep#"}
  ],
  "uid": "https://provider.com/policies/policy-a",
  "@type": "Offer",
  ...
}

7.1.3 Identification of the profile

A Policy MUST have a profile property with value http://www.w3.org/ns/tdmrep

{
  "@context": [
    "http://www.w3.org/ns/odrl.jsonld",
    {"tdm": "http://www.w3.org/ns/tdmrep#"}
  ],
  "uid": "https://provider.com/policies/policy-a",
  "@type": "Offer",
  "profile": "http://www.w3.org/ns/tdmrep",
  ...
}

7.1.4 Assigner

A Policy MUST contain one assigner property. The assigner property of the Offer MUST use a limited number of vCard properties ([vcard-rdf]):

  • "fn": full name of the rightsholder, as a string
  • "nickname": acronym of the rightsholder, as a string
  • "hasEmail": email address of thz rightsholder, as a string starting with "mailto:"
  • "hasAddress": postal address of the righsholder, as an object containing "vcard:street-address", "vcard:postal-code", "vcard:locality" and "vcard:country-name" as a set of strings
  • "hasTelephone": telephone of the rightsholder, as a string starting with "tel:"
  • "hasURL": URL of a Web page containing information about TDM License acquisition.
{
  "@context": [
    "http://www.w3.org/ns/odrl.jsonld",
    {"tdm": "http://www.w3.org/ns/tdmrep#"}
  ],
  "uid": "https://provider.com/policies/policy-a",
  "@type": "Offer",
  "profile": "http://www.w3.org/ns/tdmrep",
  "assigner": {
    "uid": "https://provider.com",
    "vcard:fn": "Provider Name",
    "vcard:nickname": "PRV",
    "vcard:hasEmail": "mailto:contact@provider.com",
    "vcard:hasAddress": {
      "vcard:street-address": "111 Street Address",
      "vcard:postal-code": "5555",
      "vcard:locality": "Espérance",
      "vcard:country-name": "France"
    },
    "vcard:hasTelephone": "tel:+61755555555",
    "vcard:hasURL": "https://provider.com/tdm/licensing.html" 
  }
  ...,
}

7.1.5 Permissions

A Policy MUST contain one permission property. It SHOULD contain no obligation nor prohibition property.

7.1.5.1 Expressing the target and action of a permission

The mandatory target of a permission, which is expressed via the target property, MUST be a URI identifying the collection of resources involved in the policy.

Note

TDM Agents will use this property in their messages to publishers, to identify a collection of resources they which to mine. This identifier shall therefore properly identify a specific collection of resources and be well know from their publisher.

The target URL is not necessarily dereferencable. Accessing this URL may end with an http error (403 in many cases): this is not a processing error.

The mandatory action of a permission, which is expressed via the action property, MUST be the following:

7.1.5.1.1 tdm:mine

Definition: analyse, via automated analytical technique, text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations.

Label: Text & Data Mine

Identifier: http://www.w3.org/ns/tdmrep#mine

Included in: http://www.w3.org/ns/odrl/2/use

{
  "@context": [
    "http://www.w3.org/ns/odrl.jsonld",
    {"tdm": "http://www.w3.org/ns/tdmrep#"}
  ],
  "@type": "Offer",
  "profile": "http://www.w3.org/ns/tdmrep",
  "uid": "https://provider.com/policies/policy-a",
  ...
  "permission": [{
    "target": "https://provider.com/research-papers",
    "action": "tdm:mine"
    }
  ]
}
7.1.5.2 Expressing the duty to contact the rightsholder before getting a permission

The duty to obtain verifiable consent before performing TDM on content is expressed by adding an duty property to the Policy. The duty is expressed as an action property with an obtainConsent value.

{
  "@context": [
    "http://www.w3.org/ns/odrl.jsonld",
    {"tdm": "http://www.w3.org/ns/tdmrep#"}
  ],
  "@type": "Offer",
  "profile": "http://www.w3.org/ns/tdmrep",
  "uid": "https://provider.com/policies/policy-a",
  ...
  "permission": [{
      "target": "https://provider.com/research-papers",
      "action": "tdm:mine",
      "duty": [{
        "action": "obtainConsent"
        }
      ]
    }
  ]
}
7.1.5.3 Expressing the duty to compensate financially the rightsholder

The duty to compensate financially the mining of content is expressed by adding a duty property to the Permission. The duty is expressed as an action property with a compensate value.

{
  "@context": [
    "http://www.w3.org/ns/odrl.jsonld",
    {"tdm": "http://www.w3.org/ns/tdmrep#"}
  ],
  "@type": "Offer",
  "profile": "http://www.w3.org/ns/tdmrep",
  "uid": "https://provider.com/policies/policy-a",
  ...,
  "permission": [{
      "target": "https://provider.com/research-papers",
      "action": "tdm:mine",
      "duty": [{
        "action": "compensate"
        }
      ]
    }
  ]
}
7.1.5.4 Expressing a constraint on the type of usage

The permission to mine content for a given type of usage only is expressed by adding a constraint property to the Policy. The usage type is expressed as a purpose value on a leftOperand property, the operator property takes eq as value and the rightOperand property takes one of the following values:

7.1.5.4.1 tdm:research

Definition: designates research purposes.

Label: Research purpose

Identifier: http://www.w3.org/ns/tdmrep#research

Included in: http://www.w3.org/ns/odrl/2/rightOperand

7.1.5.4.2 tdm:non-research

Definition: designates non-research purposes, including commercial ones.

Label: Non-research purpose

Identifier: http://www.w3.org/ns/tdmrep#non-research

Included in: http://www.w3.org/ns/odrl/2/rightOperand

{
  "@context": [
    "http://www.w3.org/ns/odrl.jsonld",
    {"tdm": "http://www.w3.org/ns/tdmrep#"}
  ],
  "@type": "Offer",
  "profile": "http://www.w3.org/ns/tdmrep",
  "uid": "https://provider.com/policies/policy-a",
  ...
  "permission": [{
      "target": "https://provider.com/research-papers",
      "action": "tdm:mine",
      "constraint": [{
        "leftOperand": "purpose",
        "operator": "eq",
        "rightOperand": "tdm:research"
        }
      ]
    }
  ]
}

7.2 Full Examples

This section is non-normative.

In this example, the rightsholder requires TDM Actors to contact him for obtaining licensing rights. The rightsholder provides detailed contact information using the W3C vCard Ontology.

Important note: TDM Actors which benefit from the Article 3 of the DSM Directive do not have to comply to this requirement.

  {
      "@context": [
        "http://www.w3.org/ns/odrl.jsonld",
        {"tdm": "http://www.w3.org/ns/tdmrep#"}
    ],

    "@type": "Offer",
    "profile": "http://www.w3.org/ns/tdmrep",
    "uid": "https://provider.com/policies/policy-a",
    "assigner": {
      "uid": "https://provider.com",
      "vcard:fn": "Provider",
      "vcard:nickname": "PRV",
      "vcard:hasEmail": "mailto:contact@provider.com",
      "vcard:hasAddress": {
        "vcard:street-address": "111 Street Address",
        "vcard:postal-code": "5555",
        "vcard:locality": "Espérance",
        "vcard:country-name": "France"
      },
      "vcard:hasTelephone": "tel:+61755555555",
      "vcard:hasURL": "https://provider.com/tdm/licensing.html" 
    },
    "permission": [{
      "target": "https://provider.com/research-papers",
      "action": "tdm:mine",
      "duty": [{
        "action": "obtainConsent"
        }
      ]
    }
  ]
}

In this example, the rightsholder expresses that non-research Actors from any country can mine its content if they agree to pay a fee.

{
    "@context": [
      "http://www.w3.org/ns/odrl.jsonld",
      {"tdm": "http://www.w3.org/ns/tdmrep#"}
  ],

  "@type": "Offer",
  "profile": "http://www.w3.org/ns/tdmrep",
  "uid": "https://provider.com/policies/policy-a",
  "assigner": {
    "uid": "https://provider.com",
    "vcard:fn": "Provider",
    "vcard:hasEmail": "mailto:contact@provider.com",
  },
  "permission": [{
      "target": "https://provider.com/research-papers",
      "action": "tdm:mine",
      "duty": [{
        "action": "compensate"
        }
      ],
      "constraint": [{
        "leftOperand": "purpose",
        "operator": "eq",
        "rightOperand": "tdm:non-research"
        }
      ]
    }
  ]
}

A. References

A.1 Normative references

[ODRL]
Open Digital Rights Language (ODRL) Version 1.1. Renato Iannella. W3C. 19 September 2002. W3C Working Group Note. URL: https://www.w3.org/TR/odrl
[RFC2119]
Key words for use in RFCs to Indicate Requirement Levels. S. Bradner. IETF. March 1997. Best Current Practice. URL: https://www.rfc-editor.org/rfc/rfc2119
[RFC7230]
Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing. R. Fielding, Ed.; J. Reschke, Ed.. IETF. June 2014. Proposed Standard. URL: https://httpwg.org/specs/rfc7230.html
[RFC8174]
Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words. B. Leiba. IETF. May 2017. Best Current Practice. URL: https://www.rfc-editor.org/rfc/rfc8174
[vcard-rdf]
vCard Ontology - for describing People and Organizations. Renato Iannella; James McKinney. W3C. 22 May 2014. W3C Working Group Note. URL: https://www.w3.org/TR/vcard-rdf/