Skip to toolbar

Community & Business Groups

Text and Data Mining Reservation Protocol Community Group

The goal of this Group is to facilitate Text and Data Mining (TDM) Reservation Protocol in Europe and elsewhere, by specifying a simple and practical machine-readable solution, capable of expressing the reservation of TDM rights - following the rules set by the new European DSM Directive / Art.4 - and the availability of machine-readable licenses for TDM actors.

w3c/tdm-reservation-protocol

Group's public email, repo and wiki activity over time

Note: Community Groups are proposed and run by the community. Although W3C hosts these conversations, the groups do not necessarily represent the views of the W3C Membership or staff.

No Reports Yet Published

Learn more about publishing.

Chairs, when logged in, may publish draft and final reports. Please see report requirements.

Publish Reports

Minutes, September 7th, 2021

Status:

During its early meetings, the group worked on what TDM means in practice, the vocabulary to be used during the project, the goals and requirements for a technical solution, created several use cases and compiled a set of past and existing initiatives with a similar scope. The group then agreed on three alternative technical solutions for expressing the reservation of TDM rights: one based on http headers, another based on a file hosted on the origin server, and a third based on html meta tags. These three solutions correspond to different situations and technical skills. 

In May, the group defined a machine readable TDM policy which details how a rightsholder can be contacted and conditions in which a TDM license can be acquired. It was agreed that TDM Policies would be defined as a profile of ODRL 2. In June, a complete draft of specification (https://w3c.github.io/tdm-reservation-protocol/spec/) was written, which includes details like the priority by which the 3 techniques must be processed by TDM Agents and how TDM Agents should react to protocol errors.

From July to August, details of the specification were discussed and the group requested advice and prototyping from content providers and TDM Actors.

Participants: Giulia, Robin, Claudio, Fred, Laurent

#issue 23 and #20: these issues are strongly related to possible protocol errors and how TDM Agents should react to such errors (made by content providers). As a general mechanism, in case protocol errors are detected by TDM Agents, the fallback is to consider that tdm-reservation is unset. tdm-reservation=2 with no tdm-policy property is a kind of protocol error, but applying the general rule in this case was controversial.

The decision of the group is to suppress this value 2, which suppresses the controverse and simplifies the spec at the same time.

  • a TDM Agent which sees tdm-reservation = 1 and NO tdm-policy property will assume that there is NO way to get rights to use the content
  • a TDM Agent which sees tdm-reservation = 1 and a tdm-policy property set will assume that there is a way to get rights to use the content and will decide if it tries to get more information or stop there.

The issues will be closed as soon as the modification of the specification is approved.

Minutes, July 27th, 2021

Status:

During its early meetings, the group worked on what TDM means in practice, the vocabulary to be used during the project, the goals and requirements for a technical solution, created several use cases and compiled a set of past and existing initiatives with a similar scope.

The group then agreed on three alternative technical solutions for expressing the reservation of TDM rights: one based on http headers, another based on a file hosted on the origin server, and a third based on html meta tags. These three solutions correspond to different situations and technical skills. 

In May, the group defined a machine readable TDM policy which details how a rightsholder can be contacted and conditions in which a TDM license can be acquired. It was agreed that TDM Policies would be defined as a profile of ODRL 2.

In June, a complete draft of specification (https://w3c.github.io/tdm-reservation-protocol/spec/) was written, which includes details like the priority by which the 3 techniques must be processed by TDM Agents.

Participants: Giulia, Robin, Claudio, Laurent

Agenda: discuss specification details

Issue #18: the order in which TDM Agents must check the 3 possible techniques have been integrated in the spec. The participants to the call are confortable with the proposal, but it is important that other participants check this in details. 

Issue #20 is about tdm property values in case of protocol errors. The draft specification considers that if there is a protocol error, tdm-reservation is considered unset. In this case the TDM Agent will act as if the rightsholder had not provided any information. This is especially true if tdm-reservation = 2 but tdm-profile is not set. 

tdm-reservation = 2 indicates that “tdm rights are reserved and a policy is available”. If we consider that a missing (or unreachable) tdm-profile is not a protocol error, then it would be simpler to suppress this value 2 and keep only value 1 with an optional tdm-profile. In such a case, when the policy is not set or unreachable, the TDM Actor must simply not process the content during this round of scraping. Issue #23 was open to discuss this aspect. 

There will be no call in August, therefore please use Github issues to discuss the draft specification, which will be updated at least once during August.

The next call of the group is planned on September 7th, at the usual time (3 pm UTC) , via the usual Zoom (ask details to the chairs if needed). 

The draft specification is ready

A complete version of the draft specification is now available at https://w3c.github.io/tdm-reservation-protocol/spec/
Please review it and open Github issues or comment in existing ones as you see fit. 
If some of you have started developing prototypes, please check this version of the spec and advise if you miss important information. 
Proposal: having a call on Tuesday July 27th, at 3 pm UTC (= usual time), to discuss if the draft needs some enhancements for the completion of prototypes. 

Minutes, June 1st, 2021

Status:

During its early meetings, the group worked on what TDM means in practice, the vocabulary to be used during the project, the goals and requirements for a technical solution, created several use cases and compiled a set of past and existing initiatives with a similar scope.

Then, the group agreed on three alternative technical solutions for expressing the reservation of TDM rights: one based on http headers, another based on a file hosted on the origin server, and a third based on html meta tags. These three solutions correspond to different situations and technical skills. The priority in which these techniques should be processed by TDM Agents is still to be decided.

During recent meetings, the group defined a machine readable TDM policy which details how a rightsholder can be contacted and conditions in which a TDM license can be acquired. It has been agreed by the group that TDM Policies will be defined as a profile of ODRL 2 and a detailed proposal has been written.

Participants: Tzviya, Jean-Baptiste, Robin, Giulia, Laurent

Agenda: discuss open issues

Issue #15, Should the vcard prefix be explicitly declared?
It is closed, as it has been tested that re-defining the vcard prefix in Policy instances is not useful.

Issue #13, Which domain name for the ODRL profile used for TDM Policies?
The W3C management accepts the use of the W3C domain name for identifying the properties defined in the Policy specification. We will now have to work on the details of which URL is chosen.

Issue #14, Should we merge the file on the origin server with the ODRL TDM Policy?
The current comment do not allow to reach a consensus. During the dicussion it appears that the participants to the call are in favor to keep separate both structures, as it keeps a clean separation between the core deliverable of the group (related to the Article 4 of the DSM) and the optional & advanced feature (offering TDM Policies). The final choice will be done during the next meeting.

Issue #16, Property names and values of TDM-a and TDM-b
The beauty contest is open, several ideas are exchanged during the call, we’ll make a choice during the next meeting.

Next steps:

The chairs will now start writing a detailed specification, using the W3C document style and tools (respect). The drafts will be discussed during the summer and we should be able to finalize it in October 2021.
As soon as property names are chosen, interested parties can start prototyping, by setting TDM properties on one side, by developing test scripts on the other side. We hope being able to demo the result by October also.

Minutes, May 4th, 2021

Status: 

In previous meetings, the group has worked on what TDM means in practice, the vocabulary to be used during the project, the goals and requirements for a technical solution, the creation of several use cases. They have also compiled a set of past and existing initiatives with a similar scope. 

Then, the group discussed on how rightsholders can declare to TDM Agents the reservation of TDM Rights. Three different technical solutions were selected: one based on http headers, another based on a file hosted on the origin server, and a third based on html meta tags. These three solutions correspond to different situations and technical skills. The priority in which these techniques should be processed by TDM Agents is still to be decided. 

Using one of these techniques, a rightsholder can express either that “TDM rights are not reserved”, “TDM rights are reserved” or “TDM rights are reserved but a TDM license can be acquired”. 

In the latter case, it is interesting for both parties (rightsholder and TDM Actor) to create a machine readable TDM policy which details in which conditions a TDM license can be acquired. 

The group is therefore now defining the model of such TDM policy.

Participants: Aziz Ndao, Tzviya, Leonard, Jean-Baptiste, Giulia, Laurent, Carlo Lavizzari, Mathilde

Agenda: discuss the draft of TDM Policy properties 

The TDM Policy is currently drafted in a shared GDoc. No serialization format is chosen so far, but ODRL 2 is the best candidate so far.   

The discussion focuses on some properties: 

  • Content identifier.
    • STM is using DOI
    • We may decide that it takes the form of a URI, depending the serialization format. 
    • But using this content id forbids using the same policy for N resources: this is a big issue. 
  • Rightsholder name & contact info 
    • We could add a rightsholder identifier.
  • Rights reservation (values are no, yes, yes + a TDM policy)
    • if a TDM agent has reached this policy, this is because the rights reservation value is “yes”. This value will always be “yes”, therefore its use if not obvious. 
  • “where” property: this is initially meant to express “EU” vs “non-EU”. 
    • The EU Directive already sets rules about TDM policies in the EU for research purpose, therefore no TDM policy defined by a rightsholder is applicable to “research” in the “EU”. 
    • Participants are wondering if going to this level of details in the policy is useful.
    • Both “where” and “when” are at the same conceptual level: either both are present or both absent from the TDM policy model. 

Minutes, April 20, 2021

Participants: Brendan, Giulia, Laurent, Claudio, Jean-Baptiste, Steve

Status: 

In previous meetings, the group has worked on what TDM means in practice, the vocabulary to be used during the project, the goals and requirements for a technical solution, the creation of several use cases. They have also compiled a set of past and existing initiatives with a similar scope. 

Then, the group discussed on how rightsholders can declare to TDM Agents the reservation of TDM Rights. Three different technical solutions were selected: one based on http headers, another based on a file hosted on the origin server, and a third based on html meta tags. These three solutions correspond to different situations and technical skills. The priority in which these techniques should be processed by TDM Agents is still to be decided. 

Using one of these techniques, a rightsholder can express either that “TDM rights are not reserved”, “TDM rights are reserved” or “TDM rights are reserved but a TDM license can be acquired”. 

In the latter case, it is interesting for both parties (rightsholder and TDM Actor) to create a machine readable TDM policy which details in which conditions a TDM license can be acquired”. 

Defining what such TDM policy will be is the new task of the group. 

Agenda: Presentation of RightsML as an ODRL use case; TDM policy properties.

Brendan (IPTC) presents RightsML.  

RightsML is a “profile” of ODRL. “Policies” are related to time, geography and media distribution channels. RightsML 2.0 is based on ODRL 2.2 (the latest version). 

See dev.iptc.org for RightsML samples. 

ODRL is expressed in RDF and can be serialized in RDF/XML, JSON-LD or Turtle. To understand ODRL, we should start with the ODRL 2 Core Model. The “ODRL Profile Best Practices” gives hints on how we can build a TDM profile. 

People can express “permissions” and define “duties” like “payment” or “attribution”. Constraints can be combined. “tickets” and “offer” are used when no “assignee” is known in advance. Possible “actions” are “derive”, “transform”, “translate”, which fit with TDM usage. Possible “contraints” are “payment”, “geospatial …” …

We could create a profile of ODRL, with a very limited set of possible actions and constraints. 

It would be versioned, so that new features can be added over time and we could follow the evolution of the ODRL spec. 

Q. Who is part of the ODRL group today? It is a Community Group. Renato Iannella (Monegraph) is still the guide. A core half a dozen people works of it. 

Q. Is RightsML successful? RightsML has some level of uptake now. e.g. Reuters is using it. It took time… Most newspapers don’t parse RIghtsML and rely on the old fashion textual expression of rights.

Q. Are there ready to integrate software able to process ODRL? Some proofs of concept. IPTC has developed one in Python. Reuters did their own implementation. 

see www.w3.org/community/odrl/implementations.

Q. Were there some collaboration with Google? IPTC didn’t talk about RightsML with Google. 

Giulia presents TDM license properties

The document was drafted from information received from participants, especially Steve from Elsevier.

Types of assignees can be commercial, non-commercial organizations, information scientist. 

Elsevier is using the term “authorized users”, i.e. users authorized to apply TDM on content in a given company. 

“use case” is about the kind of usage – create a knowledge graph,  extracting trends … but to describe it is a machine readable langage can be difficult, because concepts are nuanced. 

“data types” is about XML, PDF ….

“License data” copyrighted material, e.g. XML content, “derived data” is e.g. an index built upon XML content.

Minutes, April 06, 2021

Participants: Giulia, Claudio, Jean-Baptiste, Brendan, Rama  

Agenda: discuss the “file at the origin” and “html meta” proposals. Discuss if the three current proposals can be seen a complementary. 

Laurent, regarding the issue #10 open by Claudio: splitting html pages into identified sections is technically doable (via id attributes); associating different sections with different TDM rules is technically doable also (via XML idref attributes). But such a solution would not be html-friendly and would be complex to handle for both publishers and TDM Agents. We may think about it for a version 2 of the specification, but introducing this now would be really dangerous imho. 

Claudio: I understand and agree with the complexity for a version 1. 

Giulia : should http header properties be registered by a standards body? 

Brendan: if we choose to name these properties with a “X-” as a prefix, we don’t need to have any registration. If not we should have our properties registered by IANA to make the solution more “standard”. 

Laurent: regarding the “file” proposal, we have to choose between regexps and something simplistic like the robots solution. 

Jean-Baptiste : the solution should be as simple as possible. Even JS compatible regexps are so wide that they can be badly implemented. Implementing the robots.txt solution (using * and $ only) is more robust. 

Rama and Claudio agree. 

Laurent: ok I’ll change the “file” proposal accordingly. We’ll have to study the use of * and $ in details. 

Laurent : note that in the “file” proposal, if a file does not match any rule, it is driven by default by the exception allowed by the Article 4. 

Giulia, Jean-Baptiste, Rama : please add an example with 2 files treated specifically in a directory, an example with a $, an example of the difference or processing between matching files and directories.

Giulia : in robots meta (which specifies no index, no follow), does the rule apply only the html content or also to the resources referenced in the html document? 

Brendan, Laurent: not sure there is a definite robots specification for that. We’ll have to check. 

Brendan: See noimageindex in Google documentation at https://developers.google.com/search/docs/advanced/robots/robots_meta_tag

Laurent: warning, this is Google specific information, aimed at avoiding ambiguities. Still not sure that the robots spec gives a definitive answer. 

Laurent: Therefore, do you think that the 3 solutions proposed so far are complementary and all 3 can be implemented by every TDM Agent in the world, or should we already make choices? 

Giulia: the two first solutions are generic (any media): pros and cons are tied to preferences of technical providers. The third one is vertical (html only), therefore a bit different, its interest is that a publisher can embed the information without any implication of a technical provider. 

Laurent: as we have a vertical solution now (for html), other vertical solutions may be proposed. A warning: each time we add an alternative solution, we force EVERY TDM Agent to implement it.

Participants: all agree that the 3 current solutions are useful and complementary, and that they are so simple that every TDM Agent can implement the 3 solutions. We’ll need an order of priority in case of conflict. 

Giulia : should we also express the rightsholder name? the date of the declaration? 

Claudio : if I get a “don’t mine” instruction, I’d like to know who said that and why. Therefore such metadata would be useful.

Laurent: we have a charter to provide instructions to machines for TDM purpose; semantic information is out of scope here. If publishers want to provider metadata about their content, there are standard solutions for that (json-ld + schema.org on html content,  IPTC for images, XML for PDF and media content). 

Next steps: 

a/ We’ll see if other proposals come to the table. 

b/ We will now start discussing the licence format: what publishers need to declare and which existing initiatives can be reused? could we define standard licenses ready for use? The experience of the IPTC with RightsML (based on ODRL2), EPC / Copyright Hub (also using ODRL2), STM Association, Crossref TDM service, ARDITO will be really useful for that. 

Minutes, March 23, 2021

Agenda

Open discussion about the first potential technical solution, titled “Proposal based on http headers“.

Participants

Giulia, Giuseppe, Claudio, Jean-Baptiste, Steve, Leonard, Aziz, Brendan

Log of the discussion

Leonard: many people with a CMS don’t control their own servers. Sharepoint and other CMS are not so easy to control … Our solution should be usable for anybody putting content on the web.  Therefore http headers should not be the only solution. But we must be careful and also avoid too many solutions. 

Jean-Baptiste: The proposed solution would be the easiest for my company. e-distributors / aggregators control their servers. I agree that a different solution could be added for specific use cases. 

Claudio: if people cannot control their servers, they should go to a proper e-distributor, who will e.g. develop a CMS plug-in. 

Claudio : if you want to defend your content, the burden is on you. 

Steve: what is the legal minimum of functionalities required for our solution?

Laurent: a simple flag expressing an opt-out from the exception, as defined by the Article 4. Everything else is “cherry on the cake”.

Leonard: in the proposed solution, the license is referenced via a URL. We’ll have to be careful about potential same origin issues (CORS  issues). And decide if URLs can be relative or must be absolute. 

Leonard: the proposal assumes that the license will be either a human readable or machine readable license. It there a use case where both will be provided by a rightsholder?

Laurent: offering both human and machine readable licenses implies a risk of discrepancy between both expressions.

Giulia: I think the need is exclusive and human readable is useful for less tech advanced rightsholders, a sort of fallback. 

Claudio: I imagine the reverse; machine readable should be mandatory, human readable is a plus.  

Jean-Baptiste: many French publishers will decide on a human readable license. 

Leonard: As a TDM Actor, if I mis-interpret a human readable license, I am legally liable; therefore we should impose machine readable licenses. 

Steve: we could define a limited set of templated licenses to ease the use of machine-readable licenses. 

Leonard: maybe we miss a value TDM-a = “0.5”. Something between “TDM rights are reserved” and “”TDM rights are not reserved”.

Laurent: isn’t it what TDM-a = “2” is for? a notion of “TDM rights may be reserved or not, depending the license you acquire”?

Steve: outside of the EU, I want to reserve TDM rights for commercial companies. If TDM Actors pay a fee they get specific TDM rights.  How will people be able to express such a thing?

Giulia: when we study the license format, we’ll have to find a format which allows conditional information (if the use is commercial; at this location; before this date). 

Conclusion

For the next call, the co-chairs will prepare a “Proposal based on a file hosted on the origin website” and if possible a “Proposal based on meta properties in html documents”.

We will use the same anti-bikeshedding TDM-a and TDM-b properties.

Minutes of the 09/03/2021 call

Introduction

The number of participants of the Community Group has now reached 25 people. 

Validation of the charter 

The proposed charter of the CG is accessible on Github on https://github.com/w3c/tdm-reservation-protocol/blob/main/charter.md. It is similar to other W3C CG charters, taken as examples. 

The Charter is approved unanimously by the participants. Community members who didn’t attend the meeting are kindly invited to send comments, if any, before the next meeting. 

Review of the goals and requirements

During the discussion, it is made clear that the requirements are related to the technical specification, not the implementations of the specification; said differently, rightsholders will have a flexible use of the different features of the specification. For instance they will be able to express a licence only if available, and choose between expressing their licences via human readable information **or** machine-readable information.

To make it clearer, the Core requirements will be introduced via the phrase “The technical specification shall”. 

Note: It is nevertheless important to note that TDM Actors will have to implement every feature in their TDM Agents, therefore we’ll have to be very careful not to multiply options.

Because they cannot be verified against a technical solution, what was called “High level requirements” is now called “Goals”. 

Because “Primary” and “Secondary” Requirements was not clear, these are renamed “Core Requirement” (what the technical solution must fulfill) and “Additional Requirements” (which will complement Core Requirements in specific cases). The second slot is currently empty, the Group may decide  if items should be  added here during our work.   

The Goals and the Requirements are approved unanimously by the participants. Community members who didn’t attend the meeting are kindly invited to send comments, if any, before the next meeting. 

Review of the Vocabulary

One of the comments made on the Goals was that “publishers” was used but not defined. A simple definition was therefore added to the vocabulary. This definition would be more accurate for distributors of digital publications on the Web. Calling them “publishers” is non perfect but still usable in our context, where we refer to publishing as the act of making content available to a public.

The Vocabulary is approved unanimously by the participants. 

Workplan

How do we plan our work, avoid multiple threads of discussion and foster participation? 

The way to go, agreed during the call, is to proceed in sequence, focussing on one solution at a time, from the simplest to more elaborated solutions. Next steps will be: 

  1. Study a solution based on http headers;
  2. Study a solution based on a central file hosted on a Web server hosting Web resources;
  3. Study a solution based on the embedding of metadata in Web resources. 

As inspiration can come from past and existing initiatives (https://github.com/w3c/tdm-reservation-protocol/blob/main/docs/initiatives.md), we will start by studying the robots.txt initiative and the co-chairs, with the help of the participants, will try to schedule presentations of other initiatives by their experts . For instance, RightsML (IPTC) and Assert your rights (Copyright Hub) could be the subject of such presentations.

How will participants be able propose technical solutions for discussion within the group?

Different solutions are discussed: Google Docs, Github Wiki and raw Github documents in markdown. The issue with Google Docs is that they are separate from the Github framework in which we are working. The issue with Github Wiki pages is that history of changes are difficult to see and Issues or Pull Requests cannot be synchronized with changes. 

The final choice is therefore to rely on pure Github documents, using Pull Requests. A How To written by Ivan Herman (https://iherman.github.io/misc-notes/docs/BasicGitHubContributionIntro) will be of great help for newcomers to Github. 

The co-chairs will initiate the work by providing a template document for use in the first step of our work, i.e. a solution based on http headers. 

Minutes of the kickoff meeting

12 people were present on the call, the agenda being: 

  • brief presentation of the participants.
  • Q/A on the context of this project
  • review of the proposed charter
  • review of the proposed vocabulary
  • review of the proposed requirements
  • decision of the frequency and preferred date-time for the next calls.

Many thanks to the participants who have posted on the mailing list a short introduction about themselves and their interest for the project. All participants are invited to do so.

The following decisions were taken: 

  1. the group will have a call every two weeks, on Tuesdays, at 11 am EST = 17:00 CET/CEST (= 4 pm UTC in Winter, 3 pm UTC in Summer).
  2. we’ll discuss asap how we deal with the winter/summer time change. 
  3. we’ll use the same Zoom details we used for the first call. This information must be requested to the co-chairs.
  4. the CG home page (https://www.w3.org/community/tdmrep/) is a great starting point to find information: posts (like this one), Github link, Mailing List.
  5. the use of IRC and scribes is not decided: this can be complex for newcomers, therefore we’ll talk about this later only.  
  6. participants are invited to comment on the current open issues 1 to 5.
  7. participants are invited to read and comment as issues (*) the documents provided on Github as source for discussion.
  8. the co-chairs will prepare for the next meeting a proposal of evolution of the vocabulary and requirements, using as a source the current open issues 1 to 5. 
  9. the co-chairs will add a page to the documentation, describing the robots.txt technology for non technicians.

Therefore the next call is planned on Tuesday 09 March at 11 am EST = 16 pm UTC = 17:00 CET

PS: TDM actors, please join the discussion; the door is more than open.

(*) Github offers “issues” but also “discussions”. We could decide to use the latter, maybe simpler for non technicians.