Skip to toolbar

Community & Business Groups

Text and Data Mining Reservation Protocol Community Group

The goal of this Group is to facilitate Text and Data Mining (TDM) Reservation Protocol in Europe and elsewhere, by specifying a simple and practical machine-readable solution, capable of expressing the reservation of TDM rights - following the rules set by the new European DSM Directive / Art.4 - and the availability of machine-readable licenses for TDM actors.

w3c/tdm-reservation-protocol

Group's public email, repo and wiki activity over time

Note: Community Groups are proposed and run by the community. Although W3C hosts these conversations, the groups do not necessarily represent the views of the W3C Membership or staff.

final reports / licensing info

date name commitments
TDM Reservation Protocol, version 1 Licensing commitments

Chairs, when logged in, may publish draft and final reports. Please see report requirements.

Publish Reports

Minutes, March 23, 2021

Agenda

Open discussion about the first potential technical solution, titled “Proposal based on http headers“.

Participants

Giulia, Giuseppe, Claudio, Jean-Baptiste, Steve, Leonard, Aziz, Brendan

Log of the discussion

Leonard: many people with a CMS don’t control their own servers. Sharepoint and other CMS are not so easy to control … Our solution should be usable for anybody putting content on the web.  Therefore http headers should not be the only solution. But we must be careful and also avoid too many solutions. 

Jean-Baptiste: The proposed solution would be the easiest for my company. e-distributors / aggregators control their servers. I agree that a different solution could be added for specific use cases. 

Claudio: if people cannot control their servers, they should go to a proper e-distributor, who will e.g. develop a CMS plug-in. 

Claudio : if you want to defend your content, the burden is on you. 

Steve: what is the legal minimum of functionalities required for our solution?

Laurent: a simple flag expressing an opt-out from the exception, as defined by the Article 4. Everything else is “cherry on the cake”.

Leonard: in the proposed solution, the license is referenced via a URL. We’ll have to be careful about potential same origin issues (CORS  issues). And decide if URLs can be relative or must be absolute. 

Leonard: the proposal assumes that the license will be either a human readable or machine readable license. It there a use case where both will be provided by a rightsholder?

Laurent: offering both human and machine readable licenses implies a risk of discrepancy between both expressions.

Giulia: I think the need is exclusive and human readable is useful for less tech advanced rightsholders, a sort of fallback. 

Claudio: I imagine the reverse; machine readable should be mandatory, human readable is a plus.  

Jean-Baptiste: many French publishers will decide on a human readable license. 

Leonard: As a TDM Actor, if I mis-interpret a human readable license, I am legally liable; therefore we should impose machine readable licenses. 

Steve: we could define a limited set of templated licenses to ease the use of machine-readable licenses. 

Leonard: maybe we miss a value TDM-a = “0.5”. Something between “TDM rights are reserved” and “”TDM rights are not reserved”.

Laurent: isn’t it what TDM-a = “2” is for? a notion of “TDM rights may be reserved or not, depending the license you acquire”?

Steve: outside of the EU, I want to reserve TDM rights for commercial companies. If TDM Actors pay a fee they get specific TDM rights.  How will people be able to express such a thing?

Giulia: when we study the license format, we’ll have to find a format which allows conditional information (if the use is commercial; at this location; before this date). 

Conclusion

For the next call, the co-chairs will prepare a “Proposal based on a file hosted on the origin website” and if possible a “Proposal based on meta properties in html documents”.

We will use the same anti-bikeshedding TDM-a and TDM-b properties.

Minutes of the 09/03/2021 call

Introduction

The number of participants of the Community Group has now reached 25 people. 

Validation of the charter 

The proposed charter of the CG is accessible on Github on https://github.com/w3c/tdm-reservation-protocol/blob/main/charter.md. It is similar to other W3C CG charters, taken as examples. 

The Charter is approved unanimously by the participants. Community members who didn’t attend the meeting are kindly invited to send comments, if any, before the next meeting. 

Review of the goals and requirements

During the discussion, it is made clear that the requirements are related to the technical specification, not the implementations of the specification; said differently, rightsholders will have a flexible use of the different features of the specification. For instance they will be able to express a licence only if available, and choose between expressing their licences via human readable information **or** machine-readable information.

To make it clearer, the Core requirements will be introduced via the phrase “The technical specification shall”. 

Note: It is nevertheless important to note that TDM Actors will have to implement every feature in their TDM Agents, therefore we’ll have to be very careful not to multiply options.

Because they cannot be verified against a technical solution, what was called “High level requirements” is now called “Goals”. 

Because “Primary” and “Secondary” Requirements was not clear, these are renamed “Core Requirement” (what the technical solution must fulfill) and “Additional Requirements” (which will complement Core Requirements in specific cases). The second slot is currently empty, the Group may decide  if items should be  added here during our work.   

The Goals and the Requirements are approved unanimously by the participants. Community members who didn’t attend the meeting are kindly invited to send comments, if any, before the next meeting. 

Review of the Vocabulary

One of the comments made on the Goals was that “publishers” was used but not defined. A simple definition was therefore added to the vocabulary. This definition would be more accurate for distributors of digital publications on the Web. Calling them “publishers” is non perfect but still usable in our context, where we refer to publishing as the act of making content available to a public.

The Vocabulary is approved unanimously by the participants. 

Workplan

How do we plan our work, avoid multiple threads of discussion and foster participation? 

The way to go, agreed during the call, is to proceed in sequence, focussing on one solution at a time, from the simplest to more elaborated solutions. Next steps will be: 

  1. Study a solution based on http headers;
  2. Study a solution based on a central file hosted on a Web server hosting Web resources;
  3. Study a solution based on the embedding of metadata in Web resources. 

As inspiration can come from past and existing initiatives (https://github.com/w3c/tdm-reservation-protocol/blob/main/docs/initiatives.md), we will start by studying the robots.txt initiative and the co-chairs, with the help of the participants, will try to schedule presentations of other initiatives by their experts . For instance, RightsML (IPTC) and Assert your rights (Copyright Hub) could be the subject of such presentations.

How will participants be able propose technical solutions for discussion within the group?

Different solutions are discussed: Google Docs, Github Wiki and raw Github documents in markdown. The issue with Google Docs is that they are separate from the Github framework in which we are working. The issue with Github Wiki pages is that history of changes are difficult to see and Issues or Pull Requests cannot be synchronized with changes. 

The final choice is therefore to rely on pure Github documents, using Pull Requests. A How To written by Ivan Herman (https://iherman.github.io/misc-notes/docs/BasicGitHubContributionIntro) will be of great help for newcomers to Github. 

The co-chairs will initiate the work by providing a template document for use in the first step of our work, i.e. a solution based on http headers. 

Minutes of the kickoff meeting

12 people were present on the call, the agenda being: 

  • brief presentation of the participants.
  • Q/A on the context of this project
  • review of the proposed charter
  • review of the proposed vocabulary
  • review of the proposed requirements
  • decision of the frequency and preferred date-time for the next calls.

Many thanks to the participants who have posted on the mailing list a short introduction about themselves and their interest for the project. All participants are invited to do so.

The following decisions were taken: 

  1. the group will have a call every two weeks, on Tuesdays, at 11 am EST = 17:00 CET/CEST (= 4 pm UTC in Winter, 3 pm UTC in Summer).
  2. we’ll discuss asap how we deal with the winter/summer time change. 
  3. we’ll use the same Zoom details we used for the first call. This information must be requested to the co-chairs.
  4. the CG home page (https://www.w3.org/community/tdmrep/) is a great starting point to find information: posts (like this one), Github link, Mailing List.
  5. the use of IRC and scribes is not decided: this can be complex for newcomers, therefore we’ll talk about this later only.  
  6. participants are invited to comment on the current open issues 1 to 5.
  7. participants are invited to read and comment as issues (*) the documents provided on Github as source for discussion.
  8. the co-chairs will prepare for the next meeting a proposal of evolution of the vocabulary and requirements, using as a source the current open issues 1 to 5. 
  9. the co-chairs will add a page to the documentation, describing the robots.txt technology for non technicians.

Therefore the next call is planned on Tuesday 09 March at 11 am EST = 16 pm UTC = 17:00 CET

PS: TDM actors, please join the discussion; the door is more than open.

(*) Github offers “issues” but also “discussions”. We could decide to use the latter, maybe simpler for non technicians. 

Kickoff date for the TDM Reservation Protocol CG

Dear participants in the Text and Data Mining Reservation Protocol CG (TDMrep in short), Doodle has delivered its oracle:

The kickoff of the project will happen on Tuesday 23 Feb, 16:00 UTC (local datetime via dateandtime) and will last one hour. This time is not great for Asian participants, therefore the co-chairs will contact them before the meeting to get their inputs on the topics to be discussed.

We will use Zoom for the call, please contact the co-chairs to get details.

The proposed agenda is as following:

  • brief presentation of the participants.
  • Q/A on the context of this project
  • review of the proposed charter
  • review of the proposed vocabulary
  • review of the proposed requirements
  • decision of the frequency and preferred date-time for the next calls.

Please read these documents before the call. Please also drop 3 lines presenting yourself and your interest for the CG on the public-tdmrep@w3.org mailing list, so that we don’t spend too much time on the initial roundtable.

Best regards,
Giulia and Laurent, co-chairs

Kickoff of the TDM Reservation Protocol CG

Dear participants in the Text and Data Mining Reservation Protocol CG (TDMrep in short), it has taken some time for the co-chairs before inviting you to a first call; this is simply because we were waiting a sufficient number of participants representing TDM actors. We are currently 13 participants in this CG, and we are expecting other people to join now that we have reached ACReps from organizations which have a foot in TDM.

Therefore it is now time to have a kickoff call, and Giulia and I propose to have it between Feb. 15th and 23th.

For that purpose we have setup this Doodle. Please fill it to signal your availability.

We will use Zoom for the call, please contact the co-chairs to get details (it is bad practice to let that in a blog post).

Before the call, you can read the documents which have been put in place in the project Github space. The most important is certainly the set of requirements which should be discussed and – if possible – validated during the first call. A proposed charter for the CG is also to be discussed and validated asap.

The repository also includes a proposal for a common vocabulary, a tentative to define TDM and some useful extracts of the EU DSM Directive (Articles 3 and 4). All good reads.

Giulia and I hope to meet you all at this kickoff call 🙂

Context of the TDM Reservation Protocol initiative

In addition to their significance in the context of scientific research, text and data mining techniques (TDM) are widely used both by private and public entities to analyse large amounts of data (including copyright protected content like text, images, video etc.) in different areas of life and for various purposes, including for government services, complex business decisions and the development of new applications or technologies.

In a digital environment, TDM usage of copyright protected works can be subject to different terms and conditions, depending on the legal framework. In generic terms, an act of reproduction is required before TDM can be applied on content accessible on the Web; international laws stipulate that such act of reproduction is subject to authorization by rightsholders. So far, analyzing and processing the terms and conditions of a website, contacting rightsholders, seeking for permission and concluding licensing agreements require time and resources.  

In such context, a machine-readable solution which streamlines the communication of TDM rights and licenses available for online copyrighted content is necessary to facilitate the development of TDM applications and reduce the risks of legal uncertainty for TDM actors. Such a solution, that shall rely on a consensus by rightsholders and TDM actors, will optimize the capacity of TDM actors to lawfully access and process useful content at large scale.

The EU Directive 2019/790, better known as the DSM Directive (DSM meaning Digital Single Market), introduces two exceptions or limitations to the rights of rightsholders on lawfully accessible content, for reproductions and extractions for the purposes of TDM:

  • In its Article 3, a mandatory exception for research organisations and cultural heritage institutions which carry out TDM for the purposes of scientific research.
  • In its Article 4, an exception for any organisation willing to carry out TDM  for any purpose other than scientific research, including commercial purposes, which applies on the condition that the use of content for TDM has not been expressly reserved by their rights holders in an appropriate manner, such as machine-readable means. 

These TDM exceptions apply to TDM usage in the European Union in relation to content from European and foreign rightsholders. Outside of the EU, where the DSM legislation does not apply, the said exception does not apply: exclusive rights of right-holders to authorize acts of reproduction are maintained. In such cases, no TDM can be performed without the explicit authorisation of these rightsholders: in these countries, the absence of a reservation of rights by rightsholders cannot be considered as an implicit authorization to reproduce copyrighted content for TDM purpose, and advocating fair use or a similar rule is legally uncertain, as these actions are judged on a case-per-case basis.

The “opt-out” mechanism introduced by the DMS Directive is therefore a real opportunity for TDM actors and publishers across countries to define a machine-readable technique able to express not only if TDM rights on specific Web content are reserved or not, but also how rightsholders can be contacted and which licenses are available, if any. This is a tremendous help for TDM actors from all countries looking for legal certainty.

Call for Participation in TDM Reservation Protocol Community Group

The TDM Reservation Protocol Community Group has been launched:


The goal of this Group is to facilitate TDM in Europe and elsewhere, by specifying a simple and practical machine-readable solution, capable of expressing the reservation of TDM rights – following the rules set by the new European DSM Directive / Art.4 – and the availability of machine-readable licenses for TDM actors.


In order to join the group, you will need a W3C account. Please note, however, that W3C Membership is not required to join a Community Group.

This is a community initiative. This group was originally proposed on 2021-01-18 by Laurent Le Meur. The following people supported its creation: Laurent Le Meur, Ivan Herman, ANNE BERGMAN-TAHON, Giulia Marangoni, Catherine Blache. W3C’s hosting of this group does not imply endorsement of the activities.

The group must now choose a chair. Read more about how to get started in a new group and good practice for running a group.

We invite you to share news of this new group in social media and other channels.

If you believe that there is an issue with this group that requires the attention of the W3C staff, please email us at site-comments@w3.org

Thank you,
W3C Community Development Team