Text and Data Mining Reservation Protocol Community Group
The goal of this Group is to facilitate Text and Data Mining (TDM) Reservation Protocol in Europe and elsewhere, by specifying a simple and practical machine-readable solution, capable of expressing the reservation of TDM rights - following the rules set by the new European DSM Directive / Art.4 - and the availability of machine-readable licenses for TDM actors.
Update on meetings and presentations of the TDM protocol
On 5th June a webinar on the protocol and how to implement it was organized by FEP and EDRLab; more than 70 publishers attended, and positive feedback was received.
On July 11th, in Bruxelles, the TDM protocol was presented by AIE at the “Seminar on best practices for opting-out of generative ML training”, organized by Open Future. AIE and FEP attended the event, which was an occasion to exchange with organizations representing other rightsholders in the content industry, the EC Commission, AI experts, and other projects/initiatives offering solutions for machine-readable opt-out, namely the C2PA coalition and Spawning. The latter integrates different opt-out methods in order to provide a service to AI companies that, given a URL in input, can check if there is an opt-out associated with the resource that AI players intend to use.
Collaboration with Spawning AI
After some exchanges, Spawning AI has already integrated partially the opt-out solution developed by the TDM Rep CG in their service, and they are open to collaborating further with the CG.
Discussion on possible developments of the protocol
EDRLab presented an overview of the different opt-out initiatives that are in touch with our CG. Some of them are media-specific (like the ones by IPTC and C2PA) and provide solutions at the content metadata level, other like Spawning AI (and the TDM Rep protocol) are applicable to any content type, at the URL level. Even though different solutions (content specific and not-content-specific) are complementary and can coexist in line with the different standards and practices in the content industry, there are significant differences in the semantic approach adopted by IPTC and C2PA on one hand, and the TDM Rep on the other: in particular, the different solutions reflect different views on whether the TDM concept would cover all/some AI usages, and whether indexing by search engines could be part of the opt-out. Such discrepancies are partly due to the different legal frameworks (US vs. EU) where such initiatives were developed.
Considering the rapid evolution of AI applications, and the ongoing discussion in the creative industries on rights reservation and licensing for AI, the CG agreed to continue to monitor the situation and exchange with the other initiatives in this field before taking any decision on the possible refinement of the protocol with new properties or values.
In the short term, it was agreed that:
the CG will check if the semantics of the protocol can be further clarified at the level of the specifications, to prevent any ambiguity and facilitate interoperability among different solutions.
the CG will work at a FAQ for non-techies that will further clarify the meaning of the TDM opt-out in light of the EU legal framework and will provide practical insight to the adopters on how to implement it in the context of AI.
Implementation in EPUB files
Given the increasing interest by the publishing sector – including, among GC members, Mondadori, Penguin Random House, and the STM association – for the integration of the TDM protocol in EPUB files, it was agreed that the CG will liaise with the W3C Publishing Community Group and the Publishing Business Group, which follow EPUB related developments, via EDRLab (who is member of both groups).
Particularly, it was agreed that:
On behalf of the CG, EDRLab will send to the W3C Publishing Business Group a proposal to be discussed during their next meeting in September;
Should CG members have views or suggestions on the integration of TDM Rep in EPUB, they are requested to share them within the CG mailing list at their earliest convenience, so that they can be taken into account in the framework of the collaboration with the W3C Publishing Business and Community Groups
A FAQ for non-tech users: the group agreed to work on a FAQ; for more details see above;
Keeping track of early adopters: group members are invited to share on the CG mailing list information about new adopters of the protocol. The list of the early adopted will be publicized on the website of the CG, in order to give visibility to it. Early adopters are also encouraged to publicize the adoption of the protocol on their own websites.
The TDMRep Community Group had its first 2023 call on April 4th. Several new members of the CG joined the call, from the International Association of STM Publishers, the CCC (Copyright Clearance Center) and Taylor & Francis Group.
It was the opportunity to remind members about some useful links:
Several threads of discussion will be developed during the coming months:
a) about the relationship between our work and AI/ML training, especially in the scope of Generative AI.
We will ask the EU Commission if they have a view about the applicability of Article 4 of the DSM Directive to AI/ML training. We will also try to get direct information about this issue from the publishers community and legal experts in the field of digital publishing technologies.
b) about a possible addition of usage details to the TDMRep specification.
We will study if we there is a requirement from publishers/rightholders to differentiate between different usages of their content in the framework of AI, e.g. allow AI/ML training or allow all but Generative AI training. Other working groups – namely C2PA and the IPTC – are taking this approach.
c) about the inclusion of “opt-out” information inside content items.
Several publishers have expressed a requirement to attach their “opt-out” decision to the content they publish. Several working groups – especially C2PA and the IPTC – are working on adding usage rights to PDF, JPEG and other types of files.
We will study if we should extend our specifications to support this use case, or if we should rather keep a liaison with other working groups and make so that our respective specifications are compatible (and therefore complementary).
In the short term, we will ask representatives of C2PA and IPTC to present their current work to our community group during the next two calls. These calls will be announced to the group members on the mailing list.
Giulia and I had a quick call, to discussed a practical path to the next steps of our work. Here is our proposal:
1/ Study if “the use of data for the purposes of training an AI/ML model” is the same thing as “the use of data for the purposes of data mining”, especially in the scope of the EU DSM Directive. For that we need access to document formalizing the position of experts in this field, we need lawyers around the table, and the opinion of EU Commission people would be welcome.
2/ Discuss if we should apply evolutions to the specification in order to embed TDMRep properties in different types of resources (images, PDF and EPUB come to minds). One important question is what should be found in such resources: TDMRep properties (as we defined them) or detailed usage rights information (as the IPTC is defining them)?
3/ If the answer to issue 2 is “yes we should embed TDMRep properties in resources”, then how should we embed them? (XMP comes to minds there).
Could we start with issue 1? AI and LLMs are the big thing currently, and there is traction for protecting artistic content against free commercial use in AI products.
We therefore propose to held a Zoom call on this topic on April 4th, 14 UTC (16:00 CEST), using this URL: https://us06web.zoom.us/j/84639131689?pwd=QXl1T1R1UldjdG1XR00rWG5ZQUlZQT09
Please indicate if you’re willing to join, and if you have a specific expertise in the AI vs TDM field.
After the release of our final report, in February 2022, we had the opportunity to present the project to several publishers and publishers associations in European countries and start discussing possible pilots in order to gather feedback on the first release of the protocol.
We found that the interest for the “TDM opt-out” solution is growing fast, in particular since ChatGPT was released. GPT-3, the LLM (Large Language Model) behind ChatGPT, constitutes an impressive example of the potentiality of AI agents when processing digital publications, and rightsholders from different sectors of the content industries are looking for an effective machine readable solution for managing TDM rights on their content.
The TDMRep effort is especially interesting for the IPTC (International Press Telecommunications Council), which develops standards for the News Industry and has started exchanging ideas with us.
This group will therefore continue its work on refining the specification and study possible extensions to meet the requirements of specific sectors of the content industry and to facilitate the integration of our work with other metadata standards (such as the IPTC standard for photos) and content formats (e.g. to embed TDMRep metadata in PDF resources).
As multiple EU countries are moving towards the implementation of the DSM directive, publishers are looking for solutions. The Community Group will now focus on communication, its targets being the European Commission, TDM Actors and publishers.
In order to communicate effectively, the Community Group will have to work on its governance and define some tools (maybe a presentation website, recorded demos, a brochure, a logo …). We’ll therefore organize multiple calls in 2022.
If you are aware of events which could be good supports for presentations, please contact the CG chairs. We’re currently thinking about the IPTC Spring Meeting 2022 (May) and the Digital Publishing Summit 2022 (June).
The consensus is that the specification corresponds to the requirements and is easy to implement, which corresponds to its main goals.
One remaining issue is the de-duplication of requests for licenses from TDM Actors. If many resources a linked to the same policy, on which ground should a TDM Actor send one request, or multiple requests? Not every Publisher will set a `target` property in its TDM Policy. It seems to the participant that the URL of the TDM Policy should be used as an identifier and therefore a way to send a generic email stating a “request for license to mine a set of resources pointing at the TDM Policy with URL xxxx”, optionally with the list of resource URLs the TDM Actors is willing to mine. This need to be clarified in a best practices guide.
The CG will stay open for feedbacks about the specification until the end of the year 2021, will update it when needed, and will then release the specification as Final Report.
The co-chairs and the FEP (Federation of European Publishers) will turn to the European Commission for planning demos or some other representation of the work done so far, and by this raise awareness about the project.
The CCC (Copyright Clearance Center) reports that WIPO is holding webinars every 2 weeks on copyright infrastructure and ne developments. It is a good avenue for promoting our work.
We hope that Press Publishers will size the project, as they are among the organizations which will benefit most from the success of the solution.
The group feels that we should create a landing page introducing the project (and therefore certainly acquire a specific domain name), from which multimedia communication can be developed, like an overview of the project and How To for the 3 techniques we propose.
An incentive for TDM Actors could be to create a label (“Clean scrapping”?) and a logo, for those who adopt the solution.
We’ll have to discuss the governance of the project after the Final Report of the CG is released. Should it be ex-nihilo, or based on an existing organization?
The proposed agenda for the TDM Reservation Protocol Community Group meeting on Thursday 26 October 2021 (15:00 UTC to 16:00 UTC) happening as part of W3C TPAC 2021, is as follows:
Introduction to the TDM Reservation Protocol CG
Finalization of prototypes in view of presentations to interested parties
Wider communication plan: how can we make sure that TDM Actors are aware of this initiative?
The co-chairs invite community group members to propose further agenda items that they would like to be added to the meeting by sending their suggestions via the public-tdmrep CG mailing list.
The meeting will take place via Zoom and the link for the meeting will soon be published on the TPAC Wiki.
If you are planning to attend the CG meeting, you are invited to register for TPAC 2021 so that you can attend other sessions in the conference if you are interested to do so. To register for TPAC 2021, click here.
During its early meetings, the group worked on what TDM means in practice, the vocabulary to be used during the project, the goals and requirements for a technical solution, created several use cases and compiled a set of past and existing initiatives with a similar scope. The group then agreed on three alternative technical solutions for expressing the reservation of TDM rights: one based on http headers, another based on a file hosted on the origin server, and a third based on html meta tags. These three solutions correspond to different situations and technical skills.
In May, the group defined a machine readable TDM policy which details how a rightsholder can be contacted and conditions in which a TDM license can be acquired. It was agreed that TDM Policies would be defined as a profile of ODRL 2. In June, a complete draft of specification (https://w3c.github.io/tdm-reservation-protocol/spec/) was written, which includes details like the priority by which the 3 techniques must be processed by TDM Agents and how TDM Agents should react to protocol errors.
From July to August, details of the specification were discussed and the group requested advice and prototyping from content providers and TDM Actors.
#issue 23 and #20: these issues are strongly related to possible protocol errors and how TDM Agents should react to such errors (made by content providers). As a general mechanism, in case protocol errors are detected by TDM Agents, the fallback is to consider that tdm-reservation is unset. tdm-reservation=2 with no tdm-policy property is a kind of protocol error, but applying the general rule in this case was controversial.
The decision of the group is to suppress this value 2, which suppresses the controverse and simplifies the spec at the same time.
a TDM Agent which sees tdm-reservation = 1 and NO tdm-policy property will assume that there is NO way to get rights to use the content
a TDM Agent which sees tdm-reservation = 1 and a tdm-policy property set will assume that there is a way to get rights to use the content and will decide if it tries to get more information or stop there.
The issues will be closed as soon as the modification of the specification is approved.
During its early meetings, the group worked on what TDM means in practice, the vocabulary to be used during the project, the goals and requirements for a technical solution, created several use cases and compiled a set of past and existing initiatives with a similar scope.
The group then agreed on three alternative technical solutions for expressing the reservation of TDM rights: one based on http headers, another based on a file hosted on the origin server, and a third based on html meta tags. These three solutions correspond to different situations and technical skills.
In May, the group defined a machine readable TDM policy which details how a rightsholder can be contacted and conditions in which a TDM license can be acquired. It was agreed that TDM Policies would be defined as a profile of ODRL 2.
Issue #18: the order in which TDM Agents must check the 3 possible techniques have been integrated in the spec. The participants to the call are confortable with the proposal, but it is important that other participants check this in details.
Issue #20 is about tdm property values in case of protocol errors. The draft specification considers that if there is a protocol error, tdm-reservation is considered unset. In this case the TDM Agent will act as if the rightsholder had not provided any information. This is especially true if tdm-reservation = 2 but tdm-profile is not set.
tdm-reservation = 2 indicates that “tdm rights are reserved and a policy is available”. If we consider that a missing (or unreachable) tdm-profile is not a protocol error, then it would be simpler to suppress this value 2 and keep only value 1 with an optional tdm-profile. In such a case, when the policy is not set or unreachable, the TDM Actor must simply not process the content during this round of scraping. Issue #23 was open to discuss this aspect.
There will be no call in August, therefore please use Github issues to discuss the draft specification, which will be updated at least once during August.
The next call of the group is planned on September 7th, at the usual time (3 pm UTC) , via the usual Zoom (ask details to the chairs if needed).
A complete version of the draft specification is now available athttps://w3c.github.io/tdm-reservation-protocol/spec/ Please review it and open Github issues or comment in existing ones as you see fit. If some of you have started developing prototypes, please check this version of the spec and advise if you miss important information. Proposal: having a call on Tuesday July 27th, at 3 pm UTC (= usual time), to discuss if the draft needs some enhancements for the completion of prototypes.