Skip to toolbar

Community & Business Groups

Proposed Group: Update robots.txt standards Community Group

The Update robots.txt standards Community Group has been proposed by Hans Petter Blindheim:


Robots.txt is currently based on opting out of what you do not want your website to be a part of.

This is hard to maintain (almost a full time job right now) if you do not wish for your websites content to be applied for e.g. training AI, be a part of market research (e.g. price robots), to be a part of non-search engine databases and more.

This proposal is to update what type of instructions robots.txt should support to rather be treated as an opt-in, where you can give instructions based on intent of robots rather than a wildcard or in granular detail.

Example:
Agent-group: searchengines

Applies to all robots that seeks to update, process or maintain websites for search engine databases. Does not grant permission to apply scraped data for AI purposes (this should have its own Agent-group).

Also, the absence of instructions should be treated as not having opted in, and for robots working on behalf of AI, there might need to be additional instructions (e.g. max-snippet and if you require a citation if your content is applied to provide an answer).


You are invited to support the creation of this group. Once the group has a total of five supporters, it will be launched and people can join to begin work. In order to support the group, you will need a W3C account.

Once launched, the group will no longer be listed as “proposed”; it will be in the list of current groups.

If you believe that there is an issue with this group that requires the attention of the W3C staff, please send us email on site-comments@w3.org

Thank you,
W3C Community Development Team

7 Responses to Proposed Group: Update robots.txt standards Community Group

  • Hello,

    Can you say more about how this group’s work would relate to the IETF’s work on the Robots Exclusion Protocol?

    Reply

    • Hans Petter Blindheim

      I can only hope what can come from this (and to be upfront – if that is to happen, I will need a lot of help as this is my first rodeo on my first time in the saddle), which is to challenge this standard in order to:

      – Rather than have it work as an exclusion (opt-out) protocol, have it as an inclusion (opt-in) protocol
      – Instead of having to spend countless hours reading up on different bots and their handles and their support of robots.txt (or perhaps your need to block them from server if that is what you desire), focus more on common use cases (the robots intention)
      – Try to meet the growth of AI-bots and how they use content to keep some level of control

      Reply

  • Hans,

    Given that the Robots Exclusion Protocol is active work of the IETF, I want to recommend that you reach out to them with your ideas rather than launch a W3C community Group. Here is the home page for that effort:
    https://datatracker.ietf.org/doc/rfc9309/

    (I accidentally clicked “support this group” but I recommend that you pursue this in the IETF.)

    Reply

    • Hans Petter Blindheim

      Thank you for the suggestion. I have reached out now to the people who signed and created the “Robots Exclusion Protocol” in hopes of getting their thoughts on this, but considering that 3 of them are employed at Google (and while I have met one of them and know him to be concerned with the betterment of the internet, it would go against Googles interests as such a flip on the standard would mean that Google could be held liable and would likely also need to spend a lot of time and money to adhere to new standards). So I am not sure if that will bear fruit even though I do think that this would be a better move than waiting for regulation to do the same.

      Reply

  • If this really is an extension to robots.txt, then it should probably happen at the IETF.

    However the problem space that it addresses seems to be mostly about rights reservations. If so, it might be worth reformulating this as an iteration on TDMRep (https://www.w3.org/2022/tdmrep/). TDMRep lacks a lot of expressive power, but it could be extended.

    I would like to further recommend that:

    1) The group is welcoming to relatively non-technical people because part of the work needs to be done by lawyers.

    2) It can move relatively fast, which reusing TDMRep would help with.

    3) It’s clearly independent from some vendors’ attempts to own the conversation.

    This is an important brick to preserve quality web content.

    Reply

    • Hans Petter Blindheim

      This might be a venue to pursue, but wouldn’t it be easier to just tag them in here?

      Also I don’t think this would suffice even if it could be used for several reasons, such as:

      – Most web service owners do not possess themselves or via contractors, developers and other parties sufficient knowledge to do this

      – If they do gain knowledge of the ability to control this on a page level would be costly and complex to solve, when the end purpose should be that rights are respected and that uses that goes beyond that which you should expect web service owners to be comfortable with just lacks regulation

      – This would not limit the load on the server from bots requesting documents (robots.txt would/could, I see no reasons why any argument should be made that web service owners should expect to cover this cost)

      – And a whole lot of CMS-providers, hosting solutions etc. are not set up in a manner that would offer web service owners this level of control (so in practice for most, this venue would not give the same outcome even if there is interest in resolving it – unless they pay a lot to change providers and migrate their online presence)

      Reply

  • Thanks @Robin.

    @Hans, we’ll hold off creating this group until we can get a bit more consensus on where the natural home for this work is. I look forward to hearing your thoughts on Robin’s comments.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Before you comment here, note that this forum is moderated and your IP address is sent to Akismet, the plugin we use to mitigate spam comments.

*