Call for Participation in Update robots.txt standards Community Group

W3C Team | Posted on: February 2, 2024

The Update robots.txt standards Community Group has been launched:

Robots.txt is currently based on opting out of what you do not want your website to be a part of.

This is hard to maintain (almost a full time job right now) if you do not wish for your websites content to be applied for e.g. training AI, be a part of market research (e.g. price robots), to be a part of non-search engine databases and more.

This proposal is to update what type of instructions robots.txt should support to rather be treated as an opt-in, where you can give instructions based on intent of robots rather than a wildcard or in granular detail.

Example:
Agent-group: searchengines

Applies to all robots that seeks to update, process or maintain websites for search engine databases. Does not grant permission to apply scraped data for AI purposes (this should have its own Agent-group).

Also, the absence of instructions should be treated as not having opted in, and for robots working on behalf of AI, there might need to be additional instructions (e.g. max-snippet and if you require a citation if your content is applied to provide an answer).

In order to join the group, you will need a W3C account. Please note, however, that W3C Membership is not required to join a Community Group.

This is a community initiative. This group was originally proposed on 2023-10-02 by Hans Petter Blindheim. The following people supported its creation: Hans Petter Blindheim, Brijesh Gohil, Robin Berjon, Max Gendler, Hallison Brancalhão, Gustavo Henrique Quinalha. W3C’s hosting of this group does not imply endorsement of the activities.

The group must now choose a chair. Read more about how to get started in a new group and good practice for running a group.

We invite you to share news of this new group in social media and other channels.

If you believe that there is an issue with this group that requires the attention of the W3C staff, please email us at site-comments@w3.org

Thank you,
W3C Community Development Team

One Response to Call for Participation in Update robots.txt standards Community Group

Jonathan ✪ May 3, 2024 at 6:00 pm

This cannot plausibly work.

robots.txt is already an opt-in system, by the nature that it is not enforced.

Even if it was opt-in entries in the robots.txt, all crawlers would still have to comply to the scheme. since most crawlers don’t care, whats the point?

Surely, the answer should lie in a standard deny/accept protocol as HTTP headers within web servers. If this was the case, crawlers would actively be denied if you don’t want them, or accepted if you do.

IMHO, changing a scheme from opt-out to opt-in for an existing opt-in system, is a waste of energy. This should go directly to the web server.

Community & Business Groups

Call for Participation in Update robots.txt standards Community Group

One Response to Call for Participation in Update robots.txt standards Community Group

Leave a Reply Cancel reply

Archives

Categories