The Impact of Generative AI on the Web

Meeting minutes

<Vagner_NIC_br_> start meeting

<Vagner_NIC_br_> diogo: agenda

<Vagner_NIC_br_> diogo: data collection on the Web

<Vagner_NIC_br_> diogo: inegration of LLMs in a search engine

<Vagner_NIC_br_> diogo: production and publication of synthetic contents on the Web

<Vagner_NIC_br_> diogo: we hope anticipate challenges to understand the issues and plan to use inputs at IGF's meeting.

<Vagner_NIC_br_> diogo: three maind questions to be discussed

<Vagner_NIC_br_> ... 1) what the limit to scrap web data to train Generative AI

<Vagner_NIC_br_> 2) what are the potential impacts of incorporating LLMs

<Vagner_NIC_br_> ... 3) how can Web help detect AI-generated content

<npdoty> @@: directive in the EU, regarding Text and Data Mining, and allows for opt-out through machine-readable means

<npdoty> ... generic definition, but it is possible

<npdoty> ... can't specify every other reason that a publisher might not want their data to be used

<npdoty> ... Community Group, TDM Reservation Protocol, machine-readable way for publishers to opt-out .. and how to contact me if you want to use the data

<npdoty> ... some publishers are using it, in France, Sweden, etc.

<npdoty> ... people still fear that companies like Google who use AI for translation, snippets, etc.

<npdoty> ... need to be sure that saying no to TDM doesn't mean no to indexing

<npdoty> gendler: thanks for starting us off. News Corp is a conglomerate of newspapers and television

<ReinaldoFerraz__> +present

<npdoty> ... three distinct questions: 1) how users can express opinions; 2) scraping content on the Web generally; 3) xxxx

<npdoty> ... US news publishers have been thinking about this question, don't have a formal proposal, but do have an idea that we have to started to build consensus around in the US

<npdoty> ... see the best place to do the work is a technical standards body

<npdoty> ... purposes is not something we can express easily in machine-readable form

<npdoty> ... don't have specific law to be aiming for in the US

<Vagner_NIC_br_> s /xxxx/how can Web help detect AI-generated content/

<npdoty> ... things that aren't covered by terms of service

<npdoty> ... crawlers indicate that they don't have any way to know that scraping for a particular purpose isn't allowed

<npdoty> ... need to be able to indicate what purpose the scraper is grabbing this for

<npdoty> ... arguing about how specific the purposes should be

<npdoty> ... would prefer to align this with copyright law in general, taking inspiration from Creative Commons

<npdoty> ... attribution, commercial, derivative works

<npdoty> ... similar formatting to how robots.txt work

<npdoty> ... interested in feedback from technical experts

<Zakim> dsinger, you wanted to add a different main question

<npdoty> dsinger: embedded URLs make a big part of the Web. but even for AI engines that generate references in the Web, those citations are just a plausible set of citations, rather than the actual data source that is relied upon

<npdoty> ... and some don't provide citations/urls at all

<npdoty> ... no assurance that the same AI engine will give the same output to a different person

<npdoty> ... not a Web technology, if it's not pointing in or out

<npdoty> ... is it an existential threat to the Web at all?

<npdoty> DiogoCortiz: may point to the second question on the slide

<fremy> q!

<npdoty> ... different players trying to include yyyy

<npdoty> ... Google is reading my website and using the response, and may make a reference to my website. but users still might not use my website any more

<npdoty> ... and so it could change the traffic distributed on the web, or move the power to search engines

I can take over scribe

tzviya: One of the things we have at Wiley, a publisher of books and journals, I work on governance ...
… whether we like it or not, Gen AI exists. Whether or not it's an existential issue, the toothpaste is out of the tube....

<Zakim> tzviya, you wanted to ask if we are writing for the AI engines or or AI users

<tzviya> https://syntheticmedia.partnershiponai.org

tzviya: what specs can we write for the folks that create and use the engines? It is worth taking a look at the materials that have been created by Practices for Synthetic Media...
… these are best practices created by orgs you will well know. While proposed solution of TDM is good, it's unclear whether that would be respected when added as a signal on the web.

npdoty: Nick Doty, CDT. I believe it's very important to have these publisher perspectives in the room. Important industry...
… I urge that when we look at these controls, we look past copyright. All these big companies will sue each other, good luck. Most of us will not be part of those lawsuits, but we will have plenty of interests....
… privacy is an issue that is not going to be well covered by copyright issues....

<Zakim> npdoty, you wanted to comment on something not focused purely on copyright

npdoty: In response to David Singers, there is some work that we can do some standards work around provenance.

fremy: Thank you very much, I wanted to address a question of if you don't have links, does that destroy the web as we know it? Naturally, there will be a stronger contection between search engines and data content providers...
… most of the queries that matter to people are questions of real time nature, sports, political events...
… basically if you want to provide a service like G search or Bing, these things will exist, but people will create and provide high quality data.

<bdekoz> hey can somebody repost the link to the google/meta document about generative content guidelines?

danbri_: I would like to respond to david, I believe that the concept of the website may be archaic in ten years. Some people think this is a threat, others think it's massive hype....

<bdekoz> thanks!

danbri_: it could be as big as the web, could be a blessing, could be a curse. We've made a change here, around computers who understand natural languages. How does that change issues?

<Zakim> danbri_, you wanted to respond to David

danbri_: it could be good for technical reasons, we could summarize millions of pages with an LLM, turn those into 10,000 pages, a tweet, a tag, it might be much better from an eco-wide perspective. What willl be the need for browser bookmarks? Could these systems adjust how we use browsers, how their organized. Some things will be expedited...
… possible that medical publishers will advance faster with LLM help.

<Zakim> dsinger, you wanted to offer that we could write guidelines and best practices

dsinger: Answering my own question, we tend to think that we can write a specification to fix these issues, but I think we should focus on guidelines, for things that help build best practices....
… if you train on systems that work on high quality writing, but not truth, you can't then claim your system builds truth.

Juanita: I find that it would be interesting to figure out how we can democratize information and education around the world....
… there are issues with governments that may try to block access to generative AI, and we should think of these issues.

diogo: Quick response to this, we are from brazil and we have similar questions around this. We have research around the topic of bias in LLMs, for example...
… in chatgpt in portugese, in google bard, there is a lot of bias especially with our cultural work. The bot responds in portugese, but it's clear the content is based on American culture....
… we need to make sure more data is in those LLMs, so that a global perspective is covered.

brian: I think it's an interesting question, as we discuss the shape of the web post gen-AI, we need to think of gen-AI as potnetially having a 'napster moment', where this new technology affects how people interact with the web...
… how will people use information without permission, how will that change things simillarly to how things didn't matter if they were physical.

bdekoz: Hello, I am from africa, and very excited about this conversation around Gen AI. I'm glad we've woken up to all of the possibilities...
… a question from the perspective of those who create content. A question for the big companies, are you going to block generated content from your systems. Someone can create genAI content just with an attempt to monitize it, will you give penalties to those that do not create work from scratch? Will you protect journalists who need to put in groundwork on these questions and creation.

danbri: For images, we've collaborated with the ? to help with embedding information into the metadata in order to help us understand whether it came out of a generative tool...
… but not subtle enough. Could be a completely generated image, or something that upscaled a low quality photo...
… text is different. I think there's been a risk of LLM models being used by all sorts of supports. It's hard to find whether something is from a generative model as opposed to a human writer. You end up with material that is part you, part machine. Should that be on the content creators? People may want to find cheaters, but will that take away tools for accessibility....
… there is no single answer on what we can rush into, we need to be careful about how we handle this.

Diogo: google has updated their policy, that if you create political content with AI systems, you need to label this content. But how you catch those seems to be a technical challenge.

bdekoz: I'm benjamin from Mozilla. On the limits of scrapping content, and protecting creators, robots.txt is not sufficient. We need something like a generative.json with more options than what we're getting out of robots.txt....
… creators need prompted metadata or watermarks that will clearly indicate when Gen systems are used, and artisic style should be referenced and labeled....
… we should make sure this information is connected to image.

diogo: There have been legal questions around how we can or cannot copyright prompts. Some say prompts are like an essay, and so they are significant enough to be covered by the copyright laws.

<tzviya> https://www.w3.org/TR/webmachinelearning-ethics/

<Zakim> tzviya, you wanted to ask what standards/guidance w3c can write

tzviya: I think we should focus on what kinds of standards or guidelines we can actually write in w3c. Referencing the Ethical principles for web machine learning....

<zolkis> https://www.w3.org/TR/webmachinelearning-ethics/

tzviya: it walks through many of the issues, including the possible exploitation of private information. We have not focused enough on issues outside of copyright. One that was briefly touched on was bias in a system becoming worse as it gets built upon. You need human intervention, but how do you write guidance for what that intervention needs to be. We certainly need more guidance.

<bdekoz> diogo: beyond copyright, in EU/USA there are VARA rights for artists that allow declaiming artworks and dissassociating creators from works that become broken or mangled by their owners

humera: My question is in reference to how the economy will be transformed. What will the impact of search results be when the ads are embedded into the chat bot? How will ad blockers affect this? How will this transform the economics of the web and how ads play a role?

anssik: I would like to follow up on tzviyas point on ethical principles. I was part of the taskforce that worked on this. We lost some contributors, and welcome new members to take part and contribute to the guidelines. I will put a link in the irc. Please follow up!

<anssik> https://www.w3.org/TR/webmachinelearning-ethics/

jaunita: are there any standards that could be written to protect chatgpt against governments or bad actors that would be interested in creating misinformation versions of chat-gpt.

<npdoty> https://c2pa.org/

nick: So many great ideas, it sounds like we have a few areas of work to consider. I wanted to follow up on some of the ideas about providence. The need to say where this image or writing came from, or the multiple steps that are involved. One source to look at is the Coalition for Content Provenance and Authenticity (C2PA)...

<Zakim> npdoty, you wanted to comment on content provenance, https://c2pa.org/

nick: this is related to a problem we've already seen, how has this image been altered. We have many questions about misinformation, we don't need AI to make misinformation....
… this may be a mechanism that many people find useful. How do we indicate how our content was developed, and what steps were taken.

david: I think you're absolutely right, this problem and the problem of misinformation are intertwined. We've made a conclusion jump, that we don't know enough, and so we have to not say much. We should be much more urgent to talk about it.

<bdekoz> npdoty: is there a way for human content creators to indicate authenticity or embed some kind of private key that can be authenticated later?

<Zakim> dsinger, you wanted to talk about the need to discuss and be visible

<npdoty> gendler: prompted by npdoty, generative ai often talked about as a completely new thing

<tzviya> +1 AI is not new

<npdoty> ... thinking along similar lines has happened. generative ai is only exacerbating issues that have already existed for the web

<dsinger> Because we don't know how to solve a problem, we think we should avoid discussing it or talking about it. That's a conclusion jump. In discussion, and in publication of discussion papers or guidelines, we may well make progress.

<npdoty> ... people have been working on them, but now ai may have accelerated the concern

<npdoty> ... look at work that has already been done

<AWK_> Resource for image provenance metadata: https://contentauthenticity.org/

brian: I wanted to follow up on my earlier point, when we dicsuss things such as provenance, we're talking about being able to trace a relatively discreet thing from one point to another point...

<npdoty> +1 AWK, that was a work item from the c2pa group that I was mentioning

<bdekoz> AWK_: thanks

brian: in the world of AI, I'm seeing many points being thrown into one. I don't understand how you indicate the 100 milliion provenances...
… if a new input is put into the machine, how do we plot those inputs. We need to think of the world in a new way.

<npdoty> gendler: useful again to look at the work already been done

<npdoty> ... algorithmic auditing work may be relevant to large language models

<npdoty> ... potential references to look at from CDT

Diogo: we solicit comments on a conclusion.

Diogo: the idea of this session was to generate insights, and not to discuss a specific point, but on many perspective from the impact of LLMs and generative AI on the web...
… we are seeing different types of concerns, and we will clearly be having more conversations in the coming years, because it's not just on the technical side, there will be regulations that will impact how the technical side should work...

<bdekoz> npdoty: can you link to the google proposal plz?

Diogo: for example in Brazil, they are trying to pass a new law for copyright that is quite different from EU systems...

<npdoty> google has described interest in this work, not a specific technical proposal: https://blog.google/technology/ai/ai-web-publisher-controls-sign-up/

Diogo: it allows research, but companies will not be allowed for scrapping. So the technical ways will have to be different for different regulations.

<Zakim> npdoty, you wanted to comment on tdmrep and google proposal re robots.txt opt-out

nick: In terms of the concrete work, and where to continue the conversation, for the specific area of reserving rights, there has been a public proposal from google where they would like to start a conversation, and if the TDMRep CG is the right place to discuss this, or in a new CG.

?: could technologists help detect AI generated content?

+1 tzviya on GAI CG

<danbri_> re iptc, acquireLicense schema —- https://x.com/brendanquinn/status/1375011632137580544

<npdoty> I'd also be happy to see a GAI CG

<cpn> +1 to a CG

Diogo: certainly that will be the challenge in the next few years, but it seems like the issue for text will be significantly different to the work for images. Different companies are using different approaches, but how do we add watermarks to a piece of content...

<bdekoz> +1 to a CG

<Vagner_NIC_br_> close queue

Diogo: can we create a standard or guidline on how we detect if this is posted on the web? Is that possible?

<danbri_> For the non-AI misinformation world, we have ClaimReview and MediaReview schemas, eg see. https://fullfact.org/blog/2020/mar/how-we-spoke-77-fact-checkers-and-helped-21-them-start-using-claim-review-schema/

<cpn> also, to dwsinger's point on misinformation, we had a Credible Web CG, now inactive. can we restart it?

<Vagner_NIC_br_> close queue

david: I would like to make sure that this discussion doesn't evaporate if we leave the room. Should we write a paper or two? That may lead to guidelines, that may lead to specifications. Let's get a CG together to publish thing, publicize concern, etc.

<npdoty> I don't think we should do it slowly, but I'm happy to in parallel work on guidelines and some technical proposals

<Laurent__> about TDM Reservation Protocol: https://www.w3.org/community/tdmrep/

– DRAFT –
The Impact of Generative AI on the Web

13 September 2023

Attendees

Meeting minutes

Diagnostics