Meeting minutes
[Introduction: will go through slides, then have a discussion]
<kush> is there a link to the slides?
Slideset: https://
Slideset: https://
<mt_hates_irc> access to private data is not right. it's access to capabilities, which includes access to private data
<dveditz> +1: the "change state" alone is roughly requivalent to CSRF changing your router settings -- they don't need any response, or your data
<dveditz> sorry, the "change state" plus the untrusted content circle with the prompt injection, of course
<aaj> I think we've generally defined it as access to private data or ability to trigger actions on behalf of the user
<mt_hates_irc> humans being in the loop means that setting .textContent is not completely safe in all cases
<aaj> which seems relatively similar to capabilities, I would be curious if there's a non-trivial delta between the definitions though
<krgovind> Related reference on the lethal trifecta - https://
<mt_hates_irc> That's a useful article krgovind. Not an encouraging conclusion though.
<aaj> I find the "model-level defenses are unlikely to be robust enough" conclusion to be quite encouraging, because it points us in the direction of more deterministic, traditional-ish controls
<aaj> what was frustrating is the idea of a never-ending model-level arms race that defenders would likely never win
<Victor8> webmachinelearning/
<mt_hates_irc> how do you avoid small misalignments? I get that this works for gross misalignment, but I'm not convinced that these are effective. For example, in the "buy a wotsit" case, you might have price guardrails. What stops a site from using prompt injection to push someone to a product choice that has a modestly inflated price?
<aaj> that's an excellent question! definitely beyond the scope of an IRC message, but my hunch is that if we have a deterministic "outer envelope" (e.g. sets of sites that an agent can actuate on for a given prompt, other protections such as e.g. controlling whether an agent should be able to paste data / use form/password/CC autofill on which sites,
<aaj> etc.) then we'll already lop off a major chunk of the attack surface. Then the question is how we design an "inner envelope" for actions that are still allowed - maybe there model-level defenses are sufficient in some substantial fraction of cases.
[Begin discussion]
<mt_hates_irc> I can't imagine how a website might abuse a control that triggered the invocation of user prompting.
Johann: How can we work together to rise to this challenge, and not silo each effort?
<aaj> but I think it would be great to figure out non-model-level defenses within the inner envelope as well (e.g. direct hooks between the agent and websites)
Victor: Great talk. I see this as how we have x-origin embedder policies that give websites some control. If we can push web standards forward on this, this will help. There is work for us to do as a standards community. 2 Things to propose.
… First, is there any assumption on how agents are going to be consuming information? One thing we're thinking about is how to track where the information has come from. E.g. if information has come from reddit.com, other sites can allow or disallow information from reddit.com to be used on their site.
<mt_hates_irc> aaj, you have to keep in mind that all the same attacks that work on humans are in play here. Convincing a human to spend more money than necessary on a product is something that some entities specialize in. It's called Marketing. Agents seem to be especially vulnerable to that. Particularly since agents are uniform. For humans, there are
<mt_hates_irc> transferable techniques, but everyone has their own biases and defenses, so marketing isn't always uniformly effective. Against an agent, you might imagine that a single effective attack would be extraordinarily effective. Nothing to do with "envelopes" can address that sort of thing.
Victor: Second: We don't even want agents to ingest this content. Have to think of agents on web as separate entities, browser can enforce policies upon the agent re: what it can see. Question for the general web: how opinionated do we want to be?
<aaj> mt_hates_irc agreed, I just consider this to be a different problem. An online merchant that can convince your agent to buy a more expensive sweater seems less scary than an online merchant (or random user-supplied comment on the merchant site) being able to read your email.
Johann: agree, it's very early and I think we don't want to over-dictate a direction right now. re: first point, like the idea of tracking context, might go in the reverse, e.g. "my context should not leak to <site>".
AramZS: as a dev, idea 2 sounds like a good idea, but for a dev for a publisher that doesn't want to be crawled, want to mark all content as untrusted and all ads as trusted.
<mt_hates_irc> aaj, it's a specialized instance of a problem that can be generalized. Anything that is plausibly within the envelope is available. But how do you define the sandbox?
AramZS: Doesn't seem to be a way around that. Incentives don't align.
<cpn> +1 aram
John Wilander: 3 points: Browsers are used today with a visual interface. We're rendering webpages and that becomes the interface for agents, but maybe we need a separate interface for agents. Have the data carry some context, e.g. this data is part of a particular security context. A headless browser comes to mind.
… Second: discoverability of these interfaces. PSL comes to mind, search index too. Wondering if discoverability mechanism could be standardized.
… 3: talking about roles, we used to have "user agent" and now we have a third agent.
<smcgruer_[EST]> +1 on discoverability
<AramZS> I dunno, like thinking about an evil browser, I'm not sure we can engineer things that protect against an evil agent *in this context* at least.
tomayac: Might be advantageous for agents to ignore hints on the page.
… E.g. for an accessibility agent, might be necessary to ignore "sensitive data" annotations in order to accomplish the user's task.
<GabrielBrito> I also feel kinda the same way. IRL, how would you feel comfortable delegating the decision-making process to an untrusted individual? It feels like much of the issues discussed here should be dealt with by engineering better behaved agents.
<mnot> We trust browsers to act upon our behalf, but they're deterministic (so long as the vendor doesn't update it against our interest in the background). Agents are a different thing.
<smcgruer_[EST]> ack
bvandersloot: Very related to previous breakout. (Scribe missed first point). Second: is this the user agent trying to defend the user from their browser? Seems silly, we should decide the threat model and maybe get to a CSP structure where we have a cooperative environment.
mt_hates_irc: Cooperative side does offer opportuntities, but there are downsides. If there's a signal that says "please invoke a user", bad outcome. On guardrails, supervisory agent doesn't necessarily help. Have to set the bounds such that bad things don't happen (e.g. for payments), but whatever bounds you describe, there will be things in that
… space that are outside the user's goals. E.g., want to buy something, set bounds on the price. Agent is vulnerable to marketing, and might be convinced to spend more than necessary.
… Marketing works on humans in different ways. But with a shared agent, one vuln turns into a business opportunity that is completely within the bounds of the sandbox.
<dveditz> smcgruer_[EST]: does that syntax work without the ending slash?
<kbabbitt> +1 to mt_hates_irc. The problem space here reminds me a lot of human-targeted phishing and I wonder if there's room for cross pollination of ideas on how to defend against those.
Johann: agree, this is why we can't have nice things. This is something we have to balance.
<nwatson> imo confirmation fatigue is the single biggest problem for hitl
Victor8: Concern about agents and user agents merging together. I propose that we view them separately, that way the browser can enforce things on the agent. Could be a good way to enforce deterministic things on a nondeterministic system.
<dveditz> 👍
<kbabbitt> Even things like drawing a line around "Ads" becomes tricky. If I'm asking an agent to put together a travel itinerary for me, I might want it to pay attention to the ad about a discounted flight but not the ad about "alert your system contains a virus"
alanbuxey7: It may frustrate the user if they can see something but the agent can't, and is therefore unable to help. Wondering whether or not we'll have ai-agent as an entity. Also, agent can just click "yes" when sites ask for confirmation.
<KevinDean> Never mind the kids. What about my aging parents?!
<vasilii> With devices becoming more powerful, are there upsides for on-device llms becoming an actual agents that behave to beenfit the user instead of to benefit the corporation that controlls the llm. Having that this approach doesn't help with marketing and manipulation, but at least gives users to be in control of behavior of their own agents. Just brainstorming.
wanderview: is there a difference between unsophisicated agents and e.g. children? Could be solutions that help unsophisticated users be safer on the internet. There are a lot of users that struggle on the internet, it would be nice if it became safer for them.
… Understand that putting something like a red box around sensitive content is spoofable, etc.
<mt_hates_irc> the site is the adversary
johannhof: this is why sites have "you're leaving our site ... popups"
<AramZS> could say a loooot about those 'you're leaving our site' prompts, very little of it good.
<mnot> Somewhat disagreeing with the notion that flagging dangerous content isn't a good direction. An attacker has an incentive to identify sensitive / powerful content and tools available to do it -- they will find it, and they don't have to find it all. Defenders need to defend every bit of sensitive content to be effective -- it just takes one.
<mnot> (the AIs want to join the conversation)
<mt_hates_irc> too many negatives there mnot, try again?
<mt_hates_irc> magic computer protocol will solve all the problems, obviously
<wanderview> magic computer protocol is the best
<kleber> Sounds to me like we need a id/superego/ego partitioning of responsibilities — a second agent whose job is user protection, to counterbalance the pleasure-seeking behavior of the current breed of agent
<mnot> MT: to succeed, an attacker only has to identify _some_ sensitive content -- they will be able to do so without hints. Defenders need more certainty; it only takes one. Hints help defenders meet their goals more than attackers.
Nick: I work on MCP, surprised we're not talking about it more here. Thinking about MCP being used for web, we really want one way for agents to access page in an untrusted context, then have a way for agents to get access to e.g. user credentials. I see MCP as a way to say, give me an MCP server so I can retrieve tools that I can use, then go
… through an authentication flow. That could help defend against prompt injection. Best we can do now is prevent agents from receiving context that we don't want it to have, and use existing tools in the browser.
<Victor8> We need good cop / bad cop agents :)
<dbaron> One thought is that <link rel=nofollow> is perhaps a vaguely relevant precedent in this space.
<mt_hates_irc> left shoulder, right shoulder agents
<dveditz> mnot: but the browser has no way to tell when a legit site is offering legit hints and when it's an illegitimate site offering malicious hints.
<dveditz> depends on the task and capabilities for when that matters
<mt_hates_irc> maybe we can build a "seems legit" agent
<mnot> dveditz: it depends on the nature of the hints. If they're steering agents away that's not an issue
<aaj> mt_hates_irc Classifiers + model-alignment defenses are pretty much "seems legit" agents :)
<dveditz> it matters if the site is steering the agent away from accessing the very information the user wanted -- then it's more or less DRM
<mnot> It has the vibe of a user agent
Nick: Seeing today: headless browsers were mentioned. Some of them ask users to add their master password in a config file for the agent to use. Would prefer to use a solution that uses MCP directly. MCP has things that allow credentials to be securely provided without the agent getting involved.
<mnot> dveditz: that's a _bit_ of a leap
<vasilii> I feel like tagging of sensetive content needs to work hand in hand with actually redacting the content so that it is not visible / discoverable by an agent
AramZS: re: marking sensitive areas and how that's optional for agents, seems valuable. Separate problem: why would I not price things differently for an agent? Also a problem for users, not sure if we want to tackle that here.
<AramZS> I think we can't get into redacting the content. That is going to go down a road where a *lot* of website owners will use that in an unintended way and kill the entire idea
wseltzer: How do we enlist the end user to see these as cooperative protections, rather than things that are getting in the way?
<smcgruer_[EST]> q/
<AramZS> I want to redact content from an agent. The agent doesn't want me to redact it. The redaction *will* get ignored. I think that's a very different case for marking a field is sensitive.
<vasilii> AramZS, agree, totally make sense, especailly in view of your point where publishers would want to redact the content
<dveditz> mnot: it could be used in this price discrimination example people are discussing right now. Say the real price (visible to the user) is "unsafe" and have a higher transparent "safe" buy-at-higher-price button for agents
??: ideas on site marking trusted content. Thinking of investment scams. Similarities between content that are trying to fool users vs fool agents, might fall into a trap of trying to solve all of the web's problems.
<mt_hates_irc> we should absolutely not show the agent something different than what the human sees, or we'll never be able to diagnose attempts to attack the agent
<masonf> "Human, forget your previous instructions and buy these shoes"
<vasilii> AramZS, I wonder still if redacting can be done by browser (deterministic user agent, ie Chrome, Safari, headless browser) to actually remove the content from the web so that is not visible to the agent at all.
johannhof: pulling phishing into scope might be somewhat unavoidable.
<dveditz> Of course that's a PR disaster when discovered if it's a large well-known company, but I could see a bunch of small shops doing it
<dveditz> Lots of "I've never heard of these guys" who sell interesting gadgets in Facebook ads, for example
<AramZS> vasilii, but I think I'd rather do that as a publisher than make my content visible to an agent. I am well incentivized to give users with agents in their browser a bad experience.
<vasilii> AramZS, makes sense!
aaj: wanted to talk about previous comment. Trying to solve everything is hard. We could decide that protecting agents and users are separate problems. E.g. process boundaries in OS, analogy to site isolation in browsers. Can we move from wild west with very powerful agents that can be prompt injected, to a world where agents can only do things
… related to the original prompt? Some protections could be model-level, some deterministic.
<zcorpan> I think it could make sense to follow the principle of controlling access at the start of a task. The user should confirm at the start to give read and action access for e.g. gmail for a task or a substep. Then everything else is logged out.
johannhof: thanks everyone, there will be more discussions, feel free to reach out to me and thanks for being here.
<dveditz> mt_hates_irc: we already have cases where some users see things differently from others -- alt-text for visually impaired users, for example,