Meeting minutes
<gb> https://github.com/webmachinelearning/meetings/issues/35
Repository: webmachinelearning/webmcp
Welcome
/]
Anssi: welcome to the W3C Web Machine Learning CG F2F at TPAC 2025
… I'm Anssi Kostiainen, Intel, the chair of the CG, also chairing the WG that works closely together with this CG that serves as an incubator
… as a recap from yesterday:
… this CG is a group where new ideas are discussed, explored and incubated before formal standardization
… past CG spec incubations include e.g. WebNN API, Model Loader
… since last year, we've expanded the scope of the CG
Anssi: the CG has added in scope and delivered a number of new incubations
Anssi: we have delivered first versions of new built-in AI APIs:
… Prompt API
… Writing Assistance APIs
… Translator and Language Detector APIs
… and at the explainer stage we have:
… WebMCP API
… Proofreader API
reillyg: can we add a short presentation on the built-in APIs and developers feedback before we dive into individual APIs?
anssik: sure, remind me at the start of that session
Anssi: this CG has grown significantly over the last year similarly to its sister WG
… obviously the intersection of the web and AI is exciting and this group is the place where the future Web AI experiences are incubated ahead broad market adoption
… the year-over-year growth rate of this group is around +30% for both organizations and participants, so the diversity is growing too
https://
Anssi: we observe many businesses looking to adopt the technologies developed in this group have joined, this is important as it allows us to capture real-world feedback early on
… the nature of these "task-specific APIs" is that they're easy to adopt if your requirements match
… when more control over the experience is required, WebNN API provides the lower-level primitives to build your own
… if you registered as a CG participant, please join us at the table
… observers are welcome to join the table too subject to available space
Anssi: we use Zoom for a hybrid meeting experience, please join using the link in the meeting invite
Anssi: we use IRC for official meeting minutes and for managing the speaker queue
… please join the #webmachinelearning IRC channel, link in the meeting invite and agenda:
https://
webmachinelearning/
<gb> Issue 35 WebML WG/CG F2F Agenda - TPAC 2025 (Kobe, Japan) (by anssiko)
Anssi: to put yourself on the queue type in IRC "q+"
… during the introductions round, we'll try to record everyone's participation on IRC with:
… Present+ Firstname_Lastname
… please check that your participation is recorded on IRC
Agenda bashing
Anssi: the F2F agenda was built collaboratively with CG participants
webmachinelearning/
<gb> Issue 35 WebML WG/CG F2F Agenda - TPAC 2025 (Kobe, Japan) (by anssiko)
Anssi: any last-minute proposals or updates?
WebMCP
Repository: webmachinelearning/webmcp
Intro & demo
Anssi: the WebMCP abstract reads:
… "Enabling web apps to provide JavaScript-based tools that can be accessed by AI agents and assistive technologies to create collaborative, human-in-the-loop workflows."
… TL;DR unpacking:
… - WebMCP API is a new JS API that allows a web developer to expose their web app functionality as "tools"
… - web developer is in control what tools the web page exposes and how they function
… - the web page acts as an MCP Server equivalent, but implements the tools in client-side script, not backend
… - tools are simply JS functions with associated natural language descriptions and structured schemas
… - the natural language descriptions allow AI agents to invoke the programmatic "tools" API
… - there's also a complementary declarative WebMCP API being worked on, reusing ARIA role-* attributes
Anssi: the WebMCP proposal is a synthesis of two proposals from Microsoft and Google
… the group has now converged on the initial proposal documented in the explainer and supplementary proposal document that contains API shape and code examples
Anssi: the group has received early implementation experience and feedback from the OSS community contributors Alex Nahas and Jason McGhee that has been very valuable
Anssi: both these OSS projects have explored the problem space ahead browser implementations and without some of the constraints of the browser implementation
… both Alex and Jason are active participants of the group
Kush: it was a good surprise that the GOogle and MS team had very similar APIs in mind, good signal for future convergence
… we've started with a very simple API shape, but as we're advancing, it looks like browsers will have an important role to play as a trusted mediator
… we want to make sure developers have control about how their site can be used by agents
… also nice to see interest from the community in this space
sushraja: this started from seeing the enthusiasm of the community around MCP, and we were pleasantly surprised to see Google with a similar proposal
… the explainer is still very open to changes
… lots still need to be figured out in terms of API shapes, capabilities, security, ...
… very open to feedback, in particular from potential users of the API
Anssi: in this demo Alex demonstrates a user interacting with an AI agent built into the browser
… Alex has a conversation with an an agent using a voice interface and asks the agent to search for specific information (WebMCP explainer) and send a short summary of this information via email to a specific email address
… in the another demo, Alex uses the Prompt API integration
… the demo uses determistic API access to tools provided by the website and does not do any DOM parsing or computer vision-based image processing of screenshots
… that means secure, auditable
… in the spirit of the human-in-the-loop workflow, the end user can actually choose which tools exposed by the website are accessible to the agent
… all compute happens on the client side, the model powering the agent runs on the client
… this is more light-weight and privacy-preserving approach compared to cloud-based agentic usage
Anssi: in the second part of the demo, google.com has defined tools such as get_page_title, extract_search_query, run_search, get_search_results
… remember these tools are just JS functions declared by the website, injected into the model's context so it is aware of them
… WebMCP tools make it easy to progressively enhance your existing websites and make them agentic
… Prompt API input "run_search for webmcp and tell me what it is"
… calls the same tools as in the voice model demo
… get_page_title tool call returns "Google" to confirm the page is Google's homepage
… run_search too is invoked to navigate to Google search results for "webmcp"
… all interaction happens inside browser, model execution, tool calls, events
… agent can be stopped at any time by the user
anssik: For this demo the extension injects the tools. In production the website would inject these tools.
mfoltz: How close was the polyfill to the explainer?
anssik: Unsure. You can check out the repo. It's conceptually aligned.
kush: I have another demo I can show which matches the explainer.
<kush> https://
https://
kush: This demo shows integration with MCP UI.
… Allows embedding UI within the conversation with the agent.
… That UI declares its tools with WebMCP.
… The agent doesn't need to hit the UI's backend. It's all a client-side app.
… In this case it's not a full website but a small UI embedded in a chat like we're seeing in the ecosystem.
Built-in agent ideation
ningxin: (Referencing slides) This is very similar to what we just saw in the demo.
… Web application registers capabilities through WebMCP.
… Browser has built-in small language model.
… Investigating whether a site can call into this agent to perform a task using tools provided by the web app itself.
… Investigating whether this can be run entirely locally.
anssik: We can share this idea in an issue and foster further discussion.
kbx: +1 to this idea.
… the browser also has capabilities that we might want to expose.
kush: There are two issues related ot this on WebMCP.
… 1. Exposing built-in tools to agent.
… 2. Are you hoping for a tool to expose built-in agent to external agent?
ningxin: No. The prompt comes from the user, e.g. a chatbox on the site.
kush: Could the in-browser and site agent work together?
ningxin: The web app can customize this.
… We're considering use by app itself as well as extensions.
… Extensions can use this API to enhance the browser agent.
Communication with the TAG
Anssi: issue #35
<gb> Issue 35 Communication with the TAG (by xiaochengh) [Agenda+]
Anssi: the TAG provided the following discussion points:
webmachinelearning/
<gb> Issue 35 Communication with the TAG (by xiaochengh) [Agenda+]
Anssi: I'd like to prime this session by unpacking the TAG discussion points followed by a summary of the group's current posture wrt these points
Anssi: - Motivation: "Frontend agent integration is useful and should be supported on the Web"
… the group agrees with the TAG, the WebMCP effort's focus is to establish interoperable interfaces to enable frontend agent integration, motivation section of the explainer expands on this
WebMCP Explainer > Motivation
… - Generality: "Whatever the WebMCP efforts introduce to the web standards should be general enough to support different protocols"
… the explainer states "the WebMCP API [is] as agnostic as possible" and explains the API only reuses the "tools" base primitive from MCP, similarly to how the Prompt API reuses "tools"
… the group's charter codifies coordination expectations with the AI Agent Protocol CG, and I expect this group's participants to explore implementations atop other emerging protocols that may gain traction
webmachinelearning/
https://
… - Privacy and Security: "The P&S aspects can be challenging, and we would also like to see more explorations."
… the group has initiated a dedicated workstream for privacy and security, documented in issue #45 to be discussed next
… - Declarative API: "A declarative API can be quite useful and cover use cases that an imperative API can't cover"
… the group agrees with the TAG and has developed and prototyped a declarative API alongside the imperative API, the latest proposal is PR #26 and related discussion in issue #22
<gb> Issue 45 Privacy & security considerations for WebMCP (by victorhuangwq) [Agenda+]
<gb> Issue 22 Declarative API Equivalent (by EisenbergEffect)
<gb> Pull Request 26 add explainer for the declarative api (by MiguelsPizza) [Agenda+]
Anssi: "[TAG] would like to encourage further development of the high-level API that the WebML [CG] is currently working on" suggests this group is on the right track
<christianliebel> cpn: webmachinelearning/
<gb> Issue 35 Communication with the TAG (by xiaochengh) [Agenda+]
Anssi: per issue discussion Xiaocheng's personal view (and not TAG consensus) is "WebMCP should be built bottom-up on lower-level primitives"
… Khushal explains in his comment how current web sites don't have a way to expose the same "how I use this site" as semantic actions and that such an API must be standardized for the browser to know how to use such actions programmatically
… this is the rationale why the "tools" was chosen as the abstraction for the WebMCP to allow interoperability between websites and browsers
… the "tools" abstraction enables websites and web browsers to talk to each other programmatically without needing for the browserss to scrape the content or run computer vision models on the page screenshots to understand where to click
… the "tools" abstraction enables websites and web browsers to talk to each other programmatically without needing for the browserss to scrape the content or run computer vision models on the page screenshots to understand where to click
webmachinelearning/
jyasskin: That was a good summary.
… A lot of this feedback was not full TAG concensus.
… Loudest concern is that MCP is not going to last. Want to avoid stuff being stuck in the platform that won't match a future protocol.
… There was some uncertanty if the current design satisfies that request.
… I think it might.
… Also concern about JSON schema being production ready here. I think that it is good enough.
anssik: My takeaway from the feedback is that high-level is the right direction.
sushraja: There was a suggestion to try something lower level.
… Building a JS API which would allow multiple LLMs to understand tools that are available.
… This would be challenging across multiple library versions.
RESOLUTION: Continue development of WebMCP as the high-level API as per the TAG guidance. Coordinate with the AI Agent Protocol CG on new protocols.
Privacy & security considerations
Anssi: issue #45
<gb> Issue 45 Privacy & security considerations for WebMCP (by victorhuangwq) [Agenda+]
Anssi: the group has kicked off a workstream to look into privacy and security considerations as requested by the TAG in #35
<gb> Issue 35 (by xiaochengh) [Agenda+]
Anssi: please note also breakout session "Agentic Browsing and the Web's Security Model" by Johann tomorrow
<gb> Issue 25 Agentic Browsing and the Web's Security Model (by johannhof) [session]
<Victor> anssik I have prepared a slide show for the privacy and security considerations that I can share after the coffee break perhaps?
<Ehsan> Ehsan Toreini +
Anssi: the issue opened by Victor surfaces three key areas for deeper discussion
… 1. Prompt injection attacks, which is already mentioned in issue #11
<gb> #11
Anssi: 2. Misrepresentation of intent in the WebMCP tool
… 3. Personalization scraping / fingerprinting through over parametrization
… I will break this into separate subtopics
… (permissions part of the solution space is discussed in issue #44, we'll discuss that separately later today)
<gb> Issue 44 Managing action specific permissions (by khushalsagar) [Agenda+]
<Zakim> dom, you wanted to note that over parametrization is much more invasive than just fingerprinting
<Fazio> security means a few different things to people with disabilities. It could mean am I entering the information in the right place, is this what I think it is
dom: thank you for the presentation, I'd like to remind of the fingerprinting issue
… I don't want a site to know a user is pregnant or is visiting Japan at the moment, for example
… privacy issue is quite substantial
Victor: fingerprinting is not the only thing this feature would be used for, but other malicious usage scenarios too
<msw> (not audible to remote participants)
Fazio: we did user research on an American fast-food chain, users had not done any purchase online because security concerns, s/he wasn't sure that the information is inserted in the right fields, confidence that the user is doing the right thing is a consideration
anssik: we can make a comparison with keeping a hand on the steering wheel in a self-driving car
victor: conversely, people get more and more comfortable with a self-driving car over time
<Fazio> Data annotation is crucial
victor: but we're at the start of the journey and so it's important to make sure the user stays involved
… but we're at the start of the journey and so it's important to make sure the user stays involved
… what can we do in the WebMCP design to make it better? (vs what's left to agents)
<Fazio> Standardized annotations
Anssi: web specifications don't prescribe browser experiences; should WebMCP prescribe agent experiences - I don't have an answer
Victor: neither do I
… from a privacy and security perspective, WebMCP is going to be good for the general privacy and security because it provides a standard way to interact with the Web, rather than computer vision and automated interactions
kush: we were trying to nail where the mitigations lays - e.g. agent mitigations vs browser mitigations
… for prompt injection, using classifier to protect against it is something that would be on the agent
… for injection in the output, this could be based on heuristics in the agent, or a hint from the developer that the tool itself is agentic
… for fingerprinting, the browser and the agent could manage it through some form of consent, as done for e.g. autofills
… the spec itself coudl recommend what browsers should do without being prescriptive about how
… e.g. if something is marked as PII, user consent before sharing is required
… re misrepresentation of intent and the ambiguity of whom to blame, maybe the consuming agent can keep a ledger of actions that it took; but that doesn't help deal with ambiguity of language itself
… not sure how to solve this in particular in the case where the ambiguity is adversarial
… Another question is how to expose this to agents not built-in in the browser
… which forces to clarify the responsibility of agents vs browsers in more detail
Mark: in terms of UI - we don't want to dictate UI, but some principles: see what happens in real time, make it auditable, tool schema providing enough context
… we can get feedback through implementation experience
Kenji: to what extent would this be worse with 3rd-party scripts embedded in your doc?
Kush: if you include content in your 1st party content, it has already access to the smae content as the main 1st party
reillyg: 3rd-party scripts are a well-known attack vector, but the agent might benefit from more context on the identity scripts
johannhof: these agentic features should force us to rethink some of our threat models
… with agents having access to a shared context, they may be a vector for sharing information cross-origin
kenji: maybe restricting this to a service worker would help
johannhof: maybe but we would need a more in depth threat model to determine that
cpn: I'm from the BBC - what about misinterpretation of the output of the web sites that lead to unexpected outcomes for the user? e.g. in the context hallucination
victor: MCP folks have already put thoughts at least in the non-malicious case - tools annotations that can say e.g. that a tool is read-only
… vs a tool that can be destructive
Tarek: re PII, I wonder if we need mechanism to explicit mark stuff as PII, e.g. on the web page
<Victor> johannhof We would probably need multiple solutions to help - some to help with cross-origin concerns, some to help protect PII, provide semantic markup. Looking forward to your talk
kush: this might be worth considering as new DOM primitives
johannhof: reducing user prompts to an absolute minimum is likely desirable, so working on a direction that assume user validation feels wrong
… annotating something as PII will have a limited effect to how an agent treat the data once it is in their context
… these PII should be gated maybe as a tool themselves to avoid this
<johannhof> Victor - yeah, honestly I'm not sure I can cover it all, definitely looking forward to working on this stuff together :)
AlexNahas: re PII, with annotations, tool call may be handled by the browser first and decide whether the output should go to the model context
LeoLee: re language misinterpretation, words meaning vary across regions too - maybe this should be exposed to the agent
<Fazio> follow plain language guidance for internationalization
johannhof: high level, in the short or mid term, we're relying on a lot of custom agent-defined protections to protect the user because we can't standardize all the mitigations (e.g. classifying injections, limiting to relevant tool calls)
… one way to approach this is with tests, with evals - even though sharing those is controversial in the industry
… e.g. with examples of prompt injections that try to get password sharing
kush: in the context of misrepresentation of intent - you need to get a tool schema across browsers; how do we add tests for non-deterministic APIs
<Victor> oh wow that's true
kush: but we could add tests that has prompts that should or should not invoke tools for a given schema
Victor: we shouldn't expect WebMCP to solve all the problems
sushraja: elicitations might address some of the concerns
… a taxonomy of all possible PII is too hard, but having an elicitation that requires asking user consent, or instruct the agent not to share info with other origins
Kush: let's write this up in a markdown as a first step
Managing action specific permissions
Anssi: issue #44
<gb> Issue 44 Managing action specific permissions (by khushalsagar) [Agenda+]
Anssi: we opened this separate issue to discuss the permission model, informed by the TAG feedback in #35
<gb> Issue 35 Communication with the TAG (by xiaochengh) [Agenda+]
Anssi: Khushal's initial proposal:
… - The browser manages a global: "can you be agentic on this site" permission.
… - Action specific permissions. Say an action is destructive (deletes some files). The site would likely want user consent before that action is taken.
… the question is whether and how to persist the action specific permissions per site?
… tools are ephemeral from the browser's perspective, but NOT ephemeral from the site's perspective
… consider these 2 states are indistinguishable from the browser (or MCP client's) perspective
… the site (or MCP server) no longer provides this tool
… the tool is currently disabled
kush: the main issue I landed on: for any kind of sensitive action, outlining clearly who is responsible for gaining user content (browser/agent vs site)
<Victor> dom I have sent the document to you
kush: if the agent/browser doing it, it's going to be non-determistic since it will depend on interpretation by the agent
… conversely, the site knows determistically which tool is sensitive
… but we also want to avoid double-prompting, so making the browser the default consent dialog source
… and have tools identify if they require user consent
… what string should show up on the consent dialog? coming deterministically from the site or generated by the agent?
… can this be declarative or would this depend on tool execution?
… elicitation is for cases where a complex UI needed to continue the task (e.g. a payment transaction)
… don't have a sketch of an API yet
kush: a browser could refuse to execute WebMCP on abusive web sites
johannhof: this should be discussed explicitly in the threat model
… some discussion of preventing elicitation to happen
sushraja: if you're thinking of persistence, we need an identity for tool which we don't have today - maybe a hash for the description?
… the user may want to allow now or always
kush: if this is delegated to the browser, the tool may not need to care
sushraja: part of the question is what guarantees would tied to that hash/id
<kush> WebMCP needs an API for the site to request browser provided consent flows for each tool execution
dom: providing browser prompts on the site might complicate the UX, example is geolocation API where the web site should explain when it needs access
… maybe a boolean API works, any information under control of the website blurs the boundarary
johannhof: website generated text is not allowed in prompts, a Chrome policy
… difference between how you give our this information and click a button
dom: consent dialog and PII, is scarier
kush: if you share PII that is browser-driven, this is about site exposing a tool and the user agreeing with the agent's action
… the site gets information it otherwise would not have access to
reillyg: I think I lost track of what we try to protect against?
… the site decides what is requires and built UX for it, "sure you want to pay for this" needs no browser UX
… what is the threat model we want to push this to browser UX?
kush: UX reasons, one was minimizing context switches, no need to foreground the tab
… another establishing a clear responsibility
reillyg: we have 3 regions in the UI, content area, browser chrome and agent UX
… question is, if we ask for consent, but the consent question in the agent UX area different from browser UX
… is asking question at the bottom of the conversation the same as if it'd be an alert dialog in an omnibox
johannhof: personally I think all should be consistent agent UX
dom: clear threat models would help here
johannhof: a lot of this is delegated to the user agent already
… no strong definition how these UI elements are shown exactly
Tarek: confused about prompt granularity
… not going to show 10 different permission prompts
… full scenario for using an agent on a web page, shopping on a webpage
… is there a protocol that, if I'm on a website these can be called 100s times
kush: analogue "cookie prompts"
johannhof: all the agent makers must be good at these probabilistic protections to ship products
Tarek: you trust the origin?
Fazio: a journey map visualization to help illustrate the flow?
<sushraja> webmachinelearning/
<gb> https://github.com/webmachinelearning/webmcp/issues/44
PROPOSED RESOLUTION: WebMCP needs a threat model to evaluate the role of browser mediated consent for tool execution
<sushraja> Please leave open issues and continue to discussion here webmachinelearning/
<Em> these types of conversations about consent of a user for an agent actions is already happening in many other standards organizations. WebMCP can leverage OAuth discssions from IETF where the larger MCP community is working through these same concerns of what the agent is allowed to do on behalf of an user
RESOLUTION: WebMCP needs a threat model to evaluate the role of browser mediated consent for tool execution.
WebMCP accessibility, ARIA mapping via Declarative API
Anssi: PR #26
<gb> Pull Request 26 add explainer for the declarative api (by MiguelsPizza) [Agenda+]
<Tarek> Em: agreed, also wonder the duplication from the ai agents working grooup
Anssi: Brandon noted "aria-label and aria-description and others for labelling and describing elements and so it might be helpful to define WebMCP mappings/behaviors for these instead of introducing entirely new HTML attributes."
alex: WebMCP describes the tools in JS
… the declarative API is a proposal for a pure HTML version of MCP
… rather than declare them in JS, we declare them in HTML, and the different form field are all comprised in the markup
… e.G. with a tool-name attribute and a tool-description, and the rest of the schema for the tool is inferred from the children elements, including aria attributes
… still lots of questions about the execution of the tool would do
… current proposal is execute the form, with the expectation that the server would respond with JSON based on the presence of an HTTP header indicating the execution in a tool context
… with some challenges in how to handle HTTP and redirects in that context
… but it's interesting to repurposing accessibility annotations in that context
matatk: APA is happy to look at this pull request - we haven't done that yet
… what's really important is to ensure that semantics that are added are useful to people
… some of it targets for screen readers, but not just them
… we had a breakout about making some of these annotations for machine
… which can then help e.G. for cognitive disabilities
… e.g. to help navigation destinations
… there are other things where we see overlap between Agents & accessibility: in many case, accessibility needs to increase machine interpretability for assistive technologies
… but it's important to make sure the aria attributes used in the agent context end up overloading with information people
LeoLee: there might be web sites that abuse aria attributes to "escape" from agents
… and thus deteriorate the accessibility experience
alex: this should be opt-in to avoid overloading models
… e.G only expose tools expose with mcp annotations
reillyg: I like the idea of trying to describe things better to make assistive agents more useful
… in terms of declarative vs imperactive, the former should be focused on explaining to the agent how to interact with the web site through the normal human interface
… what makes me nervous is the assumption the server should respond with a specialized machine readable output, it should be the regular result of the underlying normal interaction
… it should be grounded on the UI and human workflow of the site; for a JSON-based approach, the imperative approach feels appropriate
kush: + 1 to avoid that divergence between UI and response
… also, if this an opt-in approach, we should provide ways to describe MCP desriptions independently of ARIA to avoid polluting the screen reader experience
kush: open question on whether they should ship together
Define the API for in-page Agents to use a site's declared tools
Anssi: issue #51
<gb> #51
Anssi: currently we have an API to declare tools to site, "the registry":
Anssi: we need also an API for an agent _embedded on the site_ to use these tools, and think browser can help mediate which agent is accessing the site's functionality at a time
… notably, here the agent is running within the same origin as the site
… this issue was informed by issue #43, rephrasing Alex's thinking there:
<gb> Issue 43 Clarifying the scope of the proposal (by 43081j)
Anssi: - "WebMCP server" is the `navigator.modelContext` object that acts as the registry which tools are declared on
… - "WebMCP client" is the API for listing, calling, and listening for tool changes callable by an agent _embedded on the site_, e.g.:
navigator.modelContext.listTools navigator.modelContext.executeTool navigator.modelContext.onToolListChanged
webmachinelearning/
<gb> Issue 43 Clarifying the scope of the proposal (by 43081j)
kush: this is a derivative discussion from #43 about the scope of the API
… when people are embedding 3rd party library to embed agents on their site (à la zendesk)
… it would be convenient to make that integration easier
… initially, there was no support to integrate this for lack of good use case
dom: this feels like a convenience function?
kush: exactly, to avoid browsers jumping at each other
… say "I want to start executing tools and if another agent is interacting with the site I can avoid stomping on each other"
dom: how about if you provide WebMCP we recommend you record your server in a global function
… some other groups are recommending exposing standard library as an SDK
johannhof: this seem pretty big tangent, a separate proposal for the group?
reillyg: if we decide there's explicit way to expose tools, then we'd include a mitigation in the browser feature
johannhof: a separate proposal for integration with the Prompt API warrants its own issue
kush: there's probably a lighter-weight way to deal with this problem?
… anything mediated by the browser, API light weight as "external agent want to connect to you" and the site can terminate that connection
… someone declares the tools and by doing this we ensure there's only one task using the tool at a time
dom: only change to core WebMCP would be if the tool can or cannot run concurrently
reillyg: web developers can proceed when the initial tool call via a button press is completed, and resume with a new call then
anssik: we should define the various flavours of agents (incl e.g. in-site agents)
reillyg: the type of agents I can imagine are browser-agents and same-site agents
dom: how about agent connected to the site?
reillyg: browser-integerated agent, OS-provided agent, in-page agent
dom: agent that specialized on one origin only
reillyg: extension developers could do a origin-expert agents
dom: security consideration are different for different flavors
johannhof: feels beyond the scope of what an MVP for WebMCP needs to address, except to the extent it affects WebMCP itself (e.g. a connect function to help manage multi connections)
PROPOSED RESOLUTION: revisit that issue (likely in a different proposal) if there is a need for browser mediated interactions between browser agents and in-site agents - no clear need to address this in WebMCP atm
<Em> fyi created an issue about identity within WebMCP. I think many of the current conversations around consent prompting to users and what agents have access to can be resolved by the agent not acting *as* the user, but using OAuth for the agent to act *on behalf of* the user. webmachinelearning/
<gb> Issue 54 Challenging assumptions of Identity within WebMCP (by EmLauber)
RESOLUTION: revisit that issue (likely in a different proposal) if there is a need for browser mediated interactions between browser agents and in-site agents - no clear need to address this in WebMCP atm
Should we support cross-origin Agents across frame boundaries
Anssi: issue #52
<gb> Issue 52 Should we support cross-origin Agents across frame boundaries (by khushalsagar) [Agenda+]
Anssi: current thinking is tools can only exist on the top-level browsing context:
… "Only a top-level browsing context, such as a browser tab can be a model context provider."
https://
Anssi: this design does not support the following use cases:
… - a webpage which embeds an Agent
… - an Agent which embeds a cross-origin webpage and wants to access its WebMCP functionality
… proposal to consider a permission policy allowlist for WebMCP
https://
Anssi: question on granularity: "this origin is allowed to see all tools", "this origin is allowed to see a subset of tools: X, Y and Z"
kush: this is another derivative of #43; this is about an agent embedded in a cross-origin iframe
<gb> Issue 43 Clarifying the scope of the proposal (by 43081j)
kush: the question is whether the browser need to provide a shared interface instead of relying e.g. on ad-hoc postMessage communication
kush: should we allow embedded iframe to expose tools to the browser-agent? separately, should the embedder be able to access embedded tools?
johannhof: for the former, part of the question is whether the user has the mental model to understand what happens
… that's an assumption we've made for permissions
… imagine a locateStore tool - this should be available to the agent seamlessly
kush: the challenge is to enforce origin isolation for data provided to the tool
… this could be something the embedder could allow or disallow
johannhof: if so, I think it should be disallowed by default
kbx: +1 on disallow by default
… #43 was talking about an in-site agent, which it feels out of scope
<gb> Issue 43 Clarifying the scope of the proposal (by 43081j)
tomayac: re iframe, some pages (e.G. shopping pages) advertise for themselves, via an iframe with a different origin
… it would be problematic if these types of embedding were perceived as a separate and not accessible by the agent
… e.g. if the agent couldn't invoke a tool provided by the tool
emily: work at MS Identity - one part missing is authentification of the pages
… we should ensure the agent should only have access to authenticated pages for that user
… incl scenario where federated authentication is used
… to avoid leakage of sensitive data
kush: from the WebMCP API surface, there is no distinction on the state of authentication
… in terms of data-leakage, origin isolation needs to be applied in any case even without embedding
em: if WebMCP calls a tool that interacts with an authenticated backend, then ensuring limits around authentication context matters
reillyg: tools are always calling JS functions in the page (and thus inherit the authentication status of the page)
… there are separate proposals to be able to shift the caller of the tool from the agent in the browser to a server-side agent with its context, incl identity (e.g. to allow chatgpt to continue an operation started in a browser agent context)
… but that's outside the scope of MCP
ErikAnderson: MS Teams has a complex architecture for their own tabbing infrastructure, incl via iframes
… if we use a permission policy mechansim, it would be interesting to see if this could or should be attached to actual visibility of the content
johannhof: or allowing to toggle permission on and off dynamically
sushraja: the problem with iframes is the risk of tool naming collisions
… the top-level frame could delegate tool "focus" to a given iframe dynamically
dom: this points out the need for the top-level frame to be the coordinator
kush: the problem of name collision exists in multi-tab scenarios
sushraja: I've assumed we were working only tab-in-focus
johannhof: I'm not too worried about the multi-tab situation where the agent should be able to make distinction
kush: should we punt on embedded iframes for now? or make this UA dependent?
johannhof: is there ways to detect name collision in the current spec?
kush: name collision across multiple services is bound to happen, with or without the Web
ErikAnderson: is there a concrete use case to help drive this conversation?
Mark_Foltz: maybe polyfilling with postMessage to get a better sense of the need and potential API shape
dom: if so, would we treat same-origin and cross-origin differently?
johannhof: I don't see why not allow same site
sushraja: we would be facing name collision
johannhof: I think we should actually go with a permission policy that would work in x-origin context
kush: again, the need for agents to disambiguate across name collisions is something that agents will need to fix
leo: even the same-origin is not necessarily anchored in use case
johannhof: ultimately, whether first-party sites do the tools themselves or delegate to same- or x-origin iframes shouldn't be relevant
kush: I think the main question is of implementation cost; let's give a bit more time figure implementation challenges before proceeding
anssik: we should also make sure to get more johannhof's time in our calls
Built-in AI APIs Overview
Prompt API
Slideset: https://
MikeWasserman: the rest of the slides are more in-depth look at technical topics - we might need to agenda-bash which topics to focus on today
Anssi: reillyg and I have identified a few issues for Prompt and Writing Assistance
Prompt API
TAG design review
<gb> CLOSED Issue 1093 Prompt API (by domenic) [Progress: in progress] [Venue: WebML CG] [Resolution: lack of consensus] [Topic: Machine Learning] [Focus: API design (pending)] [Focus: Web architecture (pending)] [Focus: Internationalization
<gb> … (pending)]
Anssi: we requested TAG design review in May and received a review response in August
… however, the initial TAG review response was withdrawn due to inaccuracies
reillyg: in terms of the brand new TAG review, there are concerns about the locality of the models - feedback from developers seems to indicate they're good enough
Anssi: we received new review feedback 40 minutes ago
reillyg: with the caveat that the APIs are only provided on devices with sufficient performance characteristics - question of whether we can provide a fallback for lower end deviecs
… a question on cost of computing - concerns about site abusing the user system resources
… as the crypto craze demonstrated in the past
… this may be something that browser implementors should work towards detect abuse of computing power
… A really good question on whether or not we should assume the model execute locally
… the current API mentions hybrid options, but we should probably clarify we've only done the local option in existing implementations, and review the security and privacy implications of a cloud-based approach
… in a cloud approach, there may be concerns about resources consumption (network, AI subscription) and possible additional user profiling (e.G. level of subscription the user has access to can be a proxy to their means)
… discussion about downloadprogress with all its complexity - TAG suggesting we should make it simpler
… and developers also pushing back on having to make the decision to download or not
… On Model version and updates - concern about frequency updates (possibly a suggestion to limit those from a browser perspective), but also more importantly, the issues around interop between sessions (and interop between browsers across different type of models)
… some positive experience in that regard from early experimentation, but still needs to be confirmed
… On Input/Output languages, risks of fingerprinting with a possible solution to restrict the languages to those that fit the user context
… Memory management: destroy() being redundant, good question
… JSON Schema standardization status
… On Tool use - more examples needed to help the TAG analyse the spec
… likewise for structured output
<sushraja> +present Sushanth_Rajasankar
anssik: this is great feedback
kbx: re hybrid model support, are there lessons from WebSpeech?
reillyg: I started reading the TAG review of the WebSpeech API to get a sense of that
… it looks like it's still an open question how to deal with that aspect
dom: we should ensure to work with WebSppech on this
Tarek: we're looking at making some of our APIs support cloud-based
… which raises question about cost and subscription (hence authentication)
… if it needs authentication, it should be applicable across different features
reillyg: developers really want to be able to enforce on-device e.g. for alignment with their privacy/security policy
… (with a fallback of having themselves determine which cloud service to use)
Tarek: Apple Intelligence has a local vs private compute
reillyg: we would need to validate whether this trusted environment on cloud would match their needs from a privacy/security perspective
tarek: access to a trusted environment needs a key exchange, which the browser doesn't necessarily have access to
<kbx> flower.ai
tarek: https://
kush: what's the motivation for the hybrid fallback to happen in the browser vs in a JS lib?
Rob: to make it easier for developers - make it work on all browsers without consideration of their performance status
dom: if need to integrate two different APIs it makes the DX worse for Prompt API
rob: the explainer says we want to enable hybrid approaches; nothing we're doing preclude it atm
kbx: some implementors might choose a hybrid approach, but it's not clear the current API allows for it
sushraja: re structured output, we throw an exception if the browser doesn't understand a given JSON Schema
reillyg: we need to be specific about what needs to be supported for interop
sushraja: re WebSpeech, there are limits in terms how much context get kept - typically split at silent spots
… not sure what happens when the limits are hit
anssik: we'll coordinate the responses
reillyg: our team will go through the feedback and propose responses/file relevant issues
anssik: noting there isn't TAG consensus
Add image input resizing or tiling options
Anssi: issue #133
<gb> Issue 133 not found
Anssi: this is another issue from Mike, opened a few months ago so we have some Demenic's comments too from the time before he stapped down from the editor role
… the proposal is as the subject says, to add image input resizing or tiling options
webmachinelearning/
<gb> Issue 133 Add image input resizing or tiling options (by michaelwasserman) [Agenda+]
Anssi: the request is motivated by the followign use cases:
… - Resizing (the default right now) is useful for an overview of an image, e.g. rough descriptions of photos.
… - Tiling would be useful for analyzing high-resolution details from a larger image, e.g. OCR from a large document.
… Domenic suggested a higher-level knob such as detail: "low" vs. detail: "high" would be more forward looking
… and asked the group to research other model provider APIs
sushraja: today the API exposes the min/max resolution for images, we only normalize to token consumed
<msw> webmachinelearning/
<gb> Issue 84 Exposing max image / audio limits? (by domenic) [enhancement]
sushraja: but that isn't sufficient for developers to know what they can provide and is supported
reillyg: the OCR example is also illustrative: you need to tile the image and tile it in a way that's useful for OCR (e.G. avoid breaking over lines)
… combining model-specific knowledge and app-specific (e.g. is this vertical or horizontal text)
… exposing min/max of images is the only way I can see, but it hardcodes a particular constraint of current models
tomayac: the developer expectations would be a higher quality image leads to better OCR, but if a low quality image is already higher than what the model will be resizing to, it breaks a logical assumption
reillyg: my only concern is the additional complexity we might regret later
… there is a similar issue with audio and the supported sample rate
kbx: and the duration of audio as well
reillyg: if the openAI API is successful with min/max, that's a good sign, but learning more about these properties
kbx: another item is the number of channels
reillyg: images has width/height/depth/color
… also context window size
tomayac: also the shape into which it may gets transformed might inspire tiling
sushraja: I don't think the developers want resolution for OCR - but what's the minimum font size they support for OCR
… that may be pretty constant (human readable font size)
kbx: that might be discoverable through testing on well-known content
RESOLUTION: The group will research other model provider APIs for image input resizing or tiling options. Also consider audio input.
Tool Use: decouple execution and formalize function calls and responses
Anssi: issue #159
<gb> #159
Anssi: Mike reports the initial Prompt API design integrated JS tool execution within the prompt() call itself
… this design prioritized API simplicity
… the trade-off of this design is it reduces granular control and direct LLM interactions
… to address this, Mike suggests to realign initial tool use integrations with Prompt API objectives, three points:
… - (1) "Provide essential Language Model tool types: i.e. Function Declarations (FD), Function Calls (FC), and Function Responses (FR)"
<kbx> Actual 159 issue is here: webmachinelearning/
<gb> Issue 159 Tool Use: decouple execution and formalize function calls and responses (by michaelwasserman) [tools] [Agenda+]
Anssi: - (2) "Offer fine-grained client control over tool execution loops used for agentic integrations."
… - (3) "Align with patterns established by major LLM APIs (OpenAI, Gemini, Claude)."
… the motivation for the three:
… (1) is used by API clients to inspect, reconstruct, and test session history
… (2) enables clients to define the looping patterns, error handling, limits, etc.
… (3) empower clients to use the Prompt API more interchangeably with server-based APIs.
… about the proposal
… currently the Prompt API using a closed-loop model
… where the browser process operates as a hidden agent, looping on (model prediction → tool execution → response observation) until a final text response was ready
… proposal to move to API-centric, open-loop model for tool execution where:
… - prompt() method returns a structured Function Call (FC) object to the client
… - Function Call (FC) object manages the execution and Function Response (FR) feedback loop
… this maximizing developer control and observability
… Function Declarations (FD) also need not provide execute functions for now
… there's a comparison table between Closed-Loop (Original) and Open-Loop (Proposed)
… Open-Loop is better in Developer Control, Debuggability and Industry Alignment dimensions
Mike: working with Microsoft on this proposal
… function calls and responses as first-class APIs
… our concerns were around encapsulating the notion of API clients needing to capture calls to execution functions and turn them into representations for calls and represented later
… what constitutes a response, added to the initial prompt for the session
Mike: also something to look for are other LMs, web app developers to target Prompt API or cloud-based APIs more easily
… if they're developing and inside agent, we want to make the Prompt API aligned with cloud-based solutions
… some of the intro slides animate this
… also invite Sushant and Frank to chime in, we've worked together on this
… feedback from this group would be helpful
… we want to structure the API so that allows good production API around tool use
[ Mike revisits the Built-in AI APIs Overview slides shared earlier. ]
[ Slides shown is the "Tool Use: JavaScript Example" adn "Tool Use: Design alternatives" for a weather tool ]
[ Demo polyfill by Nathan ]
<tomayac> Video link https://
reillyg: one of the reasons this makes sense is the semantics are provided by the developer for the tool and they get complicated if you ask the developer to re-prompt the tool
… to avoid design a complicated API, it is deceptively simple, but the semantics are hiding a lot of detail
sushraja: each model has a different way to represent the tool call and we want to avoid model dependency
… open-loop allows model to be replaced with a cloud-based
… only negative is it increases implementation complexity
kush: interestingly fundamentally different, responsibility is on the develop to manage everything that goes to the context window
… once they resolve the tool call may want to modify the call
… there is room to align the syntax for LM tool declaration across both the APIs even if the pattern is not
reillyg: there was some discussion, if you provide no tools then model will not use them
… because there's a question for each call to prompt, are you expecting a tool call at that very moment
… that effects what string encoding the underlying implementation will do
sushraja: enabling and disabling tool calls between prompts?
reillyg: tool call needs to be added to expectedOutput
sushraja: presence of the tools can be used, so developers don't need to provide the list of tools
reillyg: availability API assumes the same options are available
kbx: we got feedback from authors that in practice there's a dynamic condition where open-loop model helps, if something changes in between the journey so can adapt
… if everything happens without issues closed-loop works as well
sushraja: state changes we need to add act differently then between the two
Tarek: on our side we had issues depending on models, something models did not work, bad accuracy, true for small models
… if we'd use this API on different underlying models on say Edge and Chrome, which API is use so that the tool call works
… the tools are going to make the interop issues harder
sushraja: without this API developers will craft a special system prompt instead
reillyg: proposed API abstracts out some differences, we are working with model development teams to ensure the models can work with tools
… this is a possible interop issue for some models
sushraja: do we assume building a workflow that expects LLM to make the tool call?
kush: want to understand how the model's context looks like when a tool call is invoked
… if there's a failure, the developer does not handle the tool calls, do the tools get ignored in the next prompt?
reillyg: in the two versions of the polyfills, in closed-loop, input and output streams failures, browser catches the errors
in open-loop you get a chunk that instead of text is the error signal
kush: ImageBufferSource as input to tool call?
Tarek: you can use SHA64 images
sushraja: if the model wants to do multiple tool calls, then need two responses and prompt
reillyg: in this open-loop model can respond in a way the model is not trained for
… the model is trained so it might not do the right thing
… do we track the model expect a tool call?
reillyg: in the future models might support async tool calling
tomayac: footgun-wise, open-loop syntax makes me nervous
… this is recursion, streams, JS developers are not so familiar with these
… non-ergonomics of the open-loop models is challenging for developers
reillyg: chunk type that is a tool call, this is only a property of open-loop design
… there are different flavors of tool calling, in closed-loop model you don't get any chunks
sushraja: if you have a closed-loop model, the implementation will silently eat token
… to be able to restore the session need to see all the tokens
sushraja: also need to think of security implications
RESOLUTION: Further solicit feedback on both closed-loop and open-loop models to understand developer ergonomics issues.
Writing Assistance APIs
Repository: webmachinelearning/writing-assistance-apis
User Activation requirements for Session Creation cause undue friction
Anssi: issue #83 and PR #86 (merged)
<gb> #86
<gb> #83
Anssi: an issue from Isaac
… Built-in AI APIs currently consume transient user activation when their availability is downloadable
… this adds friction when a site wants to create a Summarizer once the content to be summarized comes into view
… now the site must create the session earlier (think warm up), consume the transient activation, then get user to act on the page again to activate the transient activation again
… transient activation indicates a user has recently pressed a button or performed some other user interaction, and when a transient activation is consumed, it is "deactivated"
… sticky activation by contrast persists until the end of the session to require sticky activation, rather than consuming transient activation
… the proposal is to relaxing the the user activation requirements to require sticky activation
Anssi: Reilly asks could we relax this to only require sticky activation for the downloadable state?
… Isaac agrees
… this suggest a similar change to Language Detector
… the PR was reviewed and merged, any comments from the group?
reillyg: Translator API excluded because this behaviour is tied up with anti-fingerprinting mechanism regarding model that is downloaded
… consider two models for download controls
… one model for everything and one for developer-specified
… proposal is for APIs where there is only one model
sushraja: I'd like to see more transparency in which developer scenario this is a real ask to use Prompt API in the background, this competes with TAG feedback
… do you get developer feedback this is preferred?
… background tabs playing audio?
kbx: developers competing to capture the gesture, some get it some don't
Erik: audio players have restrictions if you open a new tab but don't focus they can't play
… if we open 10 new tabs and all call Prompt API at the same time, do you need to wait until the tab is visible?
reillyg: this change only impacts the behaviour when the model is downloadable, not when the model has been already downloaded
Markus: does the user know how much needs to be downloaded
… what if I accidentially download a lot of data while roaming and the data costs a lot?
reillyg: I can imagine a more complex download policy if you're on a metered network to not download
… we refuse to do anything if you're not on a situation you can download
reillyg: if folks have concerns please chime in on the issue
Discuss the privacy implications of using a paid cloud option
Anssi: issue #84
<gb> Issue 84 Discuss the privacy implications of using a paid cloud option (by jyasskin) [Agenda+]
Anssi: this issue was filed for the writing-assistance-apis repos, but the issue refers to the Prompt API explainer that makes a general statement that I believe applies across all built-in AI APIs:
"Allowing hybrid approaches, e.g. free users of a website use on-device AI whereas paid users use a more powerful API-based model."
Anssi: Jeffrey point out here's a risk that this API could reveal to a website whether its user is wealthy (cloud available) or not (local-only).
… Tom suggests this may not be as strong signal of wealth as e.g. device manufacturer information that is already disclosed
… any comments from the group?
reillyg: we haven't discussed the cloud option too much
Markus: if someone uses cloud do they care about privacy?
… someone sends a request to your model and it returns I'm a cloud version?
reillyg: TAG noted even if we don't provide information on client vs cloud it can be inferred from the response received
reillyg: if we as browser vendor want to ensure that the users of any means have access to these models, as a service, then that is an incentive to find a solution
… alternative is to delegate all to the developer
sushraja: WebGL/GPU do not have this consideration?
kbx: some implemented would do this differently, maybe on server if it is free
Tarek: in Chrome implementation, do you fall back to the model already available on the system?
reillyg: Chrome on Android uses the model that ships with Android
sushraja: similarly on Windows, we have a plan to contribute the OS provided model
Tarek: for cloud option, what if we can us eBring Your Own Model?
Erik: can of worms, punting may make it harder to response later
reillyg: going to the opposite direction to Web Speech API
Wrap up
Anssi: thank you for your active participation and exciting discussions on our incubations, agentic web and built-in AI capabilities!
… similarly to yesterday, interested folks are welcome to join us for a dinner
… the plan would be to again meet in the Portopia Hotel (adjacent to the Kobe International Conference Center) lobby at 18:15 to coordinate on transport and restaurants, more restaurants should be open today than yesterday so more options!
… restaurant options: