W3C

– DRAFT –
RQTF teleconference.

20 August 2025

Attendees

Present
janina, jasonjgw, John_Paton, stacey
Regrets
-
Chair
-
Scribe
janina, stacey

Meeting minutes

Accessibility of Machine Learning and Generative AI.

jasonjgw: Status, Janina working on editing, esp Scott's additions

jasonjgw: Stacey raised a question on list

<jasonjgw> Janina: this work will continue - status is as given last week.

No Tooling AI Agents

jasonjgw: Are their issues raised by these tools not addressed in current guidance?

stacey: Describes recent experience ...

Jasonjgw: a whole class of authoring systems that use LLM to generate code for web content and problem is because output of LLM and response to a prompt is not very predictable, drawing on large body of examples and generating on what you asked for and isn't deterministic. Hard to guarantee to satisfy a set of requirements

Janina: if unpredictable no way to know to satisfy, but why unpredictable?

jason: nature of how the work, the fundamental mechanism tryies to predict what the next tokens in sequences will be based on prompt. Predicting based on its training, and even the people who created it don't reliably know how it will repsond under particular conditions. LLM whether text, code, or images...not readily predictable in what they're

going to do. Might be ways to mitigate that to run tests externally and ask to make corrections. When programming for instnace have it run the code and whatever erros would be given back to LLM to get it to correct. Ways to mititage the unpredictability. But no reliable method to ensure what it produces will satisfy set of requirements.

janina: if that's inherent that should be something we write about

Janina: come up with a non-expert explanation of why this is so, on that basis we can say "this is why you can't rely on...and here are the trade-offs" rather than "it got this wrong or translation wrong..." not as satisfying as an explanation as to why it's not reliable. Another term - given a set o f inputs and it does the work, but a couple of

sentences of what that work is, it gives it some kind of database about it. But results are then unpredeictable because of series of tokens. Explain so epople who really know their WCAG, could be useful.

Jason: report on risks of general purpose AI international report by specialists in machine learning, good coverage in those sections

janina: are we pointing to those in our wiki? we need to read those

jason: what's interesting is when systems are initally trained they don't have prior info in them about language. so everything to know and operationalize and how syntax works...is all derived by the training process from examples. That's fairly spectacular. No programming that's done that tells them how sentences work. Built up through machine

learning process and training

janina: training isn't deterministic. it's here's a lot of stuff, go look for patterns

jason: see what errors are and use that...

janina: how does it know an error from not an error?

john: i think it comprised of what other programs used...general problems, trying to enable programming for non programmers. people aimed at not supposed to have the skills to what the problems are. There's not the opportunity in traget audience to have the oversight to "has it done the job properly." If you ask it to create the page and say I want

it to be accessible, you have eto have a way to know if it's correct. It can tick the boxes but still fails one way or another. Do we know of any tools that can be used by AI to test this?

janina: there is one in APA and others selling. How reliable is another question. About to publish another doc that says you still need a human

john: need a human without the AI, and still need a human with the AI

jason: point of unpredictability of LLM output is inherent in the nature in how they function

Jason: not an accident

jason: try your best. If you had an LLM i wouldn't want to enter into a contract that guarantees about what it's going to to...

john: people who are using LLMs are using to prompt engineering is knowing the secret sauce in how you ask the question to get what you want. Look at unit tests, ask to produce software and build in unit tests to make sure still working, that's the kind of thing that you can put through in prompt enginering. predictability doesn't worry me as you

can buld the test as well as creation, but we still need a human to look at it and say it's doing what's needed and no side effects or big data leaks. Need that oversight. Don't know how to prove that but a lot of industry professionals say vibe coding is brilliant but need to check it. Spell checkers aren't always correct.

janina: can't just rely on a spell checker or you're going to be embarrassed

jason: wualification, user runs the risk of erronious output that it can generate

janina: that may be an acceptable risk and may be part of the success. Don't need a PHD level quality content all the time. Close may be close enough. But not when it's brain surgery.

jason: when as to generate accessible HTML gives mixed quality reuslts, don't reliably do that at this point

john: when think of stuff that needs to be flagged as accessible in training, quite a lot being flagged as accessible and when it's not

janina: how much better would an LLM be if given stuff by a human that's considered accessible? Can we build such a thing?

jason: back to a key point of issues paper, that these systems have an unerliability about them. and users have to deal with that in some way. Don't want that landing on a person with a disability who can't check it.

janina: the need to check reliability needs to be reinforced. Don't know that it's really seeped into everyone's consciousness yet. It's a key point. Once you know it, it's obvious. But until you do...Sometimes this can be the hardest to put into the work/make a point because it is so obvious

jason: like alt text, only so much can put into it, need to call someone to check for me. And I won't know if it's correct.

janina: some people aren't the best person to give correct alt text on an image (ex: if it's technical, a non-expert may not be the best person as the description could be wrong in key ways). No easy solution other than having someone who's capable.

janina: if you trained the model on specific texts to increase the probability of reliable results, why distinctions matter.

jason: reinforcement learning techniques to improve accuracy, and a whole discipline that concerns how to set up training process to improve results. But even the large places aren't there yet.

Miscellaneous topics.

janina: baseline in web tech and web content development...baselines tells you which ones are widely supported across modern browsers. Use these structures you'll get reliable results across all browsers. This is great and baseline is important to web dev community. What's not in baseline and bringing to APA is associating a11y support with those

commonly used and well supported objects and trying to get baseline to include a11y is what's going on and learn in the next hour. There will be a recording.

Jason: where does this get defined?

janina: in community group in w3c. Not sure how and what form. Baseline, which is something you can reference to see if something you use is well supported.

janina: i don't know what the source is like. I think we'll learn in the next hour with APA.

Minutes manually created (not a transcript), formatted by scribe.perl version 244 (Thu Feb 27 01:23:09 2025 UTC).

Diagnostics

Maybe present: jason, john

All speakers: Janina, jason, jasonjgw, john, stacey

Active on IRC: janina, jasonjgw, JPaton, stacey