Assistive Technology Interop through ARIA-AT

Meeting minutes

Chris Cuellar (Bocoup)

<Joe_Humbert6> are we in the wrong zoom?

<Joe_Humbert2> Only Murray and I are in zoom

<murray_moss> Guessing it's the wrong Zoom...

<murray_moss> Joe_Humbert2 I got the right Zoom from https://www.w3.org/events/meetings/8b398afb-04d4-4e27-81ed-fa5cd11d9a4a/?recurrenceId=20240925T100000

<Joe_Humbert2> yup

boaz: we're getting started. reminder about the code of conduct and the anti-trust guidance. We're not recording today

boaz: we're doing a high-level overview of the ARIA-AT system

boaz: we're doing intros

<Ben_Tillyer_> Any objections if I stop the camera from panning around?

No objections if you know how to stop the camera panning!

boaz: we're here to talk about ARIA-AT testing and our approach to testing. Compared to some of the specifics we got into yesterday

boaz: we're testing AT interop based on test cases from APG. It's the brainchild of Matt King and input from Bocoup, based on experience contributing to web platform tests

boaz: I have slides with image descriptions. the slide link is in the github issue for this breakout

boaz: we're using verdict-based testing

<boaz> link: https://docs.google.com/presentation/d/1gyz37dYtd9IznzAjxIXYVHqtNw-quWKuBcfegU7tL80/edit#slide=id.g3042961006a_0_0

boaz: We start with writing tests based on examples from APG and collect screen reader utterances from manual testers to determine a verdict if the correct meaning was conveyed by the AT

boaz: if we have a good utterance, then we have stable software

boaz: if from the verdict we get a bad utterance, we either change the software or the tests and restart the verdict process

boaz: if the platform or a new AT version comes, we have an automation bot that re-checks the verdict. If the results pass and then utterances have not changed, then we have stable software. If not, then we restart the process again. This is how we do regression testing against new versions of AT.

boaz: any questions?

boaz: next, we will describe the ARIA AT working mode.

<Charli> Would someone re-share the link to the slide deck, please?

boaz: in the first step, we are doing research on the test plan

https://docs.google.com/presentation/d/1gyz37dYtd9IznzAjxIXYVHqtNw-quWKuBcfegU7tL80/edit#slide=id.g3042961006a_0_0

boaz: in the second step of draft test plan review, we have at least 2 testers collect utterances and get a verdict

boaz: step 3 is candidate test plan review where we determine the verdict

boaz: here test admins ask AT vendors to make product changes

boaz: if there's agreement we go to step 4. if there's disagreement then we facilitate a conversation to determine where the change needs to happen, on the software side or the AT side. This can restart the whole testing process.

boaz: if there's agreement, this brings us to recommended test plan reporting stage (step 4). Here we go to issue triage for recommended test plan reporting (step 5). This is where we start the regression testing against newer versions of the AT. If we detect a regression, then we restart the testing process.

boaz: Any questions about the working mode?

Is there a way to automate person A or person B?

boaz: Because we're not testing against a spec, we're relying on human interpretation to come up with consensus on what the utterances should be. We're not trying to automate that away

<Patrick_H_Lauke> human judgement needed at the moment, which i personally welcome...

Matt_King: the whole community group is making the call

Matt_King: we're only testing JAWS, NVDA on Windows and VoiceOver on Mac. This is a foundational phase but we have hopes to go way beyond that. We have to figure out how to address IOS and Android. That's a big challenge in front of us. And then going beyond screen readers

boaz: if there's disagreement, there's conversation in the community group

jcraig: there's an example of the unchecked checkbox on VoiceOver that was resolved with further conversation

Matt_King: there's a lot of things that can happen when running a screen reader test. Can be lots of extra speech that comes through because we're testing in the real world. If there's unexpected side effects, we categorize them as moderate or severe. We try to track things like unexpected cursor movement or extra verbosity. there's only two levels of severity. This is to make sure the testers have a consistent understanding among the group

boaz: there's been numerous changes to ATs based on this process. We need to compile that list

<jcraig> s/unchecked checkbox on VoiceOverthat was resolved with further conversation/checkbox on VoiceOver that was resolved with further conversation... The "unchecked" state is the implicit default for checkbox, not verbose stated speech./

Matt_King: hopefully the default is no changes are needed in the test because the tests can be really simple. In those cases, we hope to get consensus and move on to the recommended phases quickly

Patrick_H_Lauke: what about internationalization?

Matt_King: we don't currently have a plan for that. That's an important scoping question. We're trying to accomplish a level of interop but not totally debug the ATs. If there's localization bugs, is that an interop or AT bug?

Patrick_H_Lauke: I don't know enough to determine if i18n is in scope. There may be more ambiguous utterances in other languages, or there may even be cases where the AT in one language has a completely different utterance (e.g. not announcing the "checked" state, for whatever reason)

<Zakim> jcraig, you wanted to discuss i18n

jcraig: loc would be complicated to test b/c different loc strings are coming from different places. I acknowledge your example case of ambiguity in string localization. Ideally, there won't be that much of a difference. Sometimes these strings come from different parts of the OS stack. But that's not the core of the core of what I understand the ARIA-AT project is trying to test.

jcraig: Unless there's a scenario with a polyfill that breaks a core feature. Like MathJax, which broke Voiceover. This kind of testing could detect something like that.

Ben_Tillyer_: Are you looking at PC Talker?

Ben_Tillyer_: second question - how are you getting the sound? How exactly are you collecting the utterances?

Ben_Tillyer_: last question - do you get responses from the screen reader about focus change or window change events?

Matt_King: When we start to add more AT, for example a japanese-first screen reader? I think our framework should still work, except we have more localization work to make sure of that.

<JackieFei> https://docs.google.com/presentation/d/1gyz37dYtd9IznzAjxIXYVHqtNw-quWKuBcfegU7tL80/edit#slide=id.g3042961006a_0_8

<jcraig> s/Like macjacks? which broke Voiceover. This kind of testing could detect something like that./Like earlier versions of MathJax which broke VoiceOver. (Thankfully resolved now.) That type of problem might be detectable with additional loc testing./

jugglinmike: I'll explain the basics of the automation. We use AT Driver. Our AT Driver servers speak a web-driver like protocol. We maintain one server tied to NVDA. The other one communicates with MacOS. They have different strengths and weaknesses. We don't collect any events other than speech.

Matt_King: But we get the screen reader response to a focus change event. We capture that with AT driver

<Zakim> estellevw, you wanted to say can someone put a link to the current tests in this chat?

gregwhitworth: is there an expectation to add mobile support?

boaz: yes

gregwhitworth: Is there an expectation to make the driver and test suite more available?

boaz: Yes, people are starting to copy this process. This software is open source. People are using this to test web apps. They're using this verdict-based approach on web apps, or just use the automation in the dev tool chain.

<gregwhitworth> Here is one we did at MS in the past: https://blogs.windows.com/msedgedev/2016/05/25/accessibility-test-automation/

gregwhitworth: An example from Narrator to check UI regression

gregwhitworth: It would be useful to see how the tests are authored as well.

boaz: An SDK that evolved from this would be a great downstream benefit from this software

boaz: we'll hold the queue as I step through a demo.

https://aria-at.w3.org

boaz: there's interop reports, data management and test queue. For interop reports, we can show the percentage of passing tests per AT and browser combination that we're testing. The test plans have more info about the tests in each plan for every combo of browser and At.

boaz: in data management, we can see the state of each of the test plans. It shows us the overall status, what stage of the working mode it's in.

Matt_King: we did a major refactor of our test format last year. We track the dates in all of our test reports. The older ones will be harder to read and understand.

<jcraig> side mention that 65% result is disputed (invalid expectations IMO). Matt and I are discussing it and other results later today. two issues of several more. w3c/aria-at#1060 w3c/aria-at#1061

boaz: In the Test Queue page, you can run the tests. In Candidate Review, we manage the process of working with AT vendors. Each AT has a list of where they're at with the test plan support.

boaz: In the test plan page, you can see all the steps of a test plan. We can take actions on each test plan

boaz: This is an open source web app. You can find it on Github and try forking it. You're also welcome to contribute!

Matt_King: Collecting speech utterances is difficult. It's hard to tell when we'll be ready to take on Braille. Testing on mobile is our next big priority.

Patrick_H_Lauke: What happens when the problem is not on the AT side but on the browser side?

boaz: Acacia is more geared toward testing the accessibility APIs

Patrick_H_Lauke: ARIA-AT assumes that the browser is working correctly

jcraig: There are multiple layers of more testing at each step of the stack to attempt to ensure everything is working correctly. Granular testing is more reliable for automation. WPT Accessibility testing is all inside the browser, Acacia is testing how the accessibility comes out of the browser to the Accessibility API, and ARIA-AT is

boaz: There's other a11y testing initiatives. WPT has one. There's the Acacia project. Many lanes of testing

Ben_Tillyer_: If the verdicts are more subjective than objective. Is this interop even preferable for users?

<spectranaut_> Patrick_H_Lauke: here is the acacia project, adding browser exposed accessibility API testing to WPT: https://github.com/Igalia/rfcs/blob/wpt-for-aams-rfc/rfcs/wpt-for-aams.md

boaz: We don't have a process beyond the CG to check in if changes in ATs are desirable for end users. We would really like to do this

Ben_Tillyer_: how can we help?

boaz: we're looking for testers. You can join the community group. If you have resources to get broader user input. Come to the CG.

Matt_King: When it comes to the long-term health of this project, we need a broader funding base. This has been solely funded by Meta so far. We need to get the message out about how important AT interoperability is. We need to figure out how to replicate something like wpt.fyi for this initiative. We'd love to discuss that with folks.

<Zakim> jcraig, you wanted to mention driver complexity in the context of Greg's question about providing this via Chrome Dev Tools

jcraig: Screen readers tend to be highly authorized for the OS. Getting this into Chrome DevTools would be a long-term security process. It's probably not feasible in the short term.

aaronlev: I would love to use this to catch regressions in Chrome.

boaz: Test your browsers too with this!

Thanks everyone!

<jcraig> testing what happens after the browser, in the Screen Readers (for now)./

<gsnedders> RSRRSAgent, make minutes

– DRAFT –
Assistive Technology Interop through ARIA-AT

25 September 2024

Attendees

Meeting minutes

Diagnostics