Meeting minutes
<chrisp> +present
ChrisCuellar: ARIA-AT is different from other accessibility testing frameworks or platforms that you may have already encountered is that this one is really pushing forward the concept of interoperability within screen readers themselves
ChrisCuellar: It's pushing beyond the boundaries of the browser
ChrisCuellar: This effort started with the ARIA-AT community group in 2016
ChrisCuellar: in that time, the framework has evolved to a high degree of sophistication
ChrisCuellar: this involes writing tests, running tests, and even automating test execution
ChrisCuellar: We're here to share an overview and give status updates
ChrisCuellar: We want to make a deeper dive into the infrastructure--how the tests work, how they are structured, and the underlying methodology
ChrisCuellar: And we'd like to give an indication of where the program is headed and share some pointers on how folks can get involved
ChrisCuellar: We're sharing a screenshot of the ARIA-AT app which documents support levels for our test plans
ChrisCuellar: It features a grid describing screen reader / web browser pairs
ChrisCuellar: Initially, we've been testing those pairs' renderings of design patterns from the ARIA Practices Guidelines
ChrisCuellar: The goal of the program is to help AT vendors improve interop through testing
ChrisCuellar: We've taken a lot of inspiration from the web-platform-tests project
ChrisCuellar: Along those lines, we're hoping to get more granular with accessibility-related features
Matt_King: This is different from other interop efforts is that normally, you start with a standard and have everyone tested to that standard. Here, there is no standard for how ATs should behave. We're trying to solve that, but not by writing a standard. We're starting with tests as a basis to drive consensus about basic expectations
Matt_King: The biggest value to developers is having the confidence in that the experience you are designing is truly accessible across all platforms
ChrisCuellar: There's been a lot of evolution in this project's six-year lifespan
ChrisCuellar: We've learned a lot about the difficulty in testing against the accessibility stack
ChrisCuellar: We have a concept we've been calling the "four-mile journey"
ChrisCuellar: It describes what happens to the code that web authors write in order for it to reach AT users
ChrisCuellar: The first mile is about the code as authored by web developers. That's supported by the ARIA Authoring Practices Guide, WCAG, etc.
ChrisCuellar: The second mile is the accessibility tree. That's really the territory of browser developers. At this stage, we're able to track interop by web-platform-tests. There's been a lot of innovation in recent years around exposing that tree for testing
ChrisCuellar: At the third mile, we have the operating systems' accessibility APIs. Before we reach the screen reader, we have to pass through the operating system. There, we're dealing with AAM tests
ChrisCuellar: It gets harder and harder to access each layer we're describing here. But efforts are underway to tap into and to test the accessibility APIs. That's a new frontier for the web-platform-tests
ChrisCuellar: The last mile is what we've been talking about--it's where ARIA-AT really lives. It's the behavior of the ATs (e.g. screen readers) themselves
ChrisCuellar: We're validating that the various ATs we support (JAWS, NVDA, and VoiceOver at the moment) provide a roughly equivalent experience
Matt_King: The idea that the screen readers should behave "more or less the same" is an area that I'm sure many in attendence today will want to interrogate. That's where the ARIA-AT community group spends most of its efforts
ChrisCuellar: There are many distinct projects under active development for testing at each of these "miles"
ChrisCuellar: Since 2018, we have developed an overall approach to building consensus. I think that's the most unique part of our work. There is so much conversation between testers and AT vendors themselves
ChrisCuellar: We have a repeatable, scalable, and automatable test structure
ChrisCuellar: We have a testing and reporting platform
ChrisCuellar: And we have integrated JAWS, NVDA, and VoiceOver automation
Nigel: Is there a reason why TalkBack is not in that list?
ChrisCuellar: Why yes there is!
ChrisCuellar: It's on the roadmap!
ChrisCuellar: Mobile in general is something we started to work on this year, and we made some good progress on Android
ChrisCuellar: Likewise, we're also interested in moving beyond the English language
Matt_King: Our current scope is limited by resource availabilty, largely
Matt_King: Our plans have shrunk over the years. Our plans back in 2018 were much more optimistic about our progress by this point. We wanted more screen reader/browser pairs, etc
Matt_King: This work all hinges on the availabilty of automation. Without that, it becomes impossible to keep up with the releases of new versions of platforms
Matt_King: So we continued narrowing our scope to a point where we could find succcess given our resources
Matt_King: But we've been designing everything to avoid limiting extensions to other kinds of ATs, other languages, etc
Nigel: Do ATs have a standard protocol for reporting their state?
ChrisCuellar: We're not trying to start with specs and standards--we're backing in to that via testing. However, one area that is under development (one that is powering automation) is a standard protocol for remotely controlling ATs
ChrisCuellar: We're calling it AT-Driver, and it's modeled after WebDriver BiDi from the W3C
ChrisCuellar: That's enabling us to do the work driving consensus in AT users' experience
Nigel: So you've got some stimulus, you're expecting to observe the behavior, and what are you actually observing?
Matt_King: We're at the final level
ChrisCuellar: Right, "what is the attached device actually doing?"
ChrisCuellar: Initially, it started with pressing keys, but it's starting to evolve into more generic "user intents"
jugglinmike: We're actually capturing the text being spoken by the screen reader. So it's text data from the screen readers.
… AT-Driver is implemented as a web driver bidi protocol. It speaks over websockets. We've implemented in NVDA, macOS and in JAWS. As a separate effort, we're hoping that this has other implementations.
ChrisCuellar: Yeah! And hearing that, I was wondering: are there other implementations or use-cases that would be valued by folks here?
ChrisCuellar: What drew you here to this talk today?
florian: The use cases I've had are internationalization-related
florian: I've assumed that the well-trodden English paths are the best-tested
florian: So I'm concerned with how the accessibility tree is rendered in internationalization contexts
… I'm curious about CJK-related transforms
ChrisCuellar: So it would be useful to you to get the final output just to verify?
florian: Yeah, in a WPT-style context
… So if implementers really insist on what they're doing, we can reecognize this and have a conversation
… Another case I've wanted this is also related to CJK use-cases. Ruby is an assistive tool for sighted users who, for whatever reason, lack knowledge about the rendered text and need help interpreting it
… Bad information in this context is worse than no information
… There are language-specific things that happen with some technologies where verification is especially important
Nigel: One of my use-cases is in an implementation that sends audio description text via an aria-live=polite element
… and there's a related use-case where, if you imagine that you have a video that only has a description and it has hard-of-hearing subtitles
… In the BBC's player, we send the subtitles to the screen reader. Let's say that you have two people watching this video, and one of them can't hear, and one can't see. You're screen reader is on, and your subtitles are on. It would be really good to have a repeatable mechanism to understand the user-experience
Chiara: Sometimes, others think about people with physical impairments. Of course we know this work is also about people of advanced age
… We want to be sure that these experiences are designed properly
… Also, my manager asked me to implement something in a website because the target audience is aged above 70 years old.
ChrisCuellar: Thanks, everyone!
ChrisCuellar: Let's get into how this all works operationally
ChrisCuellar: The effort is hosted by the W3C
ChrisCuellar: And that's informed a very rigorous process design
ChrisCuellar: If you ever join a Community Group call, you'll hear Matt_King assigning Test Plans to testers (here on the slides, we're looking at a "radio group" test plan--specifically one that relies on aria-activedescendent)
ChrisCuellar: Generally, we want to have test plans executed by two testers. We're looking to corroborate the results
Matt_King: Right. The test plans are authored to be as specific as possible, but there's still plenty of room for people to make mistakes.
ChrisCuellar: So, reviewing a test plan like this one I'm sharing for JAWS and Chrome, you can see that there are a lot of instructions.
ChrisCuellar: Then, you get a list of different commands--these are steps in the test. You can see that here, we have one command for what happens after you hit the "x" key. Here, it's when JAWS is in a specific mode.
ChrisCuellar: Following that, we have the captured output from JAWS
ChrisCuellar: We're not just making assertions against the output itself. This is where the role of human testers is critical. There has to be subjective judgements made about the output
ChrisCuellar: We have a set of assertions about the output. E.g. "was the list boundary conveyed?"
ChrisCuellar: Sometimes, people have different results, and we talk about that in the community group. The bot really helps with the velocity of this task. Today, the human testers work to verify that what the bot reports match their own experience (rather than enter the data manually themselves)
Matt_King: The process is that: someone writes the test plan, then at least two people run it. Once we've ironed out the behaviors, we move from "draft review" to a state where we can run the test plan whenever a new version of the AT under test is released
Matt_King: Ultimately, this test is kind of defining "what do we mean that the checkbox is supported?" What does that mean in real life? By writing these tests and gaining consensus with screen reader developers, everyone can have a shared understanding about elements in HTML (or role, state, and property in ARIA) means for users
ChrisCuellar: All of these get finalized into reports that we published. Those give an overall sense of how the AT/browser combinations are performing
florian: This sounds reminiscent of something we used to have in Opera software for visual tests.
… Is this approach something that is or can be integrated with how WPT does tests?
jugglinmike: In WPT, there are ref tests that are somewhat relevant to this discussion. But the level of fuzziness involved in this kind of testing is different. We have a concept of verdicts in ARIA-AT. It's not enough to say that an assertion is passing or failing. The verdicts are subjective and fallible.
florian: In the pre-WPT days at Opera, we had reftests AND visual tests.
… In some cases, a fuzz factor would be sufficient
… I think we had in the visual context where there could be tremendous variability, and it would be obvious to a human if the result was right or wrong, but it would be very difficult to encode that in query
ChrisCuellar: I think it might be a non-goal to get this into WPT given the level of infrastructure required to run screen readers seems undesirable for WPT maintainers
Matt_King: In the first 1.5 years, we researched what exists already and whether we could fit into off-the-shelf solutions. The result was that we really did need to build a bespoke solution
Matt_King: Over the years, we've learned about what we can and cannot abstract
Matt_King: Making the decisions about test design--how abstract or concrete to write them. Working with concrete has allowed us, in time, to see the opportunities for abstractions
florian: It seems valuable, though I'm sure it would involve a lot of work
… In CSS WG, we did not consult with the people who work on the system to learn what is feasible
… When people implemented our work, they did something bad because there were no tests, and that was bad. Our system may have been great, but they did something else
Matt_King: We want people to be able to propose expectations for assistive technologies, place a test in this system, run the tests, and learn about the implementations' behavior
florian: I was hoping for a separation between the people writing the implementations and the people writing the tests. That's the WPT parallel I'm interested in
Matt_King: Agreed. We think this platform brings a lot of value to the community in terms of moving interoperability forward. We're trying to find the best way to work it into the needs of those working in these spaces
Matt_King: Testing new features and experimental implementations--are they potentially able to deliver the value to end-users that we want them to?
ChrisCuellar: This slide I'm showing now demonstrates how we're reporting support levels
ChrisCuellar: We've been testing against the APG patterns, and we're trying to open the door to writing other kinds of tests
Matt_King: The project actually has a lot of different needs
Matt_King: One is enlisting people write tests for ARIA and HTML features
Matt_King: We also want to make this platform itself work--there is infrastructure, the implementation (three bots at the moment, and a desire for more)
Matt_King: If there are people who are passionate about this space and want to see assistive technology interoperabilty become a thing (or you know someone who is, or you know someone who can recruit talent), that would be a big help
Matt_King: You can come to us directly, or you can join the Community Group and we'll welcome you there