ARIA and Assistive Technologies Community Group Teleconference -- 28 Aug 2019

<scribe> scribe: Jean-Francois_Hector

Granularity of tests

MCK: "How granular do the assertions need to be" might not be the right question.

Been thinking about how to reframe that question.

One possibility is that assertions are very granular (i.e. as specific as key commands)

But maybe, what we need to do, is a way to describe expectations about what a screen reader should do in circumstances X and Y

e.g. When reading a checkbox in reading mode, all screen readers must communicate the role, name and state of the checkbox

Not sure how we call that. Maybe it's not an assertion. Or a requirement. Maybe it's an expectation.

Maybe we could make that list of expectations in a generic way for all screen readers. Then the next activity would be to associate each expectation with a list of more granular assertions.

In doing this, it might be that some experienced screen reader users only need the expectations + the list of key commands as a reminder. So that they don't need to get bogged down in a list of 54 granular expectations.

I.e. maybe we need both the high level 'expectations', and granular 'assertions' – and have the language to describe both

JF: It's helpful to think in terms of outcomes we want / don't want

MCK: For people who are reading a report of the support of a screen reader, they don't need to look at a long table of all the key commands. They need to know just things at a higher level. We don't want them to have to read through the 73 things that didn't work, and formulate themselves their understanding of what worked and didn't work.

It'd be simpler for someone to just look at a report and see that there are 10 things that are fully supported, 5 that don't work, and 7 that have partial support.

Another thing I'm trying to avoid, is to have to write every single assertion by hand.

If I have 5 expectations, we should be able to automatically general the low level assertions based on the characteristics of the expectations.

e.g. expectation: a widget role is announced in reading mode. from that assertion would could automatically generate a list of more granular assertions for each screen reader.

Web developers won't know the screen reader command. We want it to be easy for them to see that an ARIA attribute has 82% support, but the bits that aren't supported might not be crucial to their users' experience.

So, another outcome that we don't want is this: It's not our goal to design the screen reader user experience. So I want to avoid statements such as "the JAWS Insert+UpArrow" should say a specific thing in order to support this particular UI component.

It's not for us to decide to which Key Commands screen reader developers should implement. We can write expectations (at a high level), and then screen reader developers can tell us how their product fulfill them (including key commands).

This last point might be the most important of the three, when it comes to our stakeholders.

JF: So we could write high level, generic, hard-to-argue-with expectations, and screen reader developers could tell us what key commands honour these expectations. And then we perform the tests based on these key commands.

MCK: Yes, hard to argue with expectations, that everyone can agree with. And the specific key commands, screen reader developers could "tell us that" through their documentation.

JF: And then we can do test on the granular assertions generated in this way. And the granular test data we get can be automatically aggregated at the level of high level expectations again.

MCK: The puzzle is: how do we characterise each expectation? It could be characterised/described along two axises. One is a screen reader axis, the other is an ARIA axis. Our data model could help us automatically determine the intersection point for each expectation. And that could semi-automatically generate a candidate list of granular assertions for that expectation.

Axis might not be the right word, it's just a picture I have in my head.

Each of these is like a hierarchical progression. E.g. in ARIA we can say whether something is a Widget or a structure. And there are categories of widgets. The ARIA ontology is part of what characterises an expectation.

Our descriptive ontology for screen readers could be things like: • They have reading modes and interaction modes, • There are high level categories of user tasks (eg. 'reading', 'operating' – we need to agree on these words).

And there are categories of screen reader commands that could be associated with certain aspects of the ARIA ontology.

We could categorise screen reader commands according to the types of actions that they perform, and the type of things that they perform them on. For example, there are interaction commands that are only applicable to ARIA Widget elements.

So by listing the ARIA characteristics of something, and listing the screen reader characteristics, by correlating those two types of information, we should be able to generate (semi-automatically) a candidate list of granular assertions (i.e. a particular screen reader command in a particular mode should report the name and state of a particular type of widget).

So this is my thinking about how we can rephrase this problem, away from asking ourselves whether we have high level or granular assertions, and reframing the question in terms of "maybe we need both".

MCK: One way that we could test this is: instead of generating a single spreadsheet that tries to put everything onto 1 sheet; what about we start by generating a spreadsheet that has all the high level expectations; and then another spreadsheet which would say, for a given high level expectation, exactly how someone would test it.

And then we could test the aggregation of the granular data (maybe we could even automate that in Excel?)

MCK: On the interface/table that a tester users, there might be: • at least one row for each key command; • Test result; • Anything needed to make sure that testing that key command in that circumstance is a repeatable thing (e.g. mode) • Something about result and output (JF: I didn't get that)

"at least one row for each key command" because, for a single key command, we might want to have separate rows for name, role and value (for example)

One question is to figure out whether we aggregate some key commands (e.g. Tab and Shift+Tab), or whether we keep them on separate row

So the idea of the spreadsheet would be test how granular we want to go. E.g. would we ever consider 2 keyboard commands so closely related that we would test their behaviour as one (i.e. if one fails, it's like both fails). E.g. left arrow and right arrow, or up and down. Maybe because we think that in 99% of cases if one fails both fails

There would be one row for each ARIA attribute (e.g. role is an attribute, state is an attribute, accessible name is an attribute). In some cases several states apply (e.g. aria-expanded, and aria-haspopup)

I'm hoping that in the UI, we would never be this granular. I'm expecting that even for a simple example like checkbox, being this granular would generate 60-70 rows. And for something like menubar, 200-500 rows.

We would record granular data from the tests. But I'm hoping that we would be able to ask a shorter list of questions that would allow the system to report data for every granular assertion

e.g. when testing a checkbox in reading mode, we could tell the testing what the high level expectation is, and the list of key commands, and ask them to test these expectations with these key commands, and ask them whether there were failures. If the answer is yes, we might ask them whether there was a problem with the name, or a problem with the role, or a problem with the value. And based on that, we could ask whether the probl[CUT]

(these could be recorded as checkboxes that a tester would test)

JF: I understand. We'd need to see how much more difficult a dynamic interface like this would be to design and build

MCK: Next steps:

• Try to do this in Excel to explain the concept.

• For us to agree on the language that we would use, both for describing the different types of data in the data model, and for generically describing screen reader capabilities, and screen reader users' activities. Without that generic language, we can't write clear expectations. We want consistency in the way that expectations are worded, and consistent understanding of what they mean.

- DRAFT -

ARIA and Assistive Technologies Community Group Teleconference

28 Aug 2019

Attendees

Contents

Granularity of tests

Summary of Action Items

Summary of Resolutions

Scribe.perl diagnostic output