Bi-Weekly Meeting of Assistive Technology Automation Subgroup of ARIA-AT Community Group

Meeting minutes

Classifying parts of an AT response

jugglinmike some respnses can b nonverbal, e.g. a mode change

We started this discussion last meeting.

It is related to separating testin into 2 steps: collection of responses and assignment of verdicts

So far, we have conflated those two steps.

We might need to recognize the difference to separate what automation can do from what it can't do.

MK: The first things we need to discuss are What is the anatomy of a response?

Matt_King: It's more like, what parts of a response do we need to characterize?

Matt_King: Do we need to characterize 100 percent of a AT response to a command?

Matt_King: One example, when we use the full settings they can give alot of information about how to operate a control, generally speaking that response is something we ignore. The one exception is that as a human doing this evaluation,. if that info is incorrect, than we say we observed a undesirable behavior

Matt_King: In general, however, we don't really care about these responses

Matt_King: So do we need to characterize every element of a response, like those kinds of responses we don't really care abouty

michael_fairchild: Thats a good point.

Matt_King: Should one of our characterizations be "Don't care"?

Matt_King: So machines will gather everything, but humans only answer questions based on the assertions. Maybe this isn't the right question

michael_fairchild: We could say that there are aspects we don't care about

Matt_King: Z brought up last time that there are specific things like "the state is conveyd" and we haven't treated those undesirable behavior's as assertions. They are more like negative assertions

mzgoddard: in normal web development you don't normally test that a certain error isn't thrown,

mzgoddard: currently the way we document the underlying assumption is with undesirable assertions

jugglinmike: undesirable behavior's could also be called exceptional behavior's

Matt_King: I don't know if exceptional captures the undesirable aspects. They are both problematic and undesirable

michael_fairchild: I'd argue that expectational makes sense in a programmatic context, but not as well in a human context

jugglinmike: How about erroneous

Matt_King: Maybe we don't need to wordsmith or rename unexpected behavior's. We need to decide if we formally treat those as assertions in this classification of the AT responses

Matt_King: Right now, for human testing, its a yes or no on every single test if a unexpected behavior occurred, then the human tester will describe which one occurred

Matt_King: So we are in agreement, there is a clear difference between collecting responses, and analyzing responses

Matt_King: How should we label these two steps in the process? Is collecting and analyzing the right words?

jugglinmike: I have been using those word. We should also discuss using the term verdict assignment

Matt_King: The analysis is more that just verdict assignment.

Unless each part of the response is a different assertion, than its more than just a verdict assignment.

mzgoddard: What would be expected to be stored in a database for a verdict assignment?

Matt_King: For example, in the case of Excess verbosity, the response is yes or no, would that be a verdict assignment?

jugglinmike: Part of me thinks that its better to explicitly classify what part of the response was excess verbosity? rather than just yes or no

jugglinmike: that only makes sense to me if if there are going to be multiple text based AT responses, than you would be disambiguating something

Matt_King: So we have an input field there for the user to describe the unexpected behavior's. So I expect that would be stored, and that is more than just verdict assignment

mzgoddard: So the output of the analysis is a numerical value and a assement

mzgoddard: The first step for automating verdicts, is using existing matching data for test and response, but we would need humans to assign the initial verdicts because the response is not going match the human response with out lots of modification.

Matt_King: There is a part of me that is wondering if a human is running a test today, if we want to automate the collection of responses they observe

Matt_King: we still need the human there to collect a response the automated collector didn't collect

Matt_King: There will be parts that are not collectable at the start, it will require more development

mzgoddard: I think thats a good goal, but that will be tough to achieve on some ones system.

mzgoddard: While we are the ones developing that tool, it may leave some ones machine in a state that stops responding because of a bug in our stuff.

Matt_King: I think we could use the NVDA add on to try this out

Matt_King: I'm wondering if there is a world in which you start the test by pressing the start up button on a webpage, the human performs a command, the machines collects the output, then the human presses the stop test button

mzgoddard: I think there may be security concern there

Matt_King: We could try to normalize parts of the human responses, then have the NVDA addon collect responses, do string compares between them, and work towards convention that way

Matt_King: Its sort of an in-between step between collecting and analyzing responses

Matt_King: I guess we have to have the consensus before proceeding

mzgoddard: I think we could store verdicts for automated responses with the human verdicts and responses

Matt_King: I just had an idea, lets look at how we do it today

Matt_King: we might have what we need

Matt_King: a human runs a test, then a machine runs a whole test plan

Matt_King: we already have this code that looks for conflicts between two people

Matt_King: it does this by comparting assertion verdicts

Matt_King: you could have a similar set of code that just looks for conflicts in output, after normalizing output on both sides

Matt_King: Then you have normalized output

Matt_King: In the case where there is conflict after normalizing the output, a human can go to rerun the test and review. It the output matches, the human can update the verdict

Matt_King: We would need a couple of different buttons in the interface if the runner was a machine, for the human to review the test

Matt_King: If there are conflicts, then we would review with our normal conflict resolution process

jugglinmike: Can you say more about what a equivalent output could be?

Matt_King: I don't know what the machine recorded responses will look like

Matt_King: Right now, we are operating under the assumption that the machine output could differ from the human output

jugglinmike: One issue could be homophones, or localization

jugglinmike: one aspect we haven't talked about is how and when we order when responses matter?

Matt_King: I have a hard time imagining that the order will change for human and machines with the same configuration, it would have to be configuration issue

jugglinmike: I'm thinking of a mode switch

jugglinmike: the human might interpret the events differently than the machine would

Matt_King: We may need a way to code for Events

jugglinmike: Events is a new term for this conversation

jugglinmike: Right now we say here are the things that happened in a response, I don't know if the data structure captures the order of these responses

Matt_King: The response back from the API Call will match the output the human would hear

mzgoddard: I think we need to write this into the spec

mzgoddard: When do we record the speech response? When it begins or when it ends?

mzgoddard: The human may perceive a different start and end of a response than a machine would

jugglinmike: Thats related to what I was thinking, but different

jugglinmike: I'm still thinking about the need for process and UI for recognizing equivalency in responses.

Matt_King: Its clear to me that some of these things we just need to move forward with experimental implementations and see what issues we run into

Matt_King: We should put together a framework on how we want to do that

Matt_King: We can't make it all perfect at the outset. Lets get real world situations than figure out how to deal with them

Matt_King: I think we have a strong sense of what we need to anticipate!

– DRAFT –
Bi-Weekly Meeting of Assistive Technology Automation Subgroup of ARIA-AT Community Group

24 April 2023

Attendees

Meeting minutes

Classifying parts of an AT response

Diagnostics