19:00:49 RRSAgent has joined #aria-at 19:00:53 logging to https://www.w3.org/2023/04/24-aria-at-irc 19:00:59 Zakim has joined #aria-at 19:01:17 RRSAgent, start the meeting 19:01:17 I'm logging. I don't understand 'start the meeting', jugglinmike. Try /msg RRSAgent help 19:01:33 Zakim, start the meeting 19:01:33 RRSAgent, make logs Public 19:01:35 please title this meeting ("meeting: ..."), jugglinmike 19:01:54 meeting: Bi-Weekly Meeting of Assistive Technology Automation Subgroup of ARIA-AT Community Group 19:02:05 present+ jugglinmike 19:02:18 present+ michael_fairchild 19:02:25 present+ mzgoddard 19:02:31 present+ Sam_Shaw 19:03:46 Matt_King has joined #aria-at 19:03:47 mzgoddard has joined #aria-at 19:04:28 scribe+ 19:07:12 TOPIC: Classifying parts of an AT response 19:11:31 Mike: some respnses can b nonverbal, e.g. a mode change 19:11:51 We started this discussion last meeting. 19:12:21 It is related to separating testin into 2 steps: collection of responses and assignment of verdicts 19:12:36 So far, we have conflated those two steps. 19:13:03 We might need to recognize the difference to separate what automation can do from what it can't do. 19:13:48 scribe+ 19:14:09 s/Mike:/jugglinmike/ 19:14:21 MK: The first things we need to discuss are What is the anatomy of a response? 19:15:00 Matt_King: It's more like, what parts of a response do we need to characterize? 19:15:33 Matt_King: Do we need to characterize 100 percent of a AT response to a command? 19:16:51 Matt_King: One example, when we use the full settings they can give alot of information about how to operate a control, generally speaking that response is something we ignore. The one exception is that as a human doing this evaluation,. if that info is incorrect, than we say we observed a undesirable behavior 19:17:46 Matt_King: In general, however, we don't really care about these responses 19:18:19 Matt_King: So do we need to characterize every element of a response, like those kinds of responses we don't really care abouty 19:18:34 michael_fairchild: Thats a good point. 19:18:57 Matt_King: Should one of our characterizations be "Don't care"? 19:19:49 Matt_King: So machines will gather everything, but humans only answer questions based on the assertions. Maybe this isn't the right question 19:20:03 michael_fairchild: We could say that there are aspects we don't care about 19:21:36 Matt_King: Z brought up last time that there are specific things like "the state is conveyd" and we haven't treated those undesirable behavior's as assertions. They are more like negative assertions 19:22:18 mzgoddard: in normal web development you don't normally test that a certain error isn't thrown, 19:22:55 mzgoddard: currently the way we document the underlying assumption is with undesirable assertions 19:23:42 jugglinmike: undesirable behavior's could also be called exceptional behavior's 19:24:25 Matt_King: I don't know if exceptional captures the undesirable aspects. They are both problematic and undesirable 19:25:14 michael_fairchild: I'd argue that expectational makes sense in a programmatic context, but not as well in a human context 19:25:34 jugglinmike: How about erroneous 19:26:57 Matt_King: Maybe we don't need to wordsmith or rename unexpected behavior's. We need to decide if we formally treat those as assertions in this classification of the AT responses 19:28:31 Matt_King: Right now, for human testing, its a yes or no on every single test if a unexpected behavior occurred, then the human tester will describe which one occurred 19:29:52 Matt_King: So we are in agreement, there is a clear difference between collecting responses, and analyzing responses 19:30:21 Matt_King: How should we label these two steps in the process? Is collecting and analyzing the right words? 19:31:00 jugglinmike: I have been using those word. We should also discuss using the term verdict assignment 19:32:30 Matt_King: The analysis is more that just verdict assignment. 19:33:39 Unless each part of the response is a different assertion, than its more than just a verdict assignment. 19:34:36 mzgoddard: What would be expected to be stored in a database for a verdict assignment? 19:34:53 Matt_King: For example, in the case of Excess verbosity, the response is yes or no, would that be a verdict assignment? 19:35:46 jugglinmike: Part of me thinks that its better to explicitly classify what part of the response was excess verbosity? rather than just yes or no 19:36:29 jugglinmike: that only makes sense to me if if there are going to be multiple text based AT responses, than you would be disambiguating something 19:37:28 Matt_King: So we have an input field there for the user to describe the unexpected behavior's. So I expect that would be stored, and that is more than just verdict assignment 19:38:51 mzgoddard: So the output of the analysis is a numerical value and a assement 19:41:13 mzgoddard: The first step for automating verdicts, is using existing matching data for test and response, but we would need humans to assign the initial verdicts because the response is not going match the human response with out lots of modification. 19:41:57 Matt_King: There is a part of me that is wondering if a human is running a test today, if we want to automate the collection of responses they observe 19:42:20 Matt_King: we still need the human there to collect a response the automated collector didn't collect 19:43:10 Matt_King: There will be parts that are not collectable at the start, it will require more development 19:43:40 mzgoddard: I think thats a good goal, but that will be tough to achieve on some ones system. 19:44:23 mzgoddard: While we are the ones developing that tool, it may leave some ones machine in a state that stops responding because of a bug in our stuff. 19:45:11 Matt_King: I think we could use the NVDA add on to try this out 19:46:12 Matt_King: I'm wondering if there is a world in which you start the test by pressing the start up button on a webpage, the human performs a command, the machines collects the output, then the human presses the stop test button 19:46:21 mzgoddard: I think there may be security concern there 19:49:12 Matt_King: We could try to normalize parts of the human responses, then have the NVDA addon collect responses, do string compares between them, and work towards convention that way 19:49:40 Matt_King: Its sort of an in-between step between collecting and analyzing responses 19:50:05 Matt_King: I guess we have to have the consensus before proceeding 19:50:41 mzgoddard: I think we could store verdicts for automated responses with the human verdicts and responses 19:50:59 Matt_King: I just had an idea, lets look at how we do it today 19:51:07 Matt_King: we might have what we need 19:51:20 Matt_King: a human runs a test, then a machine runs a whole test plan 19:51:32 Matt_King: we already have this code that looks for conflicts between two people 19:51:43 Matt_King: it does this by comparting assertion verdicts 19:52:11 Matt_King: you could have a similar set of code that just looks for conflicts in output, after normalizing output on both sides 19:52:21 Matt_King: Then you have normalized output 19:53:06 Matt_King: In the case where there is conflict after normalizing the output, a human can go to rerun the test and review. It the output matches, the human can update the verdict 19:53:53 Matt_King: We would need a couple of different buttons in the interface if the runner was a machine, for the human to review the test 19:54:10 Matt_King: If there are conflicts, then we would review with our normal conflict resolution process 19:54:37 jugglinmike: Can you say more about what a equivalent output could be? 19:54:56 Matt_King: I don't know what the machine recorded responses will look like 19:55:30 Matt_King: Right now, we are operating under the assumption that the machine output could differ from the human output 19:56:04 jugglinmike: One issue could be homophones, or localization 19:56:28 jugglinmike: one aspect we haven't talked about is how and when we order when responses matter? 19:57:38 Matt_King: I have a hard time imagining that the order will change for human and machines with the same configuration, it would have to be configuration issue 19:57:51 jugglinmike: I'm thinking of a mode switch 19:58:19 jugglinmike: the human might interpret the events differently than the machine would 19:59:35 Matt_King: We may need a way to code for Events 20:00:07 jugglinmike: Events is a new term for this conversation 20:00:38 jugglinmike: Right now we say here are the things that happened in a response, I don't know if the data structure captures the order of these responses 20:01:04 Matt_King: The response back from the API Call will match the output the human would hear 20:01:19 mzgoddard: I think we need to write this into the spec 20:01:41 mzgoddard: When do we record the speech response? When it begins or when it ends? 20:02:14 mzgoddard: The human may perceive a different start and end of a response than a machine would 20:02:52 jugglinmike: Thats related to what I was thinking, but different 20:03:41 jugglinmike: I'm still thinking about the need for process and UI for recognizing equivalency in responses. 20:04:31 Matt_King: Its clear to me that some of these things we just need to move forward with experimental implementations and see what issues we run into 20:04:42 Matt_King: We should put together a framework on how we want to do that 20:05:07 Matt_King: We can't make it all perfect at the outset. Lets get real world situations than figure out how to deal with them 20:05:31 Matt_King: I think we have a strong sense of what we need to anticipate! 20:06:20 rrsagent, make minutes 20:06:21 I have made the request to generate https://www.w3.org/2023/04/24-aria-at-minutes.html Matt_King 20:47:15 jugglinmike has left #aria-at 21:00:10 jongund has joined #aria-at 21:30:43 Zakim has left #aria-at