19:00:49 <RRSAgent> RRSAgent has joined #aria-at
19:00:53 <RRSAgent> logging to https://www.w3.org/2023/04/24-aria-at-irc
19:00:59 <Zakim> Zakim has joined #aria-at
19:01:17 <jugglinmike> RRSAgent, start the meeting
19:01:17 <RRSAgent> I'm logging. I don't understand 'start the meeting', jugglinmike.  Try /msg RRSAgent help
19:01:33 <jugglinmike> Zakim, start the meeting
19:01:33 <Zakim> RRSAgent, make logs Public
19:01:35 <Zakim> please title this meeting ("meeting: ..."), jugglinmike
19:01:54 <jugglinmike> meeting: Bi-Weekly Meeting of Assistive Technology Automation Subgroup of ARIA-AT Community Group
19:02:05 <jugglinmike> present+ jugglinmike
19:02:18 <jugglinmike> present+ michael_fairchild
19:02:25 <jugglinmike> present+ mzgoddard
19:02:31 <jugglinmike> present+ Sam_Shaw
19:03:46 <Matt_King> Matt_King has joined #aria-at
19:03:47 <mzgoddard> mzgoddard has joined #aria-at
19:04:28 <Matt_King> scribe+
19:07:12 <Matt_King> TOPIC: Classifying parts of an AT response
19:11:31 <Matt_King> Mike: some respnses can b nonverbal, e.g. a mode change
19:11:51 <Matt_King> We started this discussion last meeting.
19:12:21 <Matt_King> It is related to separating testin into 2 steps: collection of responses and assignment of verdicts
19:12:36 <Matt_King> So far, we have conflated those two steps.
19:13:03 <Matt_King> We might need to recognize the difference to separate what automation can do from what it can't do.
19:13:48 <Sam_Shaw> scribe+
19:14:09 <jugglinmike> s/Mike:/jugglinmike/
19:14:21 <Sam_Shaw> MK: The first things we need to discuss are What is the anatomy of a response?
19:15:00 <Sam_Shaw> Matt_King: It's more like, what parts of a response do we need to characterize?
19:15:33 <Sam_Shaw> Matt_King: Do we need to characterize 100 percent of a AT response to a command?
19:16:51 <Sam_Shaw> Matt_King: One example, when we use the full settings they can give alot of information about how to operate a control, generally speaking that response is something we ignore. The one exception is that as a human doing this evaluation,. if that info is incorrect, than we say we observed a undesirable behavior
19:17:46 <Sam_Shaw> Matt_King: In general, however, we don't really care about these responses
19:18:19 <Sam_Shaw> Matt_King: So do we need to characterize every element of a response, like those kinds of responses we don't really care abouty
19:18:34 <Sam_Shaw> michael_fairchild: Thats a good point.
19:18:57 <Sam_Shaw> Matt_King: Should one of our characterizations be "Don't care"?
19:19:49 <Sam_Shaw> Matt_King: So machines will gather everything, but humans only answer questions based on the assertions. Maybe this isn't the right question
19:20:03 <Sam_Shaw> michael_fairchild: We could say that there are aspects we don't care about
19:21:36 <Sam_Shaw> Matt_King: Z brought up last time that there are specific things like "the state is conveyd" and we haven't treated those undesirable behavior's as assertions. They are more like negative assertions
19:22:18 <Sam_Shaw> mzgoddard: in normal web development you don't normally test that a certain error isn't thrown,
19:22:55 <Sam_Shaw> mzgoddard: currently the way we document the underlying assumption is with undesirable assertions
19:23:42 <Sam_Shaw> jugglinmike: undesirable behavior's could also be called exceptional behavior's
19:24:25 <Sam_Shaw> Matt_King: I don't know if exceptional captures the undesirable aspects. They are both problematic and undesirable
19:25:14 <Sam_Shaw> michael_fairchild: I'd argue that expectational makes sense in a programmatic context, but not as well in a human context
19:25:34 <Sam_Shaw> jugglinmike: How about erroneous
19:26:57 <Sam_Shaw> Matt_King: Maybe we don't need to wordsmith or rename unexpected behavior's. We need to decide if we formally treat those as assertions in this classification of the AT responses
19:28:31 <Sam_Shaw> Matt_King: Right now, for human testing, its a yes or no on every single test if a unexpected behavior occurred, then the human tester will describe which one occurred
19:29:52 <Sam_Shaw> Matt_King: So we are in agreement, there is a clear difference between collecting responses, and analyzing responses
19:30:21 <Sam_Shaw> Matt_King: How should we label these two steps in the process? Is collecting and analyzing the right words?
19:31:00 <Sam_Shaw> jugglinmike: I have been using those word. We should also discuss using the term verdict assignment
19:32:30 <Sam_Shaw> Matt_King: The analysis is more that just verdict assignment.
19:33:39 <Sam_Shaw> Unless each part of the response is a different assertion, than its more than just a verdict assignment.
19:34:36 <Sam_Shaw> mzgoddard: What would be expected to be stored in a database for a verdict assignment?
19:34:53 <Sam_Shaw> Matt_King: For example, in the case of Excess verbosity, the response is yes or no, would that be a verdict assignment?
19:35:46 <Sam_Shaw> jugglinmike: Part of me thinks that its better to explicitly classify what part of the response was excess verbosity? rather than just yes or no
19:36:29 <Sam_Shaw> jugglinmike: that only makes sense to me if if there are going to be multiple text based AT responses, than you would be disambiguating something
19:37:28 <Sam_Shaw> Matt_King: So we have an input field there for the user to describe the unexpected behavior's. So I expect that would be stored, and that is more than just verdict assignment
19:38:51 <Sam_Shaw> mzgoddard: So the output of the analysis is a numerical value and a assement
19:41:13 <Sam_Shaw> mzgoddard: The first step for automating verdicts, is using existing matching data for test and response, but we would need humans to assign the initial verdicts because the response is not going match the human response with out lots of modification.
19:41:57 <Sam_Shaw> Matt_King: There is a part of me that is wondering if a human is running a test today, if we want to automate the collection of responses they observe
19:42:20 <Sam_Shaw> Matt_King: we still need the human there to collect a response the automated collector didn't collect
19:43:10 <Sam_Shaw> Matt_King: There will be parts that are not collectable at the start, it will require more development
19:43:40 <Sam_Shaw> mzgoddard: I think thats a good goal, but that will be tough to achieve on some ones system.
19:44:23 <Sam_Shaw> mzgoddard: While we are the ones developing that tool, it may leave some ones machine in a state that stops responding because of a bug in our stuff.
19:45:11 <Sam_Shaw> Matt_King: I think we could use the NVDA add on to try this out
19:46:12 <Sam_Shaw> Matt_King: I'm wondering if there is a world in which you start the test by pressing the start up button on a webpage, the human performs a command, the machines collects the output, then the human presses the stop test button
19:46:21 <Sam_Shaw> mzgoddard: I think there may be security concern there
19:49:12 <Sam_Shaw> Matt_King: We could try to normalize parts of the human responses, then have the NVDA addon collect responses, do string compares between them, and work towards convention that way
19:49:40 <Sam_Shaw> Matt_King: Its sort of an in-between step between collecting and analyzing responses
19:50:05 <Sam_Shaw> Matt_King: I guess we have to have the consensus before proceeding
19:50:41 <Sam_Shaw> mzgoddard: I think we could store verdicts for automated responses with the human verdicts and responses
19:50:59 <Sam_Shaw> Matt_King: I just had an idea, lets look at how we do it today
19:51:07 <Sam_Shaw> Matt_King: we might have what we need
19:51:20 <Sam_Shaw> Matt_King: a human runs a test, then a machine runs a whole test plan
19:51:32 <Sam_Shaw> Matt_King: we already have this code that looks for conflicts between two people
19:51:43 <Sam_Shaw> Matt_King: it does this by comparting assertion verdicts
19:52:11 <Sam_Shaw> Matt_King: you could have a similar set of code that just looks for conflicts in output, after normalizing output on both sides
19:52:21 <Sam_Shaw> Matt_King: Then you have normalized output
19:53:06 <Sam_Shaw> Matt_King: In the case where there is conflict after normalizing the output, a human can go to rerun the test and review. It the output matches, the human can update the verdict
19:53:53 <Sam_Shaw> Matt_King: We would need a couple of different buttons in the interface if the runner was a machine, for the human to review the test
19:54:10 <Sam_Shaw> Matt_King: If there are conflicts, then we would review with our normal conflict resolution process
19:54:37 <Sam_Shaw> jugglinmike: Can you say more about what a equivalent output could be?
19:54:56 <Sam_Shaw> Matt_King: I don't know what the machine recorded responses will look like
19:55:30 <Sam_Shaw> Matt_King: Right now, we are operating under the assumption that the machine output could differ from the human output
19:56:04 <Sam_Shaw> jugglinmike: One issue could be homophones, or localization
19:56:28 <Sam_Shaw> jugglinmike: one aspect we haven't talked about is how and when we order when responses matter?
19:57:38 <Sam_Shaw> Matt_King: I have a hard time imagining that the order will change for human and machines with the same configuration, it would have to be configuration issue
19:57:51 <Sam_Shaw> jugglinmike: I'm thinking of a mode switch
19:58:19 <Sam_Shaw> jugglinmike: the human might interpret the events differently than the machine would
19:59:35 <Sam_Shaw> Matt_King: We may need a way to code for Events
20:00:07 <Sam_Shaw> jugglinmike: Events is a new term for this conversation
20:00:38 <Sam_Shaw> jugglinmike: Right now we say here are the things that happened in a response, I don't know if the data structure captures the order of these responses
20:01:04 <Sam_Shaw> Matt_King: The response back from the API Call will match the output the human would hear
20:01:19 <Sam_Shaw> mzgoddard: I think we need to write this into the spec
20:01:41 <Sam_Shaw> mzgoddard: When do we record the speech response? When it begins or when it ends?
20:02:14 <Sam_Shaw> mzgoddard: The human may perceive a different start and end of a response than a machine would
20:02:52 <Sam_Shaw> jugglinmike: Thats related to what I was thinking, but different
20:03:41 <Sam_Shaw> jugglinmike: I'm still thinking about the need for process and UI for recognizing equivalency in responses.
20:04:31 <Sam_Shaw> Matt_King: Its clear to me that some of these things we just need to move forward with experimental implementations and see what issues we run into
20:04:42 <Sam_Shaw> Matt_King: We should put together a framework on how we want to do that
20:05:07 <Sam_Shaw> Matt_King: We can't make it all perfect at the outset. Lets get real world situations than figure out how to deal with them
20:05:31 <Sam_Shaw> Matt_King: I think we have a strong sense of what we need to anticipate!
20:06:20 <Matt_King> rrsagent, make minutes
20:06:21 <RRSAgent> I have made the request to generate https://www.w3.org/2023/04/24-aria-at-minutes.html Matt_King
20:47:15 <jugglinmike> jugglinmike has left #aria-at
21:00:10 <jongund> jongund has joined #aria-at
21:30:43 <Zakim> Zakim has left #aria-at