W3C

Accessibility Conformance Testing Teleconference

03 Apr 2017

See also: IRC log

Attendees

Present
Wilco, MaryJo, Charu, Shadi, Kathy, Romain, Moe, Anne
Regrets
Chair
Wilco, MaryJo
Scribe
Romain

Contents


ACT benchmark https://www.w3.org/WAI/GL/task-forces/conformance-testing/wiki/Benchmark_requirements

<Wilco> https://www.w3.org/WAI/GL/task-forces/conformance-testing/wiki/Test_case_repository

Test case repository https://www.w3.org/WAI/GL/task-forces/conformance-testing/wiki/Test_case_repository

wilco: we want to put together a set of test cases for running automated test tools
... something very straightforward
... 2 parts:
... 1. a small test case (HTML file) with good/bad a11y practice
... + some metadata on the SC and which element passes or fails the criteria

<Wilco> https://www.w3.org/WAI/GL/task-forces/conformance-testing/wiki/Testing_Resources

wilco: we started putting together a list of resources that could help with testing (OOA, axe Rules, QuailJS, etc.)
... do you have your own list internally that you might be using and willing to share?

anne: Siteimprove has a list, I can ask whether we're willing to share

kathy: we have a list on github

wilco: IBM, you have the Va11ys project

moe: yes, with samples for the WCAG techniques
... are you looking for working code, or rules? we also have a new a11y testing dashboard, with rules in English

wilco: we're looking specifically for test cases

charu: then I don't think our dashboard would be useful

moe: Va11ys can be useful

wilco: can you tell us more?

moe: it's a gh repo, we took the code snippets from the WCAG techniques and put them in individual HTML files
... served on github.io
... we have some additional ones, but the ones from the WCAG techniques are the same (but available as live code)

wilco: how did you document the results you want to get out of this?

moe: we wanted to provided a simple way for developers to see live working samples
... we're not using as a bucket of tests to run and produce results
... just to demonstrate how to properly code for a11y
... I don't say it can't be used for regression testing, it's a potential but not something we did

wilco: if we wanted to do sth like that, how would we start?

<MoeKraft> https://ibma.github.io/Va11yS/

moe: we have no negative test cases

wilco: most of the tests written for tools are probably negative test cases
... romain, you had a list as well?

romain: it's a short list that we just set up to manually evaluate checker tools, but might not be useful in this context

wilco: the question is how do we want to put our tests together to have something that tool develoeprs can easily run against?

charu: we may have a benchmark / regression test bucket. I will look into it and see if we can share that

wilco: with aXe core, we're using a test library, I can imagine doing something similar
... as long as the test snippets are accessible through an HTTP request and with metadata, there can be a way to test against it, does that make sense?
... Anne, Stein Erik, thoughts?

anne: I'm thinking that might be a place to send my developers instead of me :-)

wilco: we can either try to take all the test cases and put them in a single repository, but that would probably be difficult to maintain over time

charu: I was wondering, are we planning to vet all this cases at some point? we have a whole bunch from different places
... say we are testing for 1.1.1 and say the test cases for OOA pass but those from QuailJS fail.
... do we plan to vet the cases and have a list of valid cases that can be used?

wilco: I'm pretty confident in the cases which comes from WCAG, but for the rest I don't know much

charu: Va11ys are not really test cases but just HTML elements, we may hve to do some formatting. They're not test cases, they're valid code samples
... we could create test cases off of them

wilco: what's the difference you see between test cases and code samples?

charu: you want several test cases for a single SC, code samples are more about elements

wilco: ok, each of them is it's own test case, but we'd need to combine these
... it seems to me that the number of cases already makes a reasonable list, the test samples would be valuable
... what would be the next step for us? we have a reasonably-sized bunch of code snippets for HTML elements, how to make a single suite of tests from that?

romain: maybe we need to look at the tests format first, how easy it is to run from tools, and then try to see if it's easy to transform theses test material?

wilco: I think we didnt't want to touch the tests from the various projects

anne: I didn't have the time to look at the list of the tests, but there might be a difference between tests targetting a single technique and tests for a whole SC?
... our tests are linked to a technique, I guess some others might be linked to a SC, so there might be some differences in the granularity of the tests
... we have to make sure they're aligned

wilco: we want to get a big chunk of test cases and map them to SC
... from there we can map to the Rules

stein erik: do we have a way to link techniques to SC, to say if the they are mutually exclusive?

scribe: there are cases where implementing 2 different techniques could actually cause problem for another technique. do we have any way to describe these relations?

wilco" can we have an example?

stein erik: I will come up with an example

wilco: I can think of it as an accessibility support issue, so I would say no. but otherwise we would address that into the rule itself, so probably shouldn't be part of the test case
... I have a suggestion:
... if we take all of these tests repositories, can the owners create a metadata file that describe where to get the HTML files and what the expected outcome should be?
... if we document that in a common format I would say we'd be able to run the tools against the tests automatically

anne: Can I take one step back and look at an example?
... for 2.4.5 ("multiple ways") there is a need to combine techniques

wilco: right
... the granularity of our tests is at the SC level, so we'd need an example or either using both or the example would be a violation.
... charu, WDYT of a metadata file for the test suite?

charu: it seems it can be useful, I'd like to see an example
... if some kind of scripting is able to get the locations of all the files, it can be useful

wilco: charu, do you have some time maybe next week to work on a proposal for this?

charu: I don't think I can commit to that...

wilco: I don't think we're on a super hurry with that one. for now let's just create an issue to start working on this

ACTION moe to create an issue to kick-start the work on a test metadata description file

<trackbot> Error finding 'moe'. You can review and register nicknames at <http://www.w3.org/WAI/GL/task-forces/conformance-testing/track/users>.

ACT benchmark https://www.w3.org/WAI/GL/task-forces/conformance-testing/wiki/Benchmark_requirements

[document reads: "coming soon..."]

wilco: the deliverable is to write up what exactly we want to do with benchmarking rules
... the way I see benchmarking: there's gonna have to be a way for real-world websites, reviewed by acessbility experts, evaluate those on a SC per-page level, and evaluate those through the rules
... when doing that multiple times, we get a sense of how accurate is the rule
... how often do the rules pick up an a11y violation, how often they don't pick up a violation
... then improve them over time

romain: one issue is that it's gonna be a problem to document a benchmarking session if the real-world site evolves afterwards

wilco: it's gonna take a lot of resource to do the evaluation
... you're gonna need to work with fairly recent data on that

anne: we have a bunch of customers who are happy to report when we have false positives
... we don't see all the weird edge cases in many sites, so we need to test many sites to identify some issues
... sometimes an issue pops up after several years

stein erik: there are some challenges with what you mention, if you only do the evaluation per-page and per-SC, there is some degree of uncertainty on whether the manual check and automated check apply to the same problem

scribe: we need to see what's efficient and what's effective and balance that accordingly

charu: all our internal product teams use the tools, so we collect all the edge cases, false positives, and go from there
... a custom base, using the rules, is essential
... I don't know if there is a way to capture everything in a test site that you create, there are so many different possible scenarios

wilco: right, that's why the benchmarking should be based on existing web sites
... there are a lot of unknowns
... we're all relying heavily on user feedback to improve the quality of our tools
... it's a kind of whack-a-mole

anne: when a11y consultants use our tools, they'll find other things than when our users use them
... there could be a combination of looking into the tool and into the site
... the a11y consultants are more technical

wilco: all these points about the granularity and up-to-date data are absolutely valid

[wilco stuck on his own thoughts...]

wilco: if we do our benchmarking by putting the rules in a draft or proposal mode, and stay like that until they're implemented and we get enough feedback on them

charu: I like that idea, we do something similar to that.
... some rules are in 'beta'

wilco: what is then the point when you're taking them out of beta?

charu: maybe 3 months or so, to see if we need any tweaking or if they're efficient

wilco: so a fixed time period? certain rules are applied more frequently than others

charu: good point. we just started this recently...

anne: after internal testing we just usually release them and then wait for some customer testing
... we have to put it out to the real world to get reald insight

wilco: with axe we played around the idea of having experimental rules
... I will take these ideas and will work on a proposal for the benchmark
... in the mean time, worth noting:
... we have called for consensus, our FPWD has been approved!
... shadi will publish it tomorrow or wednesday

Summary of Action Items

Summary of Resolutions

[End of minutes]

Minutes formatted by David Booth's scribe.perl version 1.152 (CVS log)
$Date: 2017/04/03 16:07:07 $