Accessibility Conformance Testing Teleconference -- 08 May 2017

Introductions

Sujasree: leading Deque team in India

Debra: enagement manager for Deque fedral account
... want to understand the direction and stay on top of it
... not a coder but happy to help

Benchmark definition - Issue #81 https://github.com/w3c/wcag-act/issues

<Wilco> https://github.com/w3c/wcag-act/issues/81

WF: previous discussion - two weeks ago - brought this up
... topic keeps coming up
... confusion about what was meant
... initially was meant as a mechanism to figure out if rules will generate false positives in practice
... measure their accuracy
... no real solution proposed
... initially thought about comparing tools to manually tested results
... but that idea is changing
... more about collecting feedback by using our rules
... let users try out the rules, and react to feedback
... IBM and Deque have a kind "beta" or "experimental" approach
... until tests are confirmed

SES: so validation is, it is accepted until someone complains?

WF: up for discussion

SES: so what would the use be of the benchmarking?
... not sure to receive comments
... unless you have a mechanism to ensure testing
... but not guarantee
... so what is the purpose then?

<MoeKraft> Shadi: What I understood Alistair was proposing. When you develop a rule you develop test cases along with the rule. The two approaches are not mutually exclusive.

<MoeKraft> Shadi: w3c maturity model, things in draft or testing phase however we need at least a minimal amount of testing and ask for feedback for further validation.

WF: already part of the spec to write test cases
... that filters out the known potential problems
... but that is not benchmarking
... to respond to SteinErik, before rules are put on the repository, have test cases
... additionally a feedback mechanism

DM: already an existing repository of failure conditions?

WF: yes. currently different tools have their test case repositories, but want to merge these

DM: so common place to validate the rules

WF: thoughts?
... maybe need to include positive feedback too

SES: concern about the usefullness of the information we receive
... no clear view yet

<MoeKraft> Shadi: We definitely are changing from what we originally had in mind for benchmarking. At least what we have in our work statement. If we do have a test suite, we could have tools run by themselves to provide information on how well developers support test suite.

<MoeKraft> Shadi: Not sure how many tool vendors would want to expose false positives.

<MoeKraft> Shadi: There would have to be self declaration. But this could be criteria. Force some useful information to come back here. If tool implementation does not report rule comes back cleanly, then it is experimental.

<MoeKraft> Shadi: Test cases would need to be versioned too.

<MoeKraft> Shadi: Would be complex because of regression.

<MoeKraft> Shadi: Encourage someone who proposes are rule to get it implemented first. Plus, we would get more competition because tools vendors would try to implement and get a green light.

WF: so what would be the evidence that rules work in practice?

<MoeKraft> Wilco: What would be the evidence that a rule is accurate?

WF: if enough tools implement it?

<MoeKraft> Shadi: First criteria, hope there is an active community that constantly reviews rules proposed. This is the first level of checking. Rules proposed by vendors

<MoeKraft> Shadi: If rule is accepted by competitor this is a good sign. Consensus building. Raise the bar to 3 independent implementations. Disclose how many vendors implement rule.

WF: hesitant because implementation is only proof of accuracy if tools implement no false positives
... some rules indicate a level of accuracy
... for example if someone proposes a test rule with only 70 accuracy

DM: can run against failure conditions, but can also check for false positives
... then there is semi-automated testing, where human intervention is needed
... may or may not test to the full standard
... like maybe only 50%
... so it is a scale to test

WF: agree that some rules are more reliable than others

DM: reliability and completeness

SES: yes, but what does accurate actually mean?
... consistent and repeatable versus correct

WF: "average between false positives and false negatives" (reads out from spec)

<Wilco> https://w3c.github.io/wcag-act/act-rules-format.html#quality-accuracy

SAZ: think really trying to avoid a central group gate-keeping to scale up
... but need minimum bar of acceptance defined by the test cases
... this can be increased over time, as new situations and new technologies emerge
... may even need to pull rules at some point

WF: can publish rules at any time, with different maturity flags
... fits with the W3C process

SES: agree

MJM: so do i

SAZ: me too

WF: maybe not more than one flag
... just something like "beta" or "experimental"

SES: rely on implementers providing information that the tools function

WF: when implemeters find an issue, and make a change, to encourage them to provide that feedback

SAZ: ideally by adding a new test case

WF: want to make sure that the rules stay in synch
... implementers should give feedback by way of test cases
... does not work unless implementers share their test cases
... need iterative cycles, but need a way to do that
... have to encourage tool vendors

MK: guess some vendors will not want to share all their rules

WF: so can't take rules unless test cases are shared back

MK: how to phrase that

SES: fair expectation to set rules will be shared, because it is a quality check

WF: does somebody want to take over writing up this part?
... also need a name change

SES: how about validation?

WF: think publication requirements
... talking about how test rules get posted on the W3C website

SAZ: can take this over
... like the idea of incentives
... and the cycle that the incentives drive

SES: happy to work on this too

RESOLUTION: SES will head up drafting the publication/validation/benchmarking piece, with SAZ supporting

WF: MaryJo maybe you can help too

DM: what are the failure conditions of a rule?

WF: think that is what we can test procedure

DM: clients want to know what in the rules triggers the "fail"

<Wilco> https://auto-wcag.github.io/auto-wcag/rules/SC4-1-1-idref.html

WF: that is the test procedure
... SteinErik, draft by next week?

SES: yes, will try

SK: still behind but will catch up

<MoeKraft> +1

DM: amazing effort! complicated, but excellent to address
... shared effort

SK: do we create our own test files?

WF: we have some test cases, will send you the link

Accessibility Conformance Testing Teleconference

08 May 2017

Attendees

Contents

Introductions

Benchmark definition - Issue #81 https://github.com/w3c/wcag-act/issues

Summary of Action Items

Summary of Resolutions