See also: IRC log
Sujasree: leading Deque team in India
Debra: enagement manager for Deque fedral
account
... want to understand the direction and stay on top of it
... not a coder but happy to help
<Wilco> https://github.com/w3c/wcag-act/issues/81
WF: previous discussion - two weeks ago -
brought this up
... topic keeps coming up
... confusion about what was meant
... initially was meant as a mechanism to figure out if rules will
generate false positives in practice
... measure their accuracy
... no real solution proposed
... initially thought about comparing tools to manually tested results
... but that idea is changing
... more about collecting feedback by using our rules
... let users try out the rules, and react to feedback
... IBM and Deque have a kind "beta" or "experimental" approach
... until tests are confirmed
SES: so validation is, it is accepted until someone complains?
WF: up for discussion
SES: so what would the use be of the
benchmarking?
... not sure to receive comments
... unless you have a mechanism to ensure testing
... but not guarantee
... so what is the purpose then?
<MoeKraft> Shadi: What I understood Alistair was proposing. When you develop a rule you develop test cases along with the rule. The two approaches are not mutually exclusive.
<MoeKraft> Shadi: w3c maturity model, things in draft or testing phase however we need at least a minimal amount of testing and ask for feedback for further validation.
WF: already part of the spec to write test
cases
... that filters out the known potential problems
... but that is not benchmarking
... to respond to SteinErik, before rules are put on the repository,
have test cases
... additionally a feedback mechanism
DM: already an existing repository of failure conditions?
WF: yes. currently different tools have their test case repositories, but want to merge these
DM: so common place to validate the rules
WF: thoughts?
... maybe need to include positive feedback too
SES: concern about the usefullness of the
information we receive
... no clear view yet
<MoeKraft> Shadi: We definitely are changing from what we originally had in mind for benchmarking. At least what we have in our work statement. If we do have a test suite, we could have tools run by themselves to provide information on how well developers support test suite.
<MoeKraft> Shadi: Not sure how many tool vendors would want to expose false positives.
<MoeKraft> Shadi: There would have to be self declaration. But this could be criteria. Force some useful information to come back here. If tool implementation does not report rule comes back cleanly, then it is experimental.
<MoeKraft> Shadi: Test cases would need to be versioned too.
<MoeKraft> Shadi: Would be complex because of regression.
<MoeKraft> Shadi: Encourage someone who proposes are rule to get it implemented first. Plus, we would get more competition because tools vendors would try to implement and get a green light.
WF: so what would be the evidence that rules work in practice?
<MoeKraft> Wilco: What would be the evidence that a rule is accurate?
WF: if enough tools implement it?
<MoeKraft> Shadi: First criteria, hope there is an active community that constantly reviews rules proposed. This is the first level of checking. Rules proposed by vendors
<MoeKraft> Shadi: If rule is accepted by competitor this is a good sign. Consensus building. Raise the bar to 3 independent implementations. Disclose how many vendors implement rule.
WF: hesitant because implementation is only
proof of accuracy if tools implement no false positives
... some rules indicate a level of accuracy
... for example if someone proposes a test rule with only 70 accuracy
DM: can run against failure conditions, but
can also check for false positives
... then there is semi-automated testing, where human intervention is
needed
... may or may not test to the full standard
... like maybe only 50%
... so it is a scale to test
WF: agree that some rules are more reliable than others
DM: reliability and completeness
SES: yes, but what does accurate actually
mean?
... consistent and repeatable versus correct
WF: "average between false positives and false negatives" (reads out from spec)
<Wilco> https://w3c.github.io/wcag-act/act-rules-format.html#quality-accuracy
SAZ: think really trying to avoid a central
group gate-keeping to scale up
... but need minimum bar of acceptance defined by the test cases
... this can be increased over time, as new situations and new
technologies emerge
... may even need to pull rules at some point
WF: can publish rules at any time, with
different maturity flags
... fits with the W3C process
SES: agree
MJM: so do i
SAZ: me too
WF: maybe not more than one flag
... just something like "beta" or "experimental"
SES: rely on implementers providing information that the tools function
WF: when implemeters find an issue, and make a change, to encourage them to provide that feedback
SAZ: ideally by adding a new test case
WF: want to make sure that the rules stay
in synch
... implementers should give feedback by way of test cases
... does not work unless implementers share their test cases
... need iterative cycles, but need a way to do that
... have to encourage tool vendors
MK: guess some vendors will not want to share all their rules
WF: so can't take rules unless test cases are shared back
MK: how to phrase that
SES: fair expectation to set rules will be shared, because it is a quality check
WF: does somebody want to take over writing
up this part?
... also need a name change
SES: how about validation?
WF: think publication requirements
... talking about how test rules get posted on the W3C website
SAZ: can take this over
... like the idea of incentives
... and the cycle that the incentives drive
SES: happy to work on this too
RESOLUTION: SES will head up drafting the publication/validation/benchmarking piece, with SAZ supporting
WF: MaryJo maybe you can help too
DM: what are the failure conditions of a rule?
WF: think that is what we can test procedure
DM: clients want to know what in the rules triggers the "fail"
<Wilco> https://auto-wcag.github.io/auto-wcag/rules/SC4-1-1-idref.html
WF: that is the test procedure
... SteinErik, draft by next week?
SES: yes, will try
SK: still behind but will catch up
<MoeKraft> +1
DM: amazing effort! complicated, but
excellent to address
... shared effort
SK: do we create our own test files?
WF: we have some test cases, will send you the link