Wilco: Talk about this last week. Idea that we use the word aggregation in two ways. One for results from any rule - leading to pass / fail; Second - through composed rules.
... There were a couple of suggestions.
... one was composition definition.
... Do people feel this is different enough, that it needs a different term?
Alistair: it's not correct for composed rules.
Anne: We could get rid of test definition and other definition.
Anne: I've put in a PR.
<Wilco> https://github.com/w3c/wcag-act/pull/247/files
Anne: We could move things around, to remove heading test definition
... We tried to find another term for aggregation in composite rule, which lead to the idea that we probably don't need it.
... Applicability and Expectation should be moved out of the rule structure.
Wilco: I like this, it makes sense.
... Proposal side-steps the issue. Which is a good way to do it.
... Anne proposes to drop Applicability / Expectation sections. Then added sub-headings into Applicability / Expectation sections.
Moe: Shadi's comment meant that he wanted to move section out under aggregation definition.
... Final section - Shadi recommends we move out, to further simplify things.
Wilco: So is Shadi's proposal different from this.
Moe: His recommendation is to move out the final two paragraphs.
... The composed rules defines - he thought that bit was too much in that area.
Anne: If we go by my proposal we need to clean all references to Aggregation in composed rules.
... Really aggregation is the business logic in the composed rules.
... Happy to do the clean up in the PR.
Moe: I'll assign that to you Anne.
Wilco: Talked about this last week, without coming to a conclusion.
... In summary, what we found out was - way we've defined false positives is the way we should define false negatives. How do we want this to work?
... Do we even want a section like this in the doc?
... If we right up these rules, and compare rule sets. E.g. comp A gets comp B's rules, and they want them to be x% accurate.
... Do we want to define a measure of accuracy?
<MoeKraft> I have to drop. Have a good one everyone.
Wilco: Many of us are measuring accuracy - to say that should be part of the standard is a stretch.
... On aXe core we don't measure accuracy. We're only interested as false-positives. They are reported as a bug, then fixed.
... We do comparisons - it depends on what you are testing. The more sophisticated a company's accessibility use, the less return they get from a tool.
Anne: We cannot expect companies to have accessibility experts around. So defining accuracy would reduce the number of people using it.
Wilco: It is a ratio of false positives to false negatives.
Anne: Should it not just be on running the tool against the ACT rule test suite.
... It does not say how you benchmark against the test case.
... Does this work for manual.
Alistair: It should work just the same.
Kathy: We work with a group to establish the manual methodology. We don't then run it against other test suites to get this comparison.
... When we talk about false positives and false negatives we typically talk about results from an automated test.
... It seems to more about a tool failing.
Wilco: I suppose between different implementations accuracy can change.
Kathy: In writing the rules, it's mostly an interpretation of the requirements, and how conformance can be checked.
... If something changes we need to change a rule, or update a rule.
... A false positive indicated either rule needs to be changed; new rule needs adding; or rule test has been badly done.
Charu: False positive / negative indicates a change to the rule.
... It boils down to test cases being right, and test cases being right.
... The rule could be inaccurate or the implementation could be inaccurate.
Wilco: Example, one of the rules is images with different sources don't have the same text alternative. If the accessible name is the same the alt should be the same.
... There is no hard requirement saying that that is an issue. But in reality it flags a problem.
... This rule has a potential to have false positives. How do we decide to publish it?
... How do we know that? That is why we wanted to include false positives.
Alistair: You're speed to mend a test is what drives you're ability to release a test which is less certain.
Kasper: For rules like that - false positives might not be an issue if we include a manual rule.
... Find all cases where you have different images - you then take the two into a composed rule.
... That is similar to what we do in other places.
Anne: Techniques might be releasable bi-weekly.
Wilco: They did move out of TR space to update more frequently.
... This is a question raised by the AG chairs. Relating to false positives.
... Ultimately we can't do accuracy the way we have it. Simply moving to a model where we try to fix things as soon as possible.
... It is also solved by asking for it to be in a number of tools.
Alistair: you could say a rule needs to have been in a tool in the wild for 6 months before it's acceptable.
Wilco: Loads of people have experimental rules running before they are released into public.
... We are reaching a conclusion - can we say we can remove the benchmarking section from the rule format?
<Wilco> https://www.w3.org/TR/act-rules-format/#quality-accuracy
Wilco: Maybe we need a clearer understanding of what it means to have something implemented.
... It's a question of do we want it in the rules format or can that live somewhere else.
... Should we define what it means to implement a rule.
... Rules format does not say when a rule can be published by the W3C
... Maybe linking to that from the rules format.
MaryJoe: Rules format is just the format. The acceptance should be handled elsewhere.
... Don't muddy the waters of a normative doc with non-normative stuff.
Anne: From a usability perspective, it would make sense to link to the things needed to use it and publish them.
Wilco: we have a solution. Which I'll capture in the ticket.
Kathy: Would it be clearer to separate rule accuracy from benchmarking.
... How would benchmarking be performed would be a separate topic.
Wilco: Are you suggesting keeping the accuracy definition?
Kathy: I think so.
Wilco: It seems to be the way we handle our accuracy is to fix the things being reported.
... We don't allow a percentage of false positives.
... Shall we take an action Moe to update along these lines, and take in Shadi's comments too?
... I'll reach out to Moe, and come I with ideas for some of the comments which were left.
... Anything else on accuracy.