AGWG Teleconference – 29 April 2021

Meeting minutes

I can do both blocks if I get a 5 min breather :)

<Jennie> *I can scribe at hour 2

Alastair: Session 1 - Testing . Shares slide presentation in zoom meeting.

Etiquette for this meeting - using queue, using topics on queue "to say, to ask, to [anything]", keeping points short, avoid metaphors and allegories

Context of meeting and short summary of AGWG and Silver Merge plans

What types of tests to include?

Which tests to include in conformance?
… Please reference W3C code of Ethics and professional conduct if need be.

<jeanne> https://www.w3.org/Consortium/cepc/

<jeanne> Zoom info <- https://www.w3.org/2017/08/telecon-info_ag-ftf

<alastairc> https://www.w3.org/WAI/GL/wiki/Meetings/vFtF_2021#Session_1_-_Testing

<AWK> +AWK

Alastair: We have a mix of people who have been in Silver and AGWG.
… FPWD is a starting point not the end point.
… We can use guidelines as examples to talk through things, we aren't updating guidelines today. We will be informed of solutions to the issues per our discussions today.

<jaunita_george> Could you define the acronym used in the previous slide?

Alastair: declaring scope is important , testing is conducted against scope.

<jaunita_george> Thanks!

<Rachael> FPWD: First Public Working Draft

Alastair: FPWD = first public working draft

AGWG = Accessibility Guidelines Working Group

Thanks JF!

<jaunita_george> Thanks JF!

Alastair: AG will be focusing more on 3.0 , smaller group to work on WCAG 2.3 . Joint meetings with goal of merging in near future , date to be determined.

<bruce_bailey> i can scribe for our 2pm slot

ShawnL: That covers the agenda very well.

Alastair: Talks to schedule Session 1 - Testing, second two hour session Session 2 - Scoring then Session 3 - Conformance

<bruce_bailey> slide url again pls ?

Alastair: Overarching themes - simplicity and objectivity with need for flexibility around functional needs.

<alastairc> https://docs.google.com/presentation/d/1eUbNUGFaqbI87tx7vVMvDwxT8GNsAHD-SCWYddgEWEE/edit#slide=id.gd5abc90fa9_0_10

Alastair: Speaks to WCAG 3 Structure, Guidelines, Outcomes, Methods and how that relates to Functional Needs. Functional Needs describes a gap in one's ability.

<Rachael> Functional needs draft if needed: https://www.w3.org/WAI/GL/WCAG3/2020/functional-needs/

Alastair: Requirements review - Multiple ways to measure, flexible maintenance and extensibility, multiple ways to display and technology neutral.

<Rachael> Full requirements document if needed: https://www.w3.org/TR/wcag-3.0-requirements/#requirements

Alastair: readability and usability, regulatory environment, motivation and scope.

JF: On slide 7, guidelines , outcomes and methods. Anything on testing?

Alastair: Methods talk to testing and scoring.

<Zakim> Rachael, you wanted to say that slide was mainly to synch vocab

JF: Testing was not present on the main slide, just wondering where that was in context.

Rachael: We were looking to streamline on words used to describe context of conversation.

JF: Ok, thank you.

Alastair: What types of tests to include - Granular testing and Holistic Testing . Granular could be subjective but clearly defined tests.

Holistic Testing - includes but not limited to maturity model, people in seats , Assistive Technology (AT) , Heuristic testing

Jeanne: The heuristic testing was included in FPWD

<Fazio_> For those interested in following the Maturity Model Testing, here's our working doc https://docs.google.com/document/d/1Y5EO6zkOMrbyePw5-Crq8ojmhn9OCTRQ6TlgB0cE6YE

Jeanne: Say we are testing a small website, we would run the tests that are included in the methods for each of guidelines, they'd get a score for the test. They'd take that score and apply it to outcome, which includes a scoring level. This gives a similar rating scale to compare.
… when an org is testing, they need to test for critical errors.

<jaunita_george> Fazio, could you email the document to me? jaunita_george@navyfederal.org -- I sadly can't use Google Docs on this machine

Jeanne: if it prevents users from completing a task, it fails.

Here is the information on critical errors from WCAG 3.0 FPWD https://www.w3.org/TR/wcag-3.0/#critical-errors

<Fazio_> mental fatigue

Errors located anywhere within the view that stop a user from being able to use that view (examples: flashing, keyboard trap, audio with no pause);

<Fazio_> also called cognitive overload

Errors that when located within a process stop a user from completing a process (example: submit button not in tab order); and

Errors that when aggregated within a view or across a process stop a user from using the view or completing the process (example: a large amount of confusing, ambiguous language).

<Zakim> Lauriat_, you wanted to clarify the "people in seats" testing note

Outcome rating and score : https://www.w3.org/TR/wcag-3.0/#outcome-rating

ShawnL: On Holistic testing, the people in seats testing relates to that vs. debate around usability testing as there is different usability testing.

Alastair: Is it correct that nobody was arguing on the topic of granular testing?

Jeanne: There was one comment on less subjectivity on testing in WCAG 3 as WCAG 2 can be subjective.

Jeanne: We have 300 comments on FPWD , there were around 5 or 6 on tests not being objective enough.

Testing objectivity

Jeanne: we want to be more objective and what types of tests to include and what to do on subjective tests.

<Zakim> Rachael, you wanted to say example of heuristic testing is "provide affordances"

Rachael: To clarify on Holistic Testing, we wanted to look at affordances, i.e. does a button look like a button.

Alastair: Do you think that would trigger people's comments on subjectivity?

Rachael: The more objective it is , the fewer the tests we can incorporate.

<Zakim> JustineP, you wanted to ask or confirm that the presence of one critical error will result in a score of overall failure in conformance or is it possible to pass with one critical error?

Jaunita: I want to caution on subjectivity. It reduces governments and courts to accept results due to consistency. Objectivity is key in legal world. Automation rubric scoring would be beneficial based off of inputs provided. Allowing for individuals to do this without experts , the clearer the new standard would be on new research and standard.

<Fazio_> I've mentioned this over the years but Neuropsychological evaluations are a good benchmark

Justine: Critical error topic, on alt text is lacking for all image. Does this result as an critical errors?

<Zakim> answer, you wanted to react to JustineP

Jeanne: Any critical errors, you will not pass.

Jeanne: We did ask feedback on this. Critical errors failing entire product is an open question regarding mathematically possibility

Wilco: I wanted to ask Rachael on layers of subjectivity statement.

Rachael: I'm not sure how we have the conversation, I think that we can talk to automation testing has higher objectivity and manual is more subjective . There is a continuum that we can talk toward. We know we need simplicity.

<Francis_Storr> In the FPWD, if "any image of text without an appropriate text alternative needed to complete a process" is missing alt text, that's a critical failure. https://www.w3.org/TR/wcag-3.0/#text-alternatives

Rachael: we are looking for adoption and tolerance points

Wilco: Almost sounds like repeatability aspect of testing.

<Zakim> JF, you wanted to note that increased subjectivity is codifying the need for subject matter expertise

Rachael: Yes, we could say that.

<Fazio_> good point JF

<jeanne> Plain language experts to help us make it easier for non experts to use the spec

<Zakim> bruce_bailey, you wanted to ask if heuristics defined somewhere ? -- it is not in FCPWD

<jon_avila> That heuristic testing is at the gold and silver levels not at bronze, right?

JF: The more subjectivity we put into WCAG 3, the more the need for subject matter experts. Concerned on baking in need for subject matter experts. Example , plain language experts reviewing our content , we should be making it clearer before it gets to them.

BruceB: I am wondering where holistic is used , https://www.w3.org/TR/wcag-3.0/#holistic-tests

<Lauriat_> jon_avila: not yet determined, and a likely topic in the third session today (continuing from previous conversations)

ShawnL: It is essentially a cognitive walk through . It is at least a walk through of overall flow

<jon_avila> Could bronze be sufficient from a court perspective to determine conformance to laws like ADA?

Automated evaluation- Evaluation conducted using software tools, typically evaluating code-level features and applying heuristics for other tests.

<Zakim> sajkaj, you wanted to note our Challenges doc breaks objectivity/subjectivity as quantitative vs qualitative

<sajkaj> https://raw.githack.com/w3c/wcag/conformance-challenges-5aside/conformance-challenges/index.html

<JF> Usability Heuristic: https://www.nngroup.com/articles/ten-usability-heuristics/

Janina: I wanted to talk to challenges document and points to a branch copy and quantitively vs. qualitatively assess

<Francis_Storr> probably the most well-known set of heuristics for interface design are Jakob Neilsen's "10 Usability Heuristics for User Interface Design" https://www.nngroup.com/articles/ten-usability-heuristics/

Janina: the quantitative set could change, but the threshold could be there . A necessary threshold needs to be present.
… if we can break them down , i.e. AI in alt text analysis and finding colors used in alt text, is probably not great alt text.

<Zakim> alastairc, you wanted to say that objective doesn't mean you can't use percentages / categories

Alastair: There are areas that impact people with disabilities that are qualitative in nature , but with more granular testing methods may benefit them. Objectivity would be similar to what we have now , i.e. alt text of a zero , this is an alt text of 4 , for scoring. Similar to what we have now but scoring a bit differently to move toward objectivity and testing reliability.

<Lauriat_> +1 to Alastair (covered what I wanted to say)

Detlev: I wanted to add on to issue on criticality and critical errors. When a person defines this as critical , it is an important part. With alternative text, it does help with running automated text. I.e. alt text in a menu missing would be critical vs. an alt in a call to action would not be deemed a critical stopper.
… Leveling headings is wrong , could be annoying, but could be very important and critical

<Fazio_> Neuropsyche evals are globally accepted measurements of seemingly subjective criteria but with standardized, objective methods - worth exploring how thats accomplished

<Chuck__> ChrisLoiselle: My comment is on the use of the word heuristic and holistic and how it relates to 3.0

<JF> @Fazio how does Neuropsyche evals scale across millions of web sites?

<shadi> +1 to contextulization and breaking down into more granular requirements/checks

<alastairc> JF - millions of people would be the equivelent.

<Chuck__> ChrisLoiselle: I did a search, ...conducted using software tools... we go into holistics in detail, but heuristics is listed once in the doc.

<Chuck__> ChrisLoiselle: Next version it may need to be explained in more detail.

<jaunita_george> Is there a way scoring could be automated? Could an scoring tool be created to help people generate consistent results?

<jon_avila> There are 2 types of heuristic - another means user discovered

<Wilco> many accessibility tools use heuristics

<sajkaj> Yes, and a definition of any term, e.g. "heuristic" should not use the same term in the definition!

<Zakim> AbiJ, you wanted to say robustness of qualitative methods is very important

<AWK_> Pls unmute me in zoom AC

<jon_avila> Nielsen article on heuristic testing https://www.nngroup.com/articles/how-to-conduct-a-heuristic-evaluation/

<Jennie> +1 to AbiJ, and more repeatable which is important when doing product evaluations/comparisons

<Lauriat_> +1 to AbiJ, really looking forward to getting into creating some possible examples of this with ACT folks

AbiJ: I would strongly endorse qualitative and quantitively analyzing. I think that would make the guidelines more readable and accessible to people. I feel shortcoming of WCAG 2 is that it is difficult to implement. I.e. what is subjective and what is not. If WCAG 3 does have that, it would add value.

<Rachael> We will discuss conformance levels in the third session (See slide 20)

<JF> +1 to AWK - that sounds vaguely familiar to the feedback that Makoto brought to us from his Japanese workshop

AWK: I wanted to plus 1 to Janina. Conformance level conversation needs to be looked at. What Janina was talking to is incredibly important. The reality is that websites are not meeting WCAG 2. One of the questions from ITI was conformance level around programmatic testing perhaps with in combination with critical tests . Not sure the order of topics, but wanted to bring it up.

<jeanne> ack

<Zakim> jeanne, you wanted to address tester reliability in defining rating scales

Alastair: If there are conversations about conformance, we can talk to that it this session around automation or manual testing , framing it around testing conversation .

Jeanne: One of the goals we have is to improve the reliability of testing for testers. I.e. how good is alt text, what is the quality of alt text? Putting in a qualitative scale , good instructions on what makes a 2 vs. 4 score is beneficial on subjectivity issues. That would make it more reliable.

<Detlev> +1 to jeanne

Jaunita: Can people input items in to a rubric framework that automates this process ? I.e. making people and other parties accountable is the key piece. Subjectivity and inconsistency would be a barrier to the uptake of the requirement and impact users.

<JF> +1 to Juanita

<JustineP> +1 to Juanita

Wilco: To Jeanne, Why would the scale help with subjectivity? Seems more work and possibility to misinterpreting and lead to more subjectivity

Jeanne: Our assertion is that we want to work with ACT about, when we have existing subjectivity , such as WCAG 2 , what we want to do is make it easier by giving descriptions of the bands on different levels.

<Zakim> alastairc, you wanted to say there is currently a reliance on experts

Jake: Rationales on specific subjectivity and then judge against those rationales. Up to us to come up with rationales.

Alastair: We have to undo some of the changes or results they have received from testing as they haven't understood guidelines. I.e. Nielson's heuristics are general, there are also different ones to look at.

<alastairc> A guideline can be easy to understand, easy to test, or short. Pick any two.

<bruce_bailey> i like that !

Alastair: We are aiming to be easy to understand and test.

<Lauriat_> At the hour, do we have a new scribe?

Alastair: the types of granular testing will be to be proved but a direction worth exploring.

<Zakim> jeanne, you wanted to say that Silver needs more help with scoring models, because we don't have a lot of people with testing experience in the group. AGWG has that expertise

<Jennie> *Scribe change after Jeanne, Chris?

Jeanne: When the Silver Task Force looked at this, we were looking at testing, scoring and levels. This was against a time constraint for FPWD.
… issue was around expertise around testing and migration of WCAG 2.1 and 2.2 into WCAG 3. AGWG has a lot more testing expertise . For example, clear words and testing of clear words.
… Silver does need AG help around these areas.

<ChrisLoiselle> Thanks, Jennie!

Alastair: Yes, thank you Jeanne
… I would like to work out the decision points. Are there particular issues that we can frame, and bring to some sort of conclusion.

Shawn: 1 aspect we are speaking to is whether to remove subjectivity from the guidelines.

<johnkirkwood> LS: small meeting

Shawn: A separate topic is what type of tests do we have in order to support the tests either way

Alastair: Thank you.

Chuck: I want to address Wilco's question about how a scale might worsen subjectivity
… We will discuss this more in the scoring part of the agenda
… If most of the people were testing, sometimes we got (missed a portion)
… 2 different people might come to different results
… We got 8 of 9 or even 10 sometimes
… In our current scoring model, that is a very big difference
… I am going to make a proposal that most are covered
… We think that both those results would fit and manage subjectivity

<Zakim> shadi, you wanted to respond on scale/weighting and requirement breakdown

Shadi: I also want to react to the scaling aspect

<Wilco> "most" means more than half to me my friend. Not sure that helps.

Shadi: One other access is the scope of the tests
… For WCAG 2 - it has subjectivity, which is fine, but sometimes the requirements have a broad scope
… The text alternative requires objectives to have text alternatives and a certain quality

<JF> +1 to Shadi. 2 tests

Shadi: this gets broken down with what Detlev was saying
… We could break this down even more on functional items (buttons, links) - these could have a higher score
… Just by constructing the requirements in a way to be more specific on a shorter scope I think we would automatically get more of a feeling
… of levels of severity

<sajkaj> +1 to shadi on functional object alt being more critical than on other graphics

<jeanne> +1 Shadi to more granularity in testing.

Shadi: If we put such a scale on the current requirement, I would agree with Wilco that it would provide more subjectivity
… But breaking it down would get us further

<jaunita_george> +1 to shadi's comment on granularity

<Zakim> Chuck__, you wanted to say how a scale can help

<Rachael> +1 to shadi

Shadi: I don't think it is just the type of tests, but how the tests are constructed, and we have a problem with that

<Lauriat_> +1 to Shadi's point on test construction, definitely!

<AbiJ> +1 to shadi

Shadi: I think good presentation you could only show the parts that you need to

<jon_avila> I agree with Shadi that this would help to bring objectivity.

Detlev: I want to address Wilco's question about scale
… I think the problem is if you rely on things that can only be pass fail then there may be cases from a barrier point of view
… it's a minor thing, and you would have to call it fail

<jeanne> +1 Detlev to have the user point of view

Detlev: and this can be wrong in terms of the impact on the user
… You have to look at the context

<Lauriat_> +1 to Detlev, this gets into the larger scoring and critical failure aspects of testing

Detlev: When I was suggesting test cases for 4 evaluators to address
… The simple, automated tests help, but you need both
… You can have a check for an alt attribute, but then you need to check to see if they make sense

JF: I want to go back to the statement that other fields have these kinds of subjective evaluations
… How many have requirements that are being taken up by regulators

<jaunita_george> +1 JF

JF: The regulators are depending on us to define success

<JF> I keep hearing the voice of Jamie Knight, "After my presentation, you will have a better understanding of the needs of exactly one person with autism".

JF: I'm worried we will fall into the 80/20 rule

<SuzanneTaylor> +1 on Detlev's bands with examples within the bands - non-experts are much better served by examples than by carefully constructed specialist language

JF: After hearing Jamie Knight say (see quote above)

<alastairc> q/

JF: Bruce has talked about having different types of currencies - maybe this is something we need to revist

<bruce_bailey> +1 for multiple currencies !

DM: My hope with the next major version is that it is simpler, more accurate
… Some have found they don't have enough expertise, but others are asking for it to be more objective
… We have discussed conflicting views: holistic, others - we have to balance this with what people want that are waiting for this standard
… One thought: with 3.0 it seems like we are introducing a lot of new concepts
… Let's make it more plain language whenever possible
… We have a lot of ambitious goals
… Would it make sense to introduce it in stages?
… We have a flat structure, with the methods and guidelines

<jaunita_george> +1 to introducing it in stages

DM: Then we migrate WCAG 2 into that new structure
… Then we expand from there

Alastair: In terms of what we've got in front of us today

<Zakim> Wilco, you wanted to ask what the score is supposed to represent

Alastair: Going from the first public working draft to WCAG 3.0 - what we are trying to include for testing

Wilco: What are we scoring against?
… It might seem better to see if it is better or not
… I think that was achieved in the 1st draft
… Colour contrast, for example, uses a scoring method
… It does not take into account how important a particular text is
… That context is really important in establishing how high a thing should be scored
… I'm not convinced we can even do that because it is subjective
… I'm not sure we should even try to do that

<JF> +1 to Wilco, with the added observation that the current scoring isn't granular enough

Wilco: Does it make sense to do scoring for everything?
… Does everything need a scale like that?

<Zakim> Rachael, you wanted to say the granularity was a concern raised in issues

Rachael: I agree not everything does
… We are intending to have that conversation
… I agree with Shadi's point, if we can break things down into granularity, we can have a scale to address the quality of the alt text
… From issues, there is concern about breaking them down because it increases the burden
… We have to think about how to reduce it when we break things down into smaller pieces

<Zakim> jeanne, you wanted to react to Wilco to respond to Wilco that FPWD defines tasks, or process

Jeanne: I want to respond to Wilco
… The first public working draft attempted to establish that context
… By having the organization that owns the product or website say this is what the user is trying to accomplish on this page
… This establishes the process
… That can establish how critical the particular problem is
… The example with alt text - if a missing alt text prevents the user from accomplishing the text, that is a critical error
… But if it is in the footer, or not in the area of the task trying to be accomplished, then it is not a critical error, then it could pass

<Zakim> alastairc, you wanted to say the testing, leading to scoring should map to impact on the user

Jeanne: We are allowing people to have a small number of minor errors as long as they don't interfere with the process that the user is trying to go through

Alastair: To respond to Wilco regarding what the score should represent - I would like it to represent what level of barriers are there to the end user
… That should represent how good a job has the entity done at accounting for accessibility
… People talk about subjectivity in WCAG 2
… I was looking at 2 sites at the same time
… Both had 12 Level A fails
… You couldn't differentiate between the sites, though one was generally pretty good
… I would like it to map better to the impact it has on the end users

<Fazio_> the simplest way to do it is to get large sample sizes and break it down by percentile. Let govt decide whats acceptable, pass, fail, etc

Alastair: To try to move the conversation forward
… In terms of what we are trying to add to the First Public Working Draft
… The subjectivity of 3.0 should be the same as WCAG 2 in terms of the granular testing

<JF> the same OR LESS

<bruce_bailey> it should not be MORE subjective

<jaunita_george> +1 JF

Alastair: per guideline, basically

<david-macdonald> SHould not be more subjective

Alastair: Is anyone arguing it should be much more or much less?

<bruce_bailey> it would nice if it was less, that is good

Shadi: To react to Rachael on adding more requirements
… I don't think we are adding more requirements - we are making it more transparent

<Wilco> +1

<bruce_bailey> but IMHO subjective of 2x SC is manageable

<Lauriat_> +1

<Rachael> I agree with you Shadi, I just wanted to point out the feedback we are getting from issues

Shadi: Most testers will have some sort of spreadsheet where they break things down, and test
… We are making it more transparent, and making it easier for those that are less expert in testing
… It is not adding requirements, it might be adding lines
… By doing that, we use subjectivity
… It depends on how you want to define subjectivity
… Will we want qualitative tests for alt text? Yes

<jeanne> +1 to Shadi that the granularity improves the subjectivity and makes it more reliable/repeatable

Shadi: But, dividing it into navigational elements - that already becomes more clear
… It is still the same qualitative check

<Zakim> Lauriat_, you wanted to mention that with scoring and conformance still to solidify, additional types of tests do not necessarily mean all tests become required. It depends on how we define conformance for that particular guideline and as a whole for the level of conformance

<jaunita_george> +1 Shadi

Alastair: Good point

<AbiJ> +1 some SCs already require multiple test e.g. 1.3.1 so providing more differentiated tests increases accuracy

Shawn: I want to give a preview of the next subtopic
… We still have scoring and overall conformance left to solidify
… Additional types of tests does not mean all tests are required
… It is more of: we want to look at the range of tests to assess how well people have met the guidelines
… We also want to talk about which to include for conformance
… We have a lower level of conformance than might be used in a regulartory situation
… We could have levels that are more or less the same level of subjectivity, but to Shadi's point, change how you navigate that subjectivity
… And then heuristic evaluations and AT tests

<Zakim> JF, you wanted to ask how do we know what the user's primary "task" is?

Shawn: It is more about what we want to make available to those writing WCAG 3.0

JF: I would like to comment on what Jeanne said about activities - what is the primary purpose
… How do we know what the process is?
… It is common that in the footer are social media icons
… There can be a background image
… when trying to tweet the article - that may be the primary purpose for sharing an article

Alastair: Either the organization, or whoever's doing the conformance statement defines that

<Fazio_> that's what we agreed on AC

JF: That assumes that I am trying to do on your website. You may have broad guesses, but you may not really know
… You many not have know that I wanted to share it on Twitter
… When we make subjective determinations - 80% may want to read the article
… We could still be failing 20%

Wilco: I wanted to give my perspective on Alastair's question
… I don't like the word subjectivity because I think we are using it differently
… Lots of organizations have additional documentation on how to test WCAG, which creates differences
… The fact that they have to do that is a real problem
… It shouldn't just apply to granular testing
… I don't think that should be a restriction on what type of testing should be in WCAG 3
… Usability testing can have a good degree of consistency if you do enough of it

<Zakim> Rachael, you wanted to state that I do think that we need to expand a small amount to include tests like affordances

Rachael: I also want to go back to Alastair's question.

<shadi> +1 to Wilco

Rachael: I would like to see the granular tests include a little more context

<jaunita_george> +1 to Wilco

Rachael: I think that we have a certain level of tests, example for affordances, but that do not make the cut in 2x and I want to see that expanded

Alastair: I was thinking along the same lines as Wilco - subjectivity is not the term to be using for outcomes for WCAG 3

<jaunita_george> +1

Alastair: Could we all agree at the granular testing - all we are onboard with at least as good or better at

<Rachael> Proposal: At least as good or better intertester reliability

<jeanne> +1

<Lauriat_> +1

<Francis_Storr> +1

<Wilco> -1 needs to get better

<sajkaj> +1 to Alastair

inter rater reliability?

<Lauriat_> +1 to Wilco, I included that in my yes

<Rachael> +1 trying for better

<SuzanneTaylor> +1 on inter-tester reliability as a better measure of success for "subjectivity" issue

<david-macdonald> at inter rater reliability = or better+1

<bruce_bailey> +1 to inter-rater reliability (as opposed to focus on "subjectivity")

<jeanne> +1 trying for better reliability metric

Alastair: I think that would answer quite a lot of issues that have been raised

<MelanieP> Better

<AbiJ> +1

<jaunita_george> +1 trying for better too.

<Azlan_> +1

Alastair: On Wilco's 2nd point, I have some disagreements about usability testing being consistent when required in a standard

<Zakim> jeanne, you wanted to suggest a discussion and straw poll on what types of tests for the next draft

<Rachael> draft RESOLUTION: For WCAG 3, testing will be at least as good or better intertester reliability

Jeanne: I also want to suggest that we wrap up this part of the subjectivity conversation
… I would like to move to the next question.

<bruce_bailey> +1

Alastair: I think we have +1s on Rachael's proposal

<JustineP> +1

<jaunita_george> +1

Alastair: This is not a concrete resolution
… Maybe it is something we can incorporate into the requirements

<Rachael> Draft RESOLUTION: For WCAG 3, testing will improve intertester reliability

Alastair: Would anyone objective to saying improve?

Bruce: I think we can only say "don't make it worse" because we don't have great measures of the current inter-rater reliability

<jaunita_george> Could we work on a phased approach for WCAG 3? I think we might be trying to do everything at once -- which might hurt progress.

Alastair: Within individual conversations, it can be good, but different orgs it may not match up

Wilco: There is some research, but it is dated

Alastair: I think that was looked at as part of the Silver research
… It should be fairly easy to improve on that

Wilco: I think it was 70% or so

Alastair: Improving might not be difficult

Shadi: There is also Michael Vigo's research at the University of Manchester
… We actually have concrete specific improvements in the ACT group
… We have identified issues, things that can be made more granular, more understandable

<jaunita_george> Could you share those recommendations Shadi?

<Zakim> bruce_bailey, you wanted to mention Trusted Tester and The Baseline

<Wilco> totally agree

Shadi: I think we already have existing improvements for WCAG 2, so we should be confident that we can continue to improve

Bruce: The (named an article) is flawed - the summary description has flaws
… This is one of the motivations for the Trusted Tester at DHS
… Their whole focus was having inter-rater reliability
… For people that are new to testing
… That is the entire point of the DHS Trusted Tester program
… It is aimed at beginner testers
… Abstracting the Trusted Tester approach, we are calling the baseline
… A less objective version of the WCAG criteria that acts more like a checklist
… The lack of a good sense of inter rater reliability is very important

Alastair: Would anyone disapprove

<shadi> +1 to Bruce

<jaunita_george> +1 Bruce's comment

Alastair: that WCAG 3 testing will improve inter rater reliability?

AWK: It is ironic that we are trying to agree on a statement that we may not be able to get inter rater reliability on

<Wilco> +1

AWK: It can be aspirational, but expecting a measure on this, by a university, it may be difficult, and create a bar we cannot clear

<Rachael> Draft RESOLUTION: For WCAG 3, testing will aim to improve intertester reliability

AWK: Sounds good to me

<laura> +1

<Zakim> Lauriat_, you wanted to add a hopefully more minor point of our setting a goal of improving, but not reaching 100% consistency

Shawn: I want to make a more minor point
… While we are aiming to improve it, we are not working towards 100% consistency

<bruce_bailey> Here is the current Trusted Tested testing script

Shawn: We are not going to get 100% inter tester reliability

<bruce_bailey> https://section508coordinators.github.io/TrustedTester/

Shawn: We are striving to improve only

<Zakim> jeanne, you wanted to respond to AWK that we have a testing group already to look at this

Jeanne: We do have a subgroup that is working on testing the spec itself, and doing these types of evaluations
… Francis is leading that group
… They have 5 metrics and reliability is one of them
… We are just calling it reliability

Alastair: I think we all came to agreement
… Any -1s?
… This helps us create responses to lots of the testing issues

JF: To Shawn's point - this isn't going to be perfect
… Do we set a minimum baseline?
… The 80/20 rule - are we going to try to qualify it?

Shawn: I think the baseline is where we are with WCAG 2x
… I don't think we have enough sample tests to have some sort of idea of how much we can improve it
… This would be good to work on in joint sessions with ACT

Wilco: I would be in favour of at least trying to measure it

<jaunita_george> +1

Alastair: Can we put that on a yes we will measure it?
… That is in our comments around subjectivity

<JustineP> Agree that measuring reliability is critical

JF: Can we get it in the resolution?

<Rachael> Draft RESOLUTION: For WCAG 3, testing will aim to improve intertester reliability and will work on testing to determine this

JG: I want to see if there is a possibility of developing tooling that could assist with this
… Is there way we can endorse or help lower the entry for those less familiar with the standard and grading to develop a more consistent score

Alastair: We had been assuming there would be tools to help with scoring
… There is a discussion about that

<Zakim> jeanne, you wanted to react to jaunita_george to talk about proof of concept tools

Jeanne: We are working on proof of concept tools
… We expect to rely on the industry to create the tools
… We received feedback that one tooling organization would like more information about the information behind the rule

<Wilco> A test suite would be helpful

Jeanne: Then we will not focus on building the tool ourselves

<JF> +1, plus our stated goal of using ACT rules format

<Lauriat_> +1, and also a lot of opportunity for other types of tools in this space

Resolution: For WCAG 3, testing will aim to improve intertester reliability and will work on testing to determine this

Alastair: Moving on to which tests to include in conformance

Which tests to include in conformance

Jeanne: This conversation has been very helpful
… I would like to propose for the next few drafts we focus on including binary testing and qualitative testing
… I don't have an opinion on what to call it

<Lauriat_> +1 to that proposal, let's expand in one space at a time

Jeanne: Start pursuing greater granularity, more qualitative evaluations

<Zakim> jeanne, you wanted to propose we focus on binary and rating types of tests for the next few draft and ask for AGWG and ACT help in drafting some example tests.

Jeanne: And get people from AGWG to help with that

<Zakim> Rachael, you wanted to ask Jeanne if that means the first three test types?

Rachael: you mean the 1st three test types?

Jeanne: I think so, yes
… The types of testing we have today - introduce quantitative in the next few drafts
… To build this out in more detail

JF: I want to go back to the resolution
… I'm looking for the word "measure"

<Rachael> Draft RESOLUTION: For WCAG 3, testing will aim to improve intertester reliability and will work on testing to meaure this

JF: I want to be more granular and be able to measure we are achieving it

Alastair: ok

Resolution: For WCAG 3, testing will aim to improve intertester reliability and will work on testing to measure this

<Zakim> bruce_bailey, you wanted to suggest alt text quality as low hanging fruit for adjectival scoring proof of concept

JF: thank you

<Rachael> Subjective: For WCAG 3, testing will be at least as good or better intertester reliability

Bruce: I want to suggest that we do scoring of the quality of alt text as a proof of concept
… to say yes we can have adjectival ratings

<jaunita_george> +1 Bruce

<alastairc> q/

<JF> +1, and that kind of test also allows for a bit of a "walk-through" evaluation

<Lauriat_> +1

Alastair: Yes, good area to focus on for the proof of concept
… Jeanne, who could they contact to volunteer to work on that?

Jeanne: Makoto is leading the subgroup, but you can get in touch of me and I can make the connection

<Rachael> heuristic (likely misnamed): For controls which are necessary to move forward or backwards through a process, spacing and/or font styling are not used as the only visual means of conveying that the control is actionable.

Alastair: Start with Jeanne

<Lauriat_> I'll happily help with that as I can, with the caveat that I go on vacation in a matter of hours, so more likely will help during ACT sessions.

<jeanne> jspellman@spellmanconsulting.com

AWK: I would like to understand Jeanne's proposal
… I also have one
… Because of the discussion about tests that are subjective
… It would be great for the group to focus on starting with that as the baseline
… To be able to say: here's the things we think we can require that can be handled either in an automated way
… Or are so important they must be tested manually

<sajkaj> 1 to AWK

AWK: Gold is going to be hard
… It will be presumably more subjective
… We can build up from the bottom

Janina: I want to +1 that

<JF> @bruce, we have a starting point already: https://www.w3.org/WAI/tutorials/images/decision-tree/

Janina: Whatever we call that set, there is a 2nd question
… There is value in knowing what can be automated
… We can have higher requirements - where you run those tests regularly
… Then run the subjective parts

<shadi> +1 to JF's starting point idea

Alastair: Do Jeanne, Rachael, or Shawn want to respond?

Jeanne: I'm trying to craft a question so we can include it in a later section

Alastair: I think it is very difficult to rely on the results of usability testing to contribute to some kind of conformance

<jeanne> question+ Should we have a structure where a site or product must meet all automated tests before doing manual or qualitative tests?

Alastair: I do think it is possible to have some kind of question - have you done usability testing for a particular topic
… Have you acted on the results?
… Then it is not relying on the outcome
… Question: Can anyone outline what format the AT testing would take?

<bruce_bailey> +1 to "have you done useabiltiy testing" yes/no

Alastair: I'm not clear how that would contribute to a score

<Zakim> alastairc, you wanted to comment on usability testing

<Zakim> Lauriat_, you wanted to speak to AT testing, from my perspective

Shawn: I was thinking of it in terms of the people in seats usability testing

<jaunita_george> +1 to the usability testing suggestion from Alastair

Shawn: Transparency in process for validating how accessible something is
… It gives a framework for expressing what kinds of AT testing has been done

<Zakim> JF, you wanted to ask about AT testing and the test matrix (OS + Browser + AT tool X versions [current & -1?])

JF: My concern around the AT testing - we also have to build out a matrix about what that looks like

<jaunita_george> +1 JF

JF: OS, browser, etc - it gets unmanageable quickly

<Zakim> Lauriat_, you wanted to react to JF

Shawn: This onus would go on the people making the claim
… We would say: here is a structure you can for recording this

<AbiJ> Expertise of the user in AT can be a differentiator between AT testing and diverse user testing

Shawn: that responsibility would be on the people making the cliam

<sajkaj> Could we require AT support a11y standards? e.g. ARIA?

+1 to AbiJ

Alastair: We have reached time

<Lauriat_> +1 to AbiJ

<jaunita_george> +1 to AbiJ

<sajkaj> Though we actually don't!

<sajkaj> i.e. end at 2300Z

<sajkaj> Jeanne: You had it right the first time. It's an rrsagent command

<sajkaj> 5-7PM Boston = 21:00-23:00 -- so, no

<sajkaj> Meeting resumes at 1700Z. Wiki info at:

<sajkaj> https://www.w3.org/WAI/GL/wiki/Meetings/vFtF_2021

<sajkaj> Maybe that should be "automatable tests?"

<Lauriat_> +1 to sajkaj, though a different term may help avoid that confusion

<ChrisLoiselle> same here if need be on scribing

<Jennie_> * shocked face here

<ChrisLoiselle> Chuck, is this the correct slide deck ? https://docs.google.com/presentation/d/1eUbNUGFaqbI87tx7vVMvDwxT8GNsAHD-SCWYddgEWEE/edit#slide=id.gd5abc90fa9_0_111

<Chuck> https://docs.google.com/presentation/d/1eUbNUGFaqbI87tx7vVMvDwxT8GNsAHD-SCWYddgEWEE/edit#slide=id.gd5abc90fa9_0_111

Charles Adams (Chuck): Any new members?
… 1st session was testing, this session focuses on Scoring
… reminder of etiquette, use queue and include topic (slide 2 in deck)
… keep comments short and literal
… overall thems, slide 3
… need for simplicity and objectivity balanced against flexibility
… recapping from session one (slide 16)
… Resolution: For WCAG 3, testing will aim to improve inter-tester reliability and will work on testing to measure this
… Agreed that the framing of inter-tester reliability (or reliability for short) was a more productive way to approach testing than discussing subjectivity, since WCAG 2 has subjectivity. We want to improve, if possible.
… Agreed that AGWG members would assist the Alt Text Subgroup to work on writing qualitative evaluation that can be rated with improved inter-tester reliability. This can include breaking it into more granular tests and outcomes.
… HT Shadi and Detlev
… slide 17 recap, Guidelines, Outlines, methods

<jeanne> Slide Deck <- https://docs.google.com/presentation/d/1eUbNUGFaqbI87tx7vVMvDwxT8GNsAHD-SCWYddgEWEE/edit#slide=id.gd5abc90fa9_0_111

Jeanne Spellman speaks to slide 18, tabular comparison of WCAG 2 mapping to analog in WCAG 3
… slide 19, conformance comparison WCAG 2 vs WCAG 3
… in 2, unit of conformance is page, in 3 evaluate by site or product (or logical subset)

<AWK> +AWK

<Zakim> Lauriat_, you wanted to note that we arrived on processes and views: "Conformance is defined only for processes and views." from https://www.w3.org/TR/wcag-3.0/#defining-conformance-scope

Shawne Laurieat: in FCPWD we defined processes and views
… can build up to site vs product

Jeanne: trying to give overview of FCPWD and some big picture concepts
… returning to slide 19, SC rated by A/AA/AAA. For 3, we did not want to do that because of disparate treatment to disability groups
… 3 does have concept of Critical Errors (three type) which are a hard fail
… not close correspondse to A/AA/AAA, but some similarity
… 2 SC are perfection or fail, 3 has point system
… 2 AA level ended up being used by regulator, with 3 we recommend Bronze for regulation

<alastairc> Currently, you can pick your pages as well.

Jeanne: 2 all SC are binary T/F, 3 guidelins cutomize for test and scoring as most approptiat

John Foliot: Concern with scoring..

<Lauriat_> +1 to JF calling out the dual role of A, AA, AAA, definitely good to consider as we move forward!

Jeanne: Please defer until December publication.

<johnkirkwood> +1

<ToddLibby> +1

JF: Reminder that A AA AA was based on impact on users AND content creators

Chuck: Thank you.

Jeanne: Lawyer from OCR wanting to know why A, AA, AAA
… apparently comes up in course case, and things around AAA not getting sufficient attention

<sajkaj> If anyone has a copy of that paper, I desperately need citations from it!

Jeanne: want to make sure that different disability categories are treated equitibly

<JF> equally or equitably?

Jeanne: Slide 20, big picture idea is to have a score at outcome level
… threshold for total score would include score by functional need / disabilitity category
… this addesses situations like where for flashing seisures, there is basically one test...

<Zakim> Chuck, you wanted to ask about "total score" vs average score

<ChrisLoiselle> https://www.w3.org/TR/wcag-3.0/#overall-scores

Jeanne: whereas for for useable with out vision there would be dozens of test

<JF> equally: in the same manner or to the same extent. equitably: in a fair and impartial manner.

Chuck: asks for clarification between total score and average scores, and FCPWD picked 3.5 as minimum

Jeanne: Correct, mostly we are looking for mean average within a functional category, but thresholds for each FPC

<Fazio> JF has a point

Jeanne: would not average score between two fpc

JF: There is a difference between equity and equality.

<ChrisLoiselle> I think may help on the discussion https://www.mentalfloss.com/article/625404/equity-vs-equality-what-is-the-difference

<alastairc> I think you're agreeing...

Jeanne: We want the results (of testing) to be equitability, even though the number of test for two different FPC categories is not equal.

<jon_avila> Article on equity vs. equality https://culturalorganizing.org/the-problem-with-that-equity-vs-equality-graphic/

<sajkaj> I want to suggest spoons problem isn't just a COGA challenge

Chuck: To JF, if we are to determine that a large number of small issues (spoons problem) is blocking for COGA issues just as much as with other issues in wcag 2.

JF: Yes, that is part of it. Not just the impact and numbers level. But not all issues are created equal.

Jeanne: What we are saying is that disability category impact is equitable, but not by count of number of SC (or WCAG 1 checkpoints).
… so part of the idea behind Critical Errors is a way to reflect this reality.

<Zakim> sajkaj, you wanted to say I think we're in violent agreement; but we are still normalizing our terminology

<Zakim> alastairc, you wanted to suggest that we shouldn't try to equalise by disability, but say whether you have made a reasonable effort to accomodate

Chuck: Want to say that we could be talking about this for quite a while, and we have other important issues on the table. Equity versus equality is important.

Janina Sajka (sajkaj): We are getting better at normallizing the language we are using for these issues.

<Zakim> JF, you wanted to test a hypothesis

Alastair Campbell: We have some consensus in the WCAG 2x space for setting responsibility between devlopers and site owners and what technology users bring to a page.

JF: Is a preference for transcripts over captioning an example of the equity that 3 aims for?

<JF> blind functional need: audio description. deaf functional need: captions deaf/blind functional need: transcripts

<jon_avila> I agree with Jeanne. The scores will roll up to disability categories and a score at that level must be met -so while we might have more criteria that apply to one disability - the score of the roll up would need to be the same.

<Zakim> Lauriat_, you wanted to say I think we're all vehemently agreeing and have the same goals in this admittedly complicated space, and why we need to work through scoring in a way that support these goals

Jeanne: No. It is about count of A and AA sc for blindness versus count of AAA SC for COGA issue.s

<JF> +1 Shawn

<sajkaj> +1 to SL

<Ben> +1

<jaunita_george> +1

Shawn: Seems like we are in strong agreement on big picture goals.

<KimD> +1 to moving on

<jon_avila> +1 - we are all saying the same thing.

Jeanne: Coming back to slide, and scoring. There are a few ways to address this.
… for example maintain binary T/F, obviously some objection to that
… we could aim for rating by percentage
… we could aim for likert ratings
… we could have more granular point based system

Jeanne: Example of Text Alternatives in FPWD

<PeterKorn> [stepping away for ~30 min]

<JF> far too easy to game the outcomes

Jeanne: in document today was automated versus manual

<jaunita_george> +1 JF

Jeanne: Could report percentage of images with alt
… Manual evaluation could use automated score plus look for critical errors (missing alt on images needed for path / process)
… Example from this morning, was icon in navigation missing alt: that is a critical error

<JF> disagree with that conclusion

Jeanne: missing alt for decorative images not a critical error

<MelissaD> +1 to concern with gaming

David McDonald: So we need count of images used with all pages in a task, so all pages in flow, all images encountered in that flow?
… sites always can be changing, so experience different from day to day

<ChrisLoiselle> On David and Jeanne's point , https://www.w3.org/TR/wcag-3.0/#defining-conformance-scope In many cases, content changes frequently, causing evaluation to be accurate only for a specific moment in time;

Jeanne: Definately needs to be snap shot.

JF: Seems like this could be gamed too easily.

<Zakim> Lauriat_, you wanted to respond to JF in that we can't prevent people from misrepresenting their conformance, they wouldn't need to add images to pages to lie.

<alastairc> Could define a non-visible image with null alt text as not in scope.

JF: for example add 100 spacer gifs all with alt="" and now the relative percent of images with good alt is much higher

<Zakim> JF, you wanted to note that a script could be created to append 100 1 X 1 "spacer gifs" with alt="" on a page and it would then pass

Shawn: With conformance claims, we assume people are not lying.

<Ben> +1 to JF, a definition of an image that is "useful to the task" could help that issue? One for another day probably...

Shawn: this is not a new problem, but if someone wants to inflant their score currently, it is easy to do.

<JF> +1 Wilco - the needs of regulators

Wilco Fiers: Disagree as point of scoring is have something that reflects actual accessibility.
… If we allow loopholes, people will game the system and exploit them. And they should. So we need to have scoring that reflect that reality.

<Zakim> JF, you wanted to talk about "critical images" and "tasks"

<Zakim> sajkaj, you wanted to say we don't do compliance, that's a regulatory function

Chuck: This is not a new issue. Queue is long, so try to stay on topic.

<JF> but if we want the regulators to take up WCAG 3, it has to meet their needs as well!

Janina: We should be focused on people that want to do the right thing. If someone games the system, they risk being judged by someone else.

<Wilco> +1

<Lauriat_> +1 to approaching this as "How do we best do this", since we don't need to do it this particular way.

<Zakim> alastairc, you wanted to say we need to align the tests to what's useful for users

Shadi Abou-Zahr: I don

't agree with word gaming.
… developer would be following the rules.
… it is about conforming or not. It is what the system offers.

<jeanne> question+ to look at the inverted score where the barriers or errors are counted instead of the overall score.

<Chuck> +1

<Zakim> Jennie_, you wanted to discuss compliance by vendors and employees

<jeanne> question+ How do we address attempted gaming?

Alastair: We can add scoping, for example decorative images are not included in the score.

<JF> +1 Jennie. W3C can't toss this over the wall and then wash our hands...

<johnkirkwood> +1 Jennie

Alastair: Current approach is to look for fails, not counts of successes, so less subject to this sort of gaming.

<Zakim> AbiJ, you wanted to speak to the importance of the granularity of tests

Jeannie Delisi: Agree with Janina, that if implimented, many of us work at the state or goverment levels where the scoring could be taken quite literally.

Chuck: Jeanne has this as a question in the FCPWD, so there will be more conversation.

Abi Janes: Want to emphasis that current wcag 2 can also be gamed, especially with all the emphasis for automated testing.

Chuck: Moving on to slide 23, Outcom Rating. Ratings from 0 to 4.

<Zakim> Jennie_, you wanted to ask what happens between 69 and 70 percent

<Zakim> JF, you wanted to ask "which" process?

Chuck: potential threshold for each tier were suggested. Percentage range very much arbitrary.

<Rachael> We have a slide at the end that is specifically to discuss scope, including the need to better define or move past process

Chuck: idea is to look at big picture idea.
… These 0 - 4 ratings also potentially map to adjectival ratings

<Fazio> wireframes can be used to define processes for evaluation

<Fazio> this good drive good design standards too

JF: Assumes well defined process, but even social media links are starting a new view

Wilco: These percentages seem to reflect only automated testing, is that correct?

<Zakim> Chuck, you wanted to answer John's question on what "process"

Jeanne: Yes, but that is not the long term intention.

<jeanne> question+ Should critical errors be across a process or by the view?

<sajkaj> Re Wilco: Also, not all images are of equal value. We know about "presentational," but a function/control is arguably more critical than an illustration

<Lauriat_> From https://www.w3.org/TR/wcag-3.0/#defining-conformance-scope and you can also declare conformance by "view" (to get beyond page-only definitions of a view)

Jeanne: the time deadline for publishing were tight. We did not mean for this to be automated testing only. But yes, that is the characterization in the FCPWD.

Chuck: An important point that I want to come back to is that the entity making the conformance claim gets to say what is the process.

<JF> my concern as well Shadi

JF: If main drop downs naivigation menu have 35 items, is that not 35 processes?

Chuck: We acknowlege that the current phrasing is not sufficient. Please focus on the intent.

<sajkaj> Automated testing can identify use of alt="spacer.giv" and it can also identify 1 picsel images

<jeanne> +1 to exploring Shadi's idea - perhaps that is where we should start with the joint ACT meeting?

<Lauriat_> +1

Shadi: The current approach of process seems like a red herring.

Scoring Issues

Shadi: Stepping back , we are trying to thing about what a process should be, what that should look like.
… images in footer without alt tags are not getting in the way.

Chuck: moving on, slide 24, how to handle scoring

Testing degrees of success is not efficient

Need for more than pass/fail but also need better explanation

Need to allow some small failures

Slide have links to issues.

How should the scoring be done at the outcome level?

Binary (WCAG 2.x, FPWD)

Percentage (FPWD at testing level)

Rating scale (FPWD at testing and outcome (confusing))

Points

Jeanne: Picked these issues out because they are representative

<Zakim> Jennie_, you wanted to ask about the percentages

Jennie: Are we talking about percentages as they were in the previous slide?

Jeanne: We had agreed we'd specify which significant digit to average to.

<Jennie_> thank you

Jeanne: We are looking at the possibility of using something like this at the outcome level. Previously we were looking at the outcome level.
… We could do percentage calculations at the outcome level.

<Chuck> https://github.com/w3c/silver/issues/508

<jeanne> I should have said "rounding" instead of "averaging"

Chuck: Was sort of a heavy process. We said we'd look at making it more efficient. You had said some people misunderstood this was a manual process.

Jeanne: Every place we intend counting should be tool assisted. The rating scale should be very efficient and have a tester quickly be able to assess the overall scope of what's tested.
… On an overall scale be able to say where it fits on the scale. So they have guidance to say which is okay.
… It could vary. In an example where you can go beyond the requirement, such as visual contrast, as long as you meet the minimum contrast you could get more points for going beyond the minimum.

<Zakim> alastairc, you wanted to ask if it is really a change (in practice) to count instances of fails?

Alastair: In our current reporting we report instances, include everywhere that fails alt texts. But also we attach importance to particular instances.
… Wondering if other people report instances, if so is that going to make the reporting any more difficult?

<Zakim> JF, you wanted to ask are 5 critical failures worse than 1 critical failure? Does the number of critical failures have an impact on scoring? (How? Why?)

<AbiJ> +1 to Alastair to practice of reporting instances and fix priority

JF: It depends on who receives the report. I ask if 5 critical errors are worse than 1?

<Lauriat_> +1 to JF's point about the helpfulness of granular scoring even within levels of conformance

JF: Having a score below compliance is still a useful metric. I would like to measure progress, in addition to seeing the minimum bar.

<alastairc> +1 on showing progress, although I think that would work as proposed?

<jon_avila> a +1 to measuring progress.

<jaunita_george> +1

<Lauriat_> +1, alastairc, I think it would

<Rachael> I think that would work as proposed as well

David: When I test, I test a set of pages manually, and crawl to test automatically. I'll scan some of the automated and give details on one of them, and tell them to look at automated results for others.
… I find if developers give a report, people glass over if they have more than 100 issues.

Chuck: As far as reporting goes, VPAT allows for partially conformance and indicate issue by issue where you don't comply.

<JF> @chuck - should we?

Chuck: I presumed that other entities described the reporting requirements of WCAG 3.
… I want to ask John that we track the count of critical failures?

JF: There is value in that. I suggest our scoring mechanism should be an evolving thing that entities can use to track progress well.

<Zakim> Chuck, you wanted to say that nothing prevents companies from recording the information they find of value

JF: In a 4 point scale you lose granularity. If I can reduce the number of issues, and my scoring counts number of issues and severity. That's a powerful tool that aids content creators in doing the right thing.
… Getting that feedback quickly is really useful.

<Chuck> wilco: I want to highlight the point john made. Scoring model will add to the workload. We are tracking data that we did not track before.

<Chuck> wilco: Often you assign a priority and track it, that's what is used in metrics. Strange that we aren't using that same model. Those are the metrics that orgs use. How many issues and the impact.

<JF> +1 Wilco - we previously spoke about "Dashboards" but that conversation seems to have gone away

<Chuck> Wilco: That seems a better fit than current proposal.

<Zakim> AbiJ, you wanted to talk about different regulatory and cultural approaches

<JustineP> +1 Wilco

<JenniferC__> +1 for John Foliot and Wilco Fiers

<Fazio> maturity model?

Abi: There may be different cultural requirements on this. In the public sector we have to report. Having a mechanism to show progress is very useful. We're looking to show where we need to do further work.

<JF> no David - specific "Jira" issues

<Fazio> we have an ICT dev cycle dimension

<ChrisLoiselle> For reference points - Per WCAG 3 FPWD , https://www.w3.org/TR/wcag-3.0/#critical-errors . Also in WCAG 3 In addition, critical errors within selected processes will be identified and totaled. Any critical errors will result in score of very poor (0). https://www.w3.org/TR/wcag-3.0/#bronze, Views and processes MUST NOT have critical errors. Conformance to this specification at the bronze level does not mean every requirement in every

<ChrisLoiselle> guideline is fully met. Bronze level means that the content in scope does have any critical errors and meets the minimum percentage of.

<Zakim> alastairc, you wanted to say we are increasing granularity of what is tested, 2.x is very flat

Abi: Development teams care to make thing pass when they work on it. Leadership teams want to know where the risk is. Compliance may be more important in some areas than in others. Having a robust methodology would be of value in other environments & countries.

<JF> +1 to ABi - Alastair spoke to "demonstrationof effort"

<ToddLibby> +1 for John & Wilco as well

Alastair: A lot was built on WCAG 2, like VPATs and reporting. Part of this process is to draw out the most useful bits and put them in the standard.
… Currently people have to add impact. The levels don't really help on impact because it's not associated with task.
… Adding that bit more granularity could be built on with more reporting.

<Zakim> Lauriat_, you wanted to continue my habit of making meta comments in that I think whatever scoring and conformance mechanism we end up with, we'll need to support these different needs, so more to provide the building blocks, rather than the complete system

<Chuck> +1 to shawn

Shawn: Whatever scoring we end up with, the goal is not to create an end-to-end system that is ideal, but to create the building block to support different tracking systems.

<Zakim> Chuck, you wanted to talk to efficiency

Shawn: It simplifies the things we're doing, as opposed to making a process that may not align with their needs.

Chuck: If we just have the data requirements. How an organisation tracks that can be left to the organisation.

<jon_avila> With what's proposed we could still measure scores and progress for disability category at view level such as 3.4, 3.5, etc. So I think we can pull metrics out of the proposed scoring.

Chuck: But what they do with the count of critical errors and non-critical errors. They can track that data.
… Whether or not there is additional cumbersome, there is more data from the FPWD to track. I wonder if efficiency has to relate to the quantity of work, as opposed to the quantity of data.

Shawn: This is where validation with stakeholder is going to be essential.

<Zakim> bruce_bailey, you wanted to ask Wilco et al if "impact" can not be subjective

Bruce: Impact is pretty important, but it isn't captured in a report like a VPAT.

<jeanne> isn't "impact" the critical error?

David: Do we aspire to make a statement of if WCAG 3 will be as hard to evaluate as WCAG 2?

<jaunita_george> +1

<jeanne> -1 that we will need to include more disability needs and that will add to the evaluation.

David: The gains we'll have is worth making it harder to test.

Shadi: Think of this from the perspective of how will this make it better. Transparency is what we want.

<jaunita_george> +1 for Shadi's comment

Shadi: I think the simpler the scoring method is, the less requirements we have for reporting.
… Lets take WCAG 2. Scoring is fairly intransparent. There are assumptions about the levels and requirements. Two sites can have very similar scores but be very different in accessibility.
… It would be good to have references on how the score was achieved and calculated. But if we have a fairly simple model, such as counting the errors, it is transparent enough.

<JF> +1 to Shadi

Shadi: The other issue is what data to collect, that could left up to whoever wants the data.

<Chuck> +1 jon_avila

Jon: The current proposal can show progress improving for disability categories. What I hear customers want to improve their score, or they want to improve access for people. They want to know what the issues are that if they fix them will make their score more compliant.
… I think we can have that with tests that map to disability groups, and also knowing the score we can calculate what technique would be most effective.

<Chuck> wilco: Suggest for scoring model I hope we can keep it lean.

<Chuck> wilco: That's what Shadi suggested as well. The more data we collect the more complicated it becomes.

<Chuck> wilco: With questionable benefit to the end user. I like the binary solution.

<jaunita_george> I know I keep asking this, but can we create a scoring tool to make it easier?

<Chuck> wilco: I'd like to consider using that for the most part for using a scoring system where it's necessary to keep things simpler.

<Chuck> wilco: Bruce asked about impact... it might be beneficial to have an informative part of WCAG 3 to say "if you want a standardized way to prioritize your issues, here's how"

<Chuck> wilco: But not part of conformance model itself.

<Zakim> JF, you wanted to respond to Bruce

John: On impact, I used to do workshop where we'd take errors and prioritise them. One thing was that we could prioritise based on different measures. Impact changes based on the role.
… Binary is complicated, but even the more complicated. A cognitive walkthrough is essentially a series of yes/no questions.

<Zakim> alastairc, you wanted to suggest that bronze = WCAG 2.x level of effort, but Silver/gold should track higher and affect process.

Alastair: As a broad brush, it would be useful if bronze was roughly the same effort as WCAG 2 AA.
… Not sure how binary scores contribute to the overall score. At the guideline level, however things happen without that guideline / outcome contributes a known thing to the final score. Does that free us up to experiment with something like an alt text one that is binary vs an alt text using a rating scale.
… I think you can compare how they track against effort level, reliability and impact to the end user.

<Zakim> Chuck, you wanted to remind everyone of Jeanne's suggestion this morning

Alastair: Can we then focus on making the tests as best we can, with some experimentation.

Chuck: This morning Jeanne suggested developing some binary tests. Nothing prevents us from creating binary tests, while others explore different types of tests.
… That doesn't prevent us from exploring other options.

<JF> is that a good thing or a bad thing Sheri?

Sheri: I want to make sure that everyone understands, the more data is generated, the larger the trail of evidence that can be used in court. Even if generated in court, it is still discoverable evidence.

<jeanne> WHat if we are allowing minor errors?

<jaunita_george> +1 Sheri's comment

<JF> The #1 phrase related to WCAG = "It depends"

Sheri: On the fence on if that is a good thing. If the company does very poor it's a good thing. But if you're close to perfect, that can be problematic.

<Zakim> Wilco, you wanted to clarify binary

<Chuck> Wilco: I want to clarify, I meant preference for outcomes being binary instead of on a scale.

<jaunita_george> That's why we need to have a very solid definition of "conformance"

Chuck: I believe we can create all the binary tests we want, but we can explore other options too.

<Zakim> kathyeng, you wanted to say binary + critical, scoring is difficult for reliability

<bruce_bailey> https://github.com/w3c/silver/issues/508

Kathy: From trusted tester perspective, it wouldn't be too difficult to add a critical reading. But it would be difficult to get testers to score on a 0-4 rating consistently. I think the interrater reliability would suffer from adding the scoring part.

<jaunita_george> +1

<jeanne> +1 to Kathy and that if we make it too hard to learn, we suffer.

Rachael: We have binary tests, but moving from the concept of everything must pass or have a fail. Binary is still a rating system, it's just a two-point rating system. It's just a question of how many divisions we want to have.

Chuck: Moving on to issue 463. There is a lot of positive commentary regarding our effort to expand beyond pass/fail.

<AWK> https://github.com/w3c/silver/issues/463

Chuck: This particular issue suggested it needs a better explanation.

<Chuck> https://github.com/w3c/silver/issues/463

Chuck: We've delved into the conversation about scoring at the outcome level. The options are binary, percentage, rating scale, or points.
… Wilco expressed a preference for binary. This repeatedly came up in the issues that were raised.
… Come had a preference for the WCAG 2 style, others were very supportive.

Rachael: We want a general idea of where the WG & TF are. Understanding where everyone falls on this is helpful.

<Chuck> poll: Binary, Multiple styles (including binary)

<jaunita_george> Binary (so that the standard stays legally enforceable)

<Detlev> instances or pages??

<alastairc> I think it is important to explore multiple styles more

Chuck: If you support of the system being limited to binary, select that. Otherwise multiple styles.

<Chuck> multiple styles

<Ben> Multiple styles

<AbiJ> Multiple styles

<KimD> Multiple styes

<Francis_Storr> multiple styles

<JakeAbma> multiple styles

<Sheri_B-H> multiple styles

<Rachael> multiple styles

<Lauriat_> Multiple styles

<Azlan> multiple styles

Chuck: This particular question is irrelevant to what the scope is. It works whatever the scope is.

<AWK> Binary first, additional styles second

<JF> more granular scoring

<Jennie_> multiple styles, especially if the scoring rubric for rating scales are clearly defined/account for inter rater reliability

<JakeAbma> +1 to Binary first, additional styles second

<laura> Binary first, additional styles second

<bruce_bailey> My preference is to keep working at making 0 - 4 ratings possible

<jeanne> multiple styles, more granular. Also interested in barrier walkthru method

<shadi> +1 to Detlev

<JustineP> binary first, additional styles second

<JakeAbma> Binary first, additional styles second, keep working at making 0 - 4 ratings possible

Detlev: If you had different instances where certain criteria apply. If you have to decide to give 1.1.1 a pass or a fail for a unit that has many instances, you have to decide if you're going to tolerate a few missing, or if you're quite strict.
… I don't agree that it's all the same.

<jaunita_george> A scoring system can work if we are very clear and granular on how things are scored.

<Chuck> wilco: I don't understand the level we are voting on. I'm suggesting binary scores for outcomes where possible.

<Chuck> wilco: It's 3 level's of conformance. There's different things within that.

<jon_avila> Explore new model beyond WCAG 2.0

<JF> +1 Wilco: Good, Better, Best

<Zakim> JF, you wanted to ask about Shadi's suggestion about counting-up instances

John: I suggest a more granular scoring mechanism. Shadi proposed counting instances of problems. I'm struggling with the 0-4 scoring. It does not give the granularity I believe we need to track progress.
… 4 is perfect. 3.5 is minimum, and below it is not good enough. I don't see us gaining much from that.
… I'd like something far mor granular.

<jon_avila> 3.4 is useful because it shows how far you are from 3.5

<jeanne> We know that 3.5 is too high from the testing we already have done

Shadi: Color contrast could be clearly done in multi-step. Other things, rate a text alternative 1 - 4 can be quite difficult. You get very inconsistent results. Multiplies by the number of instance on the page gets more complex.
… What is the test. How to do the test comes from the nature of the test itself.

<Zakim> bruce_bailey, you wanted to say not wcag2 not so binary

<david-macdonald> does anyone post claims?

Bruce: I don't think the 2.0 model is as binary as you characterise it. There are not as many people posting conformance claims. If things get missed, there is continuous improvement to work in that.

<alastairc> All public sector orgs in Europe are posting a form of conformance claim.

Bruce: The question is are we doing good enough.
… I don't think we should worry too much about this. There is already the ambiguity in the current model.

<Chuck> +1 to shadi

Shadi: Binary is in how the test is constructed. We want to avoid rigidness of WCAG 2. Binary or non-binary isn't the issue.

<bruce_bailey> +1 to that distinction Shadi is making

<jeanne> +1 to avoiding rigidity

<Zakim> alastairc, you wanted to say we need to make it more concrete and test it

Jon: A lot of contract are written to conform to a standard, and most of the time nobody does. It would be really good to get more granular data, what does substantial conformance actually mean to the user impact.

<KimD> +1 to Shadi's comment re: rigidity - that's the issue

<Ben> +1 to jon, WCAG is binary as a standard to conform to, but the fact that companies interpret and report on WCAG in a non-binary form is a whole different topic.

Alastair: This feels quite abstract. It would help to cary on exploring various ways to do tests, and come up with a variety of different requirement, using different methods and see how it works.
… So long as they all contribute to a final score on conformance.

Chuck: If you look at WCAG 2.0, it isn't fully binary. Exploring different ways to test appeals to me as well.

<jon_avila> Thank you to everyone!

<laura> Thanks.

<Ben> Look forward to seeing you in 2h!

<JF> preseent+

<Detlev> maybe someone sober

<Ben> ..or awake!

Session 3 - Conformance

ShawnL: Reminds of mtg P & Q -- use q+ with q+ to say or q+ to ask

ShawnL: Also reminder to keep points brief, there are many points (and many participants); please avoid metaphors and allegories;

<Lauriat_> https://w3c.github.io/PWETF/

Rachael: Slide 28 is current state

<jeanne> Slides <- https://docs.google.com/presentation/d/1eUbNUGFaqbI87tx7vVMvDwxT8GNsAHD-SCWYddgEWEE/edit#slide=id.gd5abc90fa9_0_116

Rachael: Were at single item test -- "atomic" test -- agreement to explore different ways to do that in 2nd session

Rachael: Also agreement to do better on inter-evaluator agreement -- different people should reach similar conclusions

Rachael: Now we get to overarching conformance level discussion

Rachael: FPWD has three levels, Bronze, (similar to A/AA/AAA and considered minimum);

Rachael: Silver and Gold were proposed to be some kind of wholeistic testing -- but not defined yet in FPWD

Rachael: Notes we've touched on many of these questions during earlier conversations today

Rachael: Should we have a level lower than Bronze?

<AWK> +AWK

Rachael: One proposal was a level based on automated testing -- phps this would be qualification for going to a higher level

Jennie: State of Minnesota review was concerned that Bronze not quite A/AA; Could we write something like "Silver Level" required by Minnesota?

Rachael: W3C wouldn't define what is ideal; that's regulatory

Rachael: suggest we make that a question for later discussion

Jennie: meaning will it be spelled out so that's easy enough for us to decide?

[agreement that the idea is the same, but stated differently]

jf: Notes Makoto's workshop feedback ...

jf: Asks about scoring content free applications

<jeanne> question+ How do we score content free components or templates

jf: believe that 3 levels may not be sufficient

PeterKorn: Lot I like about requiring a level set that could be automated

PeterKorn: Notes there are currently AAA items that could be testable but we may not want to require for B

PeterKorn: So we may need more nuance in what goes into levels

<jaunita_george> We might want to consider types of content like elearning content that isn't built the same way as other online experiences

<bruce_bailey> i think ALL current AAA sc are testable

<Zakim> MichaelC, you wanted to s/automated/automatable/

MichaelC2: "Automated" problem for me because it implies using tools; prefers automatable

<Lauriat_> +1 to MichaelC

+1 to Michael

<Chuck> +1

<kathyeng> +1

<Ben> +1

<JF> +1 with a thought towards "automatable using the ACT Rule Set"

awk: +1 to Peter because we want to be smart about what we require even if it's automatable

awk: What's automatable is clearly insufficient

awk: Have heard people have a concern about stopping at such a level because it doesn't support a full experience for users

<Rachael> Suggested reframed question: "Should we have a lower level that includes an agreed upon subset of automatable tests?"

awk: But, if a large chunk of the web were nightly tested with what is automatable, wouldn't that flush out data we need to help make the experience better and organizations incented

<PeterKorn> +1 to AWK. Having a clearly defined "thing" (whatever exactly that is) that is 100% automatable that we incent folks to do is valuable.

awk: If we set it up in a way that was attainable, that could help ease people into higher levels than if B were the entry point

awk: There are very few sites that meet A/AA

<Zakim> alastairc, you wanted to say automated testing doesn't align with end user experience

alastairc: Would not be in favor of automatable testing

alastairc: picking the automatable might focus on certain pwd categories

alastairc: potential that could be a lower level though also not yet convinced

alastairc: few go for level A

<Detlev> +1 to Alistair

<Detlev> sorry Alastair

<Zakim> jeanne, you wanted to propose an introductory "level" that is based on EasyChecks. https://www.w3.org/WAI/test-evaluate/preliminary/

<jeanne> https://www.w3.org/WAI/test-evaluate/preliminary/

jeanne: is link to Easy Checks

jeanne: also not in favor of a level; but like that it could be another doc that would be an intro level for beginners or small orgs

<alastairc> Plus we could have a percentage score that you can improve on, until you do get to the baseline of bronze.

jeanne: recalls someone suggested WCAG 3 could be multiple normative docs

<Zakim> bruce_bailey, you wanted to mention that AAA SC being in the same doc is useful.

shawn: Notes that there are additional options as opposed to we need to explore them all now

bruce_bailey: in favor of another lower level based on teaching people

bruce_bailey: If one is bringing a site up to speed the first time, it makes sense starting in the A level

bruce_bailey: The levels have been helpful that way

bruce_bailey: When we wrote 2.0 we didn't know regulators would pick up AA; that was unknown

bruce_bailey: And AA is a pretty high bar

<alastairc> Our typical audits find more A issues than AA issues :-/

bruce: Notes Canada did AA with key exceptions--captions, maps

bruce_bailey: Let regulators figure out their levels

bruce_bailey: Let's figure out our part

<Zakim> Ben, you wanted to speak to levels lower than bronze

<Zakim> Chuck, you wanted to say that a lower than bronze might facilitate components and frameworks

<jeanne> +1 to describing the low score rather than giving a level\

Ben: Concerned about implications of levels describing clearly what they are

chuck: suggest the two could work well together auto and human

<Sheri_B-H> +1 to chuck, and I would include design systems in the "raw components" list

PeterKorn: Notes that what is automatable changes over time especially now with AI

PeterKorn: so if we go down that path we have a lighter weight mechanism for greowing that level without the full W3C TR process

<Lauriat_> +1 to PeterKorn on the maintainability point

<AWK> +1 to PK

PeterKorn: important we continue to keep up with advances so that results continue to be most meaningful

<Zakim> JF, you wanted to ask whether we sidestep "conformance" altogether, and simply focus on "scores", with levels based on minimum scores

jaunita_george: Suggests possible separate guidance for certain content types, i.e. educational

jf: Suggests looking at reqs and thinking from points perspective rather than A/AA/AAA

jf: might help us not get our comparison categories confused

jf: Could support more customizable testing for particular content type

jf: the more one tests, the more one may improve how well one does

<Zakim> Makoto, you wanted to share feedback from Japanese experts

<Wilco> Interesting idea John

Makoto: Notes regulators will decide on levels; a11y isn't all or nothing

Makoto: Strong and numerous requests from Japan to induce people to try making content accessible by providing a more attainable level

<JF> +1 Makoto

<jaunita_george> +1

Makoto: many private orgs who aim at A as first step because it's hard to meet some AA

<alastairc> Ok, good to know that some areas do start with single-A and easychecks.

Makoto: Also some people who start with basics like Easy Checks because there's no legal pressure in Japan

Makoto: if WCAG3 becomes i18n standard, Japan will follow it

Makoto: We should create conformance scheme that can meet needs

Wilco: Notes many of us work for large orgs that can support people spending lots of time working on WCAG

Wilco: much of the web can't do that

<JF> +1 Wilco

<jaunita_george> +1

Wilco: would like us to prserve a lower entry barrier

Wilco: believe AAA much less used; very few even try and would like to see us encourage making the higher levels meaningful and achievable

<JF> To Wilco's point: Over 99 percent of America's 28.7 million firms are small businesses. The vast majority (88 percent) of employer firms have fewer than 20 employees, and nearly 40 percent of all enterprises have under $100k in revenue.

<JF> source: https://www.jpmorganchase.com/institute/research/small-business/small-business-dashboard/economic-activity#:~:text=Over%2099%20percent%20of%20America's,under%20$100k%20in%20revenue.

Wilco: also thinks it a bit odd that the same reqs apply to small orgs and mega orgs

<Zakim> alastairc, you wanted to say that if more things become automatable over time, we shouldn't pick guidelines based on that, it should be based on the user-impact. And do we need another level if we have a score?

Wilco: Would like higher level that does attract

alastairc: to things becoming more automatable argues for not basing a level on that

alastairc: an argument against on top of not aligning with user needs not supported at that level

alastairc: Seems Bronze was picked as most common around AA though I hear Makoto's point

<JF> +1 Alastair - all progress is good progress

<jaunita_george> Should we write practical guidance for regulators and courts to use when applying a standard? I want us to aim to think about where the standards fit in future interpretations of Title III of the ADA and other similar laws around the globe.

alastairc: notes that 50% is better than 30% even one is aiming at 60%

alastairc: suggests some kind of normalization across functional outcomes

shawn: Want to point out that levels defined by difficulty of reaching is one approach; but another is level by support for a11y

shawn: Suggest we consider a system that better supports beginners; and/or who are overwhelmed with 2.x;

shawn: Or, should WCAG3 more explicitly support a lower level

Rachael: is to be based on difficulty of achieving? or on support provided?

shawn: Hearing we want a way to support our users better

<Zakim> Lauriat_, you wanted to point out the difference between levels by difficulty in implementing vs. levels by level of accessibility for people with disabilities

<Zakim> Rachael, you wanted to say that while small organizations can't meet it, often the companies and toolkits they use can get close

Rachael: Notes her business works with small orgs, typically they're on a CMS that helps them; and they pass or fail based on CMS support

<Wilco> +1, fair point Rachael

shadi: concerned about separating by org size

<jaunita_george> +1 Rachel

<johnkirkwood> +1 to Rachael

<jeanne> +1 for not giving lighter requirements for the large component, template, and authoring tools

shadi: we're straying into law/policy and that brings up undue burden which is out of scope for W3C

<Lauriat_> +1 to shadi

<AWK> +1 to Shadi's point

<Ben> +1 to Shadi

<Wilco> +1, that was not what I intended to suggest

<jaunita_george> +1

<Rachael> The ADA at least has slightly more flexibility for small orgs while not actually changing the base requirements

shadi: though maybe type of website may be a useful org principle

<Zakim> Rain, you wanted to ask, following on Wilco, if it makes sense to set levels to meet requirements based on what the product is doing. For example, a site to buy flowers is very different from a site alerting people to fire evacuations

Rain: Shadi and Rachael resonate with me, but nervous about different levels of expectations

<Rachael> +1 Rain

<johnkirkwood> +1

Rain: smaller orgs might want to be the same groups who are trying to serve pwd the most and should do better than minimum; especially if threat to life or health

<shadi> +1

Rain: we don't want to have orgs stray into levels that aren't supportable

<KimD> +1 to Rain, also great point by Rachael re CMS/website hosts/builders/providers

<Rachael> proposed straw poll: Should we have a lower level of conformance? Yes/No

shawn: having a lower level of conformance appears to not have support

shawn: but having a defined set at such a level seems something useful

<Wilco> Yes

<JF> Not sure that's the right question to be asking

<Rachael> Janina: I am intrigued by defining what can be automatable...

<JF> +1 Janina

<jaunita_george> +1

<Rachael> ... it also has an opportunity for component libraries. We've talked about it some in conformance challenges but may want to dive into it. We may want to say more about it.

<Rachael> ...That is where people get their basic frameworks.

<jeanne> question?

<Rachael> question+ Discuss more about how to handle frameworks and CMSes

juanconcerned about a lower standard

juanmany make sites accessible because they have to

sajkaj: Want to note we can define the lower level as "necessary bvut insufficient"

juanshould we make suggestions to regulators around the implications here?

<Zakim> JF, you wanted to follow up on Rachael's point

jf: Recalls a goal for WCAG3 to roll in ATAG & UAG reqs; much of that already there but won't score the same way a completed site that people use

jf: back to suggesting focus on points

jf: notes specific numbrs immaterial but would map progress

<Ben> +1 to JFs explanation of different totals for different product types

<Zakim> bruce_bailey, you wanted to observe that SBA orgs might warrent a break, but Wix and Square Space are NOT small businesses !

<jaunita_george> +1

<JF> +1 - focus on measurement, let the regulators set minimums

bruce: agrees it doesn't need to be a formal level, just a way to note key milestone

<jeanne> ben said earlier that we could give a verbal description of scores below bronze

bruce_bailey: Notes the impact assessment government does;

<johnkirkwood> +1 to Bruce

bruce_bailey: so these CMS systems are large companies and should actually support a11y better so small orgs can do better

shawn: our CMS could say we have a score of X will imply ...

jf: yes progress toward B

jf: believe we're in a silo now and should instead think about a11y of whatever thing we're describing

jf: so maximum for the CMS is X; how do you do against that measure?

<Zakim> alastairc, you wanted to ask what specifically would be involved (e.g. functional outcomes?), and say that we do have a lower standard than bronze now - single A.

alastairc: what would be the diff if regulators said some orgs need to meet X and other orgs X+y

alastairc: notes that A is lower than B

<Zakim> Rachael, you wanted to clarify question

Rachael: Would like to clarify JF's question

Rachael: do we want some named category that means a threshold

sajkaj: Suggest "threshold" is a good name

Rachael: believe we should first answer that

<JF> should we even be defining bronze, silver, gold?

<bruce_bailey> +1 to one question at a time!

shawn: so we have b/s/g now; should we add a -B level?

Rachael: and phps how many levels? 3? 4? 5?

shawn: notes reqs for motivation is source of b/s/g

<alastairc> I'm not familiar enough with the proposals for silver/gold, but I wonder if they could be combined?

<Lauriat_> https://www.w3.org/TR/wcag-3.0-requirements/#motivation

<Lauriat_> Motivation: The Guidelines motivate organizations to go beyond minimal accessibility requirements by providing a scoring system that rewards organizations which demonstrate a greater effort to improve accessibility.

Wilco: why wouldn't we equate AA to Silver? We could do that

Rachael: Agrees. That's an option

<alastairc> So question is more like: Should we have a lower level more equivalent to single-A?

Jennie: Notes that 3 seems easier for people to grok

Jennie: has there been conversation on how more levels would be perceived? What would promote understanding and adoption

shawn: Research did uncover that current naming has confused people

shawn: Where schools grade A-F, A is great and lead to misunderstanding

<Zakim> jeanne, you wanted to the problem of puting AA-type guidance in another level

<JF> same problem, different name

Jeanne: I do want to caution people about the problem of putting guidelines into level in a way that might result in unbalanced representation of different disability groups

<Zakim> sajkaj, you wanted to say mixing automatable and requires human evaluation in a lowr level would be counter productive

Janina: what is valuable in defining an automated set, is the power of what technology can do
… we could have this type of level, but make it very clear that this level is not sufficient

<Zakim> kathyeng, you wanted to say participant level

<Ben> "Wooden spoon" comes to mind

<bruce_bailey> +1 to participation medals!

<JF> +1 Kathy, and add that to what Alastair noted about demonstration of effort

Kathy Eng: What about a participant metal - they tried, but they have not achieved bronze yet. I would not support providing a lower level of conformance, beyond something like a participant metal

<bruce_bailey> Eric Eggert (yatil) talked about "wood" metals being a thing

AWK: I don't feel strongly about 3 versus 4 levels. But I do think we need to think very hard about what Bronze actually means

<alastairc> Is it more about sites trying to meet A vs AA?

<Wilco> +100, we need to bring more organisations on board

AWK: we need to be cautious about believing that not having an easy on-ramp will force widespread Bronze compliance

<KimD> +1 to AWK, especially onramp

<Zakim> Chuck, you wanted to ask if there are legal considerations.

<Rachael> To your point, based on webaim's test at https://webaim.org/projects/million/ 98.1% of home pages had detectable WCAG 2 failure

Charles Adams: if Bronze is minimum legally, and there is an earlier level, does claiming compliance with the earlier level put an org at a legal risk

Peter Korn: it would be better if we stay away from claiming what will or will not be a legal level

<JF> +1 Peter we should focus on measurement, and leave "levels" to others

<alastairc> suggestion: Bronze = between single-A and easy-checks, Silver = AA(ish), Gold = AA+ (whatever has been proposed for silver/gold)

Charles Adams: Could someone be sued, regardless who has set the conformance, based on the idea of a claiming a named lower level

Peter Korn: Helpful clarification. Needs more consideration. Good question.

<Rachael> counter suggestion: Bronze = based on minimum score aggregated from outcomes, Silver = based on higher score aggregated from outcomes, Gold based on additional testing

<Ben> +1 to Shadi re considering the test may be easy to run, but the solution may be complex

Shadi: about the lower level being based on automation, will that truely be an easier level, if an organization doesn't know how to respond to the test results?

<Zakim> bruce_bailey, you wanted to ask where did the idea come from that w3c could/should tell regulators what to do?

<jaunita_george> +1 to Shadi's comment

<alastairc> +1, automation doesn't equate to the outcome

Bruce: Is it even the W3C's role to make a recommendation as to what should be regulated?
… for WCAG 2.0 I don't recall conversations about regulation during the development of the standards

<JF> +1 Bruce

<johnkirkwood> +1 to Bruce!

Bruce: we should be writing the best requirements we can, regardless of any thoughts about government actions

Charles Adams: We should make the requirements attractive to regulators...

<Lauriat_> https://www.w3.org/TR/wcag-3.0-requirements/#regulatory-environment

Bruce Bailey: it is the fact that the guidelines are *good* that makes them attractive

<Lauriat_> The Guidelines provide broad support, including Structure, methodology, and content that facilitates adoption into law, regulation, or policy, and clear intent and transparency as to purpose and goals, to assist when there are questions or controversy.

<bruce_bailey> none of that address bronze / silver / gold

<Zakim> sajkaj, you wanted to say I just realized I may be misapprehended--I feel automatable isuseful in describing successful a11y, not because it pertains to regulatory adoption

Janina: I wanted to clarify as I've been speaking strongly for a lower level

<Jennie> Possible labels: Beginner? Student?

<Rachael> suggested straw poll: Should we have a lower level more equivalent to A/AA?

Janina: I'm not proposing this for regulation. I'm proposing this as it is useful for getting the message about accessibility out there and getting people started. If they are willing to run the tool, they are interested in this to some extend, and can we use this leverage the value of automation.

<Rachael> suggested straw poll: Should we have a lower level than one equivalent to A/AA?

JF: To me, the things that are really important are clear intent and transparency. Everyone must understand how a content owner got their score. This way, different countries can choose different numerical scores.

<johnkirkwood> regulatory need: clear and unambigous

Rachael & Shawn: discuss straw poll

<JF> How does that work with ATAG RAchael?

<Rachael> I think that is a different question set

<JF> Why?

Alastair: If Bronze is similar to AA, then the lower level would be similar to A, so I think the question should be whether we need something lower than Bronze

Peter Korn: We don't have enough guidelines currently to make this decision

<JF> the reality is that if you score below bronze, you have something below bronze...

<jeanne> JF, because ATAG appliesto a subset of possible profiles - authoring tools -- than this discussion is about.

Shawn: Disagrees because we are not voting on a level of conformance, but rather an easier on-ramp

Peter Korn: Even still, we will need to revisit once we have more guidelines

<Rachael> draft resolution: We support a simpler on ramp below bronze that is not conformance, and will explore this further later

<PeterKorn> I can get behind that statement.

Shawn Lauriat: Agree, we are setting aside the question of whether or not there is a lower level, but we can likely agree now that we support an on-ramp

<sajkaj> +1 -- could be in a best practices, even

<jaunita_george> Maybe a conformance roadmap? Like a path to Silver?

<JF> @Jeanne, sure, but what about when authoring tools are part of a larger offering? Members of the accessibility community were furious when WordPress initially rolled out Gutenberg, which went backwards in accessibility

AWK: I would agree if the statement said "that is not necessarily conformance"

<Chuck> +1

<Sheri_B-H> +1 to onramp but not conformance

<Rachael> +1

<KimD> +1

<alastairc> +1

Shawn: yes, it is about supporting the use case, leaving the question of conformance for later

<Jennie> +1

<Wilco> +1

<AWK> +1

<Lauriat_> +1

<AngelaAccessForAll> +1

<kathyeng> +1

<JF> 0

<PeterKorn> +1

<bruce_bailey> +1

<Ben_> +1

<Rain> +1 to onramp but not conformance

<johnkirkwood> +1

<Francis_Storr> +1

<jeanne> +1 to onramp but not conformance

<Detlev> +1

<Makoto> +1

<jaunita_george> +1 to a roadmap/onramp to conformance

Resolution: We support a simpler on ramp below bronze that is not conformance, and will explore this further later

Shawn: Sees lots of +1, some with qualifications, which we will note for the next draft

<Rachael> options: https://docs.google.com/document/d/1BjH_9iEr_JL8d7sE7BoQckmkpaDksKZiH7Q-RdDide4/edit#heading=h.r8n8wkp3rutl

Rachael and Shawn: discuss next topic and choose a narrowly scoped discussion of Options for Levels

<JF> Profiles

Rachael: Can we discuss the pros and cons of subsets of tests versus overall scores. Let's not discuss types of products.

Rachael question is about the pros and cons of aggregate points, whether against profiles or not,

Sheri: Points are much more granular. 750 vs 900 much more meaningful than Bronze vs Silver

<JF> +1 Sheri, I've previously referenced FICO scores as a model

<Zakim> jeanne, you wanted to say that I don't oppose aggregate scores, but I am concerned that a company could acrue points that are "less expensive" than others that are critical to certain disabilities, for example, the expense of supporting people with hearing disabilities could be ignored. We have to protect against that.

<alastairc> Only using scores skips Functional categories, so you could end up with several being left out

<JF> more expensive = more points

Jeanne: I don't oppose aggregate scores, but I am concerned that we have to put in protections so that companies don't skip over the more expensive supports
… for example, captions and ASL might be more expensive and be dropped
… so I still think we need to have minimums by disability

Rachael: Agree that would need minimums by disability

<Zakim> JF, you wanted to also note that we have the functional requirements minimums

Rachael: thresholds versus points can be discussed without the idea of product profiles. The question to the group is which direction we want to think in, rather than exactly which option to choose.

Sheri: agrees with Jeanne. Has seen examples where people were making decisions based on how they would be reflected on a VPAT.

<bruce_bailey> In terms of what is attractive to regulators, here is the pointer to the preamble of the Original 508 Standards (12/21/2000) where the U.S. Access Board explains (comment/response) why WCAG 1.0 was not more closely adopted:

<bruce_bailey> https://www.federalregister.gov/d/00-32017/p-124

<Rachael> as a point, thresholds can also be by number of errors where the lowest number of errors gets the highest medal

Shadi: yes, worries that too much scoring to draw attention to the incorrect things. I think increasingly sites are nearly at WCAG 2 AA. Something similar to the level that we currently have, but perhaps with different balance of disabilities, etc.

<jeanne> THe problem with weighting in this example (how hard it is to implement) is that it can't be standardized. What is hard for VMware could be very easy for Wordpress, for example. Legacy software change is much harder to implement than new sofltwasre change

Shadi: there could perhaps be a threshold, where a few minus points are allowed, but after a certain number, you would no longer pass

<Detlev> +1 to Shadi

<Rachael> As a note, we all agreed at the last face to face not to do weighting

Jeanne: the problem with the example that Sheri gave of weighting based on how difficult things are would be very difficult to do at the W3C level, because it varies so much between products

Rachel: Yes, we had agreed to not to do weighting

Shawn: weighting is still be possible within individual orgs

<JF> thresholds

<Rachael> suggested straw poll: Option 1) Levels by thresholds Option 2) Levels by subsets of tests

<Ben_> 1

<Chuck> 1

<jeanne> 1

<Lauriat_> 1

<Makoto> 1

<Francis_Storr> 1

<jaunita_george> Would thresholds include subsets of tests?

<Sheri_B-H> 1

Rachael: Option 1 is new, Option 2 is how WCAG 2 does it

<Rachael> 1 but can live with 2

<Sheri_B-H> David Fazio who had to drop off also says 1

<AWK> 1, but I believe that we will need to test that it works once we have enough outcomes

Shawn: I see 1s coming in

<Detlev> don't understand options well enough to give meaningful answer

<Jennie> Same as Detlev

<bruce_bailey> i was listening, but also not clear on thresholds!

<AngelaAccessForAll> Agree with Detlev

Jeanne: Can you give a summary of the straw poll?

<Detlev> what's 'a group of tests'?

Rachael: Option 1 is where you complete a group of tests which result in a numerical score and score leads to a level

Rachael: Option 2 is where you have sets of tests (A, AA, AAA) and the subsets of the tests are what define the levels (like WCAG 2.x)

<alastairc> So when you have thresholds, you wouldn't have to test as many things at a lower level?

<JF> Question: if I score "80%" on thresholds is it the same as scoring "80%" of the subset?

detlev: can you clarify "group of tests"

<JF> +1 Shadi - each subset has minimum thresholds

Shadi: I thought I was suggesting a combinations of both, where you do have the subsets, but you also have the thresholds

<bruce_bailey> +1 for concrete examples !

Rachael and Shawn: decide to clarify the question further with examples

Action: Draw up examples of differences and revisit

<david-macdonald> seems like a very important question so appreciate cycling back

<Chuck> I think we can set the table at the very least.

Shawn: No resolution on that question, but we have next steps

<JF> or, providing a maturity model accruse X number of points

<JF> at any level

Rachael: Maturity Model is being created. Some options have placed as Gold level of WCAG 3.0 compliance, some have suggested it be separate from WCAG 3 compliance, and some have suggested it be just a part of what Gold is

Sheri: We have been working on a Maturity Model with 7 dimensions
… each dimension = slice of behavior
… we have been making sure it will work regardless of WCAG level, regardless of the type of organization
… each of 7 levels is scored from 0 to 4

<jeanne> Sheri, can you give an example of a dimension? Just name them?

Shadi: I think this is a great resource.
… The issue for me is how the two connect.
… it would be very difficult for a Web a11y auditor to get access to the information to support the model

<kathyeng> +1 to Shadi

Shadi: great work, but I feel it is a separate effort

<JF> +1 to Shadi

<PeterKorn> +1 to Shadi's comments.

Shadi: some organizations also outsource all of this

<Wilco> +1

Sheri: We have addressed the outsourcing issue within the model
… I'm not sure how I feel about it - there is great value in it, but it also makes sense to publish on the side as a note

Shawn: Does the separate document mean it must be a note?

<Zakim> alastairc, you wanted to ask how much overlap their work has with the WCAG 3 draft?

Michael Cooper/Sheri: There are a number of options, can decide later

<bruce_bailey> @alastair pls post link in irc if you have it handy

<Chuck> It's WCAG agnostic

Alastair: Maturity model does not show a lot of overlap with WCAG 3. You mentioned it could work with WCAG 2 or WCAG 3.

Sheri: It was a deliberate choice to not tie it to WCAG 3

<PeterKorn> Apologies, I need to drop a bit early.

Sheri: Since WCAG 3 will not *replace* WCAG 2, the maturity model is separate

<Zakim> JF, you wanted to note there is a non-trivial cost to documenting a maturity model, which is a negative for smaller businesses

<Rachael> draft straw poll: Publish maturity model as separate document

Jf: there is a non-trivial cost to documenting this; small businesses may not be able to do this; might make it a barrier to achieving gold. All for publishing it as very useful, but WCAG should measure outcome not process

David: It doesn't seem that W3C could really tie conformance to culture change, HR policy, etc. Seems it needs to be separate document.

<Chuck> proposed Poll: Should the maturity model go out as a separate document? yes/no

<alastairc> At gold level, I think we could cross-reference and add points above the baseline of our usual testing.

Wilco: I really like this idea. There is a potential solution in this to many things that are not documented in an international standard - and will help make changes for their specific users. But, should be separate.

<Rachael> +1 Ben, my experience too

<KimD> A note for future considerations. Can this be paired with the lower/nonconformant level? Could be interesting to discuss in the future.

Ben Tillyer: In my experience it is possible to score highly in a maturity model, while also providing a low-accessibility product. I don't think we should boost a product's score based on a maturity model.

<johnkirkwood> +1 to Ben’s concerns

<Chuck> Poll: Should the maturity model go out as a separate document? yes/no

<Chuck> yes

<david-macdonald> Yes

Sheri: One big motivator, though, is to ensure that a product can/will be expected to *remain* accessible

<jeanne> We have always talked about requiring bronze regardless of maturity model

<Rachael> yes

<kathyeng> yes

<Ben_> Yes

<Wilco> yes

<alastairc> yes

<jaunita_george> Yes

<Lauriat_> yes

<Makoto> yes

<AngelaAccessForAll> Yes

<johnkirkwood> yes

yes

<Francis_Storr> yes

<Detlev> y

<Jennie> yes

<JF> seperate

<jeanne> no because it needs the exposure of WCAG3

<bruce_bailey> yes

<sajkaj> +1

<KimD> 0

Shawn: We could publish as a separate document and then later have WCAG 3 reference it

<JF> we've not discussed Wilco's proposal for a suite of documents, so...

<KimD> use "may"?

<jeanne> +1 to "may

Resolution: The maturity model may be published as a separate document.

<jeanne> "we will pursue publishing Maturity Model as a separate document

<Detlev> good night to all

<Rachael> Good night

<AWK> +1 to Jeanne's wording

<Lauriat_> +1

<JF> bye all

<Chuck> s /The maturity model may be published as a separate document./We will pursue publishing Maturity Model as a separate document./

<Francis_Storr> good work today. Thanks, all

<Rachael> Thank you all. That was a lot but made good progress.

<david-macdonald> bye all

<Makoto> sayonara

<kathyeng> thank you!

– DRAFT –
AGWG Teleconference

29 April 2021

Attendees

Meeting minutes

Testing objectivity

Which tests to include in conformance

Scoring Issues

Session 3 - Conformance

Summary of action items

Summary of resolutions

Diagnostics