W3C

- DRAFT -

Silver Task Force & Community Group

02 Jun 2020

Attendees

Present
jeanne, Chuck, MichaelC, Rachael, ChrisLoiselle, CharlesHall, JF, Lauriat, Fazio, OmarBonilla, Joshue108, KimD, Makoto, sajkaj, Jan
Regrets
Chair
Shawn, jeanne
Scribe
ChrisLoiselle

Contents


<scribe> Scribe:ChrisLoiselle

Jake's work on conformance

Jeanne: Let us review Jake's presentation

<Chuck> +1 to both

Jake: Should I share my screen or just talk to the links? Jeanne: You can share your screen.

<JakeAbma> https://docs.google.com/drawings/d/1nRhQ2DWCGe8imhZgV8972IWSRSwCtmKaOnZOB6qm3gk/edit

Jake: Shares his screen. Talks to the diagram link in IRC for Benchmark Scoring Experiment. Gathering work from past 10 years or so. It is blend of technical of pass / fail tests. Then there is general benchmark testing.

Left hand of chart is user needs , right side of chart is user satisfaction and the middle part of chart is how we get there.

1) Technical, 2) A11y-in-use 3) Benchmark 4) Total Score is the 4 step process .

<JF> element level testing?

Technical and A11y-in-use is done at same time and then you iterate on next element. The benchmarks are based on standardized questions. Fixed set of questions applied to each functional outcome / guideline.

There is flexibility for scope. Within scope, full site, or full page. It can also be a feature or widget. You can also can just do steps 1 and 2, without 3 , i.e. building blocks for a content management system.

How do we make sure that functional outcomes are part of this? They would be part of tests.

In step 2, A11y-in-use test, you'd make sure it is valid and normalized. The conditions would be standardized. A subset of answers to how you judge your questions.

JF: On 4, total score subtotals are added up sub total technical, sub total a11y in use, and sub total benchmark. Are time and frequency reviewed and included in the total score?

How do those integrate?

Jake: You don't necessarily add the scores up. It may not need to be combined total. Jake uses example of driving car and gas and oil example, in that they talk to different things but help user know where they are on the car health.

<Fazio> This is why Silver needs a maturity model

JF: Consistent scoring would be the end goal for testing , so many can replicate the score by following the examples presented.

<sajkaj> q/

JF: A score vs. a collection of scores (gold , silver bronze). Jake: Its a choice we need to make, do we want to add them up or keep them separate. Multiple or one score, we need to agree on what we want.

Jeanne: Let us save conversation on do we want separate currency or total

<Fazio> I would think we want separate and total

<Zakim> jeanne, you wanted to ask Jake to give the sample benchmark questions

<JakeAbma> https://docs.google.com/document/d/1mbdhGQ_Zh5OCRpsAasDRTJOQfuHweicy430IFMTnsKM/edit#heading=h.jsvl7ipdtjjj

Jake: Shares benchmark experiment 1 - silver word document

Talks to general benchmark questions and severity.

Usually between 1 and 6 questions and you apply those to a task.

rating of severity 1 = easy, 5 = difficult.

<Jan> It's a Likert scale

Task questions are strongly disagree = 1 , strongly agree = 5.

<Lauriat> +1 to Jan, it did seem nicely familiar.

Standard set of easy questions. The people who judge these would be auditors, developers, or user testing onsite/offsite . We'd like to standardize the questions and conditions. We create the baseline.

Jake: We will fill in data within next week or so for real example.

<JakeAbma> https://docs.google.com/spreadsheets/d/1iCJfyMtcsSq7GHmwnc4aTNguadRfGDa0H8FBZMaJpcQ/edit#gid=1825600441

Jake shares Benchmarking login - top task spreadsheet

Talks to typical elements on a page, search, log in , what results we'd end up with for framework and testing.

We'd review the technical analysis of heading guideline, headings guideline and functional outcomes. 19 instances (Per JF's example). Its not about numbers right now, just the system I'm showcasing.

There is a column in technical table for instances, fails and average. This leads to step 2, there is a heading defined. step 2 is a11y-in-use. Which talks to easy > difficult scoring. We'd need to look at functional outcomes and relationships between guidelines.

We'd use step 2 and evaluate headings against headings questions. Judging each heading on its own.

JF: 19 instances were provided in example in step 1, so in step 2, we'd evaluate 19 headings? Jake: Yes.
... Sorry I missed the summary of what you were saying, can you provide it in IRC?

<jeanne> +1 JF about the problem of the questions

<Chuck> JF: There are 2 aspects of headings, one is symantic markup, one is understanding/accuracy.

<Chuck> JF: Screen reader may be able to id heading, but can a sighted user understand the visual presentation?

Jake: A severity problem would cause issues. Total severity of 100 % would mean unacceptable and references the total score summary table as reference.

Chuck: If we assume I'm an author. I score it high. Then we have an auditor based company that reviews same page. I.e. markup is not done correctly, heading is not graded the same as a result. What do we do with scoring and uniformity?

Jake: It must be consistent, reliable and valid. I agree.
... talks to chart 1, and standardized questions and conditions on how to judge the questions. It would be next version of adjectival rating.

<JakeAbma> https://docs.google.com/spreadsheets/d/1iCJfyMtcsSq7GHmwnc4aTNguadRfGDa0H8FBZMaJpcQ/edit#gid=633158340

Adjectival rating tab on bench marking scoring experiment spreadsheet

<bruce_bailey> to be clear, i just did a copy/paste from what was the current version of the Headings Understanding doc

Chuck, you'd need to know the guidelines , good and bad. It is the background of baseline. If its a content creator and judge your score in a certain way and JF judges differently, we'd need to point to document to standardize unacceptable or acceptable scores.

CharlesH: How do we review for example, a footer , unrelated to a the log in page, but has heading on the page? Or a log in form, that may be have headings? I.e. narrow vs. large scope?

Jake: It depends on how we set up conformance claim approach.
... You can claim conformance on certain page, certain widget . You could do a technical check. The optional questions for headings can apply or not (i.e. are you doing it for certain area of page or whole page)?

<Lauriat> I really like this flexibility paired with transparency.

Jake: It is flexible in that you can do a variety of different steps and how the conformance is scoped.

<JF> +1 to Charles. Regulators aren't intersted in accessibility claims on a login page - they are interested in the *Site's* accessibility overall

CharlesH: The challenge is on user testing and how conformance would be scoped. What is the conformance of the site?

<Lauriat> Conformance has more uses than just regulation that we need to support, though?

<jeanne> Jeanne notes that this is nort requiring butt-in-seats testing. But it could easily include user testing with people with disabilities/

Jake: Top tasks would then expand to whole page , if the page is scoped or if the site is scoped to conform. Testing with real users would use the same questions , all standardized.

<Fazio> Those have to come from real users

Jake: The benchmarks , task questions would be answers everyone could give. Each step would have different scoring. It would probably be 5 to 10 percent more time testing, but you get a lot more results and insight on where problems are.

<Fazio> Benchmark questions I mean need to be answered by PWD's

JF: On the needs of regulators vs. the benchmark scoring, we have another need , which is ongoing conformance. Granular vs. non granular. Whether is the site accessible.

Jake: The automated check is not the concern to a degree. Accessibility in use , how is the footer accessible? How is the log in accessible? How is it not?

DavidF: Real users with real disabilities would be needed with task questions for step 3.

Jake: Clean and simple presentation vs. complicated process. You'd score it with a 4 or 5 , for difficult. Real user testing would provide better data to this , yes.

DavidF: Would we score principles vs. step 1, 2 and 3 be better than granular approach? Looking to equality around the world.

Jake: For the POUR principles, we can add that in, on how to test specific guideline, but we wanted to step away to flatten it.
... The usability part and end score for top task from benchmarking , how severe is it?

Summary of Action Items

Summary of Resolutions

[End of minutes]

Minutes manually created (not a transcript), formatted by David Booth's scribe.perl version (CVS log)
$Date: 2020/06/02 14:31:14 $

Scribe.perl diagnostic output

[Delete this section before finalizing the minutes.]
This is scribe.perl Revision of Date 
Check for newer version at http://dev.w3.org/cvsweb/~checkout~/2002/scribe/

Guessing input format: Irssi_ISO8601_Log_Text_Format (score 1.00)

Present: jeanne Chuck MichaelC Rachael ChrisLoiselle CharlesHall JF Lauriat Fazio OmarBonilla Joshue108 KimD Makoto sajkaj Jan
Found Scribe: ChrisLoiselle
Inferring ScribeNick: ChrisLoiselle

WARNING: No date found!  Assuming today.  (Hint: Specify
the W3C IRC log URL, and the date will be determined from that.)
Or specify the date like this:
<dbooth> Date: 12 Sep 2002

People with action items: 

WARNING: Input appears to use implicit continuation lines.
You may need the "-implicitContinuations" option.


WARNING: IRC log location not specified!  (You can ignore this 
warning if you do not want the generated minutes to contain 
a link to the original IRC log.)


[End of scribe.perl diagnostic output]