APA & ARIA: The Future of Accessibility APIs

Meeting minutes

try that

it gives two for some reason

Pronunciation Spec Discussion

JS: Thanks for joining - lets look at this

Bridge differences from engineering perspectives.

Can Mark or Irfan kick this off?

So we can share perspectives etc?

Single vs multiple attributes..

<IrfanA> https://www.w3.org/TR/spoken-html/

PG: The goal is create authoring capabilities in HTML

We have identified a gap in specs and APIs

This is augmentation of AX Tree

there are two candidates - one is a single attribute tbd

Also there is a multi attribute approach data-ssml currently

Currently tech based values

Irf: We need to find a way to expose this

JS: The reason it is prefixed data dash is as this is defined in HTML

Once we have an implementation, then we go to the HTML group, and ask for a reserved prefix

We are a way off that.

But we need to get POC built etc. Make sure it works.

Then we can get reserved prefix etc

<Zakim> bkardell_, you wanted to ask about tag review

BK: Is the spoken HTML idea reviewed by TAG?

Seems like a good idea.

JS: We did that last year - we heard from them, don't ask the parser to change

Our current approach is within the scope of current parsing capability

BK: I've seen two diff interpretations around the use of these attributes.

Where can we discuss?

JS: Happy to discuss now.

BK: has heard different interpretations of this

I think data attributes are fine

Some have expressed to me subtler interpretations that data-* is for something more narrow, I'd like to see if I can get them to share discussion there that isn't appropriate

Can we open an issue?

JS: Yes

It may be on the HTML spec - we are following their guidelines.

To drive consistent TTS out put in various envs.

Matthew mentioned approach in personalisation , is using data- to drive it over there.

To drive personalization

JS: You can't get to a W3C REC using data-

CR is as far as it will go.

Thats the sandbox for data-

keep implementations to allay cross site concerns

JS: <gives overview of process and IP issues>

JS: Does that help?

BK: Thats not new but just sharing counter interpretation

This seems like a good use case to begin discussion

JS: Mentions other specs using this approach

Do we have a preference, is the crux here?

JS: Others ?

JS: Dave Tseng, how does that sound?

Is multiple attribute preferable?

Paul: Did mention an affinity for the multi attribute approach. There is no corollary for JSON as a value.

JS: That is one view from one AT , which is fine.

The difference for AT is around approach

The group is more interested in JSON, as it is a single target, selector - info is picked up, the AX can abstract that, augment and provide info

JS: The direct read group may be different - things that power our speech recognition devices.

GlenG: We do not have a problem parsing the HTML

Making fewer calls, from an AT angle, is good - esp if noisy.

JSON, is good so we can get it all at once

In a single attribute we can do that.

MK: Mentions read aloud tools

Text Help have a preference for single attribute

MS has immersive reader capabilities

How would they use pronunciation cues

CS: Jumping out of the A11y APIs seems like a big step.

How did that happen?

JS: We see use cases that provide benefit for AT that doesn't use the AX tree.

LW: From those who train and teach - while the JSON attribute is unfamiliar - adding more attributes could be confusing

We have spent years explaining around applied semantics, and the ground work of understanding the A11y tree

And it could be confusing for adoption with a different approach

SN: Pearson has an implementation of the single attribute approach - just to channel what Paul G says..

Sniffing and selecting tons of attributes in the DOM will be worse, that teasing out JSON via a processor

<Matthew_Atkinson> The GitHub SteveNoble mentioned is https://github.com/w3c/pronunciation/issues/86#issue-904400398

When a TTS player has to rip through the data the round trip is brutal

SN: Performance will be impacted in the read aloud environment, it's a concern.

PG: With ARIA live the A11y tree gets updated - there is a concern about dynamic content - when there is a special region, the browser has to catch up

Matthew_Atkinson: present+

cyns: aaronlev/David Tseng: any thoughts?

aaronlev: Concerns around how these things will be misused by authors (c.f. live regions). What is the ideal markup that we would want?

janina: The ideal would be to make SSML a native citizen of HTML. Concern around it not being possible to change validators (not necessarily parsers as mentioned above).

aaronlev: How about providing a separate SSML file?

janina/becky: Don't think that was considered.

aaronlev: Want to consider: what's the potential for misuse. Also: platform Accessibility APIs can often be extended to provide more information.

PaulG: Made a note about homonym attacks (in the document).

aaronlev: Concern around empowering authors to give the users a bad experience (c.f. ARIA can be misused in this way). Interested to hear Glen Gordon [who's on the line]'s thoughts on this. Examples such as inconsistencies inter-site or intra-site.

aaronlev: Using a single element could increase the risk that the AT can't present things consistently to users?

<Zakim> cyns, you wanted to ask about css pronunciation

cyns: Is there a relationship between this and the pronounciation functionality being proposed for CSS, or are they different use cases. Also concern around author misuse: remember when everyone made all the fonts small and light gray? Worried that a lot of things will get sped up.

<PaulG> We covered CSS Speech gap analysis here https://w3c.github.io/pronunciation/gap-analysis/

janina: I think the CSS work is orthogonal to what we're trying to do; we did a gap analysis [URL above] that may have more info.

<PaulG> section #3

<bkardell_> could we ++ tink in the queue before me?

<becky> sure will do that bkardell

janina: Possibility for misuse: anything that allows extra functionality could be mis-used. Need to be aware of it. But this allows us to do things such as mixing languages within a book (such as historical text).

<cyns> Can someone drop a link to the use cases in here?

janina: This also helps confuse wind/wind and tear/tear. Much opportunity for improvement. It's not a proposal that the entire document needs to be marked up for pronounciation. In most cases TTS engines will do reasonably well.

<PaulG> https://w3c.github.io/pronunciation/use-cases/

jamesn: Worried about over-use and mis-use. Not sure how we counter this. Yes, it is necessary in certain cases. Not a screen reader user, so unsure if this is a problem: company names. Can see people wanting to put this into their company name. Is it a problem if a compan's site pronounces it correctly, but everywhere else on the web it's incorrect?

<Zakim> jamesn, you wanted to echo that authors WILL use things if they are available - just because they can

tink: To answer your question jamesn, I would find it useful to hear the canonical company name pronounciation. Can get too used to how the AT pronounces it.

tink: CSS Speech is catering for a specific set of use cases: it's trying to make the auditory experience less tedious.

tink: Yes it can be misued: HTML, ARIA, XML are all misused, but think we can mitigate against this, but not stop it.

tink: For now the CSS Speech media type isn't supported by UAs, sadly. But, different use cases.

bkardell_: The use cases are different, but problems are similar in that we need to affiliate nodes with values. Presumably wouldn't have one SSML document for the entire page, nor hundreds/thousands (would be very slow to load from network).

aaronlev: Was spitballing; though we can load a hundred images for a document. Can we put element IDs in SSML documents? The idea is mainly to avoid adding noise to the markup of the document.

bkardell_: Whilst the single JSON attribute could be ugly, can also see the benefit of keeping the info together.

bkardell_: Brought this up in the MathML meeting as well, but we could polyfill something like this with existing technologies? Then authors wouldn't need to create the cumbersome JSON attributes.

<Zakim> PaulG, you wanted to comment "learning" from authors

aaronlev: This feels a bit like [CSS] background images; it's changing the presentation as opposed to the semantics?

PaulG: For linking, we did talk briefly about linking external resources (for the next stage of the spec). If SSML came to the document as a first-class citizen like SVG we would look into that.

PaulG: Performance: TTS uses a lot of heuristics to determine pronounciation. Reducing the need for heuristics may mitigate some performance hit.

PaulG: Voice assistants might start to learn correct pronounciations e.g. for company names from their official sites.

janina: e.g. Versailles is pronounced differently based on location.

<bkardell_> interesting point PaulG

mhakkinen: We have a lot of need for pronounciation in education (ref Pearson's work discussed before). We have looked for a standard solution, e.g. PLS. Pronounciation Lexicon Specification [scribe note: PaulG mentioned this just above].

mhakkinen: We want screen readers, read-aloud tools, etc. to benefit. Another example: pharmaceutical producs. And another: television/film/movie program guides (character names, actor names, etc.)

janina: We wanted to discuss this near-term problem but didn't intend to take the whole time for this; will summarize.

janina: Hearing from aaronlev that browsers aren't expected to be a blocker as to which approach is taken. Need more feedback from AT vendors. Is that a reasonable summary?

aaronlev: We _can_ implement anything; we still would need to look carefully at the proposal. We'd want good markup, good API support, good AT support; an end-to-end plan. Doesn't sound like all options have been looked at yet.

cyns: Have a similar view to aaronlev. The single-attribute approach feels counterintuitive for authors. It doesn't feel very HTML-like. Concerned about readability.

aaronlev: JSON can be hard to read.

cyns: In general, it is OK but as an attribute value it is hard to read.

mhakkinen: From an authoring tool perspective, authors don't necessarily need to see the output HTML. We have tools already that allow authors to provide pronounciation hints that are intuitive to use. We need a standard way for ATs and others to consume it.

tink: Is the idea with the single attribute that the JSON will be in the HTML code, or some external file that will be linked?

PaulG: Our current implementations/experimentations have the attribute value embedded in the HTML.

tink: How about an external file?

PaulG: We've had discussions about this before; have not yet found/developed method to do external linking.

tink: Providing common rules is very much like CSS and could be of benefit here.

PaulG: Agree; would be great to have first-class SSML support.

cyns: Concerns around readability; if it's an external file this is less so. Could this just use CSS?

<Zakim> jcraig, you wanted to point out that external file would violate the AT privacy guidelines from web platform design principles that Leonie helped author

bkardell_: There are efforts ongoing to allow authors to create CSS-like languages. (c.f. Houdini)

<bkardell_> but it isn't really AT specific, it would apply to many speech agents

jcraig: The web platform design principles mention the importance of making AT _not_ detectable. Would be good to have SSML in the document, but requesting an external file would be detectable.

<Zakim> cyns, you wanted to say that pronunciation could be used by other things besides AT

cyns: I think use cases for this extend beyond AT, so not sure this would be useful for fingerprinting. Don't want to end up with what looks like inline CSS.

<bkardell_> "hey <assistant> read this" is a thing I use all the time - those would be indistinguishable

janina: Referencing external files could be helpful to avoid repetition.

aaronlev: Not sure if proposed, but: for the use case where changing the name of a product/address/company, sounds like we could use a dictionary. Problem: every time that name/phrase/word is announced you'd have to wrap its markup.

PaulG: We discussed this. Some tags like prossidy or voice can control an entire block. Others like pauses weren't there originally, so need an extra <span>, with single or multi-attribute.

<jcraig> indistinguishable depends on many factors of entropy... client + accessed this other file + other factors might equal reasonable certainty of AT... FWIW, I think pronunciation rules are necessary. Just trying to point out the complications wrt that particular design principle

PaulG: The single attribute would, at first, encourage authors to summarize an entire block of text all at once, thus making it hard to update the pronounciation if the text changes.

PaulG: Would thus need help for developers to keep those in sync

PaulG: If everything is chopped up (multi-attribute), as a developer I think this would be easier, espeically for hand-coding devs. Interested as to others' views.

jamesn: Replying to jcraig around detection. We _could_ require browsers to always fetch these files (is an additional complication, but could be managed).

jcraig: Absolutely agree that pronounication rules need to be defined in some format; just wanted to raise the issue. Has Accessibility API design implications.

<bkardell_> embeddacble as a css-like would work too and no extra fetch

janina: Seems everyone's agreed on the _need_ but we are still unsure as to single/multiple attributes, and there is the second-order question of external file.

Joanmarie: If ATs want a single attribute, but authors want multiple attributes, or vice-versa, the implementation could be to take all the single attributes and parse them all together.

Joanmarie: We should consider what's best for authors, as a result.

aaronlev: I feel there are many proposals that haven't been made yet, so should continue offline. But for the dictionary resource proposal: this could be something the AT fetches itself (circumventing the privacy issues; allowing caching across sites/domains).

aaronlev: Seems odd to me that we're going to be saying how to pronounce things, but only in one place.

<PaulG> "pronunciation" is only for phonemes. There are many more aural expressions from SSML that this spec would allow for.

aaronlev: ...where it'd be more useful if that was everywhere.

<Zakim> jcraig, you wanted to ask if l10n/i18n was discussed in this context earlier and to mention l10n both with languages as well as with TTS capabilities

jcraig: aaronlev: are you implying there's a need for a global registry?

aaronlev: Not sure, but worth looking into. Consistency is important.

jcraig: Has l10n and i18n been discussed? E.g. locales. /Homonyms may be pronounced differently in languages/locales. /locales. Also different TTS voices may be able to pronounce Spanish and English, but not Chinese. Has any of the rules discussion covered this?

janina: We have discussed those naunaces and the need to disambiguate them. The problem is that defining what the correct pronounciation is will change (e.g. wind/wind).

[ scribe note: jcraig wished to raise having been delayed in joining, so missed some prior discussion ]

<jcraig> close/close is a more common homonym in UI in English

janina: Another example is English, but at different times in history, as proncounciations evolve.

PaulG: A dictionary would be limited to phonemes. We have an example that's wider than this [Vincent Price reading The Raven]; covering audio "performance".

PaulG: Devs are guided towards specifying the language of the document, and the TTS does the rest. But there is contextual info (such as location) that might impact accent, vernacular, local place names, and that's part of what we're aiming to provide.

PaulG: Voice packs being able to support different proncounciations is another issue that we would need to resolve as an industry, but isn't something we can solve in the spec. Some pre-reading, or meta tags could be added to encourage assistants/AT to load specific voice packs/TTS capabilities to ensure a good experience for the user.

janina: Maybe the voice packs issue is a metadata issue.

janina: Want to revisit Joanmarie's suggestion, as that could give us a path forward. If authoring is easier in multi-attributes, as long as the UAs can expose what the ATs need, that could address this. We should explore this.

janina: My concern is if we were to have conflicting views accross UAs. Joanmarie's suggested approach could help us address the UA-AT aspect. Does that sound good?

<jcraig> +1 to glen

<jamesn> +1 to glen

Glen: I don't think authors will unanimously agree on whether single-, or multi-attribute approach is easier.

jcraig: +1; depends on tooling

janina: I think we have to presume tooling.

cyns: Still thinking readability is important.

bkardell_: There should be some experimentation, particularly with a CSS-like solution. There was discussion of polyfills in last year's TPAC? How are they getting the SSML to the AT?

SteveNoble: Authoring: as mhakkinen said, the people authoring this stuff every day are using authoring tools.

+1 to general philosophical view that readability is important, though I am not an implementation expert in this field!

SteveNoble: [demonstrates some content that has been marked up for proncounciation]
… The authoring tool identifies this as "alternate text for text-to-speech" that allows users to highlight a word, e.g. melancholy, and provide alternate text, e.g. melancollie that the system turns into SSML.
… There are also tables of words and "how to spell these phonetically in the system"
… e.g. dinosaur may be expressed as dinosore
… Authors may be creating 1,000 SSML fragments per week (though they don't know it as SSML) to correct the way that the TTS pronounces things.
… [Compares this to creating MathML with a WYSIWYG editor]

mhakkinen: To echo what SteveNoble said, for classrom materials, many states have specific guidelines on pronounciation and we've had to spend time tweaking text so that it'll be pronounced with the right sort of pronounciation or pausing.

mhakkinen: We've tried to do this without authors having to learn SSML.

mhakkinen: We've had to create hacks to support whatever sorts of AT/read-aloud we are using at delivery time. We have not necesarily been able to get this into screen readers.

mhakkinen: E.g. if we altered the way the screen reader pronounced things, this could really confuse Braille users.

mhakkinen: We don't think this is a challenge for authors with the correct tooling.

<Zakim> jcraig, you wanted to demo something similar from the iOS VoiceOver settings

jcraig: [Demonstrates VoiceOver Settings > Speech > Pronounciation for some of our names]

jcraig: You can speak how you want the term to be pronounced, and the device will interpret this and offer options (that it reads back) from which you can choose.

jcraig: Users can do this accross the system. Perhaps this could be exposed through WebKit. I don't have a strong preference for whether that's via and attribute, or in a dictionary defined in a page script block or external resource.

<Zakim> cyns, you wanted to say that a wysiwyg editors don't address my concerns about human readability of markup. You shouldn't have to use a special tool to write or read markup

cyns: Special tools shouldn't be needed to make markup readable.

cyns: Have you looked at using a polyfil to pull all of the info into the Accessibility API's description field?

<bkardell_> thanks cyns that was the q I was asking too - your articulation was better

mhakkinen: We've not done anything specifically for screen readers; our use cases are wider (e.g. read aloud tools).

mhakkinen: We prefer a standards-based approach.

<PaulG> The authoring standard also allows for scenarios like kiosks where an individual's AT/voice assistance solution may not be integrated with the content

SteveNoble: Our support internally is for our own TTS system and read and write extension. TextHelp is another vendor that supports SSML (single-attribute).

<Zakim> jamesn, you wanted to say that to me this looks like the kind of overuse I fear

mhakkinen: Some years ago we prototyped a custom element that allows you to specify pronounciation and a Braiile label, but this didn't solve the problem of getting the screen reader to direct content specifically to TTS vs Braille.

jamesn: Can see the publishing appraoch SteveNoble demo'd working when you have control over the TTS, but this standards approach is much more general. This doesn't seem like an appropriate use case for the wider web.

jasonjgw: Trying to maximize author convenience and ACKing that this will differ across authors. The ability to define information globally and at the individual text element level seems to have got agreement. There's some flexibility on the UA side as to how it's represented in the markup and it seems possible to tailor the deliverey via the Accessibility API for ATs that will maximize efficiency there.

jasonjgw: This has some parallels to the work NeilS demo'd at TPAC last week about how to provide disambiguating information on MathML. They're considering the same problems wrt how to specify the markup side and the deliverey side. We should aim to produce a simlar approach in both cases.

<Zakim> jcraig, you wanted to agree with ETS comments that any polyfill implementable today may help speech users, but would be harmful to braille users. the standards approach takes longer but is the right path.

jasonjgw: That might help the discussion along. There were broader issues raised in the agenda, but these seem to have specific parallels across the work of different groups.

<bkardell_> I agree it is very hard for me to not see the interrelationship here -- they might not be the exact same thing, but they certainly seem to have some intersection of concerns

jcraig: Wanted to +1 the ETS comments: bending existing pronounciation rules in specific contexts would be harmful to Braille in the general context.

jcraig: The standards approach is the right approach for our use-cases; don't see a polyfill approach working.

NeilS: Our (MathML)'s question was: if we're polyfilling, what's the target? Can't use aria-label as would negatively affect Braille. There is no target in the Accessibility API. Seems we have to add something.

<Zakim> jcraig, you wanted to respond

jcraig: +1; MathJax polyfil is a good example as it degrades the user experience when the platform has wider features (such as conversion to Nemeth Braille, which is bypassed by the Polyfill).

becky: Does anyone want to provide a summary, or next steps?

bkardell_: The single-attribute version is not pretty, but if we could figure out how to plumb that down so that it could be used for polyfills/idea experimentation, we could always sugar on top of it (e.g. like with CSS, you can use inline style attributes, but normal humans authoring HTML wouldn't—we could have a similar abstraction).

<Zakim> cyns, you wanted to ask if we can have the broader api discussion in the second session

cyns: Could we have the broader AAPI discussion in the second session?

mhakkinen: Helpful discussion; lots for the TF to consider.

jcraig: Is there a subset of spoken presentation in HTML draft that you'd recommend (some things such as prossidy have been mentioned as out of scope). In order to get this on an implementation schedule, suggest cutting it down and agreeing on any non-contraversial aspects, as could then get implementations behind runtime flags.

janina: One thing holding us back is to decide on the either/or with respect to attributes.

jcraig: Could include a dictionary inside a <script> tag in the page, and only resolve the attribute issue later when you need to provide pronunciation on specific elements.

becky: janina: thanks everyone!

(general agreement on productive discussion from everyone)

<becky> matthew are you good posting the minutes to apa?

– DRAFT –
APA & ARIA: The Future of Accessibility APIs

26 October 2021

Attendees

Meeting minutes

Pronunciation Spec Discussion

Diagnostics