Next Directions for Voice and the Web Breakout

Meeting minutes

Presentation

<BC> Hello

kaz: Thanks for joining this breakout session

kaz: This is a breakout session on new directions for Voice and Web

kaz: There was a breakout panel during AC meeting

kaz: discussion about how to improve web speech capabilities in general

kaz: There were several breakout sessions previously (previous TPAC??)

kaz: We want to summarize situation and figure out how to improve

<kaz> slides

kaz: First, reviewing existing standards and requirements for voice and web

kaz: Then would like to look into the issue of interop among voice agents

kaz: Then think about potential voice workshop

kaz: If you have any questions please raise your hand on Zoom chat, or type q+ on IRC

[slide 2]

kaz: Existing mechanisms for speech interfaces

kaz: We used to have markup languages like VoiceXML and SSML

kaz: There was also CSS speech modules

kaz: And Web Speech API

kaz: Lastly there's specification for spoken presentation in HTML WD

kaz: Most popular one is Web Speech API, but this is not a W3C REC but a CG report

kaz: so that's a question

[slide 3]

kaz: As voice agents are getting more and more popular, and very useful

kaz: Need improved voice agents

[slide 4]

kaz: Interoperability of voice agents

kaz: local voice agent or on the cloud side

kaz: most are proprietary, and not based on actual standards

kaz: speech API is very convenient but not a standard yet

kaz: Desktop and mobile apps, various implementations

kaz: how can we get them to interoperate with each other?

kaz: Do we need some standards-based infrastructure?

kaz: Voice Interaction CG chaired by ddahl has been working on interop issues

kaz: will meet next week during TPAC

<kaz> Voice Interaction CG

[slide 5]

ddahl: Our CG has been working on voice and web, focusing on interop among intelligent personal assistants right now

ddahl: We've noticed that these assistants (like Siri, Cortana, Alexa, etc.)

ddahl: they really have a lot in common in terms of what they are useful for

ddahl: Like a web page, their goal is to help users find info, learn things, be entertained, and also intelligent personal assistance

ddahl: They communicate with servers on the internet, which contribute functionality in service of their goals

ddahl: Two types of interacting are different because web page is primarily graphical UI and IPA is primarily voice interaction

ddahl: But there are some arbitrary differences also

ddahl: web page rendered in browser; IPA in a proprietary platform

ddahl: but that's an arbitrary architectural difference that devs of IPAs have chosen to use

ddahl: web pages run in any browser

ddahl: but IPAs only run on their own platform

ddahl: If you have Amazon function it can't run on the Web, it can't run on your phone

ddahl: it runs only on its own proprietary smart speaker

ddahl: similarly, web pages are very familiary with URL mechanism or search engine

ddahl: IPA is found through its proprietary platform, however that platform chooses to make it available

ddahl: So finding functionality is purely proprietary

[slide 6]

slide depicts diagram of IPA architecture

ddahl: Focus on the three major boxes

ddahl: First box is data capture parts of functionality

ddahl: In case if IPA, most typicaly want to capture speech

ddahl: compared to web page, we're capturing user input

ddahl: function in the middle is basically does the intelligent parts of the processing

ddahl: This is analogous to a browser

ddahl: On the right we have connection to other functionalities

ddahl: other IPAs or other web sites

ddahl: Found through search engine, DNS, combination

ddahl: Rightmost part of this box we find other functionalities

ddahl: e.g. the websites themselves, in the case of an IPA some other IPA

ddahl: For example looking for shopping site

ddahl: want to find interoperably from UI

ddahl: That's architecture that we're looking at

ddahl: seems parallel to Web

ddahl: we'd like to be able to make those alignments possible

ddahl: and use as much of the existing Web infrastructure as possible for IPAs to be interoperable

[slide 7]

kaz: There are many issues emerging these days

kaz: So we'd like to organize a dedicated W3C workshop to summarize the current situation, the pain points, and discuss how we could solve and improve the situation

kaz: by providing e.g. a forum fo rjoint discussion by related stakeholders

kaz: I've created a dedicated GH issue in the strategy repo

https://github.com/w3c/strategy/issues/221

kaz: Please join the workshop and give your thoughts, pain points, solutionss

Discussion

kaz: Any questions, comments?

kaz: Murata-san, you were very interested in a11y in general and also interaction of ruby and speech

kaz: interested in this workshop?

MURATA_: Yes, interested, and wondering what are the existing obstacles to existing specifications?

MURATA_: Why are they not widely used?

kaz: There are various approaches to this

kaz: e.g. markup-based approach like VoiceXML/SSML

kaz: and CSS-based approach

kaz: and JS-based approach

kaz: So we should think about how to integrate all these mechanisms into common speech platform

kaz: and have content authors and applications able to use various features for controlling speech freely and nicely

kaz: that kind of integration should be one discussion point for the workshop as well

kaz: You have been working on text information. Part of this, pronunciation specification, should also be included

MURATA_: yes

kaz: any other questions/comments/opinions/ideas?

MURATA_: Let me report one thing about EPU

MURATA_: EPUB3 has included SSML and PLS

MURATA_: But now EPUB3 is heading for Recommendation

MURATA_: and some in WG don't want to include features that are not widely implemented

MURATA_: so WG decided to move SSML and PLS to a separate note, which is maintained by the EPUB WG

MURATA_: But that spec is detached from mainstream EPUB

MURATA_: Not intended to be a Recommendation in the near future

MURATA_: On the other hand, I know some Japanese companies use SSML and PLS

MURATA_: One company uses PLS and a few use SSML

MURATA_: In particular, the biggest textbook publisher in Japan uses SSML

MURATA_: And I hear the cost of ebook is 3-4 times more if try to really incorporate SSML and try to make everything natural

MURATA_: For textbooks, wrong pronunciation is very problematic, especially for new language learners

MURATA_: It is therefore worth the cost for these cases

MURATA_: But it is not cost-effective for broader materials

MURATA_: So SSML-based approach can't scale

MURATA_: But more optimistic about PLS

MURATA_: Japanese manga and novels, character names are unreadable

MURATA_: If you use PLS you have to describe each name only once

MURATA_: Dragon Slayer is very common, but doesn't read well using text to speech

MURATA_: I'm hoping that PLS would make things better

kaz: As former Team contact for Voice group, I love SSML 1.1 and PLS 1.0

kaz: I would like to see the potential for improving those specifications further

kaz: Also, there's possibility that we might want an even newer mechanism to achieve the requirements

kaz: For example, Léonie mentioned it is maybe good time to re-start speech work in W3C, during AC meeting

kaz: Personally I would like to say Yes!

kaz: So I think a workshop would be a good starting point for that direction

kaz: Any other viewpoints?

ddahl: Want to say something about why things not implemented in browsers

ddahl: Since those early specifications, technology has gotten much stronger

ddahl: previously, speech recognition did not work well

ddahl: now text to speech works much better also

ddahl: So I think much of this was marginalized, it didn't work, and wouldn't use it

ddahl: was considered it wouldn't have anything to do with the Web

ddahl: but now the tech is far better than it was at the time

ddahl: It really does make sense to look at how it is used in the browser

<kaz> fantasai: CSS and PLS seem to very different

<kaz> ... CSS is about styling

<kaz> ... not closely tied with each other

<kaz> ... you definitely can't have only CSS speech module but could use it to extend what is existing

<kaz> ... cue sound, pauses, etc.

<kaz> ... sifting volume, etc.

<kaz> ... can't change spoken pronunciation itself

<kaz> ... you said maybe we need new technology

<kaz> ... what is missing that we need to create technology for?

kaz: I was thinking about how to integrate various modalities

kaz: that are not interoperable currently

kaz: also how to implement dialog processing for interactive services

kaz: and possible integration with IoT services

kaz: so 2001 Space Odessey, asking for voice as a key for opening the dooor

kaz: maybe because I'm working for WoT and Smart Cities as well

kaz: my dream is to apply voice technology as part of user interfaces for IoT and smart cities

kim: I have a lot of opinions on what's needed. Used voice interface for 20+ years

kim: I had to use totally hands free for 3 yrs

kim: Now also use wacom tablet

kim: Speech is not really well integrated with other forms of input

kim: If speech was well implemented, many people would use a little bit. A few people would use for everything.

kim: There's so much that is not there

kim: You were talkinga bout it being siloed, and that's one of the problems

kim: for example, when you have keyboard shortcuts

kim: Sometimes you can change it, and that's great

kim: But can only link to letters now. Would be great to integrate with speech

kim: Instead of thinking as another input method, how do you put alongside

kim: It should be something with good defaults and works alongside everything else

kim: Getting there more with Siri etc.

kim: If you say "search the web for green apples" it's faster than typing

<Jennie> +1 to Kim Patch - would also see a need for sounds/vocal melodies. Some cannot articulate clear words but can make a melody.

kim: but big gaps, I think because of the underlying technology

kim: But I think speech has a ton of potential

kim: I can show some of it using custom stuff

kim: that really has not been realized

kim: But it's also used some places where it shouldn't be used

kim: Send is a really bad one-word speech command!

kim: I see a lot of stuff being implemented that is not well thought through

kim: It's too bad that more of us don't use a little bit of speech

<Jennie> * kaz sure

kim: Also some problems like e.g. need to have a good microphone

kim: Engines are getting better, but have to make sure didn't record something totally off the wall

takio: Thanks for presentation today

takio: I'm new around here, not sure about this specification

takio: but I'm concerned about emotional things (?)

takio: e.g. if ...

takio: If laughing or angry, this may be dropped

takio: So I'm concerned about these specifications, if they take care of emotional expression

takio: Also asking about intermediate formats

takio: e.g. ...

takio: e.g. emotional info is important for that person

kaz: For example, some telecom companies or research companies have been working on extracting emotion info from speech

kaz: and trying to deal with that information once we've extracted some of it

<kaz> EmotionML

kaz: There is a dedicated specification to describe emotional information, named EmotionML

kaz: As debbie also mentioned, speech tech has improved a lot the last 10 years

<Zakim> ddahl, you wanted to talk about chatbots on websites

kaz: We might want to also rethink EmotionML

ddahl: I've been noticing about websites recently

ddahl: complex websites especially tend to have a chatbot

ddahl: Seems like a failure of the website, that users can't find the information they're looking for

ddahl: so they add a chatbot to help find information quickly

ddahl: A very interesting characteristic of voice is that it is semantic

ddahl: It doesn't require the same kind of navigation that you need in a complex website

ddahl: theoretically you ask for what you want and you go there

ddahl: chatbots are normally not voice-enabled, but they are natural-language enabled

ddahl: and that's an area where we can have some synergy between traditional websites and voice interaction

kaz: That's a good use case

kaz: Reminds me of my recent TV

kaz: It has great capabilities, but there are so many menus

kaz: I'm not really sure how to use all these given the complicated menus

kaz: but it has speech recognition, so I can simply talk to that TV

kaz: "I'd like to watch Dragon Slayer"

ddahl: That's an amazing use case, because traditionally TV and DVRs were held up as examples of poor user interfaces

ddahl: Too difficult to even set the time, without lots of struggle

ddahl: So need to think about how to cut through layers of menus and navigation with voice and natural language

kaz: These days even TV devices use web interface for their UI

kaz: TV menu is a kind of web application

kaz: that implies speech interface is good solution

Jennie: I thought Kim's point about keyboard shortcut types redirecting is excellent

Jennie: Can see use case for ppl who use speech who has limited use of vocalization

Jennie: If there was a way to program instead of using a keyboard shortcut, using a melodic phrase

Jennie: similar to physical gesture on mobile device

Jennie: Would be helpful for ppl who are limited, to control devices

Jennie: Using a shortcut or shorthand of melodic phrase

Jennie: for ppl who are hospitalized or have limited mobility

Kim: In early days ...

Kim: But one thing that worked really well was blowing to close the window

Kim: 5-6 years ago someone was experimenting with that in an engine

Kim: I think it would work well both for folks who have difficulty vocalizing, and would be neat for other people as well

Kim: but would have to be easy to do

Jennie: Needs to be easy to do, but would be interesting to adapt

Kim: 10yrs ago I was working with ppl who are gesture specialists, and trying to get a grant for combined speech + gesture

<Jennie> +1 to Kim P!

Kim: A couple of gestures, a couple of sounds, would add a lot to many use cases

Kim: True mixed input

ddahl: That was an interesting point about gestures, reminded me of the recent requirements for natural language interfaces just published

ddahl: They mentioned sign language interpretation in natural language interfaces

ddahl: that is obviously gesture based

ddahl: research world

ddahl: but thinking about gesture-based input

ddahl: could be personal gestures

ddahl: or formal language gestures, like sign language

ddahl: but that would help a lot of people

Kim: With mixed input, can do multiple input at the same time that doesn't have to be aware of each other

Kim: When pointing, computer knows where you're pointing

Kim: ...

Kim: Computer doesn't have to be aware of this

<kaz> Natural Language Interface Accessibility User Requirements

Jennie: One of the other questions I had, since I'm not as familiar with the specs

Jennie: for touchscreen devices and computers

Jennie: we have ways to control for tremors or repeated actions to choose the right one to respond to

Jennie: Do we have any consideration for that in voice, e.g. stuttering, to control which sounds the voice assistant would listen to?

<Ben> Afraid I don't, sorry!

ddahl: I don't know of anything like that. Would be very useful

ddahl: Probably some research, especially for stuttering, because it's a very common problem

ddahl: but still in the research world right now

Kim: In days of Dragon Dictate, had to pause between words

Kim: People who had serious speech problems, this worked well for them

Kim: and so they stuck with it even as speech input became more natural and looked for phrases

Kim: Speech seems remarkably good at understanding people with a lot of halting, almost better than accents

Kim: I've been surprised how well it deals with stutters

kaz: So probably during workshop we should cover those cases as well, what are actual pain points

Kim: Something else to think about

Kim: There's a time for natural language

Kim: And there's a time where it's a lot more useful to have good default set of commands, one way to say something (maybe a few) and let the user change anything they want

Kim: Dragon made mistake, I think, giving 24 different ways to say "go to end of the line"

<Ben> This is a link to a research paper titled "A DATASET FOR STUTTERING EVENT DETECTION FROM PODCASTS WITH PEOPLE WHO STUTTER". It might be useful reading material on the subject

Kim: If you have good defaults, it's much easier to teach someone

Kim: I think it's really important to think when natural language is better UX and when good default set of commands that can be learned easily and have structure is good

Kim: The type of interaction, and what fits, has to be considered

Wrap-up

ddahl: Should we try to list topics for the workshop?

<Jennie> *Thanks for sharing that study Ben

kaz: yes that's a good idea

kaz: Starting with existing standards within W3C first

kaz: Specifications including natural language interface requirements, recent work as well

ddahl: Some technologies haven't found their way to any specs

ddahl: Like speaker recognition

ddahl: Any value to including that in a standard?

ddahl: What are pain points in a11y? What would be valueable to do in voice?

ddahl: maybe think about some disabilities that involve voices, either in speaking or hearing

ddahl: what can we do with text to speech that would cover some of the issues around pronunciation spec

ddahl: and SSML

ddahl: I guess EmotionML would be an interesting presentation

ddahl: Looking at emotions being expressed in text or speech would add a lot to the users' perception of what the web page is trying to say

Kim: Some research at MIT using common sense database

Kim: They found it increased recognition a certain percent, but people's perception was that it was more than twice as good

Kim: I guess because it took out the most stupid mistakes

Kim: So the user experience was a lot better

kaz: So will revise workshop proposal based on discussion today

kaz: Kim, please give us further comments in the workshop committee

<kaz> workshop proposal

kaz: would be great if more participants in this session can join the committee

<kaz> ashimura@w3.org

kaz: you can directly give your input in GH or contact me at my W3C email address

<Jennie> *Thank you - very interesting!

kaz: OK, time to adjourn

kaz: Thank you everyone!

<kaz> [adjourned]

<BC> Thank you

– DRAFT –
Next Directions for Voice and the Web Breakout

18 October 2021

Attendees

Meeting minutes

Presentation

Discussion

Wrap-up