W3C

- DRAFT -

Voice Agents Breakout

27 Oct 2020

Attendees

Present
Kaz_Ashimura, Phil_Archer, Becky_Gibson, Chris_Needham, Dan_Burnett, Dan_Zhou, Deborah_Dahl, Erika_Miguel, Fuqiao_Xue, Jan-Ivar_Bruaroey, Jason_White, Joshue_O_Connor, Kazuaki_Arai, Kazuhiro_Hoya, Ken_Ogiso, Max_Froumentin, Mark_Hakkinen, Michael_McCool, Paul_Grenier, Pavel_Pomerantsev, Roy_Ran, Shigeru_Fujimura, Neil_Soiffer, Steve_Lee, Takashi_Minamii, Tomoaki_Mizushima, Xiaoqian_Wu, Zoltan_Kis, jiexuan_Gao
Regrets
Chair
Kaz
Scribe
cpn, kaz

Contents


<kaz> scribenick: cpn

Background

https://github.com/w3c/strategy/issues/221 Voice Agents Workshop proposal

Kaz: Lots of technologies, speech technologies, Google Voice Agent, Alexa available
... Can use voice technology with TV sets, kiosk services
... A breakout at TPAC 2019 there was discussion on the need to improve voice agent technology, especially for web services
... So many viewpoints and expected use cases. Focus on interaction with smart devices, from web browsers, smart navigation, accessibility
... What is missing, from a global viewpoint?
... We've had lots of comments on the GitHub issue
... Can we identify the missing features, user needs, and developer needs?
... Smarter interactions, short and clear commands, smarter dialog model between human and the system
... Support for various languages
... Advanced voice technology: style, expression, feeling, emotion
... Input and output entities from various vendors
... Timing, how and when, using which modality
... Typing, handwriting, voice
... A possible session could be on underlying technologies: dialog management from a research viewpoint
... Protocols for data transfer
... State transition management, improved model for voice interaction
... Also horizontal platform requirements: discovery, privacy, security, accessibility and usability
... Examples of related use cases: voice agents, smartphones, smart speakers, connected car, smart homes, IoT in general
... Example of user asking a TV to play something. A more human interaction is useful
... We need multiple stakeholders to participate
... Looking for participants, and people to join the workshop's programme committee

Q&A

Phil: Thank you for doing this. I'm interested in taking part
... At the last TPAC meeting, there was a desire to update SMIL
... GS1 would be interested. Asking the TV to order a pizza, "where from?" Does the user or the TV choose?
... Any contact with the Open Voice Network? US retailer, Target, is behind it

<xfq> https://openvoicenetwork.org/

Kaz: I agree, SMIL is important. I also work for the WoT group, they're thinking about something similar for serialization of device based services
... You're input is welcome, would you like to be on the programme committee?

<ddahl> we might be mixing up smil and ssml

Phil: Yes

<phila_> Open Voice network

Mark: Helped start up the APA pronunciation TF. My organization is interested in this, as we're looking to solve pronunciation in text-to-speech
... Language learning on web and mobile
... I'm interested to hear about emergence of SMIL in this. I was involved in SMIL 1 and 2.
... What we're trying to solve in education is more auditory presentation of content, e.g., by voice assistants
... Make it easy to support, students get better experience regardless of modality
... I'm interested in participating in the workshop
... Other publishers, such as Pearson could be interested

<Roy> pronunciation TF https://www.w3.org/WAI/pronunciation/

Kaz: The Publishing BG chair is also interested in this activity

Mark: I'm intersested in the programme committe

Xiaoqian: Work for W3C in China. Working on MiniApp, the vendors see a strong need for a markup language to monitor the application by voice. If there's interest in a markup language, the MiniApp vendors would be pleased to try to implement

Kaz: There was a voice browser WG maybe 15 years ago. We visited Beijing for i18n of speech synthesis
... Either a markup or an API approach, involving Chinese stakeholders is important

Deborah: I'm worried about some of the potential topics. Some are looking to make improvements in the fundamental voice technology
... Our focus could be better put to interoperability, similar to HTML as underlying markup for different browsers

<xiaoqian> +1 to ddahl to stay focus

Kaz: We can't work on AI, but easier way to improve usability maybe using some template like SISR. Need to look at use cases, what kind of things need to improve
... A possibility could be standardizing the interface between AI services and web services
... Need to look into the detail on that

Deborah: The Voice Interaction CG is looking into this. We have an architecture with standard communication channels

Kaz: Feedback from that CG activity would be great.

<ddahl> https://www.w3.org/community/voiceinteraction/

Joshue: I work in the area of emerging web technologies.
... I wanted to understand the focus of this work, in relation to devices like Alexa and Siri.
... Where does this sit? Is it about doing similar in the browser?
... In APA WG there is work to fine tune improvements for text to speech for users

Kaz: Good point. We need to see what's available for smart speakers, browsers. We should also think about pronunciation for different languages, not just English

Phil: Do you have a sense of where this work might go? A possible WG, for example?

Kaz: The conclusions determined by the workshop itself. But if possible, if we get many actual needs and use cases, we can create a WG and start standardization based on those needs

Phil: One of the reasons we zeroed in on SMIL at the meeting last year, is that it's something that could be done.
... Voice assistants is a competitive market, you'd need people from the assistant vendors in the room

Kaz: Another possibility is an Interest Group. There are use cases from various stakeholders, relating to different technologies. We could bring those to a new possible SMIL group, HTML, CSS, separately. Gap analysis needs to be done first, then think about potential WGs after we've done that.
... I'd like to see which level of improvement is needed, and which process is most appropriate for those needs, as part of the workshop

<kaz> scribenick: kaz

Chris: a couple of things
... w3c has the web speech api already
... revitalize it might be part of this work
... speech synth ans rec
... also another aspect of use privacy
... voice recording, etc.

<xfq> https://w3c.github.io/web-roadmaps/mobile/userinput.html#exploratory-work

<ddahl> the Web Speech API is only implemented in Chrome as far as I know

<ddahl> I'm not sure there are very many applications using it though

<jib> SpeechSynthesis appears to be in most browsers https://developer.mozilla.org/en-US/docs/Web/API/SpeechSynthesis

Chris: for us, BBC, would like to see some standardization for multiple devices
... would see desire from vendors

<cpn> scribenick: cpn

Kaz: The Web Speech API is a CG note, so updating it based on needs and use cases would be useful
... Also interaction with interactive TV is important to you?

Chris: Yes

Kaz: Privacy and interaction with cloud services. How to manage the whole sequence of devices and applications could be a possible topic

Michael: Regarding IoT, should coordinate with the WoT WG. There's a lot of overlap

Paul: I'm from the pronunciation TF. I'd like to see the workshop address better handling of accents. This may need configuration, constraint matching, to handle non-native Engish accents. If language processing is done in the cloud, phrase capture is brief. There should be better device level control of compressing those clips: wait for an entire instruction, or that the instruction is complete, to help reduce user frustration
... Timing and accent for speech disabilities would be huge improvements

Kaz: I'm also interested in those aspects, from my previous research

Paul: Not sure I could be on the programme committee, depends on level of commitment

Kaz: We can share the work among the committee members

Deborah: We have some big players in the virtual agent space, but there are some open source efforts, e.g., Almond from Stanford University
... An open source smart speaker, Mycroft. It has an intelligent agent. There may be more that would be interested in this activity

<paul_grenier> https://almond.stanford.edu/

<paul_grenier> https://mycroft.ai/

Kaz: Good point, we should invite them

Deborah: I could find contact information for them

<Zakim> phila_, you wanted to ask about timing

Phil: Do you have a schedule for when to run the workshop?

Kaz: I was planning to hold it this year, but it would probably be sometime next year
... Based on this feedback, can update the workshop proposal
... Early next year, possibly

Max: I would like to seem something about content creation. UK Government would like to see its content made available to web and voice agents. People ask simple questions, e.g., how to renew my passport. We don't want to write the content twice.

<mhakkinen> +1 to Max's authoring once to delivering on web and voice

Max: We need to be accurate in both web and voice modalities.

Phil: One area for me is a clinical setting. There, I want to be able to talk to a box of medicine to ask the dosage for a patient. How would the information be structured?
... An interaction with the physical object, then voice interaction

Paul: From a developer perspective, it seems that embedding metadata in the content would be a way to get some of the smart speaker vendors on board
... This gives author control, by adding metadata to the web content, e.g., with RDFa
... That might be a way to bring them in, taking some existing schema and applying it to this new purpose

Mark: From my organization's perspective, we wanted to push the pronunciation work, as we want correct spoken presentation across different voice devices
... The APA website has an example of voice interaction with pronunciation errors. If you mark-up the content it can help, but there's no standard way to do that yet

Kaz: Does this work also include language translation, for education purposes, etc.?

Paul: Pronunciation hints are embedded in HTML, but it's still for the developer to provide translation. If a page is offered in French and English, the French pronunciation hints would have to be provided for that language pack
... For some pronunciations in some language we have incomplete voice packs
... There's more emphasis on the author in that model
... It may take a while to become automated, needs good ML models

Kaz: It's a good use case. We need to look at the detailed scenario

Next steps

Kaz: The next step should be to form the programme committee, then create the webpages with topics for discussion, and think about scheduling
... Thanks to those of you who agreed to join the programme committee
... Others, please let me know if you're interested
... I'll create a mailing list for programme committee discussion

[adjourned]

Summary of Action Items

Summary of Resolutions

[End of minutes]

Minutes manually created (not a transcript), formatted by David Booth's scribe.perl version (CVS log)
$Date: 2020/10/27 15:45:52 $