Voice assistants: opportunities for standardisation -- 18 Sep 2019

<scribe> scribenick: cpn

[introductions from Phil, Leonie, Marco]

Phil: A11y is a use case, other applications in healthcare, driving, etc
... I know this is an important area, want to find out what we could do
... There are 5 different CGs on voice
... some addressing the same thing, mostly inactive
... voice interaction with the web isn't new
... also voice output is important, eg, for BBC
... none of this gives a clear direction on where we want to go
... [block diagram]
... [demo video from MIT]
... Open Voice Network
... it's a rare example of a voice assistant with a male voice
... add to shopping list important for retailers
... Intel and Cap Gemini (sp?) also involved in this
... APIs are needed, for intents and slots, training data (privacy implications), history of conversation context
... SSML, avoid writing code for each individual platform
... where is the common interest?
... what level of interest is there, and where to continue the conversation?

<dsr> 1998 W3C workshop on voice browsers

Phil: what are your motivations, pain points, etc?

<dsr> https://www.w3.org/Voice/1998/Workshop/

Previous W3C work

Dave: Workshop in 1998 led to specs such as speech synthesis, speech recognition, SSML
... Describing the dialog you have with a voice assistant is complex
... Wanted to separate that from the synthesis and recognition
... Work done on APIs

<dsr> https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API#targetText=Check%20the%20Browser%20compatibility%20table,SpeechRecognition%20(Asynchronous%20Speech%20Recognition.)

Dave: This is the MDN page for the Web Speech API
... Browsers support synthesis, but few support recognition
... Then there's the relationship between voice interaction and chatbots
... Voice recognition has improved, so we now have good quality speech rec, so the problem is now text interaction

<dsr> https://developer.amazon.com/docs/custom-skills/create-the-interaction-model-for-your-skill.html

Dave: This is the Amazon developer page for creating Alexa skills
... There's a declarative way to define intents and slots

<dsr> https://github.com/w3c/strategy/issues/134

Dave: There's a range of conversation markup languages available, AIML, BOTML
... What are their strengths and weaknesses of these?
... What's the business value?
... Improve customer service using chatbots
... Includes not being annoying, where the agent on websites often gets in the way
... We could have a CG, organise a W3C workshop
... Can we get the commercial companies interested?

<scheib> https://github.com/slightlyoff/declarative_web_actions mentions by Aaron G.

Aaron_Gustavson: Declarative web, Web Actions, a generalised approach to interactions, also
... Declarative Web Actions is a way in the Web App Manifest to declare interactions with assistants such as Cortana, Siri, etc
... A way to tie into the operating system
... With Cortana, had a similar thing
... Placeholders for keywords with alternate phrasing
... It uses slots, similar architecture, intents were used and key phrasings for triggering
... Talk to Alex Russell

Dave: A company could create an agent, or to allow third parties to plug in, which is more scalable

Vincent: Working on Chrome and Google Assistant
... The market is changing rapidly, so it's a challenging time to do standardisation work
... What we'll have in a few years might be quite different
... Architecture challenging because of changing technology, and businesses in this space are moving fast and differentiating themselves

<aarongu> Declarative Web Actions: https://github.com/slightlyoff/declarative_web_actions

Vincent: SSML has been adopted and extended by Amazon and Google
... I was advocating use of the standardised parts of SSML, enables ingest of content from third parties
... Things can move faster by not using standards
... With the appropriate parties engaged, we'll find people receptive to add enhancements to SSML

<aarongu> Cortana’s Voice Command Definition (for reference) https://docs.microsoft.com/en-us/uwp/schemas/voicecommands/voice-command-elements-and-attributes-1-2

Vincent: Another foundational technology are speech recognition and speech generation
... Compapies don't need standardisation, they're moving fast
... How are users using agents? Many ways. Embedded agents in web pages, I don't see large usage
... Instead, appliance scenarios as input modality to the computer as a whole
... Using the assistant at the mobile OS level
... Smart speakers
... On laptops and desktops, there's less usage
... The best thing we can do at W3C is make web content as navigable and actionable as possible by OS level agents
... And build aspects of those agents into the browser
... Alexa and Siri attempt to get structured data from the web, e.g., schema.org
... These queries work the best: fact or structureal based queries give good responses
... Navigating a website with unique offerings isn't handled very well
... Having a page that responds to certain actions such as Ctrl+S for save, and having an associated voice action, has value

Leonie: Is there room for new features in SSML. Such as effects, like "whisper", a quick way to produce specific patterns

Vincent: Yes. Reprocity and adjacent attributes. It's complex, there's motivation to improve speech generation, this is so new it's hard to standardise

Leonie: Google are restarting work on Web Speech API, is there interest in formalising that more?

<dsr> Amazon’s extensions to SSML: https://developer.amazon.com/blogs/alexa/post/5c631c3c-0d35-483f-b226-83dd98def117/new-ssml-features-give-alexa-a-wider-range-of-natural-expression

Vincent: Don't know

Dave: Interest from Amazon in extending SSML at W3C

Vincent: Google would also be interested, but other things we're doing are out of scope

Brian: Not everything there is currently supported in browsers

Vincent: Would be good to have an artifact that describes state of SSML support

Markku: Pronunciation TF from APA WG. Coming from education, consuming text to speech content
... Specific requirements for word pronunciation
... A barrier is that the HTML content can't host SSML
... Presentation cues in HTML could also be consumed by voice assistants, please participate in the TF

<Irfan> Pronunciation Task Force: https://www.w3.org/WAI/APA/task-forces/pronunciation/

<dsr> Chris: the broadcast industry got together under the EBU to discuss some of these issues, e.g. loudness of voice relative to other content, to present our content using our voice talents

<dsr> Concerns about difficulties of achieving write once run everywhere

<dsr> Need to involve implementers

<dsr> The EBU is expecting to provide a collection of requirements

<dsr> BBC would support work on extending SSML

<scribe> scribenick: cpn

Kaz: There's also PLS, as well as SSML

<tink> Lyrebird is an API that can recreate the voices of real people. Demos on this page https://www.youtube.com/watch?v=YfU_sWHT8mo

Kaz: Also multi-modal architecture, SCXML as the mechanism for that purpose, and EMMA data model

<tink> Lyrebird API here https://www.descript.com/lyrebird-ai

Phil: It sounds like SSML updates are potentially of interest
... Not keen to look at intents?

<meredith> https://www.irccloud.com/pastebin/cBgx1SEk/

<kaz> SSML 1.1

Vincent: I see that as more challenging, companies are differentiating

<kaz> PLS

<dsr> Opportunities for operating system integrated voice agents being able to make use of semantic descriptions (e.g. schema.org) of services exposed by web sites.

Aaron: Thinking about context providers, e.g., weather services, advertising specific apps to hook into a voice interface, is interseting

Leonie: As someone producing skills, a way to avoid having to write everything twice is desirable
... There's similarity with conversational models. I suspect the hooks are similar

Vincent: I think there's huge potiential, more with SSML than intents though
... Hasn't started with a standards-first approach

Dave: schema.org has allowed smart search, but also hooks for the OS voice assistent. how could we extend schema.org to provide the kinds of voice experiences people are looking for?
... Then the voice vendors have something common to work with

Aaron: I'd like to be able to ask a website to search for things, and it know what to do

Omar: I'm working on chatbots, I notice there's a ubiquity, it's on the webpage, then FB messenger etc
... We're thinking about intents, whether to do in frontend or back-end
... Would a web standard help with intents? Same for speech synthesis, where email or SMS are valid channels for the chatbot
... I'd like to see improvement in interoperability between Alexa and Siri
... For speech recognition, we do nothing, as mobile devices have it built in

Phil: Does the browser has a speech synthesis API?

Brian: Yes, it's not a great API, it lacks ability to give richer input than just text

Leonie: This is being worked on in a CG, could bring support, move to WG?

Brian: TAG has given input on Web Speech

Kaz: Multi-application handling was included in the multi-modal architecture
... WoT is working on smart speakers and speech synthesis
... Not suggesting using WoT for this, but we should collaborate

<inserted> WoT PlugFest breakout minutes

<inserted> WoT PlugFest summary slides

Phil: If there were a W3C workshop on this, would you come?

<meredith> reposting as link instead of snippet: https://github.com/w3c/strategy/issues/71#issuecomment-391105060

Phil: We'd need implementers in the room

Vincent: There's a good chance we could get people there

Dave: In preparing the workshop we'd reach out to stakeholders, so we'd first want to make the right contacts, to make it relevant

Vincent: I can help make contacts in Google

Aaron: I can help at Microsoft

Dan: I can also help at Google

<kaz> MMI interoperability test report (as the starting point of what MMI Architecture is like to synchronize multiple agents like messenger and speech)

Dan: We're always happy to try things out at schema.org
... There's speakable, which reads things from news articles. There's work on intents and filling in forms
... We pull in feeds from Netflix, etc, schema.org works well for that

Marko: It seems there's a perfect storm of people in the room to move things forward
... I have issues with SSML

Leonie: Latest update was in 2010

Brian: The Web Speech APIs are also from that time, but stopped since then

Leonie: The standards pre-empted the current situation, things have now moved on

Phil: Which group should we join?

Leonie: Voice Assistant Standardisation CG could be restarted

Phil: Thank you everyone

[adjourned]

<mhakkinen> https://www.w3.org/WAI/APA/task-forces/pronunciation/

<phila> Meeting: Voice assistants - what needs standardization?

- DRAFT -

Voice assistants: opportunities for standardisation

18 Sep 2019

Attendees

Contents

Previous W3C work

Summary of Action Items

Summary of Resolutions

Scribe.perl diagnostic output