<scribe> scribenick: cpn
[introductions from Phil, Leonie, Marco]
Phil: A11y is a use case, other
applications in healthcare, driving, etc
... I know this is an important area, want to find out what we
could do
... There are 5 different CGs on voice
... some addressing the same thing, mostly inactive
... voice interaction with the web isn't new
... also voice output is important, eg, for BBC
... none of this gives a clear direction on where we want to
go
... [block diagram]
... [demo video from MIT]
... Open Voice Network
... it's a rare example of a voice assistant with a male
voice
... add to shopping list important for retailers
... Intel and Cap Gemini (sp?) also involved in this
... APIs are needed, for intents and slots, training data
(privacy implications), history of conversation context
... SSML, avoid writing code for each individual platform
... where is the common interest?
... what level of interest is there, and where to continue the
conversation?
<dsr> 1998 W3C workshop on voice browsers
Phil: what are your motivations, pain points, etc?
<dsr> https://www.w3.org/Voice/1998/Workshop/
Dave: Workshop in 1998 led to
specs such as speech synthesis, speech recognition, SSML
... Describing the dialog you have with a voice assistant is
complex
... Wanted to separate that from the synthesis and
recognition
... Work done on APIs
Dave: This is the MDN page for
the Web Speech API
... Browsers support synthesis, but few support
recognition
... Then there's the relationship between voice interaction and
chatbots
... Voice recognition has improved, so we now have good quality
speech rec, so the problem is now text interaction
<dsr> https://developer.amazon.com/docs/custom-skills/create-the-interaction-model-for-your-skill.html
Dave: This is the Amazon
developer page for creating Alexa skills
... There's a declarative way to define intents and slots
<dsr> https://github.com/w3c/strategy/issues/134
Dave: There's a range of
conversation markup languages available, AIML, BOTML
... What are their strengths and weaknesses of these?
... What's the business value?
... Improve customer service using chatbots
... Includes not being annoying, where the agent on websites
often gets in the way
... We could have a CG, organise a W3C workshop
... Can we get the commercial companies interested?
<scheib> https://github.com/slightlyoff/declarative_web_actions mentions by Aaron G.
Aaron_Gustavson: Declarative web,
Web Actions, a generalised approach to interactions, also
... Declarative Web Actions is a way in the Web App Manifest to
declare interactions with assistants such as Cortana, Siri,
etc
... A way to tie into the operating system
... With Cortana, had a similar thing
... Placeholders for keywords with alternate phrasing
... It uses slots, similar architecture, intents were used and
key phrasings for triggering
... Talk to Alex Russell
Dave: A company could create an agent, or to allow third parties to plug in, which is more scalable
Vincent: Working on Chrome and
Google Assistant
... The market is changing rapidly, so it's a challenging time
to do standardisation work
... What we'll have in a few years might be quite
different
... Architecture challenging because of changing technology,
and businesses in this space are moving fast and
differentiating themselves
<aarongu> Declarative Web Actions: https://github.com/slightlyoff/declarative_web_actions
Vincent: SSML has been adopted
and extended by Amazon and Google
... I was advocating use of the standardised parts of SSML,
enables ingest of content from third parties
... Things can move faster by not using standards
... With the appropriate parties engaged, we'll find people
receptive to add enhancements to SSML
<aarongu> Cortana’s Voice Command Definition (for reference) https://docs.microsoft.com/en-us/uwp/schemas/voicecommands/voice-command-elements-and-attributes-1-2
Vincent: Another foundational
technology are speech recognition and speech generation
... Compapies don't need standardisation, they're moving
fast
... How are users using agents? Many ways. Embedded agents in
web pages, I don't see large usage
... Instead, appliance scenarios as input modality to the
computer as a whole
... Using the assistant at the mobile OS level
... Smart speakers
... On laptops and desktops, there's less usage
... The best thing we can do at W3C is make web content as
navigable and actionable as possible by OS level agents
... And build aspects of those agents into the browser
... Alexa and Siri attempt to get structured data from the web,
e.g., schema.org
... These queries work the best: fact or structureal based
queries give good responses
... Navigating a website with unique offerings isn't handled
very well
... Having a page that responds to certain actions such as
Ctrl+S for save, and having an associated voice action, has
value
Leonie: Is there room for new features in SSML. Such as effects, like "whisper", a quick way to produce specific patterns
Vincent: Yes. Reprocity and adjacent attributes. It's complex, there's motivation to improve speech generation, this is so new it's hard to standardise
Leonie: Google are restarting work on Web Speech API, is there interest in formalising that more?
<dsr> Amazon’s extensions to SSML: https://developer.amazon.com/blogs/alexa/post/5c631c3c-0d35-483f-b226-83dd98def117/new-ssml-features-give-alexa-a-wider-range-of-natural-expression
Vincent: Don't know
Dave: Interest from Amazon in extending SSML at W3C
Vincent: Google would also be interested, but other things we're doing are out of scope
Brian: Not everything there is currently supported in browsers
Vincent: Would be good to have an artifact that describes state of SSML support
Markku: Pronunciation TF from APA
WG. Coming from education, consuming text to speech
content
... Specific requirements for word pronunciation
... A barrier is that the HTML content can't host SSML
... Presentation cues in HTML could also be consumed by voice
assistants, please participate in the TF
<Irfan> Pronunciation Task Force: https://www.w3.org/WAI/APA/task-forces/pronunciation/
<dsr> Chris: the broadcast industry got together under the EBU to discuss some of these issues, e.g. loudness of voice relative to other content, to present our content using our voice talents
<dsr> Concerns about difficulties of achieving write once run everywhere
<dsr> Need to involve implementers
<dsr> The EBU is expecting to provide a collection of requirements
<dsr> BBC would support work on extending SSML
<scribe> scribenick: cpn
Kaz: There's also PLS, as well as SSML
<tink> Lyrebird is an API that can recreate the voices of real people. Demos on this page https://www.youtube.com/watch?v=YfU_sWHT8mo
Kaz: Also multi-modal architecture, SCXML as the mechanism for that purpose, and EMMA data model
<tink> Lyrebird API here https://www.descript.com/lyrebird-ai
Phil: It sounds like SSML updates
are potentially of interest
... Not keen to look at intents?
<meredith> https://www.irccloud.com/pastebin/cBgx1SEk/
<kaz> SSML 1.1
Vincent: I see that as more challenging, companies are differentiating
<kaz> PLS
<dsr> Opportunities for operating system integrated voice agents being able to make use of semantic descriptions (e.g. schema.org) of services exposed by web sites.
Aaron: Thinking about context providers, e.g., weather services, advertising specific apps to hook into a voice interface, is interseting
Leonie: As someone producing
skills, a way to avoid having to write everything twice is
desirable
... There's similarity with conversational models. I suspect
the hooks are similar
Vincent: I think there's huge
potiential, more with SSML than intents though
... Hasn't started with a standards-first approach
Dave: schema.org has allowed
smart search, but also hooks for the OS voice assistent. how
could we extend schema.org to provide the kinds of voice
experiences people are looking for?
... Then the voice vendors have something common to work
with
Aaron: I'd like to be able to ask a website to search for things, and it know what to do
Omar: I'm working on chatbots, I
notice there's a ubiquity, it's on the webpage, then FB
messenger etc
... We're thinking about intents, whether to do in frontend or
back-end
... Would a web standard help with intents? Same for speech
synthesis, where email or SMS are valid channels for the
chatbot
... I'd like to see improvement in interoperability between
Alexa and Siri
... For speech recognition, we do nothing, as mobile devices
have it built in
Phil: Does the browser has a speech synthesis API?
Brian: Yes, it's not a great API, it lacks ability to give richer input than just text
Leonie: This is being worked on in a CG, could bring support, move to WG?
Brian: TAG has given input on Web Speech
Kaz: Multi-application handling
was included in the multi-modal architecture
... WoT is working on smart speakers and speech synthesis
... Not suggesting using WoT for this, but we should
collaborate
<inserted> WoT PlugFest breakout minutes
<inserted> WoT PlugFest summary slides
Phil: If there were a W3C workshop on this, would you come?
<meredith> reposting as link instead of snippet: https://github.com/w3c/strategy/issues/71#issuecomment-391105060
Phil: We'd need implementers in the room
Vincent: There's a good chance we could get people there
Dave: In preparing the workshop we'd reach out to stakeholders, so we'd first want to make the right contacts, to make it relevant
Vincent: I can help make contacts in Google
Aaron: I can help at Microsoft
Dan: I can also help at Google
Dan: We're always happy to try
things out at schema.org
... There's speakable, which reads things from news articles.
There's work on intents and filling in forms
... We pull in feeds from Netflix, etc, schema.org works well
for that
Marko: It seems there's a perfect
storm of people in the room to move things forward
... I have issues with SSML
Leonie: Latest update was in 2010
Brian: The Web Speech APIs are also from that time, but stopped since then
Leonie: The standards pre-empted the current situation, things have now moved on
Phil: Which group should we join?
Leonie: Voice Assistant Standardisation CG could be restarted
Phil: Thank you everyone
[adjourned]
<mhakkinen> https://www.w3.org/WAI/APA/task-forces/pronunciation/
<phila> Meeting: Voice assistants - what needs standardization?
This is scribe.perl Revision: 1.154 of Date: 2018/09/25 16:35:56 Check for newer version at http://dev.w3.org/cvsweb/~checkout~/2002/scribe/ Guessing input format: Irssi_ISO8601_Log_Text_Format (score 1.00) Succeeded: s/Marko/Markku/ Succeeded: s/EMMA data model/SCXML as the mechanism for that purpose, and EMMA data model/ Succeeded: s/with?/with/ Succeeded: i|If there were|WoT PlugFest breakout minutes Succeeded: i|If there were|WoT PlugFest summary slides Present: Irfan Léonie (tink) Chris_Needham mhakkinen scheib dsr Kaz_Ashimura Dan_Brickley Found ScribeNick: cpn Found ScribeNick: cpn Inferring Scribes: cpn WARNING: No date found! Assuming today. (Hint: Specify the W3C IRC log URL, and the date will be determined from that.) Or specify the date like this: <dbooth> Date: 12 Sep 2002 People with action items: WARNING: IRC log location not specified! (You can ignore this warning if you do not want the generated minutes to contain a link to the original IRC log.)[End of scribe.perl diagnostic output]