W3C's work on voice and multimodal interaction
Giving users a chance to choose between using their ears or
eyes, their voice or their fingers
Dave
Raggett <dsr@w3.org>
W3C Fellow on assigment
from Openwave Systems
background photo by Gerald Saunders, see www.wetsand.com
Brief history of W3C's involvement in voice standards work
- The Web started as a visual medium with HTML and images etc.
- Aural style sheets made it possible to style HTML for rendering
to speech, when using a keyboard for input
- W3C got further involved in voice interaction through work on
accessibility for people with visual impairments
- Voice interaction is valuable for hands and eyes free operation,
and for easier input on small devices
- W3C workshop on voice browsers in October 1998
- Voice Browser working group set up in early 1999
- Speech interface framework requirements
- Specifications for VoiceXML, speech grammars, speech synthesis,
call control, ...
For more details, please see http://www.w3.org/Voice
Current status of Voice Browser working group
Requirements:
- Voice dialog requirements - 23 December 1999
- Speech synthesis requirements - 23 December 1999
- Speech grammar requirements - 23 December 1999
- Natural language processing requirements - 23 December 1999
- Pronunciation lexicon requirements - 12 March 2001
- Call control requirements - 13 April 2001
Specifications:
- VoiceXML 2.0 specification in Last Call (24 April 2002)
- Speech Grammar specification about to enter Candidate
Recommendation
- Speech Synthesis specification about to re-enter Last Call
- CCXML call control specification — First working draft 21
Feb 2002
The working group is now due to be rechartered, and the W3C
Advisory Committee is being asked for their input on this.
A very quick look at VoiceXML
- VoiceXML is being deployed by wireless and wireline operators,
and by companies for various kinds of call centers
- Users dial up to connect to a voice browser running a VoiceXML
interpreter, this in turn contacts a web server to request the
corresponding VoiceXML, speech grammars, and audio prompts
- Developers are comfortable with markup and can apply their
skills with server-side scripts and back-end systems
- VoiceXML offers navigation links, and form filling, with speech
recognition and DTMF (touch-tone) for input
- Result of recognition activates link or sets value of named
variables
- Judiscious blend of declarative and procedural features
Speech Recognition
- Dictation systems now offer good results with training to the
user's voice, a good microphone and quiet conditions
- Speaker independent continuous speech recognition isn't as far
advanced
- Recognition accuracy is increased by supplying a context free
grammar for what the user is expected to say
- This requires carefully designed prompts to get the user to
speak within the specified grammar
- Alternative is to use word spotting, e.g. "bill" or "fault", as
in:
- I have a question about my bill
- I would like to report a fault
- The user can then be transferred to an customer service
representative or to a specific VoiceXML application
Speech Synthesis
- Use of an actor or Voice Talent to record prompts
- Conventional formant speech synthesis sounds robotic, and is
tiring to listen to
- Non-uniform unit concatenative speech synthesis is much better
and is based on splicing together varying length fragments of
recorded human speech
Formant based synthesis (Festival):
Concatenative based synthesis (Rhetorical):
What does VoiceXML look like?
<vxml version="2.0" lang="en">
<form>
<field name="city">
<prompt>Where do you want to travel to?</prompt>
<grammar src="travel.xml#cities"
type="application/grammar+xml"/>
</field>
<field name="travellers" type="number">
<prompt>How many are travelling to
<value expr="city"/>?</prompt>
</field>
<block>
<submit next="http://www.acmetravel/bookings"
namelist="city travellers"/>
</block>
</form>
</vxml>
What is it like to use a VoiceXML
application?
Here is an example:
Computer: Welcome to Joe's Pizza ordering service
Computer: Select pizza size from large, medium
or small?
User: large
Computer: what number of these pizzas do you want?
User: two
Computer: Select first topping from mozzarella,
pepperoni and anchovies?
User: mozzarella
Computer: Do you want another topping, yes or no?
User: yes
Computer: Select second topping from mozzarella,
pepperoni and anchovies?
User: pepperoni
Computer: Do you want any other pizzas, yes or no?
...
Recovering from problems
Prompts are carefully designed to elicit simple responses, if
users fails to respond or the response can't be understood, the
application provides tapered prompts, for
example:
Computer: what number of these pizzas do you want?
User: I reckon, er, two would do the job
Computer: please say the number on its own
User: two
Computer: Select first topping from mozzarella,
pepperoni and anchovies?
...
Recognition confidence scores can be used to determine when to
ask for confirmation
Tapered Prompts
Offer progressively more guidance on each turn:
<field name="travellers" type="number">
<prompt count="1">
How many are travelling to <value expr="city"/>?
</prompt>
<prompt count="2">
Please tell me the number of people travelling.
</prompt>
<prompt count="3">
To book a flight, you must tell me the number
of people travelling to <value expr="city"/>.
</prompt>
<nomatch>
<prompt>Please say just a number.</prompt>
<reprompt/>
</nomatch>
</field>
For more examples:
http://www.w3.org/Voice/Guide/
Richer use of natural language
The dialog gets the job done, but is very rigid. With larger
grammars, a more natural interaction style becomes possible, for example:
Computer: Welcome to Joe's Pizza
Computer: What would you like?
User: I would like two large pizzas with mozzarella
and one small pizza with tomatoes and anchovies
Computer: would you like any drinks with that?
User: Sure, 3 large diet cokes, oh and add
pepperoni to the large pizzas
Computer: Is that all?
User: yes
Computer: Okay, that will be ready for you
in 5 minutes
User: thanks
This is impractical today, due to the
difficulties in determining all the ways people could respond.
More flexible natural language understanding is needed to deal
with this.
Multimodal Interaction
Why?
- Allow people to choose when to use their eyes, hands, ears and
mouths
- Speaking is so much easier than thumbing in text
- Sometimes speech is not the answer - too noisy or its
inappropriate to speak
- Complementing transient nature of speech with longer lived
visual information
- Ink: The use of a stylus for text input, gestures, specialized
notations such as math, music and chemistry, and for diagrams and
artwork
What?
- Name dialling - see who you want to call
- Seeing and hearing as means to browse messages
- Driving directions
- Search, e.g. for flight information
- ...
Multimodal must work on low-end phones!
A multimodal interface to the Web in every pocket!
- Cell phones need to be affordable to people from all walks of
life, this constrains the price
- Small physical size, and desire for long battery life further
constrains memory capacity and processor speed
- Cell phones are expected to support an increasing number of
specifications: XHTML, CSS, JPEG, SVG, SMIL, Java, MP3, scripting
and more ...
- It is therefore essential to be able to offload some of the
burden to the network
- A greater emphasis on the network will also scale better,
allowing for greater flexibility and more innovation without
requiring users to upgrade their phones to enjoy the benefits
W3C multimodal working group charter
- Multimodal work started in Voice Browser Working Group
- Joint W3C/WAP Forum workshop in Hongkong (September 2000)
- Multimodal Interaction Working Group was formed February
2002
-
Primary focus on 2.5 and 3G mobile networks
- Markup for synchronization across modalities and devices with a
wide range of capabilities
- Support for local and remote speech services (recognition,
prompts and verification)
- Support for local and remote processing of ink entered using
stylus, pen or imaging device
- Support for combination of local visual system and remote dialog
system
- Support for coordination across multiple devices
Technical directions under consideration
- Microsoft led SALT Forum proposes speech extensions to HTML,
assumes reasonably powerful devices - PDA's and up
- IBM, Motorola and Opera Software propose XHTML+Voice which
combines XHTML with VoiceXML for PDA's and up
- IBM, Intel and Motorola working on XML format for electronic
ink, called InkXML
- W3C working group is looking at extending W3C event model
across network using XML, but independent of underlying transport
protocol
- Working group is expected to define standard for output
from speech/ink recognizers, plus framework for natural language
understanding
- Coordination with other W3C work, e.g. on XHTML, XForms, SMIL,
SVG, CSS, and Voice Browsers etc.
Relationship with work in other organizations
- W3C work is intended to be complementary to other
organizations
- SIP for setting up multimodal sessions and providing control
throughout the session
- SIP Events as a natural fit to W3C work on distributing XML
events - register/notify model, but hide this from application
authors
- CATS (formerly MRCP) is name of new IETF work on protocol for
control of remote speech recognition, prompts and verification
services
- ETSI Aurora work on increasing resilience of speech recognizers
against transmission fade-outs and aural noise
- Device based preprocessing to address aural noise
- Codec optimized for speech recognition
- Work now picked up by 3GPP along with related work on transport
issues for multimodal services
- ETSI work on standard command and control vocabulary with
bindings for European languages
- WAP Forum may start work on interoperability issues for
multimodal aligned with W3C's work
Last thoughts
- Voice has the potential to be a natural interface to the Web,
opening it up to people with visual impairments
- Designing for voice requires a new set of skills
- Initially voice interfaces will be limited to simple prompted
responses, but over time richer and more natural interaction styles
will become possible
- Electronic ink offers exciting new possibilities for the Web
- Multimodal has the potential to allow you to choose whether you
want to use your voice or your hands and eyes, but this relies on
the application designer taking this into account
- What is the relationship between multimodal and device
independence?
- How does natural language interaction relate to the Semantic
Web?
Questions?