
From Voice Browsers to Multimodal Systems
The W3C Speech Interface Framework
Dave Raggett <dsr@w3.org>, W3C Lead for
Voice/Multimodal. Hong Kong, 1st May 2001
Voice – The Natural Interface available from over a
billion phones
Personal assistant functions:
- Name dialing and Search
- Personal Information Management
- Unified Messaging (mail, Fax & IM)
- Call screening & call routing
Voice Portals
- Access to news, information, entertainment, customer service
and V-commerce(e.g. Find a friend, Wine Tips, Flight info, Find a
hotel room , Buy ringing tones, Track a shipment)
Front-ends for Call Centers
- 90% cost savings over human agents
- Reduced call abandonment rates (IVR)
- Increased customer satisfaction
W3C Voice Browser Working Group
Mission
- Prepare and review markup languages to enable Internet-based
speech applications
- Has published requirements and specifications for languages
in the W3C Speech Interface Framework
- Is now due to be re-chartered with clarified IP policy
Voice Browser working group members
Alcatel
AnyDevice
Ask Jeeves
AT&T
Avaya
BeVocal
Brience
BT
Canon
Cisco
Comverse
Conversay
EDF
France Telecom
General Magic |
Hitachi
HP
IBM
Informio
Intel
IsSound
Lernout & Hauspie
Locus Dialogue
Lucent
Microsoft
Milo
Mitre
Motorola
Nokia
Nortel Networks |
Nuance
Philips
Openwave
PipeBeach
SpeechHost
SpeechWorks
Sun Microsystems
Telecom Italia
Telera
Tellme
Unisys
Verascape
VoiceGenie
Voxeo
VoxSurf
Yahoo
|
Note: Invited experts are italicized.
W3C Speech Interface Framework
The Voice Browser working group is developing a suite of
specifications. The following diagram shows how each of the
specifications fits into the general framework for voice
browsers:

Published Documents

Voice User Interfaces and VoiceXML
Why use voice as a user interface?
- Far more phones than PCs
- More wireless phones than PCs
- Hands and eyes free operation
Why do we need a language for specifying voice dialogs?
- High-level language simplifies application development
- Separates Voice interface from Application server
- Leverage existing Web application development tools
What does VoiceXML describe?
- Conversational dialogs: System and user turns to speak
- Dialogs based on form-filling metaphor plus events and
links
W3C aims to standardize VoiceXML based upon the VoiceXML 1.0
submission by AT&T, IBM, Lucent and Motorola
VoiceXML Architecture
Brings the power of the Web to Voice

Reaching Out to Multiple Channels

VoiceXML Features
VoiceXML includes the following:
- Menus, Forms and Sub-dialogs
- Hooks for speech grammars, DTMF input and audio
recording
- Variables and scripting
- Throwing and catching of various events
- Following links and submitting data
- Simple Telephony control
- Access to proprietary features via <object>
Example of a simple voice menu
<menu>
<prompt>
<speak>
Welcome to Ajax Travel.
Do you want to fly to
<emphasis>
New York
</emphasis>
or
<emphasis>
Washington
</emphasis>
</speak>
</prompt>
<choice next="http://www.NY...".>
<grammar>
<choice>
<item> New York </item>
<item> Big Apple </item>
</choice>
</grammar>
</choice>
<choice next="http://www.Wash...">
<grammar>
<choice>
<item> Washington </item>
<item> The Capital </item>
</choice>
</grammar>
</choice>
</menu>
Example of a simple form with two fields
<form id="weather_info">
<block>Welcome to the international weather service.</block>
<field name=“country">
<prompt>What country?</prompt>
<grammar src=“country.gram" type="application/x-jsgf"/>
<catch event="help">
Please say the country for which you want the weather.
</catch>
</field>
<field name="city">
<prompt>What city?</prompt>
<grammar src="city.gram" type="application/x-jsgf"/>
<catch event="help">
Please say the city for which you want the weather.
</catch>
</field>
<block>
<submit next="/servlet/weather" namelist="city country"/>
</block>
</form>
VoiceXML Implementations
- BeVocal
- General Magic
- HeyAnita
- IBM
- Lucent
- Motorola
- Nuance
- PipeBeach
- SpeechWorks
- Telera
- Tellme
- VoiceGenie
If your company has an implementation and would like to be
listed, contact Dave Raggett
Reusable Components

Principles:
- Express application at task level rather than interaction
level
- Save development time by reusing tried and effective
modules
- Increase consistency among applications
Possible examples include:
- Credit card number
- Date
- Name
- Address
- Telephone number
- Yes/No question
- Shopping cart
- Order status
- Weather
- Stock quotes
- Sport scores
- Word games
Speech Grammars
- Specifies the words and patterns of words for which a speaker
independent recognizer can listen
- May be specified
- Inline as part of a VoiceXML page
- Referenced and stored separately on Web servers
- Context free grammars
- XML and ABNF formats (interchangeable)
- Action Tags for “binding data”
- N-Gram statistical language models
An example in both XML syntax
<rule id="state" scope="public">
<one-of>
<item> Oregon </item>
<item>Maine </item>
</one-of>
</rule>
When rewritten in ABNF this becomes:
public $state = Oregon | Maine
Speech Synthesis
Synthesis engines have detailed knowledge about language.
Consider the following text:
Dr. Jones lives at 175 Park Dr. He weighs 175 lb. He
plays bass in a blues band. He also likes to fish; last week he
caught a 20 lb. bass.
The first instance of "Dr" is spoken as "Doctor" but the
second is spoken as "Drive". Numbers in street addresses are (in
the USA) spoken specially. Some words like "bass" have context
dependent pronunciations. The previous text is spoken as it it
were written as:
Doctor Jones lives at one seventy-five Park Drive. He
weighs one hundred and seventy-five pounds. He plays base in a
blues band. He likes to fish; last week he caught a twenty-pound
bass.
Processing is done in stages:
- Structure analysis
- Text normalization
- Text to Phoneme conversion
- Prosody analysis
- Waveform production.
The W3C speech synthesis markup language gives authors control
over each of these steps.
Lexicon Markup Language
Why?
- Accurate pronunciations are essential in EVERY speech
application
- Platform default lexicons do not give 100% coverage of user
speech

The lexicon is used in this example to get the synthesis
engine to say "either" as /ay th r/ and for the the recognition
engine to accept this pronunciation and also the alternate
pronuciation: /iy th r/.
LexiconML - Key Requirements
- Meets both synthesis and recognition requirements
- Pronunciations for any language (including tonal)
- reuse standard alphabets, support for suprasegmentals
- Multiple pronunciations per word
- Alternate orthographies
- Spelling variations — “colour” and
“color”
- Alternative writing systems —Japanese Kanji and
Kana
- Abbreviations and Acronyms - e.g. Dr., BT
- Homophones e.g “read” and “reed”
(same sound)
- Homographs e.g. “read” and “read”
(same spelling)
Natural Language Semantics ML
A standard for representing the output from language
understanding. This markup will be generated and consumed by
software rather than by people. It builds upon XForms for
representing the values and types of data extracted from spoken
utterances.

Interaction Style
- Voice user interfaces needn't be dull
- Choose prompts to reflect an explicit choice of
personality
- Introduce variety in prompts rather than always repeating the
same thing
- Politeness, helpfulness and sense of humor
- Target different groups of users e.g. Gen Y
- Allow users to select personality (skin)
Call Control
Give developers control over speech and telephony resources at
the network edge.

- Examples
- Call management—Place outbound call, conditionally
answer inbound call, outbound fax
- Call leg management—Create, redirect, interact while on
hold
- Conference management—Create, join, exit
- Intersession communication—Asynchronous events
- Interpreter context—Invoke, terminate
Multimodal Interaction = Voice + Displays
| Say which City you want weather for and see the information
on your phone |
"What is the weather in San Francisco?"

|
|
| Say which bands/CD’s you want to buy and confirm the
choices visually |
"I want to place an order for 'Hotshot' by
Shaggy"

|
W3C work on multimodal
- Multimodal applications
- Voice + Display + Key pad + Stylus etc.
- User is free to switch between voice interaction and use of
display/key pad/clicking/handwriting
- July 2000 Published Multimodal Requirements Draft
- Demonstrations of Multimodal prototypes at Paris face to face
meeting of Voice Browser WG
- Joint W3C/WAP Forum workshop on Multimodal – Hong Kong
September 2000
- February 2001 – W3C publishes Multimodal Request for
Proposals
- Plan to set up Multimodal Working Group later this year
assuming we get appropriate submission(s)
Requirements
- Primary market is mobile wireless
- cell phones, personal digital assistants and cars
- Timescale is driven by deployment of 3G networks
- Input modes:
- speech, keypads, pointing devices, and electronic ink
- Output modes:
- speech, audio, and bitmapped or character cell displays
- Architecture should allow for both local and remote speech
processing
VoiceXML IP Issues
- Technical work on VoiceXML 2.0 is proceeding well
- Related specifications for grammar, speech synthesis, natural
language synthesis, lexicon, and call control have or shortly
will be published.
- But, publication of VoiceXML 2.0 working draft is held up
over IP issues (although internal version is accessible to all
W3C Members)
- W3C and VoiceXML Forum Management are in process of
developing a formal Memorandum of Understanding
- W3C Patent Policy allows for both Royalty-free and RAND
working groups
- W3C is convening a Patent Advisory Group to recommend
detailed IP Policy for re-chartering the Voice Browser
Activity