Using XML for Voice Applications

Philipp Hoschka
Interaction Domain Leader
W3C/INRIA

Why Interest in Voice ?

"Web approach" lowers telephone service development cost
On small mobile devices (3G)
- voice input instead of keyboard
- voice output instead of screen

"Thumbs are the new fingers ..."

Mercury News, 23 March 2002:

Thumbs are the new fingers for GameBoy youth

The use of gadgets such as mobile phones and GameBoys has caused 
a physical mutation in young people's hands, according to 
a British Sunday newspaper.  
New research carried out in nine cities around the world shows that 
the thumbs of people under the age of 25 have taken over as the 
hand's most dextrous digit, said The Observer.  
...
``Discovering that the younger generation has taken to using 
thumbs in a completely different way and are instinctively using 
thumbs where the rest of us are using our index fingers is 
particularly interesting.''

W3C Voice Activities

Voice Browser
- Telephone-based services
- VoiceXML
- Purely voice-based
Multimodal Interaction
- Focussed on 3G mobile device
- Just launched
- Mix voice with web page

What is a Voice Browser ?

Browser: "What is your departure airport ?"
Caller: "Nice"
Browser: "Where do you want to fly to ?"
Caller: "Paris"
Browser: "At what time ?"
...

Example VoiceXML Users

Amazon: customers can use their voice to check the status of recent orders from any phone, any time
AT&T: US toll-free directory assistance
Merrill Lynch: Merrill Lynch Service Network. Routes calls from Merrill's thousands
of financial consultants to internal subject matter experts.
Yahoo: Call 1-800-MY-YAHOO
- listen to email and voicemail
- lookup address book
- listen to stock quotes, weather, sports, news
Demo

Companies Involved

Editors of Working Drafts
1. AT&T
2. Dynamicsoft
3. IBM
4. Lucent
5. Motorola
6. Nuance Communications
7. PipeBeach
8. SpeechWorks International
9. Tellme
10. ...

Why XML for Voice ?

Leverage Web success factors for voice
- Entry-level development with text-based editor
- Learn by reading other people's documents
- Dynamically generated content possible
- Content stored on any web server
- Web developers familiar with angle-brackets

System Architecture

Voice browser architecture

Voice Browser Architecture

voice browser architecture

W3C XML Markup Languages for Voice

Dialogue Control: VoiceXML
Speech Recognition: Speech Recognition Grammar Syntax
- also: non-XML version
Speech Synthesis: Speech Synthesis Markup Language

VoiceXML

XML-based programming language for voice applications
Need to describe
- System prompt
- Expected user response
- Action on expected response
- Action on inexpected response

Example: Prompt

<prompt>
  <audio>
    Welcome to the <say-as type="acronym">W3C</say-as>
    Voice <say-as type="acronym">XML</say-as> server.
    Would you like to have more information about the
    architecture domain, the document formats domain, the 
    interaction domain, the technology and society domain 
    or the Web Accessibility Initiative ?
  </audio>
</prompt>

Techniques to Improve User Interface

"Barge-In": User can interupt prompt with answer
Alternate prompts: Vary prompt for input
"Mixed Dialogue": User can give response that does not answer question
Browser: When will you arrive at the hotel ?

User: I need to rent a car

Browser: Which company do you prefer ?

...
See VoiceXML 2.0 Working Draft

Speech Recognition Grammar Syntax

speech recognizer

Example Rule

<rule id="city">
  <one-of>
     <item>Rio de Janeiro</item>
     <item>Rio</item>
     <item>Paris</item>
     ...
  </one-of>
</rule>

Dealing with Prononciations

<item lang-list="en, fr">Rio de Janeiro</item>

Speech Synthesis Markup Language

Synthesis steps
- Analysis of text structure: text markup, punctuation
- Text normalisation: $200 -> 200 dollars
- Text-to-phoneme conversion
  Phoneme: basic unit of sound in language
- Prosody analysis: Emphasis, speaking rate, ...

Markup for Text Normalisation

"say-as" Element

<say-as type="currency">$20.45</currency>
<!-- twenty dollar and forty-five cents -->

Spoken text is language and platform dependent
"say-as" supports
- Acronym ("USA")
- Number
  - Ordinal ("sixth")
  - Digits: read digit ("6 1 7")
- Date format (day, month, year; month,day,year; ...)
- ...

Changing Voices

<voice gender="female" category="child">
  Mary had a little lamb
</voice>

Markup for Prosody

"emphasis" element
"break" element: control pause between words
"prosody" element
- Speaking rate ("fast", "medium", ...)
- Volume
- ...

Including Recorded Audio

<audio src="prompt.au">
  What city do you want to fly from ?
</audio>

Example Implementations

Tellme Studio allows anyone to create a voice application and access it via phone
- http://studio.tellme.com
- +1-880-555-TELL
- +1-408-678-446 (International)
VXI interpreter (Open source, SourceForge)
BeVocal
HeyAnita
IBM Voice Server SDK
Motorola
Nuance
PIPEBEACH
Telera
VoiceGenie
...

Status of Drafts at W3C

VoiceXML 2.0: Working Draft
Speech Grammar: Last Call Ended
Speech Synthesis: Last Call Ended

Summary

Lots of development around VoiceXML
Bringing development of Voice applications to the masses
Fun to play with - try it out !

More Information

```
http://www.w3.org/Voice/
```