W3C Voice Browser Workshop Minutes

13th October 1998, Cambridge, Mass

Workshop Announcement | Agenda | Papers

Summary

The workshop brought together people from a broad variety of organizations working on voice technology. The presentations covered a diverse range of approaches, and helped to reveal variations in requirements. The workshop participants were broadly split into two groups according to whether they felt voice interaction required its own markup language or whether html and style sheets could be adapted to meet this need. There was strong support for W3C to set up an Interest Group to study opportunities for joint work on Web standards for Voice Interaction.

Introduction:

Dave Raggett/W3C/HP (DR): W3C founded Oct 94 MIT Cambridge, next year INRIA joined, Aug 96 Keio became 3rd host. Over 270 diverse members, vendor neutral forum for development. Advisory Committee helps run W3C, meets semi-annually. Tim Berners-Lee, inventor of Web, directs. Some W3C team members visiting engineers, Dave Raggett from HP. Work starts as working drafts, then proposed recommendation then Recommendation. Working Groups, Interest Groups where development goes.

DR HTML created by Tim at CERN. Very simple, enabled people to access common docs. Expanded on at NCSA. Standard of HTML 2.0, represented status mid 1994. Fall 1995 W3C brought Web vendors around the table to bring together evolving standard into HTML 3.2, then eventually 4.0. Building in other features. Accessibility, tables, forms, multimedia. Also Cascading Style Sheets. May 1998 workshop of future of HTML. Need to think beyond desktop systems. Now need to re-formulate HTML within XML. Browsers need to be able to purpose content. Work with performance profile: what a particular device will have to do: syntax, semantics. Content providers will be able to know who supports. Proposal is to have proposals written in RDF (W3C's word for meta data.) Also tied in with development of Mobile Access, following workshop in Japan earlier this year. Transformation tools allow content to be re-purposed to match device profiles. Apply transformation tools. Doesn't even have to be in HTML.

DR What should W3C role be in promoting this? Allow higher quality speech synthesis and more accurate speech recognition. Need ways to combine different media more flexibly, and to ensure accessibility in doing so. Towards the end of the day we should evaluate whether to launch an ongoing interest group or working group.

Presentation: Markku Hakkinen (MH), Productivity Works (PW)

MH Taken a journey through accessibility, voice browsing. Looked at requirements of voice browsing for people with visual disabilities, and the HTML requirements. Go through telephone browsing & other carry over applications. Several years ago was building an internet-based information kiosk, exploring issues in Web access for visually impaired. Took it as a challenge. Technology for visually impaired is screen reader, renders into speech or Braille output. PW looked at whether could bypass visual display step.

MH T.V. Raman, at Adobe, developed EMacSpeak which also looks at foundation HTML. Tables, forms. Look more at what is behind the scene. What does the underlying HTML mean, what used for. W3C took on the Web Accessibility Initiative. Some successes in terms of table mark-up that gives more semantic information.

MH Wanted non-visual client to be an equal client to visual browsers. Also wanted to exploit ACSS (Audio CSS) for navigation. Define a tag set that would define how elements would render into speech. Went through first prototype of PWWebSpeak. Rudimentary. Not as well synchronized between text & speech display, problematic for people with learning disabilities. Later added support for some of the access features in HTML 4.0 such as titles, etc. Some issues on support of access key. First support of voice browsing by phone using PWWebSpeak pioneered by an organization in Japan in 1997.

MH Also developed protocol for Digital Talking Book standard: Hybrid Full Text/Full Digital Audio content. Can be device independent: phone, CD's, over the Web. Interoperability testing last week went excellently.

MH Voice Browsing is here to stay. What is missing from it so far? Form-filling? How do you know intent of form? Are there ways to pre-fill it w/out need for dialog? Also, voice-browsing toolkit, where does VB and CTI app development begin? Also: all Web content needs to be authored so universally accessible, device independent? client-independent.

MH Demo: of a voice-browser. "Page is ready. Mark Hakkinen. Eight msgs. Want to read NYTimes?" MH is navigating by hand on a speaker phone. 50 different funtions to navigate by hand. Have been working with several speech recognition engines to drive these over the phone.

MH Demo: of a SMIL player. Protype talking book player. Showing navigation by tree structure, by top level headings, or by lower level headings. In a SMIL presentation can increase the time scale, to speed listening. Can run over telephone browsing. Runs over a telephone system in Japan.

Tomasz Imielinski (TI)/Rutgers: Like to hear more about applications interface.

MH use Webspeak engine. can use different layers? voice recognition layer on top of underlying object. skip to an element. navigate hyperlinks.

TI do the keys recognize touch-tone?

MH looking at a variety of interfaces, touch-tone among them.

Michael Wynblatt (MW)/Siemens how much effort is involved in building a smil demonstration like the one shown

MH authoring process very straightforward. begin with document. import into recording studio system. there are parts of the system that do the synchronization. meant for someone who works in a talking book recording studio to build these books.

DR have you tried this with streaming audio?

MH not yet tested that.

DR challenge is to predict the time needed to fill the buffer. Also, was that a skilled reader recording it?

MH not always, often untrained volunteer, kind of person who records in talking book studio.

DR if you don't have a pre-recorded audio, how well can you automate it?

MH also looking at speech synthesis in hand-helds.

Presentation: Jim Thatcher (JT), IBM Special Needs System.

Great that the first two presentations deal with accessibility. People with visual disabilities have had to put up with information in visual two dimensional space for too long. Had the first ? in 1970's. Then the first screen reader in 1980's was IBM's. Now the term has become generic. In the 1990's had screen reader for OS2. Now have home page reader. Mark & his colleagues are leaders in that field. Blind readers put up with an awful lot. The process of trying to render all content with speech is difficult. Home page reader released in Japan in 1997. IBM Japan. Suprised the special needs organization in the status, the main designer Chieko Asakawa is the main designer. She also was involved in the translation of the OS2 screen reader. Special Needs Systems in IBM Japan doing home page reader 2. Receives HTML from Netscape. Uses IBM's via-voice out loud. How about using a high-quality text-to-speech... the DEC 10 years ago produced about the same quality speech. Delighted with the quality of speech in via voice. Use the numeric keypad as input device. Don't do anything with CSS, SMIL, DHTML, JavaScript. It's extremely difficult to present content to blind readers. It's designed by blind people for blind people.

There's a stop key, a play key, a fast-forward key. And previous, current, next item. Current link key. Can enter a URL. Can search on page or net. Lots of setting. Variable speech rate. Blind users are used to text to speech. Via voice goes 340 words per minute. Requirement is to go up to 700 words per minute. People learn to hear at very high rates. There's a page summary. Number of titles, tables, forms, items in length. A "where am I" key uses that context. Like a gross way of saying the structural information.

Hardest part of access is: people who browse visually filter a lot of information, including dynamic content. Need to be able to match this. If your search engine has 22 matches... Don't have to listen to all the ads. ACB (American Council of the Blind) Website gives you a "skip over navigation" option at top of page. Another feature of via voice is that links are spoken in a female voice... can be adjusted. Have to know what links are as you read, in context. Really believe that was key. One of the first requests for beta, how to turn off, can now change to pings. Searching, skipping, jump keys. Fast forward is a scanning strategy. Then lots of different jump keys. Jump between structures.

Several other features still very boring despite efforts of the WAI group. Mailer works well with access keys. Home page reader menu provides menus for navigation. HTML 4.0 support: use of headers... decided to put in speaking headers of table cells. But problem in implementation, resulted in too much verbiage for some tables.

Andrew Forest, AT&T: Can we hear a sample of Via Voice

JT: at a break, yes.

Rajeev Argawal, Texas Instruments: given the problem with pressing keys, have you experimented with speech recognition, you'd only need 50 words for commands.

JT: will, haven't yet.

MI: am using home reader in Japanese. But English speech quality is poor.

JT: Pro-Talker speech engine works poorly in English. Should

JT: need to be able to support different speech engines to get better quality sound.

JB: hope that effective speech recognition combined with speech systhesis is available soon; increases usability for general population in eyes-busy & hands-busy situations, as well as for other disabilities such as people with physical disabilities who can't use keyboard.

[coffee break]

Voice Browser workshop, Oct 13, 1998, after mid-morning break:

[Dave: check the notes for your own presentation. There are several one-to-two minute gaps from when I was talking with the AV guy]

Presentation: Dave Raggett (DR)/W3C/HP Will present content of a note that DR & Or Ben-Natan did in a W3C Note earlier. The bigger the market the better. If you can get voice browsers to access as much content as possible... Pouring content from a database. For dealing with today's content, should be possible to use heuristic techniques. Many pages that prevent even obvious gestures towards accessibility. Helpful to have for instance alt attributes for image maps.

DR CSS2 -- number of things for positioning. will enable phase-out of use of tables for layout. if the data model is more explicit, can give a good browsing experience orally. In next generation HTML, will try to improve forms, to support aural presentation. Mobile Interest Group also looking at issues involved in repurposing content. CSS2 has a range of features for speech: speech rate, voice family, pitch, spatial direction, volume, stress, richness. Pause, queue properties. Features compare well with SABLE mark-up language for speech synthesis. Pronunciation hints from authors helps. Options for handling selection errors. Prompting. Capability to support pronunciation dictionaries & to provide aural cues.

DR How do you give an attribute for a heading. There are new techniques from HTML 4 for this. But most of the content out there is not marked up that way. Selection concept is to add new events to HTML. - OnSelectTimeOut if you haven't selected something in a certain timeframe - OnSelectionError allows author to specify options if the wrong option is selected

DR Other approaches include adding an attribute to HTML. Use of switch element from SMIL (if you don't use this, use this instead) this could be added directly to HTML. When you want to use speech recognition, handy for author. Consider development of speech grammar format for commands/ or selection acknowledgement.

DR If this group interested, W3C can develop briefing package for formal activity in this area. Opportunity to drive next generation of the Web.

Discussion:

___? Your slide about scanning -- you said digital scanning easy? Any comments about that?

DR Lots of practical experience about that in this room, didn't go into in depth. Also should look at DOM and scripting.

Presentation: George White (GW) /General Magic

GW Portico service gives voice access to intelligent agent. to business information, intelligent messaging, important public information, exec call handling research assistant, investment manager. customized newspaper. allows synchronization of information onto PDA's & make it available while traveling etc. toll free dial in service. our architecture is... bank of pc's, SR servers, text to speech servers...expanded view... dish that downloads.

GW demo over the phone to Portico. Going through email. Taking commands of playback email. Then recorded a speech message. Uses an auditory desktop in place of a visual metaphor. Accepts barge-in (interrupt). Checking on mail forwarding capabilities. Also making an appointment. Personable interface. Requested fax. Demoing call-back of old voice mail messages. Going to address book. Narrates menu options. Going to stock report for General Magic.

GW uses in car driving to work, works in a noisy environment. This phone had a higher misreq than usual. User needs an audio player. Access the Web done behind the scenes to get news and stock quotes. Working on building in an explicit portal to the Web. Acquired NetPhonic within the past year. Looking at VoxML which could make pages more browsable, scopable. Also doing customized interfaces inside General Magic.

Discussion:

Jim Colson, IBM: on slide saw syncrhonize GUI & voice, but didn't see GUI in demo.

GW yes we have GUI at portico.com and we're synchronizing with that so can see as browse visually.

John Burger, MITRE; what vocabulary size for SR engine.

GW it is vendor independent. perplexity: 15,000 stock quotes. we do it phonetically. Other pts in achieving highest quality of grammar recognition over phone. Linguist types in phonetic strings w/ co-articulatory phenomena.

JBurger: more about dialog tree?

GW: built a huge dialog tree, w/ help of linguists, scripters, hollywood folks...

Rajeev/TI: has dynamic grammar? and context sensitive help?

GW: yes. uses concept of graceful help. Barge-in is a great help. but want to track how often user has to interrupt; two options available. Novice option detects multiple looping, gives different level of support. Expert level also available.

JBrewer: what was going on in email reply demo, that didn't sound like speech-to-text-to-speech, rather digitized.

GW: reply becomes an audio file that is attached to the email message, and user needs a real audio player to access it.

Presentation: David Stallard (DS)/BBN

DS VADAR: Telephone interface to military cargo shipments

DS EMALL Interface: Telephone interface to Defense Logistics Agency's EMALL Website.

DS Talk'n'Travel: phone interface to commercial air travel Websites. free-form language input allowed.

DS Dialog System Architecture: Dialog manager, written in Java except for voice recognizer, too slow. DIABOLIC dialog rule language. Scripting language for Dialog Manager objects. Rule elicits constraints on attribute. Prompt provides content of what to say. Action is what to do following assimilation of that constraint. Meaning is represented as frame structure; language understanding done based on simple patterns; language generation also template driven. Three stages: fetching; parsing; groveling (crawling through the tree structure to find the data that's needed).

DS Poses question: is HTML suitable for voice access? A lot of tags that aren't semantic that are in the way, have to wade through those. Need to wind up doing natural language understanding more than should have to. and the page layouts do not facilitate good voice access; have to repackage the info a lot. Need to find specific info on a table.

DS What things would improve? XML. Having explicit markup of semantic content. VOX ML work at Motorola will help. Our language is more procedural, descriptive. whereas VOX ML more declarative. Ours is more expressive, but probably harder to use. but may help with client-side processing.

DS Demo: dialing in. querying re a new TCN. spelling out. tracking military shipment between air force bases. it goes out and gets the info from the Web. And then it checks to make sure the caller is really done.

DS Discussion:

Or Ben-Natan (OBN), Microsoft: Want to understand more about VoxML, why need to upgrade or replace HTML? You wanted semantic description of info to create an automatic dialog to rephrase questions or present alternatives?

DS allow to reformulate data. need to be able to get at that flight data.

OBN you hope to get that from VoxML?

DS VoxML would enable you to create scripts that have all the data in it.

DR couldn't that be supplied by an agent interface? W3C next generation should help.

Michael Wynblatt/ Siemens: talk about the trade-offs between this approach, and going straight to a database through a phone call-in standard.

Brian Altman, Applied Technologies: Meta tags from W3C, and other developments, that relate to this, have you looked at any of that?

DS not yet, no.

Tomasz Imielinski/Rutgers: how domain dependent is this? What would be involved in porting it to another domain?

DS less than a month, we've done it before.

Presentation: Rajeev Argawal (RA), Texas Instruments: Voice Browsing the Web for information access.

RA there will be an explosion in speech-based access systems & we need standards. (1) functional categories: web browsing with speech-enabled interface. (2) limited information access: info from Web but designed for speech & scripted somehow. (3) spoken dialog systems: monitors everything the client does Sometimes these categories overlap.

RA Spoken dialog systems: (1) graph-based. represents entire dialog interaction, mostly system initiated (2) frame-based. info need to complete user query, mostly mixed iniative, user can specify slots (3) plan-based. .... more portable.

RA Texas Instruments developments follow:

RA Web Browser, Voice Browser. speakable links, bookmarks, browser commands, smart pages, no audio support

RA InfoPhone get info on flights, stocks, weather, voice i/o only, not much display, customizable

RA Dialog Manager: conversation between human & machine takes place at domain independent level mixed initiative frame basis, either can start a sub dialog

RA Remote E-Mail, Voice-mail: can filter, categorize, navigate email will develop voice-mail send &recieve

RA Voice navigation: maps & directions, etc businesses coupled with GPS on client side

RA Design Issues: - need better UI for Voice I/O - portable dialog managers... - for wireless: ...?

RA need additional handlers for errors need OnRejectionError for when SR engine doesn't have enough confidence need OnHelpRequest need "Speech" media descriptor, maybe look at JSGF (Java Speech Grammar Format)

RA helps to have dynamic grammar capability; Extensions will be most effective for: Web browsing and limited information access apps; Mere Web-enabling of an IVR (interactive Voice Response) IVR systems are graphics-based. Merely just speech-enabling that interface won't be most helpful. the system-initiated prompting/client responding are frustrating.

RA Frame-based approaches enable people to just say what they want to say.

GW Excellent presentation. IVR different perspective. Consider social aspect of dialogs. Sore point is that IVR takes control away from the user.

___?/Microsystems. Most people don't like even using touch tone systems. There are usability issues with touch tone.

Ramish/ fixed grammar does improve accuracy; but with links, people may want to add words before or after; should have options such as free text; With VCSR-based systems, will the words be transcribed automatically. how much flexibility to have?

RA it's the dynamic grammars that improve the recognition. smart pages have embedded grammar. knows what to expect you to say. you look at a generic page. there is a way to incorporate a VCSR (large vocab, continuous speech recongition) can just go to the link.

[Break for lunch.]

Minutes Voice Browser WS Afternoon

Philipp Hoschka, W3C

Towards Improving Audio Web Browsing

surfing the web in a car/over the phone
want to render everything on the web
using HTML, not specific markup
download HTML - analyze structure - render in audio as good as possible
browsing modes
basic mode
content mode: navigation sections == clusters of links, skip over those, link density metric used to determine which is one
navigation mode: jump only to navigation sections
headline mode
can change modes while browsing
problem: deducing intended meaning from the HTML: bold tags instead of h2, table for layout, appropriate order for frames
html extensions:
explicit tag for delineating navigation index sections
language marking
phonetics
applet/script: meaningful alternative needed
...
extend SMIL with events
jim xxx: Q: which software synthesizer used for language detection ? does it take time to switch ?
A: any sapi compliant engine, takes time to switch
Q: headline mode: based on headers - any commercial sites using those ?
A: shrug (not really)
Q: when you go to yahoo, what about links there ?
A: will use normal mode fo yahoo
Q: navigation clusters, navigation densitiy - we tried that and we failed
A: yes these are heuristics
A: see paper reference in paper for details
Q (Sun): the page you showed as bad example won many awards

Considerations in Producing a Commercial Voice Browser

current use of alt tags is counter-productive for speech recognition due to improper use.
alt tags that properly reflect the associated graphic (say what you see) can help speech browsing considerably.
we need style guidelines to have voice-browsable pages
improvement suggestions:
voice-friendly html
style guidelines: avoid putting putting link together with url in text, use text instead that people can read (reading a url is hard), same for e-mail address
avoid abmigious links: click here, click here, ...
avoid term click on page, select is better (less mouse-centric)
...
avoid server-side image map
...
need to address hardware/software clashes in standards
Peter XX, XX systems: where do you get prononciations from ?
A: from dictionary, and prop algorithm if not in dictionary
Q (Rami XX): what about search ?
A: (missed)
Q: nils holland, att: great demo. what about patents ?
A: provisional patents pending on a lot of these things, trademarks as well, talk to us if you're interested

PhoneBrowser: A Web-Content-Programmable Speech Processing Platform

telephone access to the WWW
no pc needed
make use of existing content, no additional content needs to be created
user doesn't have to buy extra equipment
internet access provider buys phone browser
applications for browsing via voice commands:
have a device that doesn't have keyboard (TV, ...
hands & eyes busy: car
handicapped
rich grammar specification tags
grammar specification specificatoin in attribute, or referenced via url
have a language called GSL that they used for a long time
can also add semantics
HTML as finite state controler (TTS = text-to-speech)
need ocr in web images/video (...)
Spyglass announcement: will cooperate with Lucent
can support up to 6 users simultaneously
72 channels per box
Q (michael windblad, CNET): can't browse within a document ?
A: no you can do that. finite state systems work fine, despite criticism
Q: peter wolf: how do you do pronounciations ?
A: dictionary and prononciation rules

SABLE: A Standard for TTS Markup

initiative to develop a standard
relationship to CSS
requirements
...
multilingual
easier to use
should not require specialist knowledge of speech science
but: should be able to use it
portable system
extensible system
tagset: speaker directive: emph, break, ptich, rate, volume, audio, engine (talk to particular engine), pron, language (of text), speaker (age, gender... of speaker)
text description: sayas (different ways of treating text, e.g. how to pronounce day format), div
x-me-pron: non standard tag for extensions
why not CSS ? W3C interested in common content - aural CSS only one way to render a web page that can be rendered also differently. need a special purpose markup for e-mail etc., so you don't have to send e-mail in HTML
public effort, specification pubically available
Q (Dave Ragget): did you do any work on converting into aural CSS and HTML ?
A: no
Q (Rajev XX): what is the relation to jsml ?
A: based on ssml (?) from edinburgh university, relationship also to .., jsml did similar thing - we tried to merge
Q: will java adopt sable ?
A: partly - issues with multilinguality
Q (Michael Windblat, CNN): what about sapi (sp??) ?
A: sapi not consistently implemented, some are not implementable (mood), choice of tags could be improved, not structured
Q: missing implementations do not mean that spec is bad
A: ok, missing structure representation is a problem - get the cues right - any additional hints on structure are valuable

ADML - the language to create AudioWeb; hyperlinked collection of audio pages

access to web using touch tone phone and TTS
ADML - Audio Data Manipulation Language - simple authoring language to create audio web pages
accessing via voice is a completely different thing than visual access, needs completely different structure
have a specific language for this
working on making it XML compatible
background: wireless, mobile access to the web
students developed services based on content on the web - audio services
audio requires different style of authoring
services: weather,stocks, horosop, rutgers library, audio resumes, us news, ...
it is not only a matter of presentation, it is a matter of structure - we have thus rejected the concept of style sheets - it is not just visual aspect plus style
we need seperate, XML defined language such as ADML
HTML with additions will not be a solution
agree with motorola
you need a new standard - but not html - look at hdml, same idea
student task: "take this website and convert it to an audio service" within one semester
lessons learned
audio pages must be small
browsing is not desirable - the more direct the access to information the better
browsing is not a feature, but a bug (on the phone)
tools to create audio content - both at compile time and at run time
Q (dave ..., motorola): experience with touch-tone only ?
A: yes, but working on speech input. we have features in language for speech input
Q (dave raggett): what is the problem with HTML ? do you think it can be fixed in the future ?
A: seen a lot of encouraging signs in html 4.0. didn't mean to sound too negative about html. html has high learning curve and was designed for visual information. many audio pages will correspond to one visual page. you will not get away by turning your current website into audio service

Access to WWW through an audio device (ATT)

voice browsing based on html is not viable - we don't think html can do this
Interactive voice response systems no good
html-based systems are better: a dream
We used our own markup language for IVR creation
found limitations of html
html for VB: reasons for enthusiasm
html interpreted as a state machine
text -> speech
input --> little dialogues
hyperlink --> audio icons
Q: why is html for IVR systems no good ?
A: IVR no good: people are impatient - non-rapid dialogues aren't good - announcer talks to fast and you want to repeat it - html: will come to that
criticism of dave raggett / Or Ben-Natan example
voice augmentation
need to add content regarding presentation to html
examples: intial probmp, help does what ? what outer navigational options should be available
representational contents cannot be calculate from an original html document
need own language: PML and VoxML
Requirements:
emphasis on content, control flow, dialogues, handling of exceptional situations
abstraction - must be declarative - not contain any scripting, speech api's, C++, ...
Control flow: affected by previous input, dialog events, important asynchronous events
HTML reuse: static scoping, possibly hyeplinks, META, LINK, FORM, SELECT
speech standards reuse: text-to-speec markup (e.g. sable consortium), speech grammar markup
Conclusion:
content providers want to target IVR directly by specific language, not by html, even with extensive style
fully explicit marup needed for voice rendering
could be part of modularized HTML
Q (Rajev XX, XX): voxml - still for graph based system to implement it for a web scenario ?
A: yes, based on flow-diagrams, still preferred means to build IVRs, very simple
Q: plans for extensions ?
A: yes, will evolve, would like to see VXML
Q (Ora, Microsoft): statement: don't understand why you are pessimistic about HTML. instead, you criticised my example. criticism was valid. want to work on making html better for IVR - we would support this sort of effort. we've been implementing IVR based on html - we're very happy with it, and so are our customers. succesful. don't think a new language is needed. modifying an existing standard is hard - but we should think before doing something new
Q (sun): statement: agree that html is not the right approach. what about dialogue system ?
A: need to remind user of state - example: prompt in example

Panel

where should we be moving next ?

Jim Thatcher

we need heuristics - liked link density metric

what's next ? hope there will be an interest group on voice browser - we'll participate

Mark Hakkinen

diversity of opinion on use of html

html is usable for voice browsing, especially 4.0

need more information like this - what is the purpose of this element etc. ?

our audience are non-visual people, don't think that voxml addresses this

like to see extensions of html4

???,???

there will be an alternative way to access web, and it will probably be voice

liked VXML proposal

we also need agent controls

ken bartlett, html writers guild

main concern; writing html

hard enough to get people to do something simple like including alt tags

not sure whether they want to learn another language

or present content in two forms

i think: use html, adapt it, extend it, rather than doing a seperate language

perceived difficulties could be improved if web designers doing practical work would have a better understanding of how to use existing means in html to build better web sites

Richard Sproat

speech output interest

There are a number of proposals to control output of voice synthesizers: CSS, SABLE, etc. It is critical that there be at least one scheme, but if more than one: we need standard tools for conversion needed. However we don't need half a dozen schemes

Researchers should be involved, not only user community of speech synthesizer.

Some properties can be quite contentious issues, e.g pitch-range. Its not really clear what the set of tags should be. Best deciders: people working on the speech synthesizers. We need a lot of interaction with people that do the technology

ATT

need both extensions to HTML and a new language

W3C could be forum to coordinate this

Lucent

less change to html

not enough to have own language

siemens

we have heuristics

reduce ambiguity in markup language so that heuristics work better

make it easier to do interactive presentations

SMIL could help

Or Ben-Natan, Microsoft

html is a defacto standard for defining information

heard a lot about that html is not good enough for IVR

not convinced

html needs to be improved, but not much is needed

are we going to make new standard for each new device ?

Rajev

need a frame-based mechanism in the language - specify what kinds of slots a frame has - grammar based slots

Charles Hemphill, Conversational Computing

Suprised about focus on telephone access. In our product, html is a good match. we don't want to get rid of the screen for the voice browsing application. Voice dialog over the telephone without a screen is a challenge.

Rutgers

whether html or a new language: there are web-aware users and web-unaware. first group knows the web site, and want to go to third page. web-unaware (former IVR users): html extensions may not be appropriate

problem about html: my language could be defined a subset of html - my problem: cannot imagine designing for audio user and visual user at the same time - completely different designs

only against hmtl when talking about web-unaware users - make a lot of sense for handicapped users

support special interest group

Judy

universal design explanation

explanation of WAI

www.w3.org/WAI

brewer@w3.org

use universal design approach to repurpose content for different uses - this may apply to some of the questions discussed today

agree with ken: people won't do content several times

Floor

Dave Ragget: any questions ? what should future work address ?

Windblad, CNET: link density metric and other metrics: we have a paper at IEEE Multimedia

new language or not: both may be needed. question; when do you need what ? under which context ?

jim close (?): browsing is not the goal: important point. think about kind of content we're trying to render. notion of seperation of content and presentation makes sense. new markup language for new device: yes, see WAP forum, so seems needed. deliver content, decide about rendering once you know the device capabilities. "voice browser" seems to cover a lot of different devices. not sure we are going to find one answer (opinion to previous questions)

lucent: number of people have made distinction between browsing and IVR: mechanisms are very similar. reporpusing content: depends on how hard repurposing content ist. still a lot of research needs to be done. big advantage in repurposing.

rutgers: looked at html, html with css, hdml, ... analyzed pro's and con's. there are compelling arguments why a different language is necessary in HDML. if these are true, than voice access needs even more its own markup language. lot of work presented today didn't target phone, but for accessibility. thus natural orientation to repurpose html content. europe is far ahead of us in installation of mobile devices. see fusion of small devices with speech output and screen. there seems to be agreement that seperate language for mobile is needed, see hdml - why do small devices get their own language, but voice doesn't ?

??: hdml goal was to optimize for low bandwidth (based on WML)

Or Ben-Natan (microsoft): hdml means proliferation of standards - you will have five different languages for different devices - need one language for all devices- don't start a new language every time

att: when to use html, when something else: html excellent choice for talking books, with SMIL extensions. html does not make sense in transaction-based system - example: ups tracking website does give no clue on how to create voice-based site derived from website - imagine from an 800 number reaching this page - there is almost no overlap - completely different designs - don't believe html is a solution for this purpose

mark (repsonse): not all people using talking books are experienced users - elderly, young children use it. html page is useful to non-web-aware user if well-designed. a lot of the extra content on ups is just bad authoring. what about advertising and voice browsing ? we just skip it

lucent: distinction between ...: tried using html. could do rendering of information, but couldn't do the best what we thought was best. moved to dialogue driven approach, provided better results/better experience for users - capitalize on capabilities on specific users interfaces - wonder if RDF/profiling won't bloat the whole system up and make it too complicated. ...

rajev: html or different kinds of system: ... client-server solution ... for web page designer, designed forms interface - can leave it at that when going to voice browser - unless he gets more money to do voice version -

agree with ora, just add something to html, so that author can gradually improve the web page for voice - avoid a different language

gragget systems (?): display size: difference between dtmf and speech input. speech input allows many more than 12 choices at a time. don't have to go into a directed graph. shouldn't use extensions that force you to use the old way of doing things

sun: five info representations: html - substantial limitations, but pragramatic says: this will be used - CSS: ... grammars need to be provided to speech synthesizer, more than extracting links is needed - happy to provide spec - voxml, ...: three speech oriented markup - how do you minimize effort for web designers ? took that to heart. idea: same way as CSS/stylesheets, provide a navigation info in seperate part - navigation document, break down html document into navigation parts

ken: we can learn from IVR - ... - if you make it a different language, rest of the world doesn't have access to things that could be useful ...

john burger, mitre: participated at html WG - next version of html will likely be number of XML modules. decomposing html - core html, tables, forms. add an additional module, not part of html proper. hard division between html and completely different markup language disappears - add embellishments using external modules - should allow much easier repurposing of HTML - check out current HTML WG - if you think it's appropriate, please join, we may need the expertise

Rutgers: one web page corresponds to one audio site - many to many mapping. language may be able to solve it, but i'm unsure

Or Ben-Natan: html in several subparts: as long as we don't allow overlapping, great idea - even if you special-purpose design an audio site, you can still do that in HTML, even if it doesn't give the best visual experience

ken hore, lucent: pml had success, ora had success. whether its' extensions or new language: depends how complicated. when we started pml: how can we make access for phone users easier. how can we make creating phone services easier for web designers. certain percentage of the phone services can be build by html. talked to ivr people ... even general magic is similar to ivr. goal: make it easier to make these services. ivr people should be able to do more complicated things - lots of bloat in the language when using html only. you can also do some things only in html

??: (general magic ??): phone without a display is an anachronism: numbers don't support this - 800 Mio phones vs. not many PCs. new on web: ... different about gui: voices interface requires dialogues because you don't have click and point. didn't hear much discussion of what constitutes a dialogue. we may need a standard for this.

Ramesh Sarukkai: ivr difficult, tedious: why not adopt keyword strategy for accessing web content ? keyword-structuring of web pages. dialogue-oriented structure for form filling. focus on modifications of html for speech recognizers- grammars. cues to speech recognizer would be good to integrate.

lucent: advocating less new language. you want high quality performance, and you need new language for that. not convinced that we've taken enough advantage of what we already have - should try that

steve lilley: many of you used html for presentation, not powerpoint. not so great display. same is true for designing a speech app. far beyond capablities of html. you can do something simple, but not too fancy. there are still many applications that html can be used for

separate: access to the web, sophisticated dialogue-based systems in interest group

lucent: need a more natural two-way conversation with systems

dave: need help for doing a briefing package

??: shameless plug for speech recognition. when you have only audio access, you need to the best with what you have. opposite for speech recognition

Brain storming session

who would like to be involved in IG

fair number (80 % of about 30 people)

set up a mailing list, use this to clarify some of the ideas in this session

ken: identify specific needs to check whether they should be extensions or a new language - laundry list

lucent: what can't be done as an extension of an existing language ? make a list

Or Ben-Natan: need a guideline - at which point is html not enough ? i haven't gotten there yet -want to solve most of the problems for most of the people - not everything for everyone - there's always C++ for applications

dave: speech grammar formats needed

rutgers: next version of html will have xml blocks - so problem is solved

dave: comment on frames vesus graphs -can you say more ?

rajev: graph-based systems always have the limitation that you need to specificy the whole state transition diagram - if something new happens, you don't know how to handle this

repurposing content: need engines to convert between old and new content (??)

markup that you use for synthesis can also be used for recognition