BBN Position Paper on Conversational Web Access

David Stallard
BBN Technologies
stallard@bbn.com

Abstract

We describe current telephone-to-web dialog projects at BBN, as well as some of the problems we experienced in building them. Building on this work, we present our thoughts on why the web isn't currently very suitable for voice-only conversational access, and how it might be made better.

Current Telephone-to-Web Projects at BBN

BBN's Speech and Language group is currently involved in three different DARPA-sponsored projects that provide spoken dialog access over the telephone to specific websites. All three projects are built on top of a common dialog architecture, which includes a dialog manager, a speech recognizer, a language understanding component, a language generation component, and a commercial speech synthesizer.

The first, VADAR (formerly GTNPhone [Stallard98]), is a telephone dialog interface to GTN (for Global Transportation Network), a US DOD logistics website that tracks the whereabouts and itinerary of military cargo shipments worldwide. Individual shipments are tracked via a unique identifier called a TCN (for Transportation Control Number) similar in spirit to a FedEx tracking number. VADAR allows the user to query the whereabouts and status of TCNs read in over the phone or by tracked by a previously assigned short name. It computes and transmits an appropriate HTTP request for the GTN server, parses the HTML that comes back, and renders the answer in English for a speech synthesizer. A second, related project, is an interface to another DOD logistics website, Emall, which allows users to order supplies rather than merely ask about their whereabouts.

The Emall voice interface allows users to use the telephone to ask for a part by id code, specify a quantity and a supplier from a list of those stocking the part, and add the order to a virtual shopping cart. When the user is finished, he can finalize the order by specifying his credit card and have the items in the cart shipped to him. While the interaction with the website information is live, the ordering transaction is not actually carried out, as the system is currently in a demonstration and proof of concept status.

Both of these projects are characterized by having limited, prescriptive and mostly system-directed speech recognition and language interaction. Our third and most ambitious project, Talk'n'Travel, is a telephone interface to commercial travel websites which lifts these restrictions to permit spoken unrestricted language access and user/system mixed initiative. It allows for negotiation between system and user when the user's constraints cannot all be met. The following is an example dialog:

S: How can I help you?
U: I need a flight to Denver monday morning
S: What city are you leaving from?
U: Boston. I prefer nonstop.
S: I'm sorry, there are no available nonstops in the morning. I can get you a nonstop at 12:30 PM instead. Would that be acceptable?
U: What time does it arrive?
S: 4:15 PM
U: OK.

All three projects are built on a common dialog architecture, which includes a speech recognizer, language understanding and generation modules, a COTS speech recognizer, an HTTP backend interface and HTML parser, and a dialog manager which orchestrates the operation of all these components. The language understanding and generation components work with meaning frames which consist of a type and a set of attribute/value pairs. The following is a an example meaning frame which could represent the utterance "I want a flight from Boston to Denver that leaves before 3 pm":

   (flight
      (orig (city boston))
      (dest (city denver))
      (dept-time (le 1500)))

The dialog manager operates according to a transition network of pattern action rules which specify what to say to the user, and what action to take in response to what he says, as well as (optionally) how to constrain the recognizer to increase accuracy. Rules are written in a specification language we have devised, called DIABOLIC. The following is an example rule:

  (ELICIT
     (PROMPT (ASK-ABOUT nextUnconstraintAttribute))
     (ACTION
       (isUnderspecified (goto ELICIT))
       (isFetchReady (goto LAUNCH-QUERY))
       (isSpecified (goto OFFER))
       (isConflict (goto NEGOTIATE))))

This rule tells the system to prompt the user to for the next unconstrained attribute in some pre-specified precedence order (origin, destination, departure date, departure time, etc), and specifies what next rule to transitition to according to which of its branch conditions holds true of the user's response in the context of the foregoing discourse. The "isUnderSpecified" branch is taken if the system does not yet have enough information to specify a flight, so it loops back to the ELICIT state to ask about the next attribute. The "isConflict" branch is taken if the user's current specifications conflict with one another, and so on. Transitions may also include extra side-effect actions to execute, such as setting a state register to some value.

A general theme, we view dialog as a co-specification of an information structure (the frame) between user and system, whether that information structure specifies a database query, an update, or a set of goals to be satisfied. Different dialog strategies (and thus different rule sets) will dictate different interactions with the user, some lengthier or more directive, some less so.

How can the Web be made more useful for dialog access?

The systems sketched above differ in a key respect from a true voice browser: they are separated systems which treat the website simply a backend database from which to receive data. This has an obvious disadvantage. Nothing in the page actually specifies, in a formal machine readable way, which element is, say, a city, and which a departure time. As a result we have to write fragile, ad hoc code that tries to extract information from HTML parse trees, based on purely contingent facts (such as the order of TD elements of a TABLE, or textual labels) and which can thus break any time the format of the document is changed. In other words, a hack.

One obvious thing that will help is the advent of XML, in which the constituent parts of any information structure are explicitly called out by label. XML elements have semantically labeled subelements that are much like attributes of a frame, and so a meaning frame representing, say a flight information request can be matched much more straightforwardly to a document whose elements represent semantic objects like flights explicitly.

I doubt in any case that dialog systems can be retrofitted onto HTML documents without major control changes. Yankelovich has demonstrated quite convincingly how simply translating a graphical interface to a speech interface fails to create a useful or usable system [Yankelovich95], and the web seems unlikely to be different. Websites are designed to take advantage of the much higher "symbol bandwidth" that the visual medium provides, and thus pump out much more information in response to a request than one would ever stand to hear over the telephone. HTML document layout, such as for forms, does a good job of visually showing options, indicating defaults and extras, etc, but does not provide a direct specification for a dialog strategy.

What is needed is a way of annotating documents with some sort of dialog network control structure that can control what the user is asked, and when, and deal with his replies accordingly. Such a dialog control structure must be capable not only of constraining the recognizer to increase accuracy and carrying out side-effect actions, but also of fetching another document altogether. A language like DIABOLIC, which already includes a GOTO for transitioning to a new state, could be augmented with a FETCH operator which to a new page, and thus to the network associated with that page. FETCH could take a URL, plus optional register values (collected in the course of the dialog) and perform either a GET or POST as appropriate.

Ultimately, I believe the question has to be asked as to whether people really want general telephone web browsing, or do they just want access to particular important information in more than one way. I expect it's really the later - nobody wants to place a call to Little Mikey's homepage unless it's providing a crucial service to them. Documents and services with such crucial content will have to be designed with both uses in mind. The ideal would not be to generate a dialog system from a graphical representation or vice versa. Rather it would be to have a neutral mechanism of representing such content that allows one to generate a graphical webpage, a telephone dialog system or even something on the continuum between these two extremes, such as a system for access from a small screen.

References