Michael K. Brown
Stephen C. Glinski
Bernard P. Goldman
Brian C. Schmult
Lucent Technologies Inc.
Murray Hill, NJ 07974
The PhoneBrowser is a system for browsing the World Wide Web using only a telephone as the terminal. In its simplest form, the user hears a verbal description of each visited Web page using a method called HyperVoice with Text-To-Speech (TTS) synthesis. Different synthesized voices are used to signify particularly interesting text on the page, most notably hyperlink titles. Other fonts like bold text or heading text, for example, may also have special voices assigned. The HyperVoice description of page layout includes information about images, forms, tables, etc. To the extent possible information about the content of the page is summarized and transformed into a concise verbal form without heavy reliance on special programming.
At any time the user can ask questions to get greater detail or can speak Hyperlink titles into a speech recognizer, interrupting TTS output, to navigate to other Web pages. Other speech commands can control operation of the browser and how the information is rendered. In this way the user has control over the presentation and navigation processes. Thus, the PhoneBrowser makes the Web accessible to traveling business people and to the 60% of the U.S. market that does not own a computer.
The PhoneBrowser is also a programmable platform that gives the general population of Web page authors the means for building Interactive Voice Response (IVR) systems without having to own any IVR equipment. This is a significant departure from conventional means for building IVR applications that typically must be developed on expensive special IVR hardware. The PhoneBrowser is intended for ownership by Internet Service Providers (ISP's) who provide services to both the Internet and telephone communities. Thus, the PhoneBrowser creates a new class of Web applications, specifically IVR, for the general Web population.
Truly useful real-world speech understanding and synthesis applications have only emerged as products within the last decade. There are several reasons for this, perhaps the greatest being the rapid decrease in cost/performance ratio of Personal Computers (PC's). Improvements in acoustic modeling and language modeling have also contributed substantially. Small desktop applications are now sufficiently fast and accurate to handle many interesting single-user applications, however large speech processing platforms capable of handling many users simultaneously are still quite expensive. Furthermore, creating speech controlled telephony content is still beyond the ability of most small business and home users because of cost and lack of a means for programming. The PhoneBrowser platform addresses these issues.
The PhoneBrowser is the result of a combination of several technologies that include the Speech Controlled Web Browser  and TelePortal ,  technologies. In addition we add new capabilities referred to as HyperVoice that allows Web pages to be verbally rendered by describing the page in terms of its structure or content using synthesized voices to augment the information content delivered to the user.
In the next section the PhoneBrowser system is described in detail including HTML (HyperText Markup Language) parsing and analysis. Section 3 introduces generalizations of the PhoneBrowser application, making the system useful to the much larger WWW community. Finally we close with some remarks about future plans.
The PhoneBrowser system consists of: the Web browser engine that retrieves Web pages; the HyperVoice processor that performs analysis of Web pages and generates verbal descriptions; the TTS synthesizer; the automatic grammar generator that produces speech recognition grammars from Web Pages; the speech recognizer; and the spoken command interpreter. The TTS synthesizer and speech recognizer are general systems that are part of the base platform. Typical base platforms for the PhoneBrowser system are the Intuity/Conversant system and the Lucent Speech Server.
The special Web browser for PhoneBrowser supports only a subset of the typical Web browser functions since it does not need to display any visual information. Therefore it does not process any image or video data. The browser obtains text and audio data from the Web, plays back the audio data, and delivers the text to a HyperText Markup Language (HTML) parser for preprocessing for use by the HyperVoice and grammar generation processors.
The PhoneBrowser may respond to either voice commands or DTMF signals in one of three modes: DTMF only where descriptions include phrases to associate button numbers with interactive information; voice only where the most concise form of description is given using TTS voices; and both DTMF and voice where the longer description form is given and numbers may be spoken. The DTMF only mode may be desirable when using PhoneBrowser in a noisy environment like a busy city street or crowd of people because background noise might be interpreted as voice commands by PhoneBrowser. The voice only mode is generally most desirable because it produces the most rapid page descriptions.
The HyperVoice processor takes parsed HTML and further analyzes the page to identify structure under section headings, tables, frames, and forms. In general the verbal description will consist of TTS output from the page text, plus descriptions of sizes, locations and possibly other information about images and other items on the page. PhoneBrowser will immediately start to describe a new Web page upon retrieval using the various TTS voices to indicate various special elements of the page. The user can command PhoneBrowser to pause, backup, skip ahead, etc. similar to controlling an audio tape player except that content elements such as sentences and paragraphs can be skipped.
Tables may be used for page layout only or may be true tabulations. The page analysis determines which is most likely and generates descriptions accordingly. True tabulations are described as tables. Tables used for page layout purposes are not described explicitly but table element locations may be described if deemed important. Inspection mode can be used to override table treatment when PhoneBrowser hides table descriptions.
Frames can be handled in two ways: the full page description method and the frame focus method. The full page description method merges the information from all frames into a single context that allows the user to verbally address all elements independently of the frames. The frame focus method allows the user to specify a frame to be described or inspected and focus voice commands on that frame.
Forms are described in terms of field title labels. Fields are addressable by speaking field titles. General items can be entered into form fields by spelling. Inspection can be used to obtain menu choices.
The speech recognition grammar and vocabulary are automatically generated from the HTML of the Web page. This is the key feature of PhoneBrowser that makes it useful for building IVR applications on the Web. The parsed HTML is analyzed for section titles and Hyperlinks that are to become voice sensitive. A sub-grammar is constructed for each title by generating all possible ways of speaking subsets of the title. In addition all other browser voice commands are mixed in and a complete grammar is compiled into an optimized finite-state network. This network is loaded into the speech recognizer to constrain the possible sequences of words that can be recognized.
The vocabulary words are extracted from the grammar. These words are partially processed by TTS to create a list of phonetic transcriptions in symbolic form. The same phonemes are used in both the speech recognizer and TTS systems so the symbolic phonetic descriptions, once loaded into the recognizer, tell the recognizer how the vocabulary words are pronounced, thus making it possible for the PhoneBrowser to recognize virtually any spoken word.
Normally PhoneBrowser is describing Web pages to the user via TTS output. The user controls the PhoneBrowser by speaking over the TTS output thus "barging in". Echo cancellation is used to remove TTS output from the speech recognition input so that speech recognition is unaffected by TTS. When the user speaks for a sufficiently long period TTS is interrupted, speech recognition is performed, and the speech recognizer output is interpreted into a PhoneBrowser command.
During the grammar generation process voice command interpretation tables are established for use later in the interpretation phase. For browser commands a table of possible command phrases associates computer instructions with each phrase. No ambiguous browser command phrases are defined. In the case of processing a Hyperlink the Universal Resource Locator (URL) is associated with all possible subsets of the Hyperlink title. Section titles can be han- dled in a manner similar to local Hyperlinks. Later when a title word is spoken the associated URL(s) are retrieved.
Sometimes more than one URL and/or browser command will be retrieved when the spoken title words are not unique. In this case a simple dialog is initiated and the user is given a choice of full title descriptions that can be selected either by spoken number or by speaking an unambiguous title phrase. If the phrase is still ambiguous a new possibly smaller list of possible choices will be given. The user can back up at any time if the selection process has not yielded the desired choices. In this way the user can refine the list and converge on one choice.
The PhoneBrowser is not only a speech controlled Web browser, but can also be used by the general WWW population to build IVR applications. The advantage of this approach is the elimination of the need for the small business or personal user to own any special equipment. Typical IVR platforms are expensive, often costing more than $50,000. Therefore only moderately large businesses or ISP's can afford to own this equipment. The PhoneBrowser is currently built on such an IVR platform and hence is also an expensive system to own, however since IVR applications on the PhoneBrowser can be programmed by simply writing HTML or XML (eXtensible Markup Language) "programs" on the WWW while obtaining PhoneBrowser service from an ISP, the small user does not need to make any large initial investment. PhoneBrowser could also be implemented on a PC with Internet service and voice modem to provide a home solution.
Each ordinary Hyperlink title is processed to produce sub-grammars that allow all spoken subsequences for the words in the title. For general IVR applications the content developer can write more complex grammars by inserting a <GRAMMAR> tag followed by a grammar written in GSL (Grammar Specification Language ) followed by a </GRAMMAR> tag. Alternatively, the author can insert 'GRAMMAR=' or 'GRAMMAR_FILE=' tags into an anchor with either a grammar specification in GSL or a reference to a file containing GSL statements. The tags are ignored by ordinary browsers thus making them unintrusive. Using these methods many entirely different phrases can be used to address the same URL.
Local <GRAMMAR> scope comprises the entire definition for the current URL. Included files can contain surrounding grammar definitions. Macros can be defined either within the local <GRAMMAR> scope or can reside in included files. All macros have global scope within the Web page.
Using local applet/application code gives the IVR system developer the means to perform operations on either the server or the client. In the typical PhoneBrowser application Java code might be used to perform operations at the server that could, in turn, control remote devices through the Internet or the PSTN using additional hardware at the remote end.
Since HTML pages on the Web form an implicit finite state network, this network can be used to create a combined dialog and control system. Even without an applet language a dialog system can be built in this way. Response timeout, for example, can be obtained by using the HTTP refresh capability. The resulting system provides dialog control of the presentation of information to the user.
We have described a system for browsing the WWW and a means for programming IVR applications on the WWW without requiring the ownership of equipment by content developers or heavy reliance on new programming languages. Simple extensions are provided for rich grammar specification. The PhoneBrowser system is an HTML programmable system that takes spoken commands and delivers verbal descriptions of WWW pages. With this seemingly basic capability the PhoneBrowser provides a general mechanism to the WWW community for developing a new class of WWW applications.
Much research is still needed for making improvements to PhoneBrowser. Research is needed in the following areas:
Summarization of textual content for consiseness
OCR and analysis of images for verbal rendering
Email to speech conversion
Large vocabulary speech recognition for forms/email input
Speaker verification for secure access
Better WWW programming languages
As these technologies evolve the PhoneBrowser will gain in utility. This process is expected to continue for quite a few more years given the current state of technology.