Purpose

This page aims to outline potential user needs and requirements for Voice Agents and related issues. This relates to ongoing work and research in the APA/Research Questions Task Force (RQTF). We hope to identify where there are gaps in WCAG 3.0 as they relate to Voice Agents, their accessibility and usability as well as potential privacy implications for people with disabilities.

NOTE: These definitions (mostly but not all from Wikipedia) are used to initiate discussion on this topic, and help us work out the scope for this project.

Definitions: Natural Language Interfaces

Natural-language user interface is a type of computer human interface where verbs, phrases and clauses act as UI controls for creating, selecting and modifying data in software applications. A natural-language search engine would in theory find targeted answers to user questions. As opposed to keyword search, Natural-language search, attempts to use natural-language processing to understand the nature of the question and then to search and return a subset of the web that contains the answer to the question.

The input and the output may be provided in any of several modalities, including text (e.g., entered via a keyboard or displayed visually), or speech (e.g., using speech recognition for input and text to speech for output).

Examples of natural language interfaces include:

Interactive Voice response - uses voice and DTMF tones input via a keypad.
Voice control in products such as Smart TVs - these are conversational interfaces distinct from the general purpose ‘smart assistants’ since it is a secondary interaction mechanism rather than the primary one (users would struggle to control their TV solely via voice commands).
An automated chat application embedded in a Web page, in which the user communicates with a software agent rather than with another person. Such an application could be used, for instance, by an organization to process basic customer service inquiries.
General-purpose conversational agents that offers a range of services to the user – answering a variety of questions, playing multimedia content, home automation, etc. The agent may be available as part of a desktop or mobile platform, or may be implemented in a stand-alone device such as a “smart speaker” or a home appliance.
An educational application that uses natural language interaction to evaluate or to improve a student’s competence in a particular skill or field of study. For instance, such an application could be used as an aid to second language acquisition.
A classic “text adventure” game in which natural language is used to solve problems and make choices in an interactive story.
A service robot in a building that can answer a limited range of questions and respond to users’ commands in natural language.

MC: these examples focus on speech interfaces, yet the definition clearly is more general than that. Google search supports natural language questions better and better, and - at least in my experience - that's all text. By contrast, my bank forces me to use a speech interface when I call, yet it doesn't recognize natural language at all, it only recognizes certain keywords (and doesn't tell you what they are). That underscores that the issues of *natural language* and *speech* are separate areas whose requirements should be explored separately, even though there are many circumstances where they will be used together in a single product.

Accessibility challenges for Natural Language Interfaces

Speech only natural language interface that offers only speech input and speech output may be inaccessible to those which hearing or speech-related disabilities. Other accessibility requirements that ought to be identified and documented are:

Sensory requirements – not only the ability for the user to choose among multiple means of input and output, but also within each mode, such as support for adjusting speech rate and volume, or the style properties of displayed text.
Cognitive requirements, for example to facilitate the discovery of features of the interface – what can the system do? Reminders and other memory aids, the use of AAC symbols for communication, etc.
Physical requirements, such as for entirely touch-free interaction with the system (particularly applicable if the natural language interface is offered in specialized hardware such as a vehicle or a home appliance).

Some unresolved research problems that have been identified include

Sign language interaction.
Brain-computer interface interaction.

MC: These relate mostly to the speech side of the question, though the topics of cognitive requirements gets into actual natural language. What are the *natural language* requirements when separated from speech?

Multi modality and Natural Language Interfaces

There is an open question around what degree multi modality needs to supported by a device for it to be considered accessible? RQTF needs to look at where natural language UIs needs to cover both voice and text, however is it feasible to require every UI instance to support both? Are we limiting the scope to a class of device that rarely occurs in the wild. For example, a blind user may use a voice agent whereas a deaf user would use a text or GUI-based device. Both can be accessible to their respective cohort but may still have accessibility considerations. These may range from issues such as timeouts, cognitive considerations and gracefully handling accents/spelling errors.

MC: Note these issues are basic accessibility conformance that would apply to any modality, so aren't unique to this topic.

Some further research questions:

How often is a text-based natural language UI seen as preferable by a user cohort to a GUI?
Is it only in a small selection of cases (i.e. where the possible range of inputs is too wide to offer a multiple choice selection)?
Should a natural language interface to support both text input/output, or only for the software to support it via the user having appropriate hardware, such as a mobile phone or tablet, to gain access to this functionality? Is the ability to control the device via another medium either hardware or software, sufficient? For example, a robot vacuum cleaner or microwave that has a natural language interface that can be controlled by another device. It is arguable that in these cases the natural language interface accessed via voice control is used as an accessible alternative to the visual interface or display.

Advantages of Natural Language Interfaces

What are the specific advantages of Natural Language Interfaces?

A potential advantage of natural language interfaces is efficiency: formulating a natural language sentence can be faster than searching a complex hierarchy of menus and dialogues. Word prediction can also facilitate the entry of natural language even further if an on-screen keyboard or other, relatively slow input mechanism is used.

Situations where the scope of commands is too wide to offer a selection of options. This is useful where the scope of options remains wide. Once it is narrowed down a menu of options can be presented which is often faster than typing. Some questions in this use case are:

Where written language is a central part of the product’s use (i.e. the language learning application),
Where conversation is a central part of the product’s use (ie companionship products)

Questions around Natural Language Interface usage

If we take the chatbot model as an example, there are some areas we need to look at:

Are there other examples of textual natural language interfaces where written language or conversation are not central to the products use?
Do we know of any research on the acceptability of textual natural language interfaces? Is there a bias against chatbots due to poor implementation?

Limitations of Natural Language Interfaces

Textual natural language interfaces have a free form interaction style but voice agents usually require the user to word their command a particular way. Would the need to combine two types of AIs, to handle more free form textual or other modality input, along with more strict keyword command input be problematic. There is potential for greater error, as well as intrusive error correction mechanisms being needed to double check inputs.

MC: I would argue that a tool that will "require the user to word their command a particular way" is not a natural language interface. It's an unnatural language interface. But it may still have the full set of requirements from the speech accessibility side of it.

Deaf users and Natural Language Interfaces

For Deaf people in the UK (those who have BSL as a first language) learning written English is very hard. They may not have the stepping stone of spoken English in the first place, so this adds great complexity to reading. However, some Deaf people read and write clearly in English, and they are referred to as bilingual. Many Deaf people do need to use some level of written English, but it comes with a higher cognitive load.

Many Deaf users can find big blocks of text intimidating, and it is best to use simple English if BSL is not an option.

An area of research is:

Would a text-based interface using a BSL word order be helpful to Deaf users?
What if an AI could understand both an English and BSL-based word order and handle it gracefully?
How can short BSL clips or BSL diagrams be used as a part of Natural Language Interfaces?

MC: Looking to the future, sign language should be recognized as a vehicle for natural language. The different grammar means the natural language processing would be different, and the different modality means the recognition process would be different. But it still fits into the category, and we should advocate for its emergence. This means there are two tracks here - one for sign language users for whom the spoken / written language supported by the system is a second language, and one for sign language users interacting via sign language.

Captioning and Natural Language Interfaces

Hard of hearing and deafened people benefit from captioned speech and/or lipreading. The text and lip-reading augment the information received through the speech so captioning can be helpful. Lip-reading alone doesn’t convey all the sounds (‘forty’ and ‘fourteen’ look the same on the lips) but there are ways of representing the missing sounds using hand shapes. Combined speech and captions or speech and video of the face which communicating can help Deaf users with speech with video of the face and the captions would be best.

Language variation and cognitive load

An area of research is:

Is there less cognitive load in countries where the written grammar has a more cohesive and less contradictory set of rules than English?
There is at least one Chinese sign language is based on one of the Chinese writing systems (presumable simplified Chinese) but we need to look at this.

Definitions: Smart agents

"Smart Agents" are services incorporating multiple input, and output modalities. While a Voice Agent is an agent that takes input via speech only, a Smart Agent is an agent that takes multiple inputs, and has the ability to abstract instructions into usable functions. For example, while Siri/Alexa/Cortana/etc. These are accessed through mobile phones, watches, laptops, televisions, and speakers (e.g. "Echo Dot") among others. That is, there is the abstract "Smart Agent" as well as the instantiation through different devices (e.g. appliances etc).

Definitions: Voice-user interface

A voice-user interface (VUI) makes spoken human interaction with computers possible, using speech recognition to understand spoken commands and answer questions, and typically text to speech to play a reply. A voice command device (VCD) is a device controlled with a voice user interface. Virtual assistants, such as Siri, Google Assistant, and Alexa, are examples of VUIs.

Definitions: Speech recognition

Speech recognition enables the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields.

Definitions: Interactive voice response

Interactive voice response (IVR) is a technology that allows humans to interact with a computer-operated phone system through the use of voice and DTMF tones input via a keypad. In telecommunications, IVR allows customers to interact with a company’s host system via a telephone keypad or by speech recognition, after which services can be inquired about through the IVR dialogue.

Defining scope

NOTE: Complexity in defining scope in this space relates to the need to separate the mode of interaction from the service or agent itself. These are primary issues to consider if the scope is to be sensible.

With these definitions above in mind and understanding of the various parts of the stack outlined above - we can now consider exploring:

Natural Language Interfaces - or Conversation Agents
The entire space of "Smart Agents" and all their proliferation
Only "Voice Agents" (mostly hardware) without background system
"Voice or Speech Agents" including relevant "Smart Agent" functionality relating to background services

Current thinking in RQTF is tending towards the need for a modal independent descriptor for this work. Some suggestions are:

'Interactive Agents: Accessibility User Requirements for multi-modal interaction'
'Conversational Agents Accessibility User Requirements - Natural Language processing'
'Digital Assistants - Accessibility User Requirements for multi-modal interaction'
'Smart Agents - Accessibility User Requirements for multi-modal interaction'

NOTE: As with these terms - what is both in and out of scope needs to be defined.

Privacy and dependency concerns

For example, if we consider privacy: there are aspects that relate to how the system as a whole treats sensitive data as well as aspects that are specific to voice interaction (e.g. eavesdropping). Both should be covered as much as is reasonable but there is potential for scope creep in this context.

Another potentially critical part of the discussion will be the "locus of control": does the voice agent, for example, need to provide a text stream or just the API to relay a text stream from the smart agent in the background? For example, televisions do not provide captions, but they relay captions coming from the broadcaster. A television with the captioning function does not guarantee accessibility, it needs captions coming from the broadcaster to ensure accessibility. So there are dependencies that we need to consider for this work to be effective without the scope ballooning.

While we would be addressing partially referencing "Smart Agents" this would be only the relevant parts directly related to voice interaction. It is also worth noting that scoping this work to focus on "Voice Agents" including relevant "Smart Agent" functionality, would allow us to expand on this in the future with other user needs and requirements documents.

Voice Agent Accessibility

In relation to people with disabilities and the use of Voice Agents, some accessibility considerations are:

How do current accessibility standards relate to Voice Agents?
What are gaps in new standards like WCAG 3.0?
What are the main areas of difficulty for particular user groups when interacting with Voice Agents?
Are there particular usability issues that affect people with disabilities?

Voice Agent privacy

In relation to people with disabilities and the use of Voice Agents, some privacy considerations are:

How can people with disabilities are able to understand the privacy requirements: i.e. the complexities around privacy policies, how privacy settings are presented, the expectations of privacy around using Voice Agents?
How can privacy be protected when using Voice Agents and the potential impact of fingerprinting for people with disabilities when using Voice Agents? What are the necessary protections needed for vulnerable people?
What are the privacy concerns for people with disabilities when using Voice Agents? Can they be used to profile users?

[DRAFT] User Needs for Voice Agents

Users should not be frustrated and confused by Voice Agents - and need to have a clear path to perform specific tasks.
Users may need gentle introduction to Voice Agents in order to successfully interact with them. This may mean a hybrid approach to UIs, tutorials.
User need to be able to discover interaction affordances.
Users need to understand Voice Agent state - especially when data or input is being processed. So as to avoid the 'Gulf of evaluation' effect.

[DRAFT] Requirements for Voice Agents

When speech is not being processed - this should be handled gracefully and consistently.
Voice agent error recovery - should be graceful, avoid firing low level error messages. Users will need fallback mechanisms for Voice Agent error recovery.
Voice Agent - user need to have graceful introduction to using it, the need to understand.

Issues with discoverability, voice interaction affordances, MENTAL MODEL of interactions and affordances. The output suggested you can use an input that you cant - be flexible in the types of input that are accepted.

Safety factors - struggling with automated work [talk with Ted - dashboard user interface.]
Map interactions to common sense patterns.
How will the gulf of evaluation/execution work with Voice Agents?

To do:

Josh to liase with pronunciation TF.

Relevant Links

Acknowledgements

Thanks to Jason White, John Paton, Michael Cooper, Judy Brewer, Janina Sajka and Shadi Abou-Zahra and Education and Outreach Working Group (EOWG) for input on scope and definitions. This is the work of the RQTF, and is supported by the EC-funded WAI-Guide Project.