Natural Language Usage -- Issues and Strategies for Universal Access to Information

Author: Lisa Seeman, UB Access

Description of the problem

Ambiguous use of language creates problems with translation, misunderstandings and accessibility for cognitive disabilities. Poor or complicated use of language often restricts compatibility of data with semantic based systems. For example, translation and simplification tools often translate words based on a meaning that was not intended by the author. This is particularly common when the word ambiguity cannot be resolved based on context using the surrounding words and grammar rules, but require other knowledge sources to determine the meaning of a sentence. Other examples of problematic language usage include secondary and implied meaning, such as are typical of sarcasm. However, with the evolution of the referencing and semantically aware Web technologies, there is a new opportunity to provide better clarifying information, interfaces and integration with semantic based systems.

Expanding device independent design

A device is defined as an apparatus through which a user can perceive and interact with the web. Yet traditionally device-independence design relies on text or text equivalent for different media that different devices can render. However, if the definition of a device includes apparatus that process the text to a form that is more useful to the user, such as a symbolic rendering, then the requirement for text equivalent becomes a requirement for concept mapping.

Accessibility

Currently, sites intended to be user-friendly to the cognitively disabled rely on the use of simple language, illustrations or symbolic representations such as BLISS (a symbolic language invented by Charles Bliss in 1949). Not surprisingly most otherwise accessible sites are reluctant to do this. However, it is now possible to enable typical Web content to be converted to a simplified or symbolic representation, at the user end. This would require encoding that removes the uncertainties typical of natural language, and to annotate the relative importance of content. Symbolic representations that could then be created include Sign, BLISS and other natural languages used by people with impairments.

Note that such a system would allow any obtuse document, such as legal documents, to be simplified without compromising the original content. Interestingly enough, BLISS was invented by a concentration camp survivor who knew the need for internationally understood language and the result of disguising dangerous concepts with clever words. This form of device independence allows more people, disabled or not, access to the concepts within the content.

The requirement for concept mapping holds true for any device or system that works on knowledge rather than data. For example, data mining systems often give many false results because of the discrepancy between the text (data) and the concepts that the text is meant to represent (knowledge). The result may be close in terms of data but irrelevant in terms of knowledge. Expanding device independence implies compatibility with knowledge processing systems.

Requirements

Change in the natural language, or simplification at the user's end, requires the development of a standard architecture which:

facilitates human inter-operation, such as computer supported collaborative work across differences in natural language; and
enables adaptations of textual content that materially improve access to information for people with reading-related disabilities or who need translation; and
enables ambiguous content, in multi-media Internet formats, to be machine understandable, and interpretable by semantic based systems.

This will enable applications such as data mining across languages, correct translation of ambiguous text, simplifications, and more accessible, user friendly interfaces. These interfaces would be user-friendly to people who use different languages, are linguistically impaired, or are without written language.

ILS (Interoprable Language Standard)- A Possible Implementation

Annotations, such as RDF (Resource Description Framework) annotations, allow an author to make statements about the content, sections of the content or even specific words. These capabilities are being used to create vocabularies for accessibility, providing alternatives for content and form for documents. The advantage of this implementation is the combination of defaults (common standard interpretations, lexicons and mapping of text to concepts), and exceptions (overrides that map text to meanings within a defined area).

Rules

Additional, accompanying information (such as annotations in a header or linked file) would contain lexicon information about the use of language on a page, so that from a machine perspective, the content is specific.

For examples: Word ambiguities would be resolved by generalized links to context-based rules with overriding annotations for exceptions. In solving the word ambiguities the syntactic ambiguity can also be resolved. In our example sentence, Fasten the assembly with the lever, an annotation on the word with that defines it as meaning “using” would resolve the syntactic ambiguity. A potential implementation of this methodology could be an RDF type of annotation such as: This word at this location IS DEFINED as that word at that location, where the term IS DEFINED would itself be defined within the ILS ontology.

Pronouns would have a second type of annotation. Instead of referencing the pronoun to a lexicon definition, one would reference the pronoun to the noun or object to which it refers, thus resolving the semantic ambiguity. In our example sentence "Start the engine and keep it running", an annotation on the word it, that points to the word engine; would resolve the semantic ambiguity. This can be expressed using the standard RDF type of annotations as: This word at this location REFERS TO that word at that location - where the term REFERS TO would itself be defined within the ILS ontology.

Defaults

Assuming that different types of ambiguities can be resolved using the ILS technique, ILS as so far described would still not be a practical solution to uncertainties in natural language because of practical constraints of adding annotation to each potential uncertainty. This problem can be addressed through a clear and established set of defaults. In some cases defaults could be expressed as a series of grammatical rules. For example, the default reference for each pronoun can be the preceding noun. A pronoun would only require a separate annotation when it differs from the default. One may supply a default lexicon with default meanings for each word. Cascading lexicons or RDF statements pointing to a separate meaning for any individual word, can override this meaning. Using this information a user agent can render the simplified or translated content correctly. Default grammar rules could also be referenced.

Barriers to adoption

Undoubtedly, there would be an overhead of proofing the annotating exceptions. However, the overhead and load could be reused by:

A wide range of secondary, application specific, lexicons and grammar rules
User friendly authoring tools that simplify identifying exceptions and adding annotations

It is intended that the work involved will be something like running a grammar and spell check in a word processor.

More importantly, however, the reusability of ILS annotation and wide range of potential benefits will justify the time investment in most situations. For example, having annotated a document once, the document may then be translated into multiple languages, simplified, and give better performance on knowledge systems such as data mining and search engines. It can be concluded that to promote the accessibility implications of ILS, ILS will need to be implemented across a broad range of applications.

Limitations

It needs to be stated that we are not anticipating a complete resolution of text to concepts. However, we believe we can archive:

a significant increase in access to concepts; and
a flexible and adaptable architecture.

It is also anticipated that some authors will create annotated content more effectively than others. Again, good authoring tools will help resolve these issues.

Examples of Uncertainties

Word ambiguity

Word ambiguity occurs when a word can have more then one meaning. Word ambiguities are typically resolved based on context using the surrounding words and grammar rules. Syntactic ambiguity occurs when there is more than one possible syntactic parses for a grammatical sentence. For example, the sentence "Fasten the assembly with the lever". This may be either an instruction to fasten the assembly using a lever, or an instruction to fasten the assembly, which has a lever attached to it. With the prepositional phrase "with the lever" can be attached to the verb or to the noun phrase object. However, often a syntactic ambiguity is caused by a word ambiguity - in our example, the word "with" is ambiguous. With could mean using or connected to.

Semantic ambiguity

Semantic ambiguity occurs when other knowledge sources are required to determine the meaning of a sentence. For example, the sentence "Start the engine and keep it running," the fact that "it" refers to the engine is not inferable from the single clause "keep it running." The ambiguity is caused by the difficulty in resolving the pronoun.

Traditional approaches used to resolve ambiguity

1. Semantic rules based on content. Semantic and grammar-based rules are used in methods and techniques for reducing ambiguity which can achieve more than 90% monolingual effectiveness [Resolving Ambiguity for Cross-language Retrieval]. These rules are less effective in resolving syntactic ambiguity where the sentence itself can have multiple interpretations. Completely successful computational modeling requires a parallel psycholinguistic investigation of the distribution of ambiguity across the domain of human languages [Modeling Ambiguity], in which we still have far to go. This leads to the conclusion that in reality there is no operational computer system capable of determining the intended meanings of words in discourse existing today [Language Ambiguity]

2. Restricted and controlled languages. Kant [Coping with Ambiguity in Knowledge-based Natural Language Analysis] implements the strategy of constraining the input text via a sub-language that constrain the input language by limiting the words and phrases in the lexicon to a single sense, and by restricting the syntactic constructions which are allowed in the controlled grammar. These semi-formal and formal representation techniques include UML (Unified Modelling Language). The Daniel Berry approach [Linguistic Sources of Ambiguity] is to use checklists, scenario-based reading, and agendas, for human checking of ambiguities. However, it cannot be assumed or even hoped that a system that restricts human expression will ever be wildly adopted for Internet-based content.

RDF (Resource Description Framework) and Accessibility

Although the RDF standard may provide a possible implementation of a solution for the problems discussed in this document, it is not the only potential implementation. The need to address concept mapping is independent from the general promotion of semantic web compatible accessibility support.

However, it may interest the reader that the ILS proposal outlined above is an example of a potential direction for the evolution of accessibility, where full accessibility support would be implemented through the use of the semantic web and/or the Resource Description Framework (RDF) standards.

RDF usage could solve accessibility problems in the following situations:

Simplified and annotated or multimedia content, required for accessibility for some groups, is inappropriate for other audiences
The original rendering is incapable of change, such as when web authors are unable to use simplified language
The Web author is not concerned with accessibility
Content relies on markup languages that do not support accessibility.

RDF has the following additional advantages over traditional accessibility techniques:

XML schema can be annotated to increase accessibility usage, which could have an effect of an immeasurable number of documents based on that schema.
One can provide multiple alternatives / conditional content in different medias. For instance, an auditory rendering of an visual aid might be appropriate in some contexts rather than text.
User profiles can be attached to Web content and alternatives, so renderings can be optimized to the individual user.
Compatibilty with metadata, knowledge-based services and the semantic web
Further the promotion of device independence principles
In some cases, it may in fact be easier to provide accessibility through RDF as a single statement can render multiple elements of accessibility.

Current implementations

User agents are part of the solution, where they could translate text to a simple form. Text could be translated into clear simpler language, symbols and pictures at the user end. Groups such as WWAACI are working on such user agents.

There are several different implementations of the Annotea protocol.

A substantial authoring interface is built into Amaya, although this restricts the types of annotations that can be made to a pre-defined list Some simple tools have been provided by the W3C SWAD-E project for creating any kind of annotation and posting it to a server.

The Snufkin and Annozilla tools provide access to annotations for Internet Explorer and Mozilla respectively

There are at least two Annotea servers available - the W3C's, written in Perl as a module for Apache, and Brent Hendrick's server written in Python as a module for the Zope application server.

Annotea has been used as a storage mechanism for linking an object in wordnet with an image in Jim Ley's work on annotations.

The use of this in Dan Brickley's RDF censorware shows the way to what is being attempted, although instead of censoring graphic content the idea is to enhance text.

Background links:

Important sites for the Language focus include:

Concept codes by the WWAACI

BLISS Symbolics

W3C Annotate project

Wordnet at Princeton

The W3C Web Ontology working group

The Boeing site controlled English checker

List of all the efforts of the US government to publish in plain language:

Xplanation.com provide languages services that uses Web technologies and controlled language

UB Access is working with RDF and accessibility integration and their tools use RDF. See examples of RDF usage to support accessibility and more information on ILS.

References

[BLISS] Graphic, meaning based communication system invented by Charles Bliss in 1949.

[WCAG10] "Web Content Accessibility Guidelines 1.0", W. Chisholm, G. Vanderheiden, and I. Jacobs, eds., 5 May 1999. This WCAG 1.0 Recommendation ishttp://www.w3.org/TR/1999/WAI-WEBCONTENT-19990505/.

[WCAG20] "Web Content Accessibility Guidelines 2.0", Ben Caldwell, W. Chisholm, Jason White and G. Vanderheiden, eds., 28 August 2002

[RDF] "Resource Description Framework (RDF) Model and Syntax", O. Lassila and R. Swick, eds., 22 February 1999. W3C Recommendation.

[RDF Schemas] "Resource Description Framework (RDF) Schema Specification 1.0", D. Brickley and R.V. Guha, eds., 27 March 2000. W3C Candidate Recommendation.

[Modeling Ambiguity] “Modeling the Effect of Cross-Language Ambiguity on Human Syntax Acquisition”, William Gregory Sakas, In: Proceedings of CoNLL-2000 and LLL-2000, Lisbon, Portugal, 2000. http://cnts.uia.ac.be/conll2000/abstracts/06166sak.html

[Resolving Ambiguity for Cross-language Retrieval] “Resolving Ambiguity for Cross-language Retrieval”, Lisa Ballesteros, W. Bruce Croft (1998) http://citeseer.nj.nec.com/ballesteros98resolving.html##

[Coping with Ambiguity in Knowledge-based Natural Language Analysis] “Coping with Ambiguity in Knowledge-based Natural Language Analysis” Kathryn L. Baker, Alexander M. Franz, Pamela W. Jordan, Center for Machine Translation and Department of Philosophy (2001) Carnegie Mellon University http://www.lti.cs.cmu.edu/Research/Kant/PDF/flairs.pdf

[Language Ambiguity] “Language Ambiguity: A Curse and a Blessing”, Cecilia Quiroga-Clare, Translation Journal and the Author 2003 http://accurapid.com/journal/23ambiguity.htm#

[Linguistic Sources of Ambiguity'] “From Contract Drafting to Software Specification: Linguistic Sources of Ambiguity'' Daniel M. Berry, University of Waterloo, Ambiguity in Natural Language Requirements Specifications http://se.uwaterloo.ca/~dberry/ambiguity.res.html

Author: Lisa Seeman from UB Access - please feel free to contact me with any comments

Acknowledgements: WCAG and PF team, Charles McCathieNevile, Al Gilman