This extended abstract is a contribution to the Easy-to-Read on the Web Symposium. The contents of this paper were not developed by the W3C Web Accessibility Initiative (WAI) and do not necessarily represent the consensus view of its membership.

Bridging the Gap between Pictographs and Natural Language

Vincent Vandeghinste, Centre for Computational Linguistics, University of Leuven. vincent@ccl.kuleuven.be

1. Problem Description

When using digital pictograph communication environments, such as the WAI-NOT environment (www.wai-not.org), which aims at users with cognitive disabilities, users can give input in two forms. They can select pictographs from a two-level category system or they can use text, which is then converted into pictographs. In the conversion from text to pictographs, we see that only straightforward string matching procedures are currently applied, resulting in two types of problems. The first type is the fact that words possibly do not match with the name of a pictograph, so no pictograph is generated, for instance when verbs are conjugated because no lemmatisation takes place. The second type is the fact that words occasionally match with names of wrong pictographs, so a wrong pictograph is generated. Both types of problems lead to difficulties in understanding the message converted into pictographs

For instance, the Dutch sentence 'ik kom naar huis' (E: I am coming home) is converted into

ik (E: I) kom (E: bowl) naar (E:to) huis (E:house)

instead of

ik (E: I) kom (E: come) naar (E:to) huis (E:house)

2. Background

People with reading and writing disabilities can use pictograph communication sets instead of alphabetic writing, such as the pictographs provided for Dutch speaking people by Beta (www.betavzw.be) or Sclera (www.sclera.be).

Online pictograph communication environments allow people with reading and writing disabilities to communicate over the internet. These environments, such as WAI-NOT allow to send emails and chat, using pictographs or text, or a mixture of both as input method.

WAI-NOT is a project which develops a number of applications tailoring IT for the mentally disabled, such as a web site adjusted to different intellectual levels, using pictographic support wherever possible. An application was built that allows to send emails with the help of pictographs through an adjusted e-mail client.

These messages are sent over the internet (encoded) in text form and in the WAI-NOT environment converted again (decoded) into pictographs. This encoding and decoding is currently done with simple string mapping: word forms are linked to pictures, which leads to many forms which cannot be converted into pictures as they are not cataloged.

Web sites such as www.widgit.com show pictographs when the mouse is hovered over the English text. When the given word corresponds to multiple pictographs due to ambiguity, all the matching pictographs are shown. Widgit seems to apply lemmatisation on English text, as conjugated verbs and plural nouns seem to match with pictographs that correspond to the lemma. Alternatively they might have simply mapped alternate word forms onto the same pictograph.

The approach described in this paper is currently applied to a closed environment but could easily be embedded into freely accessible web pages, resulting in web sites similar to www.widgit.com, but with disambiguation, showing only one pictograph per word or word group.

3. Approach

We reduce the gap between natural language text and pictographs by collecting a corpus of over 3000 email messages which have been sent by users of the Beta pictograph set with the WAI-NOT system. We apply part-of-speech tagging, which gives each word a detailed label, telling us whether it is for instance a noun, a verb, or an adjective, and whether this word is in plural or not. We also apply lemmatisation, which is the mapping of a word onto its dictionary form. As a tagger and lemmatiser, we used Frog (Van den Bosch et al. 2007), which uses the D-Coi tag set (Van Eynde 2005), and for which it is reported that it is more than 96% accurate at the word level.

For every word-tag-lemma combination which occurred more than 29 times, we manually mapped it to the correct pictograph, when available. This amounts to 60.50% of the words in the corpus. Note that by using word-tag-lemma combinations, we avoid mistakes as in the example, provided that the tagger and lemmatiser properly label the sentence. For the lower frequency words, we checked whether there was a matching pictograph, and if this was not the case, tried whether the word's lemma could be mapped onto a pictograph. This implies that for the lower frequency items wrong pictographs (such as for "kom" in the example) are not avoided.

4. Challenges

The basic challenge lies in the correct semantic disambiguation of words. When applying simple string mapping, the system is not able to determine the different meanings a certain string can have, leading to wrong pictograph decoding, and probably ill communication. Determining a word's correct part-of-speech, and consequently the correct lemma are a first step towards word sense disambiguation. A second challenge lies in the mapping of words onto pictographs which do not occur in the pictograph lexicon. Here lemmatisation is a first and obvious step towards solving this challenge.

5. Outcomes

Of the baseline system 60.44% of all words in the 121740 word corpus could be converted into a pictograph. As we manually checked the most frequent word-tag-lemma combinations (60.50% of the words in the corpus) and we could match another 14.81% by the token (but possibly with a wrong disambiguation), and another 2.10% by using the lemmas instead of the word form, this results in a total of 77.40% of the words in the corpus which could be converted. This amounts to a relative improvement of 28.06% of words in the corpus which could be converted.

While we were developing the system, new emails were composed by the users. On this unseen data set of emails of 78800 words, the baseline system could convert a mere 41.33%. When applying our approach, the system could convert 60.30% of the words into pictographs, which is a relative improvement of 45.9%.

6. Future Research

The upgrade of the WAI-NOT environment with the described improvements is expected soon. In future upgrades we want to apply more complex natural language processing techniques, such as word sense disambiguation and linking all the pictographs to the Cornetto lexical semantic database (Vossen et al. 2008). We will work on the translation between Beta, which is more or less a word by word conversion of Dutch and Sclera, in which several lexical semantic concepts are combined in one pictograph.

Acknowledgements

This research is done in the Picto project, funded by Steunfonds Marguerite-Marie Delacroix.

References

Van den Bosch, A., Busser, G.J., Daelemans, W., and Canisius, S. (2007). An efficient memory-based morphosyntactic tagger and parser for Dutch. In F. van Eynde, P. Dirix, I. Schuurman, and V. Vandeghinste Selected Papers of the 17th Computational Linguistics in the Netherlands Meeting, Leuven, Belgium: Centre for Computational Linguistics. p. 99-114.
Van Eynde, F. (2005). Part-of-speech tagging en lemmatisering. Protocol for the Annotators in D-Coi.
Vossen, P., I.Maks, R. Segers and H. van der Vliet (2008).Integrating Lexical Units, Synsets, and Ontology in the Cornetto Database. In Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Daniel Tapias Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), Marrakech, Morocco.