This extended abstract is a contribution to the Easy-to-Read on the Web Symposium. The contents of this paper were not developed by the W3C Web Accessibility Initiative (WAI) and do not necessarily represent the consensus view of its membership.
User-generated content (UGC) has transformed the way that information is handled on-line. In this paradigm shift, users create, share and consume textual information that is likely to present informal features such as poor formatting, misspellings, phonetic transliterations, slang or lexical variants (Ritter et. al., 2010). These texts found in social networks, chats or blogs, usually offer poor accessibility for people with cognitive disabilities or people not familiar with these non-standard language deviations. Moreover, when social web gaming entered mainstream thanks to the appearance of new technologies such as HTML5, that gave support to hundreds of graphically-rich multiplayer web games, additional challenges appeared. The HTML5 canvas element is merely a low-level drawing surface and text rendered with this element lacks support for automatic accessibility or localisation tools. Thus, the informal and noisy textual UGC found in in-game chats is usually difficult to understand for both people and accessibility tools such as screen readers and text simplification or normalisation applications. Also, some on-line gaming communities develop their own sub-culture and vocabulary which can exclude newcomers. In order to overcome these challenges, web and social multiplayer game developers should normalise user input in order to provide alternate clean texts. For this reason we propose TENOR, a multilingual text normalisation web service with aim to help web and on-line game developers to process in real-time textual UGC in a way that can be understood by the majority of users.
Making UGC more accessible to everybody is a relevant issue which is gaining a lot of attention among the research community. Approaches such as text normalisation (Melero et al., 2012), (Han and Baldwin, 2011), text simplification (Simple Wikipedia) and projects such as accessible Twitter (Easy Chirp) constitute good contexts for making progress in this area. Regarding the problem of UGC accesibility in games, to our knowledge this is a quite unexplored topic yet. Related approaches are more focused on breaking physical and cognitive gameplay barriers rather than solving textual communication issues between players. Because multiplayer Web gaming can only be understood as a social activity, newcomers or impared users that are not able to understard or write in-game messages to another players are cut out of one important communication channel, thus degrading their overall gaming experience or even rendering them unable to play at all.
TENOR, our text normalisation tool, works in real-time by identifying out-of-vocabulary (OOV) English and Spanish words. Then, a search for substitution candidates is performed in a word lattice using language models in combination with lexical and phonetic edit distances (Mosquera and Moreda, 2012). TENOR filters non-printable characters and non-standard punctuaction symbols, where special combinations of punctuations and characters such as emoticons are replaced by their appropiate textual equivalent. Common word transformations, slang terms, word lengthenings and transliterations are also detected and replaced. This application has been adapted to work in real-time as a RESTful web service using JSON for data transfer with aim to provide a lightweight API for web applications.
In this study, we have found that there are very few publicly available chat corpora collected from Web 2.0 games. For this reason, we have used a dataset of 3847 texts, composed by in-game chat logs from Team Fortress 2, a multiplayer first-person shooter, and we compared the normalisation results with those obtained in previous studies using chat logs from Kongregate, an online portal of web games and Twitter, a microblogging site.
We have studied the variation of several text readability indexes before the lexical normalisation process (see Table 1). The analysis of the obtained results shows that readability increases in both chat gaming genres and Twitter texts. Normalised texts not only are more easy to understand after processing Web 2.0 specific features such as emoticons, slang or wrong-cased words (see Table 2) but the average grade-level of those texts is also decreased. These results demonstrate that the use of a text normalisation web service can improve the accessibility of textual UGC in web games and applications for people with cognitive disabilities.
Dataset | ARI | COLEMAN-LIAU | LIX | FOG |
---|---|---|---|---|
Twitter Original | 4.68 | 9.13 | 22.85 | 3.8 |
Twitter Normalised | 3.58 | 8.04 | 21.87 | 3.66 |
Kongregate Original | 2.88 | 4.4 | 12.93 | 2.27 |
Kongregate Normalised | 1.62 | 3.44 | 13.24 | 2.36 |
TF2Logs Original | 3.73 | 4.08 | 12.85 | 1.91 |
TF2Logs Normalised | 1.01 | 2.43 | 10.70 | 1.75 |
Dataset | OOV words | Emoticons | Slang words | Wrong-cased words |
---|---|---|---|---|
Twitter Original | 0.17 | 0.01 | 0.05 | 0.11 |
Twitter Normalised | 0.1 | 0.0 | 0.03 | 0.02 |
Kongregate Original | 0.31 | 0.02 | 0.19 | 0.35 |
Kongregate Normalised | 0.12 | 0.0 | 0.05 | 0.02 |
TF2Logs Original | 0.43 | 0.01 | 0.27 | 0.61 |
TF2Logs Normalised | 0.22 | 0.0 | 0.03 | 0.07 |
We plan to integrate the normalisation API into an open source web multiplayer game in order to evaluate TENOR output by collecting stats and used feedback. This human evaluation will be helpful to identify strengths and weaknesses of our approach with different kind of users and to discover future improvements.