This extended abstract is a contribution to the Easy-to-Read on the Web Symposium. The contents of this paper were not developed by the W3C Web Accessibility Initiative (WAI) and do not necessarily represent the consensus view of its membership.

Improving the Readability of User-generated Content in Web Games Using Text Normalisation

1. Problem Description

User-generated content (UGC) has transformed the way that information is handled on-line. In this paradigm shift, users create, share and consume textual information that is likely to present informal features such as poor formatting, misspellings, phonetic transliterations, slang or lexical variants (Ritter et. al., 2010). These texts found in social networks, chats or blogs, usually offer poor accessibility for people with cognitive disabilities or people not familiar with these non-standard language deviations. Moreover, when social web gaming entered mainstream thanks to the appearance of new technologies such as HTML5, that gave support to hundreds of graphically-rich multiplayer web games, additional challenges appeared. The HTML5 canvas element is merely a low-level drawing surface and text rendered with this element lacks support for automatic accessibility or localisation tools. Thus, the informal and noisy textual UGC found in in-game chats is usually difficult to understand for both people and accessibility tools such as screen readers and text simplification or normalisation applications. Also, some on-line gaming communities develop their own sub-culture and vocabulary which can exclude newcomers. In order to overcome these challenges, web and social multiplayer game developers should normalise user input in order to provide alternate clean texts. For this reason we propose TENOR, a multilingual text normalisation web service with aim to help web and on-line game developers to process in real-time textual UGC in a way that can be understood by the majority of users.

2. Background

Making UGC more accessible to everybody is a relevant issue which is gaining a lot of attention among the research community. Approaches such as text normalisation (Melero et al., 2012), (Han and Baldwin, 2011), text simplification (Simple Wikipedia) and projects such as accessible Twitter (Easy Chirp) constitute good contexts for making progress in this area. Regarding the problem of UGC accesibility in games, to our knowledge this is a quite unexplored topic yet. Related approaches are more focused on breaking physical and cognitive gameplay barriers rather than solving textual communication issues between players. Because multiplayer Web gaming can only be understood as a social activity, newcomers or impared users that are not able to understard or write in-game messages to another players are cut out of one important communication channel, thus degrading their overall gaming experience or even rendering them unable to play at all.

3. Approach

TENOR, our text normalisation tool, works in real-time by identifying out-of-vocabulary (OOV) English and Spanish words. Then, a search for substitution candidates is performed in a word lattice using language models in combination with lexical and phonetic edit distances (Mosquera and Moreda, 2012). TENOR filters non-printable characters and non-standard punctuaction symbols, where special combinations of punctuations and characters such as emoticons are replaced by their appropiate textual equivalent. Common word transformations, slang terms, word lengthenings and transliterations are also detected and replaced. This application has been adapted to work in real-time as a RESTful web service using JSON for data transfer with aim to provide a lightweight API for web applications.

4. Challenges

In this study, we have found that there are very few publicly available chat corpora collected from Web 2.0 games. For this reason, we have used a dataset of 3847 texts, composed by in-game chat logs from Team Fortress 2, a multiplayer first-person shooter, and we compared the normalisation results with those obtained in previous studies using chat logs from Kongregate, an online portal of web games and Twitter, a microblogging site.

5. Outcomes

We have studied the variation of several text readability indexes before the lexical normalisation process (see Table 1). The analysis of the obtained results shows that readability increases in both chat gaming genres and Twitter texts. Normalised texts not only are more easy to understand after processing Web 2.0 specific features such as emoticons, slang or wrong-cased words (see Table 2) but the average grade-level of those texts is also decreased. These results demonstrate that the use of a text normalisation web service can improve the accessibility of textual UGC in web games and applications for people with cognitive disabilities.

Table 1: Variation of several readability metrics.
Dataset ARI COLEMAN-LIAU LIX FOG
Twitter Original 4.68 9.13 22.85 3.8
Twitter Normalised 3.58 8.04 21.87 3.66
Kongregate Original 2.88 4.4 12.93 2.27
Kongregate Normalised 1.62 3.44 13.24 2.36
TF2Logs Original 3.73 4.08 12.85 1.91
TF2Logs Normalised 1.01 2.43 10.70 1.75
Table 2: Average frequencies of several text features before and after normalisation.
Dataset OOV words Emoticons Slang words Wrong-cased words
Twitter Original 0.17 0.01 0.05 0.11
Twitter Normalised 0.1 0.0 0.03 0.02
Kongregate Original 0.31 0.02 0.19 0.35
Kongregate Normalised 0.12 0.0 0.05 0.02
TF2Logs Original 0.43 0.01 0.27 0.61
TF2Logs Normalised 0.22 0.0 0.03 0.07

6. Future Research

We plan to integrate the normalisation API into an open source web multiplayer game in order to evaluate TENOR output by collecting stats and used feedback. This human evaluation will be helpful to identify strengths and weaknesses of our approach with different kind of users and to discover future improvements.

References

  1. Ritter, A., Cherry, C., Dolan, B. (2010) Unsupervised modeling of Twitter conversations. In HLT ’10: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 172–180, Los Angeles, USA.
  2. Han, B., Baldwin, B. (2011) Lexical normalisation of short text messages: Makn sens a #twitter. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 368–378, Portland, Oregon, USA, June. Association for Computational Linguistics.
  3. Melero, M., Costa-Jussà, M. R., Domingo, J., Marquina, M., Quixal M. (2012). Holaaa!! writin like u talk is kewl but kinda hard 4 NLP. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12).
  4. Mosquera, A., Moreda, P. (2012) TENOR: A Lexical Normalisation Tool for Spanish Web 2.0 Texts. Text, Speech and Dialogue - 15th International Conference, TSD 2012, Brno, Czech Republic, September 3-7, 2012. Proceedings. Lecture Notes in Computer Science 7499 Springer, pages 535–542
  5. Easy Chirp (2012) Accesible Twitter. Available: http://www.easychirp.com. Last accessed 1 August 2012.
  6. Wikipedia. (2012) Simple Wikipedia. Available: http://simple.wikipedia.org. Last accessed 1 August 2012.