This extended abstract is a contribution to the Easy-to-Read on the Web Symposium. The contents of this paper were not developed by the W3C Web Accessibility Initiative (WAI) and do not necessarily represent the consensus view of its membership.

Easy-to-read text characteristics across genres

Katarina Mühlenbock. DART, Sahlgrenska University Hospital, katarina.muhlenbock@vgregion.se
Mats Lundälv. DART, Sahlgrenska University Hospital, mats.lundalv@vgregion.se
Sandra Derbring. DART, Sahlgrenska University Hospital, sandra.derbring@vgregion.se

1. Problem Description

Traditional readability indices and formulas have mainly rested on shallow features such as word and sentence length, while deeper linguistic features contributing to text understandability have been ignored. Furthermore, the needs of specific groups of readers have generally been overlooked, and easy-to-read texts have predominantly been produced in order to fit a broad audience including second-language-learners, dyslectics, beginning readers and persons with cognitive disabilities. We suggest a NLP (natural language processing) method for readability assessment of texts, based on deep linguistic features. We also propose some properties that might characterize text genre and correspond to a certain level of interest and intelligibility for a narrowed target group. Finally, in an ongoing project we exploit finds from previous activities, and suggest a method to supply easy-to-read newspaper texts with symbol support for persons in need of AAC (Augmentative and Alternative Communication).

2. Background

Readability is closely related to both genre and text type. The genre denotes varieties of literature that employ different textual conventions (Biber and Conrad, 2009), while the text type refers to variation within and across genres. The notion of linguistic co-occurrence, analysed in terms of underlying "dimensions" of variation, is central to linguistic analyses of genres and text types, and the linguistic content of a dimension comprises one group of linguistics features that co-occur with a markedly high frequency in texts (Biber, 1993). Traditional readability metrics, such as LIX for Swedish (Björnsson, 1968), does not convey any information about the impact of genre features on text complexity, which implies that an ordinary text from the light reading genre may exhibit the same LIX value as an easy-to-read informative text for community citizens.

In a recent study (Heimann Mühlenbock, forthcoming), we have organised linguistic features into different categories in order to achieve a multilevel analysis of readability. The study is corpus-based, using texts from three different genres and two different text types - easy-to-read texts and ordinary texts. All texts were cross-compared statistically across genre as well as text type, with regard to linguistic features at the surface, vocabulary, sentence and idea density levels. In addition, several text classification tasks with the aim to distinguish between easy-to-read and ordinary texts were performed and evaluated. The results from both statistical analyses and automated text classification support the hypothesis that easy-to-read texts differ from ordinary texts at various language levels depending on genre, and that the influence of such features are underestimated when relying on traditional readability indices. In addition, we suggest that the impact of certain textual features should be taken into consideration when choosing texts for specific readers. Among other attempts in this direction, we have supplied easy-to-read news on the internet with symbol support (Mühlenbock et al. 2008). This small-scale project was found to be successful, and is further explored in an ongoing project.

In addition to addressing this range of text characteristics, we also suggest intensified attention to be directed at, as an essential complement, providing infrastructural multilingual and multi-modal lexicon support for a wide range of user needs in relation to text comprehension.

3. Approach

A corpus of easy-to-read texts from three different genres - fiction, community information and daily news - was compiled. The fiction part was further subdivided into texts targeted towards children and towards adults. Quantitative analysis of 23 different linguistic features related to surface structure, vocabulary load, sentence structure and idea density was performed in documents pertaining to each of the genres in the easy-to-read corpus and compared to corresponding feature values in documents from identical genres in a corpus of ordinary texts. Literature aimed at children was expected to differ from that aimed at adults for the mentioned features. Significance tests generally revealed large variations between the text types, but also that the features involved were clustered at different linguistic levels depending on genre. In addition to the descriptive statistical analysis, a binary text classification experiment was made. The reason for this follow-up was two-fold: it was expected to strengthen hypotheses about the influence of different features on text complexity and hence readability, but also to shed further light on the overall genre impact. The classification task was performed with a support vector machine approach, which is an algorithm that has been found to work well in a number of studies (Joachims, 1998; Yu, 2008). The results were evaluated by the measure of classification accuracy, which can be seen as a testimony of the relevance of specific features implemented in the classifier.

After completion of descriptive statistical analysis and text classification evaluation, we performed a small-scale experiment on easy-to-read newspaper texts on the web. Given that the easy-to-read newspaper texts investigated showed to be internally homogeneous in terms of vocabulary load, we wanted to further explore the idea that this genre might be appropriate for additional AAC support in order to make it more understandable for certain groups of readers.

4. Challenges

NLP approaches are heavily dependent upon consistency with regard to preprocessing methods and tools. The results from large-scale quantitative studies can easily be jeopardized by incompatible tagsets or parsing conventions.

5. Outcomes

Both descriptive statistical analysis and text classification evaluation reinforced the idea that genre-specific text characteristics must be taken into consideration when choosing easy-to-read texts on the web. To exemplify: the easy-to-read news genre proved to differ from its ordinary counterpart at all feature levels, including the surface level, but the classification accuracy increased to nearly 100% when deeper linguistic features were utilised in the analysis, as compared to 76% when the analysis was restricted to surface measures alone as in LIX (Björnsson, 1968). Ordinary newspaper texts had a mean LIX value of 40, while the easy-to-read texts from the same genre showed a mean LIX value of 35, although they proved to be significantly different in mean word length as well as mean sentence length values (p<0.001 at T-test). Principal component analysis showed that features at all the mentioned levels contributed to the classification results. Given this, these features may also be used to improve guidelines for writing text.

Attempts at providing graphic symbol support for an existing Swedish easy-to-read news service (8 SIDOR) is currently underway in a project which is utilizing the results from an earlier small-scale experiment (Mühlenbock et al, 2008). It is exploiting the Concept Coding Framework (CCF) multimodal vocabulary resources developed in the newly finalised AEGIS project (Lundälv and Derbring, 2012), complemented with some specific materials for the news genre.

6. Future Research

There are no simple fixes to problems with text comprehension. A range of complementary actions are needed to meet the complexity of user needs. These include continued research for a deeper understanding of the different aspects of what easy-to-read text may be, as well as major, joint efforts to develop standardised infrastructural multilingual and multi-modal lexicon resources for wide-spread use in web and other ICT based services. As suggested in this paper, text comprehension enhancements could consist of identifying text genres, targeting suiting genres for specific groups, and integrating symbol or picture support.

References

Biber, D. (1993). The Multi-Dimensional Approach to Linguistic Analyses of Genre Variation: An Overview of Methodology and findings.Computers and the Humanities, 26:331-345.
Biber, D., Conrad, S. (2009). Register, Genre, and Style. Cambridge, UK: Cambridge University Press.
Björnsson, C.H. (1968). Läsbarhet. Stockholm: Bokfärlaget Liber.
Heimann Mühlenbock, K. (Forthcoming). I see what you mean - Assessing readability for specific target groups. Ph.D. University of Gothenburg.
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. Lecture Notes in Computer Science (ECML -98), 1398:137-42.
Lundälv, M., Derbring, S. (2012). Towards General Cross-Platform CCF Based Multi-Modal Language Support. In: Miesenberger, K.; Karshmer, A.; Klaus, J.; Zagler, W., eds. Proceedings of Computers Helping People with Special Needs, 13th International Conference, ICCHP 2012, Linz, Austria, July 11-13, 2012. Berlin, Heidelberg: Springer-Verlag, pp. 261-268. DOI:10.1007/978-3-642-31534-3_40
Mühlenbock, K., Roxendal, J., Rudberg, J., Lundälv, M. (2008). Symbol supported news text on the Internet - A corpus-based approach. Second Scandinavian Language Technology Conference. Stockholm, Sweden, Nov. 20-21, 2008.
Yu, B. (2008). An evaluation of text classification methods for literary study. Literary and Linguistic Computing, 23(3):327-343.