This extended abstract is a contribution to the Easy-to-Read on the Web Symposium. The contents of this paper were not developed by the W3C Web Accessibility Initiative (WAI) and do not necessarily represent the consensus view of its membership.
Readers with cognitive disabilities often experience difficulties understanding complicated words and deriving their meaning from the context (e.g. Devlin, 1999; Carroll et al., 1998). In order to make online texts more accessible to these groups of readers, their content should follow easy-to-read guidelines (Freyhoff et al., 1998; Mencap, 2002) and thus be stripped of "complex words or phrases that could be replaced with more commonly used words" (Cooper et al. 2010). We propose an approach to simplifying lexical content, based on an empirical analysis of a parallel corpus of original and manually simplified texts in Spanish. We focus on the treatment of reporting verbs (RepV) – verbs that introduce both direct and indirect speech when reporting a speaker's language (Quirk et al., 1985), as a specific type of lexical units that have rather consistently received the same treatment by human editors. The present work is part of the Simplext project (Saggion et al., 2011), aimed at developing an automatic text simplification system for Spanish in order to make newspaper articles more accessible to readers with cognitive disabilities. The treatment of reporting verbs is just one element within the lexical module of the said system. Constructions containing these verbs are particularly common in the journalistic genre. The simplification of these expressions could, therefore, enhance readability of these texts for people with cognitive disabilities, and thus improve their accessibility to these essential sources of information (Freyhoff et al., 1998).
Automatic lexical simplification is mainly seen as a task of synonym substitution – a difficult word is substituted with its simpler synonym, the main criterion of difficulty being word frequency (Carroll et al., 1998). However, due to common cases of polysemy, unfortunate substitutions may occur if context is not taken into consideration (De Belder et al., 2010). While the problem of automatic text simplification was relatively well addressed in English (e.g. Devlin and Unthank, 2006; Devlin and Tait, 1998) and Portuguese (e.g. Aluísio et al., 2008), the Simplext project (Saggion et al., 2011) is the first one aimed at producing a text simplification system for Spanish, to the best of our knowledge. Our approach to lexical simplification involves a combination of lexical substitution based on synonymy and word sense disambiguation (Bott et al., 2012), and a set of rule-based transformations applied to different categories of lexical units (e.g. numerical expressions, ethnic adjectives, etc.). The case of the reporting verbs belongs to the latter category.
Since there are no large comparable corpora for Spanish as there are for English (e.g. the "original" Wikipedia and the Simple English Wikipedia), we compiled a modest corpus of 40 short news articles in Spanish (topics: international news and culture), published online and obtained from the news agency Servimedia. The texts were manually simplified by trained human editors, following a series of easy-to-read guidelines derived by a group of experts for the purpose of the Simplext project. After aligning the two sets of texts (original and simple) on the sentence level, we analysed all cases of lexical substitution of reporting verbs.
We observed the following:
The phenomenon of not entirely consistent substitution of all reporting verbs with "decir" can be rationalised from the following angles:
|ES (example)||EN (translation)|
|El PSOE [afirma] que España pierde a un "gigante de la escena" con la muerte de Manuel Alexandre.
Muere el actor Manuel Alexandre. El Partido Socialista Obrero Español [señaló] su pena por la muerte del actor. El Partido Socialista [dijo] que Manuel Alexandre ha sido un extraordinario actor. Ha sido un actor que ha participado en los momentos más importantes del cine español. El Partido Socialista también [ha dicho] que el actor amaba su trabajo.
|The SSWP [confirm] that Spain has lost a "giant on stage" with Manuel Alexandre's death.
The actor Manuel Alexandre dies. The Spanish Socialist Workers' Party [indicated] their grief at the actor's death. The Socialist Party [said] that Manuel Alexandre was an extraordinary actor. He was an actor that participated in the most important moments of Spanish cinematography. The Socialist Party also [said] that the actor loved his job.
|ES (example)||EN (translation)||Pattern|
|Original||El Museo del Prado acogerá en 2014 una gran exposición dedicada a El Greco, con motivo del IV centenario del fallecimiento del pintor, [según anunció] este martes la presidenta de la Comunidad de Madrid, Esperanza Aguirre.||The mayor of Madrid, Esperanza Aguirre, [announced] this Tuesday that in 2014 the Prado Museum is going to house a large exhibition dedicated to El Greco, motivated by the fourth centenary of the painter's death.||"según" (according to) + RepV|
|Simplified||Esperanza Aguirre, presidenta de la Comunidad de Madrid, [anunció la exposición].||Esperanza Aguirre, the mayor of Madrid, [announced the exhibition].||V + Object|
However, the European Guidelines for the Production of Easy-to-Read Information for People with Learning Disability stipulate that only the simplest and most common words should be used in texts written for this target audience; that long words should be avoided; and that the same term should be consistently used to refer to the same concept, disregarding the matters of style (Freyhoff, G. et al., 1998). Therefore, we opted for substituting all RepV with "decir" (say), which is both the most common and the most general reporting verb (Quirk et al., 1985; Bosque Muñoz and Demonte Barreto, 1999) and shorter than any of its semantic equivalents. We also found that such substitutions eliminate polysemy, as is the case with the verb "indicar", which in Spanish means both "point" (the literal meaning) and "point out" (non-literal meaning). As stated in WCAG 2.0 guidelines, use of non-literal meaning should be avoided in easy-to-read writing.
Our rules currently recognise 31 different RepV in 20 different patterns, thus covering additional cases which were not found in our corpus. An illustration of two of these patterns is given in Table 4 (English translation does not always structurally mimic the original):
|ES (example)||EN (translation)||Pattern|
|Juan Antonio Bardem [indicó que] esa industria cayó en calidad...||Juan Antonio Bardem [pointed out that] the quality of this industry suffered...||RepV + relative pronoun introducing a clause|
|Las pequeñas compañías...se vieron afectadas, [según explicó] el secretario general de la SGAE.||Small companies...have been affected, [as] the Secretary general of the SGAE [explained]||"según" (according to) + RepV|
The patterns were derived bearing in mind structural restrictions posed by different verbs and the inability to always fit both the original and the substitute verb in the same syntactic environment (Section 4).
The difficulties we encountered mainly concern syntactic restrictions. Due to their differences in argument structure, not all RepV combine with other elements of the sentence in the same manner. The verb "advertir" (warn), for example, can combine in two different ways: (1) taking two objects [warn O1 about O2]; or (2) being followed by a clause [warn that CLAUSE]. The substitute "decir" only fits in the second structure, and inserting it in the first structure would require further syntactic transformations we are currently not able to carry out. Furthermore, many verbs used to report speech are polysemic, such as "underline", "add", "conclude", etc. We, therefore, need to restrict the rules to those contexts where (a) the verb in the original is indeed a RepV (and is not used with another meaning); and (b) the verb "decir" fits the original structure without requiring additional syntactic transformations. This somewhat reduces recall, but yields perfect precision, eliminating all cases of false positives (Section 5).
We tested our rules on a set of 40 randomly chosen news articles – ten from each topic: national news, international news, culture, and society. The rules achieved perfect precision P=1 and reasonable recall R=0.74 (F-score: 0.85). More than 25% of the texts had more than three substitutions, with the largest one being six substitutions per text. Therefore, we believe our rules could be a significant contribution to a larger module of lexical simplification, a hypothesis to be tested at a later evaluation stage.
As for the targets our rules failed to recognise, the majority were cases of structures not covered by the rules or RepV not included in our list. For example, the verb "recordar" is polysemic even in the structures covered by the rules (it means both "remember" and "remind"), and therefore, it could not be included on the list of verbs to be recognised.
We plan to expand our RepV list by testing the rules on a larger corpus that would cover a wider variety of topics within the journalistic genre, with the aim of expanding the scope of the application of our rules. Even though, as part of the Simplext project, we have implemented an initial version of a lexical substitution system based on extracting synonyms from the Spanish OpenThesaurus (Bott et al., 2012), we have found that only a third of the verbs on our current list are synonymous with "decir" according to the thesaurus, and, therefore, direct rule-based substitution actually works better in this case. Current corpus analysis revealed similar changes applied as consistently to other types of lexical units, such as ethnic adjectives or numerical expressions. We are working on the derivation of rules for as many of these cases as possible (Bautista et al., 2012), with the aim of implementing them all together in a single lexical simplification module of the Simplext system.
We present this work as part of a project entitled Simplext: An automatic system for text simplification, with the file number TSI-020302-2010-84 (http://www.simplext.es). We are also grateful to the fellowship RYC-2009-04291 from Programa Ramón y Cajal 2009, Ministerio de Economía y Competitividad, Secretaría de Estado de Investigación, Desarrollo e Innovación, Spain.