This extend abstract is a contribution to the Text Customization for Readability Symposium. The contents of this paper was not developed by the W3C Web Accessibility Initiative (WAI) and does not necessarily represent the consensus view of its membership.
In the past, PDF and text customization didn’t seem to go together well. The introduction of tagged PDF in 2001 did not help much. The publication of the international standard ISO 14289  – usually referred to as PDF/UA – may change the situation, as it establishes clear quality criteria that a well tagged PDF must meet. Assuming that well tagged PDFs may become more common in the near future, we wanted to find out whether and how such well tagged PDFs would lend themselves to text customization.
The currently available options to achieve a certain degree of text customizability for a PDF file are:
None of these options work well or reliably for non-trivial PDF documents.
As our goal was not to develop yet another heuristic tool that guesses what the semantic structure of an arbitrary PDF's content is we focused exclusively on tagged PDFs Tagged PDFs – at least when they are tagged well – transport not only the raw content, like words, graphics objects or images, but also the logical structure of the document including the intended sequence of its content objects and the semantic type of its parts, like headings, paragraphs, lists, tables, table of content and so on. In addition, a well tagged PDF will have alternate text for non-text content. As a consequence, a well tagged PDF has everything that is needed to derive a customizable text-centric representation of its content.
Our goal was not to research approaches for or problems with text customization as such.
Exporting PDF content to HTML – as opposed to a rich text format or a text processing format – seemed like a very logical decision as every user will have a browser available to display HTML. Actual text customization can then be achieved through the use of CSS style sheets as well as through features built into some browsers, like text scaling or page scaling, or application of user CSS styles.
The main research goals were:
Text customization was tested through the exemplary use of several specific CSS style sheets for low vision, dyslexia and easy reading of long documents.
A number of challenges were encountered throughout the development of the prototypical tagged PDF to HTML converter:
A prototypical implementation of a "convert tagged PDF to HTML" tool was developed and used on a wide range of PDF files, mostly from federal government agency websites in Germany, Switzerland and Denmark.
The experience was illuminative:
In terms of actually applying text customization beyond the CSS based diagnostic view, we found the following
Figure 1: The screenshot from the CSS based diagnostic view for the HTML exported form a tagged PDF shows how colored structure labels make it easy to assess the quality of the logical structure
A slightly refined version of the prototypical implementation will be released as a plug-in for Adobe Acrobat in November 2012 under the name "callas pdfGoHTML", and will be available free of charge from the callas software website.
One important aspect that needs more research is the problem that tagged PDF can express logical content structure that is difficult if not impossible to map into HTML. It would be desirable to develop robust strategies for conversion of certain types of logical (sub-) structures in PDF into HTML equivalents.
Special thanks go to Silas S. Brown – after having run into his style sheet generator for low vision users, I contacted Silas for permission to use his style sheets. Not only did Silas grant such permission right away, he was also very helpful and offered valuable additional advice. Special thanks also go to Abelardo Gonzalez, who is the developer behind the Open Dyslexic font, which he makes available free of charge.