This paper is a contribution to the Text Customization for Readability Online Symposium. It was not developed by the W3C Web Accessibility Initiative (WAI) and does not necessarily represent the consensus view of W3C staff, participants, or members.
How feasible is text customization for PDF?
1. Problem Description
In the past, PDF and text customization didn’t seem to go together well. The introduction of tagged PDF in 2001 did not help much. The publication of the international standard ISO 14289  – usually referred to as PDF/UA – may change the situation, as it establishes clear quality criteria that a well tagged PDF must meet. Assuming that well tagged PDFs may become more common in the near future, we wanted to find out whether and how such well tagged PDFs would lend themselves to text customization.
The currently available options to achieve a certain degree of text customizability for a PDF file are:
- Visually reflow the PDF page contents in the order in which the content is encoded (Adobe Reader and Adobe Acrobat); problems: only works well for very simple PDFs
- Customize background and foreground colors or contrast of text in PDF (Adobe Reader and Adobe Acrobat); problems: only works well for very simple or for text only PDFs
- Export PDF to a format that is more amenable for text customization, like Rich Text or HTML; problems: most tools do not make use of logical structure, thus the quality of the exported content depends mostly on the heuristic and overall quality of the given tool;
None of these options work well or reliably for non-trivial PDF documents.
As our goal was not to develop yet another heuristic tool that guesses what the semantic structure of an arbitrary PDF's content is we focused exclusively on tagged PDFs Tagged PDFs – at least when they are tagged well – transport not only the raw content, like words, graphics objects or images, but also the logical structure of the document including the intended sequence of its content objects and the semantic type of its parts, like headings, paragraphs, lists, tables, table of content and so on. In addition, a well tagged PDF will have alternate text for non-text content. As a consequence, a well tagged PDF has everything that is needed to derive a customizable text-centric representation of its content.
Our goal was not to research approaches for or problems with text customization as such.
Exporting PDF content to HTML – as opposed to a rich text format or a text processing format – seemed like a very logical decision as every user will have a browser available to display HTML. Actual text customization can then be achieved through the use of CSS style sheets as well as through features built into some browsers, like text scaling or page scaling, or application of user CSS styles.
The main research goals were:
- How useful is it to convert a well tagged PDF to HTML, and then use CSS to adjust the presentation of its content?
- How good is the tagging quality of typical real world tagged PDF files, based on the use of a special diagnostic CSS style sheet that visualizes the logical structure?
Text customization was tested through the exemplary use of several specific CSS style sheets for low vision, dyslexia and easy reading of long documents.
A number of challenges were encountered throughout the development of the prototypical tagged PDF to HTML converter:
- Not all PDF standard tags can be mapped directly to HTML tags (for example "TOC" and "TOCI" for tables of contents and items in a table of contents, "Lbl" enclosing the bullet points in unordered lists)
- Not all constellations of tags that are possible in tagged PDF are acceptable in HTML (for example a Table tag inside a P tag)
- A tagged PDF may contain custom tags whereas custom tags are not a known concept in HTML
- The semantic quality of the extracted HTML structure relies heavily on the quality of the tagging structure in the PDF (garbage ge in - garbage ou
A prototypical implementation of a "convert tagged PDF to HTML" tool was developed and used on a wide range of PDF files, mostly from federal government agency websites in Germany, Switzerland and Denmark.
The experience was illuminative:
- the CSS based diagnostic view we developed throughout the project made it possible to quickly – in about 10 to 15 seconds for a PDF with up to 100 pages – assess the overall tagging quality of a PDF
- the tagging quality varies drastically; of the about 1400 PDFs on a German federal government agency's website, only a quarter are tagged at all; of those tagged PDFs about 10% to 15% are tagged well
- for these well tagged PDFs text customization looks very promising – whether for armchair reading, special color or contrast settings catering to low vision users or optimization for dyslexic users
In terms of actually applying text customization beyond the CSS based diagnostic view, we found the following
- we made use of a number of example CSS styles sheets suitable for low vision users and for dyslexic users 
- as can be seen from the work offered by Silas S. Brown in the form of a low vision style sheet generator  there are not just two or three such style sheets that make sense – rather, each low vision user may have slightly different needs and preferences; nevertheless, to offer a handful of sample styles for users with disabilities is of substantial educational value for just about anybody else, and makes quality assurance easier for those who wish to create tagged PDF that makes text customization as easy as possible.
Figure 1: The screenshot from the CSS based diagnostic view for the HTML exported form a tagged PDF shows how colored structure labels make it easy to assess the quality of the logical structure
A slightly refined version of the prototypical implementation will be released as a plug-in for Adobe Acrobat in November 2012 under the name "callas pdfGoHTML", and will be available free of charge from the callas software website.
6. Future Research
One important aspect that needs more research is the problem that tagged PDF can express logical content structure that is difficult if not impossible to map into HTML. It would be desirable to develop robust strategies for conversion of certain types of logical (sub-) structures in PDF into HTML equivalents.
Special thanks go to Silas S. Brown – after having run into his style sheet generator for low vision users, I contacted Silas for permission to use his style sheets. Not only did Silas grant such permission right away, he was also very helpful and offered valuable additional advice. Special thanks also go to Abelardo Gonzalez, who is the developer behind the Open Dyslexic font, which he makes available free of charge.
- Brown, Silas S. (2012). Stylesheets for low vision. Available: http://people.ds.cam.ac.uk/ssb22/css/. Last accessed 24th September 2012.
- Gonzalez, Abelardo (2012). OpenDyslexic, a Free Dyslexia Font. Available: http://abbiecod.es/2011/12/22/open-dyslexic-a-free-dyslexia-font/. Last accessed 20th October 2012.
- International Organization for Standardization. (2012) ISO 14289-1:2012 Document management applications – Electronic document file format enhancement for accessibility – Part 1: Use of ISO 32000-1 (PDF/UA-1)