PDF Techniques
for Web Content Accessibility Guidelines
W3C Internal Working Draft 8 November 2000
- This version:
- http://www.w3.org/WAI/GL/2000/12/pdf.html
- Latest version:
- http://www.w3.org/WAI/GL/2000/12/pdf.html
- Editors:
- Loretta Guarino Reid, Adobe
Systems
- Katie Haritos-Shea, Paradigm
Solutions Corp. & NTIS - US Dept. of Commerce
- Wendy
Chisholm, W3C
Status
This document is a Draft.
This document has been produced as part of the W3C Web Accessibility
Initiative. The goal of the Web Content Guidelines Working Group is discussed
in the Working Group charter. Last up-dated 2001-03-15
Please send comments on this document to w3c-wai-gl@w3.org.
Abstract
This document describes techniques for creating accessible Adobe Portable
Document Format (PDF) content (refer to PDF Reference Second Edition, Version
1.3). This document is intended to help authors of Web content who wish to
claim conformance to "Web Content Accessibility Guidelines 1.0" ([WCAG10]).
Because PDF is a Page Description Language, not intended to be edited
directly, these techniques are intended particularly for the developers of
authoring tools that generate PDF as an output format.
For each technique, we identify the version of PDF in which the language
support is first available. Where no version is specified, the technique can
be applied in all versions of PDF. (Some of these items refer to language
features in PDF 1.4, which has not yet been released.)
This document is part of a series of documents about techniques for
authoring accessible Web content. For information about the other documents in
the series, please refer to "Techniques for Web Content Accessibility
Guidelines 1.0" WCAG 1.0
PDF Techniques
Guideline 1. Presentation:
Design content that allows presentation according to the user's needs and
preferences
Provide text equivalents for images and graphics
- Checkpoint:
- Technique:
- WCAG 1.0 Checkpoints in section 2:
1.1 Provide a text equivalent for every non-text
element. [Priority 1]
- WCAG 2.0 Checkpoints in section 2:
- NA yet
Identify the natural language of all text in the document
Can't decide if this comes under here or Interaction,
Comprehension or Technology Considerations
Use the language tagging facilities (Lang) to specify the natural language
of all text in the document. (PDF 1.4)
- Checkpoint:
- Technique:
- WCAG 1.0 Checkpoints in section 4:
4.1 Clearly identify changes in the natural language of
a document's text and any text equivalents (e.g., captions).
[Priority 1]
4.3 Identify the primary natural language of a
document. [Priority 3]
- WCAG 2.0 Checkpoints in section 4:
- NA yet
Foreground and Background colors
- Checkpoint:
- Technique:
Acrobat lets the user control foreground and background color of a
document. Text in the original document will use the foreground color.
Background elements will use the background color. A background element is
defined to be any rectangle that is aligned with the edges of the page (not
skewed or rotated) and that covers 50% or more of the page.
Avoid drawing rectangles behind text that are not background
elements. Even if the rectangle color matches the background color, when
the user changes the background color, the rectangle will not change and
may cause contrast problems with the new foreground color.
- Example: A document with a white background may draw
a title by placing characters on a pink rectangle. Since the pink
rectangle will not be recognized as part of the background, when the
user sets the foreground color to white and the background color to
black, the white characters will be hard to see on the pink
background.
Images are not affected by the background and foreground color
settings, so text should not be placed on top of images.
- Example: Black text will be visible on a
pastel-colored background image, but yellow text may not be, and
changing the background color to black will not change the colors in
the image.
- WCAG 1.0 Checkpoints in section 7:
2.2 Ensure that foreground and background color
combinations provide sufficient contrast when viewed by someone having
color deficits or when viewed on a black and white screen.
[Priority 3]
- WCAG 2.0 Checkpoints in section 7:
- NA yet
Guideline 2, Interaction:
Design content that allows interaction according to the user's needs and
preferences
Generate text that is accessible
Can't decide if this comes under here or Technology
Considerations
- Checkpoint:
- Technique:
Generate the text of your document so that it can be extracted reliably in
logical reading order.
Render characters and words in reading order.
- Render words, and characters within words, in reading order within the
page-content stream. The ReverseChars marked content may be used
when rendering right-to-left text that will be typeset left-to-right
(PDF 1.4).
- Serves the same function as the non-text element.
- May contain structured content or metadata.
Deprecated Technique Examples:
- Some PDF authoring applications save space by rendering all the
characters in one font at a time. Hence, the PDF file may render
all the bold characters on a page, then all the normal weight
characters. Generally, this will cause characters not to be
rendered in reading order.
- A PDF page may render all characters left-to-right,
top-to-bottom. For a multi-column document, this does not render
the words in reading order, since the first line of the second
column will be rendered before the second line of the first
column.
Separate words explicitly with spacing characters.
- Separate words explicitly with spacing characters. Do not rely on
the location of the characters or the division of characters into
showstring operations to indicate word breaks. Note that this implies
that lines of text for western languages usually end with a trailing space
character.
Consider rendering the two line example:
Now is the winter
of our discontent.
-
Correct Technique Example:
- Position to the beginning of line 1
- Show String ("Now is the winter ")
- Position to the beginning of line 2
- Show String ("of ")
- Position to the beginning of "our"
- Show String ("our ")
- Position to the beginning of "discontent"
- Show String ("discontent. ")
Deprecated Technique Example:
- Position to the beginning of line 1
- Show String ("Now is the winter")
- Position to the beginning of line 2
- Show String ("of")
- Position to the beginning of "our"
- Show String ("our")
- Position to the beginning of "discontent"
- Show String ("discontent.")
- Note: In the deprecated example above, there are
no spacing characters at the end of either line, and there are no
spacing characters between the words of the second line.
Use soft hyphens and hard hyphens appropriately.
- Use a soft hyphen, identified by a character that maps to the Unicode
value U+00AD or 173 decimal, when a line-break hyphen is introduced into
the middle of a word.
- Example: If the word "father-in-law" is
hyphenated after the first syllable ("fa-ther-in-law"), the first
hyphen should be a soft hyphen, and the second and third hyphens
should be hard hyphens.
Use the ActualText attribute.
- If characters are not rendered using the showstring operation, they
must be marked in the page as a Span element with an ActualText value
reflecting the desired Unicode value. (PDF 1.4)
- Example: Suppose the word "Arthur" is rendered
using an illuminated A.
The structure subtree for this word might contain
<Figure> graphics or image for illuminated A
<Span> "rthur"
The <Figure> structural element should have the ActualText
attribute, with value "A", so the word "Arthur" could be
extracted.
Ensure that all characters codes map reliably to Unicode.
- Within a PDF page, show string operations operate on a sequence, each
with a sequence of Character Codes with associated fonts. Every such
sequence of character codes must map unambiguously into a sequence of
Unicode code points. Mapping is done as follows:
1.5.1 If the Font contains a
ToUnicode entry, convert the Character Code to Unicode via the
ToUnicode CMap.
1.5.2 If the Font uses one
of the PDF predefined encodings MacRomanEncoding, MacExpertEncoding,
or WinAnsi Encoding (perhaps as modified by a DIFFERENCES array in
the fonts encoding resource), use the DIFFERENCES array or Appendix
D of the PDF Reference Manual to convert the Character Code to an
Adobe glyph name. Then use the Adobe glyph name and look up the
corresponding Unicode value.
1.5.3 If the Font uses one
of the predefined CMaps listed in Table 5.14 on page 320 of the PDF
Reference Manual except Identity-H and Identity-V, convert the
Character Code to a Unicode value via the following steps.
1.5.3.1 Obtain the Registry and Ordering
of the predefined CMap from the CIDSystemInfo of the
appropriate CMap.
1.5.3.2 Concatenate the Registry and the
Ordering according to the format
"--UCS2" to obtain a second CMap name,
e.g. "Adobe-Japan1-UCS2". Obtain that CMap.
1.5.3.3 Index into the predefined Cmap,
using the Character Code, and obtain an Intermediate Value
1.5.3.4 Index into the CMap obtained in
step 1.5.3.2, using the Intermediate Value, and obtain a
Unicode Value.
If any of these four steps fail, e.g. there is no CMap of that name
or the indexing value is missing or undefined in the CMap, then
there is no mapping of the character code to Unicode.
1.5.4 If the font is a Type
0 font whose descendant CIDFont uses the Adobe-Japan, Adobe-Korea,
Adobe-CNS1, or Adobe-GB1 character collection, as specified in the
CIDSystemInfo dictionary, follow the same steps as in 1.5.4 to
obtain the character code mapping.
1.5.5 If the Font is a Type
1 font whose character names are taken from the Adobe standard Latin
character set and the set of named characters in the Symbol font,
documented in Appendix C, use the corresponding Unicode value found
by looking up the glyph name.
- WCAG 1.0 Checkpoints in section 1:
??? Ideas anyone [Priority
3]
- WCAG 2.0 Checkpoints in section 1:
- NA yet
Provide structural grouping
Provide logical structure
- Checkpoint:
- Technique:
- Provide logical structure (PDF Reference Manual Section 8.4.3) for the
document. Map structure types to the standard structure types described
in Adobe Technical Note #5401. (PDF 1.3)
3.1.1 Artifacts of the printing
process, like crop-box markings or the document file name.
3.1.2 Artifacts of the
pagination of the document, that is elements that would be absent or
present in a much different form if a document was always one big
page. like running headers and page numbers
- Tag artifacts in the page contents.
- Tag artifacts in the page contents with the /Artifact marked content,
so that users can control how and whether they are included in the
contents of the document. Artifacts are either
3.2.1 Artifacts of the printing
process, like crop-box markings or the document file name.
3.2.2 Artifacts of the
pagination of the document, that is elements that would be absent or
present in a much different form if a document was always one big
page. like running headers and page numbers
3.2.3 Artifacts of the layout
process and typographic style, like a horizontal rule above a
footnote.
- WCAG 1.0 Checkpoints in section 3:
- 5.1 For data tables, identify row and column
headers. [Priority 1]
5.2 For data tables that have two or more logical levels
of row or column headers, use markup to associate data cells and header
cells. [Priority 1]
5.4 If a table is used for layout, do not use any
structural markup for the purpose of visual formatting.. [Priority
1]
12.3 Divide large blocks of information into more
manageable groups where natural and appropriate. [Priority 2]
13.6 Group related links, identify the group (for user
agents), [Priority 1]
- WCAG 2.0 Checkpoints in section 3:
- NA yet
Document navigation
- Checkpoint:
- Technique:
Use bookmarks to provide navigation aids into a document.
- Content transforms gracefully when mechanisms provided by the
author are not supported or turned off but the content is still usable
and readable by the user.
Use links within a document.
- Markup languages, multimedia formats, software interface standards,
etc., vary in their support of accessibility. When choosing which
technologies to use, consider how easy it is apply these guidelines.
Where feasible, favor technologies that:
If the value of the link does not describe the target clearly and
accurately, provide Alt attributes.
- Markup languages, multimedia formats, software interface standards,
etc., vary in their support of accessibility. When choosing which
technologies to use, consider how easy it is apply these guidelines.
Where feasible, favor technologies that:
Provide a user name (/TU key) for all form fields
- In Acrobat, this field is called the Short Description.
- WCAG 1.0 Checkpoints in section 6:
0 Create a style of presentation that is consistent
across pages. [Priority 3]
Create a logical tab order through links, form
controls, and objects.
- WCAG 2.0 Checkpoints in section 6:
- NA yet
Guideline 3, Comprehension:
Make it as easy as possible to use and understand
Provide expansions for acronyms and abbreviations (PDF 1.4)
- Checkpoint:
- Technique:
Note: this guideline applies only where the content provides its own
user interface
-
- Checkpoint:
- Technique:
(for example as a form or programmatic object).
Need example
- WCAG 1.0 Checkpoints in section 5:
4.2 Specify the expansion of each abbreviation or
acronym in a document where it first occurs. [Priority 3]
- WCAG 2.0 Checkpoints in section 5:
- NA yet
Guideline 4, Technology considerations:
Design for compatibility and interoperability
Set document protections to permit access
- Checkpoint:
- Technique:
Set the data access restrictions on the document to permit the
contents to be accessed.
-
- Checkpoint:
- Technique:
In PDF 1.3 and earlier, permit the text and graphics in the document
to be copied.
-
- Checkpoint:
- Technique:
In PDF 1.4, set accessibility permission for the document.
- Checkpoint:
- Technique:
- WCAG 1.0 Checkpoints in section 8:
??? Ideas anyone
[Priority 3]
- WCAG 2.0 Checkpoints in section 8:
- NA yet
PDF Glossary
- Accessibility permission
NEW 01-01-08
- A PDF file can be encrypted (PDF 1.1) to protect its
contents from unauthorized access. PDF's standard security handler
defines a set of access privileges for a document, including privileges
such as modifying the document's contents, copying text and graphics
from the document, and printing the document. In PDF 1.4, this set
includes accessibility permission, which controls whether the contents
of the document are available via standard accessibility APIs to screen
readers and other assistive technology.
- ActualText value NEW 01-01-08
- Sometimes characters are rendered by graphics
commands other than showstring. For instance, an illuminated character
may be rendered by an image or a series of graphics commands. In this
situations, the Actual Text property is used to identify the character
being rendered. This character may be concantentated with adjoining text
to form a word.
- Adobe glyph name NEW 01-01-08
- The name of a character in the Adobe standard
character encodings, in Appendix D of the PDF 1.3 Reference Manual. The
encodings list characters, character names, and character codes used in
platform standard encodings.
- Artifacts NEW
00-12-14
- A page element that is a side effect of rendering,
rather than an intrinsic part of the document or story. For example,
artifacts of the printing process might include crop-box markings or the
document file name printed outside the crop box. Artifacts of the
pagination of a document are elements that would be absent (or present
in a different form) if the document was always one very big page. So
pagination artifacts include running headers and page. A horizontal rule
above a foornate would be an artifact of the layout process and
typographic style.
- Characters NEW 00-12-14
- A character is a printable symbol having phonetic or
pictographic meaning and usually forming part of a word of text,
depicting a numeral, or expressing grammatical punctuation. A character
is generally one of a limited number of symbols, including the letters
of a particular language's alphabet, the numerals in the decimal number
system, and certain special symbols such as the ampersand (&) and
"atsign" (@). Several standards of computer encoding have been developed
for characters. The most commonly used in personal computers is ASCII.
IBM mainframe systems use extended binary-coded decimal interchange
code. A new standard, Unicode, is
supported by the Windows NT system. A distinction is sometimes made
between a character and a glyph. In this distinction, a character can be
distinguished from other characters in terms of meaning and sound and a
glyph is the graphic image used to portray
the character. In different implementations, a character can have more
than one possible glyph, and a glyph can represent more than one
possible character.
- Character Codes NEW 00-12-14
- (a la Loretta)A show string is the encoded
representation of a sequence of non-negative integers. Each of those
integers is a Character Code. The interpretation of a show string
depends on the associated font: some fonts imply a one-byte
representation whie others imply a more complicated representation.
A mapping from a set of integers to a set of characters. This mapping is
generally 1:1 (i.e., bijective), for example, the code position 65 in
ASCII maps only to "A", and it's the only position that maps to "A".
There are several standard coded character sets, the most widely used is
ASCII, generally in its Latin-1 dialect (the ASCII coded character set,
encoded directly as single-byte values), or UTF-8 (the Unicode coded
character set, encoded with an 8-bit transformation method), with
Unicode becoming slowly more common; while EBCDIC and Baudot are extinct
except in legacy systems. A coded character set may include letters,
digits, punctuation, control codes, various mathematical and typographic
symbols, and other characters. Each character in the set is represented
by a unique character code (or "code position").
- Column headers NEW 00-12-14
- @@
- CMap
NEW 01-01-08
- A CMap specifies the mapping from character codes to
character selectors (CIDs, character names, or character codes) in one
or more associated fonts or CIDFonts. It serves a function analogous to
the Encoding dictionary for a simple font. A Cmap also specifies the
writing mode - horizontal or vertical - for any CIDFont with which the
CMap is combined.
Also a CMap (character map) file specifies the correspondence between
character codes and the CID (character identifier) numbers used to
identify characters. For composite (Type 0) fonts, it is the equivalent
to the concept of an encoding in a simple font. A CMap can describe a
mapping from multiple-byte codes to thousands of characters in a large
CID-keyed font.
- Concatenate NEW 00-12-14
- To combine character strings, to join together two or
more files or lists to form one big one. Example: The Unix cat command
can be used to concatenate files.
- Crop box NEW
01-01-08
- The crop box defines the region to which the contents of the page are
to be clipped (cropped) when displayed or printed.
- Data tables NEW 00-12-14
- @@
- Expansion NEW 00-12-14
- @@
- Form fields NEW 01-01-08
- @@
- Glyph NEW
00-12-14
- An image used in the visual representation of characters; roughly speaking, how a
character looks. A font is a set of glyphs. In the simple case, for a
given font (typeface and size), each character corresponds to a single
glyph but this is not always the case, especially in a language with a
large alphabet where one character may correspond to several glyphs or
several characters to one glyph (a character encoding). A glyph can be
an alphabetic or numeric font or some other symbol that pictures an
encoded character. The following quote is from a document written as
background for the Unicode character set standard. An ideal
characterization of characters and glyphs and their relationship may be
stated as follows: A character conveys distinctions in meaning or
sounds. A character has no intrinsic appearance. A glyph conveys
distinctions in form. A glyph has no intrinsic meaning. One or more
characters may be depicted by one or more glyph representations
(instances of an abstract glyph) in a possibly context dependent
fashion. Glyph is from a Greek word for "carving."
- Indexing value NEW 00-12-14
- @@
- Line-break hyphen
NEW 00-12-14
- Hyphens that you add explicitly by entering the dash
character are called line-break or hard hyphens. A hyphen that is always
set; for example, the hyphen in "cost-effective." A soft hyphen, by
contrast, will only be set when a word that is not normally hyphenated
falls at the end of a line, and must be broken for proper type spacing.
Word processors use two basic techniques to perform hyphenation. The
first employs an internal dictionary of words that indicates where
hyphens may be inserted. The second uses a set of logical formulas to
make hyphenation decisions. The dictionary method is more accurate but
is usually slower. The most sophisticated programs use a combination of
both methods. Most word processors allow you to override their own
hyphenation rules and define yourself where a word should be
divided.
- Link
text NEW 00-12-14
- @@
- MacRomanEncoding, MacExpertEncoding,
or WinAnsi Encoding NEW 01-01-08
- The regular font encodings used for Latin-text fonts
on mac OS and Windows systems are named MacRomanEncoding and
WinAnsiEncoding, respectively. Additionally, an encoding named
MacExpertEncoding is used with "expert" fonts that contain addiitonal
characters useful for sophisticated typography. Complete details of
these encodings and the characters present in typical fonts are found in
Appendix D of the PDF Version 1.3 Reference Manual.
- Map,
mapped NEW 00-12-14
- @@.
- Markup NEW
00-12-14
- @@
- The rendered text content of a link.
- Objects NEW
00-12-14
- An object is an identifiable, encapsulated entity
that provides one or more services requested by a client. Objects can
refer to the objects in OOP (object-oriented programming) or the objects
in OLE (Object Linking and Embedding). In object-oriented programming,
objects are the things you think about first in designing a program and
they are also the units of code that are eventually derived from the
process. In between, each object is made into a generic class of object
and even more generic classes are defined so that objects can share
models and reuse the class definitions in their code. Each object is an
instance of a particular class or subclass with the class's own method
or procedures and data variable. An object is what actually runs in the
computer. An object can be a spell checker or a piece of a graphics
program used to draw squares or circles. Do you remember the crazy story
people used to try to tell about a word processer where you could pick
all of your favorite pieces (favorite spell checker, grammar checker,
text editor, font manager, etc.) and piece them together to form the
ultimate customizable word processer? Well, those pieces are objects. In
OLE, an object is a piece of a document, a graphic, or some multimedia.
In general multimedia terms, an object is a stored data element, such as
a video clip, an audio file, or a graphic representation of an
object.
- Page-content stream
NEW 01-01-08
- A page's content stream contains operands and
operators used to place "paint" on a page in selected areas. By
executing the actions described in the page content stream, an
application builds up the image of the page described by the
stream.
- NEW 00-12-14
- @@
- ReverseChars NEW 00-12-14
- Font characteristics may suggest that right-to-left
text be typeset left-to-right. The ReverseChars marked content indicates
that the show strings within the marked content are individually
reversed in reading order.
- NEW 00-12-14
- @@.
- Showstring NEW 00-12-14
- (a la Loretta) The strings that are the arguments to
the PDF and Postscript text-showing operators that show text on a page.
The show string is interpreted as a sequence of character codes
identifying the glyphs to be painted.
- Soft hyphen NEW 00-12-14
- (a la Loretta) A character that is used to mark
conditional hyphenation points. Unicode and ISO_Latin-1 code-point 0xAD.
A hyphen that will only be set if the word falls at the end of a line
which is too long, and has to be broken. Hyphens inserted automatically
by a hyphenation utility are called discretionary or soft hyphens. Word
processors use two basic techniques to perform hyphenation. The first
employs an internal dictionary of words that indicates where hyphens may
be inserted. The second uses a set of logical formulas to make
hyphenation decisions. The dictionary method is more accurate but is
usually slower. The most sophisticated programs use a combination of
both methods. Most word processors allow you to override their own
hyphenation rules and define yourself where a word should be
divided.
- Trailing space character
NEW 01-01-08
- A white space character inserted into the text for a
page after the last word on a line. A trailing space character is not
needed to produce the correct page image, but is important for
determining word breaks in the text of the page.
- Type 0
font, Type 1 font NEW 01-01-08
- Type 0 font: a composite font, that is, a font
composed of other fonts, organized hierarchically.
Type 1 font: a font represented using the Adobe Type 1 Font Format. A
Type 1 font program is a stylized PostScript program that describes
glyph shapes.
- Typographic styleNEW 00-12-14
- @@.
- Unicode NEW
00-12-14
- A character coding scheme that uses 16 bits for each
character, designed to extend the capabilities of ASCII, which uses
seven bits. Nearly all letters and symbols in all languages can be
represented in a standard way with Unicode. The first 128 characters of
Unicode are identical to those in standard ASCII. Unicode is an entirely
new idea in setting up binary codes for text or script characters.
Officially called the Unicode Worldwide Character Standard, it is a
system for "the interchange, processing, and display of the written
texts of the diverse languages of the modern world." It also supports
many classical and historical texts in a number of languages. Currently,
the Unicode standard contains 57709 distinct coded characters derived
from 24 supported language scripts. These characters cover the principal
written languages of the world. Originally Unicode was designed to be
universal, unique, and uniform, i.e., the code was to cover all major
modern written languages (universal), each character was to have exactly
one encoding (unique), and each character was to be represented by a
fixed width in bits (uniform). Parallel to the development of Unicode an
ISO/IEC standard was being worked on that put a large emphasis on being
compatible with existing character codes such as ASCII or ISO Latin 1.
To avoid having two competing 16-bit standards, in 1992 the two teams
compromised to define a common character code standard, known both as
Unicode and BMP. Since the merger the character codes are the same but
the two standards are not identical. The ISO/IEC standard covers only
coding while Unicode includes additional specifications that help
implementation. Unicode is not a glyph encoding. The same character can
be displayed as a variety of glyphs, depending not only on the font and
style, but also on the adjacent characters. A sequence of characters can
be displayed as a single glyph or a character can be displayed as a
sequence of glyphs. Which will be the case, is often font
dependent.
- Unicode value NEW 00-12-14
- (a la Loretta)Unicode value or code point: The
Unicode Consortium defined a set of sixteen-bit code points, 57709 of
which are currently assigned and named Unicode Characters. The lowest
65536 code-points in ISO 10646-1 1993 are idential to the Unicode
Standard and are sometimes called the Basic Multilingual Plane. See http://www.unicode.org
- User name (/TU key) NEW 01-01-08
- Any interactive form field may contain the optional
/TU entry in its dictionary. This entry, known as the user name or short
description, is used to identify this field when generating an error
message or naming the field to a screen reader.
- Word breaks NEW 01-01-08
- Applications divide the text of a page into words;
word breaks are the points in the text stream that separate adjoining
words. Different applications may use different rules for defining
words; for example, one application may consider everything between
white space characters to be a word. Another application may not include
leading or trailing punctuation as part of a word.
This document last modified on 01-05-17
by Katie Haritos-Shea
@ Home
or
Katie Haritos-Shea @ Work, Paradigm
Solutions Corporation, and,
National Technical Information Service (NTIS),
United States Department of Commerce