Skip to content

Technique PDF7:Performing OCR on a scanned PDF document to provide actual text

Applicability

Scanned PDF documents

This technique relates to:

Description

The intent of this technique is to ensure that visually rendered text is presented in such a manner that it can be perceived without its visual presentation interfering with its readability.

A document that consists of scanned images of text is inherently inaccessible because the content of the document is images, not searchable text. Assistive technologies cannot read or extract the words; users cannot select, edit, resize, or reflow text nor can they change text and background colors; and authors cannot manipulate the PDF for accessibility.

For these reasons, authors should use actual text rather than images of text, using an authoring tool such as Microsoft Word or Oracle Open Office to author and convert content to PDF.

If authors do not have access to the source file and authoring tool, scanned images of text can be converted to PDF using optical character recognition (OCR). Adobe Acrobat Pro can then be used to create accessible text.

Examples

Example 1: Generating actual text rather than images of text using Adobe Acrobat Pro

This example is shown with Adobe Acrobat Pro. There are other software tools that perform similar functions.

This example uses a simple one-page scanned image of text. To ensure that actual text is stored in the document, perform the following steps:

  1. Select Tools → Scan & OCR.
  2. In the Scan & OCR toolbar, select Insert and then either From File or From Scanner.
  3. Depending on the resolution and how clear the text was, OCR converts images of words and characters to actual text. Text that Acrobat Pro does not recognize is listed as an "OCR suspect", or text element that Acrobat suspects was not recognized correctly.
  4. To fix the suspects, in the Scan & OCR toolbar, choose Recognize Text → Correct Recognized Text. Acrobat Pro presents each suspect one at a time, which can then be corrected.
  5. Using the Accessibility Tags panel, add tags to the document.
  6. Test for accessibility: Accessibility Tool → Accessibility Check.

The following image shows a scanned one-page document in Adobe Acrobat Pro.

A scanned page in Acrobat Pro showing a cheese recipe.

The next image shows the converted content after adding tags to the document. It will be necessary to use the Reading Order tool and the Tags panel to tag the content properly. The Reading Order tool was used in this example to hide the image of the hand as decorative image / artifact (see PDF4). The recipe title was tagged as a first level header.

A tagged converted page in Acrobat Pro showing a cheese recipe. The name of the recipe is a first level header, and the ingredients are a list.

Note: Acrobat Pro may automatically add tags when the file is run through OCR.

This example is shown in operation in the working example of generating actual text and the result of tagging text created with OCR.

Other sources

No endorsement implied.

Tests

Procedure

  1. For each page converted to text using OCR, ensure that the resulting PDF has been converted correctly, using one of the following ways:

    1. Read the PDF document with a screen reader or a tool that reads aloud, listening to hear that all text is read correctly and in the correct reading order.
    2. Save the document as text and check that the converted text is complete and in the correct reading order.
    3. Use a tool that is capable of showing the converted content to open the PDF document and verify that all text was converted and is in the correct reading order.
    4. Use a tool that exposes the document through the accessibility API and verify that all text was converted and is in the correct reading order.

Expected Results

  • #1 is true.
Back to Top