OCR & Web Textual Icons

Daniel Dardailler - danield@w3.org
Last Updated: 18 Avril 2000

Abstract

The purpose of this page it to help us understand how today's OCR (Optical Character Recognition) technology can deal with Web textual icons commonly found on the Web.

Problem statement

The images displayed at the bottom of this page all share the same property: they all represent some pieces of text using pixels and have therefore lost their original machine-readable representation, i.e. their logical information as individual character codes forming words (e.g. represented in ASCII or Unicode).

Some are simple, one liner, some are multilines, animated text. They also have in common the fact that they are very easy to find on the Web: the vast majority of pages used them to some extent. Why ? Because page designers like to control the exact style (color, fonts, layout) of their creations and the Web technologies do not yet offer them this functionality in a reliable way (but it's coming: CSS, SVG, SMIL, etc).

The issue with these images is that having lost their machine readability, they are only accessible to users who can see (let's put aside tactile graphics for now). Seems obvious, but it's not: most of the "normal" text on the Web, like the lines you're reading now, are made available by their authors in HTML or XML and not as pixels, and they are accessible to most users regardless of them using their eyes or not: that's because their machines can read the text, the character codes that is, and easily transform it in speech output, braille output, large print or mobile phone screen output, etc. There is some degradation, of course (you don't see the fancy colors, fonts or layout) but you get the content, which is what you're probably looking for.

But have these images really lost their machine readability ? Well, no, not really: there are machines and software that can read in text coming from pixels and produce a character based encoding representation of it, it's called character recognition. The question is: how hard is it to recognize a given piece of text ? Does the text have to be in black and white, horizontal only, no fancy font, etc ? One can imagine that in some years from now, character recognition will have made so much progress that these icons will all be understandable, but I don't think that's the case today (and in the meantime, people should provide the textual description of the images in addition to the textual information provided by the pixels, seems redundant, but necessary).

Purpose of this page

On April 5th 2000, I sent a message pointing to this page to comp.ai.doc-analysis.ocr, with no answer so far.

I hope to reach experts in the OCR fields that will help me get an idea of how OCR (Optical Character Recognition) software deal with those icons today.

The kind of questions I have in mind:

Can OCR softwares extract text from theses images given only the pixels of the image. All of them, some of them, which ones ?
Are non-english icons posing a particular problems ?
Does the knowlegde of the AREA coordinates - for image map - help in any way ?
Are the OCR "engines" only available as black boxes that operates on paper or can they be adapted for online Web usage ?
Are they any public domain OCR engine of good quality ?

Please send your comments and answers to these questions to Daniel Dardailler (in addition to any forum that originally pointed you to this page).

Thanks.

Simple Images (no MAP, i.e. no clickable sensitive areas):

These images are usually expressed on the Web using the IMG element:

  <IMG SRC="wai.gif" ALT="Web Accessibility Initiative">

The issue is: if the ALT is not provided, and the image file name (like "wai.gif") is not useful, there's not much a non-graphical user can do with this information - except if OCR can help!

Images with MAPs and AREA:

These images have an associated MAP, or list of sensitive areas that can help OCR to delineate particular zones of interest (where text might be present).

Example:

<IMG SRC="google.gif" usemap="#map" alt="Google">
  <MAP name="map">
    <AREA SHAPE=rect COORDS="493,58,595,103" 
           HREF=jobs.html ALT="We're Hiring!">
    <AREA SHAPE=rect COORDS="381,57,488,104" 
           HREF=about.html ALT="About Google">
  </MAP>

(don't click on these images, I'm not sure they'll go anywhere).

Daniel Dardailler

Last modified: Tue Apr 18 11:47:31 MEST 2000