The ALT-server
("An eye for an alt")

W3C logoWeb  Accessibility Initiative      (WAI) logo
A accessibility/collaboration project proposal
Daniel Dardailler


This memo was written between July 1997 and Jan 1998. Most of it right after Xmas 97 (on a personal touch, I didn't write any algorithm since - 10 years after).

It has no W3C official status, it's not even a W3C Note.


For most people, visiting a web page is a one-to-one experience where one client program, the user's browser, gets and presents some resources coming from one information provider, usually the provider's web server. It's the client/server paradigm as we understand it. The client usually gets an "initial" HTML file from which it derives a complete presentation made of pieces found in the document itself (text, markup, style, alt text, etc) and additional pieces fetched by going back to the provider servers (images, audio, longdesc, etc).

It doesn't have to be always that way. The web addressing and transport architecture is flexible enough so that the set of resources that makes one's web session can seamlessly integrate from independent providers, or chains of providers.

This paper presents the application of this principle to Web Accessibility. The design of a system to retrieve and generate missing textual description of particular HTML elements (such as images) is examined, as well as the human collaboration foundation on which it is based.

Web Accessibility

Web Accessibility covers a very broad set of issues. There are of course different kind of disabilities to consider, such a visual or hearing impairment, which all relate to different types of access denial (e.g. a missing caption for an audio stream, or the inability to linearize the content of a table for speech output). For the purpose of this paper, we will focus on one important aspect of accessibility for non-visual based user-agent: the textual description, or rather the lack thereof, attached to graphical images on the web, i.e. the well known missing ALT text in HTML. However, we believe the system presented can be generalized to other kind of resources.

The current situation is the following: when presented with a piece of HTML containing an image, a non-visual browser needs to "degrade gracefully" by presenting the user with a textual version of the image (that can be output as speech or braille). Most such systems currently only look for the textual information in the ALT attribute of the IMG element of the image. With progress happening in the browser area, other ways to find this textual/alternate text will soon be implemented, such as: look in the TITLE attribute, the HTTP stream, or the URL filename part.

What characterizes these approaches is that the textual description can only come from the origin document or the provider server.


As we alluded to in the introduction, another way of getting this information is to ask a different server altogether. Suppose there was a web server somewhere on the Internet whose primary job was to serve textual description of other servers' images. A non-visual browser (such as lynx) would then just have to query it when image ALT and TITLE attributes are missing and use the result in its presentation.

Let's consider an example expressed in pseudo HTTP sequences of queries and replies:

We'll look at the server issues later on, but for now, let's concentrate on the client side.

The important line in the example above is the query to the alt server.


It's just a regular HTTP request that provides the server with the URL of an image and expects the ALT text for this image back.

The /SERVICE?name=value can of course be generalized to handle different type of resources and more information about one given resource.

Consider the following examples:


which asks the server for a linear/textual version of the fourth table found in


which asks for a caption of the audio track found at the given URL.

 GET /ALT?;isA;line=2

which mentions to the server that the image is used in the context of an A tag (a link anchor), and appears on line 2 of the document, making it more important to treat.

I think it's easily understandable what one can achieve there. Two things worth mentioning: the performance hit is nothing really to worry about: we merely add a web request in the overall building of a page, as graphical browsers do all the time. The minimal configuration on the browser side is also really small: declaring the alt server name to query from. So it's mostly transparent for the non visual browser user.

The implementation of this new GET functionality in a given browser (like lynx of amaya) is trivial.


At the beginning of the previous section, we assumed that "there was a web server somewhere on the Internet whose primary job was to serve textual description of other servers' images".

How do we go about implementing this alt server ?

Basically, I envision two ways of generating alt text for images.

The first is automatic extraction, the second human generation.

I will not expand on the first as this is an area of advanced research (shape and pattern recognition). I'll just mention that for a entire category of images, those representing text using some big fonts and colors, there exist algorithms out there (e.g. OCR) that could be used to extract the characters out of the graphics. A centralized server is well suited to integrate the latest and greatest solution in one location while readily serving the entire community.

The second way, human generation, is where I see the power of the web as a collaboration tool best applied.

This is how it could work.

The ALT server logically maintains a list of tuples

   (image-url, textual-description, state)

where state is one of to-be-described, being-described, and described.

Processing works as follow:

The part with the form filling needs to be detailed.

Each time a sighted user access the form (see annex), the next to-be-described image with the highest priority is presented while its state moves to being-described.

Sighted user can then enter the description in an input field aside the image and submit the form to the server, which validates the text and either move the entry in the described state of just unlock it my moving it back to to-be-described state (invalid might be empty text for instance).

The locking is necessary due to the asynchronous nature of web form filling: several users could access and fill the "same" form at the same time, and we only need one image description per image.

So this is basic principle of the alt text server: use the eyes of sighted web volunteers to help those who cannot see.

The reason why this system can work is based on two facts:

In addition, the number of images with no description should go down as the awareness of content providers to accessibility is raised and authoring tools are improved.

If the automatic extraction part is improved, this will also diminish the number of images actually needing some human collaboration.

Regarding implementation on the server side, see the annex for pseudo code. I expect a first version handling the base service to take a week of programming. A more complete version (generating reports, ranking, and doing more automatization can take a couple to several months.


[ to be developed ]


1 - Form layout

Example of form used to query the sighter user.

Welcome to the ALT-server filling form

You are about to describe:

(entered Dec 25 1997 and was queried 12 times)

No Alt by definition Enter the description(*) of this image

Select Language:

Email: Name:

(*)the description should be short and to the point: e.g. "a american express credit card", "a dog", "a map with a magnifier". No need to add "an image of" or "this is a".

Link to Advanced query form providing n-at-a-time, site or url targeted filling, and database dump ranked by images site name, describer id, base image file name, etc.

2 - Simple pseudo code for ALT-server script

This script handles both the queries for alt desc and the insertion of alt desc by sighted users for an hypothetical server hosted at

// for now ignore lang, date entered, number queries, id of describer,
//      checking dup, validaty, security, and additional services like
//      ranking of bad site, good describers, etc)
//  INPUT: 3 cases
// 1 (asking for textual description of url)
// 2 (giving a textual description for url)
//                                 desc="A credit card logo"
// 3 (asking for form to fill in desc for a url)
//  OUTPUT: see RETURN statement below 
// maintains a persistent list of [url, desc, state]
//    with state = d, bd, tbd  (described, being described, to be described)

if url
  if !desc   // case 1 : asking for textual description of url
    if (url in list) 
      if (list[url].state = d)
         RETURN list[url].desc
         RETURN no desc
       add url in list 
       list[url].state = tbd
       RETURN no desc

  else  // case 2 : giving a textual description for url
        // should check list[url].state = bd and desc valid
    list[url].state = d
    list[url].desc = desc
    RETURN ok

else // case 3 : no param, asking for form to fill in desc for a url
   get top url with list[url].state = tbd
   list[url].state = bd   // should check url valid with HEAD
   RETURN form HTML with embedded image url

Valid HTML 4.0!