Printing for JSTOR

A position paper for the W3C workshop on High-Quality Printing from the Web

Last updated April 11, 1996

This paper will raise more questions than it will provide suggestions or solutions. We are facing the necessity of providing high-quality printing of (initially) scanned page images, and (eventually) other forms such as SGML and PDF.

Established in August 1995, JSTOR is an independent not-for-profit organization created with the assistance of The Andrew W. Mellon Foundation to help the scholarly community take advantage of advances in information technology. In pursuing this mission, JSTOR has adopted a system-wide perspective, taking into account the sometimes conflicting needs of scholars, libraries and publishers. With an initial focus on core scholarly journals, the primary objectives of JSTOR are:

to improve dramatically access to journal literature for faculty, students, and other scholars by linking bit-mapped images of journal pages to a powerful search engine;
to mitigate some of the vexing economic problems of libraries by easing storage problems (thereby saving prospective capital costs involved in building more shelf space), and also by reducing operating costs associated with retrieving back issues and reshelving them;
to address issues of conservation and preservation such as broken runs, mutilated pages, and long-term deterioration of paper copy; and
to assist scholarly associations and other publishers in making the transition to electronic modes of publication while protecting their traditional values and financial stability.

JSTOR Overview

The current focus of JSTOR is on "digitizing the back files". We have digitized the complete run of several journals, from Volume 1, Number 1, up to a "moving wall" (5 years for most of the journals). The digital form includes high-resolution (600DPI) black-and-white scans of the pages, an electronic table of contents for each issue, and searchable OCR text of each issue.

The on-line interface consists of a browser and a set of search forms. Once the user has selected an article, the scanned pages are reduced to 75-100DPI gray-scale GIF images for on-screen viewing. The user may also print the article at the full 600DPI resolution. Our initial printing strategy requires a "helper application", which downloads the scanned pages from the server, converts them to PostScript, and sends them to a selected printer.

We are currently investigating other printing methods, including CUPID, downloading PostScript files, and downloading PDF files.

Printing Requirements and Issues

The basic requirement for printing from JSTOR is that the user be able to get the highest-quality representation that we can provide. A second requirement is to be able to print an entire article with "one click." Neither of these are satisfied by the "Print" command in the web browser. The on-screen representation is at reduced resolution from the original, and includes only a single page at a time.

The page images are currently stored as G4 FAX-compressed TIFF images; a format that is not directly supported by stock web browsers, nor by printers. Thus, some translation is necessary. The TIFF format is relatively well compressed, an important factor in minimizing download time (it still runs an average of 120Kb per page). Compression methods exist that can reduce the page size by another factor of 10 or more (e.g., Cartesian Products) but are even more non-standard.

Our current "print helper" application supports only PostScript printers (preferably PostScript level 2). With the current limited test community, this is marginally acceptable. As the user community grows, it will become increasingly unacceptable, and further application development will be required. An advantage of direct PS level 2 translation is that the body of the page image can be passed through uninterpreted. This advantage is lost when using a generic printer driver interface (thus, printing will take longer). To reduce or eliminate the need for a helper application, we may decide to provide directly printable downloadable representations such as PostScript or (HP)PCL. We will then be in the unenviable position of having to generate any number of different printer languages.

To summarize, our printing requirements are:

Print full-resolution page images.
Print full article, rather than just the single displayed page.
Minimize download and printing time.
Support a diverse, not necessarily computer-savvy, user community.
Thus, support a variety of printers and computing platforms.
Minimize the options the user must "wade through" in order to print.
Allow the user to print to their own, or a nearby, printer.
Minimize the time the user must wait for the printing process, before she or he can return to work.

Some issues that we have identified include:

Centralized versus local printing -- does printing take place from a centralized server or through the user's workstation? If printing is centralized, how is it administered?
Printer selection in a centralized printing solution.
Use of a "print helper" application -- allows us to "control" the print translation process, and it can be made "small" and unobtrusive, but raises a host of problems. These include
- Coding, compiling, and maintaining the helper application for many different platforms, some of which we may not have an instance of.
- Dealing with various printing architectures (especially messy on Unix platforms).
- PostScript generation versus use of OS-provided printer driver. How do we decide which to use?
- Notifying users when a helper upgrade is available.
Providing printable representations for download. In this case, the "print helper" runs on our server, producing a print file, which the user then downloads and prints using a vendor or third-party supplied application. For example, PostScript can be easily printed using lpr on Unix. If we produce PDF files, then we push the print helper problem off onto Adobe. On the other hand, when a PDF file is downloaded, it must be loaded into Acrobat and displayed before it can be printed, increasing the load on the user. The download time is probably increased over the print helper case, as well.

As we move into "current file" deployment, the printing situation will only get more complex, because we will have several source representations. It seems certain that we will begin deploying SGML-based documents in the near future. Most of the issues identified above will apply to printing these documents, as well, although a minimal solution is to translate the SGML to HTML for both display and printing. However, crucial structural and formatting information can be lost in this translation.

Conclusion

Our desire to provide the highest quality product to our customers has led us to investigation and partial deployment of a number of alternative printing solutions. The "Print button" provided by web browsers is completely insufficient for our needs. We would prefer not to have to roll our own solution in isolation, and are eager to coordinate with others in solving our common printing problems.