eslides.html

The perspective of the scientific publisher

Sebastian Rahtz

and

Herbert van Zijl

Elsevier Science

Oxford / Amsterdam

May 1st 1998

2 Background

Elsevier Science is one of the main scientific publishers in the world:

primary research journals in the major fields
over 120,000 papers a year
long-term storage in SGML since the early 1990s for structure
long-term storage in PDF for page typography and reprints
a specialized DTD developed over the last 6--7 years, with specialized needs in e.g. math and bibliographies

3 Document storage in HTML?

HTML's markup model does not allow essential tasks like:

imposition of editorial control
generation of navigational aids, such as indices, directly from the document itself
generation of rich cross-document (or even intra-document, such as bibliographical citation) links
addressing or management of objects smaller or larger than a single document
efficient re-use of document components
search within semantically significant components of a document

4 Elsevier's current usage of HTML

Electronic journals since 1995, generating HTML from SGML; since 1997 Science Direct, an extremely large inter-linked database of articles.

Pre-extraction of key fields (such as author names) for external indexing and searching
Linking of articles to backup resources, such as abstract databases
Use of PDF to provide better quality printout than that which can be derived from HTML
Use of fixed size GIF images to display mathematics and special symbols
Variable presentation granularity, eg
- summary bibliographical details only
- front matter only
- front matter, plus figures and references
- full article

5 Common problems

No semantic information is left in the target HTML file about links
Linking to external resources is static, in the simple HTML model, and cannot easily accomodate changes in the resources
The target HTML does not allow flexible and dynamic printing, because of the lack of semantic information in the markup
The fixed rendering of math and special symbols is expensive in development and production time, and is seriously inflexible.

6 What's wrong

Production processes are

costly to set up
not producing products that are flexible enough for the user

All the flexibility that we introduce is at the generation end of the process. Parallel `canned' variants, or on-the-fly reconversion, allow readers document display in full article, summary of headings, or just front matter.

7 Does CSS help?

Cascading Style Sheets allow semantics to be derived in a roundabout way from the presentation style markup.

<H2 class="sectionHead">Results</H2>

is more useful than just

<H2>Results</H2>

but this poor man's architectural mapping is hardly flexible enough. Applications like scientific publishing require features like re-ordering the components of a document, and selecting subsets of it.

8 Future plans and requirements

Elsevier Science can develop its Web-based offerings in at least three ways (although the priority of these is arguable):

Switching flexibility in presentation to the client side of the process, by preserving semantic markup across the delivery, instead of pre-rendering it to HTML
Increasing sophistication and richness in linking, either inside the document database, or to external resources
More interactive documents, with embedded applications

9 Towards XML

Not surprisingly, these three directions coincide with XML:

Client-level applications can render the information in different ways.
Linking adds value to basic material. The extended link and pointer mechanism proposed for XML has many applications in scientific publication.
Scientific publishing can make use of special-purpose markup languages, like those for mathematics and chemistry. With the semantics of math formulae, or molecular models, we can produce richer products.

10 Example

For example, a simple feature like `back-referencing' from bibliographies currently requires pre-processing, and devious add-in scripts to hard-wire an interface; but if the functionality was a standard feature of XLink-enabled XML browsers, we could reduce our work to style sheets.

11 Do we need HTML?

If our future lies in providing flexibility on the client side of delivery, we could become essentially independent of HTML. It then becomes a browser decision whether to convert XML markup into HTML for presentation, or render it directly.

12 Directions for HTML 1: PDF-like

HTML can evolve further in the direction of PDF. HTML can define the low-level functionality of screen documents, freeing browser writers to concentrate on the user interface instead of the complexities of rendering XML and XSL/CSS directly. Or should PGML take over this role?

13 Is HTML like PostScript?

PS was a huge breakthrough in providing developers with a single interface language for many different rendering engines, and enabled typesetting to be a consumer product. HTML brought multi-media authoring to the average consumer, by providing a single, accessible, language. PostScript gave birth to PDF, which added all the functionality needed for screen, as well as paper, rendering, and HTML has slowly acquired more and more presentation features, and a style sheet language.

14 Differences between PDF and HTML

PDF files usually carry fonts with them, which makes them a bit larger than HTML files
While PDF files contain line and page breaking information, the real difference is the word-level justification, hyphenation etc, and white-space placement.
Neither format retains much worthwhile semantic structure --- any indexing is either crude crunching of every word, or uses predefined catalogue structures
HTML's style sheet mechanism is more efficient than the fixed layout of PDF, but only allows simplistic adjustment by the end-user.

15 Directions for HTML 2: back to basics

Split HTML into three parts:

A markup language for simple documents
A standardized set of modular XML applications for
1. hyper-linking
2. tables
3. forms
4. metadata
Add frames to CSS?

16 Conclusions

The vital steps:

Break out the simple semantic component of HTML to an XML application to enable XML adoption
Complete the work on XLink
Propose standard table XML DTD
Propose standard form XML DTD

1 The Future of HTML