1 The Future of HTML

The perspective of the scientific publisher
Sebastian Rahtz
and
Herbert van Zijl
Elsevier Science
Oxford / Amsterdam

May 1st 1998

2 Background

Elsevier Science is one of the main scientific publishers in the world:

3 Document storage in HTML?

HTML's markup model does not allow essential tasks like:

4 Elsevier's current usage of HTML

Electronic journals since 1995, generating HTML from SGML; since 1997 Science Direct, an extremely large inter-linked database of articles.
  1. Pre-extraction of key fields (such as author names) for external indexing and searching
  2. Linking of articles to backup resources, such as abstract databases
  3. Use of PDF to provide better quality printout than that which can be derived from HTML
  4. Use of fixed size GIF images to display mathematics and special symbols
  5. Variable presentation granularity, eg
    • summary bibliographical details only
    • front matter only
    • front matter, plus figures and references
    • full article

5 Common problems

  1. No semantic information is left in the target HTML file about links
  2. Linking to external resources is static, in the simple HTML model, and cannot easily accomodate changes in the resources
  3. The target HTML does not allow flexible and dynamic printing, because of the lack of semantic information in the markup
  4. The fixed rendering of math and special symbols is expensive in development and production time, and is seriously inflexible.

6 What's wrong

Production processes are All the flexibility that we introduce is at the generation end of the process. Parallel `canned' variants, or on-the-fly reconversion, allow readers document display in full article, summary of headings, or just front matter.

7 Does CSS help?

Cascading Style Sheets allow semantics to be derived in a roundabout way from the presentation style markup.

  <H2 class="sectionHead">Results</H2>

is more useful than just

  <H2>Results</H2>

but this poor man's architectural mapping is hardly flexible enough. Applications like scientific publishing require features like re-ordering the components of a document, and selecting subsets of it.

8 Future plans and requirements

Elsevier Science can develop its Web-based offerings in at least three ways (although the priority of these is arguable):
  1. Switching flexibility in presentation to the client side of the process, by preserving semantic markup across the delivery, instead of pre-rendering it to HTML
  2. Increasing sophistication and richness in linking, either inside the document database, or to external resources
  3. More interactive documents, with embedded applications

9 Towards XML

Not surprisingly, these three directions coincide with XML:
  1. Client-level applications can render the information in different ways.
  2. Linking adds value to basic material. The extended link and pointer mechanism proposed for XML has many applications in scientific publication.
  3. Scientific publishing can make use of special-purpose markup languages, like those for mathematics and chemistry. With the semantics of math formulae, or molecular models, we can produce richer products.

10 Example

For example, a simple feature like `back-referencing' from bibliographies currently requires pre-processing, and devious add-in scripts to hard-wire an interface; but if the functionality was a standard feature of XLink-enabled XML browsers, we could reduce our work to style sheets.

11 Do we need HTML?

If our future lies in providing flexibility on the client side of delivery, we could become essentially independent of HTML. It then becomes a browser decision whether to convert XML markup into HTML for presentation, or render it directly.

12 Directions for HTML 1: PDF-like

HTML can evolve further in the direction of PDF. HTML can define the low-level functionality of screen documents, freeing browser writers to concentrate on the user interface instead of the complexities of rendering XML and XSL/CSS directly. Or should PGML take over this role?

13 Is HTML like PostScript?

PS was a huge breakthrough in providing developers with a single interface language for many different rendering engines, and enabled typesetting to be a consumer product. HTML brought multi-media authoring to the average consumer, by providing a single, accessible, language. PostScript gave birth to PDF, which added all the functionality needed for screen, as well as paper, rendering, and HTML has slowly acquired more and more presentation features, and a style sheet language.

14 Differences between PDF and HTML

  1. PDF files usually carry fonts with them, which makes them a bit larger than HTML files
  2. While PDF files contain line and page breaking information, the real difference is the word-level justification, hyphenation etc, and white-space placement.
  3. Neither format retains much worthwhile semantic structure --- any indexing is either crude crunching of every word, or uses predefined catalogue structures
  4. HTML's style sheet mechanism is more efficient than the fixed layout of PDF, but only allows simplistic adjustment by the end-user.

15 Directions for HTML 2: back to basics

Split HTML into three parts:
  1. A markup language for simple documents
  2. A standardized set of modular XML applications for
    1. hyper-linking
    2. tables
    3. forms
    4. metadata

  3. Add frames to CSS?

16 Conclusions

The vital steps: