Requirements for Web Publication and Packaging

From Digital Publishing Interest Group
Jump to: navigation, search

(Definition of an archive at IETF: "Archives are used to collect multiple data objects together into a single object for easier portability and storage." See The Archive Top-Level Media Type for File Archives, Draft. Although we use the term "Package", this definition may be useful nevertheless.)

Introduction

As publishing shifts to the Open Web Platform, we have expanded our concept of a package beyond the traditional view of a nested folder structure. The central concept is a generic Web Publication; an online publication is merely a reading mode, a state, of the Web Publication. It is necessary to point to the publication with a stable identifier, but with the ever-increasing abilities of the OWP to function as both a rendition engine and data warehouse using tools like Service Workers, the concept of a traditional package of static folders may become secondary in nature. Concerns such as file size limitation fall away because there are no limits on the OWP. (Although there are limits imposed by browsers and other user agents.) Nevertheless, packages that exist (possibly) apart from the network still have a role to play as units that can be stored or transferred. This concept is essential with the current business models that dominate the publishing industry for, e.g., digital books.

Thus, we have defined three states of existence for a Web Publication:

  1. Online state (i.e., a collection of documents on the Web that can be considered as a logical unit, a "Web Publication")
  2. Cached state (i.e., a collection of documents in the reading engine, local system, etc., that can be considered as a logical unit and which is mostly indistinguishable from the online state)
  3. Portable state (i.e., an package, ie, a collections of documents packed into one unit for, e.g., a file system, network transfer, etc, but whose content is mostly indistinguishable from the online state)

There are some requirements for the Web Publication that are unique to the portable state, but most requirements are common to all three states. We would like to arrive at a specification (or sets of specifications) that would allow smooth transition among states, and the differences should be restricted so that the management (render, interact with, store, etc.) of a Web Publication would be independent from the original state. E.g., a document in a Portable State should become Cached through some simple, and localized, actions (e.g., in a specialized service worker), and that should be transparent for the higher level rendering and interaction aspects of the user agent.

This document builds on the use cases outlined at DPUB IG' Use case collection.

Requirements for the Portable State

Portability/Persistence

The portable state (i.e., the package) must be functionally independent of Web access. It must be network-, browser-, and install- independent. It should maintain its persistence over time.

Use cases

The use cases below are not the same as caching (with or without service workers) although the borderline is indeed fuzzy:

  • Pilot’s manuals are used in-flight when there is no access to the internet (and we sure hope that is not what comes between the pilot and a safe landing)
  • Full publication ported from one user agent to a second user agent, without network
  • A digital archive services, such as Chronopolis or OCLC's Digital Archive, need to ingest and store a complete, self-contained item. Such archives may not depend on network access.
  • A package may contain a data set component that will need computational resources to work with it independent of the network and any particular reader technology to view or edit it.

Requirements that are valid for all three states

Note that some of these requirements may be trivially available for one or the other states; the emphasis is on the fact that these are shared requirements whose fulfillment is part of the smooth transition among states.

Streaming

Streaming means that the content is constantly received by, and presented to an end-user, while being delivered by a provider. In this case this may refer both to a content part of the Web Publication (e.g., an audio segment) or the Web Publication as a whole.

Use cases

  • A publication contains 4 minute video. User wants to begin watching video when she pushes "play". She does not want to wait for the full download before she is able to watch video.
  • For some reasons the Web Publication may be very large, e.g., due to a large amount of embedded data. The user does not want to wait for everything to be accessed before being able to consume the main content.

Random access to content

It must be possible for a client to fetch components of a package in any order, or to fetch multiple components at the same time, without having to read the entire document. This means, for example, that it should be possible to make a specific request based on a general description, initial table of content, etc, on the Portable Document to get the necessary information on the content of the full Web Publication.

Use cases

  • Scholarly publications may include interactive data files. (see the "Publication with Data" or "Publication with Interactive Data" for further details.) The user want to read the content of the article and access (and/or interact with) the data only when it is necessary.
  • EsteemedJournalPublisher would like to offer the users of the EsteemedJournal of Chemistry the opportunity to read only the abstracts of the journals. The Web Publication for the journal as a whole must offer the user a list (table of contents) of abstracts. (See "Streamlined Access to Disjoint Package Components".)
  • Web Publications have a method to define at least one "reading" (consumption) order of its components. These may include the possibility of consuming components out of order, or skipping elements (e.g., for preview)
  • Web Publications may have to offer multiple rendition of the same content (e.g., multiple languages, reflowable and fixed layout).

External (non embedded) assets

A Web Publication may include references to external contents that is not embedded into the publication in any of the states. The Web Publication should be able to specify which resources are embedded into the publication and which are external.

Use cases

  • Scholarly publication may refer to scientific data to emphasize the scientific assertions and/or to provide evidence. However, that scientific data may be way too large, or there may be legal issues to be embedded into a Web Publication; the publication should include an external reference to those.
  • An art book on music may refer to specific concerts or recording, but those may be too long and therefore the data to large to be directly embedded in the Web Publication itself.

Access to the media type of all constituents

This is more of a technical rather than a usage requirement: generic reading systems, browsers, etc, usually have different behavior depending on the media type of a document; this information should therefore be available.

Stable links (same link whether on/offline)

The Web Publication, and its components, should have stable identifiers usable for linking, citations, annotations, etc. Components within the Publication should be able to refer to one another with active links that are usable independently of the current state.

This requirements is trivially true for the online state; the way cached systems (e.g., through Service Workers) are implemented means that these should also be realizable easily for cached states. Internal cross-references (e.g., via a systematic usage of relative URI-s) make this easily reachable for portable states, too; however, access the Web Publication via a unique identifier, when in a portable state, becomes more complicated (some sort of a mapping from the generic identifier and the local copy must be made available by the system).

Use cases

  • Rosita's thesis cites an article in the American Geophysical Union's monthly journal. She is comfortable referring to the Web Publication version of the article only if she knows that the link will still function in 6 months.
  • Annotations created for a Web Publication means using an annotation target, i.e., a hyperlink anchor in one of the components. The annotation (whether stored within the Web publication or not) should remain valid regardless of the current state. (See the "Publication plus Annotations" use case for further details.)

Update new components only

It should be possible to update only the new components of a package.

Use cases

  • Errata—a journal publisher corrects errata in several components of a multi-part compendium that the reader may choose to incorporate into their existing copy of the compendium
  • Textbook updates—a textbook publisher decides to regularly update specific chapters, videos, or other components within the larger textbook in order to remain immediately relevant
  • Data sets may be updated independently of the natural language component of the resource.
  • Annotations made by a colleague, a classmate, a friend, etc, should be reflected in one's own copy of the Web Publication (See the "Publication plus Annotations" use case for further details.)

Package within a package (nesting, hierarchy)

The package must be able to contain packages. It must be possible to nest a package within a package (a hierarchical structure).

Use cases

  • Several journal articles are published with unique identifiers. The publisher plans a compendium of all articles related to brain surgery within a specific period.
  • Data sets across several articles may be combined into one larger dataset for researchers to work with.

Access to package by simultaneous users

Multiple users must be able to access and interact with the package simultaneously. This may involve annotating, triggering scripts, filling in forms, or watching videos.

This is trivially available for online states, and also for portable state (module access control). It may be more complex for cached states.

Use cases

  • Students in a classroom who use the same Web Publication read the same content, although with different speed and possibly in different reading order
  • Family members share the same eBook for reading, stored (in one of the possible states) on a family server

Encryption and Obfuscation

It must be possible to encrypt and/or obfuscate the Web Publication, or its components, in a way that ensures interoperability and that would be transferred from one state to the other when there is a state transition. The access to decrypt/encrypt may depend on the credential of the user.

This is related to other requirements (update of new components, random access to content), combined with existing encryption and obfuscation techniques.

Use cases

  • Educational publication in a classroom may include information that is only available to the teacher
  • Internal documents (e.g., flight manuals) may be partially obfuscated to ensure a need-to-know access to the information

Signatures

The package and its components should include a hook for inclusion of digital signatures.

This is related to other requirements (update of new components, random access to content), combined with existing digital signature techniques.

Use cases

  • The integrity of a legal documents must be ensured through a signature of the right authorities.