(Disclaimer: many of the thoughts and opinions in this blog are of my own, and not necessarily shared by the W3C as a whole. Putting it another way: I wrote this blog with my W3C team’s hat down…)
Document Web or Web Operating System?
Let’s look at the definition of the Web on, say, Wikipedia. Here is what we find:
The World Wide Web […] is an information space where documents and other Web resources are identified by Uniform Resource Locators (URLs) interlinked by hypertext links, and can be accesses via the Internet
Note the word “documents” (emphasis is mine).
In its first, say, 10-15 years the Web was seen as some sort of a giant library of (interlinked) information. Web pages were, mostly, documents with an essentially textual content although, possibly, illustrated with images and videos. In contrast, the force that drives the Web today are mostly Web Applications. The Web has turned into some sort of a universal Web Operating System which makes the differences among Windows, Linux, Android, IOS, or MacOS mostly disappear, and which is used to run sophisticated programs from email clients to online computer games, from virtual reality applications to teleconferencing utilities. That is the main thrust of today’s developments these days.
Is this a problem? Not really, one is tempted to say, because all those applications are undeniably useful (or fun…), and having a universal Web Operating System is a good thing. It simplifies the life of many. The features used by Web Applications, i.e., the underlying rich API-s, provide useful interaction, transformation, or animation possibilities that more traditional Web pages can also exploit. And that is also a good thing.
So do we have a problem? Well… maybe yes. Not due to the fact that Web Applications (and the underlying infrastructure) exists but because, in my view, the pendulum has swung to an extreme. In spite of the amazing facilities offered by CSS, of the advances of internationalization on the Web, or of new font technologies, one nevertheless has the impression that the Web has forgotten its “roots”, that the only important goal is to make Web Applications as fast and responsive as possible, and everything else is deemed irrelevant or at least not important. We’ve got to a situation when browsers optimize their operations so that 3D animation and Virtual Reality work with hitherto unimaginable speed, but most browsers are not enabled to display the oldest and most universal language of scientific discourse, namely mathematics. It has become easy to display pull-down menus with carefully calibrated, colorful (but, often, futile) transitions but displaying poetry on a Web page remains a challenge. Systems are optimized at short messages surrounded by dancing pictures and emojis but handling really long text in a readable manner, using ergonomically proven techniques like temporary bookmarks (in the original, etymological sense of the word) on a simple paragraph, or paginating the content, is still experimental and in the realm of (a few) extensions as opposed to be part of the core. The Web has contributed in making us forget about the value of long, carefully written and curated content in favor of short texts of 128 characters…
In short, we have lost the concept of a “Web of Documents”. Personally I believe this is a problem.
Haven’t (traditional) publishers developed solutions already?
Unfortunately, not really. To see some of the problems they are facing we have to realize that the term “publishers” means lots of different communities, even if we concentrate on the question of how they interact with the Web.
Take journal and magazine publishers. This includes publishers of scholarly journals and conference proceedings as well as industrial reports and publicity materials. As of today, this community mostly produces PDF as the output of their workflow with, in some cases, a presence on the Web that serve just as a landing page to get to those PDF files. They do it because they have, usually, very high demand on the typesetting and general output of their materials (either for aesthetic reasons or due to the traditions of a particular community) and their “digital” publication is merely a digital copy of their paper publication. Magazines, for example, have barely moved away from that approach. I do not really consider that as part of the Web, it is just providing digital access to, essentially, printed or printable material.
This approach begins to break down, though. First of all, one of the peculiar aspects of, e.g., scholarly publishing is that publishers “publish”, say, conference proceedings consisting of a series of PDF files formatted as if they were printed documents even though the printed version is not produced any more. Which is odd, if one thinks about it. And the usage of such files is becoming a major issue for the consumers of our digital age: PDF may be problematic for accessibility or for consuming it on a mobile device. Anybody who has tried to provide a peer review of a scholarly paper on a tablet or a phone knows how unnerving that can be.
Some of these publishers recognize that the future is to move fully to the Web, but then they hit numerous issues that need solutions. In many cases a publication in those communities must have a unique and immutable identity so that others can refer to it. (People’s career in, for example, the scholarly world may depend on a clear identification of their publication!) But a publication usually consists of many different resources: not only the core textual content, but all the accompanying images, photographs, data files together, and the identity should refer to all these resources and not only to one of those. It is the collection that counts, not a single HTML page.
Also, journal publishers, depending on the discipline, must include mathematics, biochemical diagrams, or extracts of sheet music. They may have to abide to various typesetting rules that have evolved over centuries and that scholars are not ready to give up. Publications must include a number of metadata elements. And, last but not least, scholars may want to have an access to the full publication even if they are offline; after all, what is a better place to read a paper than on a plane or on a train while commuting?
Traditional “trade” publishers, e.g., those that publish “simply” books (novels, textbooks, recipes, or travel books) have gone down the EPUB way. Though different in origin, the problems they have hit is fairly similar to what, say, journal or academic publishers have come to face. (It must be noted that EPUB is not for books only although, so far, the book publishing industry is its biggest user. EPUB is perfectly usable for other types of publications.)
The most obvious issues is that an EPUB publication is off the Web. Just like the aforementioned PDF files, they may be downloaded from the Web, but they rely on special “reading devices”, as the jargon goes. This in spite of the fact that, at its core, EPUB relies on Web technologies. Forgetting the packaging aspect for a moment, EPUB is, essentially, a Web site. A Web site that, beyond the core content in HTML, CSS, or SVG, contains a number of information that makes the Web site a publication. It specifies that it is, actually a collection of resources (just like a modern, HTML-based scholarly publication is such a collection), it contains metadata, it contains information about a table of content, it has a strong notion of identity for the collection, etc. Note that EPUB, as a Web site, also uses technologies that are barely used elsewhere on the Web like SMIL (albeit a relatively small profile thereof). EPUB has features to turn the content into an audio book, and the accessibility requirements imposed by EPUB are more stringent than on the Web in general. Finally, EPUB is packaged (into a derivative of ZIP): the reason is partly a business/distribution issue, but partly because, just like a scholarly publication, an EPUB publication must be available offline or for archival purposes in, e.g., national libraries.
Is (the current version of) EPUB a solution for a Web of Documents (if one forgets about the packaging again)? Not as it stands today. The definition of EPUB, that was done independently, was always dependent on what was developed elsewhere (mostly at W3C), and was therefore often behind the most up-to-date technologies. It still uses XML syntax’s for some of its “configuration” (today we would say “manifest”) data, which has become, over the years, a heresy for many in the Web community. It relies on XHTML. It is impossible to “just” unpack an EPUB file onto a Web site and expects the result to be “just” enjoyable. (There are of course extensions or applications on top of Web browsers, but they are highly non-trivial and complex, ahem, Web Applications acting on the packaged EPUB publications, much like
pdf.js is used in many browsers to handle PDF files.)
A way forward: work together!
The recent evolution at W3C may have a profound and welcome effect on the development of a Document Web. Mid last year W3C and IDPF (the organization that specified EPUB) decided to join their forces to form what is now called “Publishing@W3C”. This was followed by the creation of several new groups at W3C, including the Publishing Working Group, that began their work right before summer. Without going into too much details, one can say that the goal of the Working Group (as well as of Publishing@W3C in general) is to make significant steps towards a more comprehensive Web of Documents.
What are the direction this Working Group has taken? A precursor of the Working Group, the (now closed) Digital Publishing Interest Group, has coined the term “Web Publication” for, well, publications on the Web. It described a vision where publications are wholly part of the Web. This term has been taken up by the Working Group in its charter; the term refers to a collection of Web resources that has a unique identity (i.e., a URL) for the whole collection as opposed to any individual constituents. It should have some sort of a manifest, providing information that are necessary for publications (table of content, metadata, references to the constituent resources, etc.). It should be available offline or online; it should be packageable, if necessary (the term “Packaged Web Publication” is used to differentiate that state) to encompass relevant business models or requirements for long term archival. User actions, like search or annotations, preference settings for font size or color background should be based on the collection (i.e., the full publication) rather than necessarily restricted to a specific constituent resource. (After all, users want to search for a term in the whole book, and not exclusively in the current chapter only…) But probably the most important aspect of the planned work is that all these feature should be through additions to Web technologies, and not (I repeat: not) some sort of a “fork” of the current Web.
The Working Group will need to look at ways to reintroduce features on audio-video-text synchronization using technologies other than SMIL, because that latter has been ignored by the current Web developments; it should rely, as much as possible, on technologies that already exist are under development elsewhere. It has to reuse, wherever possible and worthwhile, technologies developed elsewhere; for example, whilst the current work on Web Application Manifests is aimed (of course…) at Web Applications, a similar and, hopefully, compatible approach may be used for publications, too. It should embrace the latest evolution of Web Technologies, like Service Workers, to make offline access to a publication a simple issue. Or, as packaging remains necessary for business or archival reasons, reuse any work on Web packaging, should the Web community move in this direction.
Simple, traditional Web pages should also be considered as Web Publications either out of the box or with very little extra work. That means that reasonable defaults should be provided so that systems (browsers, dedicated applications, etc.) could automatically offer extra facilities to Web pages that were, so far, not available (e.g., pagination of long texts). It should be possible to extend the capabilities of Web browsers with new formats like mathematics or chemical markup; obviously, we cannot expect browser manufacturers to solve everything by themselves. A simple path should be available for existing publications (journal articles, EPUBs, etc.) to become Web Publications, to make the transition for publishers easy and smooth.
Note that Web Publications, because they are essentially Web sites, need the possibilities offered by that “Web Operating System”, too. Not necessarily for the publication of a simple novel but, for example, for educational materials that may include interactive tests, direct connection and interaction with sophisticated server-side facilities, inclusion of all possible types of media. (Consider an educational textbook that is on the borderline of a book and an application providing online tests, quizzes, etc.) Also, at least at first, handling Web Publications will probably need some polyfill or other forms of extensions to be handled within the browser. The big advantage is that, via such extensions, reading a novel or an article will rely on the same user interface, the same look-and-feel as any Web site, as opposed to how a PDF or an EPUB document is consumed today. That shows that the choice should not be a Web of Applications or a Web of Documents; the future should be a Web of Applications and a Web of Documents. The pendulum should get to an equilibrium in the middle.
The first results
The Publishing Working Group has now reached a significant milestone: it has published the First Public Working Draft for Web Publications. As is often the case for such first drafts, the document does not provide detailed technical solutions for all issues like the essential information items for Web Publications, manifests, security issues, locations and identifiers, or reading enhancements. There may be an outline for a solution for some of those, and merely number of specific technical questions and open issues for others. But it gives a direction to where the WG thinks we should go. Also, many of the problems faced by publishers are actually not addressed by this Working Group, because they represent issues that other Working Groups are chartered for (e.g., CSS), and the job of the Publishing Working Group will be to work hand in hand with those groups.
The main value, for me, of publishers joining (at last!) the efforts to develop the Web further is to help us find the right balance that I believe we have lost. We must continue developing a Web Operating System for all the good reasons. But we also must spend our efforts to rejuvenate a Web of Documents, i.e., a universal library of human knowledge and culture, where publishing political essays and poetry, scholarly articles, or novels can occur as a direct continuation of one of the oldest human endeavors that is present in all cultures and traditions: publishing. Such a Web of Documents should embrace all the possibilities offered by Web of Applications, but this should happen without giving up the facilities, the aesthetics, or the ergonomic features of traditional publishing. Just as it may be a pleasure to open a beautifully typeset and illustrated art book, it should be the same pleasure opening up a similar page on the Web, without loosing the additional possibilities that the Web has given us and that may also revolutionize publishing.
Web Publications should put the paradigm of a document on the Web back in the spotlight. Not in opposition to Web Applications but to complement them. (Web) Publications should become first class entities on the Web. This should lead to a right balance between a Web of Applications and a Web of Documents as two, complementary faces of the World Wide Web.