World-Wide Web: The Information Universe Tim Berners-Lee, Robert Cailliau, Jean-François Groff, Bernd Pollermann CERN, 1211 Geneva 23, Switzerland Abstract The World-Wide Web (W 3 ) initiative is a practical project to bring a global information universe into existence using available technology. This article describes the aims, data model, and protocols needed to implement the ``web'', and compares them with various contemporary systems. The Dream Pick up your pen, mouse or favorite pointing device and press it on a reference in this document - perhaps to the author's name, or organization, or some related work. Suppose you are directly presented with the background material - other papers, the author's coordinates, the organization's address and its entire telephone directory. Suppose each of these documents has the same property of being linked to other original documents all over the world. You would have at your fingertips all you need to know about electronic publishing, high-energy physics or for that matter Asian culture. If you are reading this article on paper, you can only dream, but read on. Since Vannevar Bush's article [1], men have dreamed of extending their intellect by making their collective knowledge available to each individual by using machines. Computers give us two practical techniques for the man-knowledge interface. One is hypertext, in which links between pieces of text (or other media) mimic human association of ideas. The other is text retrieval, which allows associations to be deduced from the content of text. In the first case, the reader's operation is typically to click with a mouse (or type in a reference number) - in the second case, it is to supply some words representing that which he desires. The W 3 ideal world allows both operations, and provides access from any browsing platform. Reality Existing research projects and commercial products are not far from achieving parts of this dream. The Xanadu system [2] is an ambitious distributed hypertext project. Existing hypertext systems (see for example [3, 4]) tend to be restricted to the local or distributed file system, and often are developed with a limited set of platforms in mind. Contemporary information retrieval and access systems such as Alex [5], Gopher [6], Prospero [7] and WAIS [8] cover a wide area without the hypertext functionality. Merging the techniques of hypertext, information retrieval, and wide-area networking produces the W 3 model. This poses specific requirements on document naming schemes, protocols, and data representation. The W3 data model The W 3 model uses both paradigms of hypertext link and text search in a complementary fashion, as neither can replace the functionality of the other. Figure 1 shows how a personalized web of information is built from these operators: My home page Joe Phone ... ... Group The phone book Joe Joe in phone book Joe Bloggs Joe Doe Sara Joe Joe Bloggs Joe Bloggs YD group 3 Main Street (202) 676 7687 Encyclopaedia ATP ATP ATP, an acronym for .... -- Joe Group resources The W3 model involves hypertext links and index searches. The reader starts at the home page (1), and quickly uses his own links, group-wide or public links to find resources. Indexes such as the phone book (2) are represented as documents with the possibility of inputting search words. The result is a virtual hypertext document (3) which points to the documents found (4). (1) (2) (3) (4) Link Search Fig. 1: A web of links and indexes Features to note are:- . Information need only be represented once, as a reference may be made instead of making a copy; . Links allow the topology of the information to evolve, so modeling the state of human knowledge at any time without constraint; . The web stretches seamlessly from small personal notes on the local workstation to large databases on other continents; . Indexes are documents, and so may themselves be found by searches, and/or following links. An index is represented to the user by a ``cover page'' which describes the data indexed and the properties of the search engine. . The documents in the web do not have to exist as files: they can be ``virtual'' documents generated by a server in response to a query or document name. They can therefore represent views of databases, or snapshots of changing data (such as the weather forecast, financial information, etc). A pleasing, and useful, aspect is that almost all existing information systems can be represented in terms of the W 3 model. A menu becomes a page of hypertext, with each element linked to a different destination. The same is true of a directory, whether part of a hierarchical or cross-linked system. The notion of many named indexes within the web allows a given search engine and database to be visible with several different addresses, each representing different options for the search algorithm. For example, the index /library/books/ti+au/substring may give a title and author search, whereas /library/books/text/exact may give an exact-word full-text search. Addresses are discussed in more detail below. Publishing From the information provider's point of view, existing information systems may be ``published'' as part of the web simply by giving access to the data through a small server program. The data itself, and the software and human procedures which manage it, are left entirely in place. This approach has allowed, for example, a mainframe-based document storage and index system to be opened up to access from all platforms in the organization. To see how this is done requires a brief overview of the W 3 architecture. W 3 Architecture Hypertext and text retrieval systems have been available for many years, and a valid question is why a global system has not already come into existence. Traditional answers to this question are the lack of . A common naming scheme for documents . Common network access protocols . Common data formats for hypertext Most research in hypertext systems (the Xanadu project excepted) has focussed on the user interface and authoring questions, rather than the questions of wide- area and long-term distribution. These architectures have assumed that users share a common application program running on computers (often of the same type) which share a common file system. The W 3 architecture must cope with a widely distributed heterogeneous set of computers running different applications which use different preferred data formats. This requires a client-server model. The client has the responsibility for resolving a document address into a document using its repertoire of network protocols. The server provides data in a simple hypertext or plain text form, or, by negotiation with the client, in any other data format. Mac PC X dumb NeXT Gateways Addressing scheme + Common Protocol + Format Negotiation Existing data Server Server Network News Fig 2. The W 3 architecture in outline. It may be more difficult initially to develop a generic hypertext browser than a specific front-end for a particular information system. However, the de-coupling of the client and server programs by the ``information bus'' pays off as more clients and servers are plugged in and universal readership is achieved. Writing a server for new data is generally a simple task because it requires no human interface programming. Document Naming The fulcrum on which the document universe rests is the scheme for naming documents. A document name provides a method for the client to find the server, and for the server to find the document. In the W 3 model, a name can also specify a part of the document to be selected by the displaying application. Although a document name is normally hidden in the hypertext syntax transferred over the link, in practice it must sometimes be referred to by people, and passed through applications (such as mail) which are not yet hypertext- aware. It must therefore ideally be composed of printable characters, and manageably short. Any lasting reference to a document must be a logical name rather than a physical address. That is, it should refer to a document's registration with some ``publishing'' organization rather than any physical location, so that its location may later be moved. The client is therefore prepared to follow several stages of translation by name servers before finding a final document server. Similarly, a document name should not contain any information which is transitory such as the particular formats available for a document, or its length, for example. The W 3 naming scheme fulfills these requirements, but is otherwise open to the addition of new protocols as technology evolves. For this purpose a prefix is used to identify the protocol (and therefore naming scheme) to be used. Clients which do not have that protocol in their repertoire refer to a gateway for translation. Protocols The W 3 clients are built on a common core of networking code for information access. This core provides access using widely deployed internet protocols such as . File Transfer Protocol -- FTP [9] . Network News Transfer Protocol - NNTP [10] . Access to mounted file systems. A new search and retrieve protocol was found necessary, known as HTTP. Faster than FTP for document retrieval, this also allows index search. HTTP is similar in implementation to the internet protocols above, and similar in functionality to the WAIS protocol. Some differences are discussed below. Document Formats The Dexter data model of hypertext [11] provided a conceptual model for hypertext systems, and the HyTime standard [12] formalizes hypertext at a high level. The W 3 project defines a concrete syntax in the SGML style for basic hypertext as used for menus, search results, and on-line hypertext documentation. Every W 3 browsing application is able to parse this simple format (see Fig. 3). In the pilot phase of the project, this format was all that was required, but in the second phase, format negotiation between client and server will allow the exchange of information in any medium using any mutually acceptable representation. WAIS and the Web From the point of view of the W 3 dream, the WAIS protocol represents a significant advance on the search and retrieve (SR) protocol standard Z39.50/ISO-10163, by being stateless, and introducing a persistent name. The document names used are local to the containing database, but these names may be appended to the database name and host address to form a universal W 3 address. In this way, WAIS indexes and servers can be represented in the web. A gateway program, running at CERN and available for general use, provides this mapping. The WAIS model uses separate ``source'' files to describe indexes. The WAIS-W 3 gateway keeps caches of these files, using them to build descriptive ``cover pages'' for indexes. PFD Error codes Error Codes Codes returned by the PFD program include . No paper in tray . No people in room . No data in file PFD Error Codes ERROR CODES Codes returned by the PFD[1] program include o No paper in tray o No people in room o No data in file 1-9, Return for more, Help or Quit: Window client Terminal client Sending hypertext data over the network in a high level (logical) representation allows optimum presentation according to the facilities of the reader's platform. Server Original Data