Web Architecture: Describing and Exchanging Data

W3C Note 7 June 1999

Document id:: http://www.w3.org/1999/04/WebData (will always return the latest version)
This version:: http://www.w3.org/1999/06/07-WebData
Previous version (Member only):: http://www.w3.org/1999/05/08-WebData
Authors:: Tim Berners-Lee <timbl>, W3C
Dan Connolly <connolly>, W3C
Ralph R. Swick <swick>, W3C

This document is an analysis provided by the authors and carries no endorsement by the Consortium.

As we begin the XML Schema design [XMLSchema] and examine the RDF Schema design [RDFSchema], this document acknowledges the input we have received on how they fit together and how they should fit together, and invites further exploration.

This section represents the status of this document at the time this version was published. It will become outdated if and when a new version is published. The latest status is maintained at the W3C.

Abstract

The World Wide Web is a universal information space. As a medium for human exchange, it is becoming mature, but we are just beginning to build a space where automated agents can contribute--just beginning to build the Semantic Web. The RDF Schema design [RDFSchema] and XML Schema design [XMLSchema] began independently, but we explore a common model where they fit together as interlocking pieces of the semantic web technology.

Introduction

The architecture of the World Wide Web provides users with a simple hypertext interface to a variety of remote resources, from static documents purely for human consumption to interactive data services. HTML, the data format that facilitated the widespread deployment of the Web, started by adding URI based linking to word processor style rich text to provide basic global hypertext functionality. The addition of forms to HTML provided a minimal but functional user interface to interactive data services.

While this HTML infrastructure has facilitated a revolution in global information technology, it suffers from the inevitable limitations of a "one size fits all" solution: rich document structures are lost as the content is squeezed into the primitive structures of HTML. Similarly, the cost of squeezing rich data structures into and out of HTML is paid in efficiency and integrity.

Now that the Web has reached critical mass as a medium for human communication, the next phase is to build the "Semantic Web". The Semantic Web is a Web that includes documents, or portions of documents, describing explicit relationships between things and containing semantic information intended for automated processing by our machines.

Many paths toward the semantic web

XML began as a project to address HTML's limitations on structured documents, by selecting a simple-to-implement yet extensible subset of SGML for use on the Web. It has emerged as the infrastructure for structured data interchange as well.

Meanwhile, in our effort to address the impact of the Web on society, the W3C membership came together to develop the Platform for Internet Content Selection (PICS), which provides users with the ability to select content based on labels provided by information providers or other sources. A critical component of PICS is the rating system description, a sort of schema; every PICS label points to a description, in the Web, of the fields in the label.

PICS was designed as a first step toward generalized labels that would allow any party in the Web to make claims about the qualities of resources: endorsements, terms and conditions for use, and so on. The Metadata Activity addresses the necessary work to complete the picture: structured labels, rules, integration with digital signatures. The PICS label design was generalized to a model of information as directed labeled graphs (DLGs). This was known as the RDF model, and a serialization was defined in XML syntax. PICS rating systems were incorporated as special cases in the design of RDF Schemas.

XML documents have a mechanism for self-description as well: the DTD. As the use of XML became more diverse and intense, the limitations of the aged DTD design became acute, especially in the area of data typing, modularity, and reuse; soon W3C began work in the XML Activity on a new generation of schemas for XML.

The paths come together

The initial expectation was that RDF would be simply layered on top of XML, with minimal interaction. But then the RDF design started to include a "namespace" facility for connecting XML element names to web addresses, which was closely related to a long-standing design discussion in XML (and SGML before that). Similarly, the RDF requirements for datatypes like integer and date were shared with many other XML based formats.

Over time, the interactions grew. The Document Object Model (DOM), which started as an effort to harmonize HTML scripting facilities in browsers, expanded in scope to include XML and become a foundational Application Programming Interface (API) in many Web software platforms and structured data repositories. Software built on these platforms sees RDF not as XML streams but as DOM objects. The emergence of the transformation component of Extensible Style Language (XSL) as a useful component in its own right sheds new light on many of the syntactic design issues in RDF. The benefit of syntax that is easy to manipulate with DOM and XSL were not evident in the early RDF design stage.

At the Query Language Workshop [QL98] a number of applications were being designed using XML to encode DLG data and it was clear that the syntax used by the RDF community to do this was not as direct as that assumed by some others. The mapping of XML elements directly to graph edges (rather than nodes) was a closer, more natural mapping to some. The direct mapping meant that statements about RDF's arcs and XML's elements had implications on each other by this stronger analogy. This suggested a need to define precisely what that mapping was, so determining the architectural connection between future work on the Semantic Web and other applications of XML schemas.

First we review some of the requirements for the Semantic Web. Secondly we review the data models of many systems whose data is under strong pressure to be accessible directly in semantic form. For each we try to delineate the mapping where it is evident, but outline the areas where specification work is required.

Requirements for a general model for data on the Web

Evolution and partial understanding

Traditionally, both documents and databases have been strongly typed; that is, the producer and consumer have prior agreement on the structure of the information units. But this by itself is not sufficient for the long-term health of the Semantic Web. The Semantic Web must permit distributed communities to work independently to increase the Web of understanding, adding new information without insisting that the old be modified. This approach allows the communities to resolve ambiguities and clarify inconsistencies over time while taking maximum advantage of the wealth of backgrounds and abilities reachable through the Web. Therefore the Semantic Web must be based on a facility that can expand as human understanding expands. This facility must be able to capture information that links independent representations of overlapping areas of knowledge.

The XML 1.0 specification [XML98] takes a large step toward enabling the interchange of information even with a party that is able to recognize only a portion of a document. XML specifies the syntactic constraint called well-formedness. Well-formedness is a fundamental tool for allowing documents to include extended information while remaining processable by older "down-level" applications.

Old engineering habits^[1] suggest that for every document there must exist somewhere in a single place a complete enumeration of every markup feature present in that document. While this notion of XML validity is appropriate for many application contexts, it is too strong a constraint to place on the Semantic Web.

Mixing of vocabularies is a critical feature for the Web [BC98]. Of the evolutionary requirements on protocols [HTTPNG98], the first two of three also apply to data formats:

Incremental decentralized development of Semantic Web applications requires documents to be able to contain an ad hoc mixture of features from multiple application domains. The combinatoric issues make it impractical to predefine document types that encompass all the possible vocabulary sets. Instead, the XML Namespace facility [XMLNS99] allows this vocabulary mix-in. The Resource Description Framework Model and Syntax Recommendation [RDF99] leverages the XML Namespace facility throughout.

When design work on RDF began, the only XML schema facility available was the DTD, which lacks support for decentralized evolution. Since then, XML Schema work [XMLSchema] proposes ways to compose strongly typed documents using XML Namespaces.

Total vs. partial information

Any given XML document is finite, as is any table in a relational database. But the Web is unbounded. The design of the Web fundamentally differed from traditional hypertext systems in sacrificing link integrity for scalability. While any party can (and should!) maintain link consistency within some part of the Web, no tool that looks at the Web as a whole can assume consistency.

This emerged as a critical design distinction at the workshop [QL98]. Assuming a finite repository, a query processor can assume it has the total information; it can decide that, for example, there are no elements or records that satisfy a query. But while there are services that process queries of the form "find all links to X in the Web," they cannot decide that there are no links to X, but only that their necessarily incomplete knowledge of the Web includes no links to X.

The workshop showed that the problem of querying a bounded XML collection has been solved in the research and industrial settings, and is perhaps a commodity, amenable to standardization by now. But work presented there and elsewhere [Craw90] showed that while the unbounded query problem is perhaps at the research stage, the research is maturing, and we should take care not to prevent the transition into products and commodity technology. We should do what we can to see that data is recorded in a global context, because [Ber98a]:

For example, HTML linking requires that all links from a resource be expressed in the content of that resource. But that is a limitation of HTML, not a limitation of Web architecture [Ber90]. The design and deployment of HTML did not prevent the design and deployment of out-of-line XML Links [XLink98], which allow links from a resource to be expressed anywhere in the web.

In the same way that HTML and XML Linking allow authors to lead readers from any place in the Web to any other place in the Web, data in the Semantic Web must be able to relate anything to anything.

A requirement, therefore, of a data model for the Semantic Web is that there should be no fundamental constraint relating what is said, what it is said about, and where it is said.

Global universality and local constraint

To encompass the universe of network-accessible information [Ber92], the Semantic Web must provide a way of exposing information from different systems. These systems may use a variety of internal data models so this implies a requirement for some generic concept of data at a low level that is in common between each system. For example, at the W3C Query Language Workshop [QL98] the directed, labeled graph (DLG) model was a common underlying model among many systems.

Another challenge of the Semantic Web, then, is to support the mapping of the existing and future systems onto the Web, preserving the universality of the Web and also the properties of the local systems. Optimizations - such as being able to enumerate and index all objects of a given type - that are important to the local operation of a system do not scale to the Web.

An example of this is the scoping of identifiers. In the object-oriented model, the variables in an object are declared when the object type is declared. Entity-relationship models similarly are an optimization of a model that presumes an enumerable set of properties. The Semantic Web should be able to represent these constrained models but, as with link consistency, we must relax absolute constraints to achieve scalability; when an object is exported to the Web, the "anything can say anything about anything" rule allows assertions to be made about the object expressing things which were not foreseen in the original definition of that object.

The mechanism adopted in RDF [RDF99] to manage the expression of constraints is to make all objects, all relationships, all types, and even all assertions be "first class objects" on the Web. That is; they have their own URIs and are not constrained in the fundamental level to be combined in any particular way. By giving first class identifiers to types, relationships, and assertions we allow the Semantic Web to make assertions about itself.

All statements found on the Web occur in some context. Applications need this context in order to determine the trustworthiness of the statements; that is, the machinery of the Semantic Web, does not assert that all statements found on the Web are "true". Truth - or more pragmatically, trustworthiness - is evaluated by, and in the context of, each application that processes the information found on the Web. These are not new issues; the commerce and financial communities have evolved techniques to manage exchange of information (goods) without requiring perfect trust [Reagle96, Geer98].

Just as the design of the Web sacrificed link integrity for scalability, the "all knowledge about my thing is contained here" notion cannot hold when databases and objects are exported to the Web. A great benefit to relaxing this assumption will be that, just as hypertext links connect different information systems, the Semantic Web will connect data from vastly different systems, allowing complex and far-reaching processing of a wide store of available data.

We will need to consider one optimization that RDF does not currently address and that is found in database systems. This occurs with operations on composite objects and is frequently represented as containment (in the sense of storage). Operations such as deletion and comparison on structured objects frequently make use of such containment relationships. While this local containment constraint also does not scale fully to the Web, it is an example of a relationship that should be expressible in the Semantic Web.

Data Models: Trees, Graphs and Tables

The relationship between the tree structure of an XML document and a graph structure was discussed at the workshop [QL98]. The participants, coming from diverse backgrounds, agreed that a shared data model was the cornerstone of the design of any query language, and, in fact, a precursor to meaningful design discussion. The relational calculus [Codd70] underlying Structured Query Language(SQL) is a good example.

Section 2.1 of the XML specification defines the way the elements in a document form a tree, and Section 5 of the RDF specification defines the RDF data model as a directed, labeled graph (DLG). The XML syntax of RDF reflects the difference between these models. And it seemed at first glance that the design of a query language for XML must start with a tree model, whereas a query language for RDF must start with a DLG model.

A DLG model has been shown [GMW99] to be useful for serializing a semistructured database in XML. Other work [LayA98, DSB98, BLR99] also discusses modeling graphs in XML. This work does not propose that all XML documents should be modelled as DLGs -- the order of elements generally gets lost, for example -- but it does show that RDF is not the only XML application with a requirement to represent DLGs in XML.

These models use XML ID/IDREF to supplement parent/child relationships expressed by element containment. While ID/IDREF works within a single document, these designs depend on XML Linking [XLink98], i.e. well-known constructs for making connections across documents, much the way metadata applications require that the basic structure of RDF assertions be visible to all systems, even systems that don't understand the semantics of the assertions.

The content selection application added an operational requirement that assertions be visible by inspection, i.e. without reference to a schema. Expectations from HTML suggest that authors have the option to express simple XML Linking constructs [XLink98] directly in the document, and do not necessarily need to work with a DTD or schema.

From the workshop discussion, it was clear that a direct mapping between XML elements and graph arcs was a strong design. Future work is needed to address how to identify semantic arcs and cross-links within an arbitrary XML document.

Relational Data

The Semantic Web model is very closely connected with the relational database model [Ber98b]. A collection of RDF statements about a node corresponds to a row in a table. A database join is a splicing of graphs. Relational databases are optimized to handle large numbers of instances of statements using the same property, and there might be corresponding optimizations in an XML serialization for large volumes of similar data. But we should expect that the basic structures that support serializing relational databases can be shared with the RDF DLG data model.

In a table there are many records with the same set of properties. An individual cell (which corresponds by analogy to an RDF property) is strongly typed. Combination rules tend in RDBs to be loosely enforced; a query can join tables by any columns which match on datatype without any check on the semantics. Joins across arbitrary fields are another case where constraint enforcement in the Semantic Web cannot be absolute; the Semantic Web is not designed just as a new data model - it is designed to support the linking of data from many different models.

Object Types

Much of the object-oriented world has to do with the modeling of functions on objects. For the Semantic Web, the data model for the serializations of objects when they are stored or transmitted is also of interest [Chang98].

The serialization of an object can be considered to be a series of data fields expressing different properties of the object. In most O-O systems the type of an object denotes constraints on the methods (functions) supported by the object. The serialization of the object data is often considered to be an internal matter, and is hidden from the caller of methods. Under this design principle, XML technology is only of interest for the serialization of the remote method calls (which is outside the scope of this paper) and cannot be used to provide interoperability between implementations.

However, when interoperability is a goal for object serializations, then it becomes reasonable to put objects on the Web. In this case, designing self-describing serialization formats using XML makes the objects more robust across time [KR97]. XML vocabularies for representing inheritance are needed when object systems are serialized on the Semantic Web. These mechanisms are desired by RDF and can be shared across other XML applications.

Logical systems

A large number of applications store and communicate information that takes the form of logical expressions. For example, configuration files defining access control, specifications of capability profiles ([CONNEG98], [CCPP98]) show a need for not only structure but logical combination. Future work must be chartered to provide a common vocabulary for this logic language.

Knowledge representation systems (e.g. Knowledge Interchange Format (KIF) and Cyc [Cyc95]) include not only logical level information but also expressive power which includes quantification and inference. The basic DLG model of RDF provides a very natural base for the expression and interchange of such data, but future work is needed to define common terms for these extensions to the power of the language. There may be a use for some XML shorthand in order to make such expressions sufficiently concise. While these systems also make assumptions about having full access to information that fail to scale to the Web, a comparison of the models suggest two areas of impact when a typical KR system is represented on the Semantic Web. Many KR systems are built to work with primarily one node representing any given concept, while in general the Semantic Web may have many independently created nodes which in fact represent the same thing. Also, KR systems in practice store some kinds of hints in order to be able to help algorithms perform certain types of queries on the data. The Web will not in general guarantee one node per concept nor will it guarantee the presence of query hints and the Semantic Web must therefore be able to function without these constraints.

Conclusions and Future Work

We have shown the importance of a common architecture for tree-structured documents and directed labeled graphs. We have also shed new light on some of the design decisions in the XML syntax used by RDF.

We have discussed the way contemporary data models (relational, object, knowledge representation) relate to a unified Semantic Web Architecture. We look forward to elaborating these connections in future work.

Footnotes

1 The presumption that a complete specification of the objects within a document must exist in one place appears in many places: SGML, Ada, SQL, and in many object oriented programming systems. At the same time, the import of many different interfaces into a program module (in Ada, C, etc) also demonstrates the concept of creating a new module using a mixture of independently created vocabularies.