Web Architecture

Dan Connolly

Park University Department of Information and Computer Science
26 Oct 2005

with thanks to Ian Jacobs for his xml.gov presentation

postscript: The Park ACM club is hosting a archived video stream of the presentation.

$Revision: 1.4 $ of $Date: 2005/10/28 17:30:21 $

Overview

W3C's Technical Architecture Group (TAG)
Overview of Architecture of the World Wide Web
Architecture Review of US Patent Office Web Site

W3C's Technical Architecture Group

W3C's Technical Architecture Group was chartered in 2001:

"to document and build consensus around principles of Web architecture and to interpret and clarify these principles when necessary."

Roles: write, coordinate, mediate

TAG Participants

TAG participants elected and appointed at our September 2002 meeting in Vancouver:

TAG, posing in front of same motorbike as in Sep 2002, in Vancouver

Norman Walsh, Sun Microsystems. Docbook guy
Paul Cotton, Microsoft. fulltext industry
Chris Lilley, W3C. SVG lead
David Orchard, BEA. Web Services.
Roy Fielding. HTTP spec editor. REST thesis
Tim Berners-Lee, W3C Director
Stuart Williams, HP
Dan Connolly, W3C
Tim Bray, Sun. XML 1.0 co-editor
Ian Jacobs, W3C. Tech writer

TAG Participants, all dressed up

Henry Thompson, W3C and U. Edinburgh. XML Schema editor
Norman Walsh
David Orchard
Vincent Quint, INRIA
Tim Berners-Lee
Dan Connolly
Roy Fielding
Noah Mendelsohn, IBM

Why an Architecture Document?

To distill ten years of experience with the hypertext Web
To help developers of Web technologies avoid pitfalls
To provide guidance to users, site managers, software designers on promoting a robust Web
To build consensus around concepts and terms
To learn humility...

Community Brings Issues to TAG

23 Jan 2002: Potential architecture issue brought to TAG's attention on public list www-tag@w3.org (archive):
- HTTP GET deprecated in XForms Last Call Working Draft (background)
29 Jan 2002: TAG accepts issue whenToUseGet-7.

Teleconference and mailing list discussions ensue.

TAG Explores Problem Space

What makes HTTP GET important?

HTTP GET designed so that URI alone encodes interaction; allows linking
Safe/unsafe distinction in protocol enables user agent support
Requests with no side-effects enable caching of results:
- At global networking scale, require fast data exchange
- Use caching to improve performance (see HTTP/1.1 study)
- Design protocols that support caching

8 Apr 2002: Dan Connolly receives assignment to write strawman proposal. This evolves into a draft finding.

TAG Coordinates to Build Consensus

4 May 2002: Making connection to Web Services, David Orchard proposes SOAP HTTP GET Binding Version 0.1
3 Jun 2002: David Orchard receives assignment to request of XMLP Working Group that SOAP 1.2 (then a Working Draft) HTTP binding include GET method.
10 Jun 2002: TAG approves finding URIs, Addressability, and the use of HTTP GET.
10 Nov 2002: TAG announces agreement regarding use of GET.

Groups Document Consensus

24 Jun 2003: SOAP Version 1.2 Part 2: Adjuncts becomes a W3C Recommendation, with GET as part of HTTP binding (see section 4.1.2)
22 Sep 2003: TAG accepts revised finding URIs, Addressability, and the use of HTTP GET and POST
14 Oct 2003: XForms 1.0 becomes a W3C Recommendation, with support for HTTP GET.
9 Dec 2003: Architecture Document to Last Call with discussion of safe interactions in section 3.5

Ongoing: Marking safe operations in WSDL

Negotiation tactics

March 2004: TAG struggling to get webarch document done
Dan was more optimistic than Paul Cotton:
Let the record show that PC and DC have bet a dinner that at this rate we won't (PC) or we will (DC) get to Recommendation by 2005. W3C Tag meeting notes March 2004

It's a REC! Party!

December 15: with a week or two to spare, W3C issued webarch as a Recommendation. (with press release and all)
Paul graciously conceded the bet
Connolly family dines on Paul's nickel

What type of information is in the Architecture Document, Findings?

Properties we desire of the Web, and
Design choices to achieve them.

Example related to previous issue:

Principles, Constraints, Good Practice Notes

Agents do not incur obligations by retrieving a representation ("GET is safe").

Rationale

Benefits of URI addressability: linking, bookmarking, caching
Benefits of distinction in protocols of safe/unsafe: user agent alerts, caching

Stories and examples

Examples of safe (lookup) and unsafe (credit card purchase) interactions
Considerations for sensitive data
Practical considerations, ephemeral limitations

Architecture Tripod

Identification
Interaction
Representation

Identification I: Why URIs?

Value of common syntax for global identifiers:

"Great multiplicative power of reuse derives from the fact that all languages use URIs as identifiers: This allows things written in one language to refer to things defined in another language. The use of URIs allows a language to leverage the many forms of persistence, identity, and various forms of equivalence." -- URIs, Addressability, and the use of HTTP GET and POST

Power in the network effect.
Extension through URI schemes

Identification II: URI Usage

Comparison. Key to Semantic Web, caches
Dereference. Discussed below in Interactions

Due to global scope, URIs also used outside of Web protocols (e.g., as database keys).

Interaction I: Dereferencing a URI

Communication between agents involves URIs, messages, data
Dereference a URI, get back a representation of resource state
Representation consists of representation data and metadata (e.g., Internet Media Type).

Interaction II: Dereferencing a URI (illustration)

A resource (Oaxaca Weather Info) is
identified by a particular URI and is represented by pseudo-HTML
content

Interaction III: Managing Representations

Internet Media Type
Representations evolve over time as resource, technology evolves
Consistency in representation increases trust in URI
Content negotiation facilitates evolution
Fragment identifier semantics and content negotiation

Interaction IV: Issues Raised by Interaction

Safe, unsafe interactions
Sensitive data
Access control independent of identification (cf. Deep Linking Finding)
Metadata from representation provider authoritative. Other behavior ok, but requires transparency for user.

Representation: Data formats

Data formats used to organize representation data
Data format considerations: binary v. text, extensibility, versioning, composition, modularization
Hypertext
XML-based data formats: links, namespaces, qnames, media types

Architecture Review of US Patent Office Web Site

Review of the United States Patent and Trademark Office revealed:

HTTP GET used for database lookup (good)
HTTP GET used for unsafe interactions (not good)
URI for patent is actually URI for search (not optimal)
POST used to protect sensitive login data (design choice)

HTTP GET used for database lookup

HTML "GET" form used for database lookup:

   <form action="/netacgi/nph-Parser" method="GET">

Use GET for queries, searches, database lookups.

HTTP GET used for unsafe interactions (not good)

Modifying state of shopping cart is unsafe since produces side-effect:

Search with keyword "hypertext"
Select Patent 6,678,889: "Systems, methods and computer program ...."
Add to cart, view "Quantity" (1)
Hit back button, add to cart, view "Quantity" (2)

"Add to Cart" an HTML link:

   <a href=".../AddToShoppingCart?docNumber=6,678,889...">...

I cannot link to shopping cart from this slide; a search engine or pre-fetching agent might increment counter (cf. SVG 1.2, section 11.8.

In HTML, use "POST" form for unsafe operations.

URI for patent is actually URI for search (not optimal)

What might a URI for a patent look like?

   http://www.uspto.gov/patents/p6678889

Note that this is globally unambiguous; better than "6678889"

Search produced this URI for search on "hypertext":

   http://patft.uspto.gov/...s1=hypertext&OS=hypertext...

Search produced this URI for search by patent number 6,678,899:

   http://patft.uspto.gov/...s1=6,678,889.WKU....

Why are these URIs different if this is the same patent?

Cost of Arbitrarily Different URIs

At first, I thought these URIs were arbitrarily different URIs for the same resource. If so, machines cannot compare reliably, so:

Interferes with caching
Semantic Web does not work
Site management more complex

Identify Results of Search, not Search

Resource only indirectly identified as query result.

My expectation is not to bookmark search, but result of search
Search might return different results another day; I want to refer to the patent.

POST used to protect sensitive login data (design choice)

GET allows URIs (bookmarking, back button), but we don't want sensitive data in URI.
Choices include:
- GET with HTTP Basic Authentication over SSL: Sensitive data in HTTP headers, so allows bookmarking. User agent manages passwords.
- POST over SSL
However, cost to SSL as well

Think about these architecture issues, tradeoffs during design! See URIs, Addressability, and the use of HTTP GET and POST

Future work

Issues the TAG has not resolved for First Edition
Other systems that make use of URI space: Web Services, Semantic Web
Internationalized URIs (IRIEverywhere-27)
XML canonicalization
Binary XML (binaryXML-30)
Mixing XML Namespaces (mixedUIXMLNamespace)