The Web borrows familiar concepts from physical media (e.g., the notion of a "page") and overlays them on top of a networked infrastructure (the Internet) and a digital presentation medium (browser software). This is a convenient abstraction, but when social or legal concepts and frameworks relating documents, publishing and speech are applied to the Web, the analogies can be misleading, for example, publishing a page on the Web is fundamentally different from printing and distributing a page in a magazine or book.
Communication is often subject to governance: legislation, legal opinion, regulation, convention and contract; these are ways in which society looks to enforce norms, for example, around copyright, censorship, and privacy. But there is often a mismatch between governance intended to apply to the Web (usually based on the analogy with physical media) and the technology and architecture used to create it.
This document is intended to inform future social and legal discussions about the architecture of the Web: the ways in which the Web's technical facilities operate to store, publish and retrieve information, and by providing definitions for terminology as used within the Web's technical community. Specifically, this document has the following goals:
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
The publication of this Note indicates that work has ended on this specification for the time being. Work on this document may restart in the future in the TAG or in another Working Group when we have consensus on how to proceed.
This document was published by the Technical Architecture Group as a Working Group Note. If you wish to make comments regarding this document, please send them to email@example.com (subscribe, archives). All comments are welcome.
Publication as a Working Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
In recent months there have been several legal actions against individuals and organizations for making material available that is illegal or seditious material and may have been copyrighted. We argue in section 4.3 Including that the manner in which the material is made available is important and should be taken into consideration. Similarly, as we explain in section 4.2 Caching and Relaying there are several kinds of intermediaries that store and/or integrate material from other sources. The role of these intermediaries is quite different from websites that provide original content and the laws should (and in some cases do) distinguish between them.
As the Web has worked its way into the fabric of our lives and access to the Internet is likened to free speech as a fundamental right, it is also increasingly subject to governance. By governance we mean the general idea of societal controls, whether by legislative, regulatory, court order, contractual, or other means. Unfortunately, a number of problems arise when dealing with governance of the Internet.
The goal of this document is to clarify technology. If it informs policy-makers and thus helps make better policies it will have succeeded in its goals. A secondary goal is to point out that many restrictions on the use of Web material that are commonly written into Terms and Conditions are better implemented by technology. For example, if a Website does not want other sites linking to specific pages it is more effective not to provide URIs for those pages rather than to include this restriction in the Terms and Conditions.
Readers who are interested in legal opinion and case citations related to linking and the role of intermediaries, the most common causes of lawsuits, as well as other related matters may want to consult ChillingEffects.org and LinksandLaw.com.
The act of viewing a web page is a complex interaction between a user's browser and any number of web servers. Unlike reading a book, viewing a web page involves copying the data held on the servers onto the user's computer, if only temporarily. Logic encoded within the page may cause more copying to take place, perhaps from other servers. The combined material may be displayed or otherwise used within the original page often without the user's explicit knowledge or consent. For an end user, it is usually impossible to tell whether a given image or video displayed within a page originates from the server the page comes from or from some other location.
In addition to browsers and webservers, many other kinds of servers live on the Web. Proxy servers and services that combine and repackage data from other sources may also retain copies of this material. These intermediary services may transform, translate or rewrite some of the material that passes through them, to enhance the user's experience of the web page or for their own purposes. Still other services on the web, such as search engines and archives, make copies of content as a matter of course. This is, in part, to facilitate the indexing necessary for their operation, and in part to provide value to their users and to the original authors of the web page. These intermediaries, we argue, should be treated differently by the law based on how much control they have over the underlying material and how they process it.
Examples of the kind of legal questions that have arisen related to material that originates on other sites include:
The Wikipedia page on Copyright aspects of hyperlinking and framing discusses these and several other examples.
Many content publishers seek to control the use of their content on the Web. In some cases, they employ means that do not take into account the Web's true architecture, and do not use the technical mechanisms available to them. A few illustrative examples are provided below.
Licenses that describe how material may be copied and altered by others tend not to distinguish between a proxy compressing a web page to make it load faster and someone editing and republishing the page on their own website. To illustrate, the Creative Commons Attribution-NoDerivs defines the terms (emphasis added):
- means a work based upon the Work, or upon the Work and other pre-existing works, such as a translation, adaptation, derivative work, arrangement of music or other alterations of a literary or artistic work, or phonogram or performance and includes cinematographic adaptations or any other form in which the Work may be recast, transformed, or adapted including in any form recognizably derived from the original, except that a work that constitutes a Collection will not be considered an Adaptation for the purpose of this License. For the avoidance of doubt, where the Work is a musical work, performance or phonogram, the synchronization of the Work in timed-relation with a moving image ("synching") will be considered an Adaptation for the purpose of this License.
- means to make available to the public the original and copies of the Work through sale or other transfer of ownership.
- means to make copies of the Work by any means including without limitation by sound or visual recordings and the right of fixation and reproducing fixations of the Work, including storage of a protected performance or phonogram in digital form or other electronic medium.
Consider, now, the following questions:
Terms and Conditions statements on websites also list acceptable and unacceptable behavior on a site, with any browsing on the site implicitly indicating acceptance of the terms. These generally do not take into account the behavior of proxies. For instance, one standard set of Terms and Conditions includes:
You must not:
(a) republish material from this website (including republication on another website);
(b) sell, rent or sub-license material from the website;
(c) show any material from the website in public;
(d) reproduce, duplicate, copy or otherwise exploit material on our website for a commercial purpose;
(e) edit or otherwise modify any material on the website; or
(f) redistribute material from this website except for content specifically and expressly made available for redistribution (such as our newsletter)
It is not possible to view material on the web without it being downloaded onto your computer, so forbidding downloading except for caching purposes essentially means that people cannot view the page. In addition, many proxies automatically transform the documents that pass through them, for example to compress them so that they take up less bandwidth for mobile consumption or to introduce advertisments into pages that are accessed free of charge. Should this be prohibited?
Limits placed on the use of a website often include limitations on automatic indexing of the website, without exceptions for search engines that make the website discoverable or archives that ensure its longevity. For example, the set of terms and conditions quoted above goes on to say:
You must not conduct any systematic or automated data collection activities (including without limitation scraping, data mining, data extraction and data harvesting) on or in relation to our website without our express written consent.
Search engines rely on systematic data collection from websites in order to provide users with accurate search results, and archives do so in order to retain websites for posterity. So, these terms and conditions, if adhered to strictly, put the website out of the reach of search engines and hence makes it undiscoverable; surely this is not in the best interest of the website. Another problem is that automated agents (webcrawlers, spiders, robots) that gather information from the web are unable to read these terms and conditions; the only things they understand are the technical signals that a website provides about what is permitted. See more on this below.
As another example, the terms and conditions for gsig.com include:
Use of Materials: Upon your agreement to the Terms, GSI grants you the right to view the site and to download materials from this site for your personal, non-commercial use. You are not authorized to use the materials for any other purpose. If you do download or otherwise reproduce the materials from this Site, you must reproduce all of GSI?s proprietary markings, such as copyright and trademark notices, in the same form and manner as the original.
You may not use any "deep-link", "page-scrape"?, "robot", "spider" or any other automatic device, program, algorithm or methodology or any similar or equivalent manual process to access, acquire, copy or monitor any portion of the Site or any of its content, or in any way reproduce or circumvent the navigational structure or presentation of the Site.
It would be simpler and more effective for the site to use technical means
for controlling what webcrawlers, spiders or robots
access on the site, namely a
which is a set of machine processable instructions
instructing automated web agents what thay can and cannot
do. They could also exempt automated web agents from the
Terms and Conditions as discussed below.
Many sites have a linking policy that limits what links can be made to the site from other sites. These conditions are not backed up through relatively simple technical mechanisms that would prevent such links from being made. For example, the website at quotec.co.uk has a linking policy that includes:
Links pointing to this website should not be misleading.
Appropriate link text should be always be used.
From time to time we may update the URL structure of our website, and unless we agree in writing otherwise, all links should point to http://www.quotec.co.uk.
You must not use our logo to link to this website (or otherwise) without our express written permission.
You must not link to this website using any inline linking technique.
You must not frame the content of this website or use any similar technology in relation to the content of this website.
Technically it is straightforward to prevent linking to pages that the website does not want others to link to: you simply do not give these pages URLs or make the URLs undiscoverable. This is likely to be more effective than asking people to read and adhere to the Terms and Conditions. Several techniques for controlling linking and inclusion are discussed in section 6. Techniques.
Legislation that governs the possession and distribution of unlawful material (such as child pornography, information that is under copyright or material that is legally suppressed through a gag order) needs to exempt certain types of services, such as caching or hosting, as it would be impractical for such services to police all the material that passes through their servers. This does not, however, always happen and intermediaries are often held accountable for material that did not originate with them and have no control over. An example of good legislation that does exempt intermediaries the UK is the Coroners and Justice Act 2009 Schedule 13; from the Explanatory Notes (emphasis added):
Paragraphs 3 to 5 of [Schedule 13] provide exemptions for internet service providers from the offence of possession of prohibited images of children in limited circumstances, such as where they are acting as mere conduits for such material or are storing it as caches or hosts.
This section summarises the terminology that is used within this document. More details about each of the terms is given in the rest of the document.
The concept of publishing on the web has evolved as the web's ecosystem has enlarged and diversified, and as the capabilities of browsers and the web standards that they implement have developed. There is no single definition of what publishing on the web means. Instead, there are a number of activities that could be viewed as publication or distribution in a legal sense. This section describes some of these activities and how they work.
The basic form of publication on the web is hosting. A server hosts a file if it stores the file on disk or generates the file from data that it stores, and that file did not (to the server's knowledge) originate elsewhere on the web.
The presence of data on a server does not necessarily mean that the organisation that owns and maintains the server has an awareness of the presence of the data or its content. Many websites are hosted on shared hardware that is owned by a service provider that stores and serves data at the direction of controlling individuals and organisations which determine the data they provide on the site. Because of this, multiple servers may host the same file at different locations.
There are many different types of service provider. Some exercise practically no control over the software and data that they host, merely providing a base platform on which code can run. Others may focus on particular types of content, such as images (e.g. Flickr), videos (e.g. YouTube) or messages (e.g. Twitter). Also, there may be many service providers involved in the publication of a particular file on the web: some providing hardware, others providing different kinds of publishing support.
Some service providers automatically perform transformations on material that they host, as a service, such as converting to alternative formats, clipping or resizing, or marking up text. When they sign up to a service, controllers explicitly or implicitly enter into an agreement with the service provider that grants them a license to perform transformations on the material which they upload.
Service providers that host particular types of material often employ automatic filters to prevent the publication of unlawful material, but it is impossible for a service provider to detect and filter out everything that might be unlawful.
To add to the complexity of this area, it is possible for each of the following to be in different jurisdictions:
and be controlled by different laws and conventions.
Some servers provide access to files that are hosted elsewhere on the web: on an origin server that holds the original version of the file. These files might be stored on the server and provided again at a later time (a caching proxy in this document), or might simply pass through the server in response to a request (a relaying server in this document).
It is often impossible to tell whether a server is providing a stored response or whether it has made a new request to the origin server and is serving the results of that request. Servers commonly store the results of some requests and not others, acting as a caching proxy some of the time and as a relaying server the rest.
In both cases, the file the caching or relaying server provides might be different from the original web content that was accessed from the origin server. For example:
Caching and relaying servers are extremely useful on the web. There are four main types of caching and relaying servers discussed here: proxies, archives, search engines and reusers. The distinctions between them are summarised in the table below.
|purpose||increase network performance||maintain historical record||locate relevant information||better understand information|
|refreshing||based on HTTP headers||never||variable||based on HTTP headers|
|retrieval||on demand||proactive||proactive||usually on demand|
|URI use||usually uses same URI||uses new URI||uses new URI||uses new URI|
Archives aim to catalog and provide access to some web content to provide an on-going historical record. They use crawlers to fetch pages and other web content from the portion of the web that they cover, and store them on their own servers, along with metadata about the pages, including when each was retrieved. They then may provide access to the stored copies of the web content at particular historical dates, enabling people to see how pages used to appear.
Archives are often run by institutions that have a legal mandate and responsibility to keep a historical record, such as a legal deposit. Although their primary purpose is long term record-keeping, they often make this material available online as well. They might restrict access to the data for a period of time after it is collected, for security or privacy reasons, and may respond to legally-backed removal requests. Users might use archives for research, but also to access information that has otherwise been removed from the Web.
When they are made available to the public, archived pages are often distinguishable by end users from the original page using banners placed within the page or having the original page appear within a frame. The links (both to other pages and to embedded web content such as images) are usually rewritten so that when the user interacts with the page, they are taken to the version of the linked web content at the same point in time. "Dark archives" do not make their content available to the public.
Search engines aim to catalog and analyze as many web pages as they can, so that they can direct users to appropriate information in response to a search. They use crawlers to fetch pages and other web content from the web, analyse them and store them on their own servers to support further analysis.
Search engines are mostly interested in indexing web content and providing links to them rather than in the content itself. They may or may not copy the page itself, but they always store metadata about the page, derived from the information in the page and other information on the web, such as what other pages link to it.
Search engines play an important role in the web by enabling people to find information, including that which would otherwise be lost or is temporarily unavailable. When a user views a stored page from a search engine, it is usually obvious both that the search engine is involved (from the URI of the page and from banners or framing), that the content originally came from somewhere else, and where it came from. The links within the page are not usually rewritten.
A server that is a reuser fetches information from one or more origin servers and either provides an alternative URI for the same page or adds value to it by reformatting it or combining it with other data. Good examples are the BBC Wildlife Finder, which incorporates information from Wikipedia, Animal Diversity Web and other sources or triplr.org, which converts data from one format to another as a service.
Reusers that do not change the information from the origin server may be used to simplify access to the origin server (by mapping simple URLs to a more complex query) or to provide a route around gateways or the same-origin policy (as servers are not limited in where they access web content from).
Since reused information is, by design, seamlessly integrated into a page that is served from the reuser, people viewing that page will not generally be aware that the information originates from elsewhere. The URIs used for the pages will be those of the reuser. Licenses on the material may require attribution and even when it doesn't, it is good practice for reusers to indicate where the material originates.
A web page written in HTML may include other web content,
such as images, video, scripts, stylesheets, data and other
HTML. The HTML in a web page refers to this external web
content using markup. For example,
<img> element uses
src attribute to refer to an image which
should be shown within the page. Material that is included
within a web page may appear to be a hosted copy to the user
of a website, but in fact may come from somewhere else,
entirely outside the control of the owner of the web
HTML supports several different mechanisms for including external web content in a web page but they all work in essentially the same way. When a user navigates to a web page, the browser automatically fetches all the included web content into its local cache and executes them or displays them within the page.
Inclusion is different from hosting, copying or disseminating a file because the information is never stored on, nor passes through, the server that hosts the web page doing the including. As such, although the included web content is an essential component of the page to make it appear and function as a whole, the server of the web page does not have control over content which may change without its knowledge.
When scripts or HTML are included into web pages, the included content may itself include other content (which may include still more and so on). The author of the original web page can choose what content it wants to include, but does not have control over the choice of the subsequently included content. The publishers of included content might change the content at any time, possibly without warning. This has been used to include third-party images without permission, or to substitute the image with something distasteful or to redirect to a link that performed an unintended action on the user's behalf; see Preventing MySpace Hotlinking.
Some of the web content that is used within a page may be invisible to the user. An example is a hidden image that is used for tracking purposes: each time a user navigates to the page, the hidden image is requested; the server uses the information from the request of the image to build a picture of the visitors to the site.
This facility can be used for malicious
<img> element can point to
any URI (not just an image) and causes a GET request to
that URI. If a website has been constructed such that GET
requests cause an action to be carried out (such as
logging out of a website), a page that includes this
"image" will cause the action to take place.
Although linking and inclusion (or embedding or transclusion) are often confounded, they are fundamentally different.
In an article discussing the decision by the National Newspapers of Ireland to charge for links to articles that appear within their their pages the author says: "There's the fact that naming a work's title does not and cannot be copyright infringement not under US law ... A link (or the URL inside it) is little more than a name, so arguably the same rule would apply. And even if it is more than a name, the URL can be regarded as a factual statement (you can find the content here) and facts arguably cannot be copyrighted in the US (some courts disagree)."
This is clearly not the case with inclusion. If you include material from another site, especially if you do it without attribution, you can be prosecuted for copyright infringement and if the material is judged to be seditious or otherwise unlawful you can be prosecuted for distributing inappropriate material.
The discussion highlights how difficult it can be for end users (both human and machine) to be aware of the original source of content on the web, and the ways in which it may have been changed en route to them. Controllers of content need to make clear about how that content can be used elsewhere, both through human-readable prose and by using technical barriers that can be used to limit access. Third parties that use that content that originates elsewhere, whether proxies, reusers or linkers, should also follow good practices in transformation, reuse and linking to information. This is discussed in more detail below.
Publishers can control access to pages in several ways.
In addition to not giving out URIs to these pages, they can control access
Referer HTTP header which indicates the last page
that was referenced. If it was not a page on your own site, then you
can redirect to your site's home page, for example. It is also possible
dialog to check whether the contractual terms have been read, to
confirm that the user is over 18, or to ask for a password.
You can also use a cookie, for example, to start a session only when a page is accessed through a given gateway page and reject or provide an alternative path for requests that don't have the cookie set.
User-Agent HTTP header which indicates the identity of the user agent
making the request is particularly useful in preventing access from crawlers and
search engines. A
robots.txt file on the website can be also used for the
The domain name or IP address of the client making the connection can also be used to prevent specific reusers from accessing material.
As well as the techniques above, which can be used to control any access to pages, it's also possible to provide additional control over the inclusion of content in a third-party's web pages.
To prevent an HTML page from being embedded within a frame, publishers can include a script that checks whether the document is the top document in the window.
The Cross-Origin Resource Sharing Working Draft [CORS] defines a set of HTTP headers that can be used to give the publisher of the third-party resource greater control over access to their resources. These are usually used to open up cross-origin access to resources that publishers want to be reused, such as JSON or XML data exposed by APIs, by indicating to the browser that the resource can be fetched by a cross-origin script.
Embed-Only-From-Origin HTTP header is
also currently under discussion by
Applications Working Group and described within
Resource Embedding Restrictions Editor's Draft
[CORER]. This would enable publishers to control which
origins are able to embed the content they publish
into their pages.
Publishers should ensure actions are not taken on behalf
of their users in response to an HTTP
a URI, as otherwise sites are open to security breaches
through inclusions, as described in section 4.3 Including. It is also good practice to check
Referer header in these cases to prevent
actions being taken as the result of the submission of
forms within other website's web pages, unless that
functionality is desired.
There are a number
headers [HTTP11] that enable content providers to
indicate whether a proxy should cache a given page and for
how long it should keep the copy. These are described in
13: Caching in HTTP. For example, a server can use the
no-store to indicate that a particular response
should not be cached by a proxy server.
Publishers of websites can also indicate which pages
should not be fetched or indexed by any search engine or
element [META]. They can indicate other
characteristics of web pages, such as how frequently they
might change and their importance on the website,
[SITEMAPS]. More sophisticated publishers may use
the Automated Content
Access Protocol (ACAP) extensions [ACAP] to attempt
to indicate access policies.
Publishers can also use the
link relationship to indicate a canonical URI for a page
which should be used by search engines and other reusers
to reference a given page.
no-transform HTTP header indicates that a proxy
server must not change the original content, nor the
For example, an proxy server must not convert a TIFF
Cache-Control: no-transform into
a JPG, nor should it rewrite links within an HTML page.
Websites can include a license that describes how the information served by the website can be reused by others.
Just as with HTTP headers, robots.txt and sitemaps, there can be no technical guarantees that crawlers will honor license information within a site. However, to give well behaved crawlers a chance of identifying the license under which a page is published, websites should:
xhv:licenseto indicate the license of included content./li>
This section describes techniques that you should use when operating a website that incorporates material from other sources, whether by caching, transforming or simply linking.
As described in section 6.3 Controlling Caching and section 6.4 Controlling Processing, there are a number of HTTP headers and other conventions that indicate how origin servers intend other servers to treat the resources that they publish. Servers that cache or reuse data from origin servers should obey these headers.
Proxies must use
HTTP header when they handle requests to origin servers,
to indicate their involvement in the response to the
user's original request. Proxies which perform
transformations on content must include
214 Transformation applied HTTP header in the
These and other recommendations for proxies which perform transformations are included in the Guidelines for Web Content Transformation Proxies 1.0.
Many licenses require reusers of information to provide attribution to the original source of the material. This attribution must be human-readable, so that users understand where the material came from, and may also be computer-readable, which enables automated tools to track the use of material on the web.
The wording and positioning of attribution is usually dictated by the license under which the material is made available. For example, if you use the free icons available from Axialis Software their license includes:
If you use the icons in your website, you must add the following link on each page containing the icons (at the bottom of the page for example):Icons by Axialis Team
The HTML code for this link is:<a href="http://www.axialis.com/free/icons">Icons</a> by <a href="http://www.axialis.com">Axialis Team</a>
If there is no explicit guidance about the location of attribution, it is recommended that attribution to material from a third party appear as close to the actual material as possible. Methods to make the attribution machine-readable include:
citeattribute on the
<blockquote>element, where a portion of a page is quoted within your own site
dc:sourceproperty with microformats, microdata or RDFa to indicate the source of a portion of the page (identified through an
An example of clear attribution of material from another site is that of the BBC Wildlife Finder; the following screenshot shows the attribution within the page on the Pygmy Three-toed Sloth.
There are a number of practices around linking to third-party sites that can help users and automated agents understand the relationship between your website and the third parties. These include:
rel="nofollow"for links where the link is not meant to imply approval; these will not be used by search engines when determining the relevance for a page
rel="external"for links to third-party web pages; this can be used as the basis of styling, such as an image that indicates the user will be taken to a separate site
There are a number of techniques that can be used to track
which links are followed from a website. Methods that
rewrite the links within a web page to point to an
interstitial ("you are leaving this website") page or
through a script can mislead the user and any automated
agents about the target of the link. It is better to use a
script to capture
onclick or other events and
redirect the user at that point.
In conclusion, publishing on the Web is different from print publishing. This document has enumerated some of these differences, especially those relevant to licensing, attribution and copyright issues.
Recommended practices include:
Many thanks to Thinh Nguyen, Rigo Wenning, Wendy Seltzer and other TAG members for their reviews and comments on earlier versions of this draft, and to Robin Berjon for ReSpec.js.