Publishing and Linking on the Web

Abstract

The Web borrows familiar concepts from physical media (e.g., the notion of a "page") and overlays them on top of a networked infrastructure (the Internet) and a digital presentation medium (browser software). This is a convenient abstraction, but when social or legal concepts and frameworks relating documents, publishing and speech are applied to the Web, the analogies ~~often do not suffice. Publishing~~ can be misleading, for example, publishing a page on the Web is fundamentally different from printing and distributing a page in a magazine or book.

Communication is often subject to governance: legislation, legal opinion, regulation, convention and contract; these are ways in which society looks to enforce norms, for example, around copyright, censorship, ~~privacy~~ and ~~other areas.~~ privacy. But there is often a mismatch between governance intended to apply to the Web (usually based on the analogy with physical media) and the technology and architecture used to create it.

This document is intended to inform future social and legal discussions about the architecture of the ~~Web by clarifying~~ Web: the ways in which the Web's technical facilities operate to store, publish and retrieve information, and by providing definitions for terminology as used within the Web's technical community. ~~This~~ Specifically, this document ~~also describes~~ has the ~~technical and operational impact~~ following goals:

Clarify that ~~does or could result from legal constraints on publishing,~~ linking ~~and transformation on the Web. One particular mismatch discussed in this document between~~ - providing the ~~Web and physical media~~ location of some content - is ~~the way~~ different in ~~which publication can happen without any judgment or review by Web subsystems based on actions of end-users, either~~ intention from including - embedding content developed elsewhere within your own - and should be treated differently by ~~web intermediaries~~ the law.
Intermediaries, such as ~~proxies, archives,~~ proxies and caches, and more controversially, search ~~engines,~~ engines and agggregators that merely refer to or by collect content ~~transformation services.~~ developed elsewhere, should be treated differently than websites that create orginal content.
In many cases, websites put legal restrictions on how the material they make available may be linked or reused. Such ~~sub-systems cannot exercise judgment, even if required to~~ restrictions can be better enforced by ~~governance.~~ using mechanisms provided by the web rather than by legal terms of use.

2. Introduction

The act of viewing a web page is a complex interaction between a user's browser and any number of web servers. Unlike reading a book, viewing a web page involves copying the data held on the servers onto the user's computer, if only temporarily. Logic encoded within the page may cause more copying to take ~~place — of images, videos and other files,~~ place, perhaps from other ~~servers, that are~~ servers. The combined material may be displayed or otherwise used within the original page — often without the user's explicit knowledge or consent. For an end user, it is usually impossible to tell whether a given image or video displayed within a page originates from the server the page comes from or from some other location.

In addition to browsers and webservers, many other kinds of servers live on the Web. Proxy servers and services that combine and repackage data from other sources may also retain copies of this material. These intermediary services may transform, translate or rewrite some of the material that passes through them, to enhance the user's experience of the web page or for their own purposes. Still other services on the web, such as search engines and archives, make copies of content as a matter of course. This is is, in ~~part~~ part, to facilitate the indexing necessary to for their operation, and in part to ~~enable presentation of search results, to~~ provide value to their users and to the original authors of the web page. These intermediaries, we argue, should be treated differently by the law based on how much control they have over the underlying material and how they process it.

~~Clark et.al. [ TUSSLE ] characterize~~ Examples of the ~~Internet as~~ kind of legal questions that have arisen related to material that originates on other sites include:

Is it unlawful to link to a ~~system where different kinds~~ page that contains unlawful material? Is it unlawful to embed unlawful material within a page? See http://act.demandprogress.org/sign/dhscomplaint/ .
Does embedding an image that you do not have a license to copy within a web page constitute a copyright infringement? Does creating a thumbnail version of ~~entities interact, often with conflicting goals. Managing these conflicts, they call them "tussles", they argue is crucial~~ that image constitute a copyright infringement? Perfect 10, Inc. v. Amazon.com, Inc.
If a proxy automatically rewrites scripts to combine and compress them, changes markup in the ~~harmonious development~~ page or compresses images, are these classed as adaptations of the ~~internet.~~ original material?
If a browser uses an online service to translate the text of a web page, is this classed as an adaptation of that web page?
Can linking to a page that contains defamatory material constitute libel? See Canada Supreme Court: Hyperlinks cannot libel (BBC) .

The Wikipedia page on Copyright aspects of hyperlinking and framing discusses these and several other examples.

1.1 2.1 Background

Many content publishers ~~and other entities have sought~~ seek to control the use of their content on the Web. In some cases, they ~~have employed~~ employ means that do not take into account the Web's true architecture, and ~~they have~~ do not ~~used~~ use the technical mechanisms available to them. A few illustrative examples are provided ~~as background~~ below.

Licenses that describe how material may be copied and altered by others tend not to distinguish between a proxy compressing a web page to make it load faster and someone editing and republishing the page on their own website. To illustrate, the Creative Commons Attribution-NoDerivs defines the terms (emphasis added):

Adaptation

means a work based upon the Work, or upon the Work and other pre-existing works, such as a translation, adaptation, derivative work , arrangement of music or other alterations of a literary or artistic work, or phonogram or performance and includes cinematographic adaptations or any other form in which the Work may be recast, transformed, or adapted including in any form recognizably derived from the original, except that a work that constitutes a Collection will not be considered an Adaptation for the purpose of this License. For the avoidance of doubt, where the Work is a musical work, performance or phonogram, the synchronization of the Work in timed-relation with a moving image ("synching") will be considered an Adaptation for the purpose of this License.

Distribute

means to make available to the public the original and copies of the Work through sale or other transfer of ownership .

Reproduce

means to make copies of the Work by any means including without limitation by sound or visual recordings and the right of fixation and reproducing fixations of the Work, including storage of a protected performance or phonogram in digital form or other electronic medium.

~~Consider~~ Consider, now, the following questions:

Does using the automatic Google Translate service (which can be invoked automatically within Chrome) count as an "adaptation" of a Work, given that it's a translation, or is it only non-automated translations that would count?
When a document is made available on a website, the original owner still retains their ~~copy —~~ copy, it is not transferred — , so does this count as "distribution"?
If someone accesses a document released under this license on the web, it has to be copied onto their machine in order to be displayed; does this mean anyone accessing the document "reproduces" it?

Terms and Conditions statements on websites also list acceptable and unacceptable ~~behaviour~~ behavior on a site, with any browsing on the site implicitly indicating acceptance of the terms. These generally do not take into account the ~~behaviour~~ behavior of proxies. For instance, one standard set of Terms and Conditions includes:

You may view, download for caching purposes only , and print pages from the website for your own personal use, subject to the restrictions set out below and elsewhere in these terms of use.

You must not:

(a) republish material from this website (including republication on another website);

(b) sell, rent or sub-license material from the website;

(c) show any material from the website in public;

(d) reproduce, duplicate, copy or otherwise exploit material on our website for a commercial purpose;

(e) edit or otherwise modify any material on the website ; or

(f) redistribute material from this website except for content specifically and expressly made available for redistribution (such as our newsletter)

It is not possible to view material on the web without it being downloaded onto your computer, so forbidding downloading except for caching purposes essentially means that people cannot view the page. In addition, many proxies automatically transform the documents that pass through them, for example to compress them so that they take up less bandwidth for mobile consumption or to introduce advertisments into pages that are accessed free of charge. Should this be prohibited?

Limits placed on the use of a website often include limitations on automatic indexing of the website, without exceptions for search engines that make the website discoverable or archives that ensure its longevity. For example, the ~~same~~ set of terms and conditions ~~as described~~ quoted above ~~includes:~~ goes on to say:

You must not conduct any systematic or automated data collection activities (including without limitation scraping, data mining, data extraction and data harvesting) on or in relation to our website without our express written consent.

Search engines rely on systematic data collection from websites in order to provide users with accurate search results, and archives do so in order to retain websites for posterity. So, these terms and conditions, if adhered to strictly, put the website out of the reach of search engines and hence makes it undiscoverable; surely this is not in the best interest of the website. Another problem is that automated agents ~~— webcrawlers, spiders and robots —~~ (webcrawlers, spiders, robots) that gather information from the web are unable to read these terms and conditions; the only things they understand are the technical signals that a website provides about what is permitted. See more on this below.

As another example, the terms and conditions for gsig.com include:

Use of Materials: Upon your agreement to the Terms, GSI grants you the right to view the site and to download materials from this site for your personal, non-commercial use. You are not authorized to use the materials for any other purpose. If you do download or otherwise reproduce the materials from this Site, you must reproduce all of ~~GSI’s~~ GSI?s proprietary markings, such as copyright and trademark notices, in the same form and manner as the original.

...

You may not use any ~~“deep-link”, “page-scrape”, “robot”, “spider”~~ "deep-link", "page-scrape"?, "robot", "spider" or any other automatic device, program, algorithm or methodology or any similar or equivalent manual process to access, acquire, copy or monitor any portion of the Site or any of its content, or in any way reproduce or circumvent the navigational structure or presentation of the Site.

~~However,~~ It would be simpler and more effective for the site ~~does not~~ to use ~~the primary~~ technical ~~method of actually~~ means for controlling what webcrawlers, spiders or robots access on the site, namely a robots.txt file which is a set of machine processable instructions instructing automated web agents what thay can and cannot do. They could also exempt automated web agents from the Terms and Conditions as discussed below.

Many sites have a linking policy that limits what links can be made to the site from other sites. These conditions are not backed up through relatively simple technical mechanisms that would prevent such links from being made. For example, the website at quotec.co.uk has a linking policy that includes:

Links pointing to this website should not be misleading.

Appropriate link text should be always be used.

From time to time we may update the URL structure of our website, and unless we agree in writing otherwise, all links should point to http://www.quotec.co.uk.

You must not use our logo to link to this website (or otherwise) without our express written permission.

You must not link to this website using any inline linking technique.

You must not frame the content of this website or use any similar technology in relation to the content of this website.

Technically it is straightforward to prevent linking to pages that the website does not want others to link to: you simply do not give these pages URLs or make the URLs undiscoverable. This is likely to be more effective than asking people to read and adhere to the Terms and Conditions. ~~Section 4. Techniques discusses several~~ Several techniques for controlling linking and ~~inclusion.~~ inclusion are discussed in section 6. Techniques .

Legislation that governs the possession and distribution of unlawful material (such as child pornography, information that is under copyright or material that is legally suppressed through a gag order ) ~~often~~ needs to exempt certain types of services, such as caching or hosting, as it would be impractical for ~~the people running those~~ such services to police all the material that passes through their servers. This does not, however, always happen and intermediaries are often held accountable for material that did not originate with them and have no control over. An example of good legislation that does ~~this in~~ exempt intermediaries the UK is the Coroners and Justice Act 2009 Schedule 13 ; from the Explanatory Notes (emphasis added):

Paragraphs 3 to 5 of [Schedule 13] provide exemptions for internet service providers from the offence of possession of prohibited images of children in limited circumstances, such as where they are acting as mere conduits for such material or are storing it as caches or hosts .

Examples of the kind of legal questions that have arisen with respect to publishing and linking on the Web are: Is it unlawful to link to a page that contains unlawful material? Is it unlawful to embed unlawful material within a page? See http://act.demandprogress.org/sign/dhscomplaint/ . Does embedding an image that you do not have a license to copy within a web page constitute a copyright infringement? Does creating a thumbnail version of that image constitute a copyright infringement? Perfect 10, Inc. v. Amazon.com, Inc. If a proxy automatically rewrites scripts to combine and compress them, changes markup in the page or compresses images, are these classed as adaptations of the original material? If a browser uses an online service to translate the text of a web page, is this classed as an adaptation of that web page? Can linking to a page that contains defamatory material constitute libel? See Canada Supreme Court: Hyperlinks cannot libel (BBC) . The Wikipedia page on Copyright aspects of hyperlinking and framing discusses these and several other examples. 1.2 Scope of this Document This document does not aim to address whether particular activities on the web are illegal or legal; this is outside the scope of the TAG. Instead, it aims to: provide an explanation of the way that material is published on the web to help inform people writing licenses and legislation provide definitions of terms related to web publishing and linking that may be useful within licenses and legislation describe the technical measures that websites can take to reinforce any restrictions that they place on the use of content they make available on the web describe the mechanisms by which websites that reuse material can ensure they meet known restrictions on the use of that material, for example through attribution

3. 4. Publishing

The concept of publishing on the web has evolved as the web's ecosystem has enlarged and diversified, and as the capabilities of browsers and the web standards that they implement have developed. There is no single definition of what publishing on the web means. ~~Instead~~ Instead, there are a number of activities that could be viewed as publication or distribution in a legal sense. This section describes ~~each~~ some of these activities and how they work.

3.1 4.1 Hosting

The basic form of publication on the web is hosting. A server hosts a file if it stores the file on disk or generates the file from data that it stores, and that file did not (to the server's knowledge) originate elsewhere on the web.

Diagram showing a browser accessing web content from a hosting server and creating a copy of it within its cache. — Fig. 1 Accessing hosted pages

The presence of data on a server does not necessarily mean that the organisation that owns and maintains the server has an awareness of the presence of the data or its content. Many websites are hosted on shared hardware that is owned by a service provider that stores and serves data at the direction of controlling individuals and organisations which determine the data they provide on the site. Because of this, multiple servers may host the same file at different URIs. For example, an artist could upload the same image to multiple servers, which then store the image and serve it to others. Fig. 2 Uploading files to a server A controller uploads a file to a hosting server, which is then accessed by a browser. locations.

There are many different types of service provider. Some ~~may~~ exercise practically no control over the software and data that they host, merely providing a base platform on which code can run. Others may focus on particular types of content, such as images (e.g. Flickr), videos (e.g. YouTube) or messages (e.g. Twitter). Also, there may be many service providers involved in the publication of a particular file on the web: some providing hardware, others providing different kinds of publishing support.

~~Fig. 3 Transforming files on a server A controller uploads a file to a hosting server, which is then automatically transformed into a different format which is accessed by a browser.~~

Some service providers automatically perform transformations on material that they host, as a service, such as converting to alternative formats, clipping or resizing, or marking up text. When they sign up to a service, controllers explicitly or implicitly enter into an agreement with the service provider that grants them a license to perform transformations on the material which they upload.

Service providers that host particular types of material often employ automatic filters to prevent the publication of unlawful material, but it is impossible for a service provider to detect and filter out everything that might be unlawful.

To add to the complexity of this area, it is possible for each of the following to be in different jurisdictions:

the individual or organisation who controls the data
the service provider(s)
the physical servers that host the data

and be controlled by different laws and conventions.

3.2 4.2 Caching and Relaying

Some servers provide access to files that are hosted elsewhere on the web: on an origin server that holds the original version of the file. These files might be stored on the server and provided again at a later ~~time,~~ time (a caching proxy in ~~which case, for the purposes of~~ this ~~document, it is termed a caching server ,~~ document), or might simply pass through the server in response to a ~~request, in which case, for the purposes of this document, it is termed a~~ request (a relaying server . in this document).

Diagram showing a caching proxy passing on to a browser a copy of content from elsewhere — Fig. 4 2 Accessing ~~files~~ web content via a caching ~~server~~ proxy

It is ~~usually~~ often impossible to tell whether a server is providing a stored response or ~~wehther~~ whether it has made a new request to the origin server and is serving the results of that request. Servers commonly store the results of some requests and not others, acting as a caching ~~server~~ proxy some of the time and as a relaying server the rest.

In both cases, the file the caching or relaying server provides ~~may~~ might be different from the original ~~file~~ web content that was accessed from the origin server. For example:

~~links~~ Links within an HTML ~~page may~~ content might be rewritten so that they point to ~~pages~~ content that ~~are~~ is also served by the caching ~~server or relaying server~~ proxy.
Javascript and CSS files ~~may~~ might be combined and compressed to provide speedier ~~access~~ access.
~~banners may~~ Banners might be added within an HTML page to highlight that the page is a copy of an original from somewhere ~~else~~ else.
~~files may~~ Content might be compressed or converted to different ~~formats~~ formats.
~~wholly~~ Wholly new ~~documents may~~ content might be created that bring together information from multiple different ~~sources~~ sources.

Caching and relaying servers are extremely useful on the web. There are four main types of caching and relaying servers discussed here: proxies, archives , search engines and reusers. The distinctions between them are summarised in the table below.

	proxy	archive	search engine	reuser
purpose	increase network performance	maintain historical record	locate relevant information	better understand information
refreshing	based on HTTP headers	never	variable	based on HTTP headers
retrieval	on demand	proactive	proactive	usually on demand
URI use	usually uses same URI	uses new URI	uses new URI	uses new URI

3.2.1 Proxies A proxy is an application that sits in between a server (such as a website) and a client (such as a browser). On the web, caching proxies are often used to speed up users' experience of the web by caching pages and other resources onto the proxy server so that they can be accessed faster. Some proxies, particularly those used by mobile operators, perform other actions on the content that passes through them, such as rewriting or merging Javascript to make it faster to access. Proxies come in four general flavors: Forward proxies serve a given community, such as a company. The browsers of the members of that community are set up to use the proxy, and whenever they fetch a resource, the request is routed through the proxy. If the proxy already has a cached copy of the resource, it returns that copy rather than forwarding the request on to the origin server. If many people within the community are requesting the same pages, this can speed up their access and lower the bandwidth use of the community. The proxy can also be used to prevent access to unsuitable sites and to carry out virus checking on the content. Reverse proxies sit in front of a (private) server and cache responses from that server. This can reduce the load on the server and speed up responses, which is particularly important when the response to a request takes time to compute. An extreme version of this is a content-delivery network ( CDN ) which operates a set of proxies and directs requests to the nearest of these, saving transmission time as well as processing time. Gateways are proxies which usually do not cache the result of a request: they simply pass on the request to an origin server and pass the response back. Gateways may be used to prevent access to particular parts of the web, or to bypass those blocks. For example, a gateway may be set up that prevents access for users in a particular country to particular websites; another gateway might be set up, at a non-blocked location, that enables people within that country to access those websites by mapping requests on that gateway to requests on the blocked website. Transforming proxies are proxies which transform the Web documents being delivered through them to address issues such as presentation issues in order to speed up access to those documents (often these are used to compress content, for example) [reference to Mobile Web Best Practices working group work on best practices for transforming proxies ]. Note Split browsers (such as Opera Mini and Amazon Silk) where some software components execute on a server and some software components execute on a client device are not discussed further in this document. Since, in these cases, the client and the server are tightly coupled, we consider them to be a single User Agent rather than a client and a proxy. The use of a forward proxy, gateway or transforming proxy may be configured either on an individual machine or transparently for a particular network. Users may have no idea that their requests are channelled through a given proxy, or they may have configured their set-up to use the proxy. Reverse proxies appear to be normal servers to users: it is impossible for a user to tell that their request is actually passed on to a completely different origin server, or where that server is. This is intentional as the origin server in this case is a private one. To improve performance, some proxies, particularly Content Distribution Networks (CDNs), may pre-fetch resources that a page includes since these resources are likely to be requested by the browser soon after the page is viewed. Thus, although generally the contents of a proxy's cache will be determined by the requests that users of that proxy have made, the proxy might also in some cases contain content that no one has ever requested.

3.2.2 4.2.1 Archives

Archives aim to catalog and provide access to some web content to provide an on-going historical record. They use crawlers to fetch pages and other ~~resources~~ web content from the portion of the web that they cover, and store them on their own servers, along with metadata about the pages, ~~particularly~~ including when ~~they were~~ each was retrieved. They then may provide access to the stored copies of the ~~resources~~ web content at particular historical dates, enabling people to see how pages used to appear.

Archives are ~~usually~~ often run by institutions that have a legal mandate and responsibility to keep a historical record, such as a legal deposit . Although their primary purpose is long term record-keeping, they often make this material available online as well. They might restrict access to the data for a period of time after it is collected, for security or privacy reasons, and may respond to legally-backed removal requests. Users might use archives for research, but also to access information that has otherwise been removed from the Web.

When they are made available to the public, archived pages are ~~usually~~ often distinguishable by end users from the original page using banners placed within the page or having the original page appear within a frame. The links (both to other pages and to embedded ~~resources~~ web content such as images) are usually rewritten so that when the user interacts with the page, they are taken to the version of the linked ~~resource~~ web content at the same point in time. "Dark archives" do not make their content available to the public.

3.2.3 4.2.2 Search Engines

Search engines aim to catalog and ~~provide access to~~ analyze as many web pages as they can, so that they can direct users to appropriate information in response to a search. They use crawlers to fetch pages and other ~~resources~~ web content from the web, analyse them and store them on their own servers to support further analysis.

Search engines are mostly interested in indexing ~~resources~~ web content and providing links to them rather than in the content ~~of the resource~~ itself. They ~~might~~ may or may not copy the page itself, but they always store metadata about the page, derived from the information in the page and other information on the web, such as what other pages link to it.

Search engines play an important role in the web in by enabling people to find information, including that which would otherwise be lost or is temporarily unavailable. When a user views a stored page from a search engine, it is usually obvious both that the search engine is involved (from the URI of the page and from banners or framing), that the content originally came from somewhere else, and where it came from. The links within the page are not usually rewritten.

3.2.4 4.2.3 Reusers and Aggregators

~~Data reuse is becoming more prevalent as web servers act as services to others.~~ A server that is a reuser fetches information from one or more origin servers and either provides an alternative URI for the same page or adds value to it by reformatting it or combining it with other data. Good examples are the BBC Wildlife Finder , which incorporates information from Wikipedia , Animal Diversity Web and other sources or triplr.org , which converts ~~RDF~~ data from one format to another as a service.

Diagram showing a reuser building a page from information from two origin servers — Fig. 5 3 Reusing information

Reusers that do not change the information from the origin server may be used to simplify access to the origin server (by mapping simple URLs to a more complex query) or to provide a route around gateways or the same-origin policy (as servers are not limited in where they access ~~resources~~ web content from).

Since reused information is, by design, seamlessly integrated into a page that is served from the reuser, people viewing that page will not generally be aware that the information originates from elsewhere. The URIs used for the pages will be those of the reuser. Licenses on the material may require ~~attribution;~~ attribution and even when it doesn't, it is good practice for reusers to indicate where the material originates.

3.3 Aliasing An alias is a URI that points the browser to another URI on an origin server . A server can automatically redirect a browser (using a HTTP 3XX status code and a Location header). Web pages from a server can do the same thing using a <meta> element with an http-equiv attribute set to Refresh ; this technique is often used with a slight delay to indicate to the user that they are being redirected to another page. Fig. 6 Aliasing documents A browser requests a file from an aliasing server; it responds with a redirection to a document hosted on an origin server and doesn't send any content itself Aliases do not involve any of the information from the origin server passing through or being stored by the redirecting server, but the redirecting server will be able to record when a particular URI is requested. Although it is preferable to only have one URI for a particular resource, redirections are a useful mechanism for managing change on the web. They are used within websites when the structure of the website changes, or between websites when a new website is created that supersedes the first, or to archived information when a host no longer wants to provide access to a file itself. Redirections are also used to provide other services. Link shorteners provide a short URI for a resource that is then redirected to the original URI, and are useful in situations where space is limited such as in print or on Twitter or where the original URL feels long and unwieldy. Depending on their implementation, link-tracking services can use a similar technique to enable servers to analyse which links are followed from their site. Note Add example of how Google records clicks on search results. When aliasing is used, users may not be aware about the eventual target of a link, or the involvement of an aliasing server, both of which are important. Shortened links, for example, hide the target location behind a URI that often has no visible relationship to the eventual destination of the page. Some implementations of link tracking do not change the original destination of the link (such that the status bar on a browser shows the eventual target of the page) but instead use the onclick event to direct the user to the aliasing server. Following a redirection, browsers change the address bar to the new location, but this is often the only indication, and the user may miss it and be unaware of the redirection.

3.4 4.3 Including

A web page written in HTML may include other ~~resources,~~ web content, such as images, video, scripts, stylesheets, data and other HTML. The HTML in a web page refers to ~~these~~ this external ~~resources~~ web content using markup. For example, an <img> element uses the src attribute to refer to an image which should be shown within the page. Material that is included within a web page may appear to be a hosted copy to the user of a website, but in fact may ~~be hosted completely separately,~~ come from somewhere else, entirely outside the control of the owner of the web page.

Diagram showing a page including other web content — Fig. 7 4 Including ~~resources~~ web content

HTML supports several different mechanisms for including ~~other resources~~ external web content in a ~~page. These are listed in section A. Linking Methods ,~~ web page but they all work in ~~basically~~ essentially the same way. ~~Typically, when~~ When a user navigates to a web page, the browser automatically fetches all the included ~~resources~~ web content into its local cache and executes them or displays them within the page.

Inclusion is different from hosting, copying or disseminating a file because the information is never stored on, nor passes through, the server that hosts the web page doing the including. As such, although the included ~~resources are~~ web content is an essential component of the page to make it appear and function as a whole, the server of the web page does not have control over ~~their~~ content which may change without ~~the server's~~ its knowledge.

Users may not be aware that included resources are used within a page at all. When included resources are embedded within the page such that they are visible to a user, it may not be clear that an image or video is from a third-party website rather than the website that they are visiting unless this is explicitly indicated within the content of the page. A resource that is included into a popular page may cause a large number of requests to the server on which the file is published. This can be burdensome to the third-party who hosts the file. Publishers who intend their files to be reused in this way, therefore, typically have terms and conditions that apply to the reuse of those files and may to put in place technical barriers to restrict it. As with normal links, included resources may or may not have the same origin as the page that includes them. Resources such as images and scripts that are included within the web page may be from any site. However, browsers implement a same-origin policy which generally means that third-party resources cannot be fetched and processed by scripts running on the page, for example through XMLHttpRequests [ XMLHTTPREQUEST ]. However, these scripts can write markup into the page which causes the inclusion of such resources.

3.4.1 4.3.1 Inclusion Chains

When scripts or HTML are included into web pages, the included ~~resource~~ content may itself include other ~~resources~~ content (which may include still more and so on). The author of the original web page ~~has control over which resources~~ can choose what content it ~~includes,~~ wants to include, but does not have control over ~~which resources those~~ the choice of the subsequently included ~~resources go on to include.~~ content. The publishers of included ~~resources may~~ content might change the content ~~of those resources~~ at any time, possibly without warning. This has been used ~~in cases where websites included~~ to include third-party images without permission, or to substitute the image with something distasteful or to redirect to a link that performed an unintended action on the user's behalf; see Preventing MySpace Hotlinking .

3.4.2 4.3.2 Hidden Requests

Some of the ~~resources~~ web content that ~~are included~~ is used within a page may be invisible to the user. An example is a hidden image that is used for tracking purposes: each time a user navigates to the page, the hidden image is requested; the server uses the information from the request of the image to build a picture of the visitors to the site.

This facility can be used for malicious purposes. An <img> element can point to any URI (not just an image) and causes a GET request on to that ~~resource.~~ URI. If a website has been constructed such that GET requests cause an action to be carried out (such as logging out of a website), a page that includes this "image" will cause the action to take place.

3.5 4.4 Linking

Linking is a fundamental ~~notion for~~ facility on the web. In fact, it has been argued that linking is what makes the Web the Web. HTML pages ~~use~~ can include links using the <a> ~~elements to insert links~~ element to other pages on the ~~web,~~ web with the href attribute holding the URI for the linked page . Some of the links ~~will~~ may be to be pages from the same origin ; while others will be cross-origin links to pages on third-party's sites that hold related information.

Diagram showing a page linking to other web content — Fig. 8 5 Linking to pages

A user can usually tell where a link is going to take them prior to selecting it ~~through the browser UI (e.g.~~ by "mousing over" ~~it)~~ it or after the link is selected through the status bar in the browser, although some links are overridden by onclick event handling that takes them to a different location. Some websites, such as Wikipedia, use icons to indicate ~~when~~ whether a link is a cross-origin link ~~and when~~ or whether it will take a user to a page on the same server. The use of interstitial pages or dialog boxes which warn the user they are about to leave the site in question can obscure the eventual destination of the link, as discussed in section 3.3 Aliasing . If the link is a cross-origin link (or even in some cases where it is an internal link), the publisher of the origin page will have no control over the content or access policies of the linked page. These are the responsibility of the publisher of that page; the TAG Finding on "Deep Linking" in the World Wide Web [DEEPLINKING] describes the ways in which publishers can control access to their pages and the fundamental principle that addressing (linking to) a page is distinct from accessing it. link.

3.5.1 4.4.1 Linking Out of Control

Traditionally, a user must take a specific action in order to navigate to the linked page, such as by clicking on the link or selecting it with a keystroke or a voice command. In these cases, the linked page cannot be accessed without the user's knowledge and consent ~~(though~~ although they may not know where they will eventually end ~~up).~~ up.

~~But there are three practices used by some~~ Some sites ~~that~~ use elaborate means to obscure whether a link is followed by the user. For example:

Browsers can be made to navigate to another page using scripted navigation , which may simply run automatically (navigating the user to another page after a period of time, for example) and can hide the location of a link, such that users don't know where they will be navigating to. The HTML5 history API [ HTML5 ] enables a script to change the address bar, which can mean that the address bar does not reflect the actual location of the page (although this use is limited to locations that have the same origin as the original page).
Browsers might pre-fetch ~~resources~~ web content that ~~seem~~ seems likely to be visited next, so that the target page is loaded more quickly when the user accesses it. A page can indicate which links should be pre-fetched using the prefetch link relation in a link. For example, a page might indicate that the first result in a list of search results should be fetched before the user actually navigates the link.

~~Browsers may support offline web applications [~~

4. 6. Techniques

The ~~description above about how information is published on the web~~ discussion highlights how difficult it can be for end users (both human and machine) to be aware of the original source of content on the web, and the ways in which it may have been changed en route to them. ~~It also shows that the controllers~~ Controllers of content need to be make clear about how that content can be used elsewhere, both through human-readable prose and by ~~the~~ using technical barriers that ~~they put up that~~ can be used to limit access. Third parties that use that ~~content,~~ content that originates elsewhere, whether proxies, reusers or linkers, should also follow ~~some best~~ good practices in transformation, reuse and ~~links~~ linking to information. ~~4.1 Controllers Once material is put on the public web (that is, on the internet and unprotected by authentication barriers), it is impossible to completely limit how that material~~ This is used through technical means — HTTP headers can be faked, metadata can be ignored. However, there are a number of standard techniques that controllers can use to indicate how they intend their material to be used, which intermediate servers should pay attention to. discussed in more detail below.

4.1.1 6.1 Controlling Access

Publishers can control access to ~~resources that are unprotected by authentication through HTTP, by refusing or redirecting connections~~ pages in several ways. In addition to ~~particular resources based on:~~ not giving out URIs to these pages, they can control access via the Referer HTTP ~~header;~~ header which indicates the last page that was referenced. If it was not a page on your own site, then you can redirect to your site's home page, for example. It is also possible to do this check in JavaScript, which can then be used to bring up a dialog to check whether the contractual terms have been read, to confirm that the user is ~~useful~~ over 18, or to ask for ~~preventing linking~~ a password.

You can also use a cookie, for example, to ~~particular resources from outside~~ start a ~~website, or preventing the inclusion of~~ session only when a ~~resource in another website~~ page is accessed through a given gateway page and reject or provide an alternative path for requests that don't have the cookie set.

The User-Agent HTTP ~~header; this~~ header which indicates the identity of the user agent making the request is particularly useful ~~for~~ in preventing access from crawlers and search engines. A robots.txt file on the website can be also used for the same purpose.

The domain name or IP address of the client making the ~~connection; this may~~ connection can also be ~~useful~~ used to prevent specific reusers from accessing ~~material Password controlled access~~ material.

4.1.2 6.2 Controlling Inclusion

As well as the techniques above, which can be used to control any access to pages, it's also possible to provide additional control over the inclusion of ~~resources~~ content in a third-party's web pages.

~~In the case of~~ To prevent an HTML ~~pages,~~ page from being embedded within a frame, publishers can include a script that checks whether the document is the top document in the ~~window, to prevent it from being embedded within a frame.~~ window.

The Cross-Origin Resource Sharing Working Draft [ CORS ] defines a set of HTTP headers that can be used to give the publisher of the third-party resource greater control over access to their resources. These are usually used to open up cross-origin access to resources that publishers want to be reused, such as JSON or XML data exposed by APIs, by indicating to the browser that the resource can be fetched by a cross-origin script.

Note

A new From-Origin or Embed-Only-From-Origin HTTP header is also currently under discussion by the Web Applications Working Group and described within the Cross-Origin Resource Embedding Restrictions Editor's Draft [CORER]. This would enable publishers to control which origins are able to embed the ~~resources~~ content they publish into their pages.

Publishers should ensure actions are not taken on behalf of their users in response to an HTTP GET on a URI, as otherwise sites are open to security breaches through inclusions, as described in section ~~3.4~~ 4.3 Including . It is also good practice to check the Referer header in these cases to prevent actions being taken as the result of the submission of forms within other website's web pages, unless that functionality is desired.

4.1.3 6.3 Controlling Caching

There are a number of HTTP headers [ HTTP11 ] that enable content providers to indicate whether a proxy should cache a given page and for how long it should keep the copy. These are described in detail within Section 13: Caching in HTTP . For example, a server can use the HTTP header Cache-Control : no-store to indicate that a particular ~~resource~~ response should not be cached by a proxy server.

Publishers of websites can also indicate which pages should not be fetched or indexed by any search engine or archive through robots.txt [ROBOTS] and the robots <meta> element [META]. They can indicate other characteristics of web pages, such as how frequently they might change and their importance on the website, through sitemaps [SITEMAPS]. More sophisticated publishers may use the Automated Content Access Protocol ( ACAP ) extensions [ ACAP ] to attempt to indicate access policies.

Publishers can also use the rel="canonical" link relationship to indicate a canonical URI for a page which should be used by search engines and other reusers to reference a given page.

4.1.4 6.4 Controlling Processing

The Cache-Control: no-transform HTTP header indicates that a proxy server must not change the original content, nor the headers:

Content-Encoding
Content-Range
Content-Type

For example, an proxy server must not convert a TIFF served with Cache-Control: no-transform into a JPG, nor should it rewrite links within an HTML page.

4.1.5 Link Shorteners One reason that people linking to websites use misleading links is when the original URLs are too long to incorporate into space-limited documents, such as short-form posts or in printed media. Although it's possible to use third-party link shortening services, origin websites can also set up link shorteners for their own content, and then use the rel="shortlink" link relationship to point from the original page to the short link for that page.

4.1.6 6.5 Licensing

Websites ~~indicate~~ can include a license that describes how the information ~~within~~ served by the website can be reused by others.

Just as with HTTP headers, robots.txt and sitemaps, there can be no technical guarantees that crawlers will honor license information within a site. However, to give well behaved crawlers a chance of identifying the license under which a page is published, websites should:

~~indicate~~ Indicate the license of a web ~~page~~ page, an image or ~~other resource~~ a video using a mechanism whereby the license can be detected automatically such as
- a <link> or <a> element with rel="license"
- microformats [MICROFORMATS], RDFa [ RDFA-CORE ] or microdata [MICRODATA] with a common property such as cc:license xhv:license to indicate the license of included ~~resources, such as images or videos~~ content./li>
~~ensure~~ Ensure that the conditions of the license are machine-understandable by
- referencing a well-known license (such as a Creative Commons license) through its URI, or
- providing a machine-readable description of the license, for example using the Creative Commons Rights Expression Language .

4.2 6.6 How to Write Good Websites

This section describes ~~the~~ techniques that you should use when operating a website that incorporates material from other sources, whether by caching, transforming or simply linking.

4.2.1 6.6.1 Honouring Honoring Headers

As described in section ~~4.1.3~~ 6.3 Controlling Caching and section ~~4.1.4~~ 6.4 Controlling Processing , there are a number of HTTP headers and other conventions that indicate how an origin ~~server intends~~ servers intend other servers to treat the resources that they publish. Servers that cache or ~~transform~~ reuse data from origin servers should obey these ~~headers, which exist to ensure that the end user receives current information in the intended form.~~ headers.

4.2.2 6.6.2 Adding Headers

Proxies must use the Via HTTP header when they handle requests to origin servers, to indicate their involvement in the response to the user's original request. Proxies which perform transformations on ~~a document~~ content must include a Warning: 214 Transformation applied HTTP header in the response.

These and other recommendations for proxies which perform transformations are included in the Guidelines for Web Content Transformation Proxies 1.0 .

4.2.3 6.6.3 Attribution

Many licenses require ~~the~~ reusers of information to provide attribution to the original source of the material. This attribution must be human-readable, so that users ~~of your website~~ understand where the material came from, and may also be computer-readable, which enables automated tools to track the use of material on the web.

The wording and positioning of attribution is usually dictated by the license under which the material is made available. For example, ~~the license for~~ if you use the free icons available from Axialis Software their license includes:

If you use the icons in your website, you must add the following link on each page containing the icons (at the bottom of the page for example):

Icons by Axialis Team

The HTML code for this link is:
~~<a href="http://www.axialis.com/free/icons">Icons</a> by <a href="http://www.axialis.com">Axialis Team</a>~~
<a href="http://www.axialis.com/free/icons">Icons</a> by <a href="http://www.axialis.com">Axialis Team</a>

If there is no explicit guidance about the location of attribution, it is recommended that attribution to material from a third party appear as close to the actual material as possible. Methods to make the attribution machine-readable include:

the use of the cite attribute on the <blockquote> element, where a portion of a page is quoted within your own site
using the dc:source property with microformats, microdata or RDFa to indicate the source of a portion of the page (identified through an id )

An example of clear attribution of material from another site is that of the BBC Wildlife Finder; the following screenshot shows the attribution within the page on the Pygmy Three-toed Sloth .

A description of the pygmy three-toed sloth is followed by an attribution that reads "This entry is from Wikipedia, the user-contributed encyclopedia. If you find the content in the 'About' section factually incorrect, defamatory or highly offensive you can edit this article at Wikipedia. For more information on our use of Wikipedia please read our FAQ." — Fig. 9 6 Attributing Reused Content

4.2.4 6.6.4 Linking

There are a number of practices around linking to a third-party ~~site~~ sites that can help users and automated agents to understand the relationship between your website and the third parties. These include:

~~use~~ Use rel="nofollow" for links where the link is not meant to imply approval; these will not be used by search engines when determining the relevance for a page
~~use~~ Use rel="external" for links to third-party web pages; this can be used as the basis of styling, such as an image that indicates the user will be taken to a separate site

There are a number of techniques that can be used to track which links are followed from a website. Methods that rewrite the links within a web page to point to an interstitial ("you are leaving this website") page or through a script can mislead the user and any automated agents about the target of the link. It is better to use a script to capture onclick or other events and redirect the user at that point.

Publishing and Linking on the Web

W3C Working Draft TAG Finding 15 August 2012 08 April 2013

Abstract

Status of This Document

Table of Contents

1. Why We Wrote This Document