Copyright
©
2012
2013
W3C
®
(
MIT
,
ERCIM
,
Keio
,
Beihang
),
All
Rights
Reserved.
W3C
liability
,
trademark
and
document
use
rules
apply.
The
Web
borrows
familiar
concepts
from
physical
media
(e.g.,
the
notion
of
a
"page")
and
overlays
them
on
top
of
a
networked
infrastructure
(the
Internet)
and
a
digital
presentation
medium
(browser
software).
This
is
a
convenient
abstraction,
but
when
social
or
legal
concepts
and
frameworks
relating
documents,
publishing
and
speech
are
applied
to
the
Web,
the
analogies
often
do
not
suffice.
Publishing
can
be
misleading,
for
example,
publishing
a
page
on
the
Web
is
fundamentally
different
from
printing
and
distributing
a
page
in
a
magazine
or
book.
Communication
is
often
subject
to
governance:
legislation,
legal
opinion,
regulation,
convention
and
contract;
these
are
ways
in
which
society
looks
to
enforce
norms,
for
example,
around
copyright,
censorship,
privacy
and
other
areas.
privacy.
But
there
is
often
a
mismatch
between
governance
intended
to
apply
to
the
Web
(usually
based
on
the
analogy
with
physical
media)
and
the
technology
and
architecture
used
to
create
it.
This
document
is
intended
to
inform
future
social
and
legal
discussions
about
the
architecture
of
the
Web
by
clarifying
Web:
the
ways
in
which
the
Web's
technical
facilities
operate
to
store,
publish
and
retrieve
information,
and
by
providing
definitions
for
terminology
as
used
within
the
Web's
technical
community.
This
Specifically,
this
document
also
describes
has
the
technical
and
operational
impact
following
goals:
This
is
an
Editor's
Draft
which
section
describes
the
status
of
this
document
at
the
time
of
its
publication.
Other
documents
may
supersede
this
document.
A
list
of
current
W3C
TAG
intends
publications
and
the
latest
revision
of
this
technical
report
can
be
found
in
the
W3C
technical
reports
index
at
http://www.w3.org/TR/.
This document was published by the Technical Architecture Group as a Working Draft. This document is intended to become a W3C Recommendation. If you wish to make comments regarding this document, please send them to www-tag@w3.org ( subscribe , archives ). All comments are welcome.
Publication
as
a
Working
Draft
on
does
not
imply
endorsement
by
the
Recommendation
track
W3C
Membership.
This
is
a
draft
document
and
may
be
updated,
replaced
or
obsoleted
by
other
documents
at
any
time.
It
is
inappropriate
to
cite
this
document
as
other
than
work
in
progress.
This
document
was
produced
by
a
group
operating
under
the
5
February
2004
W3C
.
Patent
Policy
.
W3C
maintains
a
public
list
of
any
patent
disclosures
made
in
connection
with
the
deliverables
of
the
group;
that
page
also
includes
instructions
for
disclosing
a
patent.
An
individual
who
has
actual
knowledge
of
a
patent
which
the
individual
believes
contains
Essential
Claim(s)
must
disclose
the
information
in
accordance
with
section
6
of
the
W3C
Patent
Policy
.
In recent months there have been several legal actions against individuals and organizations for making material available that is illegal or seditious material and may have been copyrighted. We argue in section 4.3 Including that the manner in which the material is made available is important and should be taken into consideration. Similarly, as we explain in section 4.2 Caching and Relaying there are several kinds of intermediaries that store and/or integrate material from other sources. The role of these intermediaries is quite different from websites that provide original content and the laws should (and in some cases do) distinguish between them.
As the Web has worked its way into the fabric of our lives and access to the Internet is likened to free speech as a fundamental right, it is also increasingly subject to governance. By governance we mean the general idea of societal controls, whether by legislative, regulatory, court order, contractual, or other means. Unfortunately, a number of problems arise when dealing with governance of the Internet.
The goal of this document is to clarify technology. If it informs policy-makers and thus helps make better policies it will have succeeded in its goals. A secondary goal is to point out that many restrictions on the use of Web material that are commonly written into Terms and Conditions are better implemented by technology. For example, if a Website does not want other sites linking to specific pages it is more effective not to provide URIs for those pages rather than to include this restriction in the Terms and Conditions.
Readers who are interested in legal opinion and case citations related to linking and the role of intermediaries, the most common causes of lawsuits, as well as other related matters may want to consult ChillingEffects.org and LinksandLaw.com .
The
act
of
viewing
a
web
page
is
a
complex
interaction
between
a
user's
browser
and
any
number
of
web
servers.
Unlike
reading
a
book,
viewing
a
web
page
involves
copying
the
data
held
on
the
servers
onto
the
user's
computer,
if
only
temporarily.
Logic
encoded
within
the
page
may
cause
more
copying
to
take
place
—
of
images,
videos
and
other
files,
place,
perhaps
from
other
servers,
that
are
servers.
The
combined
material
may
be
displayed
or
otherwise
used
within
the
original
page
—
often
without
the
user's
explicit
knowledge
or
consent.
For
an
end
user,
it
is
usually
impossible
to
tell
whether
a
given
image
or
video
displayed
within
a
page
originates
from
the
server
the
page
comes
from
or
from
some
other
location.
In
addition
to
browsers
and
webservers,
many
other
kinds
of
servers
live
on
the
Web.
Proxy
servers
and
services
that
combine
and
repackage
data
from
other
sources
may
also
retain
copies
of
this
material.
These
intermediary
services
may
transform,
translate
or
rewrite
some
of
the
material
that
passes
through
them,
to
enhance
the
user's
experience
of
the
web
page
or
for
their
own
purposes.
Still
other
services
on
the
web,
such
as
search
engines
and
archives,
make
copies
of
content
as
a
matter
of
course.
This
is
is,
in
part
part,
to
facilitate
the
indexing
necessary
to
for
their
operation,
and
in
part
to
enable
presentation
of
search
results,
to
provide
value
to
their
users
and
to
the
original
authors
of
the
web
page.
These
intermediaries,
we
argue,
should
be
treated
differently
by
the
law
based
on
how
much
control
they
have
over
the
underlying
material
and
how
they
process
it.
Clark
et.al.
[
TUSSLE
]
characterize
Examples
of
the
Internet
as
kind
of
legal
questions
that
have
arisen
related
to
material
that
originates
on
other
sites
include:
The Wikipedia page on Copyright aspects of hyperlinking and framing discusses these and several other examples.
Many
content
publishers
and
other
entities
have
sought
seek
to
control
the
use
of
their
content
on
the
Web.
In
some
cases,
they
have
employed
employ
means
that
do
not
take
into
account
the
Web's
true
architecture,
and
they
have
do
not
used
use
the
technical
mechanisms
available
to
them.
A
few
illustrative
examples
are
provided
as
background
below.
Licenses that describe how material may be copied and altered by others tend not to distinguish between a proxy compressing a web page to make it load faster and someone editing and republishing the page on their own website. To illustrate, the Creative Commons Attribution-NoDerivs defines the terms (emphasis added):
- Adaptation
- means a work based upon the Work, or upon the Work and other pre-existing works, such as a translation, adaptation, derivative work , arrangement of music or other alterations of a literary or artistic work, or phonogram or performance and includes cinematographic adaptations or any other form in which the Work may be recast, transformed, or adapted including in any form recognizably derived from the original, except that a work that constitutes a Collection will not be considered an Adaptation for the purpose of this License. For the avoidance of doubt, where the Work is a musical work, performance or phonogram, the synchronization of the Work in timed-relation with a moving image ("synching") will be considered an Adaptation for the purpose of this License.
- Distribute
- means to make available to the public the original and copies of the Work through sale or other transfer of ownership .
- Reproduce
- means to make copies of the Work by any means including without limitation by sound or visual recordings and the right of fixation and reproducing fixations of the Work, including storage of a protected performance or phonogram in digital form or other electronic medium.
Consider
Consider,
now,
the
following
questions:
Terms
and
Conditions
statements
on
websites
also
list
acceptable
and
unacceptable
behaviour
behavior
on
a
site,
with
any
browsing
on
the
site
implicitly
indicating
acceptance
of
the
terms.
These
generally
do
not
take
into
account
the
behaviour
behavior
of
proxies.
For
instance,
one
standard
set
of
Terms
and
Conditions
includes:
You may view, download for caching purposes only , and print pages from the website for your own personal use, subject to the restrictions set out below and elsewhere in these terms of use.
You must not:
(a) republish material from this website (including republication on another website);
(b) sell, rent or sub-license material from the website;
(c) show any material from the website in public;
(d) reproduce, duplicate, copy or otherwise exploit material on our website for a commercial purpose;
(e) edit or otherwise modify any material on the website ; or
(f) redistribute material from this website except for content specifically and expressly made available for redistribution (such as our newsletter)
It is not possible to view material on the web without it being downloaded onto your computer, so forbidding downloading except for caching purposes essentially means that people cannot view the page. In addition, many proxies automatically transform the documents that pass through them, for example to compress them so that they take up less bandwidth for mobile consumption or to introduce advertisments into pages that are accessed free of charge. Should this be prohibited?
Limits
placed
on
the
use
of
a
website
often
include
limitations
on
automatic
indexing
of
the
website,
without
exceptions
for
search
engines
that
make
the
website
discoverable
or
archives
that
ensure
its
longevity.
For
example,
the
same
set
of
terms
and
conditions
as
described
quoted
above
includes:
goes
on
to
say:
You must not conduct any systematic or automated data collection activities (including without limitation scraping, data mining, data extraction and data harvesting) on or in relation to our website without our express written consent.
Search
engines
rely
on
systematic
data
collection
from
websites
in
order
to
provide
users
with
accurate
search
results,
and
archives
do
so
in
order
to
retain
websites
for
posterity.
So,
these
terms
and
conditions,
if
adhered
to
strictly,
put
the
website
out
of
the
reach
of
search
engines
and
hence
makes
it
undiscoverable;
surely
this
is
not
in
the
best
interest
of
the
website.
Another
problem
is
that
automated
agents
—
webcrawlers,
spiders
and
robots
—
(webcrawlers,
spiders,
robots)
that
gather
information
from
the
web
are
unable
to
read
these
terms
and
conditions;
the
only
things
they
understand
are
the
technical
signals
that
a
website
provides
about
what
is
permitted.
See
more
on
this
below.
As another example, the terms and conditions for gsig.com include:
Use of Materials: Upon your agreement to the Terms, GSI grants you the right to view the site and to download materials from this site for your personal, non-commercial use. You are not authorized to use the materials for any other purpose. If you do download or otherwise reproduce the materials from this Site, you must reproduce all of
GSI’sGSI?s proprietary markings, such as copyright and trademark notices, in the same form and manner as the original....
You may not use any
“deep-link”, “page-scrape”, “robot”, “spider”"deep-link", "page-scrape"?, "robot", "spider" or any other automatic device, program, algorithm or methodology or any similar or equivalent manual process to access, acquire, copy or monitor any portion of the Site or any of its content, or in any way reproduce or circumvent the navigational structure or presentation of the Site.
However,
It
would
be
simpler
and
more
effective
for
the
site
does
not
to
use
the
primary
technical
method
of
actually
means
for
controlling
what
webcrawlers,
spiders
or
robots
access
on
the
site,
namely
a
robots.txt
file
which
is
a
set
of
machine
processable
instructions
instructing
automated
web
agents
what
thay
can
and
cannot
do.
They
could
also
exempt
automated
web
agents
from
the
Terms
and
Conditions
as
discussed
below.
Many sites have a linking policy that limits what links can be made to the site from other sites. These conditions are not backed up through relatively simple technical mechanisms that would prevent such links from being made. For example, the website at quotec.co.uk has a linking policy that includes:
Links pointing to this website should not be misleading.
Appropriate link text should be always be used.
From time to time we may update the URL structure of our website, and unless we agree in writing otherwise, all links should point to http://www.quotec.co.uk.
You must not use our logo to link to this website (or otherwise) without our express written permission.
You must not link to this website using any inline linking technique.
You must not frame the content of this website or use any similar technology in relation to the content of this website.
Technically
it
is
straightforward
to
prevent
linking
to
pages
that
the
website
does
not
want
others
to
link
to:
you
simply
do
not
give
these
pages
URLs
or
make
the
URLs
undiscoverable.
This
is
likely
to
be
more
effective
than
asking
people
to
read
and
adhere
to
the
Terms
and
Conditions.
Section
4.
Techniques
discusses
several
Several
techniques
for
controlling
linking
and
inclusion.
inclusion
are
discussed
in
section
6.
Techniques
.
Legislation
that
governs
the
possession
and
distribution
of
unlawful
material
(such
as
child
pornography,
information
that
is
under
copyright
or
material
that
is
legally
suppressed
through
a
gag
order
)
often
needs
to
exempt
certain
types
of
services,
such
as
caching
or
hosting,
as
it
would
be
impractical
for
the
people
running
those
such
services
to
police
all
the
material
that
passes
through
their
servers.
This
does
not,
however,
always
happen
and
intermediaries
are
often
held
accountable
for
material
that
did
not
originate
with
them
and
have
no
control
over.
An
example
of
good
legislation
that
does
this
in
exempt
intermediaries
the
UK
is
the
Coroners
and
Justice
Act
2009
Schedule
13
;
from
the
Explanatory
Notes
(emphasis
added):
Paragraphs 3 to 5 of [Schedule 13] provide exemptions for internet service providers from the offence of possession of prohibited images of children in limited circumstances, such as where they are acting as mere conduits for such material or are storing it as caches or hosts .
This
section
summarises
the
terminology
that
is
used
within
this
paper.
document.
More
details
about
each
of
the
terms
is
given
in
the
rest
of
the
document.
The
concept
of
publishing
on
the
web
has
evolved
as
the
web's
ecosystem
has
enlarged
and
diversified,
and
as
the
capabilities
of
browsers
and
the
web
standards
that
they
implement
have
developed.
There
is
no
single
definition
of
what
publishing
on
the
web
means.
Instead
Instead,
there
are
a
number
of
activities
that
could
be
viewed
as
publication
or
distribution
in
a
legal
sense.
This
section
describes
each
some
of
these
activities
and
how
they
work.
The basic form of publication on the web is hosting. A server hosts a file if it stores the file on disk or generates the file from data that it stores, and that file did not (to the server's knowledge) originate elsewhere on the web.
The
presence
of
data
on
a
server
does
not
necessarily
mean
that
the
organisation
that
owns
and
maintains
the
server
has
an
awareness
of
the
presence
of
the
data
or
its
content.
Many
websites
are
hosted
on
shared
hardware
that
is
owned
by
a
service
provider
that
stores
and
serves
data
at
the
direction
of
controlling
individuals
and
organisations
which
determine
the
data
they
provide
on
the
site.
Because
of
this,
multiple
servers
may
host
the
same
file
at
different
URIs.
For
example,
an
artist
could
upload
the
same
image
to
multiple
servers,
which
then
store
the
image
and
serve
it
to
others.
Fig.
2
Uploading
files
to
a
server
A
controller
uploads
a
file
to
a
hosting
server,
which
is
then
accessed
by
a
browser.
locations.
There
are
many
different
types
of
service
provider.
Some
may
exercise
practically
no
control
over
the
software
and
data
that
they
host,
merely
providing
a
base
platform
on
which
code
can
run.
Others
may
focus
on
particular
types
of
content,
such
as
images
(e.g.
Flickr),
videos
(e.g.
YouTube)
or
messages
(e.g.
Twitter).
Also,
there
may
be
many
service
providers
involved
in
the
publication
of
a
particular
file
on
the
web:
some
providing
hardware,
others
providing
different
kinds
of
publishing
support.
Some service providers automatically perform transformations on material that they host, as a service, such as converting to alternative formats, clipping or resizing, or marking up text. When they sign up to a service, controllers explicitly or implicitly enter into an agreement with the service provider that grants them a license to perform transformations on the material which they upload.
Service providers that host particular types of material often employ automatic filters to prevent the publication of unlawful material, but it is impossible for a service provider to detect and filter out everything that might be unlawful.
To add to the complexity of this area, it is possible for each of the following to be in different jurisdictions:
and be controlled by different laws and conventions.
Some
servers
provide
access
to
files
that
are
hosted
elsewhere
on
the
web:
on
an
origin
server
that
holds
the
original
version
of
the
file.
These
files
might
be
stored
on
the
server
and
provided
again
at
a
later
time,
time
(a
caching
proxy
in
which
case,
for
the
purposes
of
this
document,
it
is
termed
a
caching
server
,
document),
or
might
simply
pass
through
the
server
in
response
to
a
request,
in
which
case,
for
the
purposes
of
this
document,
it
is
termed
a
request
(a
relaying
server
.
in
this
document).
It
is
usually
often
impossible
to
tell
whether
a
server
is
providing
a
stored
response
or
wehther
whether
it
has
made
a
new
request
to
the
origin
server
and
is
serving
the
results
of
that
request.
Servers
commonly
store
the
results
of
some
requests
and
not
others,
acting
as
a
caching
server
proxy
some
of
the
time
and
as
a
relaying
server
the
rest.
In
both
cases,
the
file
the
caching
or
relaying
server
provides
may
might
be
different
from
the
original
file
web
content
that
was
accessed
from
the
origin
server.
For
example:
Caching and relaying servers are extremely useful on the web. There are four main types of caching and relaying servers discussed here: proxies, archives , search engines and reusers. The distinctions between them are summarised in the table below.
proxy | archive | search engine | reuser | |
---|---|---|---|---|
purpose | increase network performance | maintain historical record | locate relevant information | better understand information |
refreshing | based on HTTP headers | never | variable | based on HTTP headers |
retrieval | on demand | proactive | proactive | usually on demand |
URI use | usually uses same URI | uses new URI | uses new URI | uses new URI |
Archives
aim
to
catalog
and
provide
access
to
some
web
content
to
provide
an
on-going
historical
record.
They
use
crawlers
to
fetch
pages
and
other
resources
web
content
from
the
portion
of
the
web
that
they
cover,
and
store
them
on
their
own
servers,
along
with
metadata
about
the
pages,
particularly
including
when
they
were
each
was
retrieved.
They
then
may
provide
access
to
the
stored
copies
of
the
resources
web
content
at
particular
historical
dates,
enabling
people
to
see
how
pages
used
to
appear.
Archives
are
usually
often
run
by
institutions
that
have
a
legal
mandate
and
responsibility
to
keep
a
historical
record,
such
as
a
legal
deposit
.
Although
their
primary
purpose
is
long
term
record-keeping,
they
often
make
this
material
available
online
as
well.
They
might
restrict
access
to
the
data
for
a
period
of
time
after
it
is
collected,
for
security
or
privacy
reasons,
and
may
respond
to
legally-backed
removal
requests.
Users
might
use
archives
for
research,
but
also
to
access
information
that
has
otherwise
been
removed
from
the
Web.
When
they
are
made
available
to
the
public,
archived
pages
are
usually
often
distinguishable
by
end
users
from
the
original
page
using
banners
placed
within
the
page
or
having
the
original
page
appear
within
a
frame.
The
links
(both
to
other
pages
and
to
embedded
resources
web
content
such
as
images)
are
usually
rewritten
so
that
when
the
user
interacts
with
the
page,
they
are
taken
to
the
version
of
the
linked
resource
web
content
at
the
same
point
in
time.
"Dark
archives"
do
not
make
their
content
available
to
the
public.
Search
engines
aim
to
catalog
and
provide
access
to
analyze
as
many
web
pages
as
they
can,
so
that
they
can
direct
users
to
appropriate
information
in
response
to
a
search.
They
use
crawlers
to
fetch
pages
and
other
resources
web
content
from
the
web,
analyse
them
and
store
them
on
their
own
servers
to
support
further
analysis.
Search
engines
are
mostly
interested
in
indexing
resources
web
content
and
providing
links
to
them
rather
than
in
the
content
of
the
resource
itself.
They
might
may
or
may
not
copy
the
page
itself,
but
they
always
store
metadata
about
the
page,
derived
from
the
information
in
the
page
and
other
information
on
the
web,
such
as
what
other
pages
link
to
it.
Search
engines
play
an
important
role
in
the
web
in
by
enabling
people
to
find
information,
including
that
which
would
otherwise
be
lost
or
is
temporarily
unavailable.
When
a
user
views
a
stored
page
from
a
search
engine,
it
is
usually
obvious
both
that
the
search
engine
is
involved
(from
the
URI
of
the
page
and
from
banners
or
framing),
that
the
content
originally
came
from
somewhere
else,
and
where
it
came
from.
The
links
within
the
page
are
not
usually
rewritten.
Data
reuse
is
becoming
more
prevalent
as
web
servers
act
as
services
to
others.
A
server
that
is
a
reuser
fetches
information
from
one
or
more
origin
servers
and
either
provides
an
alternative
URI
for
the
same
page
or
adds
value
to
it
by
reformatting
it
or
combining
it
with
other
data.
Good
examples
are
the
BBC
Wildlife
Finder
,
which
incorporates
information
from
Wikipedia
,
Animal
Diversity
Web
and
other
sources
or
triplr.org
,
which
converts
RDF
data
from
one
format
to
another
as
a
service.
Reusers
that
do
not
change
the
information
from
the
origin
server
may
be
used
to
simplify
access
to
the
origin
server
(by
mapping
simple
URLs
to
a
more
complex
query)
or
to
provide
a
route
around
gateways
or
the
same-origin
policy
(as
servers
are
not
limited
in
where
they
access
resources
web
content
from).
Since
reused
information
is,
by
design,
seamlessly
integrated
into
a
page
that
is
served
from
the
reuser,
people
viewing
that
page
will
not
generally
be
aware
that
the
information
originates
from
elsewhere.
The
URIs
used
for
the
pages
will
be
those
of
the
reuser.
Licenses
on
the
material
may
require
attribution;
attribution
and
even
when
it
doesn't,
it
is
good
practice
for
reusers
to
indicate
where
the
material
originates.
A
web
page
written
in
HTML
may
include
other
resources,
web
content,
such
as
images,
video,
scripts,
stylesheets,
data
and
other
HTML.
The
HTML
in
a
web
page
refers
to
these
this
external
resources
web
content
using
markup.
For
example,
an
<img>
element
uses
the
src
attribute
to
refer
to
an
image
which
should
be
shown
within
the
page.
Material
that
is
included
within
a
web
page
may
appear
to
be
a
hosted
copy
to
the
user
of
a
website,
but
in
fact
may
be
hosted
completely
separately,
come
from
somewhere
else,
entirely
outside
the
control
of
the
owner
of
the
web
page.
HTML
supports
several
different
mechanisms
for
including
other
resources
external
web
content
in
a
page.
These
are
listed
in
section
A.
Linking
Methods
,
web
page
but
they
all
work
in
basically
essentially
the
same
way.
Typically,
when
When
a
user
navigates
to
a
web
page,
the
browser
automatically
fetches
all
the
included
resources
web
content
into
its
local
cache
and
executes
them
or
displays
them
within
the
page.
Inclusion
is
different
from
hosting,
copying
or
disseminating
a
file
because
the
information
is
never
stored
on,
nor
passes
through,
the
server
that
hosts
the
web
page
doing
the
including.
As
such,
although
the
included
resources
are
web
content
is
an
essential
component
of
the
page
to
make
it
appear
and
function
as
a
whole,
the
server
of
the
web
page
does
not
have
control
over
their
content
which
may
change
without
the
server's
its
knowledge.
When
scripts
or
HTML
are
included
into
web
pages,
the
included
resource
content
may
itself
include
other
resources
content
(which
may
include
still
more
and
so
on).
The
author
of
the
original
web
page
has
control
over
which
resources
can
choose
what
content
it
includes,
wants
to
include,
but
does
not
have
control
over
which
resources
those
the
choice
of
the
subsequently
included
resources
go
on
to
include.
content.
The
publishers
of
included
resources
may
content
might
change
the
content
of
those
resources
at
any
time,
possibly
without
warning.
This
has
been
used
in
cases
where
websites
included
to
include
third-party
images
without
permission,
or
to
substitute
the
image
with
something
distasteful
or
to
redirect
to
a
link
that
performed
an
unintended
action
on
the
user's
behalf;
see
Preventing
MySpace
Hotlinking
.
Some
of
the
resources
web
content
that
are
included
is
used
within
a
page
may
be
invisible
to
the
user.
An
example
is
a
hidden
image
that
is
used
for
tracking
purposes:
each
time
a
user
navigates
to
the
page,
the
hidden
image
is
requested;
the
server
uses
the
information
from
the
request
of
the
image
to
build
a
picture
of
the
visitors
to
the
site.
This
facility
can
be
used
for
malicious
purposes.
An
<img>
element
can
point
to
any
URI
(not
just
an
image)
and
causes
a
GET
request
on
to
that
resource.
URI.
If
a
website
has
been
constructed
such
that
GET
requests
cause
an
action
to
be
carried
out
(such
as
logging
out
of
a
website),
a
page
that
includes
this
"image"
will
cause
the
action
to
take
place.
Linking
is
a
fundamental
notion
for
facility
on
the
web.
In
fact,
it
has
been
argued
that
linking
is
what
makes
the
Web
the
Web.
HTML
pages
use
can
include
links
using
the
<a>
elements
to
insert
links
element
to
other
pages
on
the
web,
web
with
the
href
attribute
holding
the
URI
for
the
linked
page
.
Some
of
the
links
will
may
be
to
be
pages
from
the
same
origin
;
while
others
will
be
cross-origin
links
to
pages
on
third-party's
sites
that
hold
related
information.
A
user
can
usually
tell
where
a
link
is
going
to
take
them
prior
to
selecting
it
through
the
browser
UI
(e.g.
by
"mousing
over"
it)
it
or
after
the
link
is
selected
through
the
status
bar
in
the
browser,
although
some
links
are
overridden
by
onclick
event
handling
that
takes
them
to
a
different
location.
Some
websites,
such
as
Wikipedia,
use
icons
to
indicate
when
whether
a
link
is
a
cross-origin
link
and
when
or
whether
it
will
take
a
user
to
a
page
on
the
same
server.
The
use
of
interstitial
pages
or
dialog
boxes
which
warn
the
user
they
are
about
to
leave
the
site
in
question
can
obscure
the
eventual
destination
of
the
link,
as
discussed
in
section
3.3
Aliasing
.
If
the
link
is
a
cross-origin
link
(or
even
in
some
cases
where
it
is
an
internal
link),
the
publisher
of
the
origin
page
will
have
no
control
over
the
content
or
access
policies
of
the
linked
page.
These
are
the
responsibility
of
the
publisher
of
that
page;
the
TAG
Finding
on
"Deep
Linking"
in
the
World
Wide
Web
[DEEPLINKING]
describes
the
ways
in
which
publishers
can
control
access
to
their
pages
and
the
fundamental
principle
that
addressing
(linking
to)
a
page
is
distinct
from
accessing
it.
link.
Traditionally,
a
user
must
take
a
specific
action
in
order
to
navigate
to
the
linked
page,
such
as
by
clicking
on
the
link
or
selecting
it
with
a
keystroke
or
a
voice
command.
In
these
cases,
the
linked
page
cannot
be
accessed
without
the
user's
knowledge
and
consent
(though
although
they
may
not
know
where
they
will
eventually
end
up).
up.
But
there
are
three
practices
used
by
some
Some
sites
that
use
elaborate
means
to
obscure
whether
a
link
is
followed
by
the
user.
For
example:
prefetch
link
relation
in
a
link.
For
example,
a
page
might
indicate
that
the
first
result
in
a
list
of
search
results
should
be
fetched
before
the
user
actually
navigates
the
link.
Although linking and inclusion (or embedding or transclusion) are often confounded, they are fundamentally different.
In
an
article
]
which
direct
browsers
discussing
the
decision
by
the
National
Newspapers
of
Ireland
to
a
manifest
charge
for
links
to
articles
that
lists
files
appear
within
their
their
pages
the
author
says:
"There's
the
fact
that
naming
a
work's
title
does
not
and
cannot
be
copyright
infringement
not
under
US
law
...
A
link
(or
the
browser
then
downloads
URL
inside
it)
is
little
more
than
a
name,
so
that
arguably
the
web
application
same
rule
would
apply.
And
even
if
it
is
more
than
a
name,
the
URL
can
be
used
when
an
internet
connection
is
not
available.
Note
It
is
very
early
regarded
as
a
factual
statement
(you
can
find
the
content
here)
and
facts
arguably
cannot
be
copyrighted
in
the
implementation
of
offline
Web
applications.
In
current
implementations,
users
are
US
(some
courts
disagree)."
This
is
clearly
not
made
aware
when
retrieval
of
resources
the
case
with
inclusion.
If
you
include
material
from
a
cache
manifest
another
site,
especially
if
you
do
it
without
attribution,
you
can
be
prosecuted
for
copyright
infringement
and
if
the
material
is
done
on
their
behalf,
however,
this
may
change
as
implementations
mature.
judged
to
be
seditious
or
otherwise
unlawful
you
can
be
prosecuted
for
distributing
inappropriate
material.
The
description
above
about
how
information
is
published
on
the
web
discussion
highlights
how
difficult
it
can
be
for
end
users
(both
human
and
machine)
to
be
aware
of
the
original
source
of
content
on
the
web,
and
the
ways
in
which
it
may
have
been
changed
en
route
to
them.
It
also
shows
that
the
controllers
Controllers
of
content
need
to
be
make
clear
about
how
that
content
can
be
used
elsewhere,
both
through
human-readable
prose
and
by
the
using
technical
barriers
that
they
put
up
that
can
be
used
to
limit
access.
Third
parties
that
use
that
content,
content
that
originates
elsewhere,
whether
proxies,
reusers
or
linkers,
should
also
follow
some
best
good
practices
in
transformation,
reuse
and
links
linking
to
information.
4.1
Controllers
Once
material
is
put
on
the
public
web
(that
is,
on
the
internet
and
unprotected
by
authentication
barriers),
it
is
impossible
to
completely
limit
how
that
material
This
is
used
through
technical
means
—
HTTP
headers
can
be
faked,
metadata
can
be
ignored.
However,
there
are
a
number
of
standard
techniques
that
controllers
can
use
to
indicate
how
they
intend
their
material
to
be
used,
which
intermediate
servers
should
pay
attention
to.
discussed
in
more
detail
below.
Publishers
can
control
access
to
resources
that
are
unprotected
by
authentication
through
HTTP,
by
refusing
or
redirecting
connections
pages
in
several
ways.
In
addition
to
particular
resources
based
on:
not
giving
out
URIs
to
these
pages,
they
can
control
access
via
the
Referer
HTTP
header;
header
which
indicates
the
last
page
that
was
referenced.
If
it
was
not
a
page
on
your
own
site,
then
you
can
redirect
to
your
site's
home
page,
for
example.
It
is
also
possible
to
do
this
check
in
JavaScript,
which
can
then
be
used
to
bring
up
a
dialog
to
check
whether
the
contractual
terms
have
been
read,
to
confirm
that
the
user
is
useful
over
18,
or
to
ask
for
preventing
linking
a
password.
You
can
also
use
a
cookie,
for
example,
to
particular
resources
from
outside
start
a
website,
or
preventing
the
inclusion
of
session
only
when
a
resource
in
another
website
page
is
accessed
through
a
given
gateway
page
and
reject
or
provide
an
alternative
path
for
requests
that
don't
have
the
cookie
set.
The
User-Agent
HTTP
header;
this
header
which
indicates
the
identity
of
the
user
agent
making
the
request
is
particularly
useful
for
in
preventing
access
from
crawlers
and
search
engines.
A
robots.txt
file
on
the
website
can
be
also
used
for
the
same
purpose.
The
domain
name
or
IP
address
of
the
client
making
the
connection;
this
may
connection
can
also
be
useful
used
to
prevent
specific
reusers
from
accessing
material
Password
controlled
access
material.
As
well
as
the
techniques
above,
which
can
be
used
to
control
any
access
to
pages,
it's
also
possible
to
provide
additional
control
over
the
inclusion
of
resources
content
in
a
third-party's
web
pages.
In
the
case
of
To
prevent
an
HTML
pages,
page
from
being
embedded
within
a
frame,
publishers
can
include
a
script
that
checks
whether
the
document
is
the
top
document
in
the
window,
to
prevent
it
from
being
embedded
within
a
frame.
window.
The Cross-Origin Resource Sharing Working Draft [ CORS ] defines a set of HTTP headers that can be used to give the publisher of the third-party resource greater control over access to their resources. These are usually used to open up cross-origin access to resources that publishers want to be reused, such as JSON or XML data exposed by APIs, by indicating to the browser that the resource can be fetched by a cross-origin script.
A
new
From-Origin
or
Embed-Only-From-Origin
HTTP
header
is
also
currently
under
discussion
by
the
Web
Applications
Working
Group
and
described
within
the
Cross-Origin
Resource
Embedding
Restrictions
Editor's
Draft
[CORER].
This
would
enable
publishers
to
control
which
origins
are
able
to
embed
the
resources
content
they
publish
into
their
pages.
Publishers
should
ensure
actions
are
not
taken
on
behalf
of
their
users
in
response
to
an
HTTP
GET
on
a
URI,
as
otherwise
sites
are
open
to
security
breaches
through
inclusions,
as
described
in
section
3.4
4.3
Including
.
It
is
also
good
practice
to
check
the
Referer
header
in
these
cases
to
prevent
actions
being
taken
as
the
result
of
the
submission
of
forms
within
other
website's
web
pages,
unless
that
functionality
is
desired.
There
are
a
number
of
HTTP
headers
[
HTTP11
]
that
enable
content
providers
to
indicate
whether
a
proxy
should
cache
a
given
page
and
for
how
long
it
should
keep
the
copy.
These
are
described
in
detail
within
Section
13:
Caching
in
HTTP
.
For
example,
a
server
can
use
the
HTTP
header
Cache-Control
:
no-store
to
indicate
that
a
particular
resource
response
should
not
be
cached
by
a
proxy
server.
Publishers
of
websites
can
also
indicate
which
pages
should
not
be
fetched
or
indexed
by
any
search
engine
or
archive
through
robots.txt
[ROBOTS]
and
the
robots
<meta>
element
[META].
They
can
indicate
other
characteristics
of
web
pages,
such
as
how
frequently
they
might
change
and
their
importance
on
the
website,
through
sitemaps
[SITEMAPS].
More
sophisticated
publishers
may
use
the
Automated
Content
Access
Protocol
(
ACAP
)
extensions
[
ACAP
]
to
attempt
to
indicate
access
policies.
Publishers
can
also
use
the
rel="canonical"
link
relationship
to
indicate
a
canonical
URI
for
a
page
which
should
be
used
by
search
engines
and
other
reusers
to
reference
a
given
page.
The
Cache-Control:
no-transform
HTTP
header
indicates
that
a
proxy
server
must
not
change
the
original
content,
nor
the
headers:
Content-Encoding
Content-Range
Content-Type
For
example,
an
proxy
server
must
not
convert
a
TIFF
served
with
Cache-Control:
no-transform
into
a
JPG,
nor
should
it
rewrite
links
within
an
HTML
page.
Websites
indicate
can
include
a
license
that
describes
how
the
information
within
served
by
the
website
can
be
reused
by
others.
Just as with HTTP headers, robots.txt and sitemaps, there can be no technical guarantees that crawlers will honor license information within a site. However, to give well behaved crawlers a chance of identifying the license under which a page is published, websites should:
<link>
or
<a>
element
with
rel="license"
cc:license
xhv:license
to
indicate
the
license
of
included
This
section
describes
the
techniques
that
you
should
use
when
operating
a
website
that
incorporates
material
from
other
sources,
whether
by
caching,
transforming
or
simply
linking.
As
described
in
section
4.1.3
6.3
Controlling
Caching
and
section
4.1.4
6.4
Controlling
Processing
,
there
are
a
number
of
HTTP
headers
and
other
conventions
that
indicate
how
an
origin
server
intends
servers
intend
other
servers
to
treat
the
resources
that
they
publish.
Servers
that
cache
or
transform
reuse
data
from
origin
servers
should
obey
these
headers,
which
exist
to
ensure
that
the
end
user
receives
current
information
in
the
intended
form.
headers.
Proxies
must
use
the
Via
HTTP
header
when
they
handle
requests
to
origin
servers,
to
indicate
their
involvement
in
the
response
to
the
user's
original
request.
Proxies
which
perform
transformations
on
a
document
content
must
include
a
Warning:
214
Transformation
applied
HTTP
header
in
the
response.
These and other recommendations for proxies which perform transformations are included in the Guidelines for Web Content Transformation Proxies 1.0 .
Many
licenses
require
the
reusers
of
information
to
provide
attribution
to
the
original
source
of
the
material.
This
attribution
must
be
human-readable,
so
that
users
of
your
website
understand
where
the
material
came
from,
and
may
also
be
computer-readable,
which
enables
automated
tools
to
track
the
use
of
material
on
the
web.
The
wording
and
positioning
of
attribution
is
usually
dictated
by
the
license
under
which
the
material
is
made
available.
For
example,
the
license
for
if
you
use
the
free
icons
available
from
Axialis
Software
their
license
includes:
If you use the icons in your website, you must add the following link on each page containing the icons (at the bottom of the page for example):
Icons by Axialis TeamThe HTML code for this link is:
<a href="http://www.axialis.com/free/icons">Icons</a> by <a href="http://www.axialis.com">Axialis Team</a><a href="http://www.axialis.com/free/icons">Icons</a> by <a href="http://www.axialis.com">Axialis Team</a>
If there is no explicit guidance about the location of attribution, it is recommended that attribution to material from a third party appear as close to the actual material as possible. Methods to make the attribution machine-readable include:
cite
attribute
on
the
<blockquote>
element,
where
a
portion
of
a
page
is
quoted
within
your
own
site
dc:source
property
with
microformats,
microdata
or
RDFa
to
indicate
the
source
of
a
portion
of
the
page
(identified
through
an
id
)
An example of clear attribution of material from another site is that of the BBC Wildlife Finder; the following screenshot shows the attribution within the page on the Pygmy Three-toed Sloth .
There
are
a
number
of
practices
around
linking
to
a
third-party
site
sites
that
can
help
users
and
automated
agents
to
understand
the
relationship
between
your
website
and
the
third
parties.
These
include:
rel="nofollow"
for
links
where
the
link
is
not
meant
to
imply
approval;
these
will
not
be
used
by
search
engines
when
determining
the
relevance
for
a
page
rel="external"
for
links
to
third-party
web
pages;
this
can
be
used
as
the
basis
of
styling,
such
as
an
image
that
indicates
the
user
will
be
taken
to
a
separate
site
There
are
a
number
of
techniques
that
can
be
used
to
track
which
links
are
followed
from
a
website.
Methods
that
rewrite
the
links
within
a
web
page
to
point
to
an
interstitial
("you
are
leaving
this
website")
page
or
through
a
script
can
mislead
the
user
and
any
automated
agents
about
the
target
of
the
link.
It
is
better
to
use
a
script
to
capture
onclick
or
other
events
and
redirect
the
user
at
that
point.
In
conclusion,
publishing
on
the
Web
is
different
from
print
publishing.
This
document
has
enumerated
some
of
these
differences,
especially
those
relevant
to
licencing
licensing,
attribution
and
copyright
issues.
Recommended practices include:
A.2
Including
img
frames
iframes
object
AJAX
script
style
prefetching
offline
applications
Many
thanks
to
Thinh
Nguyen,
Rigo
Wenning
and
Wenning,
Wendy
Seltzer
and
other
TAG
members
for
their
reviews
and
comments
on
earlier
versions
of
this
draft,
and
to
Robin
Berjon
for
ReSpec.js
.