Distributed and syndicated content

Abstract

The web has long had formats and mechanisms whereby content which canonically exists at one location is also available in a different form in a different location. Some of the oldest examples include RSS and other machine readable syndication formats, and the newest include content platforms such as Blendle, Facebook's instant articles and Google's AMP top stories carousel.

This raises important issues concerning the primacy of URLs and origins on the web, and the ability for users to make judgments about the trustworthiness and provenance of information they encounter while using it.

Distributed content has compelling use cases and is well supported by fundamental web technologies such as hyperlinks and iframes, but some newer approaches can present security and privacy challenges.

1. Definitions

While the word 'syndicated' has been long used to describe the republishing of content by a third party, the emergence and growth of a new generation of mostly proprietary platforms has prompted the use of a new term within the media industry. We note that the term 'distributed content' has gained acceptance in that community (cf. the annual Reuters Institute Digital News Report), and that while the general mechanism can be applied to almost any kind of content, the most popular use case to date has been news.

In recognition of this, we use the term 'distributed content' in this finding, but do not intend to restrict the scope of the finding to news.

2. Trends in distributed content patterns

While there is no specification defining how distributed content platforms work or the behavior or features they exhibit to the user, modern distributed content platforms are characterized by the following trends:

Less attribution: Often, the distributed form does not provide a browser-verified, user-visible indication of the source origin.
Increased visual/functional fidelity: Distributed forms of content can use advanced web platform features and sophisticated styling, making it hard to distinguish from the original.
More frequently complete: The distributed form is increasingly a complete copy of the canonical form rather than a teaser or abstract
Less canonical source referencing: Distributed forms sometimes do not link to the canonical form in a standardized way, such as a <link rel="canonical"> or <link> tag.

3. Potential concerns

Distributed content presents the greatest challenge to web architecture when it is high fidelity, has unclear attribution, is complete, lacks a reference to the canonical source, is distributed at scale, and is discovered serendipitously rather than via a conscious-choice opt-in mechanism. These challenges present as:

Source/provenance masking: User agents go to considerable lengths to expose the source of a web page to the user, including a visible URL bar, colorizing of domains, TLS certificate labeling, etc. Distributed content may include labeling and/or branding within the web UI in an attempt to replicate the user agent's native behaviors, but this is a poor substitute for browser-level labeling, since it is free to mislead.
Origin policy: The web applies many security, privacy, and quota constraints to websites using the concept of an Origin. Distributed forms of content lose the connection to these policies when consumed on distribution platforms.
Permissions: Many features of the web require explicit user consent which is then attached to the entire origin. Distributed content that requires such permissions would result in a degraded experience when consumed off-origin, or the permission potentially being assigned to the distribution platform origin.
TLS: URLs and origins when combined with TLS and the PKI system provide the canonical proof of provenance of content. Distributed content does not rely on the trust badge of the content creator but rather that of the delivery platform, rendering the efforts of labeling and validating TLS status largely redundant.
URL fragmentation/pollution: Creating multiple URLs for the same content reduces the value of URLs as a means of identifying things, especially when sharing and linking to content via platforms such as social media. This problem is particularly bad when the distinct URLs exist on different origins, for example, m., wap., app., .mobi etc.

The web platform has, over time, developed defenses against malicious content which lean on the ability for URLs and Origins to be a primary indicator of identity and trust, such as:

X-Frame-Options
Visibility of URL bar in popup windows
Punycode/unicode deobfuscation in URL bar
Crowdsourced phishing blocklists and safe browsing lists

Many of these defenses, which are designed to protect users, can be undermined by mechanisms for serving distributed content that conflate the content of many origins within a single origin or remove content from its source.

4. Recommendations

Sites which facilitate the consumption of distributed content should make efforts to avoid the concerns outlined above. The TAG believes in and hopes to strengthen the origin model, and has encouraged the Web App Sec WG in their work on Secure Contexts.

The TAG finds that it is essential to emphasize the value of the browser-level origin authenticity built into the Web platform as opposed to mechanisms that an untrusted content platform may choose to provide. Browsers are literally the "user's agent", and the model of the browser as a trusted gateway protecting the user from untrusted content is fundamental to balancing the needs of the user with the motivations of website owners, and therefore fundamental to the architecture of the Web.

The anchor element is designed to allow one website to refer visitors to content on another website, whilst retaining all the features of the web platform. We encourage distribution platforms to use this mechanism where appropriate. We encourage the loading of pages from original source origins, rather than re-hosted, non-canonical locations. We further discourage rewriting links within content with the purpose of keeping a user within a distribution platform.