Web Tracking and User Privacy in the Age of Ads Business Models

Karl Dubost, Opera Software A.S., March 2011 (Acknowledgements)

This paper expresses some of the issues that Opera Software A.S. might encounter in its businesses and products. This input can be used for the W3C Workshop on Web Tracking and User Privacy (28/29 April 2011, Princeton, NJ, USA). A lot of the challenges have been already exposed in the W3C Workshop on Privacy for Advanced Web APIs in 2010 in London, UK.

Opera Position Summary

Users’ ability to maintain the control on their own data is of paramount importance to Opera Software. User privacy starts with this control.
Browsers being parts of the Web ecosystem, users communicate their personal data explicitly or implicitly to online services. It makes browser a critical piece in the management and communication of these data.
Enable users to opt out of Web tracking
Simple data management mechanisms understandable by users
Built-in secure environments in the browser.
Protocols, UI and policies are all part of this ecosystem for managing and controlling data.

Browsers and Control of Online Identities

Privacy is not just a technological issue. Technologies have a deep impact by creating, enforcing or destroying social contexts which are related to privacy. That said, we believe that technological solutions do not equate with users’ privacy. Privacy differs based on each individual cultural expectations and national legislations.

We should be careful about using the word « Privacy » when we sometimes mean being in control over our own identities. Understanding and controlling the data aggregation shaping our identities is a core question. The chosen technologies can have a significant impact on it.

Network Infrastructure

The first source of Web Tracking is the network infrastructure. Internet protocols such as HTTP, SMTP, POP, etc. are using IP addresses for communicating between two pieces of software. Some systems such as Web proxies have given tools for people of adding a layer of opacity in between their real IP and the service they want to reach. It relies on a system of initial trust toward the chosen HTTP proxy. Recently, but accessed by a limited number of users, systems such as Tor creates a pool of IP addresses for accessing services « anonymously »

Some devices in the context of geolocation are broadcasting their Ethernet addresses, which creates a deeper challenge for controlling one’s own identity.

In the next few months, years, IPv6 addresses will expand due to the lack of IPv4 addresses. This will pose a far greater challenge than the initiatives that have been taken lately. Basically, IPv6 will theoretically make possible to identify each individual users, by having a unique IP address for each devices.

The first limitation on one’s ability to control the identity is the network infrastructure itself.

Browsers

Browsers are the main tool for accessing the Web and communicating with online services. They represent a critical tool by which we create our identities. A single action is, most of the time, benign. A longterm aggregation of data creates a profile that ourselves are not aware of. Users should be aware of what they are doing and having an environment simple enough to manage their own data inside the browser. What are these sources?

Browsers for creating a better interaction with the Web propose features that are useful and critical at the same time. Here are a few sources of collecting data related to the identification:

search engine box: All browsers have a system for searching the Web. Each of these queries are sent to a search engine which has then the ability to relate the queries from one specific user or set of users (shared computers).
Pre-filled Forms: Browsers give the possibility for user to pre-fill data for the users. It might be a source of collecting data in some circumstances if the user is not completely aware of the consequences.
Device Configuration: It is possible to access through the DOM and HTTP headers to the setup configuration of the browser and the device (list of fonts, screen size, languages, etc.) In many cases, it creates a unique way to identify users on which the users has barely no control at all.
Cookies: Cookies store a small amount of information. They became universal. Only a minority of sites are not using them. For each of the services online, they create a single id used for saving preferences as well tracking your behavior in using the service. Some of them are transmitted in clear text over the network. Browsers create UI to accept or decline cookies. Opera for example gives the choice to “accept all cookies”, “accept only the cookies from the domain we are visiting”, “never accept the cookies”. In addition, the browser can suppress all cookies when quitting and can ask for confirmation before recording cookies. In the daily context, all these solutions are not practical at all. The browser experience become a clicking game. Suppressing the cookies also means that the user will have to remember and reenter his/her login/password for the next time or loose his/her preferences. Cookies are also most of the time not granular for each type of information. One cookie can be a key id hiding a lot of configuration on the server itself. Users can’t decide which information they decide to be collected, remembered or not. They can remove individually cookies through a dashboard. The only limitation being that it is impossible to store more than 5Ko of information. All analytics and tracking tools are using cookies.
Local/Session storage is a new feature from browsers vendors which helps design applications which runs once offline. By pushing a lot of the logic of the application on the client side. It makes it easier to have a fluid interaction in a changing context such as mobile or difficult access to network (country side). Users have the ability to store 5Mo by domain name. It creates a lot of opportunity. It also means that it creates a bigger opportunity for collecting data about the users and storing them on a longer term. It amplifies the issues created by the cookies. It is unlikely that the user will be able to control in a reasonable way what is kept and communicated with the increase of Web applications complexity.

It is interesting to note that some people have proposed strategies to remove the ability of users of controlling their data. For example, evercookie creates a system which recreate the tracers by using different type of containers such as cookies, LSO (flash cookies), PNG storage, local storage, etc. When one context is destroyed by the user, it is automatically recreated by the others.

The issue is becoming even more complex in the case of applications using the Web, the HTTP protocol but with a different chrome, not providing the usual information management options that a browser propose. For example, there are applications or Web widgets such as feed readers, mailers, etc. using the HTTP protocol and all its possibilities for accessing and interacting with content and providing little if no options for controlling the data.

Services

We have seen that browsers are personal data repository. There are also the mediator in using online services. Many of these services operate because they can use the personal data of users.

The experience of the users is not only tied to the product itself but offers services for enabling an access to a specific identity in different contexts. Emails, data storage (among many others) are common online services with strong implications on Web Tracking and User Privacy.

Tracking can be a matter of analyzing the content itself. For example, using an online mail service, a user could require to have everything being encrypted and decrypted on the client side, but then would not benefit of search features. On the other hand, many online services operate by profiling user data and send advertisements. It is then important to figure out how to properly make users aware of it be on the service itself or in the browser UI.

Strategy Against Web Tracking

In the age of business models based on ads, there is a strong resistance from any services to abandon tracking features. The more the browser will propose blocking features, the more the services will create strategies for circumventing them. The Web has been built in a specific social context with trust as a premise. Finding the right mechanism will pose challenges not only in terms of revenues but also technical.

The « Do Not Track » mechanisms propose an interesting experiment but rely also on a trust system and assume that people and services will act in a good will.

We think that strategies related to Web Tracking and Data Control should rely on a few principles:

What do I input data? (And what do I know about this input?)
How do I block access to my local data? (And what do I know about the attempts for accessing them?)
How do I remove, modify, create my local data?
How do I remove, modify hosted data on a distant service where I’m identified?
How do I remove, modify hosted data on a distant service where I’m not identified?

Finding the appropriate technologies and protocols enabling the control of data by users is of high interests for Opera Software. It is challenging and requires the expertise of many areas: legal, technological, UX, security, etc. Some technological choices which are easy to develop might have negative consequences for users such as giving a false sense of trust and/or security. Sometimes a technology without a legal framework to enforce it will have no practical effect for the user.

The Web industry is facing an interesting question: Do we have to be identified to not be tracked?

Acknowledgements

Charles McCathieNevile, Opera Software
Haakon F. Bratsberg, Opera Software
Kristina N. Kjerstad, Opera Software