18 Sep 2019


toml, christine, weiler, taraw


Kleber: Already two sessions, this is the third on session on Ad Selection on APIs from the Privacy Sandbox
...: This is a problem I want to solve, I would like to motivate you that this is a problem worth solving.
... Study published shows that there is a -52% revenue loss when you remove cookies.
... So some notion of user identity approximately doubles the amount of revenue a publisher receives from having ads on their site.
... The details show a break out for the top 500 publishers. It varies by industry. News for example, loses 62%.
... I would like to make this difference much smaller. However, I want to do so without enabling cross-site linkability.
... In this morning's privacy threat model discussion we listed six properties, I think the identifying users across sites, i.e. linking together a users browsing across the web, is hugely important.
... So I would like to do this in a way that does not allow the recreation of the user's browsing history.
... Two motivations: I like the web and want sites to be able to continue making money.
... Secondly, everything we're talking about to improve privacy could trigger an arms race. If the benefit of winning is billions of dollars, then someone is going to try and win it.
... I would like the gap between monetisation on the private web and on the non-private web smaller.
... That's the what and why, finally the how.
... My goal is to figure out how to get back ads targeted at something not on the current page.
... I have two APIs I would specifically like to talk about, but I want to discuss and get help.
... Part 1: What kind of information is ok to use? I don't mean in all circumstances - user consent, state, etc. can change all this.
... Part 2, once we've decided this information is ok to use how do we avoid unintended consequences, e.g. re-enabling cross-site topics.
... For example, a user may self select ten topics of interest. However if they are the only user interested in these topics then they are now identified.
... The two APIs are FLOC and PIGIN.
... FLOC is about letting ads be targeted, not at you personally, but at a large enough group of people who are like you. An example of what the browser might know already is your browsing history, interests you volunteer, or perhaps there's ML in hte browser learning about your intersts.

<dbaron> https://github.com/jkarlin/floc

<toml> +q to note that "Kinda like you" shared with 999 other people can be a pretty darn personal bucket, and can be extremely sensitive even if it's not narrow.
...: The important part of FLOC is the clustering of users based on that set of signals. This is some crypto-black magic I'd rather not get into, but I'd like to assume we can do clustering without sending that data to a server.
... So assume there's a way to group users with a similar set...

Hadley: Question, I don't understand how the browser understands everything I've done and how it reveals that information to an advertiser.

Kleber: First the browser decides you're in a particular cluser, e.g. 1234567.
...: Currently the advertiser observes a third party cookie in your browser.

Hadley: How is it sent?

Kleber: In a specific header sent to the advertiser.
...: The advertiser can choose to target ads at a particular cluster or FLOC.
..: The advertiser can observe how that FLOC behaves in aggregate, but not you specifically.

Hadley: What stops the advertiser linking you to the FLOC?

Kleber: That was a problem we discussed as part of the Privacy Budget discussion. We need to constrain the amount of information that leaks about you. The FLOC is some of those bits of information that leaks out about you.
...: Any questions?

Tom: You talk about Privacy Budget as being the mechanism for preventing the leaking of information. However, other things come in like screen resolution, connection characteristics, etc. However, your FLOC could be incredibly personal. Sexual preference, union membership, etc. If it's expressed in your browser history then it can contribute to the FLOC. The Privacy Budget would not cover this.
... This is a feature of the current advertising model, but I would like to stop it.

Brad: The Privacy budget covers identifiablitiy. The explainer talks both about k-anonynymity and do not deviate from current population demographics.

Tom: What kind of limits are those? If it's homosexuality in 10% of the population - how is that prevented.

Kleber: There are two different problems being solved - fingerprinting versus sensitive characteristics.
... So, properties of the FLOC being too personal.
... It is plausible to build a cluster that has better privacy properties than k anonymity.
... This only works if you identify specific characteristics that you want to ensure you do not leak. Then build the clustering mechanism to not reveal those.

Tom: I don't understand what you're saying. I don't understand the difference between a FLOC that identities people being as gay versus saying that's not possible. I'm not sure we can enumerate all the sensitive charactertistics, but let's say it was a 100.

Kleber: I can't fairly represent the details, but I can point you to papers on this and link them in the notes. However, I agree the list of sensitive characteristics is not something you can just list out.

Tom: Let's say we can definitely enumerate a set of characteristics to avoid. What is the basis information you need in order to do that?

Kleber: Yes, you need a training set.

Tom: So you would need a whole bunch of people who disclose their browsing history and sexuality and you need to repeat that for every single sensitive characteristic?

Kleber: yes, this is one of the paradoxes of privacy research. To avoid recognising the characteristic you need a system that can recognise the characteristic. T-closeness is the magic word to search on.

<dbaron> .... (K-anonymity and T-closeness)

<Zakim> toml, you wanted to note that "Kinda like you" shared with 999 other people can be a pretty darn personal bucket, and can be extremely sensitive even if it's not narrow.

Tom: I think the solution I would propose, though it may be naive, we could not disclose so much information by doing things like analysing browsing history.

<englehardt_> +q

John: We've had some offline conversations, but I want to bring them up here. I worry a lot about applying Machine Learning to people's behaviour to decide things about them. Even as part of a FLOC.
... The browser could actively tell the user, "I think you're gay" and "I want to broadcast this" and I don't think users would like this. Why do we want to apply ML to discover this?

<Zakim> christine, you wanted to ask about risk of "identifying" users with enough FLoCs

Christine: You say in the slide the FLOC can change over time. Is there a risk a uesr could be identified by linking a number of FLOCs? If I know you're in 43a, 43b, etc.

Kleber: If there is a first party site you visit over time where you're signed in, then they will be able to see the FLOC changing over time. They might be able to learn more about you over time. Sites you visit over time do have the opportunity to learn more about you, this is a chance to do that.
...: This doesn't cover changing the model of first party identity. So seeing the change in your FLOC over time is dependent on your ambient identity on the web.

Melanie: I share some of the concerns about not knowing what's sensitive. e.g. gender In Microsoft, we've seen that some people really do like tailored ads. What about an opt-in model? I think that could be a more valuable model to advertisers.

Kleber: My goal is to figure out in what circumstances it's ok use this information. User consent is needed and it's a big sliding scale of how involved they were in that choice.
...: In the case where you are using the browsing history, you may be fine sharing your top 10 visited sites and that feeds into a specific cluster. You get to review the information being used to build the FLOC.
... Likewise the user volunteering interests seems like a reasonable thing to use.
... However, it doesn't fully address Tom's concern. Those top 10 interests may be giving away something that they didn't intend.
... Clustering those interests puts you in a cluster of representative interests, not specifically your interests. However an advertiser could still use that cluster to derive something like sexuality.

Tom: That sounds like a good reason not to share any of that information.

Kleber: I think people may have different feelings about sharing seemingly innocuous information that leads to latent discoveries of sensitive information.

Stephen: We are specifically concerned about cross-site tracking that allows advertisers to discover more about users than they expect. This API seems like it still gives away information but is about preventing that arms race. Are we confident this is useful enough?

Kleber: Yes, the balance is this being useful enough for advertisers and it being information that the user is happy to volunteer about themselves.
... Now onto PIGIN.
...: FLOC was about interest based ads, or targeting people who are similar to you.
... PIGIN is about remarketing, aka those ads that follow you around the web.

<dbaron> https://github.com/michaelkleber/pigin
...: Ads based on the advertiser observing you did something in the past where they believe you would be interested in an ad, e.g. you added something to your basket and it's still there 2 hours later.
... Currently these ads are based off of cookies.
... PIGIN is about allowing the advertiser to do this without being able to track you and link your behaviour across site.
... The advertiser gets to create interest groups and when they see the user do something, they can request the user is put into the interest group.
... It's up to the user agent what it does with that request.
... At some point later when you are browsing around the web, the browser sends a request to the network, and it sends the interest group the user is a member of.
... The browser can't reveal all the interest groups. However we will assume there is some magic crypto black box service that can pick the most valuable interest groups you are a member of that is also appropriately large enough.

Tom: Those interest groups still could be very sensitive.

Kleber: Yes, if you visit a site and add something to your cart the site could ask you if it can show you ads. So, if you've been to kites.com and been to Harley Davidson then it may be revealed that you are in both.

Tom: The perceived innocuous nature of multiple data points are at odds with the conjoint data that results. We know the user is not able to reason about the impact of all those data points together.
... They then find they are being targeted on a resulting data point that they did not realise about themselves. This feels like we are setting up data leaks based on seemingly innocuous steps.

Kleber: In the event a person finds themselves targeted on a surprising characteristic, the user can tell that they were targeted on those specific interest groups. We can make it clear that the adverstiser must reveal the types of ads targeted against a specific group.
...: The browser can help you find out why you were targeted with those ads.

Tom: I love the idea of creating an audit trail of what ads were served and why. I don't think what you've described covers the case where it's a combination of innocuous characteristics A, B, C, and D reveal the sensitive characteristic D. The user sees they are targeted on the innocuous ones, but they are not told about the other.
... Inferring things about people is definitely valuable.

Kleber: Would revealing only one of these groups solve this? If you're a New York Times subscriber and see that you're getting Harley ads one week and Kite ads the next, they have that all that information.

Tom: I feel like you're making my point - just because I read those articles doesn't mean I want the site to know all these other interests.

Christine: This proposal has the feeling of making the browser more integrated into the ads system and becoming more involved in the ads role.

Kleber: That's fair. It's a new idea of the browser having specific APIs intended for the advertising use case. This is the browser saying that advertising is part of the way the web works and we are giving it first class support.

John: If you're pursuing this it would be good to make the signal to advertisers - broadcast or repository - be a separate thing. Other browsers can then chose to do different things. Whether the interest is deduced, chosen, or falsified.

Kleber: That's the intent. The browser can do whatever it wants, it has the responsibilitity to maintain the k-anonymity.

<englehardt_> +q

Kleber: This is an attempt to solve a few points, but I'd love to keep the conversation going on wider topics - e.g. how to allow advertisements to show up in the browser based on more than just the page you're on that moment.
...: Some of these are more long term
... Moving a block of ads into your browser for private information retrieval.
... Or maybe when you're on Site A, that site gives the browser an ad to be displayed at a later date.
... This becomes something more similar to Brave's model where the ML happens in the browser.

Tom: It's a great system and it's going to save the web.

Kleber: Possible.

Stephen: You still run the risk of leaking information about the user even if selection happens on the device. For example if users click on ads that correlate with users that have diabetes then I can track that.

Kleber: That's unrelated to PIGIN or FLOC though, that can happen with other ads.

Tom: We limit that by requiring just one category on an ad.

Kleber: Summary - heavy scepticism.

Summary of Action Items

Summary of Resolutions

[End of minutes]

Minutes manually created (not a transcript), formatted by David Booth's scribe.perl version 1.154 (CVS log)
$Date: 2019/09/18 16:33:18 $

Scribe.perl diagnostic output

[Delete this section before finalizing the minutes.]
This is scribe.perl Revision: 1.154  of Date: 2018/09/25 16:35:56  
Check for newer version at http://dev.w3.org/cvsweb/~checkout~/2002/scribe/

Guessing input format: Irssi_ISO8601_Log_Text_Format (score 1.00)

Succeeded: s/latent discoveries/latent discoveries of sensitive information/
Present: toml christine weiler taraw
No ScribeNick specified.  Guessing ScribeNick: rowan_m
Inferring Scribes: rowan_m

WARNING: No "Topic:" lines found.

WARNING: No meeting title found!
You should specify the meeting title like this:
<dbooth> Meeting: Weekly Baking Club Meeting

WARNING: No meeting chair found!
You should specify the meeting chair like this:
<dbooth> Chair: dbooth

WARNING: No date found!  Assuming today.  (Hint: Specify
the W3C IRC log URL, and the date will be determined from that.)
Or specify the date like this:
<dbooth> Date: 12 Sep 2002

People with action items: 

WARNING: No "Topic: ..." lines found!  
Resulting HTML may have an empty (invalid) <ol>...</ol>.

Explanation: "Topic: ..." lines are used to indicate the start of 
new discussion topics or agenda items, such as:
<dbooth> Topic: Review of Amy's report

WARNING: IRC log location not specified!  (You can ignore this 
warning if you do not want the generated minutes to contain 
a link to the original IRC log.)

[End of scribe.perl diagnostic output]