W3C

– DRAFT –
Improving Web Advertising BG

12 January 2021

Attendees

Present
apascoe, AramZS, arnaud_blanchard, BenHumphry, bleparmentier, bmay, dialtone, dinesh, dkwestbr, ErikAnderson, GarrettJohnson, jeff_burkett_gannett, joelstach, jrobert, jrosewell, Karen, KevinG_, kleber, kris_chapman_, marguin, Mike_Pisula_Xaxis, Mikjuo, mjv, pl_mrcy, shigeki, wbaker, weiler, wseltzer
Regrets
-
Chair
Wendy
Scribe
Karen

Meeting minutes

Wendy: Welcome

Wendy: We are about to get started on our agenda review

Agenda-curation, introductions

Wendy: I'll start with agenda curation and introductions
… we have discussion of Scaup API from Gang
… at the end of the last call I heard folks on irc wanting to get updates on the standardization work
… and updates on proposals we have previously heard; where are they now
… invite people to share updates or request updates
… so that others working on them might prepare that
… Arnaud just today sent a note
… on a proposal for a testing environment for

<wseltzer> https://github.com/criteo/privacy/tree/main/TEETAR

Wendy: cohort bidding experiment
… call TEETAR
… maybe we can get an introduction to that
… Any other business?
… any introductions, new participants who would like to introduce themselves to the group?

Scaup API

Wendy: hearing none, we will jump right to our first agendum
… Gang, thank you for your patience and for returning to us with discussion of Scaup

Gang: thank you, Wendy; good morning everyone
… Let's talk about Scaup
… let's start with the problem itself

<wseltzer> https://github.com/google/ads-privacy/tree/master/proposals/scaup

Gang: at a very high level, we use ML to optimize key business performance indicators
… for publishers it could be retention or user experience
… for advertisers could be most efficient way to reach customers
… we rely on models
… rely on cross-domain or collecting model training label from other domains, providing user data
… Today we use third-party cookies to collect cross-party data
… how can we have model to optimize those KPIs
… from that perspective, one problem we tried to address is how to deliver the utility
… to use models to optimize KPIs
… while at same time deliver a strong privacy guarantee to user
… so utility and privacy guarantee is problem we try to solve
… In this Scaup explainer, we explore direction of setting up a trusted server that can take trusted user data and train the ML models
… and by monitoring or @ can give a privacy guarantee
… as ML model is a pretty big topic
… if you have ML infrastructure for privacy, you can do many things
… we are shooting for similar audience, or look-alike targeting
… Can you see my screen?

[yes]
… This is the overall architecture I am going to talk about
… it can support audience targeting and other ML use cases in future
… a key thing in this architecture the models are trained on two servers, running some secure multiparty computational code
… in the TPAC meetings a few months ago, Ben from Facebook presented a high-level tutorial of multi-party computation
… a high-level summary is in this two-party setup, if one party remains honest
… we can guarantee no data user leak
… we hope to find some practical crypto design to deliver this guarantee that is still practical to industry to solve ML problem
… On top of this infrastructure, a tech company can train the ML model to solve a business problem
… the ML model ownership is at the company
… similar to situation where you train your ML model on AWS, and you still own the right, ownership of the model
… Let's look at architecture at high level
… User navigates on internet across different domains
… lots of things happen, page views, clicks, etc.
… when those events happen, get HTTP request
… can ask browser to please score this feature on my behalf
… feature is a description of what happened, ie user clicked on ad for shoes
… browser takes feature specified by adtech provider
… that is step two
… when user navigates on web for a while
… the browser collects a lot of features fom the ATP
… please collect all the features in this time frame
… and aggregate all the features across this time frame
… could be a sum or average
… that result is called a user profile
… can ask browser to upload user profile
… can use for prediction purpose later
… first key component in this design where browser, under instruction of ATP instructs
… and uploads user profile using strict crypto protocol to MPC
… the two servers, MPC1 and MPC1 will execute the secure ML protocol to train a model
… in this explainer we were not specific about the models you can train
… we can train different models in the MPC
… there is the K-nearest-neighbor model
… neural net might have visibility to train in the MPC cluster
… So step four, the two servers do the training
… and the training result is a model in the MPC cluster
… at this point we have a model
… and the ATP may ask the browser to please query the model in the MPC model to see which user list you should join
… or if model is for other use cases, query model and get a different prediction back
… that is where step five happens
… browser uploads user query and profiles
… the ML training result is a number of IGs the browser should join
… the browser joins those IGs
… and step six, the browser allows adtech industry to monetize using TD or some varient
… we are leveraging the TD API
… to enable the ML prediciton result
… we are leveraging all the security, privacy efforts put into TD
… we can deliver a guarantee that cross-site user data can be held and ML can train
… this is high level of how this work
… any questions of how this work?

<wseltzer> https://github.com/google/ads-privacy/tree/master/proposals/scaup#design-overview

Wendy: yes, you have a queue

Aram: thanks for putting this together for us
… there are 8K adtech vendors
… and they provide models
… at what point are there speed or responsiveness issues?
… how would that work?
… second question is what happens if we cannot trust any of the MPC servers?
… trust conversations about Google and bids and AMP
… if we are arguing about if we can trust Google and Amp
… then trusting an adtech ventor
… seems to be a big obstacle, assuming that every single person is trustworthy in this environment

Gang: First question is whether we should put ML functionality onto browser or onto a server on the Internet
… if you put onto server on Internet, you can run complex models
… how many models can I train inside browser was one question
… I don't have data we can talk about
… I imagine if user is using a cell phone and you are training an ML model on a cell phone, battery could be an issue
… we need to see how many ML models we can train on a user's device
… server side, computation and power are not an issue; it's a trade-off
… Second question you ask is trust model
… should we trust MPCs
… if they become malicious, what is the loss
… If only one of the two parties is malicious, do bad things to injure industry or privacy
… if only one, there is a guarantee of no leak
… that is achievable with crypto design
… if both parties collude and are malicious
… we will lose data
… that's a scenario we want to avoid
… Go back to question of who should run the two servers
… and why trust the two servers
… that is a separate discussion we should have

Aram: that is helpful

Gang: if only one party is malicious, we won't have data leak
… that we can guarantee from crypto design perspective

Aram: question was not how models are being trained, but they have to be stored on the browser

Gang: correct, in this model the models on stored on the MPC servers, not on user's browser

Aram: thanks

Ben: thanks for the proposal, Gang
… about that last question
… first, you said the models are trained and evaluated in the MPC server
… where is this storage happening?
… I imagine there is like
… previous MPC designs Google shared were transient
… sounds like this requires on-going storage, correct?

Gang: yes, correct
… you can imagine that MPC 1 and 2 are in a web server, like AWS

Ben: do you envision ML models are trained continuously or in a batch?

Gang: probably start at first with batch if first phase is successful

Ben: will it store the secret shared trainig data?
… how much space is needed

Gang: depends upon kind of model to train
… for neuralnet, you have to store all the data
… after training the MPC servers can throw away the data, and just keep the models
… that is one scenario if you are training neural net
… if you are training the K- neighbor model, have to remember all the neighbors so that is pretty big
… we don't have any estimation yet on storage requirement for nearest neighbor model

Ben: my basic understanding of how things work
… if browser is sending to server and that is shared
… that would constitute data processing under GDPR
… sounds like it boils down to consent and less good alternatives
… do you envision user gives consent to process their data

Gang: I am not a lawyer
… this is not legal advice
… in general, we want to make sure the user understands what is happening
… if this becomes a reality, I imagine it becomes part of the browser API and they have to acknowledge what is their obligation
… may become a browser-vendor decision on how to give such choice

Ben: your answer makes sense

Mehul: thank you Gang for your interesting proposal
… any thought on how the crypto exchange will happen between MPC 1 and 2
… how key exchange happens
… with every @ requires
… more encrypted format, compute the varient
… how does it actually look at ground level

Gang: That is a great question
… to be clear we need to flesh out the crypto design details
… we need to publish those so experts can take a look
… to make sure it guarantees what I claim here
… crypto experts and software engineers need to work together
… to make sure it's practical from latency and cost perspectives
… guarantee correctness, that it's practical
… we are hoping to get details published so that more people can look at it
… so that crypto design makes sense
… this proposal itself does not contain those crypto details
… more of a high level explanation of what we are trying to do

Mehul: MPC browser....is browser giving in first step or @

Gang: We are thinking about partial information, in terms of secret shares
… I should explain how secret shares work
… you split something people don't want to know into two secret shares
… one secret share goes to MPC 1 the other to MPC 2
… each one looks totally random
… the crypto still allows the two MPC servers to do pre-approved @ to achieve some meaningful result

Mehul: wondering if it can compute the gradient
… or have a more complete model for each of them individually

Gang: no user data is leaked to either server
… that is the guarantee we want to follow
… if the model is still sensitive from the privacy angle, then the model needs to stay in secret shares

Mehul: that answers...inference required from both servers

Gang: yes, we are thinking about all the relevant details and hope to share with industry

Brian: I have a couple questions
… not sure what you mean by "honest"
… can you flesh that out a bit?

Gang: I can answer from crypto perspective or non-crypto perspectives
… from crypto perspective, "honest but curious model"
… means the two servers are executing the crypto model correctly
… but they may want to learn more about the data
… they are not intentionally deivating
… that is what 'honest' means
… other, not intentional, not compromise business info

bmay: is there deterministic information...
… can maintain a set of data identified with a specific browser?

Gang: they are not designed to
… not designed to hold that information

Brian: so browser passes set of data to MPC server
… and browser goes back and tells me in this context, who I am?

Gang: will browser ask MPC server to make a predicition, who I am

Brian: how does browser get info

Gang: in step five A browser prepares a query
… may be the user's profile calculated using data provided in step 2
… user profile is one of parameters in Five A and B
… profile as input to the ML model
… ML prediciion result looks like you are more similar to people in Nike's men's running shoes, therefore you should join Nike's men's running shoews

Brian: Do all ATPs share limited set of ML processes, or a facility to put up own servers and create their own ML processing?

Gang: who will be running MPCs needs to be discussed
… which browser vendors and industry
… there are two MPC servers, operated by two non-colluding parties
… with that assumption, the MPCs do the model predictions on these two servers
… by doing predictions by those non-colluding parties, easier to guarantee the privacy guarantee
… easier to get their trust with this direction
… does that answer?

Brian: yes
… can you have more than two MPC servers?

Gang: if you look at research papers
… multi-party computation beyond two is possible theoretically
… but there is more cost
… how to get all those parties on board, and the engineering costs to maintain all those servers and the latency costs
… we should discuss as an industry
… make the privacy guarantees and see what industry can afford
… I can go over details quickly
… First part is about how does the ATP tell the browser this is my data, my feature describing this event

<wseltzer> https://github.com/google/ads-privacy/tree/master/proposals/scaup#design-details

Gang: When the user browsers internet, user gets http request, then ATP gets request
… if about a page view event
… user visited a page
… can encode what kind of page, sports, outdoor running or a specific brand name
… ATP has freedom to decide @ vector and dimension of the vector
… one dimension could be if this page is about sports
… ATP decides what the feature vector looks like and how to encode information into the feature vector
… when ATP has no direct presence in browser itself
… then receives request via SSP

[missed]
… pass information to browser and take feature vector from ATP 1 or specific model of spec. ATP
… and browser stores specific vector
… Look at chronically what is happening
… each star represents a specific event
… ATP provided a vector describing that event
… red dot is conversion event happened
… when that happens, ATP can collect all the feature events I told you about during this time frame
… so four events in this time frame
… specified by ATP
… ATP asks browser to collect
… do an aggregation, and that result is the user profile to send to server for training or prediction
… ATP has control
… about training features and data and provides security/privacy guarantee
… We talk about MPC server, user profile creation steps
… user profiles are created; this is exmaple
… browser splits profile into secret shares
… cannot make any sense out of it, but have access to all secret shares and combine them together for secret message

<wseltzer> https://github.com/google/ads-privacy/tree/master/proposals/scaup#training-data-collection

Wendy: there is another question

Raj: thank you
… one quick question, how does model get shipped over to the MPC servers?

Gang: If you go back to look at this architecture
… browser ships user profile to MPC servers and it trains the model using data collected in A and B
… stays in server
… ATP tells MPC, please train nearest neighbor
… ATP does not ship model to MPC

Raj: what if ATP has own proprietary model they want to run

Gang: when you say proprietary model, say neuralnet
… ATP knows I need...
… layer already knows
… ATP can tell MPC to train the neural net, using these layers
… ATP can do that
… in this design if ATP has the model trained
… and knows result, we are not thinking about uploading an existing model
… Why would you want to upload an existing model?

Raj: try to keep separation of shipping user data; and computing with some proprietary models you can run

Gang: we can talk about the details of this use case; I was not thinking about how to use and upload an existing model; we should talk on GitHub
… now we know how it specifies the training data, let's talk about the privacy guarantees
… if only thinking about a two, non-colluding computation model
… we have no user or business computational data leak
… if one party goes malicious, the other party remains good, no user data leak

<wseltzer> https://github.com/google/ads-privacy/tree/master/proposals/scaup#privacy-guarantees

Gang: but at that point, the malicious party can screw up the model
… we don't get utility
… predicting garbage if one party goes bad
… if we design crypto protocol carefully, the browser can detect something is wrong with the other server
… have an alarm and predict
… if both parties go bad, then we don't have the privacy guarantee any more
… We talked about user privacy
… Transparency and control topic
… as Ben pointed out previously, we need

<wseltzer> https://github.com/google/ads-privacy/tree/master/proposals/scaup#transparency-and-control

Gang: the browser needs to give user some sort of transparency and control
… explain this is happening
… when MPC server asks browser to join an IG
… you joined IG for this reason
… that detail is out of scope forthis
… transparency and control is a big topic we need to address this requirement properly
… Last but not least, the adtech company's privacy is guaranteed
… data split into multiple secret shares, model likely as well
… we want to protect adtech data as confidential as well as user privacy
… Publisher controls
… publishers may want control over which ATP can use data on publisher's domain
… separate discussion on offering publishers capability on Turtledove
… continue that discussion I expect
… that is the proposal we have
… we are looking for industry feedback, browser vendor feedback
… we look forward to working together to see how we can use this proposal to meet privacy and utility goals

Wendy: Thank you, Gang, we have added that link to the repository

Brian: am I correct in the assumption that there is some kind of persistent relationship between ATP and the browser or user
… how does ATP develop the feature vector and know it applies to a specific browser

Gang: let's say ATP gets a request

[missed]
… in that ad request there may not be...front page may be about news
… ATP can decide what information to encode in the feature vector
… Let me use

[missed audio]
… Hawaii vacation on beach
… ATP feature vector may say whether page is vacation
… number two if beach or mountain location
… vector three is location
… ATP is in full control of the number of features and the meaning of each dimension
… if it's to help Nike to sell shoees
… is it about outdoor running or exercising
… that belongs to every ATP

Brian: when you have the diagram of training data collection
… clocks and watches, then purchase
… suggests it tracks browser across time

Gang: that is not intentional
… as those events happens, each event tells ATP somebody is looking at shoes, watches, cars
… no unique user identifiers to chain those events together
… ATP cannot tell all those events are happening in same browser
… ATP knows if someone viewing page about shoes, it should generate a vector about shoes
… doesn't know if it's from the same browser at all

Ben: how are you envisioning the ATP asking the MPC to use a particular architecture
… does every single browser send or one-off configuration?
… You have this model output to add browser to IG
… do you envision future extensions where browser can do other things besides join an IG

Gang: in this proposal we want to dump prediction result to IGs
… we want to leverage our existing investment in TD to guarantee user privacy protection
… if industry is comfortable with TD...
… we still have privacy guarantee; want browser vendors to feel comfortable with the prediction result
… if it can help industry, with privacy prediction guarantee, we can talk about it with the browser vendors
… what alternative way of serving prediction results, I have not addressed
… focus on getting first step done
… the configuration, how ATP tells model, using this architecture; is that a one-off
… I have not thought about those details
… in the request browser
… there needs to be some way for ATPs to tell MPCs that it represents nearest neighbor model or [missed]
… API and MPC need to agree upon with the browser vendors
… have not thought about those details
… we can file github and discuss there

Wendy: sounds there
… Mehul, something briefly?

Mehul: quick question
… one quick comment, and I will follow up on github
… compute secure computation
… adtech has a model
… why not do MPC between ATP and browser, why involve MPC1 and 2
… what will it take for computation?

Gang: what does it take for browsers to fill comfortable with the training
… we may need to design a different crypto model

Mehul: if encrypted by browsers; acquire different crypto protocol
… I will follow-up on github
… that would be interesting
… user trusts browser, but not MPC
… and they keep that information

Gang: let's follow-up

Wendy: We are building up issues for future discussion
… if people want to nominate specific proposals for follow-up
… that could help to start those discussions
… we have a proposal from Arnaud, and I heard other suggestions of material
… Look foward to future meetings and more info on lists and github

<wseltzer> [adjourned]

Wendy: our next meeting is the 19th of January

Minutes manually created (not a transcript), formatted by scribe.perl version 127 (Wed Dec 30 17:39:58 2020 UTC).

Diagnostics

Succeeded: s/K@/K-nearest-neighbor/

Succeeded: s/and ads/and AMP/

Succeeded: s/boligation/obligation/

Succeeded: s/@/bmay/

Succeeded: s/http/SSP

Succeeded: s/tells model/tells MPC/

Succeeded: s/[audio drop]/do MPC between ATP and browser, why involve MPC1 and 2

Maybe present: Aram, Ben, Brian, Gang, Mehul, Raj, Wendy