W3C

- DRAFT -

Improving Web Advertising BG
15 Dec 2020

Agenda

Attendees

Present
blassey, Karen, mlerra, joshua_koran, Mikjuo, ionel, gendler, shigeki, jrobert, marguin, eriktaubeneck, ErikAnderson, GarrettJohnson, kris_chapman__, mserrate, lbasdevant, wbaker, jeff_burkett_gannett, dialtone, joelstach, dkwestbr, jrosewell, bmay_, arnaud_blanchard, joelpm, bleparmentier, pl_mrcy, seanbedford, AramZS, apascoe, pedro_alvarado, jonasz, hober, ajknox
Regrets
Chair
wseltzer
Scribe
Karen

Contents


<scribe> scribe: Karen

Wendy: Welcome
... we will give people a few more minutes to join
... start looking over our agenda; agenda curation, introductions and a scheduling note
... We had several new proposals shared with the list
... we have COWBIRD from Matthew Wilson
... and the Augury API and Scaup API from Gang Wang
... if there are any other highlights or other business people want to raise for discussion?
... anything else anyone would like to add to the agenda
... Hearing suggestions of proposals that we might see in the new year
... Looking forward to those
... Note on the agenda I sent around that we won't be meeting the following two Tuesdays
... as people's schedules start to clear out for the end of the year
... This will be our last meeting in 2020
... Thank you all for the work that we have been doing
... We will meet next on January 5th after this
... Any introductions, new participants to this call?

Aditya Desai: I am new from Google

COWBIRD

Wendy: Welcome, Aditya
... we have the COWBIRD proposal

<wseltzer> https://github.com/AdRoll/privacy/blob/main/COWBIRD.md

Wendy: how machine learning optimization might be done in a privacy preserving way
... Matthew do you have something to introduce that?

Matthew: yes, I have a deck and will screen share

<GarrettJohnson> Thanks for making a deck!

Matthew: the proposal is called COWBIRD; Coordinated Optimization without Big Resource Demands
... will get into how this works

[slide: industry context]
... machine learning optimization for buyers is very important
... without getting feedback on the value of impressions are to us, no one knows what would happen
... there is an asymetry of information
... know if above or below the fold
... buyer could bid the average, but that turns out not to work
... ML optimization is very important; buyers need to understand value of things they are buying
... this is inspired by MURRE, in some ways a similar proposal
... we were surprised to see that small models do surprisingly well
... small models are order of a gigabyte
... compression...models of 100@bytes doing ok
... can send out and have evaluated
... you can imagine many such models
... of models with a hundred kilobytes

[Motivating observation graphs]

scribe: point is small models do surprisingly well
... at high level, this is third party cookie paradign

s/paradigm

<pedro_alvarado> !present

scribe: this is not great from a privacy perspective as personal data leaves device and goes to third parties
... instead of running bidding models in the cloud, we can run bidding models on the device
... only thing that leaves is model gradients
... we will talk about how to use, attack vectors, etc.
... In more detail
... the browser acts as a federated learning platform
... allowing ad-tech companies to optimize toward customizable objectives with customizable features
... How this works

[step one slide]

scribe: First, model has to make its way onto a device
... when browser goes to advertiser sie

s/site

scribe: the advertiser or DSP can say, please download our model and bidding logic
... model might be weights
... but need logic to interpret
... along with weights, you need function
... contextual info, browser event history and interest group data
... browser history is something like SPURFOWL
... compute number of impressions shown, advertiser associated with this IG
... to evaluate the model you need function to evaluate those features you have computed
... and the model downloaded to get predictions
... all the model data
... gets downloaded by browser upon request by advertiser when browser goes to advertiser's page
... that is one part of the story
... to be able to evaluate the model
... need to improve the model
... need to get feedback on how you are doing
... other things browser needs to have is gradient-related information
... browser under control of DSPs or others
... so gradient-related info, you need to know the labels of things, given browser event history
... leveraging core idea in SPURFOWL
... and need a function to determine the model gradient given the label of observations, model features associated with the observation and the model data
... Say something about the labels
... you can optimize for clicks
... a long click
... a click followed by a conversion; you can label how you want to as a client of this platform

[Step 2 slide]

scribe: how do we use this?
... the model related logic is all on device
... device needs to evaluate it to come up with bids

<wseltzer> https://github.com/AdRoll/privacy/blob/main/COWBIRD.md#the-proposal

scribe: for every model and every IG associated with every model
... once bids are sent, can be evaluated by browser

[Step 3 slide]

scribe: So that is how buyers can submit their bids
... and we need to get feedback on how the things we purchase are performing
... the label and gradient step
... being a bit vague here
... a click could be on the order of minutes, days
... some subtlety to mention
... for the gradients to be sent back, we need to complete the labels given SPURFOWL store
... then with model data compute the gradient
... for each association with a DSP
... or model and send those out to the gradient aggregation service
... the gradient aggregation service is what keeps it from leaking personal data
... nice thing about gradients, they can be summed and distored

s/distorted

scribe: generally point in the right direction

[Step 4 slide]

scribe: Gradient aggregation service sums the gradients collected

[Step 5]

scribe: the ad tech company receives its aggregated gradient and uses it to improve its model
... maybe apply regularization
... and ad tech improves its model; whatever that means to its company
... and releases a new version of its model
... browsers are in background pulling new models
... browser stars a new session
... quick analysis
... of resource usage
... Said the model is roughly 100 kilobytes

<wseltzer> https://github.com/AdRoll/privacy/blob/main/COWBIRD.md#resource-considerations

scribe: if we have a max of 500 models per browser
... would be 50 MB of memory per device
... Crude idea, this would happen when a user starts a sessions; download relevant models
... Gradients require bandwidth
... could be lower if gradients are sparse; compression could help some but probably not a lot
... It's ok if gradients don't get uploaded or if milestones don't get downloaded if connections are poor
... process still works if people are sending gradients
... some ways to short-circuit these resource requirements; or I should say optimize them
... Talk about some attack vectors
... two avenues for attack
... only thing browsers receive are the gradients
... one way is to program the gradient components
... the ad tech company is in control of features, gradients and what fires
... they can program the gradients
... if this component is on, it means this or that
... or if this has value above this or that
... can be combined in different ways
... I am hoping the gradient aggregation service can mitigate or eliminate these attacks
... gradient aggregation densifies gradients
... can be stored in a number of ways
... could be flipping components, scaling, adding noise, clipping it
... and gradient will still be useful
... Some known short-comings
... Bidding logic uploaded to this in-browser federated platform would be in the clear
... this opens doors to malicious actors
... DSPs copying another's logic
... setting floors
... a shady publisher hacking their site to give high bids for some models
... it is a serious defect
... if the rest of this proposal is well received, hope we would engineer a solution to this
... I don't know a lot about browsers
... Another short-coming, proposal doesn't support contextual targeting out of the box
... some upper funnel stuff you can do; call it look-alike modeling
... but contextual targeting not supported
... proposal is compatible with TD, TURN
... don't need for retargeting workflow
... one of things this proposal has over TD workflow
... it's now clear how to do inference
... in that type of world
... you have siloed data
... and have to evaluate model on data silos and sticth that together; it gets very ugly
... fact of federated data; no centralized learning
... makes things hard for data scientists like me
... cannot run a back test on centralized data
... we could figure that out but this is a short coming

[lack of centralized data]

scribe: thanks for hearing me out

James: thanks, Matt
... for the presentation
... could you give more detail on the advertiser sites to pull down model
... and an upper limit of 500
... what would the compute head be like?
... could you give insight?

Matt: the compute overhead is up to 500
... I pulled that number out of the air; if a different number makes sense, ok
... at NextRoll we have a regression model
... takes our cloud servers one millisecond and @ CPU to evaluate the model
... that's the computing needing
... maybe half a second of computation needed
... a crude comparison but some ballpark numbers we are talking about
... needing half a second to a second to produce these bids to produce winner of the auction
... first question was about advertisers' site and how models get pulled down
... When person goes to site today, they get third party cookie
... instead they could get a little tiny model
... API makes sense
... come to my site as advertiser
... please download the model data; asap for model to be fresh and start computing bids
... would happen ideally right then and there

James: that pulldown would happen first time a user visits a site

Matt: yes

James: a half a millisecond for one CPU
... is per placement or @

Matt: per placement
... probably the same kind of thing
... not sure that changes the estimate that much

James: thank you for the clarification

Charlieharrison: respond to models being public on the device
... thought about this a bit...seems like two cases
... one when you want to protect data from some other JS running on users' device
... in those cases, I think we can protect the model
... second case is when you want to protect model from someone that has local user access
... that is pretty much impossible to hide model
... I don't think @
... still allow you to evaluate the model
... and if you can evaluate enough times, you can figure out the model
... that is the lay of the land of what we are making public

Matt: that is helpful
... don't know if it's a deal breaker
... maybe obfuscation would provide enough
... to reverse engineer the model

Charlie: I think we need to discuss it
... it seems to me a local copy of an obfuscated model

<blassey> thanks

Charlie: if an adversary is dedicated enough, they could learn it, even if fully obfuscated

Matt: different from federated code

Charlie: good point
... to prevent trivial attacks, maybe the obfuscation can work

Jonasz: thanks for proposal
... seems like COWBIRD and SPURFOWL is model to help

[missed]

Matt: answer is no, I have not done federated learning
... it is being done in practice
... tells you something about the feasibility
... at high level
... how quickly you get feedback
... don't want model to always be lagging what is happening in the wild
... we would definitely want to keep feedback loop from new models and pulling gradient info
... what I have seen at Google at night when you plug your phone in is when the gradient computations go
... might take a couple days to train a model, then fine-tuning

Joz: when thinking about COWBIRD and SPURFOWL, would it be a complement?

Matt: both SPURFOWL and COWBIRD rely on aggregate reporting
... gradients cannot be sent directly to advertisers
... we have a lot of control over what those computations produce
... if there is not intermediary
... why this proposal has aggregate intermediary
... so there the attacks are at least very hard

josz: need ability to launch reports from a bidding function?
... those will have different delay considerations

Matt: looking at reporting and bidding side as separate pieces
... you might want to compute metrics that may or may not be limited to bidding
... that could use SPURFOWL only
... in theory I could only bid and use COWBIRD
... but want to give customer metrics, so I would use SPURFOWL
... cannot tell customer number of impressions using COWBIRD
... maybe some way to attack but that is not the idea

Wendy: I have closed queue now

Brian: Very interesting proposal
... I like the idea of taking advantage of the control the browser has over things
... to give out what is needed an not more than what is needed
... raises allocation of resources in browser; who owns what, who manages what
... an issue common to other proposals
... if browser lands on publishers page, and develops a model, does advertiser, or third party own that relationship?

Matt: I don't have a good answer for that
... I have not specified anything along that detail in this proposal, partially intentionally
... I don't have a great response to that; we can figure out together
... Why does ownership of these things matter?
... your first comment was about resources on the device
... this is an open question
... has not been tried
... first thing is to do a toy simulation of this
... how the entitites in ecosystem interact or own this, it could be done in different ways

Brian: proposal is interesting, but needs to be grounded in who owns what relationships with whom
... but I think we need to figure out those foundational questions before moving forward in any serious way

Matt: we can think about that a little more

Wendy: sounds like an issue to raise and discuss in the repository

Robert: to recap on conversation we had yesterday
... this is a possible way forward on measurement
... the other is server side
... thanks for pushing this
... we could work with you
... progress on the model on the data and the model structure of the gradients
... being communicated only from the advertiser site
... a mix of positive and negative lables
... has a pretty significant effect on measurement outcomes
... you need to bring in users who have not interacted with advertisers site to balance positives and negatives

Matt: I am confused
... measurement is more SPURFOWL
... this is more concerned with optimizing bidding

Robert: say you are optimizing who visit that web site
... you are not getting full population
... good for retargeting but not brand display

Matt: agree that contextual targeting are not well represented
... whena browser goes to advertiser A's site
... no reason DSP cannot write into IG
... advertiser A, B, and C
... so B and C are upper funnel bidding or ad display for this browser
... this browser has never seen B and C
... that is baked into the TD turnstyle proposals
... not contextual targeting, but you could call that look-alike targeting

Robert: problems remains, we can discuss it
... that unless that model is working on browser of people who have not visited the web site
... cannot create model to reason about ...those who have not visited

Matt: the browser has not seen B and C, you are getting gradient feedback on how those browsers perform
... so you can get info on how those behave
... continue conversations...on next privacy repo and discuss details there

Jukka: I have a question about the privacy of the approach
... what would prevent advertiser from creating a model that acts like a cookie
... and then compile results through the bidding process

Matt: Only way advertiser gets feedback on model on device is via aggregate and noise gradients
... advertiser has some control over those gradients
... have to have aggregation and noise be such so that no @ makes it hard
... aggregation and noise hope make it impossible or hard to communicate anything useful back that is not related to training the model

Jukka: where does the auction occur?

Matt: bids should be generated on the browser
... could send bids to a publisher; have code
... or browser can evaluate those bids
... bids themselves never have to leave the browser or go back to advertiser

Jukka: could it be used as mechanism to communicate things

Matt: advertiser doesn't know the value
... advertiser gets back a gradient that is aggregated and has noise

Jukka: I thought it could occur somewhere else

Matt: browser has everything it needs to compute this
... all personal data
... look at bids as personal data
... that never gets seen by adverisers
... only minimal feedback, the gradients come back

Jukka: if bits don't leave browser it is safe

Matt: I guess it could go to SSP and DSP
... SSP could provide logic on behalf of browser in order to run
... bids are evaluated on browser; could be done in different ways

Jukka: thanks

Wendy: Thanks

Augury API and Scaup API

Wendy: good discussion; ran a bit long
... We apologize to Gang
... would you like to start an introduction of Augury and Scaup with option to continue at the next meeting?

Gang: that would be great
... can give a brief introduction of the tools
... and at next meeting in January we can go through the details
... let's talk about the Turtledove API

<wseltzer> https://github.com/google/ads-privacy/tree/master/proposals/augury

s/TD Augury API

scribe: let me share
... goal of TD Augury API is to simplify Turtledove to be easier for industry to adopt
... in the process we made some key choices

<wseltzer> https://github.com/google/ads-privacy/tree/master/proposals/augury#augury-api

scribe: diagram shows you high level how this works
... when browser sends contextual ad request to SSP
... and SSP forwards request to DSP
... makes prediction of what groups member is most likely a member of
... then DSP generates some conditional bids
... if user belongs to this IG
... I would bid $1.50 for this
... If another, I would bid $2.00
... DSP returns a set of conditional bids we call Augury bids
... SSP applies controls
... so on and so forth upon those conditional bids
... in same way it applies to contextual bids
... after this application of these privacy rules and publishing controls
... the SSP returns the winning contextual bid
... plus any augury bids back to browser in step 5
... in step 6, browser runs a final auction
... from the final contextual bid and any number of augury bids
... this is the high level flow for the TD Augury API

<arnaud_blanchard> if possible

scribe: all DSP functions are running in the DSP's servers
... in TD proposal, those logics, how to generate a bid
... are packaged into creative, running in browser, likely as JS
... for buy side
... in this design, creative exclusions, etc. all applied on SSP's server
... so SSP doesn't have to reimplement those logics in JS
... we keep most of existing

[lost audio ]

[back]

[missed]

scribe: at super high level this is what TD Augury API is
... from overal design perspective
... that is the number one proposal we put onto the github
... talk about another proposal
... that is a high level overview to warm you up
... we can talk about details in the next meeting
... Second proposal we will discuss is called Scaup
... we have TD as marketing solution
... how do we do look-alike targeting

<wseltzer> https://github.com/google/ads-privacy/tree/master/proposals/scaup

scribe: we take a different approach in this proposal than others discussed previously
... we want to have a trusted third-party server
... called MPC1 and MPC2
... key idea is this
... your browser when you visit different web sites
... your browser will fire off various HTTP requests
... your adtech providers will specify a feature vector for each event
... if this is a page view event
... page is about Hawaii vacations or scuba diving

<wseltzer> https://github.com/google/ads-privacy/tree/master/proposals/scaup#design-overview

scribe: browser, upon request of ATP will store those feature vectors in a local storage
... the ATP can ask the browser at any time, 'please aggregate all the vectors in this timeframe'
... that would be called a user profile
... in steps 3A and 3B, browser uploads into secure location cluster
... using crypto protocols to secure user profiles
... goal is to build a ML model
... how to find similar audience
... in future, we can program clusters to support additional use cases
... In step four
... the two clusters execute the protocol for mL
... in step 5
... browser, please query the ATP cluster to learn which IGs you should join
... goes back to browser as IG IDs
... after browser joins IGs
... we will rely on TD to serve the adtech industry
... this is a super high level of how this scaup works

[missed]

scribe: result goes back to browser and goes back to adtech industry
... there is a crypto guarantee
... as long as two servers are honest there is no data leak
... we have short time

Wendy: Thank you for that overview
... we will give you more time to take questions when we return in January
... Brian, you have a quick comment of question?
... otherwise, invite you to comment on github or by email

[everyone will respond offline]

Wendy: Thank you, we will be back in January to dive into more details
... thank you to Karen for scribing and to everyone who throughout the year
... who helped to present, develop proposals, curate the information, ask questions
... hope everyone has the opportunity for some rest and relaxation over the December breaks
... see you again on January 5th and thanks for the great participation this year

[adjourned]

Summary of Action Items

Summary of Resolutions

[End of minutes]

Minutes manually created (not a transcript), formatted by David Booth's scribe.perl version (CVS log)
$Date: 2020/12/15 17:01:48 $

Scribe.perl diagnostic output

[Delete this section before finalizing the minutes.]
This is scribe.perl Revision of Date 
Check for newer version at http://dev.w3.org/cvsweb/~checkout~/2002/scribe/

Guessing input format: Irssi_ISO8601_Log_Text_Format (score 1.00)

Succeeded: s/Aiditya/Aditya/
WARNING: Bad s/// command: s/paradigm
WARNING: Bad s/// command: s/site
Succeeded: s/by/for/
WARNING: Bad s/// command: s/distorted
Succeeded: s/att/Matt/
Succeeded: s/wseltzer: could you mute the call in beginning with 2159?//
Succeeded: s/@/COWBIRD/
WARNING: Bad s/// command: s/TD Augury API
Succeeded: s/bits/bids/
Succeeded: s/winning bids/winning contextual bid/
Present: blassey Karen mlerra joshua_koran Mikjuo ionel gendler shigeki jrobert marguin eriktaubeneck ErikAnderson GarrettJohnson kris_chapman__ mserrate lbasdevant wbaker jeff_burkett_gannett dialtone joelstach dkwestbr jrosewell bmay_ arnaud_blanchard joelpm bleparmentier pl_mrcy seanbedford AramZS apascoe pedro_alvarado jonasz hober ajknox
Found Scribe: Karen
Inferring ScribeNick: Karen
Agenda: https://lists.w3.org/Archives/Public/public-web-adv/2020Dec/0011.html

WARNING: No date found!  Assuming today.  (Hint: Specify
the W3C IRC log URL, and the date will be determined from that.)
Or specify the date like this:
<dbooth> Date: 12 Sep 2002

People with action items: 

WARNING: IRC log location not specified!  (You can ignore this 
warning if you do not want the generated minutes to contain 
a link to the original IRC log.)


[End of scribe.perl diagnostic output]