<scribe> scribe: Karen
Wendy: Welcome
... we will give people a few more minutes to join
... start looking over our agenda; agenda curation,
introductions and a scheduling note
... We had several new proposals shared with the list
... we have COWBIRD from Matthew Wilson
... and the Augury API and Scaup API from Gang Wang
... if there are any other highlights or other business people
want to raise for discussion?
... anything else anyone would like to add to the agenda
... Hearing suggestions of proposals that we might see in the
new year
... Looking forward to those
... Note on the agenda I sent around that we won't be meeting
the following two Tuesdays
... as people's schedules start to clear out for the end of the
year
... This will be our last meeting in 2020
... Thank you all for the work that we have been doing
... We will meet next on January 5th after this
... Any introductions, new participants to this call?
Aditya Desai: I am new from Google
Wendy: Welcome, Aditya
... we have the COWBIRD proposal
<wseltzer> https://github.com/AdRoll/privacy/blob/main/COWBIRD.md
Wendy: how machine learning
optimization might be done in a privacy preserving way
... Matthew do you have something to introduce that?
Matthew: yes, I have a deck and will screen share
<GarrettJohnson> Thanks for making a deck!
Matthew: the proposal is called
COWBIRD; Coordinated Optimization without Big Resource
Demands
... will get into how this works
[slide: industry context]
... machine learning optimization for buyers is very
important
... without getting feedback on the value of impressions are to
us, no one knows what would happen
... there is an asymetry of information
... know if above or below the fold
... buyer could bid the average, but that turns out not to
work
... ML optimization is very important; buyers need to
understand value of things they are buying
... this is inspired by MURRE, in some ways a similar
proposal
... we were surprised to see that small models do surprisingly
well
... small models are order of a gigabyte
... compression...models of 100@bytes doing ok
... can send out and have evaluated
... you can imagine many such models
... of models with a hundred kilobytes
[Motivating observation graphs]
scribe: point is small models do
surprisingly well
... at high level, this is third party cookie paradign
s/paradigm
<pedro_alvarado> !present
scribe: this is not great from a
privacy perspective as personal data leaves device and goes to
third parties
... instead of running bidding models in the cloud, we can run
bidding models on the device
... only thing that leaves is model gradients
... we will talk about how to use, attack vectors, etc.
... In more detail
... the browser acts as a federated learning platform
... allowing ad-tech companies to optimize toward customizable
objectives with customizable features
... How this works
[step one slide]
scribe: First, model has to make
its way onto a device
... when browser goes to advertiser sie
s/site
scribe: the advertiser or DSP can
say, please download our model and bidding logic
... model might be weights
... but need logic to interpret
... along with weights, you need function
... contextual info, browser event history and interest group
data
... browser history is something like SPURFOWL
... compute number of impressions shown, advertiser associated
with this IG
... to evaluate the model you need function to evaluate those
features you have computed
... and the model downloaded to get predictions
... all the model data
... gets downloaded by browser upon request by advertiser when
browser goes to advertiser's page
... that is one part of the story
... to be able to evaluate the model
... need to improve the model
... need to get feedback on how you are doing
... other things browser needs to have is gradient-related
information
... browser under control of DSPs or others
... so gradient-related info, you need to know the labels of
things, given browser event history
... leveraging core idea in SPURFOWL
... and need a function to determine the model gradient given
the label of observations, model features associated with the
observation and the model data
... Say something about the labels
... you can optimize for clicks
... a long click
... a click followed by a conversion; you can label how you
want to as a client of this platform
[Step 2 slide]
scribe: how do we use this?
... the model related logic is all on device
... device needs to evaluate it to come up with bids
<wseltzer> https://github.com/AdRoll/privacy/blob/main/COWBIRD.md#the-proposal
scribe: for every model and every
IG associated with every model
... once bids are sent, can be evaluated by browser
[Step 3 slide]
scribe: So that is how buyers can
submit their bids
... and we need to get feedback on how the things we purchase
are performing
... the label and gradient step
... being a bit vague here
... a click could be on the order of minutes, days
... some subtlety to mention
... for the gradients to be sent back, we need to complete the
labels given SPURFOWL store
... then with model data compute the gradient
... for each association with a DSP
... or model and send those out to the gradient aggregation
service
... the gradient aggregation service is what keeps it from
leaking personal data
... nice thing about gradients, they can be summed and
distored
s/distorted
scribe: generally point in the right direction
[Step 4 slide]
scribe: Gradient aggregation service sums the gradients collected
[Step 5]
scribe: the ad tech company
receives its aggregated gradient and uses it to improve its
model
... maybe apply regularization
... and ad tech improves its model; whatever that means to its
company
... and releases a new version of its model
... browsers are in background pulling new models
... browser stars a new session
... quick analysis
... of resource usage
... Said the model is roughly 100 kilobytes
<wseltzer> https://github.com/AdRoll/privacy/blob/main/COWBIRD.md#resource-considerations
scribe: if we have a max of 500
models per browser
... would be 50 MB of memory per device
... Crude idea, this would happen when a user starts a
sessions; download relevant models
... Gradients require bandwidth
... could be lower if gradients are sparse; compression could
help some but probably not a lot
... It's ok if gradients don't get uploaded or if milestones
don't get downloaded if connections are poor
... process still works if people are sending gradients
... some ways to short-circuit these resource requirements; or
I should say optimize them
... Talk about some attack vectors
... two avenues for attack
... only thing browsers receive are the gradients
... one way is to program the gradient components
... the ad tech company is in control of features, gradients
and what fires
... they can program the gradients
... if this component is on, it means this or that
... or if this has value above this or that
... can be combined in different ways
... I am hoping the gradient aggregation service can mitigate
or eliminate these attacks
... gradient aggregation densifies gradients
... can be stored in a number of ways
... could be flipping components, scaling, adding noise,
clipping it
... and gradient will still be useful
... Some known short-comings
... Bidding logic uploaded to this in-browser federated
platform would be in the clear
... this opens doors to malicious actors
... DSPs copying another's logic
... setting floors
... a shady publisher hacking their site to give high bids for
some models
... it is a serious defect
... if the rest of this proposal is well received, hope we
would engineer a solution to this
... I don't know a lot about browsers
... Another short-coming, proposal doesn't support contextual
targeting out of the box
... some upper funnel stuff you can do; call it look-alike
modeling
... but contextual targeting not supported
... proposal is compatible with TD, TURN
... don't need for retargeting workflow
... one of things this proposal has over TD workflow
... it's now clear how to do inference
... in that type of world
... you have siloed data
... and have to evaluate model on data silos and sticth that
together; it gets very ugly
... fact of federated data; no centralized learning
... makes things hard for data scientists like me
... cannot run a back test on centralized data
... we could figure that out but this is a short coming
[lack of centralized data]
scribe: thanks for hearing me out
James: thanks, Matt
... for the presentation
... could you give more detail on the advertiser sites to pull
down model
... and an upper limit of 500
... what would the compute head be like?
... could you give insight?
Matt: the compute overhead is up
to 500
... I pulled that number out of the air; if a different number
makes sense, ok
... at NextRoll we have a regression model
... takes our cloud servers one millisecond and @ CPU to
evaluate the model
... that's the computing needing
... maybe half a second of computation needed
... a crude comparison but some ballpark numbers we are talking
about
... needing half a second to a second to produce these bids to
produce winner of the auction
... first question was about advertisers' site and how models
get pulled down
... When person goes to site today, they get third party
cookie
... instead they could get a little tiny model
... API makes sense
... come to my site as advertiser
... please download the model data; asap for model to be fresh
and start computing bids
... would happen ideally right then and there
James: that pulldown would happen first time a user visits a site
Matt: yes
James: a half a millisecond for
one CPU
... is per placement or @
Matt: per placement
... probably the same kind of thing
... not sure that changes the estimate that much
James: thank you for the clarification
Charlieharrison: respond to
models being public on the device
... thought about this a bit...seems like two cases
... one when you want to protect data from some other JS
running on users' device
... in those cases, I think we can protect the model
... second case is when you want to protect model from someone
that has local user access
... that is pretty much impossible to hide model
... I don't think @
... still allow you to evaluate the model
... and if you can evaluate enough times, you can figure out
the model
... that is the lay of the land of what we are making
public
Matt: that is helpful
... don't know if it's a deal breaker
... maybe obfuscation would provide enough
... to reverse engineer the model
Charlie: I think we need to
discuss it
... it seems to me a local copy of an obfuscated model
<blassey> thanks
Charlie: if an adversary is dedicated enough, they could learn it, even if fully obfuscated
Matt: different from federated code
Charlie: good point
... to prevent trivial attacks, maybe the obfuscation can
work
Jonasz: thanks for proposal
... seems like COWBIRD and SPURFOWL is model to help
[missed]
Matt: answer is no, I have not
done federated learning
... it is being done in practice
... tells you something about the feasibility
... at high level
... how quickly you get feedback
... don't want model to always be lagging what is happening in
the wild
... we would definitely want to keep feedback loop from new
models and pulling gradient info
... what I have seen at Google at night when you plug your
phone in is when the gradient computations go
... might take a couple days to train a model, then
fine-tuning
Joz: when thinking about COWBIRD and SPURFOWL, would it be a complement?
Matt: both SPURFOWL and COWBIRD
rely on aggregate reporting
... gradients cannot be sent directly to advertisers
... we have a lot of control over what those computations
produce
... if there is not intermediary
... why this proposal has aggregate intermediary
... so there the attacks are at least very hard
josz: need ability to launch
reports from a bidding function?
... those will have different delay considerations
Matt: looking at reporting and
bidding side as separate pieces
... you might want to compute metrics that may or may not be
limited to bidding
... that could use SPURFOWL only
... in theory I could only bid and use COWBIRD
... but want to give customer metrics, so I would use
SPURFOWL
... cannot tell customer number of impressions using
COWBIRD
... maybe some way to attack but that is not the idea
Wendy: I have closed queue now
Brian: Very interesting
proposal
... I like the idea of taking advantage of the control the
browser has over things
... to give out what is needed an not more than what is
needed
... raises allocation of resources in browser; who owns what,
who manages what
... an issue common to other proposals
... if browser lands on publishers page, and develops a model,
does advertiser, or third party own that relationship?
Matt: I don't have a good answer
for that
... I have not specified anything along that detail in this
proposal, partially intentionally
... I don't have a great response to that; we can figure out
together
... Why does ownership of these things matter?
... your first comment was about resources on the device
... this is an open question
... has not been tried
... first thing is to do a toy simulation of this
... how the entitites in ecosystem interact or own this, it
could be done in different ways
Brian: proposal is interesting,
but needs to be grounded in who owns what relationships with
whom
... but I think we need to figure out those foundational
questions before moving forward in any serious way
Matt: we can think about that a little more
Wendy: sounds like an issue to raise and discuss in the repository
Robert: to recap on conversation
we had yesterday
... this is a possible way forward on measurement
... the other is server side
... thanks for pushing this
... we could work with you
... progress on the model on the data and the model structure
of the gradients
... being communicated only from the advertiser site
... a mix of positive and negative lables
... has a pretty significant effect on measurement
outcomes
... you need to bring in users who have not interacted with
advertisers site to balance positives and negatives
Matt: I am confused
... measurement is more SPURFOWL
... this is more concerned with optimizing bidding
Robert: say you are optimizing
who visit that web site
... you are not getting full population
... good for retargeting but not brand display
Matt: agree that contextual
targeting are not well represented
... whena browser goes to advertiser A's site
... no reason DSP cannot write into IG
... advertiser A, B, and C
... so B and C are upper funnel bidding or ad display for this
browser
... this browser has never seen B and C
... that is baked into the TD turnstyle proposals
... not contextual targeting, but you could call that
look-alike targeting
Robert: problems remains, we can
discuss it
... that unless that model is working on browser of people who
have not visited the web site
... cannot create model to reason about ...those who have not
visited
Matt: the browser has not seen B
and C, you are getting gradient feedback on how those browsers
perform
... so you can get info on how those behave
... continue conversations...on next privacy repo and discuss
details there
Jukka: I have a question about
the privacy of the approach
... what would prevent advertiser from creating a model that
acts like a cookie
... and then compile results through the bidding process
Matt: Only way advertiser gets
feedback on model on device is via aggregate and noise
gradients
... advertiser has some control over those gradients
... have to have aggregation and noise be such so that no @
makes it hard
... aggregation and noise hope make it impossible or hard to
communicate anything useful back that is not related to
training the model
Jukka: where does the auction occur?
Matt: bids should be generated on
the browser
... could send bids to a publisher; have code
... or browser can evaluate those bids
... bids themselves never have to leave the browser or go back
to advertiser
Jukka: could it be used as mechanism to communicate things
Matt: advertiser doesn't know the
value
... advertiser gets back a gradient that is aggregated and has
noise
Jukka: I thought it could occur somewhere else
Matt: browser has everything it
needs to compute this
... all personal data
... look at bids as personal data
... that never gets seen by adverisers
... only minimal feedback, the gradients come back
Jukka: if bits don't leave browser it is safe
Matt: I guess it could go to SSP
and DSP
... SSP could provide logic on behalf of browser in order to
run
... bids are evaluated on browser; could be done in different
ways
Jukka: thanks
Wendy: Thanks
Wendy: good discussion; ran a bit
long
... We apologize to Gang
... would you like to start an introduction of Augury and Scaup
with option to continue at the next meeting?
Gang: that would be great
... can give a brief introduction of the tools
... and at next meeting in January we can go through the
details
... let's talk about the Turtledove API
<wseltzer> https://github.com/google/ads-privacy/tree/master/proposals/augury
s/TD Augury API
scribe: let me share
... goal of TD Augury API is to simplify Turtledove to be
easier for industry to adopt
... in the process we made some key choices
<wseltzer> https://github.com/google/ads-privacy/tree/master/proposals/augury#augury-api
scribe: diagram shows you high
level how this works
... when browser sends contextual ad request to SSP
... and SSP forwards request to DSP
... makes prediction of what groups member is most likely a
member of
... then DSP generates some conditional bids
... if user belongs to this IG
... I would bid $1.50 for this
... If another, I would bid $2.00
... DSP returns a set of conditional bids we call Augury
bids
... SSP applies controls
... so on and so forth upon those conditional bids
... in same way it applies to contextual bids
... after this application of these privacy rules and
publishing controls
... the SSP returns the winning contextual bid
... plus any augury bids back to browser in step 5
... in step 6, browser runs a final auction
... from the final contextual bid and any number of augury
bids
... this is the high level flow for the TD Augury API
<arnaud_blanchard> if possible
scribe: all DSP functions are
running in the DSP's servers
... in TD proposal, those logics, how to generate a bid
... are packaged into creative, running in browser, likely as
JS
... for buy side
... in this design, creative exclusions, etc. all applied on
SSP's server
... so SSP doesn't have to reimplement those logics in JS
... we keep most of existing
[lost audio ]
[back]
[missed]
scribe: at super high level this
is what TD Augury API is
... from overal design perspective
... that is the number one proposal we put onto the
github
... talk about another proposal
... that is a high level overview to warm you up
... we can talk about details in the next meeting
... Second proposal we will discuss is called Scaup
... we have TD as marketing solution
... how do we do look-alike targeting
<wseltzer> https://github.com/google/ads-privacy/tree/master/proposals/scaup
scribe: we take a different
approach in this proposal than others discussed
previously
... we want to have a trusted third-party server
... called MPC1 and MPC2
... key idea is this
... your browser when you visit different web sites
... your browser will fire off various HTTP requests
... your adtech providers will specify a feature vector for
each event
... if this is a page view event
... page is about Hawaii vacations or scuba diving
<wseltzer> https://github.com/google/ads-privacy/tree/master/proposals/scaup#design-overview
scribe: browser, upon request of
ATP will store those feature vectors in a local storage
... the ATP can ask the browser at any time, 'please aggregate
all the vectors in this timeframe'
... that would be called a user profile
... in steps 3A and 3B, browser uploads into secure location
cluster
... using crypto protocols to secure user profiles
... goal is to build a ML model
... how to find similar audience
... in future, we can program clusters to support additional
use cases
... In step four
... the two clusters execute the protocol for mL
... in step 5
... browser, please query the ATP cluster to learn which IGs
you should join
... goes back to browser as IG IDs
... after browser joins IGs
... we will rely on TD to serve the adtech industry
... this is a super high level of how this scaup works
[missed]
scribe: result goes back to
browser and goes back to adtech industry
... there is a crypto guarantee
... as long as two servers are honest there is no data
leak
... we have short time
Wendy: Thank you for that
overview
... we will give you more time to take questions when we return
in January
... Brian, you have a quick comment of question?
... otherwise, invite you to comment on github or by email
[everyone will respond offline]
Wendy: Thank you, we will be back
in January to dive into more details
... thank you to Karen for scribing and to everyone who
throughout the year
... who helped to present, develop proposals, curate the
information, ask questions
... hope everyone has the opportunity for some rest and
relaxation over the December breaks
... see you again on January 5th and thanks for the great
participation this year
[adjourned]
This is scribe.perl Revision of Date Check for newer version at http://dev.w3.org/cvsweb/~checkout~/2002/scribe/ Guessing input format: Irssi_ISO8601_Log_Text_Format (score 1.00) Succeeded: s/Aiditya/Aditya/ WARNING: Bad s/// command: s/paradigm WARNING: Bad s/// command: s/site Succeeded: s/by/for/ WARNING: Bad s/// command: s/distorted Succeeded: s/att/Matt/ Succeeded: s/wseltzer: could you mute the call in beginning with 2159?// Succeeded: s/@/COWBIRD/ WARNING: Bad s/// command: s/TD Augury API Succeeded: s/bits/bids/ Succeeded: s/winning bids/winning contextual bid/ Present: blassey Karen mlerra joshua_koran Mikjuo ionel gendler shigeki jrobert marguin eriktaubeneck ErikAnderson GarrettJohnson kris_chapman__ mserrate lbasdevant wbaker jeff_burkett_gannett dialtone joelstach dkwestbr jrosewell bmay_ arnaud_blanchard joelpm bleparmentier pl_mrcy seanbedford AramZS apascoe pedro_alvarado jonasz hober ajknox Found Scribe: Karen Inferring ScribeNick: Karen Agenda: https://lists.w3.org/Archives/Public/public-web-adv/2020Dec/0011.html WARNING: No date found! Assuming today. (Hint: Specify the W3C IRC log URL, and the date will be determined from that.) Or specify the date like this: <dbooth> Date: 12 Sep 2002 People with action items: WARNING: IRC log location not specified! (You can ignore this warning if you do not want the generated minutes to contain a link to the original IRC log.)[End of scribe.perl diagnostic output]