W3C

– DRAFT –
Improving Web Advertising BG

03 August 2021

Attendees

Present
AramZS, bLeparmentier, Brendan_IAB_eyeo, dinesh, dmarti, eriktaubeneck, GarrettJohnson, gendler, hong, imeyers, jeff_burkett_gannett, joelstach, jonasz, Karen, kris_chapman, mallory, Mike_Pisula, mjv, Newton, nlesko, seanbedford, wbaker, weiler, wseltzer
Regrets
-
Chair
Wendy Seltzer
Scribe
Karen, Karen Myers

Meeting minutes

Wendy reviews agenda
… For calendar notes, we will skip next week and return 17 August
… when we will have another proposal on Suishi
… and FalconCNS
… please continue to share agenda items for future discussions
… We seem to have a good crowd, so I will start with first agenda item
… Do we have any agenda curation or suggestions there?
… anyone new to the call who would like to introduce yourself?

Agenda-curation, introductions

Dennis @ from Microsoft

Kip @

Wendy: we use irc for queuing
… see bottom of the agenda email
… you will find web-based irc client
… we are in web-adv channel
… use for queuing and minute taking
… please join if you can
… also type "present+" when you join for a record of the participation
… thank you and welcome all

[Wendy reviews agenda again]
… for AOB category, share an update
… asked about contacts for the UK's Competition and Market Authority
… and anyone interested to learn about W3C process
… folks there contacted me for a conversation; will be speaking with them later in the week
… I think we are ready for MaskedLARK
… Denis Charles will present

MaskedLARK [Edge] https://github.com/WICG/privacy-preserving-ads/blob/main/MaskedLARK.md

Denis: Let's get started
… I go by Charles

<wseltzer> https://github.com/WICG/privacy-preserving-ads/blob/main/MaskedLARK.md

Denis: I'm from Microsoft and will be presenting proposal for Masked LARK for application and reporint
… presenting together with Joel Pfeiffer, also from MS
… One of goals
… I'll cover first part which introduces the technical aspects
… the aim, algorithmic and technical details
… how it differs from other proposals, known limitations
… Joel will go over sample workflows for reporting and training
… Those are main goals for the talk
… Background is third-party cookies are being depricated
… this has many impacts including fraud detection, targeting, conversation reporting and modeling
… I will focus on the conversion reporting and modeling pieces
… and what solutions to address that
… the main task, starting point
… for our proposal was an earlier proposal from Google
… makes a promising start to address problems of conversions
… Conversions are connecting activity across web sites
… start from a click, action directed from publisher
… ends up on action from user on advertiser's web page or landing site
… Google proposal addressed...how to get aggregate reports, etc.
… starting points of use cases of data and figure out how ad campaigns are working
… Google proposal had nice property
… does away with assumption that earlier proposals make
… to implement a trusted mediator
… and use that to do conversion reporting
… in Google proposal there was a single trusted entity with abstraction implemented by multiple parties
… and uses differential privacy to aggregate data
… trust requirements are lower
… good starting point
… what did we add on top of that?
… main pros is that segregated helpers is more palatable than single trusted mediator
… some limitations
… handles aggregation for reporting but does not address conversion modelling use cases
… an important building block
… given advertiser budget constraints, need to figure how to spend money efficiently
… another important use case of conversion data; other proposal did not address in simple way
… can be addressed with secure learning
… Take Google proposal as starting point and build on top of it to address the needs of modeelling
… focus on simple ideas of privacy
… large enough set of folks know what's going on
… so people understand
… and design around it
… this is focus in this case
… with that
… please let me know if questions as we go
… First part of what we wanted to do was clarify how we wanted to look at computational platform
… that platform, browsers provided
… so folks using this computational platform have a way to reason about it and have expectations for protocols
… Starting point is start with place where modelling
… start with something popular, the "map" framework
… view of helpers and browser
… is almost a map produced cluster
… output of any of these operations automatically have property that output is private
… so think of browsers as providing a secure map operation
… and helpers apply only apply differentially produced private "reduce" operation
… One important point is the sum
… others to leverage as well
… this is the abstraction to keep in mind
… there are a variety of...
… foundation for users are model designers
… Go into formalization
… and what is happening in a concrete way
… requirements look similar to secure aggregation
… Every web browser creates v-i
… helpers compute an output and aggregate
… think of this as reduced operation taking vectors from various clients
… think as "sum" for simplicity
… think of sum as W
… each coordinate of sum disclosed is picked at random of distribution
… some noise added on top of value
… may be more noise depending upon how noise parameter is set up
… Also have Sparse supporting constraint, also known as k-Anonymity
… if less than k reporting
… coordinates
… the aggregate does not report that coordinate
… very few users reporting garbage collector
… protocol does not @ information
… ways in which the second property can be attacked but will see later
… Very simple notion of privacy that users can reason
… has some limits
… as part of protocol, think of network
… AdNetwork publicly discloses sum and K
… Aggregation keys
… no hiding of them, treated as public information, although they can be opaque
… ad networks can use
… use a secret shared mechanism and helpers don't know true values
… add an unbiased knowledge
… this is another technical point
… Go concretely how things look for counting conversions
… then focus on the new aspects of model training
… Let's say AdNetwork wants to count conversions, how many clicks and conversions happen
… these will look like keys
… and constraint of N users; assume helpers
… in time window T
… encryption of @
… set value of clicks and conversions

[please see detailed slide for the mathmatical formulae]
… suppose two helpers
… browser generates two random values
… and creates two pieces of vectors
… random value plus true value
… sets the second shared vector to be the negative of the numbers
… browser encrypts Vi 1 and 2
… helper aggregation service computes
… adds additional noise
… reported as output of cluster
… approximately the true value
… since each of helpers sees for every client, for every coordinate
… reporting a coordinate
… know how many distinct users reporting or a key
… if less than 10 they can garbage collect
… Going through case where all the helpers are operating as expected
… helpers do what they are supposed to do
… Having these two constraints encourages aggregations that AdNetworks are requesting
… but not doing cross-advertiser or cross-zip code
… AdNetwork might not get too much info; get garbage collected and not get value
… Maybe for a metro area
… time window to get enough users
… that is a concrete example of the counting piece
… Modelling brings in complications
… use stochastic gradient descent
… In these case model does something similar to aggregate responding
… vectors aggregating from clients
… use models with count features...we can update aggregates by the protocol directly
… label connects activity across two different web sites; click coming from AdNetworks and Publisher site
… bring all this info together
… to create this V
… update the model
… for example, if we use models with counting featuers
… features can use
… run into problems with easy implementations of this protocol
… using this v-i's
… depends upon action and model
… to apply it may need to compute last function on the user's device
… user does not trust helpers to have entire action label
… implies federated learning setup, which could be expensive because
… no real bound on number of models
… number of publishers, adnetworks

[too fast]
… user trusts helper with label
… and done at helper
… but trust in helper is too strong an assumption
… some property P sends a label
… lost functions are going to be noisy now
… we came up with solution called Masked Aggregation
… will outline it
… address some of the generalizations on top of it
… we have binary lable, conversion model
… estimate probability of conversion
… based on click
… and here one thing we can
… essentially label is either 0 or 1
… browser can send picture of both labels
… feature vector
… can do optimization
… browser ascending...
… think of helper creating two gradient vectors
… know conversion or g-i 1
… this particular loss function connects information across multiple parties
… derived only from first party
… but label connects across multiple web sites
… what we will focus on
… This part we know how to do using our secret sharing
… we will do again
… let browser...four random numbers
… true label...
… If 1 is true label...
… Alphis minus one, random number
… browser send alpha and beta to helpers
… all of the Alphaes..
… going to be the actual true gradient
… basically the browser in a separate process allows selection of the true gradients to be summed up and be reported to the helpers
… that is essentially the idea
… introduces the real and fake gradients for aggregation
… you can generalize it to real and fake features/models to protect sensitive ad network data
… feature vector is not main task
… it's to protect connection
… of users across multiple web sites

Wendy: can we pause for a question
… please share slides because scribe cannot capture

bLeparmentier: can we trust a single server
… in two-party settings
… problems from bidding POV
… two parties
… what happens if one demands to double the price, how to do it?
… maybe from privacy standpoint
… a lot of overhead with two servers
… not always efficient
… I see a lot of computation
… might be a good solution, but let's not forget

Charles: yes, that is a cost
… that we have to run these helpers
… we did think about democratizing helpers
… which causes additional complications
… whoever donates computer
… we have @

Bleparmentier: we are all businesses
… significant issue
… another question
… G is @ [garbled]
… take a simple example
… very often we have gradient of size more than 10million
… here when you will add your noise
… 10 million weight
… may first do that

Charles: sum of @ is what noise is added on top of
… after minibatch is aggregated
… and add on top of sum of the vectors

Bleparmentier: if we have 5 or 10 million, audience is 5 million
… question is are you adding noise on each of the 5 million for privacy
… and if the case, that is concerning
… the ratio is likely to be extremely high
… the it's again...

Charles: @ on gradient and how it affects modeling needs experimentation
… look at dense feature vectors and things of that type
… think that is adequately addressed by proposal
… has more issues as you mention

Bleparmentier: there is a million web sites
… lots of advertisers...cross best performance
… we are working with dense vector of 5, 10 million
… concern when I hear noise of 10m vector
… not private
… but aggregated, full @
… raise this concern
… privacy of vector of huge sized used in adtech world can have huge consequences

Charles: reducing vector, to smaller dimension, can be done by adnetwork
… have adveriser as key; clicks and conversions can be aggregated by first user protocol
… then do conversion directly

Charles: smaller features

Blemparmentier: it is very costly to have this very big model
… complexity
… if we utilize it is for a good reason
… maybe costs will be...don't have a number, but wanted to raise this concern

Charles: We are focused on privacy aspect to address the model parts
… it's not perfect, have limits
… you bring up good points on model performance and comes at cost for implementation of helpers

Wendy: two more questions

Ben Savage: really glad to see proposal that tackles conversion
… models; reading with interest
… have three bigger, maybe philosophical questions
… that we have been thinking about ourselves
… First question is what are the viewpoints on conversion modeling itself
… have you collected info from privacy advocates and browsers?
… finding correlations from features and outcomes
… could be age, gender, location
… and you are finding correlations
… can say for this particular person in this country, age, with behaviors
… then can say more statistically likely to engage with a particular ad
… do we as an industry feel ok with that?
… My personal opinion is that it is not telling you what a particular person did?
… but there are correlations you are collecting
… what is POV on legitimacy?

Charles: yes, this is philosophical point
… I thik the case for conversion modeling is coming from efficiency in marketplace
… advertisers aligning with user interests
… do something reaonsable
… but can try to take this name and convert of not
… the way we are approaching it doesn't explicitly talk about it
… we think the model training is subject to differential privacy and not subject to fine grain; adding noise to the gradients so feature combinations will not be informative to the model
… we do not know what these models are learning
… maybe output of model is not private enough
… important question

Mehul: most important thing, Ben
… is individual users should not bias
… second think is user should have some control on what is going into the model
… can get highly correlated
… here if this whole thing works
… user could @ those features...will define what kind of privacy

<AramZS> Wait... how do *users* have control over what goes in beyond deleting an interest group that they don't understand and ALSO is an action they are *very* unlikely to do?

Mehul: like not putting a gender
… or remove it
… or pull out that kind of feature
… not a single user
… that is our goal

Ben: I understand that features need to be present across many users
… and won't tell you specific actions of users
… there are plenty of attributes that are common
… this tech allows correlations
… company X learns people with attribute Y are far more likely to buy product Z than others
… MS sees this as a legit use case
… what are POVs of the other browsers and the legitimacy of it

Charles: we have not reached that level of discussion
… we should do that

Mehul: early in proposal
… did not see an easy way to do model training
… thought this could potentially work
… conversion modeling is ...
… thanks for bringing it up

Wendy: We have a few more people in queue and you have more slides and less than 10 minutes

Charles: I will go to last slide and then take remaining questions

<dialtone> 1+

<dialtone> eh, 8.52... find it hard to be able to squeeze in

Charles: the abstracts...what is helping this protocol
… is the existence of bi-@ form
… do subaggregation of data to the parties
… g-1
… calling this property out
… talk about some of the issues and known things and limits
… then go to questions
… There are many known issues in addition to the ones folks brought up
… we should discuss them
… Differentially private map-reduce
… helpers can learn if a key was involved in aggregation
… use same idea
… not sure if this is sufficient
… k-Anonymity can be attacked by adnetwork via ballot stuffing
… differential privacy gives another layer of protection
… one aspect
… feature vector and model is not hidden from helpers
… also pollution attacks from users/browser
… information looks indistinguishable
… may poison model aggregation
… some solutions with expensive crypto
… if user sends model update when activity occurs, this can leak information

<AramZS> I would never assume no collusion.

Charles: essentially an update...to model...feature vector...
… doesn't need to be tied into
… could be batched together
… Solution limited models that are continuous functions trained using SGD
… higher cost
… these are some of things we thought of for the proposal
… stop here; no time to go through Joel's part of presentation

Angelina: thank you for this; quite a lot of detail
… I would love to think about all the different use cases and mapping
… so many different models
… as an industry we need to establish some of the data points being established
… so there is minimal opportunity to ID specific segments
… if we believe legislation will impede these intiatives
… concern is that we have always been a last touch attribution model
… difficult for clients to adopt other models
… in past Google provide great opportunity to adopt them
… being able to customize a campaign
… direct model...latency windows has not been discussed
… what are parameters
… we need to start discussing what are the parameters moving forward and have the ability to do customization

Charles: we don't touch on attribution
… perhaps mediated by the browser
… clusters of user proposals
… we take that as @

<Zakim> AramZS, you wanted to say that I'm unclear on the protection this provides to sensitive categories and ask about potential protections (after end of presentation)

Angelina: Great, thank you

Aram: thank you for the detailed presentation
… I am going to suggest some things to address
… will be blunt
… this will target people based on sensitive categories
… targeting payday loans by race, for example
… is like ad targeting specific attribues
… think you need to address sensitive categories
… and how users are going to interact
… is there ability for users to remove themselves
… users tend to remove themselves from complicated situations
… document how you intend users to interact with them
… model training; don't assume no collusion, assume there will be fraud
… thanks for talking with us

Charles: these are good points
… essentially problem that all models have
… we have not thought about privacy-perserving manner
… not making that task easier

Aram: which gets to core of Ben's question, how does this function in future

Charles: collusion gets more hairy
… talk about model first before getting to other assumptions

Wendy: Valentino, is your question quick?

Valentino: unlikely

<wseltzer> https://github.com/WICG/privacy-preserving-ads/

Wendy: proposal is in WICG repository
… assume you welcome discussion there
… If Joel has additional material to share, let's see if we can bring that back to another meeting
… thank you Charles
… see you in two weeks

Minutes manually created (not a transcript), formatted by scribe.perl version 136 (Thu May 27 13:50:24 2021 UTC).

Diagnostics

Succeeded: s/clulster/cluster/

Succeeded: s|s/cluster||

Succeeded: s/@/stochastic gradient descent/

Succeeded: s/setup/setup, which could be expensive because/

Succeeded: s/lost/loss/

Succeeded: s/double price/what happens if one demands to double the price/

Succeeded: s/@/Mehul/

Succeeded: s/attributing payday loans/targeting payday loans by race,/

No scribenick or scribe found. Guessed: Karen

Maybe present: Angelina, Aram, Ben, Blemparmentier, Charles, Denis, Mehul, Valentino, Wendy