SV_MEETING_TITLE -- 11 Feb 2013

<johnsimpson> am I in right group?

I was sent here though my last name puts me in C

<johnsimpson> this is last name with S, though right?

<vincent> yes

Hard to know -- Auerback, Fielding, Doty...

<vincent> I think Nick is in all groups

<dan_auerbach_> I was just asked to lead this section 15 minutes ago or so

That's a good idea -- thanks, Roy

Always an adventure

<johnsimpson> do we have the "questions"

<johnsimpson> how many actually in the room?

<dan_auerbach_> let us know once the room is able to connect and we'll get started

<dan_auerbach_> did everyone get the list of questions? or just the group leaders?

<dan_auerbach_> i can paste them into irc

<dan_auerbach_> if others haven't seen them

<sidstamm> please do

I see Dan, John, Sid, Vincent, Aleecia. Presumably Wendy, like Nick, here to staff

Haven't seen them

<johnsimpson> testing now. anybody hear me

nope

yes

<johnsimpson> i hear you

yes

<johnsimpson> are there "questions"?

where are the questions for the session?

we can be reading meanwhile?

<sidstamm> I'd like to see them too

<sidstamm> :)

seriously?

<sidstamm> paste a url?

Upload the doc please

<wseltzer> we're trying...

thank you Rob

<johnsimpson> how many in actual room?

<robsherman> 1. “Lifetime browsing history” is a phrase that is often used, but never defined clearly. What would LBH mean as a technical matter?

<robsherman> 2. In light of this definition, what technical measures would suppress or delete LBH?

<robsherman> 3. Tying LBH to the previous group discussions of “buckets” or “low-entropy cookies,” how can the latter continue while suppressing or deleting LBH?

<robsherman> 4. Are there any compelling use cases for retaining detailed browsing history beyond a general time limit on retention?

<robsherman> 5. If so, how would you limit those use cases consistent with the goals of: (1) limiting LBH; while (2) enabling “buckets” or “low-entropy cookies”?

<schunter> 1. “Lifetime browsing history” is a phrase that is often used, but never defined clearly. What would LBH mean as a technical matter?

<wseltzer> Mike Zaneis, Rob Sherman, Bryan Sullivan, Sam Silberman

<wseltzer> Adam Turkel

<wseltzer> phone: John Simpson, Dan Auerbach, Aleecia MacDonald, Berin Szoka

<wseltzer> ... Sid Stamm

<wseltzer> Room+: Wendy Seltzer, Mathias Schunter

<vincent> wseltzer, I'm here as well you did not hear me?

Dan: going through questions high level, then will focus

(agreement)

Dan: lifetime browsing history defn -- would would LBH mean as a technical matter?

<wseltzer> Phone+ vincent Toubiana (thanks)

Dan: If you have big table keyed with pseudonym and table has URIs and timestamps, that's what I think of as LBH
... Assuming longer than short retention (1 week, 1 mo perhaps) that's my starting point for defn
... use as working defn and keep going?

Susan: ?

Matthias: too strong a requirement. Have URL and for some reason you know they came from same person, that's enough
... if all articles without URLs, also LBH
... all books someone has looked at, even without URIs, still LBH

Dan: any dataset known to be same person or device over time?

Matthias: if identifiers of books, not just URIs

Dan: agree

sherman: if no identifier but you know who it is?

Matthias: know it's the same person, even if not who the person is, that's a LBH

sherman: why? Why are we concerned if you cannot link?

Matthias: different question. We're answering what's a LBH

(cross talk)

bryan: find what's alike, determine duration. An individual, to me, it says URI data with an individual whatever that is. Complied over a *long* period of time. Collected and maintained with that purpose in mind.

<Zakim> bryan, you wanted to say it means to me a set of URI data associated with an individual, compiled over a long period of time, and collected/maintained with that purpose in mind

bryan: something specific, not what you can do with it, but the record that is collected and maintained

Mike: supportive of Peter's intro, thought we have identified an issue we can agree is a consumer privacy issue that might be able to be addressed

<schunter> The history (if not linkable to any person) seems far less critical as compared to a LBH that can be associated with a person.

Mike: 3rd party collection of a *long* history, not defining long yet, point being that's "tracking"
... what we've focused on is the 3rd party tracking
... tried to come up with more transparency and control

<sidstamm> I'm confused about the "for that purpose" part of the definition... what is the purpose referenced by "that"?

Mike: not interested in lifetime browsin history, but want to agree on the scope, and more interested in the other questions on the list
... let's get on to the next questions unless we're not just talking about 3rd party

Dan?: not sure we want to get too deep, other comments?

<bryan> +1 to moving forward to the other questions

<johnsimpson> +1 to Aleeia

aleecia: would include 1st parties in LBH for defn, though perhaps not what we care about under DNT

<dan_auerbach_> +q

matthias: wouldn't just discuss 3rd parties, but may constrain to just 3rd parties

dan: purpose of the dataset shouldn't be part of the defn

+1 dan

dan: 1st or 3rd party shouldn't be part of defn
... moving to Q2

<schunter> In light of this definition, what technical measures would suppress or delete LBH?

<BerinSzoka_> link to the questions?

thanks, Matthias!

<wseltzer> [2 2. In light of this definition, what technical measures would suppress or delete LBH?]

dan: rough idea of defn. what tech measures suppress / delete?

matthias: long series of events, can regularly suppress link ability. every fixed time you start collecting fresh, not long-term any more

<BerinSzoka_> could someone please share the list of questions?

matthias: if you use cookies and you throw away cookies and set new ones, unless you do new things, that breaks the linkability

dan: hear you're changing the pseudonym

<schunter> Regular breaking the linkability (e.g., by erasing cookies while not using any other linking-ability)

matthias: yes

dan: not enough. also storing IP address, can link

<schunter> Not storing a dataset that can link two "subsequences" of the LBH.

dan: need strong notion beyond moving from one cookie to another.

wendy: if we have long but unid'ed history, addition of one piece of linking data could tie that back
... rotating identifiers to break into shorter periods of time might be useful

dan: reasonable suggestion

<schunter> Interesting question: re-linkability of sub-sequences.

dan: fields that can link between records or data sets, important to look at everything, including time stamps
... can correlate prior records to new ones

<wseltzer> [what about fuzzing of data?]

dan: broadly, want to look at all fields you are collecting and make sure none can correlate
... can go from timestamp to a day or an hour
... quickly through other qs then focus and make progress

bryan: confused about suppression
... if it's impossible to correlate then you have suppression is that fair?

dan: yes, or make data less specific

<schunter> If you cannot correlate two browsings that are a long time apart, then you suppressed the LBH.

bryan: url being one piece of the dataset, ok
... different question: what is the tech that will enable decor over time is an arms race. not productive to get into details.
... what we learned from HIPPA is best we've seen, don't know we'll do better

dan: should strive to do better. agree normative lang to specify a technique is not the way to go. but let's brainstorm
... HIPPA missed the mark, we can do better
... let's at least explore even if we don't suggest a particular thing

<sidstamm> aleecia: want to find a nice balance that can suppress while still providing a benefit for privacy and maximize monetary benefit

aleecia: we could just delete all URIs. presumably there are ways the data is useful for industry / profitability - how do we do that more privacy protecting?

matthias: can we delete after 90 or 60 days?

Rob: matthias is looking at me :-)
... timeline is not a LBH
... can click Like button to add things to record, but one-off basis
... not LBH, it's one off, and it's affirmative action from the user

<bryan> I'm unsure of the value, in this working group and DNT context, of focusing on techniques for ensuring long-term records of data are not correlatable. That is pretty deep science for this group. I think at most we can set objectives, and let the market develop techniques that meet the objectives.

Matthias: affirmative action of user is important, not our concern

rob: if you choose to use a tool, it's out of scope

(bryan, thank you for adding what i didn't capture well enough)

dan: not talking about bits and pieces a user affirmatively adds. this is a background thing that happens without the user's knowledge

(speaker phone troubles)

<schunter> -- resolved.

dan: agree that the piece Rob & Matthias are talking about should go into the defn -- not a discrete set of user added items, but something automatic and regularly

(not sure I agree, but there is some line there)

Rob: plugins are short period of time, need for trouble shooting. Not kept more than 90 days.
... then it's not identifiable form

Mike: attribution, analytics, targeting -- vary from short to long

Matthias: ad networks use for more than a year, is that normal for a campaign?

<johnsimpson> how long are "relatively short periods"

Matthias: big data?

<schunter> 1 year: seasonal and campaigning.

Mike: varies. Over a year for seasonal campaigns to adjust inventory or for market research
... need longer than a year
... interesting discussion we can have is other ways to get the insights for an ad model but less identifiable or sorter retention
... you get wide ranges of time, and if first party even wider. Carriers will have lots of reasons to keep in identifiable format for longer.

<wseltzer> aleecia++

dan: makes sense. Digression to ad world, if anyone there can help me understand
... for behavioral targeting, can conclude interested in sports apparel. Have a URL then to several buckets, male, 30-40.
... break into profiles, or use raw URL

rachel: can't speak to URL question, but isn't just you looked at a sports page. History across consumers, and over time, to reach conclusion not related to sports.

<vincent> I think they use full URL when they do retargetting

rachel: use of crest means republican, colgate is democrat
... inferences and corrolations, even if not identifiable

<schunter> I suspect that today, data is just kept to allow later re-mining with new algorithms.

Mike: ways data is currently used, if I were a marketer running super bowl ads, sponsor webpage for it plus TV commercial. 6 months later, want to know if someone came back to your site to get more info
... would want to measure effectiveness of ad campaign
... was it worth it to sponsor the site? Want to measure long term.
... marketers, ad networks, would want to know which creative on that site was more effective.
... immediate conversion may not be what builds longer-term brand recognition
... insights are valuable throughout supply chain

<wseltzer> aleecia: we've heard in the past that buckets change

<wseltzer> ... trying to predict/mine the correlations in advance is difficult

<wseltzer> ... tradeoffs will vary from company to company; some tech to bridge the gaps

<BerinSzoka_> of course, what Aleecia just said assumes that a significant percentage of the market will not be DNT users--which, I'm not sure we can assume, given Microsoft, etc.

aleecia: may be able to get the bulk of the value with buckets rather than URIs, do lose time / money if you need to start data collection from scratch on a new unforeseen topic. Different companies have different costs.
... may be able to sample from non-DNT users

John: can't we draw the inferences and get rid of the URIs?

Mike: don't disagree. Identifying how data is used.
... What would be impacted if group changed focus to what you're jumping ahead to.

John: think we need to understand that. If end goal is to in fact eliminate URIs, let's think of ways to make inferences necessary

Mike: not just inferences. Cost per impression moves to cost per click or cost per action. Purchase funnel and ads paid for in different way
... valuable to know how an action came about
... importatnt to the analytics of the internet
... could carve out ad delivery and reporting
... perhaps this new approach on "data hygiene" for URIs -- sometimes URIs are really necessary
... do we carve them out or can we do better than that?
... can we find a better balance?

matthias: question on campaign measurement. Not a good use. If you do a superbowl campaign, after 90 days wouldn't be able to know if actions were impressed by this campaign or by another
... big reaction in first 30 days, maybe long tail, but not so big

Mike: good use case, but your point is you don't need it for a life time. There is a shorter effective useful life for that URI in that example
... for web analytics and attribution, but maybe not for a full year
... do I want to pay an ad network a year later after they run an ad?

Matthias: use case makes sense, but longevity of browsing history is limited.
... can cut it.

Dan: another question, let's get into use cases for over 90 days. Maybe we can bracket that.

<schunter> ack

<Zakim> vincent, you wanted to mention retargeting

vincent: use case where you need full URL for retargeting need exact URL
... know which products viewed

dan: need exact URL, or the product?

vincent: not sure

matthias: is seasonal common?
... if valentine's day, view flowers, will a year later remember me?

Mike: long time period, likely to try mother's day
... depends on who's doing it
... is it the website, a 3rd party, how granular do they need -- it varies. Plethora of different business models

Matthias: know of any long-term focus companies?

Mike: tried to limit to 3rd party, easier to answer. If 1st party, answer is yes.
... small publisher (missed) gets 90% of traffic in November. may re-target in November.
... ad networks not as much, but 1st parties do

Matthias: 1st parties more likely to keep longer than 3rd parties

Mike: yes, more valuable than for 3rd parties
... 3rd Q on low-entropy cookies, please describe?

Dan: different issue
... keep them separate
... move to client-side solutions
... browser stores user info and selectively doles that out to advertisers
... browser makes decisions about targetting
... low-entropy cookies is a simple way to do this
... instead of unique identifier, set a cookie for "sports person" on millions or thousands of users
... small set of different sorts of cookies, all client side
... don't need to retain it all on the server side
... clinet-side will evolve over the next year or so

mike: my publisher members will *hate* that, but thanks for the description

dan: example of super bowl, 6 months later, value of data drops off
... assume a visit / impression is less valuable information than a click or an action
... wondering relative weight of URIs for impression than click or action data

rob: for targeting, we don't do this and Mike just left, we should not lose sight of other use cases
... might want to know if ad campaigns are performing well
... might want to know looked at site after campaign ran and was due to that campaign

rachel: super bowl is just a moment in time. Not the normal case.

peter: Q about 1st and 3rd party.
... 3rd party networks have visibility across more sites. Can ask to delete data / portability from 1st parties, moving that way. Harder to do with 3rd parties users haven't seen.
... seeing as same?

Dan: 3rd parties more likely to not need data as long, either, as 3rd difference

Rachel: FB letting you delete is different from Amazon letting you delete purchase history, which you could not do

Peter: transactions and financial, but not need URI details
... would be purple shirt not the green shirt

sam: long tail and first party issues
... seasonal business for our customers
... want to know how they got the customer in the first place
... how do I acquire new customers

Rachel: if you include browsing history with any identifier, need small business to know XYZ identifier from what source, might be more useful than who the user is.

cross talk

Rachel: still on conversation about what would be necessary for this information?

<sidstamm> my regrets, I have to drop off for a while.

<Zakim> wseltzer_cpdp, you wanted to ask sampling?

Wendy: hearing some uses, and sampling could be effective. Others where it is not.
... could sample time slices or user segments.
... retargeting is specific and sampling does not work

dan: been blurring these.
... high level statistics v. exact URI, should keep clear

Rob: Peter's question, assuming LBH is across sites over time

Dan: not so clear to me

Rachel jumps in: unclear

Rob: went to WaPo -

<johnsimpson> yes yes

Rachel interrupts

Rob: do think it different for reasons Peter describes
... can look at retention or not visit a first party, situation is different

<wseltzer> [could be a question to consider: does an LBH across single site pose fewer user concerns than LBH across many sites?]

Rob: doesn't require DNT

Dan: not disagreeing, but for users, understanding what happens on FB is not always clear
... may not have clear mental model on FB
... may not affect DNT discussion though

john: if LBH, 1st, 3rd, 5th and 6th parties. May have different requirements though for 1st and 3rd party.
... but LBH involves all the pages you view on a site if it's kept
... what we do about that is different. But defn is not just x-site

I'd be happy to defn now and add "we may not care about 1st parties"

Rob: contexts are different

See no reason not to defn...

Understanding use cases for long-term retention

Rob: use cases for long periods. Wendy brought up keeping retarget data as different

Dan: other use cases beyond seasonal to need full URI 1 year later?
... anyone able to offer those?

rachel: IP, fraud detection
... verify users for IP perspective
... access to accounts, subscription accounts

<wseltzer> [IP as intellectual property]

Dan: worked in fraud detection in industry, but click data is more useful than impression

Rachel: fraud areas not just in delivery and reporting but also for IP

<johnsimpson> what kind of IP issues?

Dan: would be interested in hearing more
... can we learn more?

Rachel: will see about finding a resource

Sam: subpoenas for data
... if court ordered, that's an exception, and you have to retain and produce it.

matthias: that's a reason for keeping less data

<johnsimpson> Well one of the reasons not to keep data is precisely so it won't be subpoenaed.

matthias: large enterprises have retention policies to avoid costs of discovery

?: as policy, don't keep what you don't need

Dan: we all agree you have to produce data if compelled

Bryan: they can go on quite a long time

Sam?: fraud can be someone breaking into your system and need proof, that's first person

scribe: might want to retain that data

Dan: permitted uses, that's interesting
... what's needed over a year?

Rob: bleeds into permitted uses
... things folks reasonably want to do beyond short span of time
... we do a lot of analytical work on FB
... fake accounts, child predators, don't disclose details
... not fraud or security but site integrity

Dan: would "abuse" work?

Rob: in general, but don't know how you write that

Bryan: terms of use. Need users to follow them.

Rob: broader than that, policy might not say "no child abusing" but we should deal with it

Sam: have same thing

Dan: need to end soon
... go through queue then summarize

wendy: useful in this exercise, different data needs for different uses
... might be the case that no one needs URIs plus time stamps plus sites visited, but someone needs URIs but fuzzy time, someone else needs both but for a subset of users for sampling
... another is URIs and times at suspicion of fraudulent access

<dan_auerbach_> +1

wendy: the more specific we can be,

dan: great idea, and understanding tradeoffs would be great

<wseltzer> aleecia: let's write down definition of LBH

<wseltzer> ... note we're not currently contemplating 1st parties and 3d parties doing the same things

<wseltzer> ... 3, let's try a strawman ona specific timeframe

Dan: talking about 1st and 3rd parties, not context of data collection
... not an explicit user action generating the data, FB timeline isn't what we mean
... collection of info derived from site visits, from same person / device, would be a LBH
... books looked at on AMZN would be LBH

(that sounds right to me)

Rachel: even if it's not connected to a unique id? If there's no connection, why is there concern?
... the idea you would suppress something not identifiable expands the world
... need some ability to be identified

Dan: has to be some sense in which there's knowledge that things are linked

<wseltzer> [some anonymized info can easily be re-linked to an individual]

Dan: if collection of ISBN numbers and it's random, ok
... if collection all from one person, that makes it a LBH without an identifier

Matthias: can re-identify
... my browsing history for two months, use schunter.org regularly
... could make a good guess it's from me
... search histories have identifying terms
... can make good guesses

rachel: if in buckets?

matthias: that would be ok
... can you re-associate is the question
... if "went to FB, GOOG," that could be ok

rachel: important because (sorry, missed - please fill in)

matthias: if browsing history is shared, k-anon, typical

rachel: how do we get that in the defn?

matthias: in LBH, can only do top 10 sites :-)

dan_auerbach: google.com/dansinbox is identifying
... bucketing to google.com might be reasonable
... smaller sites into sports sites might be useful

(thanks!)

Dan: k-anon has no ambiguity
... can navigate those waters

rob: being careful that we're not conflating LBH and de-id'ed data

<schunter> ... in theory ... (afaik the def. contains "background knowledge" of the adversary)

rob: two concepts
... in example, amazon could say "here's the list of all the books a person looked at" not sensitive but valuable
... different from "and I can tell Matthias is the person who looked at them"
... no privacy problem
... get worried if one of those books is "Matthias' web mailer" it's linkable.

Bryan?: what is an individual

scribe: if not tied to a person, not indivisual
... not a history, just a record, if it's not tied

dan: hear you, important to keep LBH separate
... more on this tomorrow with Ed
... papers on re-id
... you have databases and can re-id
... don't need to answer that now, just what is a LBH

<wseltzer> aleecia: Papers, Netflix contest (Narayan & Shmatikov) Anonymized users can be id'd by reference to another database, and you dojn' thave control over others' databases

<wseltzer> ... k-anonymity and buckets, ways of thinking about the long tail of re-identifiable data

<wseltzer> ... we don't have to solve it here, can set aside with "if you have an unlinkable data-set"

matthias: shorter histories -> easier k-anon
... 4 days of sites, not full URLs, then many users will be the same
... the longer, the more difficult to get k-anon
... month-long history is not as likely as possible with full URLs

Dan: agree

matthias: longer the history, the more difficult the k-anon. the more data, the less likely users have the same profile

Dan: agree on that too

("just agree" and "disagree" sound similar :-)

Dan: if no timestamp, easier too. fewer fields -> easier to have de-linked data set

<Zakim> bryan, you wanted to mention that the potential that unlinked data is somehow made linkable later is real, but should not impact the compliance of who recorded the unlinked data,

bryan: unlinked data that was recorded but later turns out to be linkable, that data as recorded doesn't represent a browsing history
... that some future party can resurrect it doesn't make it a browsing history

-1

bryan: if there is a fault in this, it is the fault of the person who does the resurrection
... if there's no link to an individual, it doesn't represent a browsing history
... if in the future it's not the fault of the recording company
... the only response if you disagree is not to record anything

matthias: netflix example is nice

bryan: get it's possible to re-link
... but if you've done everything to the state of tech today, you've fulfilled expectations. If addition of other data that's put together and the user didn't authorize it, then it's the party who ressurected that data

Dan: grey area but need to end in 5 minutes
... don't quite share that view. What if dataset was linked to Mr. Man, and he did bad things, and works at EFF, can get down to a few people. Then all it takes is one fact not in the db to identify it

Different argument: being able to relink *is* a current known threat, Bryan

We know this is real. We should account for it.

Rachel: fault is not helpful yet
... that would include identifiers.
... there is the possibility of a browser history that is not identifiable

Bryan: if no identifier, it's not linked, it's not a browsing history. Period.

(full disagreement from me)

Dan: we'll get back to defn
... Bryan disagreeing on defn in that, don't think a set of ISBN numbers from a specific user are an LBH
... don't know which user, just that it's one user
... just a list of movies a person watched

Rob: don't think that's consensus for that. not an LBH

Dan: hearing no consensus there. But is automatic collection of data, rather than affirmative user choice
... has to be retained, haven't picked a time limit
... do we want to say a month as a working limit?
... just as a defn

<johnsimpson> no. a day

<bryan> +1 I agree that "fault" was not the intent of my point, but that the party that saves an unlinked (to a person) set of related browsing records is not recording an individual's browsing history.

Rob: 6 weeks, 90 days, 365
... 30 days too short

<johnsimpson> i'm serious

I can live with 6 weeks

Mike: ad campaign for 1 month at least
... and months are longer than 30 days
... need to batch & process data
... 30 doesn't work

Dan: not sure we want to link this to retention for de-id

Can we agree under 3 months?

Where we still debate, but under 3 months?

Dan: 1st parties may want to keep things longer, use cases for 3rd parties too

Matthias: 3rd parties less likely to need long-term retention

Mike: agree, but marketer may find more useful for longer
... most ad networks won't use it for really long, but marketer may

Rachel: can use de-id'ed but need inferences

?: seasonal is a common thing

Dan: anything else?

when is next session?

<johnsimpson> thanks, Dan

main room when?

thanks Wendy!

<tlr> 3:45pm main room

<vincent> thanks

<wseltzer> thanks, all!

- DRAFT -

SV_MEETING_TITLE

11 Feb 2013

Attendees

Contents

Understanding use cases for long-term retention

Summary of Action Items

Scribe.perl diagnostic output