See also: IRC log
<johnsimpson> am I in right group?
I was sent here though my last name puts me in C
<johnsimpson> this is last name with S, though right?
<vincent> yes
Hard to know -- Auerback, Fielding, Doty...
<vincent> I think Nick is in all groups
<dan_auerbach_> I was just asked to lead this section 15 minutes ago or so
That's a good idea -- thanks, Roy
Always an adventure
<johnsimpson> do we have the "questions"
<johnsimpson> how many actually in the room?
<dan_auerbach_> let us know once the room is able to connect and we'll get started
<dan_auerbach_> did everyone get the list of questions? or just the group leaders?
<dan_auerbach_> i can paste them into irc
<dan_auerbach_> if others haven't seen them
<sidstamm> please do
I see Dan, John, Sid, Vincent, Aleecia. Presumably Wendy, like Nick, here to staff
Haven't seen them
<johnsimpson> testing now. anybody hear me
nope
yes
<johnsimpson> i hear you
yes
<johnsimpson> are there "questions"?
where are the questions for the session?
we can be reading meanwhile?
<sidstamm> I'd like to see them too
<sidstamm> :)
seriously?
<sidstamm> paste a url?
Upload the doc please
<wseltzer> we're trying...
thank you Rob
<johnsimpson> how many in actual room?
<robsherman> 1. “Lifetime browsing history” is a phrase that is often used, but never defined clearly. What would LBH mean as a technical matter?
<robsherman> 2. In light of this definition, what technical measures would suppress or delete LBH?
<robsherman> 3. Tying LBH to the previous group discussions of “buckets” or “low-entropy cookies,” how can the latter continue while suppressing or deleting LBH?
<robsherman> 4. Are there any compelling use cases for retaining detailed browsing history beyond a general time limit on retention?
<robsherman> 5. If so, how would you limit those use cases consistent with the goals of: (1) limiting LBH; while (2) enabling “buckets” or “low-entropy cookies”?
<schunter> 1. “Lifetime browsing history” is a phrase that is often used, but never defined clearly. What would LBH mean as a technical matter?
<wseltzer> Mike Zaneis, Rob Sherman, Bryan Sullivan, Sam Silberman
<wseltzer> Adam Turkel
<wseltzer> phone: John Simpson, Dan Auerbach, Aleecia MacDonald, Berin Szoka
<wseltzer> ... Sid Stamm
<wseltzer> Room+: Wendy Seltzer, Mathias Schunter
<vincent> wseltzer, I'm here as well you did not hear me?
Dan: going through questions high level, then will focus
(agreement)
Dan: lifetime browsing history defn -- would would LBH mean as a technical matter?
<wseltzer> Phone+ vincent Toubiana (thanks)
Dan: If you have big table keyed
with pseudonym and table has URIs and timestamps, that's what I
think of as LBH
... Assuming longer than short retention (1 week, 1 mo perhaps)
that's my starting point for defn
... use as working defn and keep going?
Susan: ?
Matthias: too strong a
requirement. Have URL and for some reason you know they came
from same person, that's enough
... if all articles without URLs, also LBH
... all books someone has looked at, even without URIs, still
LBH
Dan: any dataset known to be same person or device over time?
Matthias: if identifiers of books, not just URIs
Dan: agree
sherman: if no identifier but you know who it is?
Matthias: know it's the same person, even if not who the person is, that's a LBH
sherman: why? Why are we concerned if you cannot link?
Matthias: different question. We're answering what's a LBH
(cross talk)
bryan: find what's alike, determine duration. An individual, to me, it says URI data with an individual whatever that is. Complied over a *long* period of time. Collected and maintained with that purpose in mind.
<Zakim> bryan, you wanted to say it means to me a set of URI data associated with an individual, compiled over a long period of time, and collected/maintained with that purpose in mind
bryan: something specific, not what you can do with it, but the record that is collected and maintained
Mike: supportive of Peter's intro, thought we have identified an issue we can agree is a consumer privacy issue that might be able to be addressed
<schunter> The history (if not linkable to any person) seems far less critical as compared to a LBH that can be associated with a person.
Mike: 3rd party collection of a
*long* history, not defining long yet, point being that's
"tracking"
... what we've focused on is the 3rd party tracking
... tried to come up with more transparency and control
<sidstamm> I'm confused about the "for that purpose" part of the definition... what is the purpose referenced by "that"?
Mike: not interested in lifetime
browsin history, but want to agree on the scope, and more
interested in the other questions on the list
... let's get on to the next questions unless we're not just
talking about 3rd party
Dan?: not sure we want to get too deep, other comments?
<bryan> +1 to moving forward to the other questions
<johnsimpson> +1 to Aleeia
aleecia: would include 1st parties in LBH for defn, though perhaps not what we care about under DNT
<dan_auerbach_> +q
matthias: wouldn't just discuss 3rd parties, but may constrain to just 3rd parties
dan: purpose of the dataset shouldn't be part of the defn
+1 dan
dan: 1st or 3rd party shouldn't
be part of defn
... moving to Q2
<schunter> In light of this definition, what technical measures would suppress or delete LBH?
<BerinSzoka_> link to the questions?
thanks, Matthias!
<wseltzer> [2 2. In light of this definition, what technical measures would suppress or delete LBH?]
dan: rough idea of defn. what tech measures suppress / delete?
matthias: long series of events, can regularly suppress link ability. every fixed time you start collecting fresh, not long-term any more
<BerinSzoka_> could someone please share the list of questions?
matthias: if you use cookies and you throw away cookies and set new ones, unless you do new things, that breaks the linkability
dan: hear you're changing the pseudonym
<schunter> Regular breaking the linkability (e.g., by erasing cookies while not using any other linking-ability)
matthias: yes
dan: not enough. also storing IP address, can link
<schunter> Not storing a dataset that can link two "subsequences" of the LBH.
dan: need strong notion beyond moving from one cookie to another.
wendy: if we have long but
unid'ed history, addition of one piece of linking data could
tie that back
... rotating identifiers to break into shorter periods of time
might be useful
dan: reasonable suggestion
<schunter> Interesting question: re-linkability of sub-sequences.
dan: fields that can link between
records or data sets, important to look at everything,
including time stamps
... can correlate prior records to new ones
<wseltzer> [what about fuzzing of data?]
dan: broadly, want to look at all
fields you are collecting and make sure none can
correlate
... can go from timestamp to a day or an hour
... quickly through other qs then focus and make progress
bryan: confused about
suppression
... if it's impossible to correlate then you have suppression
is that fair?
dan: yes, or make data less specific
<schunter> If you cannot correlate two browsings that are a long time apart, then you suppressed the LBH.
bryan: url being one piece of the
dataset, ok
... different question: what is the tech that will enable decor
over time is an arms race. not productive to get into
details.
... what we learned from HIPPA is best we've seen, don't know
we'll do better
dan: should strive to do better.
agree normative lang to specify a technique is not the way to
go. but let's brainstorm
... HIPPA missed the mark, we can do better
... let's at least explore even if we don't suggest a
particular thing
<sidstamm> aleecia: want to find a nice balance that can suppress while still providing a benefit for privacy and maximize monetary benefit
aleecia: we could just delete all URIs. presumably there are ways the data is useful for industry / profitability - how do we do that more privacy protecting?
matthias: can we delete after 90 or 60 days?
Rob: matthias is looking at me
:-)
... timeline is not a LBH
... can click Like button to add things to record, but one-off
basis
... not LBH, it's one off, and it's affirmative action from the
user
<bryan> I'm unsure of the value, in this working group and DNT context, of focusing on techniques for ensuring long-term records of data are not correlatable. That is pretty deep science for this group. I think at most we can set objectives, and let the market develop techniques that meet the objectives.
Matthias: affirmative action of user is important, not our concern
rob: if you choose to use a tool, it's out of scope
(bryan, thank you for adding what i didn't capture well enough)
dan: not talking about bits and pieces a user affirmatively adds. this is a background thing that happens without the user's knowledge
(speaker phone troubles)
<schunter> -- resolved.
dan: agree that the piece Rob & Matthias are talking about should go into the defn -- not a discrete set of user added items, but something automatic and regularly
(not sure I agree, but there is some line there)
Rob: plugins are short period of
time, need for trouble shooting. Not kept more than 90
days.
... then it's not identifiable form
Mike: attribution, analytics, targeting -- vary from short to long
Matthias: ad networks use for more than a year, is that normal for a campaign?
<johnsimpson> how long are "relatively short periods"
Matthias: big data?
<schunter> 1 year: seasonal and campaigning.
Mike: varies. Over a year for
seasonal campaigns to adjust inventory or for market
research
... need longer than a year
... interesting discussion we can have is other ways to get the
insights for an ad model but less identifiable or sorter
retention
... you get wide ranges of time, and if first party even wider.
Carriers will have lots of reasons to keep in identifiable
format for longer.
<wseltzer> aleecia++
dan: makes sense. Digression to
ad world, if anyone there can help me understand
... for behavioral targeting, can conclude interested in sports
apparel. Have a URL then to several buckets, male, 30-40.
... break into profiles, or use raw URL
rachel: can't speak to URL question, but isn't just you looked at a sports page. History across consumers, and over time, to reach conclusion not related to sports.
<vincent> I think they use full URL when they do retargetting
rachel: use of crest means
republican, colgate is democrat
... inferences and corrolations, even if not identifiable
<schunter> I suspect that today, data is just kept to allow later re-mining with new algorithms.
Mike: ways data is currently
used, if I were a marketer running super bowl ads, sponsor
webpage for it plus TV commercial. 6 months later, want to know
if someone came back to your site to get more info
... would want to measure effectiveness of ad campaign
... was it worth it to sponsor the site? Want to measure long
term.
... marketers, ad networks, would want to know which creative
on that site was more effective.
... immediate conversion may not be what builds longer-term
brand recognition
... insights are valuable throughout supply chain
<wseltzer> aleecia: we've heard in the past that buckets change
<wseltzer> ... trying to predict/mine the correlations in advance is difficult
<wseltzer> ... tradeoffs will vary from company to company; some tech to bridge the gaps
<BerinSzoka_> of course, what Aleecia just said assumes that a significant percentage of the market will not be DNT users--which, I'm not sure we can assume, given Microsoft, etc.
aleecia: may be able to get the
bulk of the value with buckets rather than URIs, do lose time /
money if you need to start data collection from scratch on a
new unforeseen topic. Different companies have different
costs.
... may be able to sample from non-DNT users
John: can't we draw the inferences and get rid of the URIs?
Mike: don't disagree. Identifying
how data is used.
... What would be impacted if group changed focus to what
you're jumping ahead to.
John: think we need to understand that. If end goal is to in fact eliminate URIs, let's think of ways to make inferences necessary
Mike: not just inferences. Cost
per impression moves to cost per click or cost per action.
Purchase funnel and ads paid for in different way
... valuable to know how an action came about
... importatnt to the analytics of the internet
... could carve out ad delivery and reporting
... perhaps this new approach on "data hygiene" for URIs --
sometimes URIs are really necessary
... do we carve them out or can we do better than that?
... can we find a better balance?
matthias: question on campaign
measurement. Not a good use. If you do a superbowl campaign,
after 90 days wouldn't be able to know if actions were
impressed by this campaign or by another
... big reaction in first 30 days, maybe long tail, but not so
big
Mike: good use case, but your
point is you don't need it for a life time. There is a shorter
effective useful life for that URI in that example
... for web analytics and attribution, but maybe not for a full
year
... do I want to pay an ad network a year later after they run
an ad?
Matthias: use case makes sense,
but longevity of browsing history is limited.
... can cut it.
Dan: another question, let's get into use cases for over 90 days. Maybe we can bracket that.
<schunter> ack
<Zakim> vincent, you wanted to mention retargeting
vincent: use case where you need
full URL for retargeting need exact URL
... know which products viewed
dan: need exact URL, or the product?
vincent: not sure
matthias: is seasonal
common?
... if valentine's day, view flowers, will a year later
remember me?
Mike: long time period, likely to
try mother's day
... depends on who's doing it
... is it the website, a 3rd party, how granular do they need
-- it varies. Plethora of different business models
Matthias: know of any long-term focus companies?
Mike: tried to limit to 3rd
party, easier to answer. If 1st party, answer is yes.
... small publisher (missed) gets 90% of traffic in November.
may re-target in November.
... ad networks not as much, but 1st parties do
Matthias: 1st parties more likely to keep longer than 3rd parties
Mike: yes, more valuable than for
3rd parties
... 3rd Q on low-entropy cookies, please describe?
Dan: different issue
... keep them separate
... move to client-side solutions
... browser stores user info and selectively doles that out to
advertisers
... browser makes decisions about targetting
... low-entropy cookies is a simple way to do this
... instead of unique identifier, set a cookie for "sports
person" on millions or thousands of users
... small set of different sorts of cookies, all client
side
... don't need to retain it all on the server side
... clinet-side will evolve over the next year or so
mike: my publisher members will *hate* that, but thanks for the description
dan: example of super bowl, 6
months later, value of data drops off
... assume a visit / impression is less valuable information
than a click or an action
... wondering relative weight of URIs for impression than click
or action data
rob: for targeting, we don't do
this and Mike just left, we should not lose sight of other use
cases
... might want to know if ad campaigns are performing
well
... might want to know looked at site after campaign ran and
was due to that campaign
rachel: super bowl is just a moment in time. Not the normal case.
peter: Q about 1st and 3rd
party.
... 3rd party networks have visibility across more sites. Can
ask to delete data / portability from 1st parties, moving that
way. Harder to do with 3rd parties users haven't seen.
... seeing as same?
Dan: 3rd parties more likely to not need data as long, either, as 3rd difference
Rachel: FB letting you delete is different from Amazon letting you delete purchase history, which you could not do
Peter: transactions and
financial, but not need URI details
... would be purple shirt not the green shirt
sam: long tail and first party
issues
... seasonal business for our customers
... want to know how they got the customer in the first
place
... how do I acquire new customers
Rachel: if you include browsing history with any identifier, need small business to know XYZ identifier from what source, might be more useful than who the user is.
cross talk
Rachel: still on conversation about what would be necessary for this information?
<sidstamm> my regrets, I have to drop off for a while.
<Zakim> wseltzer_cpdp, you wanted to ask sampling?
Wendy: hearing some uses, and
sampling could be effective. Others where it is not.
... could sample time slices or user segments.
... retargeting is specific and sampling does not work
dan: been blurring these.
... high level statistics v. exact URI, should keep clear
Rob: Peter's question, assuming LBH is across sites over time
Dan: not so clear to me
Rachel jumps in: unclear
Rob: went to WaPo -
<johnsimpson> yes yes
Rachel interrupts
Rob: do think it different for
reasons Peter describes
... can look at retention or not visit a first party, situation
is different
<wseltzer> [could be a question to consider: does an LBH across single site pose fewer user concerns than LBH across many sites?]
Rob: doesn't require DNT
Dan: not disagreeing, but for
users, understanding what happens on FB is not always
clear
... may not have clear mental model on FB
... may not affect DNT discussion though
john: if LBH, 1st, 3rd, 5th and
6th parties. May have different requirements though for 1st and
3rd party.
... but LBH involves all the pages you view on a site if it's
kept
... what we do about that is different. But defn is not just
x-site
+1
I'd be happy to defn now and add "we may not care about 1st parties"
Rob: contexts are different
See no reason not to defn...
Rob: use cases for long periods. Wendy brought up keeping retarget data as different
Dan: other use cases beyond
seasonal to need full URI 1 year later?
... anyone able to offer those?
rachel: IP, fraud detection
... verify users for IP perspective
... access to accounts, subscription accounts
<wseltzer> [IP as intellectual property]
Dan: worked in fraud detection in industry, but click data is more useful than impression
Rachel: fraud areas not just in delivery and reporting but also for IP
<johnsimpson> what kind of IP issues?
Dan: would be interested in
hearing more
... can we learn more?
Rachel: will see about finding a resource
Sam: subpoenas for data
... if court ordered, that's an exception, and you have to
retain and produce it.
matthias: that's a reason for keeping less data
<johnsimpson> Well one of the reasons not to keep data is precisely so it won't be subpoenaed.
matthias: large enterprises have retention policies to avoid costs of discovery
?: as policy, don't keep what you don't need
Dan: we all agree you have to produce data if compelled
Bryan: they can go on quite a long time
Sam?: fraud can be someone breaking into your system and need proof, that's first person
scribe: might want to retain that data
Dan: permitted uses, that's
interesting
... what's needed over a year?
Rob: bleeds into permitted
uses
... things folks reasonably want to do beyond short span of
time
... we do a lot of analytical work on FB
... fake accounts, child predators, don't disclose
details
... not fraud or security but site integrity
Dan: would "abuse" work?
Rob: in general, but don't know how you write that
Bryan: terms of use. Need users to follow them.
Rob: broader than that, policy might not say "no child abusing" but we should deal with it
Sam: have same thing
Dan: need to end soon
... go through queue then summarize
wendy: useful in this exercise,
different data needs for different uses
... might be the case that no one needs URIs plus time stamps
plus sites visited, but someone needs URIs but fuzzy time,
someone else needs both but for a subset of users for
sampling
... another is URIs and times at suspicion of fraudulent
access
<dan_auerbach_> +1
wendy: the more specific we can be,
dan: great idea, and understanding tradeoffs would be great
<wseltzer> aleecia: let's write down definition of LBH
<wseltzer> ... note we're not currently contemplating 1st parties and 3d parties doing the same things
<wseltzer> ... 3, let's try a strawman ona specific timeframe
Dan: talking about 1st and 3rd
parties, not context of data collection
... not an explicit user action generating the data, FB
timeline isn't what we mean
... collection of info derived from site visits, from same
person / device, would be a LBH
... books looked at on AMZN would be LBH
(that sounds right to me)
Rachel: even if it's not
connected to a unique id? If there's no connection, why is
there concern?
... the idea you would suppress something not identifiable
expands the world
... need some ability to be identified
Dan: has to be some sense in which there's knowledge that things are linked
<wseltzer> [some anonymized info can easily be re-linked to an individual]
Dan: if collection of ISBN
numbers and it's random, ok
... if collection all from one person, that makes it a LBH
without an identifier
Matthias: can re-identify
... my browsing history for two months, use schunter.org
regularly
... could make a good guess it's from me
... search histories have identifying terms
... can make good guesses
rachel: if in buckets?
matthias: that would be ok
... can you re-associate is the question
... if "went to FB, GOOG," that could be ok
rachel: important because (sorry, missed - please fill in)
matthias: if browsing history is shared, k-anon, typical
rachel: how do we get that in the defn?
matthias: in LBH, can only do top 10 sites :-)
dan_auerbach:
google.com/dansinbox is identifying
... bucketing to google.com might be reasonable
... smaller sites into sports sites might be useful
(thanks!)
Dan: k-anon has no
ambiguity
... can navigate those waters
rob: being careful that we're not conflating LBH and de-id'ed data
<schunter> ... in theory ... (afaik the def. contains "background knowledge" of the adversary)
rob: two concepts
... in example, amazon could say "here's the list of all the
books a person looked at" not sensitive but valuable
... different from "and I can tell Matthias is the person who
looked at them"
... no privacy problem
... get worried if one of those books is "Matthias' web mailer"
it's linkable.
Bryan?: what is an individual
scribe: if not tied to a person,
not indivisual
... not a history, just a record, if it's not tied
dan: hear you, important to keep
LBH separate
... more on this tomorrow with Ed
... papers on re-id
... you have databases and can re-id
... don't need to answer that now, just what is a LBH
<wseltzer> aleecia: Papers, Netflix contest (Narayan & Shmatikov) Anonymized users can be id'd by reference to another database, and you dojn' thave control over others' databases
<wseltzer> ... k-anonymity and buckets, ways of thinking about the long tail of re-identifiable data
<wseltzer> ... we don't have to solve it here, can set aside with "if you have an unlinkable data-set"
matthias: shorter histories ->
easier k-anon
... 4 days of sites, not full URLs, then many users will be the
same
... the longer, the more difficult to get k-anon
... month-long history is not as likely as possible with full
URLs
Dan: agree
matthias: longer the history, the more difficult the k-anon. the more data, the less likely users have the same profile
Dan: agree on that too
("just agree" and "disagree" sound similar :-)
Dan: if no timestamp, easier too. fewer fields -> easier to have de-linked data set
<Zakim> bryan, you wanted to mention that the potential that unlinked data is somehow made linkable later is real, but should not impact the compliance of who recorded the unlinked data,
bryan: unlinked data that was
recorded but later turns out to be linkable, that data as
recorded doesn't represent a browsing history
... that some future party can resurrect it doesn't make it a
browsing history
-1
bryan: if there is a fault in
this, it is the fault of the person who does the
resurrection
... if there's no link to an individual, it doesn't represent a
browsing history
... if in the future it's not the fault of the recording
company
... the only response if you disagree is not to record
anything
matthias: netflix example is nice
bryan: get it's possible to
re-link
... but if you've done everything to the state of tech today,
you've fulfilled expectations. If addition of other data that's
put together and the user didn't authorize it, then it's the
party who ressurected that data
Dan: grey area but need to end in
5 minutes
... don't quite share that view. What if dataset was linked to
Mr. Man, and he did bad things, and works at EFF, can get down
to a few people. Then all it takes is one fact not in the db to
identify it
Different argument: being able to relink *is* a current known threat, Bryan
We know this is real. We should account for it.
Rachel: fault is not helpful
yet
... that would include identifiers.
... there is the possibility of a browser history that is not
identifiable
Bryan: if no identifier, it's not linked, it's not a browsing history. Period.
(full disagreement from me)
Dan: we'll get back to defn
... Bryan disagreeing on defn in that, don't think a set of
ISBN numbers from a specific user are an LBH
... don't know which user, just that it's one user
... just a list of movies a person watched
Rob: don't think that's consensus for that. not an LBH
Dan: hearing no consensus there.
But is automatic collection of data, rather than affirmative
user choice
... has to be retained, haven't picked a time limit
... do we want to say a month as a working limit?
... just as a defn
<johnsimpson> no. a day
<bryan> +1 I agree that "fault" was not the intent of my point, but that the party that saves an unlinked (to a person) set of related browsing records is not recording an individual's browsing history.
Rob: 6 weeks, 90 days, 365
... 30 days too short
<johnsimpson> i'm serious
I can live with 6 weeks
Mike: ad campaign for 1 month at
least
... and months are longer than 30 days
... need to batch & process data
... 30 doesn't work
Dan: not sure we want to link this to retention for de-id
Can we agree under 3 months?
Where we still debate, but under 3 months?
Dan: 1st parties may want to keep things longer, use cases for 3rd parties too
Matthias: 3rd parties less likely to need long-term retention
Mike: agree, but marketer may
find more useful for longer
... most ad networks won't use it for really long, but marketer
may
Rachel: can use de-id'ed but need inferences
?: seasonal is a common thing
Dan: anything else?
when is next session?
<johnsimpson> thanks, Dan
main room when?
thanks Wendy!
<tlr> 3:45pm main room
<vincent> thanks
<wseltzer> thanks, all!
This is scribe.perl Revision: 1.137 of Date: 2012/09/20 20:19:01 Check for newer version at http://dev.w3.org/cvsweb/~checkout~/2002/scribe/ Guessing input format: RRSAgent_Text_Format (score 1.00) Succeeded: s/Sam Sherman/Sam Silberman/ Succeeded: s/?:/sherman:/ Succeeded: s/?/bryan/ Succeeded: s/Mike/Bryan/ Succeeded: s/?:/Sam:/ Succeeded: s/rob/dan_auerbach/ No ScribeNick specified. Guessing ScribeNick: aleecia Inferring Scribes: aleecia Default Present: BerinSzoka, vincent, +1.415.920.aaaa, johnsimpson, bryan, +1.650.723.aabb, aleecia, sidstamm, +1.425.214.aacc, wseltzer, schunter, Mike_Zaneis, Rachel_Thomas, sherman, Adam_Turkel, Sam_Silberman Present: BerinSzoka vincent +1.415.920.aaaa johnsimpson bryan +1.650.723.aabb aleecia sidstamm +1.425.214.aacc wseltzer schunter Mike_Zaneis Rachel_Thomas sherman Adam_Turkel Sam_Silberman WARNING: No meeting title found! You should specify the meeting title like this: <dbooth> Meeting: Weekly Baking Club Meeting WARNING: No meeting chair found! You should specify the meeting chair like this: <dbooth> Chair: dbooth Got date from IRC log name: 11 Feb 2013 Guessing minutes URL: http://www.w3.org/2013/02/11-dntd-minutes.html People with action items:[End of scribe.perl diagnostic output]