W3C

- DRAFT -

SV_MEETING_TITLE

17 Jan 2013

See also: IRC log

Attendees

Present
Bryan_Sullivan
Regrets
Chair
SV_MEETING_CHAIR
Scribe
JoeHallCDT, yianni

Contents


<rvaneijk> When I dial in, I do not see myself in the IRC as dialed in..

<aleecia> Rob, neither do I

<aleecia> Possibly just slow?

<aleecia> But I'm guessing something is broken in the Zakim world

<jmayer> W3C: fixing IRC bots and taking attendance since...

<bryan> zakim appears to be a little sleepy

<aleecia> <groan>

<bryan> BAU

<justin> Getting ready to dial in.

<johnsimpson_> Good morning

<aleecia> I planned to before I got sick

<aleecia> (someone is typing & needs to mute)

<johnsimpson> john

<aleecia> hi

<johnsimpson> testing IRC

joe is scribe… someone remind me how to tell Zakim that and to start notes

<yianni> scribe: JoeHallCDT

Peter Swire: goal is to discuss to what extent De-ID can remove data from scope of the standard

… related: what sort of uses are consistent with compliance with the spec

… if things are used for market research in ways that are entirely de-ID, that should be safe or out of scope

… on the other hand, if explicitly ID'd, standard should apply

… clearly defining uses is crucial

… getting clear on terms, words and such is an important part of this

… instead of having people talking past each other, we want a strong foundation of shared vocabulary

… delighted to have great people in the room and on the phone

… agenda has been sent around

… ground rules for discussion

… this is not an official in-person meeting with 8 weeks notice

… have been told by w3c staff that this can't make decisions towards normative language

… it would be good to agree on terms and definitions

… this should make people more comfortable with claims made in the world

<Wileys> If you share that information externally...

… e.g., unsalted hashes

<jmayer> Could introductions include technical background? It would be helpful to understand who'll be participating from the technical side and who'll be observing from the law/policy perspective.

might want to q that jmayer

… first thing is incentives to de-ID

<aleecia> Do we need to re-introduce ourselves?

… Khaled El Emam will start us off with slides (jlh: not sure how phone peeps will see them)

… then to hashing, persistent ids, putting people in "buckets"

<rvaneijk> please send slides to the list and/or post them on the wiki !

… Yianni will gather qs

… will go around the room, please let us know any techincal experience

<aleecia> cannot hear

… Peter, law prof.

… Khaled works at U Toronto, CS background, working on health

Dan Aurbach from EFF, worked at Google before doing data mining

John Simpson, Consumer watchdog

Ed Felten, Princeton U.

research and teaching for 18 yuears

Felix Wu, prof. at Cordozo, PhD in CS from Berkeley

Peter invited Felix based on techincal work

Paul Gliss, lawyer from Comcast, worked in De-ID space

Chris Mejia, IAB, dir. of ad technology, tech dir. for DAA

Jeff Wilson, with AOL for 16 years

Marc Groman, NAI

David Wainberg, NAI, undergrad. at CS, web dev. for years

Heather West, Google

Justin Brookman, CDT

Bill Scanell, (probably a lawyer in a suit?) here to assist with communications

Peter McGee from FTC

Shane Wiley, Yahoo!!

Mary Ellen Callahan, Jenner and Block

Aleecia McDonald, PhD engineering

<bryan> Bryan Sullivan, AT&T Director of Service Standards, WAP/Web browsing service architecture and mobile/web standards for AT&T since pre-2000

Adam Turkel, lawyer with AppNexis

Bryan (?), AT&T director of standards

Ho Chun Ho, Comcast, data arch.

Jonathan Mayer, PhD student in CS at Stanford, at Stanford Security Lab

<AHanff> is there a call on now?

Rob van Eijk, PhD student at Leiden University, (very lengthy afi. and background)

<aleecia> Yes, we're on a call now

Vincent Toubiana, Alcatel Lucent, PhD CS

<AHanff> thanks I didnt see it on the icalendar

<rvaneijk> aff: Art. 29 Data Protection Working Party / Dutch DPA

Jules P, from Future of Privacy Forum

<yianni> scribe: yianni

Peter: Getting logistics worked out, brainstorm reasons in advertising and online space
... why people have incentives to de-identify
... self interest, business, or other reasons
... if we understand reasons, we might be able to understand what things will be done in practice


.privacy policy that says you do things in de-identified or anonymized ways

scribe: we do not use PII for certain operations, for example
... risk for not following promises

Marc: people do not de-identify to avoid liability, they do it to mitigate privacy and security risk, then make the promise

Paul: providing comfort to cusomters is a reason to de-identify

Peter: 2nd, organization have costs to data breaches, states and Europe
... expense of sending out notice and going through steps of data breach, if de-id you do not have to disclose

<Wileys> Encrypted is different than de-identified

Jules: big driver, beginning of NAI, big ad networks and crisis around it

<aleecia> In my experience, companies that say they only work with anonymous data mean it in the Latin sense -- literally without name. They do not mean that users are unidentifiable. I think we need to be very careful to keep these ideas separate.

<Marc_> +q

Jules: NAI treated PII and non PII very differently, representing in privacy policy that you tracked PII, you could make notice in opt-out notice
... in PIII, need more notice on web page, perhaps an opt-in
... 7 large networks adopted, and forced other partners to follow
... huge driver for ad netword that they make a specific representation of PII and non PII

Peter: are they other legal regimes for de-id?

<jmayer> Rob, could you briefly address EU law?

Paul: regulatory treatment that is different for cable, services provided by cable providers
... makes distinction between personally identified and not identified

<Wileys> Peter - are you suggesting if data is not linked to PII then it is "de-identified"?

Paul: much like NAI, different rules for consent and approval

Marc: data security issues, beyond financial issues, reputational risk is a very large piece of it as well
... privacy incident, costs are much higher than outside council and regulatory burdens, for many years talk about the x company incident

<bryan> Shane, I think the question is whether "is" includes "can be", i.e. data not linked vs non-linkable is by definition non-PII

Peter: NAI, Cable Act, also have HIPAA, GLBA
... if you are outside regime, you do not have regulatory burden

<aleecia> Shane - I think it's abundantly clear that no PII is not the same as non-identifiable (see Paul Ohm's summary paper) but I understand you're asking for Peter's view, which I do not know.

Marc: Privacy act, privacy impact assessment depends on whether you have individually identifiabe information

Peter: inside an organization, you have incentives of access controls, more people can tough if not PII

<Wileys> Bryan, that's my question - is it an absolute position? I've always felt de-identified was "more" than simply not PII.

<Wileys> Aleecia - see above :-)

Peter: data base with financial information, many reasons for access control limits
... for other employees there is a risk of breach if you do not De-identify

Khaled: opt-in consent or opt-out, evidence in health care sector for consent bias
... de-identification allows you to avoid consent bias

<Wileys> PII/Personal Data -> Pseudo/Anonymous -> De-Identified/Unlinkable -> No Value

<rvaneijk> any kind of analytics is very far streched...

Khaled: Beyond researchers, goes to analytics (bias data because you are missing a certain percent of population)

Peter: having full population better for the researchers, De-ID is a tool to get accurate analytics
... Any other comments on reasons why people do de-identification?

<aleecia> Shane - I can imagine a dataset that removes PII and is also then not re-identifiable. But that's not a general rule. It's probably easier to talk about the type of data we're using. Removing PII is not going to render a server log file "safe," and indeed there might never be PII in the first place, yet still have identifiable data.

Peter: reasons for people to do this, trying to understand the terminology
... Khaled has a book on de-id coming out the beginning of April

<aleecia> Are slides available now?

Peter: Khaled starting with part 2 and his slides

<bryan> Shane, to be clear I was not stating a position, but a question. IMO identity includes a range of attributes only some of which are personal - remove/obscure the personal ones and you're home - science will always find new ways to relink and attribute data to persons, and we should not be trying to chase that rabbit

<Wileys> Slides have not come through on email yet!!!

<rvaneijk> yes,

<justin> I sent ten minutes ago, will resend.

<AHanff> difficult

<aleecia> thank you Shane

<jmayer> Also, lots of paper shuffling etc.

Khaled: walking through process of de-identification

<aleecia> um.

<rvaneijk> sounds off now

Khaled: walk through de-identification we have been using, context will be healthcare
... agree on terminology and general approach to terminology
... basic process they have uses is five steps

<Wileys> Bryan, I'm mostly with you there. The key element is what is definied as "personal"...

Khaled: assume we have health data set and want to release for secondary purpose
... first step understand plausible attacks

<jmayer> Where are these five steps sourced from?

Khaled: second, understands variable that can be used
... measure risks, appply de-identification
... Assume a public release ro releasing to a known data recipient

<justin> Put your email in chat if you want the slides.

<bryan> In absence of the slides, can someone copy/paste the slide content into IRC?

<Wileys> wileys@yahoo-inc.com

<aleecia> aleecia@aleecia.com

Khaled: very different analysis, public have no controls, known recipient you can have controls and contracts

<vinay> vigoel@adobe.com

<AHanff> a.hanff@think-privacy.com

Khaled: For known data recipient, you have three attacks

<vincent> vincent.toubiana@alcatel-lucent.com

Chris: what type of attack?

<AHanff> are we allowed to comment?

<aleecia> ed@felten.com

<RichLaBarca> rich@addthis.com please

Khaled: re-identification attack

<jmayer> Slides answered, thanks.

<bryan> got the slides, thanks

<AHanff> so can we ask questions?

<hwest> If you have questions, please queue yourself; I'll monitor the queue

<Wileys> Thank you Heather!

<hwest> (Reminder: to put yourself in the queue, just type q+)

Rob: information that is not being disclosed, storing information to make it de-identification, not planning to disclose?

<Wileys> +q

<AHanff> typ[ing

<AHanff> I am typing lol

Khaled: go through same steps if you release to data recipient or internally

<hwest> AHanff, are you just on irc?

<hwest> Go ahead and type your question and I'll convey

<AHanff> no I am on phone too but not on headset

Shane: not mandating from a HIPAA perspective to de-identify, just for a risk management perspective, you would go through same process

<justin> Slides went to list finally, available here: http://lists.w3.org/Archives/Public/public-tracking/2013Jan/0062.html

<aleecia> Thank you Justin

Khaled: contract, allow vendor to continue using the data, need to keep in de-identification manner

<hwest> AHanff, go ahead and type question

Peter: HiPAA puts limits on data uses even internally

<AHanff> I would just like Khaled to acknowledge that known recipient doesn't guarantee confidentiality even with contractual observations. For example, i read recently that something like 90% of US medical authorities had data leaks in 2012, presumably contracts were in place...

Dan: clarifying, de-identification is a property of data?
... It is not a process

Khaled: in practice you manage the risk of re-identification, re-identification is one tool in the tool box

<hwest> AHanff, feel free to share running comments as the presentation proceeds - they go in the record as well

<AHanff> thanks

Khaled: deliberate re-identifiation by data recipient, if company signs a contract, as a corporation that company will not try to re-identificy
... there may be rogue employees, but probability of company re-identifying would be acceptably low

<AHanff> the evidence would suggest otherwise with so many data leaks surely?

Khaled: contracts are a good risk mitigating activity for first attack

<peterswire> I am aware of the q; will be calling on them at a soon moment

<aleecia> @AHanff, if you have a citation on the 90% figure, would you be so kind as to add that to the wiki?

Khaled: rogue employee re-identifying an ex spouse for example is dependent on internal company controls

<AHanff> I will try and find it yes

Khaled: first attack, as a company would you do it, do you have controls for rogue employees

<aleecia> Thanks, that's higher than I'd heard

Peter: this is a risk management approach

Khaled: most recent guidance of HHS is a risk management approach, UK Commissions also talk about risk management and context based
... regulators approaching as a risk management exercise

David: De-ID is not a binary state, it is rather a description of lower risk (Khaled probability)

Khaled: de-identification have been practiced for last 20 years, CDC, CMS, set thresholds along a continuim
... that is context dependent

<AHanff> aleecia, it was a Ponemon study, there is an article here on it (will add to wiki) http://www2.idexpertscorp.com/press/report-94-of-us-hospitals-suffered-data-breaches-and-45-had-quintuplets/

David: helpful to talk about de-identification as a process and something else as a end goal?

Dan: still fair to share de-identification is a property of data

David: functional definitioin of de-identification is a function of the context, could be 20 different forms

Khaled: can be multiple de-id versions for the same data base, public versus trusted party

Peter: binary de-identified or not? Under HHS, counts at de-identified if overall risk is low.

Khaled: once you have a spectrum, and cut off in the middle, you turn it into a binary decision

Peter: de-identified is a conclusion term under some regime under some set of facts

<AHanff> but the thresholds are not static, they move constantly depending on the amount of data aggregated about an individual

Peter: yes it is de-identified or no it is not, along the way there is a risk management regime
... de-identified right now is a conclusion term for a regime, we do not have that standard right now in dnt
... does anyone else see it differently?

Jeff: more accurate to sa a de-identified data set has been de-identified to a degree

Peter: more or less risk for re-identification

<aleecia> Thank you kindly, Alan. Report (rather than press coverage) available from: http://www2.idexpertscorp.com/ponemon2012/

David: disagree what is identified in the first place, what's de-identified and when, we will have disagreement

Ed: In a giving setting, you can ideally establish some scientific basis that risk is some ammount, you have a spectrum of risk
... then you are required to be somewhere on the spectrum

<AHanff> I think it is important to note that there are no specific types of data which can guarantee non-re-identification, in fact it is never possible to guarantee non re-identification. Data minimisation can make it less likely, but the way these systems work is the data is always increasing not decreasing, which means the risk is continually increasing as the data resolution increases...

Ed: starting point, scientific basis that data can be exploited with a certain probability
... risk analysis based on sound scientific analysis, not based on what you have done in the past

Chris: process of de-identification, and de-identified data

Peter: defining what counts as de-identified sounds like normative stuff we are not agreeing on today, we are trying to develop language and ways to talk about things to have that conversation

Chris: we do not know the degree, we just know de-id is a thing, so lets talk about good pratice

Paul: once you accept risk, then need to put tools on tables, what are the general uses
... then have conversation of what is an acceptable level of risk

<rvaneijk> I agree with Ed. The goal is relevant. If you want to use the data for aggregation is different than trying to accomplish unlinkability

<Wileys> AHanff -> I disagree, there are levels of de-identification/minimization that guarantee non-re-identification. For example, highly aggregated data sets or highly sparce raw data can both guarantee non-re-identification.

Jonathan: stick to substance, universe of attack slide, third bullet pont

<AHanff> Wiley, show me the evidence to support that and I will show you a very famous event which shoots it down :)

Jonathan: reasonably say that risk to some sort of data breach is a lot greater if you leave on street, if only CEO can see with contract
... risk is much greater in former, shades of grey are the hard part

<Wileys> 3 people in the world viewed in the world viewed Yahoo.com at a specific moment in time yesterday - please tell me who those people are?

<Wileys> Have fun AHanff (that's an example of a highly aggregated result)

Jonathan: very fact specific things, where real world challenges lie, can we reasonably estimate these sorts of attacks: being hacked, laptop out, rogue employee
... if you can predict crime, we all have a much better use of time

<justin> I don't think we need to argue about really-really-really-really hard to reidentify is technically impossible to reidentify. For purposes of this group, whatever you call that, it will suffice to constitute de-identified data.

Khaled: not predicting crime, but good approaches to manage risk

<AHanff> Wiley, I am glad you chose a search engine, I refer you to the AOL search data which was used to identify anonymous users within 24 hours of being released for "research purposes"

Khaled: develop a series of cheak list to evaluate point of disclosure
... at the end of day, probabilities can be assigned

<AHanff> far more anonymised than the data Yahoo has in their logs I should add :)

<Wileys> Thank you Justin - I agree that there arguing absolutes in this case is not helpful - that was my point. :-)

<aleecia> Justin - I think that's part of the question at hand

<Wileys> AHanff - completed apple / orange comparison

Khaled: based in part on subjective estimates, but mixtures of different things

<Wileys> completely

<AHanff> no it isn't

<aleecia> The AOL mess was *not* data aggregation

Khaled: the overall answer is that you can do it in a defensible way

<justin> The question at hand is how many "reallys" you need in front of "hard to reidentify"

<aleecia> Shane is right on this one. The AOL mess was replacing one unique id with another.

<Wileys> AHanff - AOL was row level specific data with consistent unique identifiers - my example was a highly aggregated result. Not the same

<AHanff> 3 people visiting Yahoo yesterday at specific time is not data aggregation either, server logs (probably replicated multiple times for backups across their dsitributed network) provide very exact data

Khaled: deliberate re-id, inadvertent - recognize someone they know (a relative)
... in health care setting, can measure probability that someone knows someone in the database
... Ex. breast cancer, we know the prevalence of breast cancer and average number of friend, we can estimate the chance of inadvertent re-identification
... Data breach, organization that loses data, we know that 27% of health care providers have one breach per year

<aleecia> So wait: 27%, or 94%?

Khaled: there are bigger and smaller numbers, but 27% is the most defensive number

<aleecia> That's a rather large change of inputs here

Khaled: we can use the 27% number to assign probability

<Wileys> What does breach have to do with de-identification? Those breaches are to purposely non-de-identified data.

<aleecia> But not our problem, actually

Khaled: demonstration attack - adversary once to make a point, targeting high risk person
... all you have to do is identify one person

<Wileys> +1 to Aleecia

<peterswire> I see jonathan; will call on soon

Khaled: Directly identifying variables, are the fields in HIPAA

<aleecia> What I've learned: HIPPA's a mess. :-) But we may be able to find useful parts of HIPAA anyway as we sift through this, and it's useful to see what came before.

Peter: people may disagree what is directly identified and a quasi-identifier

Khaled: can be different based on context
... with names remove the names, randomize, generate pseudonyms

<aleecia> Shane -- I realize I don't know what problem you're trying to solve in your dataset. When you talk about not destroying the value, what value is it you're trying to preserve?

<Wileys> +1 to generating pseudonyms as acceptable de-identification practice :-)

Chris: quasi-identifiers, how about rangers, someone fits with a date range, or geo location? Address in HIPAA

<Wileys> Aleecia - typically longitudinal analytical/research value

Khaled: HIPAA safe harbor, dates converted to years

<justin> e.g., it's useful to know that a particular user went to Y!, then FB, then ESPN, etc.

Khaled: when you convert to ranges, you go to expert, you could potentially go to quarter of year or increase to 10 years

<Wileys> Aleecia - You've already heard this conversation play out between Ed and I (and a few others) on the public email list. :-)

<aleecia> Yes, I've heard and read more than I care to :-) But I couldn't remember what value you were looking for, just the disagreements

Khaled: if you doing anlytics treat as quasi identifiers, ex. software testings, you cannot get rid of fields, you just randomize

<AHanff> my questions isn't on direct dientifiers

<AHanff> my question is on the 27% figure

<jmayer> Aleecia - industry participants have never explained the value they hope to achieve in detail. It's one of the reasons we haven't made progress.

Khaled: in Ontario 220 John Smiths, people have common names.

<Wileys> Aleecia, outside of permitted uses, the core value sought is analytical (be able to learn and make changes).

Ed: In practice every variable is a quasi identifier?

<Wileys> Jonathan, I thought we had - not sure what more you're looking for.

Khaled: no not really
... example, blood pressure

<aleecia> And you're likely to have a question now that can be answered from data 5 years ago? 2 years ago?

<rvaneijk> would like to bridge to quasi identier to EU perspective... (queue)

Ed: blood pressure is better than gender

<aleecia> My concern is that your answer there is you don't know

Khaled: what is the chance of adversary knowing your blood pressure

<aleecia> Because, you likely cannot

Ed: the odds my provider will know my blood pressure is high

<Wileys> Aleecia - some researchers at Yahoo! find tremendous value in long-term data as an indicator for near-term data - interesting learnings and value there.

Khaled: hospital can look at, and different controls to stop re-identification

Peter: how likely someone on outside has access to that information and how likely it is to be a match?

<Wileys> Aleecia - a simple example is spelling correction - due to the long tail of possible searches it can take many years to build enough data to predict outcomes for rare terms.

<rvaneijk> is anyone monitoring the queue?

Ed: re-identification is connecting individual to information

<aleecia> I'm sure there is. But if you pull back to a very simple view, you're suggesting that users ask for more privacy, Y! says they will provide more privacy, and then you will retain and study that user. That's a hard thing to explain to a user who just wants to be left alone.

<Wileys> Rob, Peter said in IRC that he'd be coming to the queue soon but that was quite awhile ago

Khaled: all laws protects identify disclosure, no laws protect attribute disclosure
... If I release data set and you get attribute disclosure, laws do not prohibit, its just statistics

<vincent> Wileys, with the spelling correction example, high level aggregation and short term retention are not enough?

<Wileys> Aleecia, I'd argue that once the data is deidentified that user is being left alone - we're now just using an unlinkable data point to improve our services. What are our rights in providing the free service? The most paranoid users need not use our services if we fairly call out that we use data in this way. Fair?

<aleecia> The spelling example is a nice one, thanks. I'm sure there are many, many others. I just don't know how to get you what you want while still actually honoring DNT


. . .Different governance mechanisms to manage attribute disclosure, but not what we are talking about today

<justin> WileyS, not sure that's the best example. That's first party data that can be stripped of identifiers immediately without significantly diminishing value (like Google Flu Trends).

Ed: arguably the most important aspect of privacy disclosure is not even covered?

<Wileys> Vincent, not short-term retention (not enough volume on rare terms) - but data minimization and de-identification do accomplish the risk minimization goal

Khaled: cannot predice inferences of data sets, but the more you control attribute disclosure you destroy data utility, best to manage with governance

<AHanff> Wileys - no absolutely not fair - first of all what right do you have to label privacy aware users as paranoid - secondly, are you therefore saying people who value privacy should be excluded from digital society?

<Wileys> Justin, agreed - for that use case, that's a great de-identification approach.

Peter: direct identifiers (phone numbers), quasi identifiers (people on outside can make guesses)

<aleecia> I'm pretty sure that saying "we're honoring your request for privacy, but we're still logging everything you did and using it" isn't what users will consider fair. Which, to be clear, matters a lot more than what I think is fair.

<Wileys> Justin, you do need to keep a few data elements around to help provide context (language, country of search, etc.)

Peter: Third thing, attribute disclosure

<peterswire_> I see the q

<Wileys> Aleecia, I believe the de-identification removes the "you" in 'everything you did' in your statement

<AHanff> what you believbe is not what regulators and the general public believe, which I think is aleecias point

<aleecia> Which is where you and Ed have gone many rounds, and I do disagree with your conclusions there.

Ed: list of hundred records and I know one is yours, and all have that dianosis, I know the attribute without actually identifying

<justin> WileyS, Right, that seems fair, but the re-ID risk seems almost impossibly low.

Joe: that's 100% , others are fuzzier

<peterswire> attribute disclosure as an important distinction says ed felten

Ed: are we trying to protect against attribute disclosure?

<Wileys> Justin - agreed, for that use cases - many other use cases aren't as clean cut - that's why its a good point to start there and go deeper.

Khaled: precedence in research world for attribute disclosure: IRB

<aleecia> I do agree that there are ways to do aggregation to a level as to remove the "you." I do not think that replacing one unique identifier with another unique identifier (hashing) is going to remove the "you"

Khaled: restricts how you do studies, committee oversees

<Wileys> AHanff, could you please source your position? Regulator and general public studies?

Khaled: how mechanism to agree on type of interences you will permit, certain things would be off limits

<vincent> Wileys, I though Yahoo removes rare term anyway? are there examples where yahoo is actually a third party?

Joe: risks to population of inference versus benefits?

<AHanff> wileys, regulators, a29wp, eu commission, eu parliamentarians, members of public all people I have worked with and discussed these issues with over the past 6 years

<Wileys> Aleecia, as long as there is no way back to the original user, then I believe the desired outcome has been met (no more 'you')

KHaled: no legislative requirement to worry about attribute disclosure

<AHanff> except you of course :)

<Wileys> AHanff, very much an area of active disagreement - I agree that one extreme side of that debate equates to your position

Felix: We are concern about inferences of large number of people, but that is different than inferences about one particular person

<peterswire> person is in the group, and can draw inference about them -- attribute disclosure

Khaled: can draw inferences about group memberships, and you belong to that group

<Wileys> Vincent, Yahoo! runs one of the largest 3rd party ad networks on the internet :-)

<AHanff> well absolutely every person I have ever discussed these issues with apart from advertisers, is in that "extreme" - which would suggest that the extreme is actually your segment not mine ;)

Felix: IRB - mitigates discriminating against large group, not concern about attribute disclosure to specific individual, even if group is not senstive

Khaled: depends on type of study and what harm that can happen to those individuals or at the group level

<Wileys> AHanff - disagree - if everyone agreed with you then no one would be using online service supported by 3rd party advertising

Dan: Quasi-identifiers: why is not everything a quasi identifiers?

Khaled: have to take into account probability that adversary will have information, some fields there are no probable path to get that information

<aleecia> Shane - one of the evolutions we're watching is going from "we need to identify a user by name" as what counts for a "you" to "we need to be able to distinguish a single person" such that a GUID counts for a "you"

<AHanff> Wiley's that is a completely invalid response - the VAST majority of digital citizens have no idea that any of this is going on and when they find out, they are outraged

Khaled: has to be information that is generally available

<AHanff> there are countless examples to support that

<aleecia> swapping one GUID for another doesn't actually advance privacy

<aleecia> that's not fair -

<vincent> Wileys, glade to hear :) but how is that related to my question? I was asking for examples of analytical/research that need pseudonymous data and where yahoo is involved as a third party, not a search engine

<aleecia> doesn't advance it by much.

<Wileys> Aleecia - GUID goes one step further than I'm suggesting as that implies it is still "linkable" in a production system.

Mike: What about the practical, how difficult is that inference? (large number of records)

<Wileys> Vincent, anything and everything to do with being a better ad network.

<aleecia> That's what I was just correcting. I agree, there is a minor improvement there, but not enough as to practically matter much.

Khaled: depends on fields you have in data base, and how accurate would the inference be, never count against statistics
... attribute disclosure has to be managed, cannot do so technically without destroying data

<Wileys> AHanff, please reference studies of consumer "outrage"

Khaled: need to have different oversight, evidence so far that is what works
... In practice, you do not get all of the fields in data bases (focus on 6-10 fields), for longitudnal data, repeated over multiple visits
... surveys are more complicated, can deal with database with 100 quasi-identifiers

<aleecia> Shane - let me do a thought experiment. I think we agree that if I got my hands on the raw server logs at Y! that would contain a set of "you"s, and not be non-identified.

Dan: only need to know one things

<AHanff> Wileys I don't need too, they are there in the public eye - instagram, path, phorm, nebuad, facebook etc etc etc

<AHanff> there is a new one just about every week

Khaled: chance of adversary knowing 5 things or 10 things, chance they know all 100 is very low
... choose a number that is defensable (unlikely to know 30 fields)

<Wileys> Aleecia, depends - if you're suggesting a de-identified data set, you'd find a one-way secret hashed identifier that has been truncated by 50% to purposely create noise (salt). So there is "an" identifier there - but it links to nothing in production systems.

<Wileys> AHanff - thank you for the conversation, I have a good sense of your perspective and ability to defend your statements now.

Khaled: three types of risk
... are you going to re-identify individual in data set, or are you going to match two databases

<AHanff> You should talk to your colleague Justin before discounting my arguments, we know each other very well

Khaled: are you considering maximum risk or average risk (very different)

<aleecia> If you took that raw data over a year (nothing magic, just picking a specific example) and gave me one half of the data raw, and one half you had transformed by replacing GUIDs with your hashed id, I would be able to map between the raw and the hashed data sets.

Khaled: when talking about demonstration attack worry about mximum risk
... with inadvertent, you can you use average risk
... what are the appropriate thresholds?

<aleecia> So when you say there is no link to the production system, I disagree.

<Wileys> Aleecia - we keep the datasets completely separate with strict access controls, policy, training, etc. - you wouldn't get both.

<AHanff> oh my, how many times have I head that one and then seen humble pie served lol

Khaled: In practice, the highest risk used is .33 to as low as .05

<aleecia> A different and possibly useful approach, but they *are* linked.

<Wileys> But that is our risk to manage since we make the statement the data is deidentified.

<AHanff> heard*

Khaled: No one releases data with a risk higher than .33, increased precedence for other values
... practical range (court cases, regulatory authorities), choose one of four: .33, .2, .09, .05
... no scientific way to choose value, based on past use and changed over time
... 09 and .05 are used in public disclosure

<aleecia> There might exist something in there I could reluctantly live with while really not liking. :-) (And there might not.) What I'll put my body on the tracks for is the idea that you could then publicly release that data.


.33 and .2 are for releases to trusted business partner

scribe: these thresholds are to protect against demonstration attack

<Chris_IAB> Has this deck (being presented currently) been placed into the W3C record?

<justin> Chris_IAB, it's in the mail archives.

scribe: all known attacks have been conducted by academic and media

<Wileys> Aleecia - we have yet another de-identification process for data we release to researchers - so I absolutely agree with you!

scribe: this is maximum risk, no one has a higher risk of re-identification than the level

<Wileys> Chris, it went out to the public mailing list so its now recorded.

scribe: In practice, these numbers are conservative: data changes, imperfect data cause errors
... the numbers used are ceilings on risk, real risk are lower

<aleecia> Shane - could you describe the de-identification for researchers?

scribe: Cell sizes: 3, 5, 11, 20
... the smallest cell sizes (population cell sizes), may be smaller in a sample
... If you create a population with cell size of 5, you can take a cample and have a lower cell size
... number of individuals with same cell of quasi identifiers

Ed: have to assume quasi identifiers

Khaled: only a small subset of variables in data set are quasi identifiers

<Wileys> Aleecia - it varies based on the nature of the dataset but general attributes are: older data, no identifiers, data sets highly numerized (example, instead of showing actual category of music, we show only a number representing a category but give no information to provide context for that category).

David: with a cell size of 11, there is a 9% probablility of a record being re-identified?
... any single record or one record out of the whole?

Jeff: are 9% of the records identifiable? Public databases have 9% chance of re-identification.

Peter: there has never been a re-identification of properly de-identified database, but 9% risk?

<Wileys> +q

Joe: demonstration attack on HHS database de-identified?

Khaled: the hit rate of re-identification are much lower that those values, never have been able to re-identify at a rate higher than the threshold.

Felix: if you start guessing, you will be right 9% of time, do I care if I know?

Rob: if I were to guess randomly, I would get some right randomly

Felix: you would not know you are right, but you could guess 9%.

<jmayer> This is assuming complete l-diversity among the group?

<aleecia> Shane - that sounds a lot closer to what would be reasonable to provide to users who turn on DNT

Khaled: with unlimited resources, they could verify, but expensive
... how do you choose one of four values?
... public you use .05 or .09. If not public, you look at a number of other factors
... if company have good controls, not as worried about a rogue employee

<dtauerbach> i think the wifi in the room isn't great, i suspect that's the reason

David: do you look at sensitivity of data?

<justin> We'll see what we can do during the break.

<johnsimpson> I am not doing anything.. Don't know why it is happening

Khaled: three things to look at: sensitivity, potential harm, and consent
... motives managed with contract
... with academics and journalist motive to re-identify
... they are check lists for doing this process.
... need a repetable process to evaluate all of the factors

Chris: is there ever a scenario that there is zero risk if you release data?

Khaled: no

<jmayer> ...but there are systems that can give rigorous bounds on risk if you release data.

Peter: threat models, why would someone attack here, how capable (money, show your smart)
... might be commercial reasons, upset employees, think of all the reasons why people might attack
... why do we care here, what are the harms, are they very sensitive

<Wileys> Aleecia - I understand that are your perspective of what DNT should mean - as you know I disagree with that position and would interpret a DNT to mean something different (no profiling, not 'no analytics')

Peter: different values of invasion of privacy: complete browsing history available to FBI may upset some advocates

<aleecia> I don't think the FBI is the worst thing possible - we operate in an international climate

Peter: other specturm: not a big deal, no one would care about browsing, little harm or risk around it
... assume different views on invasion of privacy.
... Left slide of slide: mitigating controls
... lot of discussion on de-identification have been on publically disclosed databases
... if you post on internet, smart people will attack, that is purely technical protection
... most of the stuff we are talking about is different: secret databases, set of administrative controls
... privacy act talks about technical, administrative and physical safeguards

<aleecia> Shane - we started this with the idea that DNT would limit collection of data. If we actually did that, I'd relax in other areas. But right now we're talking about no reduction in collection at all. My fear is that we build a system that is deceptive :-)

Peter: that is how a lot of the data protections take place today

<aleecia> When I talk to users, their main concern is not profiling, it's the data collection itself

<Wileys> Aleecia - as long as we're clear with users and the world on exactly what DNT means and how data will be handled then we won't be deceptive

<aleecia> And we're not going to help them with that

Peter: all the different variables would feed into how we think about de-identification

Jonathon: factors that could contribute to or mitigate risk, but no way to eliminate risk

<aleecia> Shane - I agree that being clear is necessary. I disagree that it is sufficient

Jonathon: we do have ways to put rigorous bounds on risk develop by computer scientist

<AHanff> with respect privacy and data protection as not the same thing. Privacy rights don't exist merely to manage risk, there are rights based around people's desire to lead a private life. So it is irrelevant to say that if data is de-identified it is ok because there is no risk, people have a right (under law in Europe and elsewhere) to refuse to have that data collected in the first place.

Jonathon: we can determine just how much the best adversary can accomplish

<aleecia> If we carefully document that DNT does nothing at all, that's not sufficient :-)

<Wileys> AHanff, you're overstating EU law

<AHanff> actually no I am not, would you like me to quote it verbatim, I worked on it so I know it pretty well...

Jonathon: techniques for rigorous bounds: differential privacy, body of writing on developing advertising analytics without following users around

<Wileys> Aleecia, so we agree on being clear, we disagree on the level of data "scrubing" that comes with a DNT signal. Progress... :-)

Jonathon: lets make marginal gains, some are more rigorously oriented

<justin> There was disagreement that we should be clear before?

<aleecia> I think you're even agreeing that being clear is not all that's needed

<jmayer> s/lets make/some propose/

<Wileys> AHanff, please share EU case law that supports your position - not your subjective interpretation of the written law.

<Wileys> Aleecia - agreed :-)

Khaled: the managing risk slide is operational

<aleecia> breakfast time, yay

<susanisrael> npdoty can you help me advise zakim that my phone number is 215 286 aajj

<johnsimpson_> test

<johnsimpson_> Shane, problem was the network we were on. Changed network.

<johnsimpson_> hope this is stediar

<susanisrael> npdoty: can you help me communicate with zakim about my phone number? i don't seem to have the syntax right.

<Wileys> John - that didn't seem to do the trick

<Wileys> Hard to follow anything on IRC today with so many connect/disconnect events being thrown up.

Peter: Mike had comment on last slide

<JoeHallCDT> ok, how do I scribe nick me?

<scribe> Scribe: JoeHallCDT

<justin> scribenick: joehallcdt

<moneill2> cookies are not anonymous, they pinpoint an individual/device

<scribe> scribe: JoeHallCDT

Peter: we're not going to debate how strict a standard is

… let's imagine a three-step model

… super strict standard for De-ID, a middle ground and no de-ID

<justin> Speaker was Mike Nolet from AppNexus

thx

… there are choices for businesses to give up a de-ID'd approach if the cost is too high

Mike Nolet: it's not as much cost as competition

… some companies are getting into thrid party advertising

<moneill2> identifiers in cookies are PII in Europe

Mark Groman: truly believe that the standard we're discussing that will have unintended consequences

… some of the things we propose may have a net-negative impact on privacy

<susanisrael> *Joehallcdt if you want me to scribe let me know

<jmayer> So, about that de-identification topic...

… the notion that opt-in consent is all that's needed to over-collect

Peter: we did start with a discussion of incentives for de-ID

… one was compliance with NAI, etc, codes

<moneill2> You have to say what data you gather and what you intend to do with it to get consent

<justin> The FTC sees cookies and IP addresses as "personal information" as well. All information is personal, but some is more personal than others.

<justin> There is a value in incentivizing companies to keep data at pseudonymous instead of real-name idenifiers.

gills (?): if we follow de-ID as a privacy protective tool, we can't say that a cookie is PII

<efelten> There is no notion of PII in this standard.

<justin> But this is somewhat off topic.

… you've created an incentive to create PII databases

… PII should matter, if you value de-ID as a way to break the link to the individual

Chris Mejia: agrees with Jonathan!

… we are supposed to do good practices for de-ID and I want to do that.

<susanisrael> *joehallcdt you had marc groman and paul glist speaking before chris iab

Peter: has not had that focus, wants to have comon language

<susanisrael> sribenick: susanisrael

<susanisrael> peter swire: let's start talking about hashing

<justin> DNT was proposed as a solution to address psuedonymous third party tracking. I don't think we're going to walk away from that idea at this point.

<susanisrael> khaled: understand that hashing was discussed as a way to protect against cookies or other unique identifiers

<susanisrael> ...if you are hashing without salting, can easily be broken and recover say ss#, so plain hashing not recommended

<Wileys> This makes sense for sharing data externally but not for internal storage of data

<susanisrael> ...if you have [something] that can be added to your value....but challenge for distributed system with salt, you don't want to distribute salt to everyone

<susanisrael> ....have to come up with protocol where salting happens at central location.

<susanisrael> [someone] need to know who can hash

<susanisrael> [who was speaking?]

<dtauerbach> efelten

<efelten> s/[someone]/efelten/

<susanisrael> khaled: one alternative is to use public keys that you can distribute and have encrypted value done say within browser

<susanisrael> ...instead of hashing you encrypt

<susanisrael> ...other consideration even with salted values is that you can have frequency attacks...certain names more common...can guess.

<susanisrael> ....so can recover names by looking at frequency. even ss#s. so salting not adequate where there is frequency distribution

<susanisrael> .....with encryption [?] would do it differently each time, frequency not an issue

<susanisrael> .....to the extent its a problem certain fields may be too long to process or transmit [with encryption?].....

<susanisrael> ...so for example you can get encrypted ss# with same character set as actual ss# so you avoid long strings. sometimes practical advantage

<susanisrael> peter swire: have some observations: lots of hashing in commercial ecosystem. heard yesterday at hhs that unsalted ss# not ok bc easy to do dictionary attack

<Wileys> Good resource on the technical and security details in this area: http://crackstation.net/hashing-security.htm

<susanisrael> .....turning to ed, you have expressed cautions re: hashing.

<susanisrael> ed felten: different scenarios in which hashing fails. doesn't do much without salt.

<susanisrael> ...even with salted hash someone who knows the salt can generally break it or someone who can cause salted function to be evaluated on their behalf.

<susanisrael> ....gives example where you ask one server to compute hash on another. [simplified]

<rvaneijk> A hash turns user data into a pseudonymous identifier

<susanisrael> ...if multiple records contain same salted hash value they can be linked. need to use probablistic encryption or something like that

<susanisrael> chris iab: there is hashing then access to salt

<Wileys> We should discuss keyed hashes as being superior to salted hashes (although in the same universe)

<susanisrael> ed felten: not just access to salt. if you have value hash then you can do same dictionary attacks as if you knew salt so not enough to ask if you know salt

<susanisrael> ed felten: can make sophisticated argument .....rare case where hashing is secure

<susanisrael> peter swire: assume people will use hashing and will be long enough not to be broken

<susanisrael> chris iab: how reliable?

<Wileys> One-way hashes don't allow direct reverse identification by themselves - access to the salt/key allows someone to perform a dictionary attack

<susanisrael> ed felten: if you can have hash computed for you just the same as if you can break it

<Wileys> Requires access to the original raw data (if it still exists) and the salt/key

<susanisrael> what are we hashing?

<rvaneijk> In the EU organizational measures are not enough to make hashed values of user data anonymous.

<susanisrael> someone [who is speaking?]: will use admin controls with hashing

<susanisrael> ed: if you can make up inputs and ask people to hash them that is just as good as if you had the salt

<susanisrael> someone: but that is form knowing input and output

<Wileys> Rob, if paired with administrative, technical, and policy/educational, then keyed hashing is considered enough to reach the point of "likely reasonable" to no longer be personal data (de-identified), correct?

<susanisrael> ed felten: what if you take value with identifier and cookie, ask someone to make salted hash, don't tell you the salt, but put it back in your data base

<Wileys> Rob, add "safeguards" after "policy/educational"

<susanisrael> someone: but that assumes you know input and output

<rvaneijk> shane: if you throw away the key, then yes. TomTom was a nice example.

<susanisrael> peter swire: i have observed lots of hashing in ad world. for most sophisticated attackers they may be able to break them

<susanisrael> ...we will eventually have to come to view of how we will discuss all this. so common hashes might be of email address? cookie value?

<Wileys> Rob, if you keep the key in a safeguarded location, limited access, technical controls, etc. - I believe you still reach the bar per the A29WP Option from April 2011.

<susanisrael> peter swire: let's take email addresses. if my email is hashed using proper salt, and someone gets output, they can eventually figure out hash and salt

<Wileys> Rob, or was that 2010 - I'll look it up.

<susanisrael> ed felten: can ask that hash be done on known value, and record hashed value in database then can correllate

<rvaneijk> Well, that safeguard is a very high bar, ie a notary, who has a legal obligation to not disclose

<susanisrael> [someone] qu is from whom you are trying to secure the data

<Wileys> Rob, I agree throwing away the key is an absolute end-point, but I'm aiming for the 'likely reasonable' standard

<susanisrael> is it protection at all wrt a particular party that has particular data

<susanisrael> david w. not hashing for hashing's sake. need to figure out from whom you are trying to protect the data from, and tailor approach to that

<rvaneijk> Shane, the point is, that if I should not be able to calculate a hash after let's say a year, and expect the same output, such that users can be re-identified.

<rvaneijk> s/if/_/

<susanisrael> khaled: even if we go back to previous model using hash or salted hash, probability of recovering original value is 1, certain

<Wileys> Rob, why? As long as the original key is secure, then there is very low risk of user re-identification

<aleecia> Rob, is that an art 29 position, or your own? (Both are valuable, I'm just trying to get which is what)

<susanisrael> chris iab: assuming you have access to data in first place, right?

<susanisrael> khaled: so final result at end of all risk assessment is still high, still has to be further mitigated

<Wileys> Aleecia, the A29WP position in the opinion paper is not as strict as Rob is stating (in my opinion)

<vincent> Wileys, in the DNT case, are we just considering hashing cookie IDs? if so, I'm not sure it brings any real protection: cookie IDs are opaque anyway

<susanisrael> peter swire: let's see why people might feel strongly

<susanisrael> ...if db is publicly accessible and people can get access then probability of breaking is higher, but david and chris are saying you can limit access

<Wileys> Vincent, keyed hashing coupled with other measures, as well as the cessation of certain business activities (profiling), does meet the goals of DNT in my opinion.

<susanisrael> .[someone]..but ed is saying if you have access to hash and salt -if disconnected doesn't work

<yianni> Jeff Wilson

<susanisrael> david w: i think what we are talking about is that using some form of oneway hash was a useful method of de-identifying

<susanisrael> khaled: depends. must be done in such a way that you can protect against attacks ed is describing which are quite trivial

<vincent> Wileys, well that's not my question :). What type of protection does it bring with regard to the risk of re-identifiication?

<susanisrael> david and khaled back and forth a bit

<rvaneijk> Shane, let's have this discussion in Boston

<susanisrael> khaled: probability that someone attempts to attack, then that they can break hash

<Wileys> Vincent, as long as the original data is not accessible and neither is the key to the hash, then there is very low risk of re-identification (depending on the details housed within the de-identified dataset)

<rvaneijk> Aleecia: formal position within this DNT debate

<susanisrael> ...if low probability of attempt ....hard to make that case

<susanisrael> [someone] isn't probability of reidentification only 1 if you have access to the computer?

<Wileys> Rob - agreed - looking forward to it (the conversation that is, not the horrible weather we're likely to encounter in Boston :-) )

<rvaneijk> :)

<susanisrael> khaled: depends on workflow. may be hashed then go to central db

<yianni> s/someone/Mike Nolet

<aleecia> We need to recruit a new WG member with a big office in the Florida Keys

<Wileys> +1 to Aleecia!

<aleecia> Rob - thanks, that's exactly what I was asking, thank you

<susanisrael> mike nolet : i have unique cookie id on ed. need to get totally random integer, if someone is snooping on all net traffic or has access to pc or net connection

<vincent> Wileys, how is the re-identification risk lower with the hased cookie ID rather than with the unhashed cookie ID? (that's actually what's discussed right now)

<susanisrael> peter swire: is there a scenario where hashing matters? mike was saying you have to have access to cookie

<susanisrael> chris iab: does it matter if transferring to another party or internally?

<susanisrael> peter swire: we are learning something

<Chris_IAB> this was the equation put on the board: pr (re-identification) = pr (re-id/attempt) x pr (attempt)

<susanisrael> jeff? there is industry practice where you hash, independent party enriches by matching, and there is permission to share 7 matches

<rvaneijk> Cookie exchanges are interesting in this context..

<Wileys> Vincent, its lower only if coupled with other factors (multi-factor test) such as seclusion of the key/salt and removal of access/existance from the original dataset.

<susanisrael> ....common identifier can be hashed

<Wileys> +q

<susanisrael> peter: so that is one scenario, do you see usefulness ed?

<susanisrael> shane: the core purpose at yahoo for hashing/keys, is to disconnect that data from use in actual production systems

<justin> "destroy"?

<susanisrael> ...destroys possibility for profiling, targeting. can not be used to modify users experience. but still useful for analysis..

<susanisrael> peter swire: ed or dan does that make sense to you?

<rvaneijk> WileyS, right. the goal is to break the re-identification

<susanisrael> dan: i am confused by that

<aleecia> sigh

<susanisrael> shane: these are always multifactor tests. your purpose in hashing is to not do this. once you add multifactors, it serves purpose

<susanisrael> [someone] if you can get hash function or key it doesn't matter

<susanisrael> shane: good luck. we make key very inaccessible

<yianni> s/someone/Joe Hall

<susanisrael> ed felten: who knows keys?

<susanisrael> shane: keys are very large. systems that are set up to de-identify know key, but human connection to key is not allowed

<susanisrael> felix: so if i understand correctly usefulness is to separate one part of company to another?

<Chris_IAB> dwainberg, in case you missed it, "the key is on a post-it on Shane's desk" (that's a JOKE, btw.. lol)

<susanisrael> shane: really to separate info from another context

<aleecia> Chris - love it!

<susanisrael> felix: 2 people (one w key) are separate

<susanisrael> shane: isolation of key is not only factor.

<Wileys> Chris, LOL

<susanisrael> peter swire: i think its relevant bc hashing and its uses have been talked about in a lot of context. people in ad industry at one end of table, others at other

<susanisrael> khaled: if that separation is strong and defensible, then at least under hipaa that would be ok. if you have good procedures for controlling access to key that's ok

<Wileys> Yay for Yahoo!, we're good by HIPPA standards (too bad we don't handle PHI :-) )

<susanisrael> ....scenarios where regulators have accepted that

<susanisrael> dan auerbach: rotating salt helps a lot

<Chris_IAB> rotating salt is a good practice

<aleecia> rotating salts kills everything shane wants out of the data

<Wileys> Aleecia - we do rotate, but not daily.

<susanisrael> david wainberg: we are saying its not binary, hashing is not perfect, question is how hard does it make it? how hard do we want to make it? what is the context/data involved?

<justin> Rotating salts kills longitudinal view, which is a feature or bug depending on how you look at it.

<Chris_IAB> aleecia, it means Yahoo buys LOTS of post-its (again, marked as a JOKE folks :)

<susanisrael> someone: sounds like its trivial to break it

<aleecia> I go with feature, Shane goes with bug :-)

<susanisrael> david wainberg: what do you mean by trivial

<yianni> s/someone/Joe

<rvaneijk> what really hard means also depends on the purpose, not only on the context

<Wileys> Aleecia, :-)

<susanisrael> david w: depends on combination of technical and administrative

<aleecia> buy stock in 3M, folks! you heard it here first.

<susanisrael> someone: shane is describing intentional inadvertent viewing of data

<yianni> s/someone/mike nolet

<susanisrael> shane: purpose is more than just personal protection--disconnect data from operational systems so utility limited and therefore privacy is increased

<susanisrael> jeff: everyone agrees with ed or should. if you have access to salt, it doesn't work. but if we say salting/hashing does not work, then we are saying passwords on internet don't work

<susanisrael> ....if you have access to hash and salt you could access hashed stored passwords

<aleecia> daily rotated salts is at least a step forward. but having it change only when the janitor tosses out the post its by mistake once a year isn't going to make me happy :-)

<susanisrael> chris iab: what would the alternative? put all raw data out on internet? or not collect any data?

<vincent> WIleys, would not a request like "SELECT User from DB where user visited site1,site2,...,siteN" recreate the link that the hash just deleted?

<Wileys> Aleecia - its a bit more formal/regular than that. Note - I don't use post-its :-)

<susanisrael> ed felten: i have not heard an example here where hashing really helps

<susanisrael> peter swire: i spent 2 years working on crypto policy. if system broken it doesn't work, but in practice it works 99 percent of the time

<Wileys> Vincent, the hash was not meant to hide activity but rather to disconnect identity from operational systems.

<susanisrael> ...i have heard that there are attacks that could be made, but i have heard about administrative controls

<rvaneijk> Passwords are used to verify an identity, based on a shared secret, which is a totally different mechanism

<susanisrael> ....all those seem like things in real world where protection is more than zero though might still be subject to some kinds of attacks

<susanisrael> ed felten: no because these attacks are trivial

<vincent> Wileys, yes but the history of websites visited by a user would help to reconnect the different operational system (the list of website is used as a unique identifier)

<susanisrael> si question: do these attacks in fact happen in companies all the time in the real world?

<peterswire> jonathan -- I see you;

<aleecia> Shane - 3M weeps

<Wileys> Vincent, agreed - so some URL cleansing helps remove this issue - or in the case of searches, attempts to cleanse personal data in queries helps.

<susanisrael> ed felten: if we say we will separate our data base into 2 pieces and only one is hashed, whatever analysis someone wants to do they just need to do one more step

<susanisrael> chris iab: but they would have to have access right?

<Wileys> Vicent, my approach can't guarantee 100% certainty but does meet the "very low risk" bar - or in the EU context, the "likely reasonable" bar.

<susanisrael> jmayer: concrete example: ad company i studied tried to use hashing to do follow on analysis. user had id cookie. then had another cookie. "anonymous"

<justin> If we the spec allows for a 30 day short-term retention period, presumably the group would be OK if the salts were rotated at least every 30 days.

<susanisrael> ...idea was that anonymous one was hash with secret salt and would be used for long term things and more private but susceptible to same attacks because you could always correlate with original cookie

<susanisrael> peter swire: jmayer you were giving example, and jeff and crhis had questions or comments

<susanisrael> chris iab: you described a bad practice

<susanisrael> ...you don't throw out baby with bath water. Just bc there is one bad practice doesn't mean all hashing worthless

<vincent> Wileys, I don't the "very low risk" bar well enough :) just trying to see what is the type of threat that cookie hashing address

<efelten> We have yet to hear an example where hashing makes any attack appreciably more difficult.

<Wileys> Justin, the spec should not be prescriptive on timeframes and rather, much like HIPPA, should focus on acceptable risk thresholds.

<susanisrael> jmayer: agree there are better engineering practices; but pretty predictable failures; have heard things like figuring out salt or doing dictionary attacks,

<susanisrael> ...but these are not only attacks. there are enormous re-identifiability problems.

<Wileys> Vincent, you don't "?" the "very low risk" bar well enough?

<rvaneijk> Ed, hashing makes sense, if you take out information such that enough collissions appear, that meat a k-anonimity bar.

<aleecia> Justin, I think you're saying: if we're going to have 30 (or more) days for people to take first-logged data to figure out what they have and if they're first or third party while collecting, then we should also be ok with a company holding all data indefinitely, so long as they rotate every 30 days.

<rvaneijk> s/meat/meet/

<dtauerbach> I think the point is that in all the examples so far, hashing is purely a method of operational control, and it is not a great one given engineering challenges

<vincent> Wileys, I don't "know" it well enough, sorry

<susanisrael> ....i think we have an error in the way some people are approaching this. you have fact pattern, try to apply approach. start with specific problem and way to solve and ask if hashing get you there...

<dtauerbach> e.g. you can't hvae an oracle and that is hard to control in practice

<susanisrael> ....ed is not asking straight up;/down vote on metaphysics of hashing...and ihave not heard concrete problem and proposed hashing solution that solves the problem

<justin> aleecia, well, we've had different interpretations of the point of the short-term period over time, but basically yes.

<Wileys> Ed, if a dataset were breached in isolation (a single data table), wouldn't you agree that hashing of identifiers in that table (depending on what additional feeds were available) would help deter re-identification?

<susanisrael> peter swire: can industry explain use case where hasing helps?

<susanisrael> david wainberg: can we identify risk thta ed and jonathan are concerned about it and see if that can be addressed

<aleecia> Justin - ok. So I'm ok with a single short period, but may not be ok with infinite retention even with rotation

<susanisrael> felix? : sounds like we are concerned about internal controls. valuable if you have company where not everyone or no one is careless or malicious

<efelten> What I'm looking for is a specific example--a specific use of hashing, and a specific attack that is made more difficult because of the use of hashing.

<susanisrael> jeff: 3 scenarios where hashing helps. 1: passwords

<susanisrael> 2. if you want to do research internally in large company.....

<dtauerbach> Shane, it depends on the details of the hashing. For example, an unsalted hash of social security numbers in that isolated table does not help at all

<Chris_IAB> new (related) subject: are toilet seat covers effective? (again, humor is my defense mechanism :)

<justin> aleecia, Fair enough, to the extent there is an inherent risk that a delinked 30-day set of urls is inherently identifiable and/or tiable to other 30-day sets.

<Wileys> dtauerbach, agreed - I'm speaking only of salted or keyed hashes.

<susanisrael> peter swire: so if some risk of internal misuse, but hash passwords or separate research database from where it came from, you reduce risk even.,..

<susanisrael> if doesn't protect against sophisticated attacks, reduces risk from normal people.

<aleecia> Justin - exactly

<susanisrael> felix: i think we are seeing risk reduction in normal ways. seeing qu from ed re: scenarios

<aleecia> I would guess that at 24 hours I'd be ok. But I'd need to know more. And I think the right way to get at this is not a timeframe, but rather the ability to chain across datasets

<susanisrael> in some sense from tech perspective does not help much but if the data just requires an extra step that may be enough to deter or detect attack from pt of view of internal controls

<susanisrael> mike nolet: re: david's question. what is risk you are talking of reducing

<susanisrael> someone: risk that info on research side is then used to target

<susanisrael> felix? if dnt is 1?

<jmayer> -q

<susanisrael> yes:

<susanisrael> ed felten: cs views attacks at 3 levels. started discussion bc broad claims were made that hashed data should be treated as per se de-identified.

<Wileys> Ed, It was never stated in isolation but as one factor of multiple steps to achieve unlinkability.

<Wileys> Ed, at least not by me

<susanisrael> ...we don't have to talk about hashing or micromanage how people protect, but i don't think we should talk about hashing as total protection

<susanisrael> paul glist: broad claims on both sides. have looked at this as dial. can reduce risk to socially acceptable levels. hashing is not nothing...

<Chris_IAB> +1 to current speaker's point

<susanisrael> ...and not everything. it's a tool. add other tools. it's useful.

<jmayer> There are protections that are effective even if an attacker controls the terminal. That's part of the point.

<susanisrael> johnsimpson: still having trouble figuring out how this relates to DNT. have been talking about protecting data sets with pii.

<dtauerbach> jmayer, for example: hard disk encryption

<susanisrael> chris iab: you may want to have access to uri's for example. but don't need it connected to unique users

<justin> Right, the deidentification method has to take into account the internal misuse angle.

<susanisrael> john simpson: but that's the disconnect bc most people saying that dnt is do not collect

<susanisrael> someone: is that right?

<susanisrael> someon: if there is any identifier you still have a problem

<justin> Someone is justin, someon is jmayer :)

<susanisrael> peter swire: we heard different perspectives:

<susanisrael> * thanks justin

<susanisrael> peter swire...unique identifiers. can you enlighten me? how is going into buckets relevant?

<susanisrael> someone asks if adding attributes and using those is unique identifiers

<yianni> s/someone/joe hall

<susanisrael> dan auerbach: better privacy friendly way to add advertising that is targeted. need minimum number of people in a bucket

<rvaneijk> Dan, the minimum buckets make nice micro-segments.

<susanisrael> ...we suggested 1024 is a minimum bar. with that don't need unique identifier, just low entropy cookies

<susanisrael> heather: might be useful to look at transcript of previous discussion

<jmayer> If you're interested in advertising, analytics, etc. without unique IDs... https://air.mozilla.org/tracking-not-required/

<susanisrael> peterswire: room is not catching fire on this

<susanisrael> chris mejia: i do agree with dan's core premise, that much harder to identify person from a few attributes distilled from all the uris that people visited

<susanisrael> dan auerbach: can keep those collections without unique identifers

<peterswire> ok, I see aleecia and jonathan

<susanisrael> chris: we agree on that part (harder to identify that way-with quasi identifiers), not necessarily the second part

<susanisrael> .....that is sort of an industry practice

<susanisrael> aleecia: i think we are all getting there. want to separate 2 different parts of dan's description. one is how to do ads without tracking....

<susanisrael> ...but pertinent is here's how you can do de-identification, suggest we focus on the de-id half

<susanisrael> aleecia: ....interesting re: reduced identificaiton risk

<susanisrael> david wainberg: outline of discusison, 3 general models: 1. random unique identifier, interest buckets

<susanisrael> 2. unique identifier associated with buckets, dan proposing buckets only, no identifiers

<susanisrael> dan: maybe what aleecia proposed make sense

<susanisrael> davd w: as discussed earlier, what we mean by de-identified requires setting threshold, and we're just jumping to let's break the connection instead of

<susanisrael> ...discussing what is a level of acceptable risk. there are significant consequences to forcing ad industry to do this

<aleecia> what does "not linked at all" mean here?

<susanisrael> peter swire: if not linked at all then outside dnt

<susanisrael> davd w: but still some risk

<susanisrael> ed: but gets to idea of attribute disclosure vs record re-identificaiton

<susanisrael> ed: matters a lot what the bucket is: soccer dad vs. aids patient

<aleecia> would like to respond to Ed

<susanisrael> ed: need more than knowing that there is a bucket, some sensitive info has to not be used

<susanisrael> ed: but combos of attributes could identify

<jmayer> Just to be clear, the DAA principles do not prohibit inferences about medical conditions.

<susanisrael> mike: want to come back to theme: understanding what we're trying to accomplish. what is bad stuff we are trying to prevent. seeing a relevant ad?

<aleecia> could we please stay on topic?

<peterswire_> jonathan -- I'm unclear -- are you in the q?

<aleecia> this is an interesting discussion, but not today's agenda

<susanisrael> ...what other bad stuff, scary outcomes, than seeing an ad for something i bought on amazon?

<jmayer> Yep, just testing the limits of Zakim.

<rvaneijk> The HARM is not a relevant factor when it comes to unlinkability

<susanisrael> peter swire: what the harm is in tracking comes up in a lot of settings but not main topic today

<susanisrael> aleecia: want to respond to ed re: which buckets you might care more about, but group decided we would not distinguish, say re: childrens data

<susanisrael> ....treating all data same here, which is different than iab daa position

<susanisrael> peter swire: thank you for history but some people do not acknowledge they agreed to that

<susanisrael> jmayer passes

<susanisrael> peter swire: had initial discussions on buckets and learned a bit on dimensions there. talked with mike at break re: example of something you think it would beuseful to look at

<aleecia> of note: this is not me *objecting* to treating some data as of more concern. just what the group decided many months ago.

<susanisrael> david wainberg: i thought next step would be taking approach of your favorite slide and start thinking through risks and how to apply techniques to mitigage

<aleecia> if there is new information before the group, Peter & Matthias have the option to reopen

<Wileys> Aleecia - my memory matches yours - we decided to not get bogged down in the "sensitivity" debate and allow self-regulation and laws deal with that item

<susanisrael> peter swire: that is one possible work flow. use khaled's checklist

<susanisrael> ...maybe there are subsets of people willing to do work on that and come back with a draft. let peter know after meeting if you want to work on

<justin> Yes, there has never been anything about "sensitive" data in the compliance spec.

<aleecia> thanks Shane. it was a while ago and pre-dates many folks joining the group. if needed the minutes are out there, but my eagerness to volunteer to find it is not particularly high this week

<susanisrael> chris: i have not gotten an answer to what works and protects data if hashing does not work, assuming we will have data

<justin> Well, apart from that one geolocation section . . .

<susanisrael> khaled: in health context use probablistic encryption that permits mathematical operations on data

<Wileys> Aleecia, I likewise have not desire to volunteer on that point :-) But would be happy to argue to the same outcome as I believe it was a good decision by the group

<susanisrael> ...encrypt at source in browser....

<Wileys> Justin, agreed - not sure how that snuck through...

<susanisrael> if you want to use those values to do lookup in db not possible for db owner to determine lookup result

<susanisrael> ....efficient process. not much slower than hashing.

<susanisrael> ...using for lookup in large database

<susanisrael> peter swire: on a wednesday call could learn about homomorphic encryption. seeing nods on this

<susanisrael> dan auerbach. talking about fully homomorphic encryption? we are not close?

<susanisrael> khaled: partial

<susanisrael> felix: also techniques like differential privacy, adding noise to data. questions whether data still useful, but also protects against some attribute disclosure:

<aleecia> My recollection is Jeff was alone at the time, perhaps one or two people with him at most, and the rest of the group either had the view you have, Shane, or came up with "we don't care, let's talk about something more interesting"

<susanisrael> jeff: with encryption or data modificiaton the criticism of hashing is that if you have key or access you can get around, and same is true for other methods, for example keys

<susanisrael> felix: not wrt noise, which you can't figure out even if you know how noise was added

<susanisrael> ed: lets put off discussion on how works

<susanisrael> david : interesting but jumping to solution without identifying problems

<susanisrael> felix: noticing that there is symmetry to this. many techniques improve privacy but limit value of data.

<justin> WileyS, at some point we'll have to go back and revisit that piece.

<susanisrael> ....homomorphic encryption does not presreve ability to do many things with data

<Wileys> Justin, we'll never finish this standard if we attempt to define what is "sensitive" in a global marketplace - good luck with that.

<susanisrael> felix: what use are we trying to preserve once data is de-identified. some uses will be preserved, others not

<aleecia> The geoIP part was well locked down, and then Ian rejoined and *did* have new information.

<susanisrael> jmayer: will postpone since postponing methodology discussion

<justin> WileyS, I am not arguing that we should.

<susanisrael> peter swire: thanks to khaled for coming and providing expertise. there was clear explanation of risk based approach used in other settings

<aleecia> We cannot bar geoIP since knowing where people are affects what to do if DNT is unset

<susanisrael> ...we also i think has some terminology gain in a lot of places. de-identified or de-linked are conclusion terms that apply once you have a standard, for example in hipaa....

<aleecia> So we were trying to find a way to say "fine, fine, just pick a large enough geography," and then were hung up in the details on what that means

<susanisrael> .....we also had variety of other terms about direct identifiers and quasi identifiers that will be helpful....

<susanisrael> ....heard interest in presentation for homomorphic encryption...

<susanisrael> ...also heard suggestion re: doing pieces of that one slide--what are harms, risks, people are concerned about, and

<jmayer> If we're going to discuss methodologies, differential privacy and privacy-preserving implementations should make the cut.

<susanisrael> ...in particular for online setting develop use cases we should care about if we are to get to homomorphic encryption.

<susanisrael> ....any other action items?

<susanisrael> ...if you have them after the meeting i welcome those. we are heading to f2f mtg, and want to make progress on this in advance...

<aleecia> thanks, Peter!

<susanisrael> ....thanks to cdt, khaled, all who came

<aleecia> and thanks Susan for scribing so much!

<aleecia> (you want public)

Summary of Action Items

[End of minutes]

Minutes formatted by David Booth's scribe.perl version 1.137 (CVS log)
$Date: 2013-01-17 17:30:47 $

Scribe.perl diagnostic output

[Delete this section before finalizing the minutes.]
This is scribe.perl Revision: 1.137  of Date: 2012/09/20 20:19:01  
Check for newer version at http://dev.w3.org/cvsweb/~checkout~/2002/scribe/

Guessing input format: RRSAgent_Text_Format (score 1.00)

Succeeded: s/x/Leiden University/
Found Scribe: JoeHallCDT
Inferring ScribeNick: JoeHallCDT
Found Scribe: yianni
Inferring ScribeNick: yianni
Found Scribe: JoeHallCDT
Found ScribeNick: joehallcdt
Found Scribe: JoeHallCDT
Inferring ScribeNick: JoeHallCDT
Scribes: JoeHallCDT, yianni
ScribeNicks: joehallcdt, yianni

WARNING: No "Topic:" lines found.

Present: Bryan_Sullivan

WARNING: Fewer than 3 people found for Present list!


WARNING: No meeting title found!
You should specify the meeting title like this:
<dbooth> Meeting: Weekly Baking Club Meeting


WARNING: No meeting chair found!
You should specify the meeting chair like this:
<dbooth> Chair: dbooth

Got date from IRC log name: 17 Jan 2013
Guessing minutes URL: http://www.w3.org/2013/01/17-DNT-minutes.html
People with action items: 

WARNING: No "Topic: ..." lines found!  
Resulting HTML may have an empty (invalid) <ol>...</ol>.

Explanation: "Topic: ..." lines are used to indicate the start of 
new discussion topics or agenda items, such as:
<dbooth> Topic: Review of Amy's report


[End of scribe.perl diagnostic output]