13:57:43 RRSAgent has joined #DNT 13:57:43 logging to http://www.w3.org/2013/01/17-DNT-irc 13:57:45 bryan has joined #dnt 13:58:09 Zakim, this will be 87225 13:58:09 ok, yianni; I see Team_(dnt)14:00Z scheduled to start in 2 minutes 13:58:14 dtauerbach has joined #dnt 13:58:32 peterswire has joined #dnt 13:58:38 aleecia has joined #dnt 13:58:51 JoeHallCDT has joined #DNT 13:59:18 zakim, code? 13:59:18 the conference code is 87225 (tel:+1.617.761.6200 sip:zakim@voip.w3.org), aleecia 13:59:54 jeffwilson has joined #dnt 14:00:06 When I dial in, I do not see myself in the IRC as dialed in.. 14:00:15 Rob, neither do I 14:00:17 justin has joined #dnt 14:00:17 Paul has joined #DNT 14:00:24 Possibly just slow? 14:00:48 But I'm guessing something is broken in the Zakim world 14:00:58 Wileys has joined #dnt 14:01:03 vincent has joined #dnt 14:01:10 W3C: fixing IRC bots and taking attendance since... 14:01:15 zakim appears to be a little sleepy 14:01:18 johnsimpson has joined #dnt 14:01:21 dwainberg has joined #dnt 14:01:24 14:01:32 BAU 14:01:38 hwest has joined #dnt 14:01:47 Getting ready to dial in. 14:02:00 johnsimpson_ has joined #dnt 14:02:04 Good morning 14:02:11 I planned to before I got sick 14:02:28 peterswire has joined #dnt 14:02:29 efelten_ has joined #dnt 14:02:30 Marc_ has joined #DNT 14:02:54 johnsimpson_ has joined #dnt 14:02:55 (someone is typing & needs to mute) 14:02:57 peterswire has joined #dnt 14:03:00 john 14:03:12 hi 14:03:12 testing IRC 14:03:27 Zakim, this is dnt 14:03:27 ok, yianni; that matches Team_(dnt)14:00Z 14:03:34 efelten_ has joined #dnt 14:03:54 joe is scribe… someone remind me how to tell Zakim that and to start notes 14:03:54 + +1.215.796.aadd 14:03:55 scribe: JoeHallCDT 14:03:58 Zakim, who is on the call? 14:03:58 On the phone I see [GVoice], Jonathan_Mayer, +1.425.214.aaaa, Aleecia, +1.202.587.aabb, WileyS, ??P9, +1.631.803.aacc, rvaneijk, [CDT], +1.215.796.aadd 14:04:04 efelten_ has joined #dnt 14:04:05 johnsimpson has joined #dnt 14:04:05 present+ Bryan_Sullivan 14:04:33 zakim, aaaa is bryan 14:04:33 +bryan; got it 14:04:34 Peter Swire: goal is to discuss to what extent De-ID can remove data from scope of the standard 14:04:41 johnsimpson_ has joined #dnt 14:04:50 + +1.215.286.aaee 14:04:54 - +1.215.796.aadd 14:04:56 -??P9 14:04:59 … related: what sort of uses are consistent with compliance with the spec 14:05:05 efelten has joined #dnt 14:05:20 … if things are used for market research in ways that are entirely de-ID, that should be safe or out of scope 14:05:34 … on the other hand, if explicitly ID'd, standard should apply 14:05:40 +??P9 14:05:42 … clearly defining uses is crucial 14:05:44 peterswire_ has joined #dnt 14:05:57 … getting clear on terms, words and such is an important part of this 14:06:02 zakim, ??P9 is vincent 14:06:02 +vincent; got it 14:06:07 peterswire has joined #dnt 14:06:07 johnsimpson has joined #dnt 14:06:32 efelten_ has joined #dnt 14:06:38 johnsimpson has joined #dnt 14:06:46 … instead of having people talking past each other, we want a strong foundation of shared vocabulary 14:07:07 … delighted to have great people in the room and on the phone 14:07:12 q? 14:07:19 johnsimpson has joined #dnt 14:07:22 … agenda has been sent around 14:07:35 … ground rules for discussion 14:07:43 … this is not an official in-person meeting with 8 weeks notice 14:07:49 Zakim, who is on the call? 14:07:49 On the phone I see [GVoice], Jonathan_Mayer, bryan, Aleecia, +1.202.587.aabb, WileyS, +1.631.803.aacc, rvaneijk, [CDT], +1.215.286.aaee, vincent 14:07:58 … have been told by w3c staff that this can't make decisions towards normative language 14:08:30 johnsimpson_ has joined #dnt 14:08:31 … it would be good to agree on terms and definitions 14:08:50 … this should make people more comfortable with claims made in the world 14:08:50 +Peder_Magee 14:08:56 If you share that information externally... 14:08:57 … e.g., unsalted hashes 14:09:18 peterswire_ has joined #dnt 14:09:26 johnsimpson_ has joined #dnt 14:09:30 Could introductions include technical background? It would be helpful to understand who'll be participating from the technical side and who'll be observing from the law/policy perspective. 14:09:42 might want to q that jmayer 14:09:50 … first thing is incentives to de-ID 14:09:58 Do we need to re-introduce ourselves? 14:10:06 johnsimpson has joined #dnt 14:10:31 … Khaled El Emam will start us off with slides (jlh: not sure how phone peeps will see them) 14:10:34 johnsimpson has joined #dnt 14:10:48 … then to hashing, persistent ids, putting people in "buckets" 14:10:52 please send slides to the list and/or post them on the wiki ! 14:11:08 … Yianni will gather qs 14:11:23 + +1.202.257.aaff 14:11:30 johnsimpson has joined #dnt 14:11:31 dwainber_ has joined #dnt 14:11:49 efelten_ has joined #dnt 14:11:53 … will go around the room, please let us know any techincal experience 14:11:57 cannot hear 14:11:58 … Peter, law prof. 14:12:12 … Khaled works at U Toronto, CS background, working on health 14:12:22 efelten_ has joined #dnt 14:12:23 + +1.646.722.aagg 14:12:28 johnsimpson has joined #dnt 14:12:31 Dan Aurbach from EFF, worked at Google before doing data mining 14:12:33 Aturkel has joined #DNT 14:12:51 John Simpson, Consumer watchdog 14:12:55 peterswire has joined #dnt 14:12:58 Ed Felten, Princeton U. 14:13:00 johnsimpson has joined #dnt 14:13:05 research and teaching for 18 yuears 14:13:17 Felix Wu, prof. at Cordozo, PhD in CS from Berkeley 14:13:21 mecallahan has joined #DNT 14:13:27 Peter invited Felix based on techincal work 14:13:36 Paul Gliss, lawyer from Comcast, worked in De-ID space 14:13:46 efelten_ has joined #dnt 14:14:01 Chris Mejia, IAB, dir. of ad technology, tech dir. for DAA 14:14:04 johnsimpson has joined #dnt 14:14:10 Jeff Wilson, with AOL for 16 years 14:14:14 Marc Groman, NAI 14:14:26 David Wainberg, NAI, undergrad. at CS, web dev. for years 14:14:29 Heather West, Google 14:14:33 Justin Brookman, CDT 14:14:50 Bill Scanell, (probably a lawyer in a suit?) here to assist with communications 14:15:04 johnsimpson_ has joined #dnt 14:15:14 Peter McGee from FTC 14:15:31 Shane Wiley, Yahoo!! 14:15:32 johnsimpson has joined #dnt 14:15:42 Mary Ellen Callahan, Jenner and Block 14:15:54 Aleecia McDonald, PhD engineering 14:16:04 Bryan Sullivan, AT&T Director of Service Standards, WAP/Web browsing service architecture and mobile/web standards for AT&T since pre-2000 14:16:05 Adam Turkel, lawyer with AppNexis 14:16:16 dwainberg has joined #dnt 14:16:16 Bryan (?), AT&T director of standards 14:16:27 johnsimpson has joined #dnt 14:16:27 peterswire has joined #dnt 14:16:30 dtauerbach has joined #dnt 14:16:36 Ho Chun Ho, Comcast, data arch. 14:16:56 peterswire_ has joined #dnt 14:16:59 AHanff has joined #dnt 14:17:04 Jonathan Mayer, PhD student in CS at Stanford, at Stanford Security Lab 14:17:07 johnsimpson_ has joined #dnt 14:17:40 efelten__ has joined #dnt 14:17:43 is there a call on now? 14:18:09 Rob van Eijk, PhD student at x, (very lengthy afi. and background) 14:18:10 Yes, we're on a call now 14:18:24 Vincent Toubiana, Alcatel Lucent, PhD CS 14:18:25 s/x/Leiden University/ 14:18:28 thanks I didnt see it on the icalendar 14:18:41 efelten_ has joined #dnt 14:18:42 aff: Art. 29 Data Protection Working Party / Dutch DPA 14:18:44 Jules P, from Future of Privacy Forum 14:19:26 scribe: yianni 14:19:31 Brooks has joined #dnt 14:19:32 +[IPcaller] 14:19:38 peterswire has joined #dnt 14:19:53 johnsimpson has joined #dnt 14:19:53 Peter: Getting logistics worked out, brainstorm reasons in advertising and online space 14:20:01 peterswire_ has joined #dnt 14:20:05 ...why people have incentives to de-identify 14:20:16 ...self interest, business, or other reasons 14:20:21 +Brooks 14:20:31 pedermagee has joined #DNT 14:20:36 ...if we understand reasons, we might be able to understand what things will be done in practice 14:20:51 johnsimpson_ has joined #dnt 14:20:54 .privacy policy that says you do things in de-identified or anonymized ways 14:21:09 ...we do not use PII for certain operations, for example 14:21:13 johnsimpson_ has joined #dnt 14:21:22 ...risk for not following promises 14:22:10 Marc: people do not de-identify to avoid liability, they do it to mitigate privacy and security risk, then make the promise 14:22:12 johnsimpson has joined #dnt 14:22:12 efelten__ has joined #dnt 14:22:24 Paul:providing comfort to cusomters is a reason to de-identify 14:22:34 johnsimpson_ has joined #dnt 14:22:45 Peter: 2nd, organization have costs to data breaches, states and Europe 14:22:47 efelten_ has joined #dnt 14:23:05 ...expense of sending out notice and going through steps of data breach, if de-id you do not have to disclose 14:23:06 Encrypted is different than de-identified 14:23:09 peterswire has joined #dnt 14:23:16 johnsimpson has joined #dnt 14:23:31 Jules: big driver, beginning of NAI, big ad networks and crisis around it 14:23:38 peterswire has joined #dnt 14:23:40 In my experience, companies that say they only work with anonymous data mean it in the Latin sense -- literally without name. They do not mean that users are unidentifiable. I think we need to be very careful to keep these ideas separate. 14:24:03 +q 14:24:06 ...NAI treated PII and non PII very differently, representing in privacy policy that you tracked PII, you could make notice in opt-out notice 14:24:14 efelten__ has joined #dnt 14:24:21 ...in PIII, need more notice on web page, perhaps an opt-in 14:24:50 johnsimpson_ has joined #dnt 14:24:50 ... 7 large networks adopted, and forced other partners to follow 14:25:20 ...huge driver for ad netword that they make a specific representation of PII and non PII 14:25:32 Peter: are they other legal regimes for de-id? 14:25:33 efelten_ has joined #dnt 14:25:37 Rob, could you briefly address EU law? 14:25:55 johnsimpson_ has joined #dnt 14:25:58 Paul: regulatory treatment that is different for cable, services provided by cable providers 14:26:10 ...makes distinction between personally identified and not identified 14:26:21 Peter - are you suggesting if data is not linked to PII then it is "de-identified"? 14:26:23 peterswire has joined #dnt 14:26:26 ...much like NAI, different rules for consent and approval 14:26:47 peterswire_ has joined #dnt 14:26:52 efelten_ has joined #dnt 14:26:56 johnsimpson has joined #dnt 14:26:57 robsherman has joined #dnt 14:27:15 Marc: data security issues, beyond financial issues, reputational risk is a very large piece of it as well 14:27:53 ...privacy incident, costs are much higher than outside council and regulatory burdens, for many years talk about the x company incident 14:27:57 Shane, I think the question is whether "is" includes "can be", i.e. data not linked vs non-linkable is by definition non-PII 14:28:16 Peter: NAI, Cable Act, also have HIPAA, GLBA 14:28:30 ...if you are outside regime, you do not have regulatory burden 14:28:49 robsherman1 has joined #dnt 14:28:49 Shane - I think it's abundantly clear that no PII is not the same as non-identifiable (see Paul Ohm's summary paper) but I understand you're asking for Peter's view, which I do not know. 14:28:57 Marc: Privacy act, privacy impact assessment depends on whether you have individually identifiabe information 14:29:24 Peter: inside an organization, you have incentives of access controls, more people can tough if not PII 14:29:29 Bryan, that's my question - is it an absolute position? I've always felt de-identified was "more" than simply not PII. 14:29:35 efelten__ has joined #dnt 14:29:35 Aleecia - see above :-) 14:29:54 ...data base with financial information, many reasons for access control limits 14:30:00 peterswire has joined #dnt 14:30:12 ...for other employees there is a risk of breach if you do not De-identify 14:30:14 efelten_ has joined #dnt 14:30:32 johnsimpson_ has joined #dnt 14:30:39 efelten_ has joined #dnt 14:30:40 Khaled: opt-in consent or opt-out, evidence in health care sector for consent bias 14:30:55 ...de-identification allows you to avoid consent bias 14:31:03 johnsimpson has joined #dnt 14:31:06 efelten_ has joined #dnt 14:31:13 PII/Personal Data -> Pseudo/Anonymous -> De-Identified/Unlinkable -> No Value 14:31:30 any kind of analytics is very far streched... 14:31:32 johnsimpson has joined #dnt 14:31:35 Khaled: Beyond researchers, goes to analytics (bias data because you are missing a certain percent of population) 14:31:57 Peter: having full population better for the researchers, De-ID is a tool to get accurate analytics 14:31:58 johnsimpson has joined #dnt 14:32:09 ...Any other comments on reasons why people do de-identification? 14:32:32 Shane - I can imagine a dataset that removes PII and is also then not re-identifiable. But that's not a general rule. It's probably easier to talk about the type of data we're using. Removing PII is not going to render a server log file "safe," and indeed there might never be PII in the first place, yet still have identifiable data. 14:32:43 ...reasons for people to do this, trying to understand the terminology 14:32:46 RichLaBarca has joined #DNT 14:32:53 johnsimpson has joined #dnt 14:33:00 ...Khaled has a book on de-id coming out the beginning of April 14:33:12 Are slides available now? 14:33:12 efelten_ has joined #dnt 14:33:12 ...Khaled starting with part 2 and his slides 14:33:20 Shane, to be clear I was not stating a position, but a question. IMO identity includes a range of attributes only some of which are personal - remove/obscure the personal ones and you're home - science will always find new ways to relink and attribute data to persons, and we should not be trying to chase that rabbit 14:33:21 peterswire_ has joined #dnt 14:33:24 Slides have not come through on email yet!!! 14:33:30 johnsimpson has joined #dnt 14:33:40 yes, 14:33:41 I sent ten minutes ago, will resend. 14:33:42 difficult 14:33:48 thank you Shane 14:33:52 peterswire has joined #dnt 14:33:52 Also, lots of paper shuffling etc. 14:33:55 Khaled: walking through process of de-identification 14:34:14 johnsimpson_ has joined #dnt 14:34:34 um. 14:34:39 johnsimpson has joined #dnt 14:34:39 sounds off now 14:34:42 efelten_ has joined #dnt 14:34:58 Khaled: walk through de-identification we have been using, context will be healthcare 14:35:10 johnsimpson has joined #dnt 14:35:23 ...agree on terminology and general approach to terminology 14:35:35 ...basic process they have uses is five steps 14:35:40 Bryan, I'm mostly with you there. The key element is what is definied as "personal"... 14:35:48 ...assume we have health data set and want to release for secondary purpose 14:35:52 robsherman has joined #dnt 14:35:55 ...first step understand plausible attacks 14:36:00 johnsimpson_ has joined #dnt 14:36:03 efelten_ has joined #dnt 14:36:04 Where are these five steps sourced from? 14:36:07 vinay has joined #dnt 14:36:07 ...second, understands variable that can be used 14:36:08 + +1.917.934.aahh 14:36:13 zakim, aahh is vinay 14:36:13 +vinay; got it 14:36:19 ...measure risks, appply de-identification 14:36:31 ...Assume a public release ro releasing to a known data recipient 14:36:34 efelten_ has joined #dnt 14:36:37 johnsimpson has joined #dnt 14:36:39 Put your email in chat if you want the slides. 14:36:43 In absence of the slides, can someone copy/paste the slide content into IRC? 14:36:50 wileys@yahoo-inc.com 14:36:51 aleecia@aleecia.com 14:36:53 ...very different analysis, public have no controls, known recipient you can have controls and contracts 14:37:04 vigoel@adobe.com 14:37:07 a.hanff@think-privacy.com 14:37:10 johnsimpson has joined #dnt 14:37:17 ...For known data recipient, you have three attacks 14:37:19 vincent.toubiana@alcatel-lucent.com 14:37:25 Chris: what type of attack? 14:37:28 are we allowed to comment? 14:37:29 ed@felten.com 14:37:34 rich@addthis.com please 14:37:43 Khaled: re-identification attack 14:37:48 Slides answered, thanks. 14:37:55 got the slides, thanks 14:38:05 so can we ask questions? 14:38:07 q+ 14:38:08 q? 14:38:10 q? 14:38:17 If you have questions, please queue yourself; I'll monitor the queue 14:38:21 ack marc_ 14:38:24 ack robsherman 14:38:25 Thank you Heather! 14:38:27 q+ 14:38:49 (Reminder: to put yourself in the queue, just type q+) 14:38:54 johnsimpson has joined #dnt 14:38:57 Rob: information that is not being disclosed, storing information to make it de-identification, not planning to disclose? 14:39:16 ack AHanff 14:39:22 +q 14:39:23 typ[ing 14:39:30 I am typing lol 14:39:31 Khaled: go through same steps if you release to data recipient or internally 14:39:35 AHanff, are you just on irc? 14:39:44 Go ahead and type your question and I'll convey 14:39:45 q+ 14:39:46 no I am on phone too but not on headset 14:40:06 q+ 14:40:09 ack wileys 14:40:12 peterswire_ has joined #dnt 14:40:13 Shane: not mandating from a HIPAA perspective to de-identify, just for a risk management perspective, you would go through same process 14:40:17 Slides went to list finally, available here: http://lists.w3.org/Archives/Public/public-tracking/2013Jan/0062.html 14:40:17 johnsimpson_ has joined #dnt 14:40:18 robsherman1 has joined #dnt 14:40:28 Thank you Justin 14:40:29 q? 14:40:36 Khaled: contract, allow vendor to continue using the data, need to keep in de-identification manner 14:40:47 peterswire has joined #dnt 14:40:58 AHanff, go ahead and type question 14:41:05 Peter: HiPAA puts limits on data uses even internally 14:41:05 I would just like Khaled to acknowledge that known recipient doesn't guarantee confidentiality even with contractual observations. For example, i read recently that something like 90% of US medical authorities had data leaks in 2012, presumably contracts were in place... 14:41:24 Dan: clarifying, de-identification is a property of data? 14:41:30 ...It is not a process 14:41:37 johnsimpson_ has joined #dnt 14:41:49 Khaled: in practice you manage the risk of re-identification, re-identification is one tool in the tool box 14:41:49 efelten__ has joined #dnt 14:41:50 AHanff, feel free to share running comments as the presentation proceeds - they go in the record as well 14:41:56 thanks 14:42:14 johnsimpson_ has joined #dnt 14:42:20 q+ 14:42:24 efelten_ has joined #dnt 14:42:25 ack hwest 14:42:28 Khaled: deliberate re-identifiation by data recipient, if company signs a contract, as a corporation that company will not try to re-identificy 14:42:28 ack David_MacMillan 14:42:36 ack dtauerbach 14:42:44 q+ 14:42:49 robsherman has joined #dnt 14:42:50 ...there may be rogue employees, but probability of company re-identifying would be acceptably low 14:42:54 efelten__ has joined #dnt 14:43:02 the evidence would suggest otherwise with so many data leaks surely? 14:43:05 ...contracts are a good risk mitigating activity for first attack 14:43:09 I am aware of the q; will be calling on them at a soon moment 14:43:23 @AHanff, if you have a citation on the 90% figure, would you be so kind as to add that to the wiki? 14:43:27 ...rogue employee re-identifying an ex spouse for example is dependent on internal company controls 14:43:37 I will try and find it yes 14:43:48 peterswire has joined #dnt 14:43:48 ...first attack, as a company would you do it, do you have controls for rogue employees 14:43:51 robsherman1 has joined #dnt 14:43:52 Thanks, that's higher than I'd heard 14:43:54 efelten_ has joined #dnt 14:44:05 Peter: this is a risk management approach 14:44:14 johnsimpson_ has joined #dnt 14:44:16 peterswire has joined #dnt 14:44:39 Khaled: most recent guidance of HHS is a risk management approach, UK Commissions also talk about risk management and context based 14:44:51 q? 14:44:52 peterswire_ has joined #dnt 14:44:54 ...regulators approaching as a risk management exercise 14:44:57 ack dwainberg 14:45:02 johnsimpson has joined #dnt 14:45:20 David: De-ID is not a binary state, it is rather a description of lower risk (Khaled probability) 14:45:30 efelten__ has joined #dnt 14:45:30 peterswire_ has joined #dnt 14:45:48 Khaled: de-identification have been practiced for last 20 years, CDC, CMS, set thresholds along a continuim 14:45:55 ...that is context dependent 14:46:12 johnsimpson_ has joined #dnt 14:46:13 aleecia, it was a Ponemon study, there is an article here on it (will add to wiki) http://www2.idexpertscorp.com/press/report-94-of-us-hospitals-suffered-data-breaches-and-45-had-quintuplets/ 14:46:13 David: helpful to talk about de-identification as a process and something else as a end goal? 14:46:30 Dan: still fair to share de-identification is a property of data 14:46:37 + +1.646.654.aaii 14:46:47 David: functional definitioin of de-identification is a function of the context, could be 20 different forms 14:46:57 efelten_ has joined #dnt 14:47:01 schunter has joined #dnt 14:47:03 robsherman has joined #dnt 14:47:08 Khaled: can be multiple de-id versions for the same data base, public versus trusted party 14:47:39 Peter: binary de-identified or not? Under HHS, counts at de-identified if overall risk is low. 14:47:57 johnsimpson has joined #dnt 14:48:05 peterswire has joined #dnt 14:48:15 Khaled: once you have a spectrum, and cut off in the middle, you turn it into a binary decision 14:48:29 Peter: de-identified is a conclusion term under some regime under some set of facts 14:48:30 but the thresholds are not static, they move constantly depending on the amount of data aggregated about an individual 14:48:36 peterswire has joined #dnt 14:48:38 johnsimpson_ has joined #dnt 14:48:47 ...yes it is de-identified or no it is not, along the way there is a risk management regime 14:49:05 q? 14:49:05 ...de-identified right now is a conclusion term for a regime, we do not have that standard right now in dnt 14:49:13 johnsimpson has joined #dnt 14:49:15 ...does anyone else see it differently? 14:49:21 Zakim, q? 14:49:21 I see jmayer on the speaker queue 14:49:33 Jeff: more accurate to sa a de-identified data set has been de-identified to a degree 14:49:44 Peter: more or less risk for re-identification 14:49:55 johnsimpson has joined #dnt 14:50:05 q? 14:50:16 dwainber_ has joined #dnt 14:50:17 Thank you kindly, Alan. Report (rather than press coverage) available from: http://www2.idexpertscorp.com/ponemon2012/ 14:50:18 David: disagree what is identified in the first place, what's de-identified and when, we will have disagreement 14:50:36 johnsimpson_ has joined #dnt 14:50:47 Ed: In a giving setting, you can ideally establish some scientific basis that risk is some ammount, you have a spectrum of risk 14:50:56 ...then you are required to be somewhere on the spectrum 14:50:57 I think it is important to note that there are no specific types of data which can guarantee non-re-identification, in fact it is never possible to guarantee non re-identification. Data minimisation can make it less likely, but the way these systems work is the data is always increasing not decreasing, which means the risk is continually increasing as the data resolution increases... 14:51:14 ...starting point, scientific basis that data can be exploited with a certain probability 14:51:17 johnsimpson has joined #dnt 14:51:28 efelten__ has joined #dnt 14:51:34 Ed: risk analysis based on sound scientific analysis, not based on what you have done in the past 14:51:46 Chris: process of de-identification, and de-identified data 14:51:54 johnsimpson has joined #dnt 14:52:21 peterswire_ has joined #dnt 14:52:21 johnsimpson_ has joined #dnt 14:52:25 Peter: defining what counts as de-identified sounds like normative stuff we are not agreeing on today, we are trying to develop language and ways to talk about things to have that conversation 14:52:42 Chris: we do not know the degree, we just know de-id is a thing, so lets talk about good pratice 14:52:54 q? 14:53:08 johnsimpson_ has joined #dnt 14:53:26 Paul: once you accept risk, then need to put tools on tables, what are the general uses 14:53:35 ...then have conversation of what is an acceptable level of risk 14:53:37 I agree with Ed. The goal is relevant. If you want to use the data for aggregation is different than trying to accomplish unlinkability 14:53:37 q? 14:53:38 johnsimpson has joined #dnt 14:53:48 Chris_IAB has joined #dnt 14:53:53 ack jmayer 14:53:54 AHanff -> I disagree, there are levels of de-identification/minimization that guarantee non-re-identification. For example, highly aggregated data sets or highly sparce raw data can both guarantee non-re-identification. 14:54:14 johnsimpson_ has joined #dnt 14:54:16 efelten_ has joined #dnt 14:54:16 Jonathan: stick to substance, universe of attack slide, third bullet pont 14:54:27 Wiley, show me the evidence to support that and I will show you a very famous event which shoots it down :) 14:54:46 efelten__ has joined #dnt 14:54:48 ...reasonably say that risk to some sort of data breach is a lot greater if you leave on street, if only CEO can see with contract 14:54:53 peterswire has joined #dnt 14:55:01 ...risk is much greater in former, shades of grey are the hard part 14:55:07 3 people in the world viewed in the world viewed Yahoo.com at a specific moment in time yesterday - please tell me who those people are? 14:55:25 Have fun AHanff (that's an example of a highly aggregated result) 14:55:31 peterswire has joined #dnt 14:55:34 ...very fact specific things, where real world challenges lie, can we reasonably estimate these sorts of attacks: being hacked, laptop out, rogue employee 14:55:37 johnsimpson has joined #dnt 14:55:48 ...if you can predict crime, we all have a much better use of time 14:55:51 I don't think we need to argue about really-really-really-really hard to reidentify is technically impossible to reidentify. For purposes of this group, whatever you call that, it will suffice to constitute de-identified data. 14:55:59 Khaled: not predicting crime, but good approaches to manage risk 14:56:08 Wiley, I am glad you chose a search engine, I refer you to the AOL search data which was used to identify anonymous users within 24 hours of being released for "research purposes" 14:56:15 ...develop a series of cheak list to evaluate point of disclosure 14:56:19 robsherman1 has joined #dnt 14:56:22 johnsimpson_ has joined #dnt 14:56:24 ...at the end of day, probabilities can be assigned 14:56:28 far more anonymised than the data Yahoo has in their logs I should add :) 14:56:28 Thank you Justin - I agree that there arguing absolutes in this case is not helpful - that was my point. :-) 14:56:32 Justin - I think that's part of the question at hand 14:56:49 AHanff - completed apple / orange comparison 14:56:52 ...based in part on subjective estimates, but mixtures of different things 14:56:53 completely 14:56:58 no it isn't 14:56:59 The AOL mess was *not* data aggregation 14:57:02 johnsimpson has joined #dnt 14:57:13 ...the overall answer is that you can do it in a defensible way 14:57:16 The question at hand is how many "reallys" you need in front of "hard to reidentify" 14:57:17 Shane is right on this one. The AOL mess was replacing one unique id with another. 14:57:18 - +1.646.654.aaii 14:57:21 felixwu has joined #DNT 14:57:38 AHanff - AOL was row level specific data with consistent unique identifiers - my example was a highly aggregated result. Not the same 14:57:43 efelten_ has joined #dnt 14:57:49 3 people visiting Yahoo yesterday at specific time is not data aggregation either, server logs (probably replicated multiple times for backups across their dsitributed network) provide very exact data 14:57:51 - +1.202.587.aabb 14:57:55 Khaled: deliberate re-id, inadvertent - recognize someone they know (a relative) 14:57:58 robsherman has joined #dnt 14:58:09 ...in health care setting, can measure probability that someone knows someone in the database 14:58:22 johnsimpson_ has joined #dnt 14:58:27 q? 14:58:29 Mike_Nolet has joined #dnt 14:58:29 peterswire has joined #dnt 14:58:46 ...Ex. breast cancer, we know the prevalence of breast cancer and average number of friend, we can estimate the chance of inadvertent re-identification 14:58:55 peterswire has joined #dnt 14:58:55 johnsimpson_ has joined #dnt 14:58:58 robsherman1 has joined #dnt 14:59:13 ...Data breach, organization that loses data, we know that 27% of health care providers have one breach per year 14:59:23 So wait: 27%, or 94%? 14:59:29 ...there are bigger and smaller numbers, but 27% is the most defensive number 14:59:39 efelten__ has joined #dnt 14:59:41 johnsimpson has joined #dnt 14:59:48 That's a rather large change of inputs here 14:59:49 q+ 14:59:56 ...we can use the 27% number to assign probability 14:59:58 What does breach have to do with de-identification? Those breaches are to purposely non-de-identified data. 15:00:00 But not our problem, actually 15:00:15 ...demonstration attack - adversary once to make a point, targeting high risk person 15:00:19 efelten_ has joined #dnt 15:00:21 johnsimpson has joined #dnt 15:00:22 ...all you have to do is identify one person 15:00:26 +1 to Aleecia 15:00:44 johnsimpson_ has joined #dnt 15:00:49 I see jonathan; will call on soon 15:01:11 Khaled: Directly identifying variables, are the fields in HIPAA 15:01:16 What I've learned: HIPPA's a mess. :-) But we may be able to find useful parts of HIPAA anyway as we sift through this, and it's useful to see what came before. 15:01:22 efelten_ has joined #dnt 15:01:27 q+ 15:01:39 johnsimpson_ has joined #dnt 15:01:45 q? 15:01:48 Peter: people may disagree what is directly identified and a quasi-identifier 15:01:55 Khaled: can be different based on context 15:02:10 peterswire_ has joined #dnt 15:02:10 ...with names remove the names, randomize, generate pseudonyms 15:02:22 q? 15:02:24 q? 15:02:40 Shane -- I realize I don't know what problem you're trying to solve in your dataset. When you talk about not destroying the value, what value is it you're trying to preserve? 15:02:41 +1 to generating pseudonyms as acceptable de-identification practice :-) 15:02:43 johnsimpson has joined #dnt 15:03:04 Chris: quasi-identifiers, how about rangers, someone fits with a date range, or geo location? Address in HIPAA 15:03:13 efelten__ has joined #dnt 15:03:16 Aleecia - typically longitudinal analytical/research value 15:03:21 Khaled: HIPAA safe harbor, dates converted to years 15:03:42 e.g., it's useful to know that a particular user went to Y!, then FB, then ESPN, etc. 15:03:48 efelten_ has joined #dnt 15:03:48 ...when you convert to ranges, you go to expert, you could potentially go to quarter of year or increase to 10 years 15:03:52 q- 15:03:56 q? 15:04:01 johnsimpson_ has joined #dnt 15:04:02 Aleecia - You've already heard this conversation play out between Ed and I (and a few others) on the public email list. :-) 15:04:23 johnsimpson_ has joined #dnt 15:04:34 Yes, I've heard and read more than I care to :-) But I couldn't remember what value you were looking for, just the disagreements 15:04:36 Khaled: if you doing anlytics treat as quasi identifiers, ex. software testings, you cannot get rid of fields, you just randomize 15:04:38 my questions isn't on direct dientifiers 15:04:54 q+ 15:04:55 my question is on the 27% figure 15:05:01 peterswire has joined #dnt 15:05:02 johnsimpson_ has joined #dnt 15:05:10 Aleecia - industry participants have never explained the value they hope to achieve in detail. It's one of the reasons we haven't made progress. 15:05:29 Khaled: in Ontario 220 John Smiths, people have common names. 15:05:31 peterswire has joined #dnt 15:05:35 Aleecia, outside of permitted uses, the core value sought is analytical (be able to learn and make changes). 15:05:37 johnsimpson has joined #dnt 15:05:42 Ed: In practice every variable is a quasi identifier? 15:05:49 Jonathan, I thought we had - not sure what more you're looking for. 15:05:52 efelten__ has joined #dnt 15:06:01 Khaled: no not really 15:06:10 ...example, blood pressure 15:06:12 And you're likely to have a question now that can be answered from data 5 years ago? 2 years ago? 15:06:15 would like to bridge to quasi identier to EU perspective... (queue) 15:06:26 efelten_ has joined #dnt 15:06:26 Ed: blood pressure is better than gender 15:06:29 My concern is that your answer there is you don't know 15:06:38 Khaled: what is the chance of adversary knowing your blood pressure 15:06:38 Because, you likely cannot 15:06:51 johnsimpson has joined #dnt 15:06:57 efelten_ has joined #dnt 15:06:58 Ed: the odds my provider will know my blood pressure is high 15:07:00 robsherman has joined #dnt 15:07:17 Aleecia - some researchers at Yahoo! find tremendous value in long-term data as an indicator for near-term data - interesting learnings and value there. 15:07:23 johnsimpson_ has joined #dnt 15:07:30 Khaled: hospital can look at, and different controls to stop re-identification 15:07:57 johnsimpson_ has joined #dnt 15:07:58 Peter: how likely someone on outside has access to that information and how likely it is to be a match? 15:08:05 robsherman has joined #dnt 15:08:14 Aleecia - a simple example is spelling correction - due to the long tail of possible searches it can take many years to build enough data to predict outcomes for rare terms. 15:08:18 is anyone monitoring the queue? 15:08:29 Ed: re-identification is connecting individual to information 15:08:41 I'm sure there is. But if you pull back to a very simple view, you're suggesting that users ask for more privacy, Y! says they will provide more privacy, and then you will retain and study that user. That's a hard thing to explain to a user who just wants to be left alone. 15:08:41 Rob, Peter said in IRC that he'd be coming to the queue soon but that was quite awhile ago 15:08:47 johnsimpson has joined #dnt 15:08:49 Khaled: all laws protects identify disclosure, no laws protect attribute disclosure 15:09:10 peterswire_ has joined #dnt 15:09:17 johnsimpson_ has joined #dnt 15:09:24 efelten__ has joined #dnt 15:09:44 ...If I release data set and you get attribute disclosure, laws do not prohibit, its just statistics 15:09:45 peterswire_ has joined #dnt 15:09:48 Wileys, with the spelling correction example, high level aggregation and short term retention are not enough? 15:09:50 q+ 15:09:54 Aleecia, I'd argue that once the data is deidentified that user is being left alone - we're now just using an unlinkable data point to improve our services. What are our rights in providing the free service? The most paranoid users need not use our services if we fairly call out that we use data in this way. Fair? 15:09:56 efelten_ has joined #dnt 15:09:59 johnsimpson has joined #dnt 15:10:01 The spelling example is a nice one, thanks. I'm sure there are many, many others. I just don't know how to get you what you want while still actually honoring DNT 15:10:05 . . .Different governance mechanisms to manage attribute disclosure, but not what we are talking about today 15:10:23 WileyS, not sure that's the best example. That's first party data that can be stripped of identifiers immediately without significantly diminishing value (like Google Flu Trends). 15:10:24 johnsimpson has joined #dnt 15:10:24 Ed: arguably the most important aspect of privacy disclosure is not even covered? 15:10:43 Vincent, not short-term retention (not enough volume on rare terms) - but data minimization and de-identification do accomplish the risk minimization goal 15:10:49 schunter has joined #dnt 15:10:52 Khaled: cannot predice inferences of data sets, but the more you control attribute disclosure you destroy data utility, best to manage with governance 15:10:55 johnsimpson has joined #dnt 15:10:56 Wileys - no absolutely not fair - first of all what right do you have to label privacy aware users as paranoid - secondly, are you therefore saying people who value privacy should be excluded from digital society? 15:11:10 Justin, agreed - for that use case, that's a great de-identification approach. 15:11:17 Peter: direct identifiers (phone numbers), quasi identifiers (people on outside can make guesses) 15:11:30 q? 15:11:38 johnsimpson_ has joined #dnt 15:11:38 robsherman1 has joined #dnt 15:11:42 I'm pretty sure that saying "we're honoring your request for privacy, but we're still logging everything you did and using it" isn't what users will consider fair. Which, to be clear, matters a lot more than what I think is fair. 15:11:44 q? 15:11:47 Justin, you do need to keep a few data elements around to help provide context (language, country of search, etc.) 15:12:00 q- 15:12:03 ...Third thing, attribute disclosure 15:12:09 I see the q 15:12:20 Aleecia, I believe the de-identification removes the "you" in 'everything you did' in your statement 15:12:30 peterswire has joined #dnt 15:12:32 johnsimpson_ has joined #dnt 15:12:46 what you believbe is not what regulators and the general public believe, which I think is aleecias point 15:12:47 Which is where you and Ed have gone many rounds, and I do disagree with your conclusions there. 15:12:48 Ed: list of hundred records and I know one is yours, and all have that dianosis, I know the attribute without actually identifying 15:12:57 peterswire has joined #dnt 15:12:57 WileyS, Right, that seems fair, but the re-ID risk seems almost impossibly low. 15:12:58 Joe: that's 100% , others are fuzzier 15:13:01 attribute disclosure as an important distinction says ed felten 15:13:11 Ed: are we trying to protect against attribute disclosure? 15:13:19 johnsimpson_ has joined #dnt 15:13:35 Justin - agreed, for that use cases - many other use cases aren't as clean cut - that's why its a good point to start there and go deeper. 15:13:36 Khaled: precedence in research world for attribute disclosure: IRB 15:13:40 I do agree that there are ways to do aggregation to a level as to remove the "you." I do not think that replacing one unique identifier with another unique identifier (hashing) is going to remove the "you" 15:13:50 ...restricts how you do studies, committee oversees 15:13:59 johnsimpson has joined #dnt 15:14:03 q- 15:14:05 AHanff, could you please source your position? Regulator and general public studies? 15:14:09 ...how mechanism to agree on type of interences you will permit, certain things would be off limits 15:14:16 Wileys, I though Yahoo removes rare term anyway? are there examples where yahoo is actually a third party? 15:14:28 Joe: risks to population of inference versus benefits? 15:14:50 wileys, regulators, a29wp, eu commission, eu parliamentarians, members of public all people I have worked with and discussed these issues with over the past 6 years 15:14:51 Aleecia, as long as there is no way back to the original user, then I believe the desired outcome has been met (no more 'you') 15:14:59 robsherman has joined #dnt 15:15:03 johnsimpson_ has joined #dnt 15:15:03 KHaled: no legislative requirement to worry about attribute disclosure 15:15:05 except you of course :) 15:15:21 AHanff, very much an area of active disagreement - I agree that one extreme side of that debate equates to your position 15:15:30 Felix: We are concern about inferences of large number of people, but that is different than inferences about one particular person 15:15:40 robsherman1 has joined #dnt 15:15:40 efelten__ has joined #dnt 15:15:40 johnsimpson has joined #dnt 15:15:42 person is in the group, and can draw inference about them -- attribute disclosure 15:15:46 Khaled: can draw inferences about group memberships, and you belong to that group 15:15:53 Vincent, Yahoo! runs one of the largest 3rd party ad networks on the internet :-) 15:16:07 well absolutely every person I have ever discussed these issues with apart from advertisers, is in that "extreme" - which would suggest that the extreme is actually your segment not mine ;) 15:16:13 efelten_ has joined #dnt 15:16:14 peterswire has joined #dnt 15:16:28 Felix: IRB - mitigates discriminating against large group, not concern about attribute disclosure to specific individual, even if group is not senstive 15:16:41 q? 15:16:50 Khaled: depends on type of study and what harm that can happen to those individuals or at the group level 15:16:58 AHanff - disagree - if everyone agreed with you then no one would be using online service supported by 3rd party advertising 15:17:00 johnsimpson has joined #dnt 15:17:00 robsherman has joined #dnt 15:17:07 Dan: Quasi-identifiers: why is not everything a quasi identifiers? 15:17:19 efelten__ has joined #dnt 15:17:26 johnsimpson has joined #dnt 15:17:27 Khaled: have to take into account probability that adversary will have information, some fields there are no probable path to get that information 15:17:29 Shane - one of the evolutions we're watching is going from "we need to identify a user by name" as what counts for a "you" to "we need to be able to distinguish a single person" such that a GUID counts for a "you" 15:17:37 Wiley's that is a completely invalid response - the VAST majority of digital citizens have no idea that any of this is going on and when they find out, they are outraged 15:17:42 ...has to be information that is generally available 15:17:45 there are countless examples to support that 15:17:46 swapping one GUID for another doesn't actually advance privacy 15:17:53 that's not fair - 15:17:56 Wileys, glade to hear :) but how is that related to my question? I was asking for examples of analytical/research that need pseudonymous data and where yahoo is involved as a third party, not a search engine 15:18:01 doesn't advance it by much. 15:18:19 Aleecia - GUID goes one step further than I'm suggesting as that implies it is still "linkable" in a production system. 15:18:27 Mike: What about the practical, how difficult is that inference? (large number of records) 15:18:38 johnsimpson_ has joined #dnt 15:18:42 efelten has joined #dnt 15:18:47 Vincent, anything and everything to do with being a better ad network. 15:18:52 That's what I was just correcting. I agree, there is a minor improvement there, but not enough as to practically matter much. 15:18:54 Khaled: depends on fields you have in data base, and how accurate would the inference be, never count against statistics 15:19:03 q? 15:19:16 ...attribute disclosure has to be managed, cannot do so technically without destroying data 15:19:25 AHanff, please reference studies of consumer "outrage" 15:19:27 ack dtauerbach 15:19:31 ...need to have different oversight, evidence so far that is what works 15:19:42 hwest has joined #dnt 15:19:43 q? 15:19:51 peterswire has joined #dnt 15:19:56 hwest has joined #dnt 15:20:09 ...In practice, you do not get all of the fields in data bases (focus on 6-10 fields), for longitudnal data, repeated over multiple visits 15:20:28 ...surveys are more complicated, can deal with database with 100 quasi-identifiers 15:20:28 johnsimpson_ has joined #dnt 15:20:34 Shane - let me do a thought experiment. I think we agree that if I got my hands on the raw server logs at Y! that would contain a set of "you"s, and not be non-identified. 15:20:36 Dan: only need to know one things 15:20:44 Wileys I don't need too, they are there in the public eye - instagram, path, phorm, nebuad, facebook etc etc etc 15:20:49 there is a new one just about every week 15:20:58 Khaled: chance of adversary knowing 5 things or 10 things, chance they know all 100 is very low 15:21:07 johnsimpson has joined #dnt 15:21:42 johnsimpson_ has joined #dnt 15:21:45 ...choose a number that is defensable (unlikely to know 30 fields) 15:22:16 Aleecia, depends - if you're suggesting a de-identified data set, you'd find a one-way secret hashed identifier that has been truncated by 50% to purposely create noise (salt). So there is "an" identifier there - but it links to nothing in production systems. 15:22:44 peterswire_ has joined #dnt 15:22:44 AHanff - thank you for the conversation, I have a good sense of your perspective and ability to defend your statements now. 15:22:46 Khaled: three types of risk 15:22:51 johnsimpson has joined #dnt 15:23:02 efelten_ has joined #dnt 15:23:06 ...are you going to re-identify individual in data set, or are you going to match two databases 15:23:11 You should talk to your colleague Justin before discounting my arguments, we know each other very well 15:23:16 peterswire has joined #dnt 15:23:17 ...are you considering maximum risk or average risk (very different) 15:23:25 If you took that raw data over a year (nothing magic, just picking a specific example) and gave me one half of the data raw, and one half you had transformed by replacing GUIDs with your hashed id, I would be able to map between the raw and the hashed data sets. 15:23:29 ...when talking about demonstration attack worry about mximum risk 15:23:44 ...with inadvertent, you can you use average risk 15:23:53 johnsimpson has joined #dnt 15:23:53 ...what are the appropriate thresholds? 15:23:57 So when you say there is no link to the production system, I disagree. 15:24:00 Aleecia - we keep the datasets completely separate with strict access controls, policy, training, etc. - you wouldn't get both. 15:24:26 oh my, how many times have I head that one and then seen humble pie served lol 15:24:26 ...In practice, the highest risk used is .33 to as low as .05 15:24:28 A different and possibly useful approach, but they *are* linked. 15:24:29 But that is our risk to manage since we make the statement the data is deidentified. 15:24:30 heard* 15:24:33 johnsimpson_ has joined #dnt 15:24:44 efelten_ has joined #dnt 15:24:48 ...No one releases data with a risk higher than .33, increased precedence for other values 15:25:05 johnsimpson has joined #dnt 15:25:19 ...practical range (court cases, regulatory authorities), choose one of four: .33, .2, .09, .05 15:25:28 johnsimpson has joined #dnt 15:25:32 ...no scientific way to choose value, based on past use and changed over time 15:25:50 q? 15:25:58 ....09 and .05 are used in public disclosure 15:25:59 peterswire has joined #dnt 15:26:11 There might exist something in there I could reluctantly live with while really not liking. :-) (And there might not.) What I'll put my body on the tracks for is the idea that you could then publicly release that data. 15:26:13 .33 and .2 are for releases to trusted business partner 15:26:21 johnsimpson_ has joined #dnt 15:26:28 ...these thresholds are to protect against demonstration attack 15:26:30 Has this deck (being presented currently) been placed into the W3C record? 15:26:42 Chris_IAB, it's in the mail archives. 15:26:44 ...all known attacks have been conducted by academic and media 15:26:46 Aleecia - we have yet another de-identification process for data we release to researchers - so I absolutely agree with you! 15:26:49 q+ 15:27:06 ...this is maximum risk, no one has a higher risk of re-identification than the level 15:27:07 johnsimpson has joined #dnt 15:27:11 Chris, it went out to the public mailing list so its now recorded. 15:27:38 ...In practice, these numbers are conservative: data changes, imperfect data cause errors 15:27:55 ...the numbers used are ceilings on risk, real risk are lower 15:27:57 Shane - could you describe the de-identification for researchers? 15:28:19 johnsimpson_ has joined #dnt 15:28:35 ...Cell sizes: 3, 5, 11, 20 15:28:56 ...the smallest cell sizes (population cell sizes), may be smaller in a sample 15:29:14 johnsimpson_ has joined #dnt 15:29:20 + +1.215.286.aajj 15:29:22 ...If you create a population with cell size of 5, you can take a cample and have a lower cell size 15:29:29 peterswire has joined #dnt 15:29:37 ...number of individuals with same cell of quasi identifiers 15:29:44 Ed: have to assume quasi identifiers 15:29:48 johnsimpson has joined #dnt 15:29:53 q? 15:30:01 peterswire has joined #dnt 15:30:01 Khaled: only a small subset of variables in data set are quasi identifiers 15:30:19 Aleecia - it varies based on the nature of the dataset but general attributes are: older data, no identifiers, data sets highly numerized (example, instead of showing actual category of music, we show only a number representing a category but give no information to provide context for that category). 15:30:49 David: with a cell size of 11, there is a 9% probablility of a record being re-identified? 15:30:51 johnsimpson has joined #dnt 15:31:10 ...any single record or one record out of the whole? 15:31:11 johnsimpson_ has joined #dnt 15:31:36 moneill2 has joined #dnt 15:31:48 Jeff: are 9% of the records identifiable? Public databases have 9% chance of re-identification. 15:31:57 johnsimpson has joined #dnt 15:32:27 johnsimpson_ has joined #dnt 15:32:36 Peter: there has never been a re-identification of properly de-identified database, but 9% risk? 15:32:40 +[IPcaller.a] 15:32:58 +q 15:33:02 peterswire has joined #dnt 15:33:04 johnsimpson has joined #dnt 15:33:07 Joe: demonstration attack on HHS database de-identified? 15:33:33 peterswire_ has joined #dnt 15:33:36 johnsimpson has joined #dnt 15:33:36 zakim, [ipcaller] is me 15:33:36 +moneill2; got it 15:33:45 Khaled: the hit rate of re-identification are much lower that those values, never have been able to re-identify at a rate higher than the threshold. 15:34:08 peterswire has joined #dnt 15:34:10 johnsimpson has joined #dnt 15:34:24 Felix: if you start guessing, you will be right 9% of time, do I care if I know? 15:34:37 peterswire has joined #dnt 15:34:52 Rob: if I were to guess randomly, I would get some right randomly 15:34:54 johnsimpson has joined #dnt 15:35:10 q+ 15:35:20 johnsimpson has joined #dnt 15:35:21 peterswire_ has joined #dnt 15:35:27 Felix: you would not know you are right, but you could guess 9%. 15:35:28 This is assuming complete l-diversity among the group? 15:35:44 Shane - that sounds a lot closer to what would be reasonable to provide to users who turn on DNT 15:35:49 johnsimpson has joined #dnt 15:35:50 peterswire has joined #dnt 15:36:10 ack dwainber_ 15:36:14 Khaled: with unlimited resources, they could verify, but expensive 15:36:17 johnsimpson has joined #dnt 15:36:53 Khaled: how do you choose one of four values? 15:36:59 johnsimpson has joined #dnt 15:37:15 ...public you use .05 or .09. If not public, you look at a number of other factors 15:37:16 mnolet has joined #dnt 15:37:26 ...if company have good controls, not as worried about a rogue employee 15:37:34 johnsimpson has joined #dnt 15:37:38 i think the wifi in the room isn't great, i suspect that's the reason 15:37:41 David: do you look at sensitivity of data? 15:37:41 We'll see what we can do during the break. 15:37:46 I am not doing anything.. Don't know why it is happening 15:38:12 johnsimpson_ has joined #dnt 15:38:17 Khaled: three things to look at: sensitivity, potential harm, and consent 15:38:20 peterswire has joined #dnt 15:38:32 ...motives managed with contract 15:38:44 ...with academics and journalist motive to re-identify 15:38:44 peterswire has joined #dnt 15:38:53 johnsimpson has joined #dnt 15:39:02 ...they are check lists for doing this process. 15:39:16 ...need a repetable process to evaluate all of the factors 15:39:36 Chris: is there ever a scenario that there is zero risk if you release data? 15:39:37 johnsimpson has joined #dnt 15:39:46 Khaled: no 15:40:07 ...but there are systems that can give rigorous bounds on risk if you release data. 15:40:12 Peter: threat models, why would someone attack here, how capable (money, show your smart) 15:40:31 johnsimpson_ has joined #dnt 15:40:31 ...might be commercial reasons, upset employees, think of all the reasons why people might attack 15:40:51 ...why do we care here, what are the harms, are they very sensitive 15:41:11 johnsimpson_ has joined #dnt 15:41:23 Aleecia - I understand that are your perspective of what DNT should mean - as you know I disagree with that position and would interpret a DNT to mean something different (no profiling, not 'no analytics') 15:41:24 ...different values of invasion of privacy: complete browsing history available to FBI may upset some advocates 15:41:29 I don't think the FBI is the worst thing possible - we operate in an international climate 15:41:36 peterswire has joined #dnt 15:41:44 ...other specturm: not a big deal, no one would care about browsing, little harm or risk around it 15:41:54 johnsimpson_ has joined #dnt 15:42:04 ...assume different views on invasion of privacy. 15:42:14 ...Left slide of slide: mitigating controls 15:42:24 robsherman1 has joined #dnt 15:42:30 ...lot of discussion on de-identification have been on publically disclosed databases 15:42:32 johnsimpson has joined #dnt 15:42:52 ...if you post on internet, smart people will attack, that is purely technical protection 15:43:11 ...most of the stuff we are talking about is different: secret databases, set of administrative controls 15:43:14 johnsimpson has joined #dnt 15:43:30 ...privacy act talks about technical, administrative and physical safeguards 15:43:36 johnsimpson has joined #dnt 15:43:39 Shane - we started this with the idea that DNT would limit collection of data. If we actually did that, I'd relax in other areas. But right now we're talking about no reduction in collection at all. My fear is that we build a system that is deceptive :-) 15:43:44 efelten has joined #dnt 15:43:56 ...that is how a lot of the data protections take place today 15:44:14 q? 15:44:17 When I talk to users, their main concern is not profiling, it's the data collection itself 15:44:17 johnsimpson has joined #dnt 15:44:24 Aleecia - as long as we're clear with users and the world on exactly what DNT means and how data will be handled then we won't be deceptive 15:44:25 And we're not going to help them with that 15:44:27 ...all the different variables would feed into how we think about de-identification 15:44:30 ack wileys 15:44:46 efelten_ has joined #dnt 15:44:50 johnsimpson has joined #dnt 15:44:59 peterswire has joined #dnt 15:45:00 q? 15:45:02 ack jmayer 15:45:19 q? 15:45:29 peterswire has joined #dnt 15:45:30 Jonathon: factors that could contribute to or mitigate risk, but no way to eliminate risk 15:45:32 Shane - I agree that being clear is necessary. I disagree that it is sufficient 15:45:43 ...we do have ways to put rigorous bounds on risk develop by computer scientist 15:45:57 with respect privacy and data protection as not the same thing. Privacy rights don't exist merely to manage risk, there are rights based around people's desire to lead a private life. So it is irrelevant to say that if data is de-identified it is ok because there is no risk, people have a right (under law in Europe and elsewhere) to refuse to have that data collected in the first place. 15:46:00 ...we can determine just how much the best adversary can accomplish 15:46:03 If we carefully document that DNT does nothing at all, that's not sufficient :-) 15:46:09 johnsimpson has joined #dnt 15:46:32 AHanff, you're overstating EU law 15:46:36 johnsimpson has joined #dnt 15:47:00 johnsimpson_ has joined #dnt 15:47:01 actually no I am not, would you like me to quote it verbatim, I worked on it so I know it pretty well... 15:47:04 ...techniques for rigorous bounds: differential privacy, body of writing on developing advertising analytics without following users around 15:47:12 Aleecia, so we agree on being clear, we disagree on the level of data "scrubing" that comes with a DNT signal. Progress... :-) 15:47:20 +SusanIsrael 15:47:26 ...lets make marginal gains, some are more rigorously oriented 15:47:28 susanisrael has joined #dnt 15:47:32 efelten_ has joined #dnt 15:47:35 johnsimpson_ has joined #dnt 15:47:35 There was disagreement that we should be clear before? 15:47:37 I think you're even agreeing that being clear is not all that's needed 15:47:48 s/lets make/some propose/ 15:47:57 AHanff, please share EU case law that supports your position - not your subjective interpretation of the written law. 15:47:59 q+ 15:48:03 q- 15:48:09 Aleecia - agreed :-) 15:48:09 Khaled: the managing risk slide is operational 15:48:27 -[IPcaller.a] 15:48:31 johnsimpson has joined #dnt 15:48:34 peterswire_ has joined #dnt 15:48:41 breakfast time, yay 15:48:42 - +1.215.286.aajj 15:48:52 -Aleecia 15:48:56 -vincent 15:49:00 johnsimpson has joined #dnt 15:49:03 peterswire has joined #dnt 15:49:23 schunter has joined #dnt 15:49:46 dwainberg has joined #dnt 15:50:00 johnsimpson_ has joined #dnt 15:50:23 robsherman1 has joined #dnt 15:50:34 johnsimpson_ has joined #dnt 15:51:02 -rvaneijk 15:51:04 robsherman has joined #dnt 15:51:04 johnsimpson has joined #dnt 15:51:45 johnsimpson has joined #dnt 15:51:50 peterswire has joined #dnt 15:52:16 johnsimpson has joined #dnt 15:52:16 zakim, aajj is susanisrael 15:52:16 sorry, susanisrael, I do not recognize a party named 'aajj' 15:52:17 peterswire has joined #dnt 15:52:40 zakim, 215 286 aajj is susanisrael 15:52:40 I don't understand '215 286 aajj is susanisrael', susanisrael 15:53:02 johnsimpson_ has joined #dnt 15:53:23 npdoty can you help me advise zakim that my phone number is 215 286 aajj 15:53:42 - +1.646.722.aagg 15:54:02 johnsimpson_ has joined #dnt 15:54:25 johnsimpson__ has joined #dnt 15:54:55 johnsimpson has joined #dnt 15:55:02 peterswire has joined #dnt 15:55:12 +[IPcaller] 15:55:23 johnsimpson has joined #dnt 15:55:32 zakim, [IPCaller] is me 15:55:32 +moneill2; got it 15:55:52 johnsimpson_ has joined #dnt 15:55:56 zakim, [215 286 aajj] is me 15:55:56 I don't understand '[215 286 aajj] is me', susanisrael 15:55:58 robsherman has joined #dnt 15:56:00 test 15:56:09 efelten has joined #dnt 15:56:39 -SusanIsrael 15:56:40 Shane, problem was the network we were on. Changed network. 15:56:53 efelten_ has joined #dnt 15:56:54 Paul has joined #DNT 15:56:57 robsherman1 has joined #dnt 15:56:58 hope this is stediar 15:56:58 npdoty: can you help me communicate with zakim about my phone number? i don't seem to have the syntax right. 15:57:18 johnsimpson__ has joined #dnt 15:57:30 John - that didn't seem to do the trick 15:57:42 vincent has joined #dnt 15:57:58 Hard to follow anything on IRC today with so many connect/disconnect events being thrown up. 15:58:09 johnsimpson_ has joined #dnt 15:58:36 peterswire_ has joined #dnt 15:58:43 robsherman has joined #dnt 15:58:49 +??P24 15:59:09 peterswire_ has joined #dnt 15:59:10 +rvaneijk 15:59:16 dwainber_ has joined #dnt 15:59:33 zakim, ??P24 is vincent 15:59:33 +vincent; got it 15:59:35 Peter: Mike had comment on last slide 15:59:40 +SusanIsrael 15:59:44 ok, how do I scribe nick me? 15:59:56 Scribe: JoeHallCDT 15:59:58 scribenick: joehallcdt 16:00:06 robsherman has joined #dnt 16:00:14 cookies are not anonymous, they pinpoint an individual/device 16:00:18 Chris_IAB has joined #dnt 16:00:19 hwest has joined #dnt 16:00:36 scribe: JoeHallCDT 16:00:49 jeffwilson has joined #dnt 16:00:52 robsherman1 has joined #dnt 16:01:55 robsherman has joined #dnt 16:02:00 peterswire has joined #dnt 16:02:38 q? 16:02:52 Peter: we're not going to debate how strict a standard is 16:02:59 … let's imagine a three-step model 16:03:20 … super strict standard for De-ID, a middle ground and no de-ID 16:03:27 Speaker was Mike Nolet from AppNexus 16:03:34 mnolet has joined #dnt 16:03:39 thx 16:04:07 q+ 16:04:26 felixwu has joined #DNT 16:04:32 … there are choices for businesses to give up a de-ID'd approach if the cost is too high 16:04:45 Mike Nolet: it's not as much cost as competition 16:04:55 … some companies are getting into thrid party advertising 16:05:50 identifiers in cookies are PII in Europe 16:06:06 q+ 16:06:09 Mark Groman: truly believe that the standard we're discussing that will have unintended consequences 16:06:20 … some of the things we propose may have a net-negative impact on privacy 16:06:24 *Joehallcdt if you want me to scribe let me know 16:06:25 So, about that de-identification topic... 16:06:42 … the notion that opt-in consent is all that's needed to over-collect 16:07:07 Peter: we did start with a discussion of incentives for de-ID 16:07:18 … one was compliance with NAI, etc, codes 16:07:20 You have to say what data you gather and what you intend to do with it to get consent 16:07:27 The FTC sees cookies and IP addresses as "personal information" as well. All information is personal, but some is more personal than others. 16:07:29 robsherman1 has joined #dnt 16:07:30 dwainberg has joined #dnt 16:08:06 There is a value in incentivizing companies to keep data at pseudonymous instead of real-name idenifiers. 16:08:08 gills (?): if we follow de-ID as a privacy protective tool, we can't say that a cookie is PII 16:08:27 There is no notion of PII in this standard. 16:08:31 But this is somewhat off topic. 16:08:40 … you've created an incentive to create PII databases 16:09:10 … PII should matter, if you value de-ID as a way to break the link to the individual 16:09:19 Chris Mejia: agrees with Jonathan! 16:09:36 q- 16:09:44 dtauerbach has joined #dnt 16:09:47 … we are supposed to do good practices for de-ID and I want to do that. 16:09:51 q? 16:10:02 *joehallcdt you had marc groman and paul glist speaking before chris iab 16:10:17 Peter: has not had that focus, wants to have comon language 16:11:12 sribenick: susanisrael 16:11:32 peter swire: let's start talking about hashing 16:11:50 DNT was proposed as a solution to address psuedonymous third party tracking. I don't think we're going to walk away from that idea at this point. 16:11:58 khaled: understand that hashing was discussed as a way to protect against cookies or other unique identifiers 16:12:27 ...if you are hashing without salting, can easily be broken and recover say ss#, so plain hashing not recommended 16:13:01 This makes sense for sharing data externally but not for internal storage of data 16:13:01 ...if you have [something] that can be added to your value....but challenge for distributed system with salt, you don't want to distribute salt to everyone 16:13:16 ....have to come up with protocol where salting happens at central location. 16:13:29 [someone] need to know who can hash 16:13:36 [who was speaking?] 16:13:53 efelten 16:13:54 s/[someone]/efelten/ 16:13:58 khaled: one alternative is to use public keys that you can distribute and have encrypted value done say within browser 16:14:05 ...instead of hashing you encrypt 16:14:19 +Aleecia 16:14:32 ...other consideration even with salted values is that you can have frequency attacks...certain names more common...can guess. 16:15:05 ....so can recover names by looking at frequency. even ss#s. so salting not adequate where there is frequency distribution 16:15:32 .....with encryption [?] would do it differently each time, frequency not an issue 16:15:59 peterswire has joined #dnt 16:16:05 .....to the extent its a problem certain fields may be too long to process or transmit [with encryption?]..... 16:16:37 ...so for example you can get encrypted ss# with same character set as actual ss# so you avoid long strings. sometimes practical advantage 16:17:17 peter swire: have some observations: lots of hashing in commercial ecosystem. heard yesterday at hhs that unsalted ss# not ok bc easy to do dictionary attack 16:17:25 Good resource on the technical and security details in this area: http://crackstation.net/hashing-security.htm 16:17:33 .....turning to ed, you have expressed cautions re: hashing. 16:17:50 ed felten: different scenarios in which hashing fails. doesn't do much without salt. 16:18:19 ...even with salted hash someone who knows the salt can generally break it or someone who can cause salted function to be evaluated on their behalf. 16:18:41 ....gives example where you ask one server to compute hash on another. [simplified] 16:18:47 A hash turns user data into a pseudonymous identifier 16:19:10 ...if multiple records contain same salted hash value they can be linked. need to use probablistic encryption or something like that 16:19:18 chris iab: there is hashing then access to salt 16:19:48 We should discuss keyed hashes as being superior to salted hashes (although in the same universe) 16:19:52 ed felten: not just access to salt. if you have value hash then you can do same dictionary attacks as if you knew salt so not enough to ask if you know salt 16:20:17 ed felten: can make sophisticated argument .....rare case where hashing is secure 16:20:36 peter swire: assume people will use hashing and will be long enough not to be broken 16:20:43 chris iab: how reliable? 16:20:44 One-way hashes don't allow direct reverse identification by themselves - access to the salt/key allows someone to perform a dictionary attack 16:21:03 ed felten: if you can have hash computed for you just the same as if you can break it 16:21:06 Requires access to the original raw data (if it still exists) and the salt/key 16:21:08 what are we hashing? 16:21:30 In the EU organizational measures are not enough to make hashed values of user data anonymous. 16:21:38 someone [who is speaking?]: will use admin controls with hashing 16:21:59 ed: if you can make up inputs and ask people to hash them that is just as good as if you had the salt 16:22:11 someone: but that is form knowing input and output 16:22:34 Rob, if paired with administrative, technical, and policy/educational, then keyed hashing is considered enough to reach the point of "likely reasonable" to no longer be personal data (de-identified), correct? 16:22:47 ed felten: what if you take value with identifier and cookie, ask someone to make salted hash, don't tell you the salt, but put it back in your data base 16:22:51 Rob, add "safeguards" after "policy/educational" 16:23:12 someone: but that assumes you know input and output 16:23:25 shane: if you throw away the key, then yes. TomTom was a nice example. 16:23:38 peter swire: i have observed lots of hashing in ad world. for most sophisticated attackers they may be able to break them 16:24:03 ...we will eventually have to come to view of how we will discuss all this. so common hashes might be of email address? cookie value? 16:24:41 Rob, if you keep the key in a safeguarded location, limited access, technical controls, etc. - I believe you still reach the bar per the A29WP Option from April 2011. 16:24:45 peter swire: let's take email addresses. if my email is hashed using proper salt, and someone gets output, they can eventually figure out hash and salt 16:24:58 Rob, or was that 2010 - I'll look it up. 16:25:12 ed felten: can ask that hash be done on known value, and record hashed value in database then can correllate 16:25:25 Well, that safeguard is a very high bar, ie a notary, who has a legal obligation to not disclose 16:25:29 [someone] qu is from whom you are trying to secure the data 16:25:38 Rob, I agree throwing away the key is an absolute end-point, but I'm aiming for the 'likely reasonable' standard 16:25:44 is it protection at all wrt a particular party that has particular data 16:26:35 david w. not hashing for hashing's sake. need to figure out from whom you are trying to protect the data from, and tailor approach to that 16:26:38 Shane, the point is, that if I should not be able to calculate a hash after let's say a year, and expect the same output, such that users can be re-identified. 16:26:53 s/if/_/ 16:27:09 khaled: even if we go back to previous model using hash or salted hash, probability of recovering original value is 1, certain 16:27:30 Rob, why? As long as the original key is secure, then there is very low risk of user re-identification 16:27:31 robsherman has joined #dnt 16:27:34 Rob, is that an art 29 position, or your own? (Both are valuable, I'm just trying to get which is what) 16:27:36 chris iab: assuming you have access to data in first place, right? 16:27:55 khaled: so final result at end of all risk assessment is still high, still has to be further mitigated 16:28:05 Aleecia, the A29WP position in the opinion paper is not as strict as Rob is stating (in my opinion) 16:28:06 Wileys, in the DNT case, are we just considering hashing cookie IDs? if so, I'm not sure it brings any real protection: cookie IDs are opaque anyway 16:28:09 peter swire: let's see why people might feel strongly 16:28:41 ...if db is publicly accessible and people can get access then probability of breaking is higher, but david and chris are saying you can limit access 16:29:09 Vincent, keyed hashing coupled with other measures, as well as the cessation of certain business activities (profiling), does meet the goals of DNT in my opinion. 16:29:15 .[someone]..but ed is saying if you have access to hash and salt -if disconnected doesn't work 16:29:28 Jeff Wilson 16:29:46 peterswire_ has joined #dnt 16:29:56 david w: i think what we are talking about is that using some form of oneway hash was a useful method of de-identifying 16:30:21 khaled: depends. must be done in such a way that you can protect against attacks ed is describing which are quite trivial 16:30:26 Wileys, well that's not my question :). What type of protection does it bring with regard to the risk of re-identifiication? 16:30:37 david and khaled back and forth a bit 16:30:55 q? 16:30:56 Shane, let's have this discussion in Boston 16:31:06 khaled: probability that someone attempts to attack, then that they can break hash 16:31:13 robsherman1 has joined #dnt 16:31:25 Vincent, as long as the original data is not accessible and neither is the key to the hash, then there is very low risk of re-identification (depending on the details housed within the de-identified dataset) 16:31:29 Aleecia: formal position within this DNT debate 16:31:32 ...if low probability of attempt ....hard to make that case 16:31:32 dwainber_ has joined #dnt 16:31:50 [someone] isn't probability of reidentification only 1 if you have access to the computer? 16:31:51 Rob - agreed - looking forward to it (the conversation that is, not the horrible weather we're likely to encounter in Boston :-) ) 16:32:05 :) 16:32:08 khaled: depends on workflow. may be hashed then go to central db 16:32:14 s/someone/Mike Nolet 16:32:32 We need to recruit a new WG member with a big office in the Florida Keys 16:32:44 +1 to Aleecia! 16:32:47 peterswire_ has joined #dnt 16:32:53 Rob - thanks, that's exactly what I was asking, thank you 16:33:14 mike nolet : i have unique cookie id on ed. need to get totally random integer, if someone is snooping on all net traffic or has access to pc or net connection 16:33:17 peterswire has joined #dnt 16:33:34 Wileys, how is the re-identification risk lower with the hased cookie ID rather than with the unhashed cookie ID? (that's actually what's discussed right now) 16:33:43 peter swire: is there a scenario where hashing matters? mike was saying you have to have access to cookie 16:33:47 Chris_IAB has joined #dnt 16:34:12 chris iab: does it matter if transferring to another party or internally? 16:34:18 peter swire: we are learning something 16:34:20 this was the equation put on the board: pr (re-identification) = pr (re-id/attempt) x pr (attempt) 16:35:00 jeff? there is industry practice where you hash, independent party enriches by matching, and there is permission to share 7 matches 16:35:08 Cookie exchanges are interesting in this context.. 16:35:08 Vincent, its lower only if coupled with other factors (multi-factor test) such as seclusion of the key/salt and removal of access/existance from the original dataset. 16:35:12 ....common identifier can be hashed 16:35:21 +q 16:35:25 peter: so that is one scenario, do you see usefulness ed? 16:35:42 q? 16:35:47 ack Wileys 16:35:51 ack dwainber_ 16:35:53 robsherman has joined #dnt 16:36:03 shane: the core purpose at yahoo for hashing/keys, is to disconnect that data from use in actual production systems 16:36:11 "destroy"? 16:36:36 peterswire_ has joined #dnt 16:36:37 ...destroys possibility for profiling, targeting. can not be used to modify users experience. but still useful for analysis.. 16:36:47 peter swire: ed or dan does that make sense to you? 16:36:52 WileyS, right. the goal is to break the re-identification 16:36:57 dan: i am confused by that 16:37:06 sigh 16:37:32 shane: these are always multifactor tests. your purpose in hashing is to not do this. once you add multifactors, it serves purpose 16:37:46 [someone] if you can get hash function or key it doesn't matter 16:37:50 robsherman1 has joined #dnt 16:37:56 shane: good luck. we make key very inaccessible 16:38:07 s/someone/Joe Hall 16:38:07 ed felten: who knows keys? 16:38:29 vincent has joined #dnt 16:38:34 shane: keys are very large. systems that are set up to de-identify know key, but human connection to key is not allowed 16:38:58 felix: so if i understand correctly usefulness is to separate one part of company to another? 16:39:00 dwainberg, in case you missed it, "the key is on a post-it on Shane's desk" (that's a JOKE, btw.. lol) 16:39:14 shane: really to separate info from another context 16:39:25 Chris - love it! 16:39:29 felix: 2 people (one w key) are separate 16:39:44 shane: isolation of key is not only factor. 16:39:59 peterswire has joined #dnt 16:40:05 Chris, LOL 16:40:18 q? 16:40:20 peter swire: i think its relevant bc hashing and its uses have been talked about in a lot of context. people in ad industry at one end of table, others at other 16:40:31 dwainberg has joined #dnt 16:40:49 khaled: if that separation is strong and defensible, then at least under hipaa that would be ok. if you have good procedures for controlling access to key that's ok 16:40:51 Yay for Yahoo!, we're good by HIPPA standards (too bad we don't handle PHI :-) ) 16:41:04 ....scenarios where regulators have accepted that 16:41:16 dan auerbach: rotating salt helps a lot 16:41:21 rotating salt is a good practice 16:41:40 rotating salts kills everything shane wants out of the data 16:41:59 Aleecia - we do rotate, but not daily. 16:42:03 david wainberg: we are saying its not binary, hashing is not perfect, question is how hard does it make it? how hard do we want to make it? what is the context/data involved? 16:42:09 Rotating salts kills longitudinal view, which is a feature or bug depending on how you look at it. 16:42:14 aleecia, it means Yahoo buys LOTS of post-its (again, marked as a JOKE folks :) 16:42:16 someone: sounds like its trivial to break it 16:42:22 I go with feature, Shane goes with bug :-) 16:42:26 david wainberg: what do you mean by trivial 16:42:30 s/someone/Joe 16:42:39 what really hard means also depends on the purpose, not only on the context 16:42:42 Aleecia, :-) 16:42:45 david w: depends on combination of technical and administrative 16:42:51 buy stock in 3M, folks! you heard it here first. 16:43:14 peterswire_ has joined #dnt 16:43:16 someone: shane is describing intentional inadvertent viewing of data 16:43:38 s/someone/mike nolet 16:43:46 shane: purpose is more than just personal protection--disconnect data from operational systems so utility limited and therefore privacy is increased 16:44:28 jeff: everyone agrees with ed or should. if you have access to salt, it doesn't work. but if we say salting/hashing does not work, then we are saying passwords on internet don't work 16:44:46 ....if you have access to hash and salt you could access hashed stored passwords 16:44:55 daily rotated salts is at least a step forward. but having it change only when the janitor tosses out the post its by mistake once a year isn't going to make me happy :-) 16:44:57 q+ 16:45:10 chris iab: what would the alternative? put all raw data out on internet? or not collect any data? 16:45:16 WIleys, would not a request like "SELECT User from DB where user visited site1,site2,...,siteN" recreate the link that the hash just deleted? 16:45:24 Aleecia - its a bit more formal/regular than that. Note - I don't use post-its :-) 16:45:26 ed felten: i have not heard an example here where hashing really helps 16:46:00 peter swire: i spent 2 years working on crypto policy. if system broken it doesn't work, but in practice it works 99 percent of the time 16:46:14 Vincent, the hash was not meant to hide activity but rather to disconnect identity from operational systems. 16:46:22 ...i have heard that there are attacks that could be made, but i have heard about administrative controls 16:46:23 peterswire has joined #dnt 16:46:46 Passwords are used to verify an identity, based on a shared secret, which is a totally different mechanism 16:46:48 peterswire has joined #dnt 16:46:51 ....all those seem like things in real world where protection is more than zero though might still be subject to some kinds of attacks 16:47:04 ed felten: no because these attacks are trivial 16:47:17 q+ 16:47:30 Wileys, yes but the history of websites visited by a user would help to reconnect the different operational system (the list of website is used as a unique identifier) 16:47:34 si question: do these attacks in fact happen in companies all the time in the real world? 16:47:39 jonathan -- I see you; 16:47:42 Shane - 3M weeps 16:48:24 Vincent, agreed - so some URL cleansing helps remove this issue - or in the case of searches, attempts to cleanse personal data in queries helps. 16:48:24 ed felten: if we say we will separate our data base into 2 pieces and only one is hashed, whatever analysis someone wants to do they just need to do one more step 16:48:28 ack jmayer 16:48:32 chris iab: but they would have to have access right? 16:48:37 q? 16:49:02 Vicent, my approach can't guarantee 100% certainty but does meet the "very low risk" bar - or in the EU context, the "likely reasonable" bar. 16:49:13 jmayer: concrete example: ad company i studied tried to use hashing to do follow on analysis. user had id cookie. then had another cookie. "anonymous" 16:49:41 peterswire_ has joined #dnt 16:50:01 If we the spec allows for a 30 day short-term retention period, presumably the group would be OK if the salts were rotated at least every 30 days. 16:50:01 ...idea was that anonymous one was hash with secret salt and would be used for long term things and more private but susceptible to same attacks because you could always correlate with original cookie 16:50:06 peterswire has joined #dnt 16:50:49 peter swire: jmayer you were giving example, and jeff and crhis had questions or comments 16:51:02 chris iab: you described a bad practice 16:51:30 David has joined #dnt 16:51:42 ...you don't throw out baby with bath water. Just bc there is one bad practice doesn't mean all hashing worthless 16:51:51 Wileys, I don't the "very low risk" bar well enough :) just trying to see what is the type of threat that cookie hashing address 16:52:00 We have yet to hear an example where hashing makes any attack appreciably more difficult. 16:52:27 David_MacMillan_ has joined #dnt 16:52:33 Justin, the spec should not be prescriptive on timeframes and rather, much like HIPPA, should focus on acceptable risk thresholds. 16:52:36 jmayer: agree there are better engineering practices; but pretty predictable failures; have heard things like figuring out salt or doing dictionary attacks, 16:53:00 ...but these are not only attacks. there are enormous re-identifiability problems. 16:53:03 Vincent, you don't "?" the "very low risk" bar well enough? 16:53:04 peterswire has joined #dnt 16:53:28 Ed, hashing makes sense, if you take out information such that enough collissions appear, that meat a k-anonimity bar. 16:53:30 peterswire has joined #dnt 16:53:35 Justin, I think you're saying: if we're going to have 30 (or more) days for people to take first-logged data to figure out what they have and if they're first or third party while collecting, then we should also be ok with a company holding all data indefinitely, so long as they rotate every 30 days. 16:53:39 s/meat/meet/ 16:53:56 I think the point is that in all the examples so far, hashing is purely a method of operational control, and it is not a great one given engineering challenges 16:53:57 Wileys, I don't "know" it well enough, sorry 16:53:58 ....i think we have an error in the way some people are approaching this. you have fact pattern, try to apply approach. start with specific problem and way to solve and ask if hashing get you there... 16:54:07 e.g. you can't hvae an oracle and that is hard to control in practice 16:54:11 - +1.631.803.aacc 16:54:33 ....ed is not asking straight up;/down vote on metaphysics of hashing...and ihave not heard concrete problem and proposed hashing solution that solves the problem 16:54:43 aleecia, well, we've had different interpretations of the point of the short-term period over time, but basically yes. 16:54:50 Ed, if a dataset were breached in isolation (a single data table), wouldn't you agree that hashing of identifiers in that table (depending on what additional feeds were available) would help deter re-identification? 16:54:55 peter swire: can industry explain use case where hasing helps? 16:55:33 david wainberg: can we identify risk thta ed and jonathan are concerned about it and see if that can be addressed 16:55:37 Justin - ok. So I'm ok with a single short period, but may not be ok with infinite retention even with rotation 16:56:07 +DAvid 16:56:10 q+ 16:56:11 felix? : sounds like we are concerned about internal controls. valuable if you have company where not everyone or no one is careless or malicious 16:56:26 What I'm looking for is a specific example--a specific use of hashing, and a specific attack that is made more difficult because of the use of hashing. 16:56:32 jeff: 3 scenarios where hashing helps. 1: passwords 16:56:36 peterswire has joined #dnt 16:56:48 2. if you want to do research internally in large company..... 16:56:50 Shane, it depends on the details of the hashing. For example, an unsalted hash of social security numbers in that isolated table does not help at all 16:57:02 new (related) subject: are toilet seat covers effective? (again, humor is my defense mechanism :) 16:57:09 aleecia, Fair enough, to the extent there is an inherent risk that a delinked 30-day set of urls is inherently identifiable and/or tiable to other 30-day sets. 16:57:11 -[GVoice] 16:57:21 dtauerbach, agreed - I'm speaking only of salted or keyed hashes. 16:57:43 -Jonathan_Mayer 16:57:45 peter swire: so if some risk of internal misuse, but hash passwords or separate research database from where it came from, you reduce risk even.,.. 16:58:04 if doesn't protect against sophisticated attacks, reduces risk from normal people. 16:58:04 +Jonathan_Mayer 16:58:12 Justin - exactly 16:58:20 vincent has joined #dnt 16:58:28 felix: i think we are seeing risk reduction in normal ways. seeing qu from ed re: scenarios 16:58:58 I would guess that at 24 hours I'd be ok. But I'd need to know more. And I think the right way to get at this is not a timeframe, but rather the ability to chain across datasets 16:59:07 in some sense from tech perspective does not help much but if the data just requires an extra step that may be enough to deter or detect attack from pt of view of internal controls 16:59:26 mike nolet: re: david's question. what is risk you are talking of reducing 16:59:39 someone: risk that info on research side is then used to target 16:59:52 felix? if dnt is 1? 16:59:55 - +1.202.257.aaff 16:59:56 -q 16:59:59 yes: 17:00:08 q? 17:00:18 peterswire_ has joined #dnt 17:00:43 ed felten: cs views attacks at 3 levels. started discussion bc broad claims were made that hashed data should be treated as per se de-identified. 17:00:48 Ed, It was never stated in isolation but as one factor of multiple steps to achieve unlinkability. 17:00:54 Ed, at least not by me 17:01:28 ...we don't have to talk about hashing or micromanage how people protect, but i don't think we should talk about hashing as total protection 17:02:09 paul glist: broad claims on both sides. have looked at this as dial. can reduce risk to socially acceptable levels. hashing is not nothing... 17:02:20 +1 to current speaker's point 17:02:29 ...and not everything. it's a tool. add other tools. it's useful. 17:02:53 There are protections that are effective even if an attacker controls the terminal. That's part of the point. 17:03:08 johnsimpson: still having trouble figuring out how this relates to DNT. have been talking about protecting data sets with pii. 17:03:14 dwainber_ has joined #dnt 17:03:16 peterswire has joined #dnt 17:03:29 jmayer, for example: hard disk encryption 17:03:34 chris iab: you may want to have access to uri's for example. but don't need it connected to unique users 17:03:34 Right, the deidentification method has to take into account the internal misuse angle. 17:03:50 peterswire has joined #dnt 17:03:55 john simpson: but that's the disconnect bc most people saying that dnt is do not collect 17:03:58 q+ 17:04:00 someone: is that right? 17:04:22 someon: if there is any identifier you still have a problem 17:04:39 Someone is justin, someon is jmayer :) 17:04:45 peter swire: we heard different perspectives: 17:04:53 * thanks justin 17:05:31 peter swire...unique identifiers. can you enlighten me? how is going into buckets relevant? 17:06:20 someone asks if adding attributes and using those is unique identifiers 17:06:32 s/someone/joe hall 17:06:48 peterswire has joined #dnt 17:06:51 dan auerbach: better privacy friendly way to add advertising that is targeted. need minimum number of people in a bucket 17:07:05 Dan, the minimum buckets make nice micro-segments. 17:07:19 peterswire has joined #dnt 17:07:26 ...we suggested 1024 is a minimum bar. with that don't need unique identifier, just low entropy cookies 17:07:48 heather: might be useful to look at transcript of previous discussion 17:07:49 If you're interested in advertising, analytics, etc. without unique IDs... https://air.mozilla.org/tracking-not-required/ 17:07:59 peterswire: room is not catching fire on this 17:08:41 chris mejia: i do agree with dan's core premise, that much harder to identify person from a few attributes distilled from all the uris that people visited 17:08:47 q+ 17:08:56 q- later 17:08:56 dan auerbach: can keep those collections without unique identifers 17:09:21 ok, I see aleecia and jonathan 17:09:37 chris: we agree on that part (harder to identify that way-with quasi identifiers), not necessarily the second part 17:09:43 q? 17:09:45 .....that is sort of an industry practice 17:09:47 q+ 17:09:51 ack aleecia 17:10:10 aleecia: i think we are all getting there. want to separate 2 different parts of dan's description. one is how to do ads without tracking.... 17:10:32 ...but pertinent is here's how you can do de-identification, suggest we focus on the de-id half 17:10:33 peterswire_ has joined #dnt 17:10:50 aleecia: ....interesting re: reduced identificaiton risk 17:11:05 + +1.631.803.aakk 17:11:06 peterswire has joined #dnt 17:11:16 dwainberg has joined #dnt 17:11:20 david wainberg: outline of discusison, 3 general models: 1. random unique identifier, interest buckets 17:11:47 2. unique identifier associated with buckets, dan proposing buckets only, no identifiers 17:11:58 dan: maybe what aleecia proposed make sense 17:12:43 davd w: as discussed earlier, what we mean by de-identified requires setting threshold, and we're just jumping to let's break the connection instead of 17:12:58 dwainbe__ has joined #dnt 17:13:16 ...discussing what is a level of acceptable risk. there are significant consequences to forcing ad industry to do this 17:13:22 what does "not linked at all" mean here? 17:13:32 peterswire_ has joined #dnt 17:13:32 peter swire: if not linked at all then outside dnt 17:13:32 q? 17:13:40 davd w: but still some risk 17:14:00 ed: but gets to idea of attribute disclosure vs record re-identificaiton 17:14:01 peterswire_ has joined #dnt 17:14:03 ack dwainber 17:14:17 q+ 17:14:22 ed: matters a lot what the bucket is: soccer dad vs. aids patient 17:14:23 q- later 17:14:40 would like to respond to Ed 17:14:48 ed: need more than knowing that there is a bucket, some sensitive info has to not be used 17:15:01 ed: but combos of attributes could identify 17:15:09 Just to be clear, the DAA principles do not prohibit inferences about medical conditions. 17:15:34 q+ 17:15:37 q+ earlier 17:15:42 mike: want to come back to theme: understanding what we're trying to accomplish. what is bad stuff we are trying to prevent. seeing a relevant ad? 17:15:43 q- earlier 17:15:49 could we please stay on topic? 17:15:56 jonathan -- I'm unclear -- are you in the q? 17:16:02 this is an interesting discussion, but not today's agenda 17:16:02 ...what other bad stuff, scary outcomes, than seeing an ad for something i bought on amazon? 17:16:08 Yep, just testing the limits of Zakim. 17:16:12 The HARM is not a relevant factor when it comes to unlinkability 17:16:26 q? 17:16:28 q? 17:16:33 peter swire: what the harm is in tracking comes up in a lot of settings but not main topic today 17:16:33 ack aleecia 17:17:07 aleecia: want to respond to ed re: which buckets you might care more about, but group decided we would not distinguish, say re: childrens data 17:17:14 peterswire has joined #dnt 17:17:23 ....treating all data same here, which is different than iab daa position 17:17:42 ack jmayer 17:17:43 peter swire: thank you for history but some people do not acknowledge they agreed to that 17:17:44 ack jmayer 17:17:47 jmayer passes 17:17:54 peterswire_ has joined #dnt 17:18:29 peter swire: had initial discussions on buckets and learned a bit on dimensions there. talked with mike at break re: example of something you think it would beuseful to look at 17:18:39 of note: this is not me *objecting* to treating some data as of more concern. just what the group decided many months ago. 17:19:00 david wainberg: i thought next step would be taking approach of your favorite slide and start thinking through risks and how to apply techniques to mitigage 17:19:06 if there is new information before the group, Peter & Matthias have the option to reopen 17:19:12 -moneill2.a 17:19:18 Aleecia - my memory matches yours - we decided to not get bogged down in the "sensitivity" debate and allow self-regulation and laws deal with that item 17:19:24 peter swire: that is one possible work flow. use khaled's checklist 17:19:51 ...maybe there are subsets of people willing to do work on that and come back with a draft. let peter know after meeting if you want to work on 17:20:04 Yes, there has never been anything about "sensitive" data in the compliance spec. 17:20:05 thanks Shane. it was a while ago and pre-dates many folks joining the group. if needed the minutes are out there, but my eagerness to volunteer to find it is not particularly high this week 17:20:12 q+ 17:20:15 chris: i have not gotten an answer to what works and protects data if hashing does not work, assuming we will have data 17:20:18 Well, apart from that one geolocation section . . . 17:20:45 khaled: in health context use probablistic encryption that permits mathematical operations on data 17:20:45 peterswire_ has joined #dnt 17:20:54 Aleecia, I likewise have not desire to volunteer on that point :-) But would be happy to argue to the same outcome as I believe it was a good decision by the group 17:20:58 ...encrypt at source in browser.... 17:21:10 Justin, agreed - not sure how that snuck through... 17:21:28 if you want to use those values to do lookup in db not possible for db owner to determine lookup result 17:21:35 efelten has joined #dnt 17:21:43 ....efficient process. not much slower than hashing. 17:21:52 ...using for lookup in large database 17:22:12 peter swire: on a wednesday call could learn about homomorphic encryption. seeing nods on this 17:22:28 dan auerbach. talking about fully homomorphic encryption? we are not close? 17:22:33 khaled: partial 17:23:14 dwainberg has joined #dnt 17:23:21 felix: also techniques like differential privacy, adding noise to data. questions whether data still useful, but also protects against some attribute disclosure: 17:23:21 My recollection is Jeff was alone at the time, perhaps one or two people with him at most, and the rest of the group either had the view you have, Shane, or came up with "we don't care, let's talk about something more interesting" 17:23:30 efelten_ has joined #dnt 17:23:38 Q? 17:23:54 jeff: with encryption or data modificiaton the criticism of hashing is that if you have key or access you can get around, and same is true for other methods, for example keys 17:24:00 peterswire has joined #dnt 17:24:14 felix: not wrt noise, which you can't figure out even if you know how noise was added 17:24:22 peterswire has joined #dnt 17:24:34 ed: lets put off discussion on how works 17:24:49 david : interesting but jumping to solution without identifying problems 17:25:18 felix: noticing that there is symmetry to this. many techniques improve privacy but limit value of data. 17:25:22 WileyS, at some point we'll have to go back and revisit that piece. 17:25:37 ....homomorphic encryption does not presreve ability to do many things with data 17:25:51 Justin, we'll never finish this standard if we attempt to define what is "sensitive" in a global marketplace - good luck with that. 17:26:04 felix: what use are we trying to preserve once data is de-identified. some uses will be preserved, others not 17:26:09 ack jmayer 17:26:11 The geoIP part was well locked down, and then Ian rejoined and *did* have new information. 17:26:23 jmayer: will postpone since postponing methodology discussion 17:26:44 WileyS, I am not arguing that we should. 17:26:53 peter swire: thanks to khaled for coming and providing expertise. there was clear explanation of risk based approach used in other settings 17:27:15 We cannot bar geoIP since knowing where people are affects what to do if DNT is unset 17:27:28 peterswire has joined #dnt 17:27:31 ...we also i think has some terminology gain in a lot of places. de-identified or de-linked are conclusion terms that apply once you have a standard, for example in hipaa.... 17:27:42 So we were trying to find a way to say "fine, fine, just pick a large enough geography," and then were hung up in the details on what that means 17:27:50 .....we also had variety of other terms about direct identifiers and quasi identifiers that will be helpful.... 17:28:05 ....heard interest in presentation for homomorphic encryption... 17:28:30 ...also heard suggestion re: doing pieces of that one slide--what are harms, risks, people are concerned about, and 17:28:36 If we're going to discuss methodologies, differential privacy and privacy-preserving implementations should make the cut. 17:28:50 ...in particular for online setting develop use cases we should care about if we are to get to homomorphic encryption. 17:29:07 ....any other action items? 17:29:08 mnolet_ has joined #dnt 17:29:37 ...if you have them after the meeting i welcome those. we are heading to f2f mtg, and want to make progress on this in advance... 17:29:45 -bryan 17:29:47 thanks, Peter! 17:29:49 ....thanks to cdt, khaled, all who came 17:29:50 - +1.631.803.aakk 17:29:51 johnsimpson has left #dnt 17:29:52 -Brooks 17:29:54 -Peder_Magee 17:29:55 and thanks Susan for scribing so much! 17:30:04 - +1.215.286.aaee 17:30:05 -rvaneijk 17:30:07 -Aleecia 17:30:12 rrsagent, make logs public 17:30:18 -vincent 17:30:25 rrsagent, set logs would visible 17:30:36 (you want public) 17:30:40 rrsagent, draft minutes 17:30:40 I have made the request to generate http://www.w3.org/2013/01/17-DNT-minutes.html yianni 17:31:14 -Jonathan_Mayer 17:31:22 -vinay 17:31:39 Ho-Chun_Ho_ has left #dnt 17:31:58 -SusanIsrael 17:34:21 -WileyS 17:45:39 -DAvid 17:50:11 peterswire has joined #dnt 17:56:13 -moneill2 18:05:00 disconnecting the lone participant, [CDT], in Team_(dnt)14:00Z 18:05:02 Team_(dnt)14:00Z has ended 18:05:02 Attendees were Jonathan_Mayer, [GVoice], rvaneijk, +1.425.214.aaaa, Aleecia, +1.202.587.aabb, WileyS, +1.631.803.aacc, [CDT], +1.215.796.aadd, bryan, +1.215.286.aaee, vincent, 18:05:02 ... Peder_Magee, +1.202.257.aaff, +1.646.722.aagg, Brooks, +1.917.934.aahh, vinay, +1.646.654.aaii, +1.215.286.aajj, moneill2, SusanIsrael, DAvid, +1.631.803.aakk 18:05:08 efelten has joined #dnt 18:05:55 JoeHallCDT has joined #DNT 18:16:54 efelten has joined #dnt 18:35:59 JoeHallCDT has left #dnt 18:40:12 dwainberg has joined #dnt 18:43:50 dwainber_ has joined #dnt 18:51:42 efelten has joined #dnt 18:55:10 mnolet has joined #dnt 19:12:16 robsherman has joined #dnt 19:31:46 Zakim has left #dnt 19:43:27 npdoty has joined #dnt 20:00:43 dsinger has joined #dnt 21:50:51 hwest has joined #dnt