Tracking Protection Working Group teleconference -- 16 Jan 2013

<kulick> 1.408.836.aaaa -> brad kulick

<Walter> I hear some underage participant

<BrendanIAB> P39 is Mattias I think

<Chris_IAB> Chris Mejia just joined the call from 212

<BrendanIAB> npdoty - I think the keyword you're looking for is "probably" rather than "maybe"

<scribe> scribe: JC

Peterswire: Put in IRC any new phone numbers

<aleecia> please mute :-)

Peterswire: please be on mute if you are talking locally
... scribes will be selected before calls
... hello to everyone
... we will be looking at de-identification issues which will be important to future call

<npdoty> issue-191?

<trackbot> ISSUE-191 -- Non-normative Discussion of De-Identification -- raised

<trackbot> http://www.w3.org/2011/tracking-protection/track/issues/191

Peterswire: a new issue 191 was created for this
... for linkability and de-identification
... it is important to get clarity around definitions and problems that have come up

<rvaneijk> Is there a URL with info to participate remotely tomorrow for the de-ID workshop?

Peterswire: two major reports have been sent out on this
... the US document will be discussed today

<aleecia> David I had to try a few times too

Peterswire: Deven McGraw from CDT was involved in it
... the second one was from UK ICO
... links are in today's agenda

<npdoty> http://www.w3.org/wiki/Privacy/De-identification

<rvaneijk> Just to make sure, the UK ICO report is for the UK only...

<rvaneijk> You can not extrapolate it to the EU..

Peterswire: hopefully we can work on advancement of common knowledge
... remember gathering at CDT

<rvaneijk> Is there a URL with info to participate remotely tomorrow for the de-ID workshop?

Peterswire: ylagos@futureofprivacy.org should be emailed if you are attending in person
... one of the rules for discussion is no normative conversations
... same call in rules for weekly calls

<Walter> The UK document doesn't even bind the UK

<Walter> And definitely does not bind anyone in Europe

Peterswire: the documents do not bind countries or are necessarily the right way to go

<rvaneijk> the UK document is centered around its definition of personal data.

<Marc_G> Marc 202 265 2736

Peterswire: I gave sample reasons why one might be less strict to use, not to say that these are the correct answers on how we shouild go on DNT
... post any new documents to the Wiki Nick has setup
... issues we plan to discuss tomorrow at 9:00

<aleecia> cannot understand

<Chris_IAB> can't hear due to background noise

Peterswire: what are incentives to do de-identification
... if we understand reasons, risks, benefits, that can lead to uses cases
... second topic, what are some measurements of de-identification. what are risks of reidentification
... what are goals as we define these regimes
... what are goals technical safeguards versus adminstrative safeguards
... number 4 hashing
... what kind of safeguards can it provide
... next issue, use of persistence identifiers
... how is it that various buckets can be updated when deidentification is used
... if there are other descriptive issues that should be identified send them to Peter Swire
... We ccirculated Devin's slides earlier
... any questions or comments?

<efelten_> Could somebody post a link to Deven's slides?

fielding: can you describe why deidentification is it applicable to DNT

<Walter> +1

<Walter> as in, I'd like to get a link to the slides too

peterswire: I see it relevant in a couple of ways
... data collected online is so aggregated it is not considered tracking
... at the other end data is associated with a specific individual such as Peter

<rvaneijk> echo: Could somebody post a link to Deven's slides?

peterswire: knowing were data falls is important to the process

<aleecia> Hi Roy, we're basically working through http://www.w3.org/2011/tracking-protection/drafts/tracking-compliance.html#def-unlinkable

peterswire: second thing, int he compliance spec it is related to the various uses

<justin> The standard says it doesn't apply to data that has been deidentified/delinked. And it's been one of the most debated topics within the group. How is that not relevant?

peterswire: there can be time when data goes into a DB and it should not come out in a way that can be linked to an individual

<Chris_IAB> to the individual or to the unique ID?

peterswire: I'm not saying that the DAA rules a perfect, but that have definitions about how data goes into the system but does not come out
... in an identifiable way

<fielding> justin, because I have no interest in keeping data when DNT:1 is set other than for security purposes

<aleecia> Justin's right that we've discussed anything being permitted for de-id'ed data, but that's not nailed down as we were still working through which defn we would go through

peterswire: because the compliance spec covers the meaning of tracking and others I feel it is relevant

<Wileys> Nick, could you share the Twiki link you created - does it host the presentation being referenced?

chris_iab: when you say individaul do you mean unique ID?

peterswire: I'm trying not to make decisions about what I mean

<npdoty> Wileys, wiki page is here http://www.w3.org/wiki/Privacy/De-identification some people have already added more to it

peterswire: sometimes it is associated with a machine or cookie or individual
... that is what I mean by creating a working definition
... so we know what we are referencing in conversations

<Wileys> Thank you Nick

peterswire: I will introduce Deven McGraw

<justin> fielding, I am glad to hear it! But other working group members want to do more with that data. So the previous discussions we had about product improvement/market research have now migrated to the discussion of deidentification.

peterswire: she was very involved in public hearings on deidentification

rvaneijk: Can we get a link to the slides?

<Wileys> Nick, checked the Twiki and can't find the slides

<Brooks> +1 on slides

<Wileys> +q

peterswire: I met have created an error in my email

Shane: can you post the slides?

<Wileys> -q

<Walter> just paste the link in here?

<Wileys> Thank you Peter

Peterswire: I will send these to Nick and he can post them
... Deven you can go

Deven: The slides are mostly text without math
... The guidance that was given on deidentification came from the HIPAA and that where we will start
... HIPAA protects health information in the US, but it is not a data protectin law
... most of the data holders in the US are covered by HIPAA
... HealthVault and similar apps are not covered

<aleecia> Roy it's possible that some of the discussion around how to keep data protected may be interesting in the context of data held to prevent fraud / for security. Unclear to me, but I could imagine some cross-over there

Deven: the bad news is HIPAA does not cover all health data
... deven@cdt.org is my email

<susanisrael> i have musted sorry

Deven: when you have data that meets the standard for deidentification it is not covered by the law
... you can do almost anything with deidentified data
... the deidentification standard is a legal one
... there is no specific percentage risk which is established as a baseline
... risk is contextual
... there are two methods that can be used
... the expert method requires someone with expertise to document that the risk is small
... it must be determined who the data is going to and what other data they have
... safeharbor metho requires removing ?? amounts of data
... I"m on slide 5
... a code can be assigned to deidentified data to allow data to be reidentified as long as the code is not derived from individual
... and you cannot deisclose the code to the identity you are giving the data to
... this provision permits healthcare entities to be able to reidentify the data when notification is required
... for example in case of an infectious desease

<npdoty> http://www.w3.org/2011/tracking-protection/HealthDe-IdentifiedDataSlides.pdf

<Chris_IAB> sorry, but have the slides that we are reviewing been distributed? I can't seem to find them?

<Chris_IAB> see in now npdoty, thanks

Deven: the assignment of codes is covered in guidance
... on slide 6 let's discuss safeharbor

<Chris_IAB> slides are not numbered

<peterswire> if you view in other mode, you can see the slide numbers

<Chris_IAB> it's a pdf peterswire; which mode are you referring to?

Deven: names, addresses, zip codes, all elements of dates, ages are okay except for the elderly, telephone number, account number, VIN, IP address, URL

<peterswire> ah, I'm viewing in powerpoint

Deven: and any other unique identifying number or code cannot be used

<aleecia> I do not think I understand this "code"

Deven: the trick with safeharbor is you have to remove all of these types of data to be covered
... if this does not work you can use the statistician method, but someone must validate the method
... safeharbor method deems that the data is deidentified and thus unregulated
... it is also a cookbook that tells you how to deidentify
... under the statistical method there are no rules for the statistician
... I have never heard of anyone be held up by a regulator because they did not properly deidentify data
... the standard is to reach low risk of reidentification, not zero risk
... requiring zero risk would remove all utility

<aleecia> Ahh. 1999. Before a lot of the re-identification work had happened.

Deven: provides rules for contractors
... data use agreements are not required, but a data holder may require an agreement for deidentificaiton
... slide 12 guidance covers who is an expert
... no specific degree or level or education is required, but they will look at that in a review
... no numeric target is given for risk

<npdoty> aleecia, isn't this much more recent guidance? I'm hearing explicit acknowledgement of re-identification -- low risk, not no risk

Deven: multiple algorithms can be used in a single datasets
... as long as datasets cannot be combined for reidentification
... slide 13 shows dataflow

<efelten_> Nick, one example of outdated thinking is the discussion of k-anonymity.

Deven: deidentification can be iterative
... an agreement cannot be a tool of deidentification

<aleecia> the guidance is more recent, I agree. The original text was from '99. That explains why there would be an identifier added back after doing all the de-identifying work -- the risk of that was likely not really appreciated at the same level in '99

Deven: slide 14 and 2.9 of guidance
... you cannot assign a code that is given away with the data

<aleecia> And here we are :-)

<aleecia> So it sounds like they're trying to fix it

<npdoty> efelten, forgive my ignorance, why is discussing k-anonymity outdated?

<peterswire> this is 2012 guidance; original rule drafted in 1999/2000

Deven: however you can disclose a code that has been derived from the data as long as the code and data meet low risk standard
... you can take protected health information and transform it into values for cryptographic hash functions

<efelten_> k-anonymity does not imply any limitation on the the analyst's ability to infer sensitive data about individuals, for one thing.

Deven: but do not give away the formula or hash
... slide 16 remember when you are using safeharbor to remove 18 types of data you have to know if the data can be reidentified
... structured data and free text fields are covered by deidentification rules
... deidentification is aimed at protecting patients and families not staff
... HIPAA rules does not cover healthcare providers
... I will let you know when the guidance does not cover something
... the agency did what congress asked them to do and nothing more

<aleecia> Some of this is really good. But it starts from a point of trying to create incentives for de-id'ing data, presumably because aggregate health information has so much public benefit. Bit different here, but very very interesting to hear what they did

Peterswire: Under safeharbor IP address is PHI. What about cookies or browsing habits?

Deven: there is no guidance on that

<npdoty> I didn't understand IP address as personal health information, but just as information that would have to be removed to de-identify

Deven: you would need to look at what is being examined
... the hospital's website would not necessarily be covered

Peterswire: Is knowing where the patient is logging in from covered

<efelten_> URLs are covered as PHI, right?

Deven: Since web data is covered this could be covered

<efelten_> Or at least URLs are one of the things that have to be removed under the safe harbor.

Deven: that is why there is the catch-all category to catch these types of things, such as cookies

<npdoty> efelten, the latter, yes

Peterswire: have people use one method over the other

<Wileys> The HIPPA standard for de-identification is focused on 'External Sharing' - whereas our discussions have centered around de-identification for data that is not to be shared externally. I believe it makes sense to have two standards here: internal vs. external

<moneill2> guid in cookie obv. can be used to re-identify

Deven: The analytical folks tend to use statistician method because they need dates

<aleecia> Shane, I could imagine that working

Deven: similaryly understanding health trends is difficult with safeharbor method
... bess analytics is done with statitically deidentified data

<Marc_G> What about Shane's question or point above?

Peterswire: Can you explain if salts are required with hashing

Deven: I believe the guidance states if you are using a hash, after you hand the data to a recipient they should not be able to reidentify the data
... the risk should be very low and examples are provided for when codes can be provided

<efelten_> In healthcare, providers are given different treatment because they have informed consent from the patient.

Deven: for hashes you cannot provide the key or salt

<npdoty> scribenick: npdoty

<Wileys> Ed, if the URL is non-specific to a user, then this would not have to be removed (meets 'very low risk' standard)

peterswire: regarding data-use agreements under HIPAA, when does de-identification happen vs. data-use agreements?

deven: data-use agreement is not required when you've reached de-identification (statistically to low risk, or under safe harbor)

<rvaneijk> Shane, a dataset with full URL's contains behavioral information, which is specific to a user

deven: you don't need to execute an agreement with the recipient of your data, they don't need to commit not to re-identify
... if you want to use a data-use agreement as an extra measure of caution, you can do that

<JC_> test

deven: enforced as a matter of contract
... can't use the data-use agreement to get to the low risk of de-identification
... gray area regarding "anticipated recipient"
... because there might be other people who can reidentify this data but you can't
... still raises questions about whether the agreement can limit recipients in a way that changes your statistical needs

<Wileys> Rob - as long as the receiptient is not able to leverage the URL history to re-identify the user then it does not need to be stripped.

peterswire: how much the expert's methodology should be public. what level of transparency is required?

deven: not required to document the methodology, but required to maintain evidence for use in response to regulators [did scribe get that right?]
... certainly been to many conferences where computer scientists will share those methodologies for feedback

<aleecia> Shane - it turns out URL history is an effective fingerprint. If "able to" is the threshold, then URLs are certainly going to need to be stripped

deven: if you're willing to attest, put your name as a statistician, you don't have to document the method
... not specified what level of attestation is needed
... I would want enough documentation as the data holder to respond to regulators who knock on my door
... a handful of people who do this on a regular basis, and everybody uses them

<Wileys> Aleecia - if I give you a handful of URLs and ask you to re-identify the individual they belong to, I doubt you'd be able to. This is the receiptent test.

deven: gives legal comfort to pick someone who has been regularly used

<efelten_> Actually the test is: if you give her all of your data, can she re-identify.

<Wileys> Ed, agreed - the assembly of the specific data elements is a key factor

<vincent> Wileys, if you include the the timestamps I bet you could re-identify someone even with a few urls

<Chris_IAB> does anyone else hear that?!

<Chris_IAB> missed everything you said during noise

<Walter> we certainly did

peterswire: q regarding categories of information

<Wileys> Vicent, I'm not sure I agree but this does align with my conversation with Ed on the assembly of data elements is key to the determination of "very low risk"

<rvaneijk> WIley, that is the whole point of pixel tagging

deven: not all holders, aimed at hospitals and doctors, and the records they use to treat patients and pay healthcare claims
... of the data that's in those types of records, what elements are most likely to be re-identifiable

<aleecia> Shane - it turns out people visit the same few sites in the long tail. So for me, that's going to be a specific set of four web comics. :-) The set of sites people visit is persistent and often unique

<Walter> justin: your cat sat on the phone?

<Wileys> Rob - pixel tagging through a unique cookie ID is meaningful to me - but since you as a receiptent don't have access to my cookie ID platform would not allow you to re-identify an individual

deven: safe harbor categories came around after Latanya Sweeney's reidentification of the governor's record
... data elements that she used are now listed in the safe harbor
... but as we increase the amount of data in the external world, we shouldn't assume every year that the safe harbor makes it a very low risk
... but a lot of public databases are not covered by HIPAA

peterswire: some discussion regarding date of birth, different from other data fields in that it splits the population into 25,000 cells
... what kind of data can be easily searched on the outside? when you're coming up with your definition of very low risk, demographic data or data that lasts with you for a long time is a higher risk
... persists longer and is more easily obtainable from other sources

deven: that's the level of detail in discussion of the statistical methodology

peterswire: thanks very much to Deven
... in person availability in Brussels; next Thursday or Friday, will provide more information

<fielding> I have no doubt that understanding deidentification is useful in general for the privacy of all users [not just those sending DNT]. I don't believe discussing it here is useful because I don't see us redefining what it means in our specs. That's in stark contrast to defining tracking, which hasn't been defined by anyone else, we are specifically chartered to define, and we aren't going to make any real progress until we do. And, no, I don't think that

<fielding> unlinkability is relevant just because someone made it an issue for TCS.

peterswire: I'm not available next Wednesday, Matthias will have a technical call at the usual time
... questions or comments?

<fielding> What about the MIT meeting?

<tedleung> any more details on the f2f?

<adrianba> is there logistics information for the f2f?

peterswire: thanks everybody

<phildpearce> thanks

<bryan> Hi Nick did you see my message?

I'm hearing questions about MIT logistics, and will follow up on the mailing list

<aleecia> thanks, Nick!

<bryan> thanks

- DRAFT -

Tracking Protection Working Group teleconference

16 Jan 2013

Attendees

Contents

Summary of Action Items

Scribe.perl diagnostic output