See also: IRC log
<kulick> 1.408.836.aaaa -> brad kulick
<Walter> I hear some underage participant
<BrendanIAB> P39 is Mattias I think
<Chris_IAB> Chris Mejia just joined the call from 212
<BrendanIAB> npdoty - I think the keyword you're looking for is "probably" rather than "maybe"
<scribe> scribe: JC
Peterswire: Put in IRC any new phone numbers
<aleecia> please mute :-)
Peterswire: please be on mute if
you are talking locally
... scribes will be selected before calls
... hello to everyone
... we will be looking at de-identification issues which will be important to future call
<trackbot> ISSUE-191 -- Non-normative Discussion of De-Identification -- raised
Peterswire: a new issue 191 was
created for this
... for linkability and de-identification
... it is important to get clarity around definitions and problems that have come up
<rvaneijk> Is there a URL with info to participate remotely tomorrow for the de-ID workshop?
Peterswire: two major reports
have been sent out on this
... the US document will be discussed today
<aleecia> David I had to try a few times too
Peterswire: Deven McGraw from CDT
was involved in it
... the second one was from UK ICO
... links are in today's agenda
<rvaneijk> Just to make sure, the UK ICO report is for the UK only...
<rvaneijk> You can not extrapolate it to the EU..
Peterswire: hopefully we can work
on advancement of common knowledge
... remember gathering at CDT
<rvaneijk> Is there a URL with info to participate remotely tomorrow for the de-ID workshop?
firstname.lastname@example.org should be emailed if you are
attending in person
... one of the rules for discussion is no normative conversations
... same call in rules for weekly calls
<Walter> The UK document doesn't even bind the UK
<Walter> And definitely does not bind anyone in Europe
Peterswire: the documents do not bind countries or are necessarily the right way to go
<rvaneijk> the UK document is centered around its definition of personal data.
<Marc_G> Marc 202 265 2736
Peterswire: I gave sample reasons
why one might be less strict to use, not to say that these are
the correct answers on how we shouild go on DNT
... post any new documents to the Wiki Nick has setup
... issues we plan to discuss tomorrow at 9:00
<aleecia> cannot understand
<Chris_IAB> can't hear due to background noise
Peterswire: what are incentives
to do de-identification
... if we understand reasons, risks, benefits, that can lead to uses cases
... second topic, what are some measurements of de-identification. what are risks of reidentification
... what are goals as we define these regimes
... what are goals technical safeguards versus adminstrative safeguards
... number 4 hashing
... what kind of safeguards can it provide
... next issue, use of persistence identifiers
... how is it that various buckets can be updated when deidentification is used
... if there are other descriptive issues that should be identified send them to Peter Swire
... We ccirculated Devin's slides earlier
... any questions or comments?
<efelten_> Could somebody post a link to Deven's slides?
fielding: can you describe why deidentification is it applicable to DNT
<Walter> as in, I'd like to get a link to the slides too
peterswire: I see it relevant in
a couple of ways
... data collected online is so aggregated it is not considered tracking
... at the other end data is associated with a specific individual such as Peter
<rvaneijk> echo: Could somebody post a link to Deven's slides?
peterswire: knowing were data falls is important to the process
<aleecia> Hi Roy, we're basically working through http://www.w3.org/2011/tracking-protection/drafts/tracking-compliance.html#def-unlinkable
peterswire: second thing, int he compliance spec it is related to the various uses
<justin> The standard says it doesn't apply to data that has been deidentified/delinked. And it's been one of the most debated topics within the group. How is that not relevant?
peterswire: there can be time when data goes into a DB and it should not come out in a way that can be linked to an individual
<Chris_IAB> to the individual or to the unique ID?
peterswire: I'm not saying that
the DAA rules a perfect, but that have definitions about how
data goes into the system but does not come out
... in an identifiable way
<fielding> justin, because I have no interest in keeping data when DNT:1 is set other than for security purposes
<aleecia> Justin's right that we've discussed anything being permitted for de-id'ed data, but that's not nailed down as we were still working through which defn we would go through
peterswire: because the compliance spec covers the meaning of tracking and others I feel it is relevant
<Wileys> Nick, could you share the Twiki link you created - does it host the presentation being referenced?
chris_iab: when you say individaul do you mean unique ID?
peterswire: I'm trying not to make decisions about what I mean
<npdoty> Wileys, wiki page is here http://www.w3.org/wiki/Privacy/De-identification some people have already added more to it
peterswire: sometimes it is
associated with a machine or cookie or individual
... that is what I mean by creating a working definition
... so we know what we are referencing in conversations
<Wileys> Thank you Nick
peterswire: I will introduce Deven McGraw
<justin> fielding, I am glad to hear it! But other working group members want to do more with that data. So the previous discussions we had about product improvement/market research have now migrated to the discussion of deidentification.
peterswire: she was very involved in public hearings on deidentification
rvaneijk: Can we get a link to the slides?
<Wileys> Nick, checked the Twiki and can't find the slides
<Brooks> +1 on slides
peterswire: I met have created an error in my email
Shane: can you post the slides?
<Walter> just paste the link in here?
<Wileys> Thank you Peter
Peterswire: I will send these to
Nick and he can post them
... Deven you can go
Deven: The slides are mostly text
... The guidance that was given on deidentification came from the HIPAA and that where we will start
... HIPAA protects health information in the US, but it is not a data protectin law
... most of the data holders in the US are covered by HIPAA
... HealthVault and similar apps are not covered
<aleecia> Roy it's possible that some of the discussion around how to keep data protected may be interesting in the context of data held to prevent fraud / for security. Unclear to me, but I could imagine some cross-over there
Deven: the bad news is HIPAA does
not cover all health data
... email@example.com is my email
<susanisrael> i have musted sorry
Deven: when you have data that
meets the standard for deidentification it is not covered by
... you can do almost anything with deidentified data
... the deidentification standard is a legal one
... there is no specific percentage risk which is established as a baseline
... risk is contextual
... there are two methods that can be used
... the expert method requires someone with expertise to document that the risk is small
... it must be determined who the data is going to and what other data they have
... safeharbor metho requires removing ?? amounts of data
... I"m on slide 5
... a code can be assigned to deidentified data to allow data to be reidentified as long as the code is not derived from individual
... and you cannot deisclose the code to the identity you are giving the data to
... this provision permits healthcare entities to be able to reidentify the data when notification is required
... for example in case of an infectious desease
<Chris_IAB> sorry, but have the slides that we are reviewing been distributed? I can't seem to find them?
<Chris_IAB> see in now npdoty, thanks
Deven: the assignment of codes is
covered in guidance
... on slide 6 let's discuss safeharbor
<Chris_IAB> slides are not numbered
<peterswire> if you view in other mode, you can see the slide numbers
<Chris_IAB> it's a pdf peterswire; which mode are you referring to?
Deven: names, addresses, zip codes, all elements of dates, ages are okay except for the elderly, telephone number, account number, VIN, IP address, URL
<peterswire> ah, I'm viewing in powerpoint
Deven: and any other unique identifying number or code cannot be used
<aleecia> I do not think I understand this "code"
Deven: the trick with safeharbor
is you have to remove all of these types of data to be
... if this does not work you can use the statistician method, but someone must validate the method
... safeharbor method deems that the data is deidentified and thus unregulated
... it is also a cookbook that tells you how to deidentify
... under the statistical method there are no rules for the statistician
... I have never heard of anyone be held up by a regulator because they did not properly deidentify data
... the standard is to reach low risk of reidentification, not zero risk
... requiring zero risk would remove all utility
<aleecia> Ahh. 1999. Before a lot of the re-identification work had happened.
Deven: provides rules for
... data use agreements are not required, but a data holder may require an agreement for deidentificaiton
... slide 12 guidance covers who is an expert
... no specific degree or level or education is required, but they will look at that in a review
... no numeric target is given for risk
<npdoty> aleecia, isn't this much more recent guidance? I'm hearing explicit acknowledgement of re-identification -- low risk, not no risk
Deven: multiple algorithms can be
used in a single datasets
... as long as datasets cannot be combined for reidentification
... slide 13 shows dataflow
<efelten_> Nick, one example of outdated thinking is the discussion of k-anonymity.
Deven: deidentification can be
... an agreement cannot be a tool of deidentification
<aleecia> the guidance is more recent, I agree. The original text was from '99. That explains why there would be an identifier added back after doing all the de-identifying work -- the risk of that was likely not really appreciated at the same level in '99
Deven: slide 14 and 2.9 of
... you cannot assign a code that is given away with the data
<aleecia> And here we are :-)
<aleecia> So it sounds like they're trying to fix it
<npdoty> efelten, forgive my ignorance, why is discussing k-anonymity outdated?
<peterswire> this is 2012 guidance; original rule drafted in 1999/2000
Deven: however you can disclose a
code that has been derived from the data as long as the code
and data meet low risk standard
... you can take protected health information and transform it into values for cryptographic hash functions
<efelten_> k-anonymity does not imply any limitation on the the analyst's ability to infer sensitive data about individuals, for one thing.
Deven: but do not give away the
formula or hash
... slide 16 remember when you are using safeharbor to remove 18 types of data you have to know if the data can be reidentified
... structured data and free text fields are covered by deidentification rules
... deidentification is aimed at protecting patients and families not staff
... HIPAA rules does not cover healthcare providers
... I will let you know when the guidance does not cover something
... the agency did what congress asked them to do and nothing more
<aleecia> Some of this is really good. But it starts from a point of trying to create incentives for de-id'ing data, presumably because aggregate health information has so much public benefit. Bit different here, but very very interesting to hear what they did
Peterswire: Under safeharbor IP address is PHI. What about cookies or browsing habits?
Deven: there is no guidance on that
<npdoty> I didn't understand IP address as personal health information, but just as information that would have to be removed to de-identify
Deven: you would need to look at
what is being examined
... the hospital's website would not necessarily be covered
Peterswire: Is knowing where the patient is logging in from covered
<efelten_> URLs are covered as PHI, right?
Deven: Since web data is covered this could be covered
<efelten_> Or at least URLs are one of the things that have to be removed under the safe harbor.
Deven: that is why there is the catch-all category to catch these types of things, such as cookies
<npdoty> efelten, the latter, yes
Peterswire: have people use one method over the other
<Wileys> The HIPPA standard for de-identification is focused on 'External Sharing' - whereas our discussions have centered around de-identification for data that is not to be shared externally. I believe it makes sense to have two standards here: internal vs. external
<moneill2> guid in cookie obv. can be used to re-identify
Deven: The analytical folks tend to use statistician method because they need dates
<aleecia> Shane, I could imagine that working
Deven: similaryly understanding
health trends is difficult with safeharbor method
... bess analytics is done with statitically deidentified data
<Marc_G> What about Shane's question or point above?
Peterswire: Can you explain if salts are required with hashing
Deven: I believe the guidance
states if you are using a hash, after you hand the data to a
recipient they should not be able to reidentify the data
... the risk should be very low and examples are provided for when codes can be provided
<efelten_> In healthcare, providers are given different treatment because they have informed consent from the patient.
Deven: for hashes you cannot provide the key or salt
<npdoty> scribenick: npdoty
<Wileys> Ed, if the URL is non-specific to a user, then this would not have to be removed (meets 'very low risk' standard)
peterswire: regarding data-use agreements under HIPAA, when does de-identification happen vs. data-use agreements?
deven: data-use agreement is not required when you've reached de-identification (statistically to low risk, or under safe harbor)
<rvaneijk> Shane, a dataset with full URL's contains behavioral information, which is specific to a user
deven: you don't need to execute
an agreement with the recipient of your data, they don't need
to commit not to re-identify
... if you want to use a data-use agreement as an extra measure of caution, you can do that
deven: enforced as a matter of
... can't use the data-use agreement to get to the low risk of de-identification
... gray area regarding "anticipated recipient"
... because there might be other people who can reidentify this data but you can't
... still raises questions about whether the agreement can limit recipients in a way that changes your statistical needs
<Wileys> Rob - as long as the receiptient is not able to leverage the URL history to re-identify the user then it does not need to be stripped.
peterswire: how much the expert's methodology should be public. what level of transparency is required?
deven: not required to document
the methodology, but required to maintain evidence for use in
response to regulators [did scribe get that right?]
... certainly been to many conferences where computer scientists will share those methodologies for feedback
<aleecia> Shane - it turns out URL history is an effective fingerprint. If "able to" is the threshold, then URLs are certainly going to need to be stripped
deven: if you're willing to
attest, put your name as a statistician, you don't have to
document the method
... not specified what level of attestation is needed
... I would want enough documentation as the data holder to respond to regulators who knock on my door
... a handful of people who do this on a regular basis, and everybody uses them
<Wileys> Aleecia - if I give you a handful of URLs and ask you to re-identify the individual they belong to, I doubt you'd be able to. This is the receiptent test.
deven: gives legal comfort to pick someone who has been regularly used
<efelten_> Actually the test is: if you give her all of your data, can she re-identify.
<Wileys> Ed, agreed - the assembly of the specific data elements is a key factor
<vincent> Wileys, if you include the the timestamps I bet you could re-identify someone even with a few urls
<Chris_IAB> does anyone else hear that?!
<Chris_IAB> missed everything you said during noise
<Walter> we certainly did
peterswire: q regarding categories of information
<Wileys> Vicent, I'm not sure I agree but this does align with my conversation with Ed on the assembly of data elements is key to the determination of "very low risk"
<rvaneijk> WIley, that is the whole point of pixel tagging
deven: not all holders, aimed at
hospitals and doctors, and the records they use to treat
patients and pay healthcare claims
... of the data that's in those types of records, what elements are most likely to be re-identifiable
<aleecia> Shane - it turns out people visit the same few sites in the long tail. So for me, that's going to be a specific set of four web comics. :-) The set of sites people visit is persistent and often unique
<Walter> justin: your cat sat on the phone?
<Wileys> Rob - pixel tagging through a unique cookie ID is meaningful to me - but since you as a receiptent don't have access to my cookie ID platform would not allow you to re-identify an individual
deven: safe harbor categories
came around after Latanya Sweeney's reidentification of the
... data elements that she used are now listed in the safe harbor
... but as we increase the amount of data in the external world, we shouldn't assume every year that the safe harbor makes it a very low risk
... but a lot of public databases are not covered by HIPAA
peterswire: some discussion
regarding date of birth, different from other data fields in
that it splits the population into 25,000 cells
... what kind of data can be easily searched on the outside? when you're coming up with your definition of very low risk, demographic data or data that lasts with you for a long time is a higher risk
... persists longer and is more easily obtainable from other sources
deven: that's the level of detail in discussion of the statistical methodology
peterswire: thanks very much to
... in person availability in Brussels; next Thursday or Friday, will provide more information
<fielding> I have no doubt that understanding deidentification is useful in general for the privacy of all users [not just those sending DNT]. I don't believe discussing it here is useful because I don't see us redefining what it means in our specs. That's in stark contrast to defining tracking, which hasn't been defined by anyone else, we are specifically chartered to define, and we aren't going to make any real progress until we do. And, no, I don't think that
<fielding> unlinkability is relevant just because someone made it an issue for TCS.
peterswire: I'm not available
next Wednesday, Matthias will have a technical call at the
... questions or comments?
<fielding> What about the MIT meeting?
<tedleung> any more details on the f2f?
<adrianba> is there logistics information for the f2f?
peterswire: thanks everybody
<bryan> Hi Nick did you see my message?
I'm hearing questions about MIT logistics, and will follow up on the mailing list
<aleecia> thanks, Nick!
This is scribe.perl Revision: 1.137 of Date: 2012/09/20 20:19:01 Check for newer version at http://dev.w3.org/cvsweb/~checkout~/2002/scribe/ Guessing input format: RRSAgent_Text_Format (score 1.00) Succeeded: s/insentives/incentives/ Succeeded: s/Devin/Deven/ Found Scribe: JC Inferring ScribeNick: JC Found ScribeNick: npdoty ScribeNicks: npdoty, JC WARNING: No "Topic:" lines found. Default Present: BrendanIAB?, dwainberg, walter, +1.408.836.aaaa, Fielding, kulick, +31.65.141.aabb, +1.202.587.aacc, rvaneijk, moneill2, JeffWilson, Chris_IAB, vincent, npdoty, Brooks, [Microsoft], schunter?, Susan_Israel, samsilberman, peterswire, [CDT], Keith_Scarborough, +1.202.331.aadd, +1.646.654.aaee, DAvid, hefferjr, Chris_Pedigo, Aleecia, Lia, RichardWeaver, +1.917.974.aaff, justin, +1.609.258.aagg, Jonathan_Mayer, efelten_, hwest, +1.202.344.aahh, adrianba, Mike_Zaneis, Peder_Magee, WileyS, +1.646.722.aaii, dsinger, +1.425.214.aajj, +1.425.455.aakk, +1.202.265.aall, Marc_G, +1.206.658.aamm, amyc?, Ted_Leung, +44.772.301.aann, +1.213.239.aaoo, schunter, +1.917.318.aapp, Alan Present: BrendanIAB? dwainberg walter +1.408.836.aaaa Fielding kulick +31.65.141.aabb +1.202.587.aacc rvaneijk moneill2 JeffWilson Chris_IAB vincent npdoty Brooks [Microsoft] schunter? Susan_Israel samsilberman peterswire [CDT] Keith_Scarborough +1.202.331.aadd +1.646.654.aaee DAvid hefferjr Chris_Pedigo Aleecia Lia RichardWeaver +1.917.974.aaff justin +1.609.258.aagg Jonathan_Mayer efelten_ hwest +1.202.344.aahh adrianba Mike_Zaneis Peder_Magee WileyS +1.646.722.aaii dsinger +1.425.214.aajj +1.425.455.aakk +1.202.265.aall Marc_G +1.206.658.aamm amyc? Ted_Leung +44.772.301.aann +1.213.239.aaoo schunter +1.917.318.aapp Alan Got date from IRC log name: 16 Jan 2013 Guessing minutes URL: http://www.w3.org/2013/01/16-dnt-minutes.html People with action items: WARNING: No "Topic: ..." lines found! Resulting HTML may have an empty (invalid) <ol>...</ol>. Explanation: "Topic: ..." lines are used to indicate the start of new discussion topics or agenda items, such as: <dbooth> Topic: Review of Amy's report[End of scribe.perl diagnostic output]