ACTION-412, Naming R/Y/G

If the group decides to use a Red/Yellow/Green approach, one question has been how to describe the three stages.  On the one hand, this may seem trivial because the substance means more than the name.  On the other hand, in my view, the names/descriptions are potentially important for two reasons: (1) they provide intellectual clarity about whatgoes in each group; and (2) they communicate the categories to a broader audience.

I was part of a briefing that Shane did on Friday on the phone to FTC participants including Ed Felten and Paul Ohm.  The briefing was similar to the approach Shane described at Sunnyvale.  In the move from red to yellow, here were examples of what could be scrubbed:

1.  Unique IDs, to one-way secret hash.
2.  IP address, to geo data.
3.  URL cleanse, remove suspect query string elements.
4.  Side facts, remove link out data that could be used to reverse identify the record.

Here are some ways that I’ve thought to describe what gets scrubbed, based on this sort of list:

1.  Remove identifiers (name) and what have been called pseudo-identifiers in the deID debates (phone, passwords, etc.).  But I don’t think there is a generally accepted way to decide what pseudo-identifiers would be removed.

2.  Earlier, I had suggested “direct” and “indirect” identifiers, but I agree with Ed’s objection that these definitions are vague.

3.  I am interested in the idea that going from red to yellow means removing information that is “exogenous” to the system operated by the company.  That is, for names/identifiers/data fields that are used outside of the company, scrub those.  Going to green would remove information that is “endogenous” to the system operated by the company, that is, even those within the company, with access to the system, could no longer reverse engineer the scrubbing.

When I suggested those terms on the call, someone basically said the terms were academic gobbledygook.  The terms are defined here: http://en.wikipedia.org/wiki/Exogenous.  I acknowledge the gobbledygood point, and the word “exogenous” is probably one only an economist could love.  But I welcome comments on whether the idea is correct – data fields that are generated or observable outside of the company are different from those generated within the company’s system.

4.  If exogenous/endogenous are correct in theory, but gobbledygook in practice, then I wonder if there are plain language words that get at the same idea.  My best current attempt is that red to yellow means scrubbing fields that are “observable from outside of the company” or “outwardly observable.”

So, my suggestion is that red to yellow means scrubbing fields that are “observable from outside of the company” or “outwardly observable.”

If this is correct, then the concept of k-anonymity likely remains relevant.  Keeping broad demographic information such as male/female or age group can be in the yellow zone.  However, a left-handed person under five feet with red hair would in most settings be a bucket too small.

Clearly, the group has a variety of issues to address if we decide to go with a three-part R/Y/G approach to de-identification.  The limited goal of this post is to try to help with terminology.  Is it useful to say that the yellow zone means scrubbing data that is “observable from outside of the company”, except for broad demographic data?

Peter

P.S.  After I wrote the above, I realized that "observable from outside of the company" is similar in meaning to what can be "tracked" by those outside of the company.  So scrubbing those items plausibly reduces tracking, at least by the other companies.


Prof. Peter P. Swire
C. William O'Neill Professor of Law
Ohio State University
240.994.4142
www.peterswire.net

Beginning August 2013:
Nancy J. and Lawrence P. Huang Professor
Law and Ethics Program
Scheller College of Business
Georgia Institute of Technology

Received on Saturday, 22 June 2013 15:30:07 UTC