Is this anonymized? Or rather, how anonymous is this?
Mexico '24
Here's two of my advisees on a Physics trip proving Einstein correct.
They invited Jed!
Unmask by linking data sets
Suppose we at an event we only know AFFILIATION and DIETARY RESTRICTIONS
Name
Dietary Restriction
Calvin
Dairy
Fred
None
Haiyan
None
Jed
Caffeine
Lucas
Red meat
I was sipping on delicious mochas while Jed was out changing the world.
Latanya Sweeney’s "Attack" (1997)
Latanya Sweeney’s "Attack" (1997)
In 1997, Mass. Governor and future Libertarian Vice Presidential nominee Bill Weld released hospital visit information by all state employees but assured employees the data was anonymized.
How do we assess this?
Removed:
Name
SSN
Retained:
DOB
Zip
Food for thought:
How many zipcodes are there?
How do we assess this?
Removed:
Name
SSN
Retained:
DOB
Zip
Food for thought:
10000 zip codes
How many DOBs are there (including or excluding year)?
How do we assess this?
Removed:
Name
SSN
Retained:
DOB
Zip
Food for thought:
10000 zip codes
365 * ~90 DOBs
How many people are on a voter list in 1997?
How do we assess this?
Removed:
Name
SSN
Retained:
DOB
Zip
Food for thought:
10000 zip codes
365 * ~90 DOBs
How many people are on a voter list in 1997?
2022 US pop = 333.3 m
2022 US rvs = 161.4 m
1997 US pop = 272.9 m
How do we assess this?
Removed:
Name
SSN
Retained:
DOB
Zip
Food for thought:
10000 zip codes
365 * ~90 DOBs
~132.1 registers voters in 1997
How do we assess this?
Removed:
Name
SSN
Retained:
DOB
Zip
Food for thought:
10000 zip codes
365 * ~90 DOBs
~132.1 registers voters in 1997
10000 * 365 * 90 = ~328.5 m unique combinations of DOB and zip for 132 people.
Mass has 533 zips ranging from 20 to 70k people. [src]
How likely is DOB and zip to uniquely identify someone? Too high
(ask a data scientist professor, we have 3 I'll introduce you)
Latanya Sweeney’s "Attack" (1997)
Then-governor Weld was hospitalized for influenza in 1996 (he recovered quickly). Once the data set was released, Latanya Sweeney purchased voter rolls for $20, de-anonymized Bill Weld by DOB and zip, and mailed him a copy of his private medical records!
Latanya Sweeney’s "Attack" (1997)
Latanya Sweeney's paper introducing the concept of k-anonymity has been cited over 8000 times!
Quasi-Identifiers
Key attributes
Name, address, phone number - uniquely identifying!
(Should) always be removed before release.
Quasi-identifiers
Zip, DOB, state gender marker uniquely identifies 87% of the U.S.!
My gender marker (X) is relatively uncommon. Good thing I never have to think about that haha.
Can be used for linking anonymized dataset with other datasets...
Recall: some collected data must be made public by law!
Classification of Attributes
Sensitive attributes
Personal medical or family information, student records, etc.
May be released for society to benefit from discoverable knowledge.
The MA data set
k-Anonymity: Intuition
The information for each person contained in the released table cannot be distinguished from at least k-1 individuals whose information also appears in the release
Example:
Try to identify a person in the released table.
You have birth date and state gender marker.
There are (at least) k people in the table with the same birth date and gender.
Any quasi-identifier present must appear in at least k records.