k-anonymity

Data, Ethics, and Government

Calvin Deutschbein

About Me

Name
Calvin (Deutschbein)
Pronouns
They/Them
Background
Ph.D. Computer Science - UNC Chapel Hill
"Mining Secure Behavior of Hardware Designs"
Email:
ckdeutschbein@willamette.edu

Background

We don't know how much data exists in the world.

  • Public sector: voting records, social security, call logs
  • Private sector: browsing history, purchase history, email logs

Data and knowledge can enrich our society

  • Formulating human-centered policies
  • Supporting community-building in digital spaces

Laws and policies require that some collected data must be made public

  • For example, campaign donations.
  • Stock levels reveal population-level consumer behaviors

Growth

It is generally regarded that...

  • Google is one of the largest data aggregators
  • Google held approximately 15 exabytes in 2013 [src]
  • Google's reported power use increased 21%/anum from 2011->2019

(12.4/2.6)^(1/8) = 1.21

  • Hard-drives grew 16x from 2012 to 2023 from less than 2 TB [src] to 32 TB [src]
  • Google current storage is around 15 exabytes * 1.21^10 * 16 = 1614 exabytes
  • 10x every 5 years within one company, but # of data companies also grows.
  • I have generated 134 MB of teaching materials in 3 years, or .0000000000134 exabytes

What About Privacy?

How can we keep individuals safe while still benefitting as a society?

One technique: anonymize data.


But how?

Remove “personally identifying information” (PII)

  • Name, SSN, phone, email, address… what else?
  • Anything that identifies an individual directly

Is this enough?

Unmask by linking data sets

Consider the Computer Science faculty who attend networking events in the area.

Name Specialization
Calvin Security
Fred VR/AR
Haiyan AI/ML
Jed Big Data
Lucas Software

There's 5 of us. Come say "hi" any time!

Unmask by linking data sets

Now imagine icebreakers at a networking event found on social media

Embarassing Secret
Orders oatmilk mochas from Archive Coffee + Bar in downtown Salem
Proved Einstein correct according to Washington Post

Is this anonymized? Or rather, how anonymous is this?

Mexico '24

Here's two of my advisees on a Physics trip proving Einstein correct.

They invited Jed!

Unmask by linking data sets

Suppose we at an event we only know AFFILIATION and DIETARY RESTRICTIONS

Name Dietary Restriction
Calvin Dairy
Fred None
Haiyan None
Jed Caffeine
Lucas Red meat

I was sipping on delicious mochas while Jed was out changing the world.

Latanya Sweeney’s "Attack" (1997)

Dr. Latanya Sweeney

Latanya Sweeney’s "Attack" (1997)

In 1997, Mass. Governor and future Libertarian Vice Presidential nominee Bill Weld released hospital visit information by all state employees but assured employees the data was anonymized.

The Data

How do we assess this?

Removed:

  • Name
  • SSN

Retained:

  • DOB
  • Zip

Food for thought:

  • How many zipcodes are there?

How do we assess this?

Removed:

  • Name
  • SSN

Retained:

  • DOB
  • Zip

Food for thought:

  • 10000 zip codes
  • How many DOBs are there (including or excluding year)?

How do we assess this?

Removed:

  • Name
  • SSN

Retained:

  • DOB
  • Zip

Food for thought:

  • 10000 zip codes
  • 365 * ~90 DOBs
  • How many people are on a voter list in 1997?

How do we assess this?

Removed:

  • Name
  • SSN

Retained:

  • DOB
  • Zip

Food for thought:

  • 10000 zip codes
  • 365 * ~90 DOBs
  • How many people are on a voter list in 1997?
  • 2022 US pop = 333.3 m
  • 2022 US rvs = 161.4 m
  • 1997 US pop = 272.9 m

How do we assess this?

Removed:

  • Name
  • SSN

Retained:

  • DOB
  • Zip

Food for thought:

  • 10000 zip codes
  • 365 * ~90 DOBs
  • ~132.1 registers voters in 1997

How do we assess this?

Removed:

  • Name
  • SSN

Retained:

  • DOB
  • Zip

Food for thought:

  • 10000 zip codes
  • 365 * ~90 DOBs
  • ~132.1 registers voters in 1997
  • 10000 * 365 * 90 = ~328.5 m unique combinations of DOB and zip for 132 people.
  • Mass has 533 zips ranging from 20 to 70k people. [src]
  • How likely is DOB and zip to uniquely identify someone? Too high
  • (ask a data scientist professor, we have 3 I'll introduce you)

Latanya Sweeney’s "Attack" (1997)

Then-governor Weld was hospitalized for influenza in 1996 (he recovered quickly). Once the data set was released, Latanya Sweeney purchased voter rolls for $20, de-anonymized Bill Weld by DOB and zip, and mailed him a copy of his private medical records!

Bill Weld

Latanya Sweeney’s "Attack" (1997)

Latanya Sweeney's paper introducing the concept of k-anonymity has been cited over 8000 times!

The Paper

Quasi-Identifiers

Key attributes

  • Name, address, phone number - uniquely identifying!
  • (Should) always be removed before release.

Quasi-identifiers

  • Zip, DOB, state gender marker uniquely identifies 87% of the U.S.!
  • My gender marker (X) is relatively uncommon. Good thing I never have to think about that haha.
  • Can be used for linking anonymized dataset with other datasets...

Recall: some collected data must be made public by law!

Classification of Attributes

Sensitive attributes

  • Personal medical or family information, student records, etc.
  • May be released for society to benefit from discoverable knowledge.
The MA data set

k-Anonymity: Intuition

The information for each person contained in the released table cannot be distinguished from at least k-1 individuals whose information also appears in the release

  • Example:
    • Try to identify a person in the released table.
    • You have birth date and state gender marker.
    • There are (at least) k people in the table with the same birth date and gender.

    Any quasi-identifier present must appear in at least k records.

k-Anonymity: Protection Model

Given a:

  • PT: Private table (that contains sensitive information)
  • RT: Released ("anonymized") table
  • An: Attributes A1, A2, …, An

Any quasi-identifier present must appear in at least k records.

The MA data set

Generalization

Goal of k-Anonymity:

  • Each record is indistinguishable from at least k-1 other records
  • These k records form an "equivalence class"
  • >>> zip_codes = ["47677", "47602", "47678"] >>> zip_codes = [z[:3] + "**" for z in zip_codes] >>> zip_codes ["476**", "476**", "476**"]

Generalization: replace quasi-identifiers with less specific, but semantically consistent values

Example of a 2-Anonymous Table

The MA data set

Limitations: Dimensionality

  • Generalization fundamentally relies on spatial locality
    • Each record must have k close neighbors
  • Real-world datasets may be very sparse
    • Many attributes (dimensions)
      • Netflix Prize (2009) dataset: 17,000 dimensions
      • Amazon customer records: several million dimensions
    • “Nearest neighbor” is very far

If projection to low dimensions is lossy then k-anonymized datasets lose value.

Limitations: Harms

  • Syntactic
    • Focuses on data transformation, not on what can be learned from the anonymized dataset
    • k-anonymous” dataset can leak sensitive information
  • “Quasi-identifier” fallacy
    • Assumes a priori that attacker will not know certain information about their target
  • Relies on locality
    • Destroys utility of many real-world datasets