Clustering

Jameson > Hendrik > Calvin

Agenda

Modeling Reminders
K-means Clustering
Hierarchical Clustering
Final Modeling Project

Timing Update

I misnumbered my weeks, so Model 2 is due immediately before the final.
In theory, this means you have more time on model 2?
Anyways:
- Model 2 by 14 Apr
- Final on 21 Apr

Next modeling project

Due next Monday (14 Apr)
Use bank data to predict churn
Using exactly 5 features
Models scored based on AUC
We’ll leave time at the end to get started

Time to Touch on final

Click me

Emphasis

No code
10 features

A hold-out sample is a random sample from a data set that is withheld and not used in the model fitting process. After the model is fit to the main data (the “training” data), it is then applied to the hold-out sample. This gives an unbiased assessment of how well the model might do if applied to new data.

Supervision

Motivating Questions

What is difference between supervised and unsupervised learning?
What is unsupervised learning used for?
What are some challenges with unsupervised learning?

Supervised Learning

Uses labeled data.
Maps inputs to outputs.
Goal: Predict outcomes.
Examples: Classification, regression.

Unsupervised Learning

Uses unlabeled data.
Finds hidden patterns.
Goal: Discover structure.
Examples: Clustering, dimensionality reduction.

The Label Divide

Supervised: Labeled data.
Unsupervised: Unlabeled data.
Label presence is key.

Difference between supervised and unsupervised learning

Uses of Unsupervised

Clustering: Grouping similar data.
- E.g. discover Yemenia novel coffee varietal
- Learn more
  - 18 min + can’t watch on stream.
Anomaly detection: Finding outliers.
Dimensionality reduction: Simplifying data.
Association rule learning: Finding relationships.

Clustering in Action

Customer/market segmentation.
- Why EU entered “esports winter” 18+ months before US?
Document clustering.

Anomaly Detection

Detect fraudulent bank transactions.
- E.g. if I purchased an espresso in Salem, MA
Detect anomalous (computer) network behavior.
- My colleague at MS does this
Predict imminent equipment failure in e.g. manufacturing sector.
- Do you do this in 596?

Dimensionality Reduction

Just a refresh
- Feature extraction.
- Data visualization.
- Reducing noise.

Challenges

Evaluation
- No clear “right” answer.
  - And sometimes there’s someday a right answer, but you have to make the model now.
  - Sometimes the “right” answer is determined by someone in power.
- Subjective evaluation metrics.
  - Non-falsifiable by construction.
- Validating discovered patterns.

Challenges

Interpretation
- Understanding discovered structures.
  - Think in e.g. practice midterm - did you “discover” colleges vs. universities.
  - What about “big school” and “small school”
- Meaningful insights can be hard.
  - Did you find “Colleges that Change Lives”
- Requires or at least benefits from domain expertise.

Challenges

Computational Cost
- Large datasets are common.
- Algorithms can be complex.
- Scalability is a concern.
Computational costs become operating costs

Takeaways

Supervised: Labeled data ⇒ predict label.
Unsupervised: Unlabeled data ⇒ discover patterns.
- Valuable, but challenging.
- Value is correlated with challenge given scarcity.

K-means Clustering Algorithm

Start with k random clusters

Calculate means

Select cluster based on which mean point is closest to

Adjust menas and repeat

Potential Issues

What happens with high dimensionality?
What happens when dimensions aren’t scaled?

Setup

sh <- suppressPackageStartupMessages
sh(library(tidyverse))
sh(library(tidytext))
data(stop_words)
wine <- readRDS(gzcon(url("https://cd-public.github.io/D505/dat/variety.rds"))) %>% rowid_to_column("id")
glimpse(wine)

Rows: 3,657
Columns: 5
$ id          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ variety     <chr> "Pinot_Gris", "Pinot_Noir", "Pinot_Noir", "Pinot_Noir", "P…
$ price       <dbl> 14, 65, 20, 50, 22, 25, 38, 28, 45, 22, 40, 50, 52, 48, 50…
$ points      <dbl> 87, 87, 87, 86, 86, 86, 91, 85, 85, 85, 89, 89, 92, 95, 92…
$ description <chr> "Tart and snappy, the flavors of lime flesh and rind domin…

Relate Word to IDS

word_ids <- wine %>%
  unnest_tokens(word, description) %>%
  anti_join(stop_words) %>% 
  filter(!(word %in% c("wine","flavors","pinot","gris"))) %>% 
  count(id, word) %>% group_by(id) %>% 
  mutate(n = if_else(n>0,1,0)) %>% ungroup()

head(word_ids)

# A tibble: 6 × 3
     id word          n
  <int> <chr>     <dbl>
1     1 acidity       1
2     1 crisp         1
3     1 dominate      1
4     1 fermented     1
5     1 flesh         1
6     1 green         1

Find top words by variety

top_words <- word_ids %>% 
  right_join(wine, by="id") %>%
  count(variety, word) %>% 
  group_by(variety) %>% top_n(3,n) %>%
  ungroup() %>% select(word) %>% distinct()

head(top_words)

# A tibble: 6 × 1
  word  
  <chr> 
1 apple 
2 fruit 
3 oak   
4 pear  
5 cherry
6 finish

Engineer Features

wino <- word_ids %>% 
  filter(word %in% top_words$word) %>% 
  pivot_wider(id_cols = id, names_from = word, values_from = n, values_fill = 0) %>% 
  right_join(wine, by="id") %>% 
  mutate(price=log(price)) %>%
  mutate(price=scale(price), points=scale(points)) %>% 
  select(-id,-variety, -description) %>% drop_na(.)

head(wino)

# A tibble: 6 × 8
    oak finish fruit cherry  pear apple price[,1] points[,1]
  <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl>     <dbl>      <dbl>
1     1      0     0      0     0     0    -1.14      -0.892
2     0      1     1      0     0     0     0.664     -1.26 
3     0      1     1      1     0     0     0.123      0.592
4     0      1     1      0     0     0    -0.478     -1.63 
5     0      0     1      1     0     0    -0.953     -1.63 
6     0      0     1      1     0     0     0.664     -0.150

Basic K-means cluster

kclust <- kmeans(wino, centers = 3)
kclust$centers

        oak    finish     fruit    cherry       pear      apple      price
1 0.1469933 0.3151448 0.6792873 0.1447661 0.28396437 0.26280624 -1.2307289
2 0.2147588 0.3538316 0.7559130 0.3916746 0.01797540 0.07190161  0.1621558
3 0.2126538 0.3330404 0.7293497 0.4824253 0.04217926 0.07908612  0.8319282
      points
1 -0.6997839
2 -0.4100172
3  0.9709884

Add clusters

wink <- wino %>% mutate(cluster = kclust$cluster)
head(wink)

# A tibble: 6 × 9
    oak finish fruit cherry  pear apple price[,1] points[,1] cluster
  <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl>     <dbl>      <dbl>   <int>
1     1      0     0      0     0     0    -1.14      -0.892       1
2     0      1     1      0     0     0     0.664     -1.26        2
3     0      1     1      1     0     0     0.123      0.592       3
4     0      1     1      0     0     0    -0.478     -1.63        1
5     0      0     1      1     0     0    -0.953     -1.63        1
6     0      0     1      1     0     0     0.664     -0.150       2

Cluster Visualizer

see_densities <- function(df, names) {
  df %>% 
    pivot_longer(names,names_to = "feature") %>% 
    ggplot(aes(value, fill=cluster))+
    geom_density(alpha=0.3)+
    facet_wrap(~feature)
}

Visualize clusters

see_densities(wink, c("oak", "finish", "fruit"))

Visualize clusters

see_densities(wink, c("cherry", "pear", "apple"))

Visualize clusters

see_densities(wink, c("points","price"))

Try different numbers of clusters

kclusts <- map(1:9, function(k) { 
    kmeans(wino, centers = k)$cluster
  }
)

Examine any one of the clusterings

unlist(kclusts[7])

   [1] 3 7 6 2 2 7 5 6 5 5 6 6 5 1 3 5 6 5 4 7 2 3 2 2 3 3 7 7 3 3 2 6 1 6 5 5 2
  [38] 2 7 1 5 6 3 4 4 4 7 7 3 2 3 7 7 3 2 2 5 3 1 3 2 2 7 2 3 2 2 1 6 4 2 7 2 7
  [75] 7 3 7 7 7 7 7 3 5 6 1 2 2 2 5 5 1 7 3 2 4 1 7 3 3 1 5 1 2 2 3 3 3 3 2 3 2
 [112] 1 4 6 1 5 6 6 1 2 2 2 2 3 7 7 1 7 3 2 4 4 2 7 2 2 2 1 6 6 6 6 6 7 4 3 4 1
 [149] 1 1 4 1 7 4 2 3 5 6 1 1 2 3 7 3 3 2 6 1 5 1 5 3 3 5 1 2 3 5 2 3 2 5 5 2 1
 [186] 5 4 6 1 4 7 4 7 7 3 7 3 3 2 4 6 5 5 6 1 6 1 1 3 4 5 6 3 7 4 6 6 2 2 3 6 1
 [223] 5 7 4 1 6 6 1 1 5 4 5 6 1 5 3 3 4 1 1 5 7 4 4 4 2 2 7 4 4 5 4 6 6 5 5 1 6
 [260] 2 2 4 6 3 4 4 4 7 2 2 4 4 4 2 7 7 5 6 6 5 4 5 2 4 7 6 1 4 1 6 4 4 3 6 5 1
 [297] 1 7 2 3 7 7 3 3 3 5 2 3 2 7 4 4 7 7 4 4 4 6 1 5 1 5 6 4 2 2 1 5 1 5 6 1 5
 [334] 6 3 1 1 5 7 7 2 1 1 5 6 4 2 6 5 5 6 6 6 7 1 2 7 7 7 3 7 4 7 7 6 3 6 7 3 6
 [371] 6 5 6 6 6 5 5 5 7 7 7 4 4 1 2 2 3 3 3 4 2 3 5 1 7 7 3 7 7 6 1 1 5 5 6 6 4
 [408] 2 2 2 3 3 4 5 6 4 7 5 6 2 3 2 2 3 3 2 6 1 6 4 1 1 4 7 1 7 2 1 1 1 3 3 3 2
 [445] 2 2 3 2 7 7 7 7 2 1 5 7 3 1 6 2 3 2 2 7 3 1 1 1 5 2 2 2 1 6 5 4 3 7 3 7 4
 [482] 6 3 4 1 6 7 3 3 2 7 2 3 2 6 6 1 6 6 2 4 5 6 5 1 4 1 5 7 4 4 1 4 2 3 7 7 2
 [519] 2 6 7 4 1 7 3 3 3 2 3 3 3 2 1 1 6 4 6 7 3 3 2 2 7 1 6 1 2 2 7 2 1 4 2 2 6
 [556] 2 4 7 6 7 2 6 1 7 7 7 4 3 7 7 4 5 5 6 5 2 7 1 7 7 1 4 7 2 7 3 3 3 4 1 7 7
 [593] 7 6 7 4 1 1 4 4 4 4 4 3 3 7 7 7 3 7 7 3 7 7 3 2 2 4 4 1 7 2 4 5 1 1 1 1 1
 [630] 4 6 6 6 7 3 2 3 3 7 3 3 2 2 6 4 1 5 3 5 1 6 6 1 5 5 1 3 2 2 2 3 2 3 2 2 7
 [667] 3 3 6 4 7 7 4 4 2 6 5 4 1 1 2 2 6 1 6 1 6 1 1 1 1 5 4 7 2 2 2 7 3 7 7 7 7
 [704] 6 1 6 4 4 7 7 7 7 2 1 5 7 4 3 7 3 3 5 5 2 3 7 2 4 7 7 4 7 7 7 5 4 7 4 4 3
 [741] 6 4 2 1 2 3 5 6 3 2 2 7 2 7 5 6 5 1 2 3 2 3 3 2 2 3 2 2 6 6 1 1 5 1 1 6 2
 [778] 2 3 4 6 6 5 3 2 7 4 3 6 4 4 1 2 3 7 7 3 2 4 6 4 5 2 1 6 1 6 1 3 2 1 6 4 4
 [815] 1 3 3 7 4 4 1 4 4 7 4 7 5 2 2 2 7 3 7 7 3 6 5 1 4 6 1 4 7 2 7 2 2 3 5 4 4
 [852] 6 4 1 5 1 1 4 5 3 4 7 6 1 5 1 6 4 5 6 6 4 4 1 7 2 2 3 3 3 2 7 6 5 5 2 3 3
 [889] 1 3 2 2 1 6 7 7 4 4 7 6 1 1 1 1 7 2 2 4 5 4 7 2 3 3 6 7 7 1 5 4 1 1 4 4 4
 [926] 1 4 6 3 7 7 7 7 7 4 4 6 1 6 5 1 5 4 2 3 5 1 1 6 2 2 7 2 3 3 7 6 1 1 5 6 6
 [963] 1 5 4 7 7 7 3 2 7 3 6 2 1 6 1 1 3 6 1 1 1 4 6 2 5 1 3 1 7 2 1 6 4 5 4 4 2
[1000] 7 4 7 5 6 5 5 2 2 2 4 6 1 4 4 4 6 5 1 3 2 1 1 2 3 7 2 3 3 4 6 7 4 4 7 5 1
[1037] 1 1 7 7 2 6 6 6 1 7 4 4 6 4 3 2 3 3 2 2 1 2 3 2 4 7 7 5 5 5 7 4 7 2 3 7 7
[1074] 7 6 5 6 5 6 3 6 5 5 1 3 2 7 4 4 6 4 3 3 7 2 7 7 6 6 5 1 1 4 4 7 1 5 5 6 7
[1111] 3 7 3 1 1 5 5 5 1 6 1 1 2 5 1 5 6 2 7 2 2 5 6 1 6 6 1 4 5 6 1 6 4 3 2 2 2
[1148] 5 7 4 4 3 2 3 3 2 2 7 7 2 6 6 2 3 3 5 6 5 4 2 2 2 2 6 4 7 4 4 4 3 5 6 6 2
[1185] 2 1 1 6 7 4 7 7 7 7 3 2 3 3 5 5 5 6 6 7 3 6 2 5 6 1 1 6 5 7 7 7 2 7 7 7 2
[1222] 3 2 7 2 7 2 3 5 5 1 2 2 6 7 5 1 1 4 2 7 7 2 3 1 6 1 4 6 2 7 1 7 3 3 2 1 1
[1259] 4 4 1 5 1 1 2 7 3 7 6 4 2 4 7 4 6 6 3 3 1 1 1 6 4 7 2 2 2 1 1 5 4 1 3 5 6
[1296] 5 2 7 7 7 2 7 1 5 7 7 2 1 4 4 4 1 3 7 7 7 4 6 4 1 4 4 4 6 6 4 6 6 4 3 2 3
[1333] 3 7 7 7 6 6 7 7 1 4 1 5 5 6 5 5 7 7 1 3 3 3 4 7 7 3 2 4 6 2 3 2 3 5 1 4 7
[1370] 2 2 1 6 1 6 3 2 3 7 2 1 5 4 7 2 2 2 3 3 3 7 2 1 1 6 4 7 1 4 5 6 4 7 5 1 3
[1407] 3 2 4 3 2 7 4 3 4 7 7 4 2 2 7 7 3 4 1 1 6 4 2 2 2 7 2 7 1 6 1 1 5 6 5 3 3
[1444] 3 7 4 5 3 2 4 7 5 4 2 3 5 1 7 4 7 5 7 6 5 3 4 4 3 3 2 1 4 1 4 1 5 6 3 5 4
[1481] 2 3 4 4 4 6 6 7 7 4 6 1 4 3 4 4 2 2 7 2 2 3 7 7 4 6 6 3 3 7 7 3 5 4 6 5 4
[1518] 4 6 5 1 2 2 7 2 2 3 7 3 2 7 7 4 7 7 3 6 1 4 2 4 4 5 6 7 4 4 1 5 1 1 1 2 6
[1555] 7 7 7 7 3 7 3 2 2 4 4 4 2 7 7 7 2 7 7 3 2 7 7 7 7 7 7 3 7 2 1 1 1 4 6 5 6
[1592] 3 7 5 6 1 4 1 3 4 4 4 2 7 4 6 5 4 6 1 5 1 3 7 3 7 2 3 2 7 2 2 2 3 2 2 2 2
[1629] 3 2 3 5 1 1 5 1 7 6 5 4 6 6 7 7 3 1 1 6 5 1 6 2 2 2 2 6 7 4 7 7 2 3 7 7 3
[1666] 3 4 5 5 4 4 7 7 4 6 4 5 3 4 7 7 4 1 6 5 5 1 1 6 2 2 6 5 1 7 7 7 7 6 5 5 1
[1703] 7 1 7 7 7 7 4 1 4 4 5 7 7 2 2 2 3 1 1 4 6 6 7 3 2 2 3 2 5 5 6 6 1 1 2 3 4
[1740] 4 6 7 4 7 5 5 6 1 4 4 6 1 3 2 2 6 1 6 3 3 7 7 1 5 4 4 4 7 4 6 6 4 3 7 4 3
[1777] 2 2 2 5 6 4 7 3 7 4 3 4 4 4 1 5 2 4 4 4 7 3 2 7 2 3 2 2 2 7 3 7 2 7 2 1 7
[1814] 3 3 3 2 7 7 4 4 2 2 5 6 7 4 4 2 2 3 2 7 3 5 4 3 3 3 7 5 4 4 3 2 2 5 6 4 1
[1851] 5 2 3 2 4 2 5 7 2 1 6 1 1 1 4 4 1 7 7 7 4 7 1 5 5 6 4 4 4 4 1 1 1 6 5 1 4
[1888] 2 1 1 1 6 1 4 2 6 1 4 3 6 1 5 1 6 5 5 3 1 5 7 7 5 7 7 6 5 2 3 2 2 6 4 1 6
[1925] 2 5 5 1 7 7 7 4 7 3 6 4 1 4 4 4 1 1 3 2 7 2 1 4 1 1 6 4 3 2 7 7 7 4 4 4 7
[1962] 6 4 1 3 5 7 4 4 7 2 3 2 2 3 3 3 3 5 1 5 1 1 4 5 3 3 2 3 2 2 2 6 1 6 6 1 3
[1999] 1 1 6 3 1 2 7 2 3 7 4 4 7 7 5 1 6 1 1 5 4 5 4 2 3 3 3 6 1 1 7 4 3 7 7 1 6
[2036] 5 6 1 2 2 3 3 2 7 7 7 7 2 3 1 6 7 7 3 2 4 4 1 4 6 4 7 5 1 5 5 5 1 1 2 7 4
[2073] 4 4 6 1 1 7 7 7 6 6 1 6 1 1 3 2 5 1 7 5 3 4 5 4 5 7 7 4 7 4 7 6 3 4 7 1 1
[2110] 4 6 6 6 1 6 5 3 5 4 5 5 6 6 4 2 2 2 4 3 4 7 7 7 1 4 4 1 4 5 4 6 1 1 3 5 2
[2147] 6 4 2 3 2 7 5 5 6 4 7 7 5 5 7 3 3 2 7 4 6 1 3 4 6 6 1 1 6 6 6 1 2 2 3 3 2
[2184] 7 7 4 1 6 1 4 4 6 2 7 7 3 3 7 7 5 1 2 3 7 5 1 1 5 5 5 3 3 5 7 7 7 2 3 1 3
[2221] 4 4 6 1 4 1 5 7 7 1 7 7 1 6 4 1 4 2 2 2 6 3 7 4 3 7 3 3 2 2 2 3 1 6 2 2 6
[2258] 1 5 4 1 6 7 4 7 2 7 7 2 7 7 7 3 7 2 3 7 7 2 3 7 3 3 7 7 3 2 3 7 4 2 4 4 7
[2295] 7 7 3 4 3 3 4 7 7 7 5 5 7 2 7 7 7 7 4 7 7 7 7 7 7 3 3 7 3 7 5 2 7 4 2 3 3
[2332] 2 7 2 2 5 1 6 7 3 7 7 3 2 3 2 7 3 7 7 7 4 6 7 3 7 2 3 2 2 2 7 4 4 3 3 6 6
[2369] 4 7 3 3 4 6 4 4 5 1 6 3 2 2 2 7 2 3 5 5 1 6 1 6 1 6 5 4 1 4 6 4 7 1 6 6 6
[2406] 6 1 7 5 5 1 7 7 7 7 3 4 7 7 1 4 6 5 1 6 5 5 2 2 6 6 6 6 6 3 3 3 6 7 7 2 2
[2443] 3 2 3 7 7 7 6 6 2 2 2 3 7 4 7 7 3 2 1 6 1 7 7 7 2 2 7 5 4 7 2 6 5 1 2 3 7
[2480] 7 2 3 7 4 6 3 1 2 5 1 3 6 5 4 1 7 4 6 6 2 6 5 7 5 7 2 7 2 3 3 3 4 6 3 3 3
[2517] 3 7 4 4 1 4 7 3 3 3 3 3 3 2 7 3 4 6 7 7 2 6 4 5 6 6 6 5 5 4 6 5 2 2 2 7 7
[2554] 7 3 7 1 3 2 4 2 1 7 6 3 3 2 2 7 6 4 5 7 2 7 3 3 7 3 2 2 3 7 7 5 6 6 6 5 1
[2591] 5 2 7 2 3 2 7 5 4 6 5 2 2 3 2 7 2 7 7 2 3 7 2 4 7 5 6 6 6 1 1 4 1 5 5 4 7
[2628] 1 6 6 1 6 5 1 1 1 5 1 4 4 7 7 7 3 2 6 1 4 4 3 2 4 4 3 2 2 2 3 3 2 6 6 7 3
[2665] 3 7 3 4 1 4 2 3 2 7 5 1 4 5 2 7 6 4 5 1 7 4 7 5 6 3 2 2 5 1 2 2 7 4 4 4 7
[2702] 5 6 5 2 2 3 7 7 3 7 4 7 7 2 2 3 4 7 4 7 3 3 2 2 1 1 4 2 3 2 5 6 2 4 6 7 4
[2739] 5 4 6 4 3 3 6 3 7 7 3 2 3 1 6 1 6 4 2 7 4 2 2 2 4 7 7 6 1 3 5 6 1 7 3 3 7
[2776] 4 7 3 4 4 6 6 4 1 6 3 3 1 4 5 1 6 5 3 2 2 2 4 2 3 2 3 2 3 7 4 4 5 1 1 4 5
[2813] 6 1 1 3 3 5 5 5 1 2 2 3 3 3 1 1 6 7 4 1 5 1 5 2 6 1 4 5 6 5 1 6 6 1 2 5 1
[2850] 7 1 7 7 7 3 3 2 2 2 4 4 6 4 7 6 6 6 4 1 4 7 4 2 1 6 6 4 5 2 2 7 7 3 2 2 4
[2887] 5 6 1 1 7 7 2 2 2 7 2 1 4 5 5 4 2 3 7 4 7 7 1 6 6 6 6 6 1 1 1 4 1 3 2 1 2
[2924] 7 7 4 4 3 2 6 7 2 7 7 7 6 4 5 1 4 4 5 4 5 2 7 3 6 4 3 7 4 4 2 1 6 6 5 1 1
[2961] 1 4 7 2 2 2 2 2 2 2 2 7 3 7 7 3 6 6 6 4 4 6 6 6 1 4 4 1 6 5 4 6 4 2 7 3 7
[2998] 4 6 4 2 2 6 6 1 6 5 2 7 7 7 7 7 7 3 3 3 2 2 2 2 2 1 6 1 4 2 2 5 5 5 5 1 6
[3035] 5 4 6 7 1 1 5 5 5 5 6 6 4 3 7 3 2 7 1 4 3 2 3 3 2 2 1 6 6 5 6 6 1 3 7 4 3
[3072] 3 3 7 7 2 7 4 5 4 7 7 3 7 7 1 1 1 6 4 4 6 6

Add Clusterings

winks <- wino %>% mutate(
  two =   unlist(kclusts[2]),
  three = unlist(kclusts[3]),
  four =  unlist(kclusts[4]),
  five =  unlist(kclusts[5]),
  six =   unlist(kclusts[6]),
  seven = unlist(kclusts[7]),
  eight = unlist(kclusts[8]),
  nine =  unlist(kclusts[9]))
names(winks)

 [1] "oak"    "finish" "fruit"  "cherry" "pear"   "apple"  "price"  "points"
 [9] "two"    "three"  "four"   "five"   "six"    "seven"  "eight"  "nine"

View

ggplot(winks, aes(price, points)) +
  geom_point(aes(color = two))

ggplot(winks, aes(price, points)) +
  geom_point(aes(color = nine))

View

ggplot(winks, aes(price, points)) +
  geom_point(aes(color = three))

ggplot(winks, aes(price, points)) +
  geom_point(aes(color = five))

Just look at one

ggplot(winks %>% mutate(six = as.factor(six)), aes(price, points)) +
  geom_point(aes(color = six))

Group Exercise

\(k\)-means clustering on 1-12 (age - credit) of churn
Label the clusters appropriately
Add customer revenue feature: \[ \begin{align} &0.20&*&\text{Revolving Credit}\\ +&0.25&*&\text{Transaction Count}\\ -&15&*&\text{Contacts Count} \end{align} \]
Density plot clusters vs this revenue & revolving balance

Walkthrough

\(k\)-means clustering on 1-12 (age- credit) of churn

Load BankChurners.rds

bank <- readRDS(gzcon(url("https://cd-public.github.io/D505/dat/BankChurners.rds"))) %>% drop_na(.)
names(bank)

 [1] "Customer_Age"             "Gender"                  
 [3] "Dependent_count"          "Education_Level"         
 [5] "Marital_Status"           "Income_Category"         
 [7] "Card_Category"            "Months_on_book"          
 [9] "Total_Relationship_Count" "Months_Inactive_12_mon"  
[11] "Contacts_Count_12_mon"    "Credit_Limit"            
[13] "Total_Revolving_Bal"      "Avg_Open_To_Buy"         
[15] "Total_Amt_Chng_Q4_Q1"     "Total_Trans_Amt"         
[17] "Total_Trans_Ct"           "Total_Ct_Chng_Q4_Q1"     
[19] "Avg_Utilization_Ratio"    "Churn"

Walkthrough

\(k\)-means clustering on 1-12 (age - credit) of churn

“(age - credit)”

features <- bank %>% select(1:12) %>% drop_na(.)
names(features)

 [1] "Customer_Age"             "Gender"                  
 [3] "Dependent_count"          "Education_Level"         
 [5] "Marital_Status"           "Income_Category"         
 [7] "Card_Category"            "Months_on_book"          
 [9] "Total_Relationship_Count" "Months_Inactive_12_mon"  
[11] "Contacts_Count_12_mon"    "Credit_Limit"

Walkthrough

\(k\)-means clustering on 1-12 (age - credit) of churn

Engineer features

features <- features %>%
  mutate(Gender = as.numeric(Gender=="F"))  %>%
  mutate(Education_Level = as.numeric(Education_Level == "Graduate")) %>%
  mutate(Marital_Status = as.numeric(Marital_Status == "Married")) %>% 
  mutate(Income_Category = as.numeric(Income_Category == "$120K +")) %>% 
  mutate(Card_Category = as.numeric(Card_Category=="Blue"))

head(features)

# A tibble: 6 × 12
  Customer_Age Gender Dependent_count Education_Level Marital_Status
         <dbl>  <dbl>           <dbl>           <dbl>          <dbl>
1           45      0               3               0              1
2           49      1               5               1              0
3           51      0               3               1              1
4           40      1               4               0              0
5           40      0               3               0              1
6           44      0               2               1              1
# ℹ 7 more variables: Income_Category <dbl>, Card_Category <dbl>,
#   Months_on_book <dbl>, Total_Relationship_Count <dbl>,
#   Months_Inactive_12_mon <dbl>, Contacts_Count_12_mon <dbl>,
#   Credit_Limit <dbl>

Walkthrough

\(k\)-means clustering on 1-12 (age - credit) of churn

What if we use 4?

kclust <- kmeans(features, 4)
kclust$centers

  Customer_Age     Gender Dependent_count Education_Level Marital_Status
1     46.35510 0.39942803        2.353670       0.3169685      0.4475691
2     46.30613 0.68655503        2.292473       0.3082426      0.4839863
3     46.35769 0.20961538        2.446154       0.2961538      0.4288462
4     46.35919 0.09427208        2.597852       0.3090692      0.3878282
  Income_Category Card_Category Months_on_book Total_Relationship_Count
1      0.07673975     0.9394662       36.12154                 3.784557
2      0.02308568     0.9991871       35.83369                 3.874980
3      0.13750000     0.8461538       36.02308                 3.785577
4      0.33532220     0.5238663       36.02267                 3.458234
  Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit
1               2.314109              2.477121     9816.343
2               2.357503              2.435376     3120.581
3               2.330769              2.497115    19420.893
4               2.301909              2.495227    32731.131

Walkthrough

\(k\)-means clustering on 1-12 (age - credit) of churn

Use two clusters (rich and not rich, probably)

kclust <- kmeans(features, 2)
kclust$centers

  Customer_Age    Gender Dependent_count Education_Level Marital_Status
1     46.31353 0.6083274        2.310188       0.3109043      0.4753042
2     46.38567 0.1484241        2.519198       0.2991404      0.4028653
  Income_Category Card_Category Months_on_book Total_Relationship_Count
1      0.03841565     0.9807922       35.90849                 3.848366
2      0.23209169     0.6962751       36.02407                 3.640688
  Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit
1               2.347888              2.447626      4985.96
2               2.308883              2.492264     26145.26

Walkthrough

Label the clusters appropriately

bank <- bank %>% mutate(cluster = kclust$cluster) %>% 
  mutate(Marxist_Class = ifelse(cluster==1, "Bourgeoisie", "Proletariat")) %>%
  select(-cluster)

Walkthrough

Add customer revenue feature: \[ \begin{align} &0.20&*&\text{Revolving Credit}\\ +&0.25&*&\text{Transaction Count}\\ -&15&*&\text{Contacts Count} \end{align} \]

bank <- bank %>% mutate(Revenue = Total_Revolving_Bal/5 +  Total_Trans_Ct/4 - 15 * Contacts_Count_12_mon)

Walkthrough

Density plot clusters vs this revenue & revolving balance

bank %>% 
  ggplot(aes(Revenue))+
  geom_density(alpha=0.3)+
  facet_wrap(~Marxist_Class)

Walkthrough

Density plot clusters vs this revenue & revolving balance

bank %>% 
  ggplot(aes(Total_Revolving_Bal))+
  geom_density(alpha=0.3)+
  facet_wrap(~Marxist_Class)

Walkthrough

Density plot clusters vs … churn?

bank %>% 
  ggplot(aes(Churn))+
  geom_density(alpha=0.3)+
  facet_wrap(~Marxist_Class)

Hierarchical Clustering

swine <- wino %>% sample_n(200)
hclustr <- hclust(d=dist(swine))
summary(hclustr)

            Length Class  Mode     
merge       398    -none- numeric  
height      199    -none- numeric  
order       200    -none- numeric  
labels        0    -none- NULL     
method        1    -none- character
call          2    -none- call     
dist.method   1    -none- character

Plot the dendrogram

plot(hclustr)
abline(h=3, col="red")

Assign clusters

hclustr <- hclust(d=dist(wino))
cluster <- cutree(hclustr, k=3)
swine <- wino %>% 
  add_column(cluster) %>% 
  mutate(cluster=as_factor(cluster))

head(swine)

# A tibble: 6 × 9
    oak finish fruit cherry  pear apple price[,1] points[,1] cluster
  <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl>     <dbl>      <dbl> <fct>  
1     1      0     0      0     0     0    -1.14      -0.892 1      
2     0      1     1      0     0     0     0.664     -1.26  1      
3     0      1     1      1     0     0     0.123      0.592 2      
4     0      1     1      0     0     0    -0.478     -1.63  1      
5     0      0     1      1     0     0    -0.953     -1.63  3      
6     0      0     1      1     0     0     0.664     -0.150 2

Visualize clusters

see_densities(swine, c("oak", "finish", "fruit"))

Visualize clusters

see_densities(swine, c("cherry", "pear", "apple"))

Visualize clusters

see_densities(swine, c("points","price"))

What do you see as some of the issues with Hierarchical clustering?

The Problem

c(nrow(wino %>% sample_n(200)), nrow(wino)) # swine <- wino %>% sample_n(200)

[1]  200 3093

Wait a minute.

c(2 ^ 200, 2 ^ 3093, 2 ^ 3093 / 2 ^ 200)

[1] 1.606938e+60          Inf          Inf

References

https://cran.r-project.org/web/packages/broom/vignettes/kmeans.html