Clustering

Jameson > Hendrik > Calvin

Agenda

  1. Modeling Reminders
  2. K-means Clustering
  3. Hierarchical Clustering
  4. Final Modeling Project

Timing Update

  • I misnumbered my weeks, so Model 2 is due immediately before the final.
  • In theory, this means you have more time on model 2?
  • Anyways:
    • Model 2 by 14 Apr
    • Final on 21 Apr

Next modeling project

  1. Due next Monday (14 Apr)
  2. Use bank data to predict churn
  3. Using exactly 5 features
  4. Models scored based on AUC
  5. We’ll leave time at the end to get started

Time to Touch on final

Emphasis

  • No code
  • 10 features

A hold-out sample is a random sample from a data set that is withheld and not used in the model fitting process. After the model is fit to the main data (the “training” data), it is then applied to the hold-out sample. This gives an unbiased assessment of how well the model might do if applied to new data.

Supervision

Motivating Questions

  1. What is difference between supervised and unsupervised learning?
  2. What is unsupervised learning used for?
  3. What are some challenges with unsupervised learning?

Supervised Learning

  • Uses labeled data.
  • Maps inputs to outputs.
  • Goal: Predict outcomes.
  • Examples: Classification, regression.

Unsupervised Learning

  • Uses unlabeled data.
  • Finds hidden patterns.
  • Goal: Discover structure.
  • Examples: Clustering, dimensionality reduction.

The Label Divide

  • Supervised: Labeled data.
  • Unsupervised: Unlabeled data.
  • Label presence is key.
Difference between supervised and unsupervised learning

Uses of Unsupervised

  • Clustering: Grouping similar data.
    • E.g. discover Yemenia novel coffee varietal
    • Learn more
      • 18 min + can’t watch on stream.
  • Anomaly detection: Finding outliers.
  • Dimensionality reduction: Simplifying data.
  • Association rule learning: Finding relationships.

Clustering in Action

  • Customer/market segmentation.
    • Why EU entered “esports winter” 18+ months before US?
  • Document clustering.
    fotoğraf galerisi video resim haber içerikleri yükleyebilir

Anomaly Detection

  • Detect fraudulent bank transactions.
    • E.g. if I purchased an espresso in Salem, MA
  • Detect anomalous (computer) network behavior.
    • My colleague at MS does this
  • Predict imminent equipment failure in e.g. manufacturing sector.
    • Do you do this in 596?

Dimensionality Reduction

  • Just a refresh
    • Feature extraction.
    • Data visualization.
    • Reducing noise.

Challenges

Challenges

  • Evaluation
    • No clear “right” answer.
      • And sometimes there’s someday a right answer, but you have to make the model now.
      • Sometimes the “right” answer is determined by someone in power.
    • Subjective evaluation metrics.
      • Non-falsifiable by construction.
    • Validating discovered patterns.

Challenges

  • Interpretation
    • Understanding discovered structures.
      • Think in e.g. practice midterm - did you “discover” colleges vs. universities.
      • What about “big school” and “small school”
    • Meaningful insights can be hard.
      • Did you find “Colleges that Change Lives”
    • Requires or at least benefits from domain expertise.

Challenges

Takeaways

  • Supervised: Labeled data ⇒ predict label.
  • Unsupervised: Unlabeled data ⇒ discover patterns.
    • Valuable, but challenging.
    • Value is correlated with challenge given scarcity.

K-means Clustering Algorithm

Start with k random clusters

Calculate means

Select cluster based on which mean point is closest to

Adjust menas and repeat

Potential Issues

  • What happens with high dimensionality?
  • What happens when dimensions aren’t scaled?

Setup

sh <- suppressPackageStartupMessages
sh(library(tidyverse))
sh(library(tidytext))
data(stop_words)
wine <- readRDS(gzcon(url("https://cd-public.github.io/D505/dat/variety.rds"))) %>% rowid_to_column("id")
glimpse(wine)
Rows: 3,657
Columns: 5
$ id          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ variety     <chr> "Pinot_Gris", "Pinot_Noir", "Pinot_Noir", "Pinot_Noir", "P…
$ price       <dbl> 14, 65, 20, 50, 22, 25, 38, 28, 45, 22, 40, 50, 52, 48, 50…
$ points      <dbl> 87, 87, 87, 86, 86, 86, 91, 85, 85, 85, 89, 89, 92, 95, 92…
$ description <chr> "Tart and snappy, the flavors of lime flesh and rind domin…

Relate Word to IDS

word_ids <- wine %>%
  unnest_tokens(word, description) %>%
  anti_join(stop_words) %>% 
  filter(!(word %in% c("wine","flavors","pinot","gris"))) %>% 
  count(id, word) %>% group_by(id) %>% 
  mutate(n = if_else(n>0,1,0)) %>% ungroup()

head(word_ids)
# A tibble: 6 × 3
     id word          n
  <int> <chr>     <dbl>
1     1 acidity       1
2     1 crisp         1
3     1 dominate      1
4     1 fermented     1
5     1 flesh         1
6     1 green         1

Find top words by variety

top_words <- word_ids %>% 
  right_join(wine, by="id") %>%
  count(variety, word) %>% 
  group_by(variety) %>% top_n(3,n) %>%
  ungroup() %>% select(word) %>% distinct()

head(top_words)
# A tibble: 6 × 1
  word  
  <chr> 
1 apple 
2 fruit 
3 oak   
4 pear  
5 cherry
6 finish

Engineer Features

wino <- word_ids %>% 
  filter(word %in% top_words$word) %>% 
  pivot_wider(id_cols = id, names_from = word, values_from = n, values_fill = 0) %>% 
  right_join(wine, by="id") %>% 
  mutate(price=log(price)) %>%
  mutate(price=scale(price), points=scale(points)) %>% 
  select(-id,-variety, -description) %>% drop_na(.)

head(wino)
# A tibble: 6 × 8
    oak finish fruit cherry  pear apple price[,1] points[,1]
  <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl>     <dbl>      <dbl>
1     1      0     0      0     0     0    -1.14      -0.892
2     0      1     1      0     0     0     0.664     -1.26 
3     0      1     1      1     0     0     0.123      0.592
4     0      1     1      0     0     0    -0.478     -1.63 
5     0      0     1      1     0     0    -0.953     -1.63 
6     0      0     1      1     0     0     0.664     -0.150

Basic K-means cluster

kclust <- kmeans(wino, centers = 3)
kclust$centers
        oak    finish     fruit    cherry       pear      apple      price
1 0.1469933 0.3151448 0.6792873 0.1447661 0.28396437 0.26280624 -1.2307289
2 0.2147588 0.3538316 0.7559130 0.3916746 0.01797540 0.07190161  0.1621558
3 0.2126538 0.3330404 0.7293497 0.4824253 0.04217926 0.07908612  0.8319282
      points
1 -0.6997839
2 -0.4100172
3  0.9709884

Add clusters

wink <- wino %>% mutate(cluster = kclust$cluster)
head(wink)
# A tibble: 6 × 9
    oak finish fruit cherry  pear apple price[,1] points[,1] cluster
  <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl>     <dbl>      <dbl>   <int>
1     1      0     0      0     0     0    -1.14      -0.892       1
2     0      1     1      0     0     0     0.664     -1.26        2
3     0      1     1      1     0     0     0.123      0.592       3
4     0      1     1      0     0     0    -0.478     -1.63        1
5     0      0     1      1     0     0    -0.953     -1.63        1
6     0      0     1      1     0     0     0.664     -0.150       2

Cluster Visualizer

see_densities <- function(df, names) {
  df %>% 
    pivot_longer(names,names_to = "feature") %>% 
    ggplot(aes(value, fill=cluster))+
    geom_density(alpha=0.3)+
    facet_wrap(~feature)
}

Visualize clusters

see_densities(wink, c("oak", "finish", "fruit"))

Visualize clusters

see_densities(wink, c("cherry", "pear", "apple"))

Visualize clusters

see_densities(wink, c("points","price"))

Try different numbers of clusters

kclusts <- map(1:9, function(k) { 
    kmeans(wino, centers = k)$cluster
  }
)

Examine any one of the clusterings

unlist(kclusts[7])
   [1] 3 7 6 2 2 7 5 6 5 5 6 6 5 1 3 5 6 5 4 7 2 3 2 2 3 3 7 7 3 3 2 6 1 6 5 5 2
  [38] 2 7 1 5 6 3 4 4 4 7 7 3 2 3 7 7 3 2 2 5 3 1 3 2 2 7 2 3 2 2 1 6 4 2 7 2 7
  [75] 7 3 7 7 7 7 7 3 5 6 1 2 2 2 5 5 1 7 3 2 4 1 7 3 3 1 5 1 2 2 3 3 3 3 2 3 2
 [112] 1 4 6 1 5 6 6 1 2 2 2 2 3 7 7 1 7 3 2 4 4 2 7 2 2 2 1 6 6 6 6 6 7 4 3 4 1
 [149] 1 1 4 1 7 4 2 3 5 6 1 1 2 3 7 3 3 2 6 1 5 1 5 3 3 5 1 2 3 5 2 3 2 5 5 2 1
 [186] 5 4 6 1 4 7 4 7 7 3 7 3 3 2 4 6 5 5 6 1 6 1 1 3 4 5 6 3 7 4 6 6 2 2 3 6 1
 [223] 5 7 4 1 6 6 1 1 5 4 5 6 1 5 3 3 4 1 1 5 7 4 4 4 2 2 7 4 4 5 4 6 6 5 5 1 6
 [260] 2 2 4 6 3 4 4 4 7 2 2 4 4 4 2 7 7 5 6 6 5 4 5 2 4 7 6 1 4 1 6 4 4 3 6 5 1
 [297] 1 7 2 3 7 7 3 3 3 5 2 3 2 7 4 4 7 7 4 4 4 6 1 5 1 5 6 4 2 2 1 5 1 5 6 1 5
 [334] 6 3 1 1 5 7 7 2 1 1 5 6 4 2 6 5 5 6 6 6 7 1 2 7 7 7 3 7 4 7 7 6 3 6 7 3 6
 [371] 6 5 6 6 6 5 5 5 7 7 7 4 4 1 2 2 3 3 3 4 2 3 5 1 7 7 3 7 7 6 1 1 5 5 6 6 4
 [408] 2 2 2 3 3 4 5 6 4 7 5 6 2 3 2 2 3 3 2 6 1 6 4 1 1 4 7 1 7 2 1 1 1 3 3 3 2
 [445] 2 2 3 2 7 7 7 7 2 1 5 7 3 1 6 2 3 2 2 7 3 1 1 1 5 2 2 2 1 6 5 4 3 7 3 7 4
 [482] 6 3 4 1 6 7 3 3 2 7 2 3 2 6 6 1 6 6 2 4 5 6 5 1 4 1 5 7 4 4 1 4 2 3 7 7 2
 [519] 2 6 7 4 1 7 3 3 3 2 3 3 3 2 1 1 6 4 6 7 3 3 2 2 7 1 6 1 2 2 7 2 1 4 2 2 6
 [556] 2 4 7 6 7 2 6 1 7 7 7 4 3 7 7 4 5 5 6 5 2 7 1 7 7 1 4 7 2 7 3 3 3 4 1 7 7
 [593] 7 6 7 4 1 1 4 4 4 4 4 3 3 7 7 7 3 7 7 3 7 7 3 2 2 4 4 1 7 2 4 5 1 1 1 1 1
 [630] 4 6 6 6 7 3 2 3 3 7 3 3 2 2 6 4 1 5 3 5 1 6 6 1 5 5 1 3 2 2 2 3 2 3 2 2 7
 [667] 3 3 6 4 7 7 4 4 2 6 5 4 1 1 2 2 6 1 6 1 6 1 1 1 1 5 4 7 2 2 2 7 3 7 7 7 7
 [704] 6 1 6 4 4 7 7 7 7 2 1 5 7 4 3 7 3 3 5 5 2 3 7 2 4 7 7 4 7 7 7 5 4 7 4 4 3
 [741] 6 4 2 1 2 3 5 6 3 2 2 7 2 7 5 6 5 1 2 3 2 3 3 2 2 3 2 2 6 6 1 1 5 1 1 6 2
 [778] 2 3 4 6 6 5 3 2 7 4 3 6 4 4 1 2 3 7 7 3 2 4 6 4 5 2 1 6 1 6 1 3 2 1 6 4 4
 [815] 1 3 3 7 4 4 1 4 4 7 4 7 5 2 2 2 7 3 7 7 3 6 5 1 4 6 1 4 7 2 7 2 2 3 5 4 4
 [852] 6 4 1 5 1 1 4 5 3 4 7 6 1 5 1 6 4 5 6 6 4 4 1 7 2 2 3 3 3 2 7 6 5 5 2 3 3
 [889] 1 3 2 2 1 6 7 7 4 4 7 6 1 1 1 1 7 2 2 4 5 4 7 2 3 3 6 7 7 1 5 4 1 1 4 4 4
 [926] 1 4 6 3 7 7 7 7 7 4 4 6 1 6 5 1 5 4 2 3 5 1 1 6 2 2 7 2 3 3 7 6 1 1 5 6 6
 [963] 1 5 4 7 7 7 3 2 7 3 6 2 1 6 1 1 3 6 1 1 1 4 6 2 5 1 3 1 7 2 1 6 4 5 4 4 2
[1000] 7 4 7 5 6 5 5 2 2 2 4 6 1 4 4 4 6 5 1 3 2 1 1 2 3 7 2 3 3 4 6 7 4 4 7 5 1
[1037] 1 1 7 7 2 6 6 6 1 7 4 4 6 4 3 2 3 3 2 2 1 2 3 2 4 7 7 5 5 5 7 4 7 2 3 7 7
[1074] 7 6 5 6 5 6 3 6 5 5 1 3 2 7 4 4 6 4 3 3 7 2 7 7 6 6 5 1 1 4 4 7 1 5 5 6 7
[1111] 3 7 3 1 1 5 5 5 1 6 1 1 2 5 1 5 6 2 7 2 2 5 6 1 6 6 1 4 5 6 1 6 4 3 2 2 2
[1148] 5 7 4 4 3 2 3 3 2 2 7 7 2 6 6 2 3 3 5 6 5 4 2 2 2 2 6 4 7 4 4 4 3 5 6 6 2
[1185] 2 1 1 6 7 4 7 7 7 7 3 2 3 3 5 5 5 6 6 7 3 6 2 5 6 1 1 6 5 7 7 7 2 7 7 7 2
[1222] 3 2 7 2 7 2 3 5 5 1 2 2 6 7 5 1 1 4 2 7 7 2 3 1 6 1 4 6 2 7 1 7 3 3 2 1 1
[1259] 4 4 1 5 1 1 2 7 3 7 6 4 2 4 7 4 6 6 3 3 1 1 1 6 4 7 2 2 2 1 1 5 4 1 3 5 6
[1296] 5 2 7 7 7 2 7 1 5 7 7 2 1 4 4 4 1 3 7 7 7 4 6 4 1 4 4 4 6 6 4 6 6 4 3 2 3
[1333] 3 7 7 7 6 6 7 7 1 4 1 5 5 6 5 5 7 7 1 3 3 3 4 7 7 3 2 4 6 2 3 2 3 5 1 4 7
[1370] 2 2 1 6 1 6 3 2 3 7 2 1 5 4 7 2 2 2 3 3 3 7 2 1 1 6 4 7 1 4 5 6 4 7 5 1 3
[1407] 3 2 4 3 2 7 4 3 4 7 7 4 2 2 7 7 3 4 1 1 6 4 2 2 2 7 2 7 1 6 1 1 5 6 5 3 3
[1444] 3 7 4 5 3 2 4 7 5 4 2 3 5 1 7 4 7 5 7 6 5 3 4 4 3 3 2 1 4 1 4 1 5 6 3 5 4
[1481] 2 3 4 4 4 6 6 7 7 4 6 1 4 3 4 4 2 2 7 2 2 3 7 7 4 6 6 3 3 7 7 3 5 4 6 5 4
[1518] 4 6 5 1 2 2 7 2 2 3 7 3 2 7 7 4 7 7 3 6 1 4 2 4 4 5 6 7 4 4 1 5 1 1 1 2 6
[1555] 7 7 7 7 3 7 3 2 2 4 4 4 2 7 7 7 2 7 7 3 2 7 7 7 7 7 7 3 7 2 1 1 1 4 6 5 6
[1592] 3 7 5 6 1 4 1 3 4 4 4 2 7 4 6 5 4 6 1 5 1 3 7 3 7 2 3 2 7 2 2 2 3 2 2 2 2
[1629] 3 2 3 5 1 1 5 1 7 6 5 4 6 6 7 7 3 1 1 6 5 1 6 2 2 2 2 6 7 4 7 7 2 3 7 7 3
[1666] 3 4 5 5 4 4 7 7 4 6 4 5 3 4 7 7 4 1 6 5 5 1 1 6 2 2 6 5 1 7 7 7 7 6 5 5 1
[1703] 7 1 7 7 7 7 4 1 4 4 5 7 7 2 2 2 3 1 1 4 6 6 7 3 2 2 3 2 5 5 6 6 1 1 2 3 4
[1740] 4 6 7 4 7 5 5 6 1 4 4 6 1 3 2 2 6 1 6 3 3 7 7 1 5 4 4 4 7 4 6 6 4 3 7 4 3
[1777] 2 2 2 5 6 4 7 3 7 4 3 4 4 4 1 5 2 4 4 4 7 3 2 7 2 3 2 2 2 7 3 7 2 7 2 1 7
[1814] 3 3 3 2 7 7 4 4 2 2 5 6 7 4 4 2 2 3 2 7 3 5 4 3 3 3 7 5 4 4 3 2 2 5 6 4 1
[1851] 5 2 3 2 4 2 5 7 2 1 6 1 1 1 4 4 1 7 7 7 4 7 1 5 5 6 4 4 4 4 1 1 1 6 5 1 4
[1888] 2 1 1 1 6 1 4 2 6 1 4 3 6 1 5 1 6 5 5 3 1 5 7 7 5 7 7 6 5 2 3 2 2 6 4 1 6
[1925] 2 5 5 1 7 7 7 4 7 3 6 4 1 4 4 4 1 1 3 2 7 2 1 4 1 1 6 4 3 2 7 7 7 4 4 4 7
[1962] 6 4 1 3 5 7 4 4 7 2 3 2 2 3 3 3 3 5 1 5 1 1 4 5 3 3 2 3 2 2 2 6 1 6 6 1 3
[1999] 1 1 6 3 1 2 7 2 3 7 4 4 7 7 5 1 6 1 1 5 4 5 4 2 3 3 3 6 1 1 7 4 3 7 7 1 6
[2036] 5 6 1 2 2 3 3 2 7 7 7 7 2 3 1 6 7 7 3 2 4 4 1 4 6 4 7 5 1 5 5 5 1 1 2 7 4
[2073] 4 4 6 1 1 7 7 7 6 6 1 6 1 1 3 2 5 1 7 5 3 4 5 4 5 7 7 4 7 4 7 6 3 4 7 1 1
[2110] 4 6 6 6 1 6 5 3 5 4 5 5 6 6 4 2 2 2 4 3 4 7 7 7 1 4 4 1 4 5 4 6 1 1 3 5 2
[2147] 6 4 2 3 2 7 5 5 6 4 7 7 5 5 7 3 3 2 7 4 6 1 3 4 6 6 1 1 6 6 6 1 2 2 3 3 2
[2184] 7 7 4 1 6 1 4 4 6 2 7 7 3 3 7 7 5 1 2 3 7 5 1 1 5 5 5 3 3 5 7 7 7 2 3 1 3
[2221] 4 4 6 1 4 1 5 7 7 1 7 7 1 6 4 1 4 2 2 2 6 3 7 4 3 7 3 3 2 2 2 3 1 6 2 2 6
[2258] 1 5 4 1 6 7 4 7 2 7 7 2 7 7 7 3 7 2 3 7 7 2 3 7 3 3 7 7 3 2 3 7 4 2 4 4 7
[2295] 7 7 3 4 3 3 4 7 7 7 5 5 7 2 7 7 7 7 4 7 7 7 7 7 7 3 3 7 3 7 5 2 7 4 2 3 3
[2332] 2 7 2 2 5 1 6 7 3 7 7 3 2 3 2 7 3 7 7 7 4 6 7 3 7 2 3 2 2 2 7 4 4 3 3 6 6
[2369] 4 7 3 3 4 6 4 4 5 1 6 3 2 2 2 7 2 3 5 5 1 6 1 6 1 6 5 4 1 4 6 4 7 1 6 6 6
[2406] 6 1 7 5 5 1 7 7 7 7 3 4 7 7 1 4 6 5 1 6 5 5 2 2 6 6 6 6 6 3 3 3 6 7 7 2 2
[2443] 3 2 3 7 7 7 6 6 2 2 2 3 7 4 7 7 3 2 1 6 1 7 7 7 2 2 7 5 4 7 2 6 5 1 2 3 7
[2480] 7 2 3 7 4 6 3 1 2 5 1 3 6 5 4 1 7 4 6 6 2 6 5 7 5 7 2 7 2 3 3 3 4 6 3 3 3
[2517] 3 7 4 4 1 4 7 3 3 3 3 3 3 2 7 3 4 6 7 7 2 6 4 5 6 6 6 5 5 4 6 5 2 2 2 7 7
[2554] 7 3 7 1 3 2 4 2 1 7 6 3 3 2 2 7 6 4 5 7 2 7 3 3 7 3 2 2 3 7 7 5 6 6 6 5 1
[2591] 5 2 7 2 3 2 7 5 4 6 5 2 2 3 2 7 2 7 7 2 3 7 2 4 7 5 6 6 6 1 1 4 1 5 5 4 7
[2628] 1 6 6 1 6 5 1 1 1 5 1 4 4 7 7 7 3 2 6 1 4 4 3 2 4 4 3 2 2 2 3 3 2 6 6 7 3
[2665] 3 7 3 4 1 4 2 3 2 7 5 1 4 5 2 7 6 4 5 1 7 4 7 5 6 3 2 2 5 1 2 2 7 4 4 4 7
[2702] 5 6 5 2 2 3 7 7 3 7 4 7 7 2 2 3 4 7 4 7 3 3 2 2 1 1 4 2 3 2 5 6 2 4 6 7 4
[2739] 5 4 6 4 3 3 6 3 7 7 3 2 3 1 6 1 6 4 2 7 4 2 2 2 4 7 7 6 1 3 5 6 1 7 3 3 7
[2776] 4 7 3 4 4 6 6 4 1 6 3 3 1 4 5 1 6 5 3 2 2 2 4 2 3 2 3 2 3 7 4 4 5 1 1 4 5
[2813] 6 1 1 3 3 5 5 5 1 2 2 3 3 3 1 1 6 7 4 1 5 1 5 2 6 1 4 5 6 5 1 6 6 1 2 5 1
[2850] 7 1 7 7 7 3 3 2 2 2 4 4 6 4 7 6 6 6 4 1 4 7 4 2 1 6 6 4 5 2 2 7 7 3 2 2 4
[2887] 5 6 1 1 7 7 2 2 2 7 2 1 4 5 5 4 2 3 7 4 7 7 1 6 6 6 6 6 1 1 1 4 1 3 2 1 2
[2924] 7 7 4 4 3 2 6 7 2 7 7 7 6 4 5 1 4 4 5 4 5 2 7 3 6 4 3 7 4 4 2 1 6 6 5 1 1
[2961] 1 4 7 2 2 2 2 2 2 2 2 7 3 7 7 3 6 6 6 4 4 6 6 6 1 4 4 1 6 5 4 6 4 2 7 3 7
[2998] 4 6 4 2 2 6 6 1 6 5 2 7 7 7 7 7 7 3 3 3 2 2 2 2 2 1 6 1 4 2 2 5 5 5 5 1 6
[3035] 5 4 6 7 1 1 5 5 5 5 6 6 4 3 7 3 2 7 1 4 3 2 3 3 2 2 1 6 6 5 6 6 1 3 7 4 3
[3072] 3 3 7 7 2 7 4 5 4 7 7 3 7 7 1 1 1 6 4 4 6 6

Add Clusterings

winks <- wino %>% mutate(
  two =   unlist(kclusts[2]),
  three = unlist(kclusts[3]),
  four =  unlist(kclusts[4]),
  five =  unlist(kclusts[5]),
  six =   unlist(kclusts[6]),
  seven = unlist(kclusts[7]),
  eight = unlist(kclusts[8]),
  nine =  unlist(kclusts[9]))
names(winks)
 [1] "oak"    "finish" "fruit"  "cherry" "pear"   "apple"  "price"  "points"
 [9] "two"    "three"  "four"   "five"   "six"    "seven"  "eight"  "nine"  

View

ggplot(winks, aes(price, points)) +
  geom_point(aes(color = two))

ggplot(winks, aes(price, points)) +
  geom_point(aes(color = nine))

View

ggplot(winks, aes(price, points)) +
  geom_point(aes(color = three))

ggplot(winks, aes(price, points)) +
  geom_point(aes(color = five))

Just look at one

ggplot(winks %>% mutate(six = as.factor(six)), aes(price, points)) +
  geom_point(aes(color = six))

Group Exercise

  1. \(k\)-means clustering on 1-12 (age - credit) of churn
  2. Label the clusters appropriately
  3. Add customer revenue feature: \[ \begin{align} &0.20&*&\text{Revolving Credit}\\ +&0.25&*&\text{Transaction Count}\\ -&15&*&\text{Contacts Count} \end{align} \]
  4. Density plot clusters vs this revenue & revolving balance

Walkthrough

  1. \(k\)-means clustering on 1-12 (age- credit) of churn
  • Load BankChurners.rds
bank <- readRDS(gzcon(url("https://cd-public.github.io/D505/dat/BankChurners.rds"))) %>% drop_na(.)
names(bank)
 [1] "Customer_Age"             "Gender"                  
 [3] "Dependent_count"          "Education_Level"         
 [5] "Marital_Status"           "Income_Category"         
 [7] "Card_Category"            "Months_on_book"          
 [9] "Total_Relationship_Count" "Months_Inactive_12_mon"  
[11] "Contacts_Count_12_mon"    "Credit_Limit"            
[13] "Total_Revolving_Bal"      "Avg_Open_To_Buy"         
[15] "Total_Amt_Chng_Q4_Q1"     "Total_Trans_Amt"         
[17] "Total_Trans_Ct"           "Total_Ct_Chng_Q4_Q1"     
[19] "Avg_Utilization_Ratio"    "Churn"                   

Walkthrough

  1. \(k\)-means clustering on 1-12 (age - credit) of churn
  • “(age - credit)”
features <- bank %>% select(1:12) %>% drop_na(.)
names(features)
 [1] "Customer_Age"             "Gender"                  
 [3] "Dependent_count"          "Education_Level"         
 [5] "Marital_Status"           "Income_Category"         
 [7] "Card_Category"            "Months_on_book"          
 [9] "Total_Relationship_Count" "Months_Inactive_12_mon"  
[11] "Contacts_Count_12_mon"    "Credit_Limit"            

Walkthrough

  1. \(k\)-means clustering on 1-12 (age - credit) of churn
  • Engineer features
features <- features %>%
  mutate(Gender = as.numeric(Gender=="F"))  %>%
  mutate(Education_Level = as.numeric(Education_Level == "Graduate")) %>%
  mutate(Marital_Status = as.numeric(Marital_Status == "Married")) %>% 
  mutate(Income_Category = as.numeric(Income_Category == "$120K +")) %>% 
  mutate(Card_Category = as.numeric(Card_Category=="Blue"))

head(features)
# A tibble: 6 × 12
  Customer_Age Gender Dependent_count Education_Level Marital_Status
         <dbl>  <dbl>           <dbl>           <dbl>          <dbl>
1           45      0               3               0              1
2           49      1               5               1              0
3           51      0               3               1              1
4           40      1               4               0              0
5           40      0               3               0              1
6           44      0               2               1              1
# ℹ 7 more variables: Income_Category <dbl>, Card_Category <dbl>,
#   Months_on_book <dbl>, Total_Relationship_Count <dbl>,
#   Months_Inactive_12_mon <dbl>, Contacts_Count_12_mon <dbl>,
#   Credit_Limit <dbl>

Walkthrough

  1. \(k\)-means clustering on 1-12 (age - credit) of churn
  • What if we use 4?
kclust <- kmeans(features, 4)
kclust$centers
  Customer_Age     Gender Dependent_count Education_Level Marital_Status
1     46.35510 0.39942803        2.353670       0.3169685      0.4475691
2     46.30613 0.68655503        2.292473       0.3082426      0.4839863
3     46.35769 0.20961538        2.446154       0.2961538      0.4288462
4     46.35919 0.09427208        2.597852       0.3090692      0.3878282
  Income_Category Card_Category Months_on_book Total_Relationship_Count
1      0.07673975     0.9394662       36.12154                 3.784557
2      0.02308568     0.9991871       35.83369                 3.874980
3      0.13750000     0.8461538       36.02308                 3.785577
4      0.33532220     0.5238663       36.02267                 3.458234
  Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit
1               2.314109              2.477121     9816.343
2               2.357503              2.435376     3120.581
3               2.330769              2.497115    19420.893
4               2.301909              2.495227    32731.131

Walkthrough

  1. \(k\)-means clustering on 1-12 (age - credit) of churn
  • Use two clusters (rich and not rich, probably)
kclust <- kmeans(features, 2)
kclust$centers
  Customer_Age    Gender Dependent_count Education_Level Marital_Status
1     46.31353 0.6083274        2.310188       0.3109043      0.4753042
2     46.38567 0.1484241        2.519198       0.2991404      0.4028653
  Income_Category Card_Category Months_on_book Total_Relationship_Count
1      0.03841565     0.9807922       35.90849                 3.848366
2      0.23209169     0.6962751       36.02407                 3.640688
  Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit
1               2.347888              2.447626      4985.96
2               2.308883              2.492264     26145.26

Walkthrough

  1. Label the clusters appropriately
bank <- bank %>% mutate(cluster = kclust$cluster) %>% 
  mutate(Marxist_Class = ifelse(cluster==1, "Bourgeoisie", "Proletariat")) %>%
  select(-cluster)

Walkthrough

  1. Add customer revenue feature: \[ \begin{align} &0.20&*&\text{Revolving Credit}\\ +&0.25&*&\text{Transaction Count}\\ -&15&*&\text{Contacts Count} \end{align} \]
bank <- bank %>% mutate(Revenue = Total_Revolving_Bal/5 +  Total_Trans_Ct/4 - 15 * Contacts_Count_12_mon)

Walkthrough

  1. Density plot clusters vs this revenue & revolving balance
bank %>% 
  ggplot(aes(Revenue))+
  geom_density(alpha=0.3)+
  facet_wrap(~Marxist_Class)

Walkthrough

  1. Density plot clusters vs this revenue & revolving balance
bank %>% 
  ggplot(aes(Total_Revolving_Bal))+
  geom_density(alpha=0.3)+
  facet_wrap(~Marxist_Class)

Walkthrough

  1. Density plot clusters vs … churn?
bank %>% 
  ggplot(aes(Churn))+
  geom_density(alpha=0.3)+
  facet_wrap(~Marxist_Class)

Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering

swine <- wino %>% sample_n(200)
hclustr <- hclust(d=dist(swine))
summary(hclustr)
            Length Class  Mode     
merge       398    -none- numeric  
height      199    -none- numeric  
order       200    -none- numeric  
labels        0    -none- NULL     
method        1    -none- character
call          2    -none- call     
dist.method   1    -none- character

Plot the dendrogram

plot(hclustr)
abline(h=3, col="red")

Assign clusters

hclustr <- hclust(d=dist(wino))
cluster <- cutree(hclustr, k=3)
swine <- wino %>% 
  add_column(cluster) %>% 
  mutate(cluster=as_factor(cluster))

head(swine)
# A tibble: 6 × 9
    oak finish fruit cherry  pear apple price[,1] points[,1] cluster
  <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl>     <dbl>      <dbl> <fct>  
1     1      0     0      0     0     0    -1.14      -0.892 1      
2     0      1     1      0     0     0     0.664     -1.26  1      
3     0      1     1      1     0     0     0.123      0.592 2      
4     0      1     1      0     0     0    -0.478     -1.63  1      
5     0      0     1      1     0     0    -0.953     -1.63  3      
6     0      0     1      1     0     0     0.664     -0.150 2      

Visualize clusters

see_densities(swine, c("oak", "finish", "fruit"))

Visualize clusters

see_densities(swine, c("cherry", "pear", "apple"))

Visualize clusters

see_densities(swine, c("points","price"))

What do you see as some of the issues with Hierarchical clustering?

The Problem

c(nrow(wino %>% sample_n(200)), nrow(wino)) # swine <- wino %>% sample_n(200)
[1]  200 3093
  • Wait a minute.
c(2 ^ 200, 2 ^ 3093, 2 ^ 3093 / 2 ^ 200)
[1] 1.606938e+60          Inf          Inf

References

https://cran.r-project.org/web/packages/broom/vignettes/kmeans.html