There are many more 100 and 200 level courses than 500
So we shouldn’t assume that “05” means the same thing after “5” as it does after “1”
Many prefixes \(\times\) hundred levels lack an “01” course at all!
Visualize
PCA of a multivariate Gaussian distribution centered at (1,3) with a standard deviation of 3 in roughly the (0.866, 0.5) direction and of 1 in the orthogonal direction. The vectors shown are the eigenvectors of the covariance matrix scaled by the square root of the corresponding eigenvalue, and shifted so their tails are at the mean.
Why?
What are the primary reasons to use PCA?
Dimensionality reduction
Visualization
Noise reduction
Curse of dimensionality
As the dimensionality of the feature space increases,
the number of configurations can grow exponentially, and thus
the number of configurations covered by an observation decreases.
Find a linear combination of variables to create principle components
Maintain as much variance as possible.
Principle components are orthogonal (uncorrelated)
Rotation of orthogonal axes
PCA of a multivariate Gaussian distribution centered at (1,3) with a standard deviation of 3 in roughly the (0.866, 0.5) direction and of 1 in the orthogonal direction. The vectors shown are the eigenvectors of the covariance matrix scaled by the square root of the corresponding eigenvalue, and shifted so their tails are at the mean.
Singular value decomposition
Illustration of the singular value decomposition \(U\Sigma V⁎\) of matrix \(M\).
Top: The action of \(M\), indicated by its effect on the unit disc \(D\) and the two canonical unit vectors \(e_1\) and \(e_2\).
Left: The action of \(V⁎\), a rotation, on \(D\), \(e_1\) and \(e_2\).
Bottom: The action of \(\Sigma\), a scaling by the singular values \(\sigma_1\) horizontally and \(\sigma_1\) vertically.
Right: The action of \(U\), another rotation.
Libraries Setup
Note we are putting id into the wine dataframe!
Why do we do this?
sh <- suppressPackageStartupMessagessh(library(tidyverse))sh(library(tidytext))sh(library(caret))sh(library(topicmodels)) # new?data(stop_words)sh(library(thematic))theme_set(theme_dark())thematic_rmd(bg ="#111", fg ="#eee", accent ="#eee")wine <-readRDS(gzcon(url("https://cd-public.github.io/D505/dat/variety.rds"))) %>%rowid_to_column("id") bank <-readRDS(gzcon(url("https://cd-public.github.io/D505/dat/BankChurners.rds")))
We imagine that each document may contain words from several topics in particular proportions. For example, in a two-topic model we could say “Document 1 is 90% topic A and 10% topic B, while Document 2 is 30% topic A and 70% topic B.”
Topics
Every topic is a mixture of words. For example, we could imagine a two-topic model of American news, with one topic for “politics” and one for “entertainment.” The most common words in the politics topic might be “President”, “Congress”, and “government”, while the entertainment topic may be made up of words such as “movies”, “television”, and “actor”. Importantly, words can be shared between topics; a word like “budget” might appear in both equally.