<- suppressPackageStartupMessages
sh sh(library(tidyverse))
sh(library(caret))
sh(library(class))
sh(library(ISLR)) # for the "College" dataframe
Setup
- Setup
Dataframe
- We use the
College
dataframe.
head(College)
Private Apps Accept Enroll Top10perc Top25perc
Abilene Christian University Yes 1660 1232 721 23 52
Adelphi University Yes 2186 1924 512 16 29
Adrian College Yes 1428 1097 336 22 50
Agnes Scott College Yes 417 349 137 60 89
Alaska Pacific University Yes 193 146 55 16 44
Albertson College Yes 587 479 158 38 62
F.Undergrad P.Undergrad Outstate Room.Board Books
Abilene Christian University 2885 537 7440 3300 450
Adelphi University 2683 1227 12280 6450 750
Adrian College 1036 99 11250 3750 400
Agnes Scott College 510 63 12960 5450 450
Alaska Pacific University 249 869 7560 4120 800
Albertson College 678 41 13500 3335 500
Personal PhD Terminal S.F.Ratio perc.alumni Expend
Abilene Christian University 2200 70 78 18.1 12 7041
Adelphi University 1500 29 30 12.2 16 10527
Adrian College 1165 53 66 12.9 30 8735
Agnes Scott College 875 92 97 7.7 37 19016
Alaska Pacific University 1500 76 72 11.9 2 10922
Albertson College 675 67 73 9.4 11 9727
Grad.Rate
Abilene Christian University 60
Adelphi University 56
Adrian College 54
Agnes Scott College 59
Alaska Pacific University 15
Albertson College 55
- States the ISLR textbook:
Name | Description |
---|---|
Private |
Public/private indicator |
Apps |
Number of applications received |
Accept |
Number of applicants accepted |
Enroll |
Number of new students enrolled |
Top10perc |
New students from top 10 % of high school class |
Top25perc |
New students from top 25 % of high school class |
F.Undergrad |
Number of full-time undergraduates |
P.Undergrad |
Number of part-time undergraduates |
Outstate |
Out-of-state tuition |
Room.Board |
Room and board costs |
Books |
Estimated book costs |
Personal |
Estimated personal spending |
PhD |
Percent of faculty with Ph.D.’s |
Terminal |
Percent of faculty with terminal degree |
S.F.Ratio |
Student/faculty ratio |
perc.alumni |
Percent of alumni who donate |
Expend |
Instructional expenditure per student |
Grad.Rate |
Graduation rate |
Multiple Regression
- Run a linear regression model with
Grad.Rate
as the dependent variable andPhD
andExpend
as features (variables).- Regard
PhD
andExpend
as two forms of investment in education - in training for instructors, and in resources for students.
- Regard
- Compute and comment on the RMSE.
= lm(Grad.Rate ~ PhD, College)
m1 = lm(Grad.Rate ~ Expend, College)
m2 = lm(Grad.Rate ~ PhD + Expend, College)
m3 = lm(Grad.Rate ~ PhD * Expend, College)
m4 = lm(Grad.Rate ~ ., College)
m5
<- function(m) {
get_rmse <- predict(m, newdata = College)
pred sqrt(mean((College$Grad.Rate - pred)^2))
}
unlist(lapply(list(m1, m2, m3, m4, m5), get_rmse))
[1] 16.34849 15.80482 15.59084 15.57864 12.59685
While neither PhD attainment of teaching faculty nor the institutional expenditure per student are extraordinarily accurate - an error of ~16 with regards to a percentage intuitively feels high - nothing in the data set necessarily naively predicts graduation rate so well, and the interaction of these features predicts better than either feature individually or even both features without an interaction term. So, I expect both educational attainment of faculty and expenditure per student are important parts of delivering a high quality education, and each improves the other’s value, though I would need to separately ensure the coefficients are positive to validate this claim:
m4
Call:
lm(formula = Grad.Rate ~ PhD * Expend, data = College)
Coefficients:
(Intercept) PhD Expend PhD:Expend
4.899e+01 1.002e-01 2.757e-04 8.829e-06
Feature Engineering
- Create 10 total features. Consider:
- Attributes of the student body.
- For example, an acceptance rate, or a percentages of students in other categories vs. accepted/enrolled.
- Costs of the university.
- Some other category, such as related to success, alumni, or faculty.
- Attributes of the student body.
- Remove all rows with a missing value.
- Ensure only
Grad.Rate
and the engineered features remain. - Compute and comment on the RMSE.
<- College %>%
df_all mutate(AcceptRate=Accept/Apps) %>%
mutate(EnrollRate=Enroll/Accept) %>%
mutate(Top10Rate=Top10perc/Enroll) %>%
mutate(Top10Rate=Top25perc/Enroll) %>%
mutate(Cost=Outstate+Room.Board+Books+Personal) %>%
mutate(MS=Terminal-PhD) %>%
mutate(AppToAlum = AcceptRate * EnrollRate * perc.alumni)
<- df_all %>%
df_feat select(-Private,-Apps,-Accept,-Enroll,-Top10perc,-Top25perc,-F.Undergrad,-P.Undergrad,-Outstate,-Room.Board) %>%
select(-Books,-Personal,-PhD,-Terminal,-S.F.Ratio,-perc.alumni,-Expend)
sqrt(mean((df_all$Grad.Rate - predict(lm(formula = Grad.Rate ~ ., data = df_all), newdata = df_all))^2))
[1] 12.36756
sqrt(mean((df_feat$Grad.Rate - predict(lm(formula = Grad.Rate ~ ., data = df_feat), newdata = df_feat))^2))
[1] 13.4235
Adding my novel features did marginally improve predictive power, but removing the initial, provided features, worsened performances versus just using the original data set. I should be more intentional and systematic about: (1) including all elements of the original data frame which may be relevant in my engineered features, and (2) ensuring independence between retained features to ensure that the assumptions of linear models are satisfied.
Classification Methods
- Use either of \(K\)-NN or Naive Bayes to predict whether a college is
Private
. - Explain your choice of technique.
- Report on your Kappa value.
= trainControl(method = "cv", number = 5) control
- We try a \(K\)-NN over a few features.
= df_all %>%
df_knn select(Private, AcceptRate, PhD, Cost, Top10perc)
<- createDataPartition(df_knn$Private, p = 0.8, list = FALSE)
split <- df_knn[split, ]
train_knn <- df_knn[-split, ]
test_knn
= train(Private ~ .,
fit_knn data = train_knn,
method = "knn",
tuneLength = 15,
metric = "Kappa",
trControl = control)
confusionMatrix(predict(fit_knn, test_knn),factor(test_knn$Private))
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 23 7
Yes 19 106
Accuracy : 0.8323
95% CI : (0.764, 0.8874)
No Information Rate : 0.729
P-Value [Acc > NIR] : 0.001745
Kappa : 0.5336
Mcnemar's Test P-Value : 0.030984
Sensitivity : 0.5476
Specificity : 0.9381
Pos Pred Value : 0.7667
Neg Pred Value : 0.8480
Prevalence : 0.2710
Detection Rate : 0.1484
Detection Prevalence : 0.1935
Balanced Accuracy : 0.7428
'Positive' Class : No
- We bin a few features and try Naive.
= df_knn %>%
df_nb mutate(HighAccept=AcceptRate > mean(df_knn$AcceptRate)) %>%
mutate(HighPhD=PhD > mean(df_knn$PhD)) %>%
mutate(HighCost=Cost > mean(df_knn$Cost)) %>%
mutate(HighTop10=Top10perc > mean(df_knn$Top10perc)) %>%
select(-AcceptRate, -PhD, -Cost, -Top10perc)
<- createDataPartition(df_nb$Private, p = 0.8, list = FALSE)
split <- df_nb[split, ]
train_nb <- df_nb[-split, ]
test_nb
= train(Private ~ .,
fit_nb data = train_nb,
method = "naive_bayes",
tuneLength = 15,
metric = "Kappa",
trControl = control)
confusionMatrix(predict(fit_nb, test_nb),factor(test_nb$Private))
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 20 11
Yes 22 102
Accuracy : 0.7871
95% CI : (0.7142, 0.8487)
No Information Rate : 0.729
P-Value [Acc > NIR] : 0.05965
Kappa : 0.4128
Mcnemar's Test P-Value : 0.08172
Sensitivity : 0.4762
Specificity : 0.9027
Pos Pred Value : 0.6452
Neg Pred Value : 0.8226
Prevalence : 0.2710
Detection Rate : 0.1290
Detection Prevalence : 0.2000
Balanced Accuracy : 0.6894
'Positive' Class : No
I expect public schools, due to the forms of regulatory oversight, they experience, to be clustered fairly neatly around certain admission, instruction preparation, and acceptance metrics, with of course a few exceptions for e.g. the “Public Ivies” like UC Berkley and UT Austin. This is reflected in drastically more accurate Kappa values for Naive Bayes, which are reflective, to me, of public universities as a rule being on the same side of the means across various metrics as other public universities.
Classification Techniques
- Predict whether a college is
Private
. - Use model weights.
- Display and comment on an ROC curve.
<- table(df_knn$Private)
counts <- counts["Yes"]
count_y <- counts["No"]
count_n <- max(count_y,count_n)/count_y
weigh_y <- max(count_y,count_n)/count_n
weigh_n
c(count_y,count_n,weigh_y,weigh_n)
Yes No Yes No
565.000000 212.000000 1.000000 2.665094
<- train_knn %>%
train_knn mutate(weight=ifelse(Private=="Yes", weigh_y, weigh_n))
= train(Private ~ .,
fit_weights data = train_knn %>% select(-weight),
method = "naive_bayes",
tuneLength = 15,
metric = "Kappa",
trControl = control,
weights = train_knn$weight)
confusionMatrix(predict(fit_weights, test_knn),factor(test_knn$Private))
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 28 8
Yes 14 105
Accuracy : 0.8581
95% CI : (0.793, 0.9089)
No Information Rate : 0.729
P-Value [Acc > NIR] : 9.277e-05
Kappa : 0.6239
Mcnemar's Test P-Value : 0.2864
Sensitivity : 0.6667
Specificity : 0.9292
Pos Pred Value : 0.7778
Neg Pred Value : 0.8824
Prevalence : 0.2710
Detection Rate : 0.1806
Detection Prevalence : 0.2323
Balanced Accuracy : 0.7979
'Positive' Class : No
In this case, where we had relatively high predictive power to find both private and non-private colleges or universities without weights, adding weights actually disrupted this process and reduced our Kappa. Using weights is not effective in all cases - there are many private colleges and universities, and maybe many different kinds of colleges and universities we find through nearest neighbors or naive bayes, and it is unhelpful to apply weights to public colleges and universities which we already identified relatively effective. It may be worthwhile, however, to add something like a “college or university” feature or something of this nature to further differentiate types of institutions.
Ethics
- Based on your analysis, comment on the for-profit privatization of education, perhaps through the framework advanced by this article:
- Discuss the civic reposibilities of data scientists for:
- Big Data and Human-Centered Computing
- Democratic Institutions
- Education and Educational Policy
- Provide at least one statistical measure for each, such as a RMSE, Kappa value, or ROC curve.
Big Data and Human-Centered Computing
We note that private institutions make up 73% of institutions while educating 43% of students. From our earlier work, we note that private/public predicts graduation rate, and these institutions may be weakly differentiated with regards to measures like acceptance rate, instuctor educational attainment, and cost. A data driven approach to policy may want to investigate closely whether private schools, which enjoy non-profit status and federal loan assistance, are serving students better, or worse, given their lower level of regulatory oversight.
list(
sum(filter(College, Private == "Yes")$Enroll)/sum(College$Enroll),
count(filter(College, Private == "Yes"))/count(College)
)
[[1]]
[1] 0.4260023
[[2]]
n
1 0.7271557
We note that public institutions have dramatically lower cost, including for out-of-state students, in aggregate, possible due to their obligation to serve the broader public through the democratic process. A strong claim, which is partially but not fully supported here, is that democratic pressures dramatically expand access and cut costs, but this at least appears true for the specific niche of higher education.
Democratic Institutions
list(
mean(filter(df_knn, Private == "Yes")$Cost),
mean(filter(df_knn, Private == "No")$Cost)
)
[[1]]
[1] 18149.78
[[2]]
[1] 12793.01
Education and Educational Policy
We note that private universities achieve dramatically lower student-to-faculty ratios, roughly in line with their higher costs…
list(
mean(filter(College, Private == "Yes")$S.F.Ratio),
mean(filter(College, Private == "No")$S.F.Ratio)
)
[[1]]
[1] 12.94549
[[2]]
[1] 17.13915
Yet we do not see lower student-to-faculty ratios necessarily corresponding to a higher graduation rate.
lm(Grad.Rate ~ S.F.Ratio, College)
Call:
lm(formula = Grad.Rate ~ S.F.Ratio, data = College)
Coefficients:
(Intercept) S.F.Ratio
84.217 -1.331
Closing Thoughts
As data scientists with a private school affiliation, we should advocate strongly for democratic process within our institution and our region.