Recently I have been working on a Kaggle competition where participants are tasked with predicting Russian housing prices. In developing a model for the challenge, I came across a few methods for selecting the best regression
model for a given dataset. Let’s load up some data and take a look.
library(ISLR) set.seed(123) swiss <- data.frame(datasets::swiss) dim(swiss)
## [1] 47 6
head(swiss)
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
This data provides data on Swiss fertility and some socioeconomic
indicators. Suppose we want to predict fertility rate. Each observation
is as a percentage. In order to assess our prediction ability, create
test and training data sets.
index <- sample(nrow(swiss), 5) train <- swiss[-index,] test <- swiss[index,]
Next, have a quick look at the data.
par(mfrow=c(2,2)) plot(train$Education, train$Fertility) plot(train$Catholic, train$Fertility) plot(train$Infant.Mortality, train$Fertility) hist(train$Fertility)
Looks like there could be some interesting relationships here. The
following block of code will take a model formula (or model matrix) and
search for the best combination of predictors.
library(leaps) leap <- regsubsets(Fertility~., data = train, nbest = 10) leapsum <- summary(leap) plot(leap, scale = 'adjr2')
According to the adjusted R-squared value (larger is better), the best
two models are: those with all predictors and all predictors less
Examination. Both have adjusted R-squared values around .69 – a decent fit. Fit these models so we can evaluate them further.
swisslm <- lm(Fertility~., data = train) swisslm2 <- lm(Fertility~.-Examination, data = train) #use summary() for a more detailed look.
First we'll compute the mean squared error (MSE).
mean((train$Fertility-predict(swisslm, train))[index]^2)
## [1] 44.21879
mean((train$Fertility-predict(swisslm2, train))[index]^2)
## [1] 36.4982
Looks like the model with all predictors is marginally better. We're
looking for a low value of MSE here. We can also look at the Akaike
information criterion (AIC) for more information. Lower is better here
as well.
library(MASS) step1 <- stepAIC(swisslm, direction = "both")
step1$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## Fertility ~ Agriculture + Examination + Education + Catholic +
## Infant.Mortality
##
## Final Model:
## Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 36 1746.780 168.5701
## 2 - Examination 1 53.77608 37 1800.556 167.8436
Here, the model without Examination
scores lower than the full
model. It seems that both models are evenly matched, though I might be
inclined to use the model without Examination
.
Since we sampled our original data to make the train and test datasets,
the difference in these tests is subject to change based on the training
data used. I encourage anyone who wants to test themselves to change the
set.seed
at the beginning and evaluate the above results again. Even
better: use your own data!
If you learned something or have questions feel free to leave a comment
or shoot me an email.
Happy modeling,
Kiefer