Ridge Regression and the Lasso

In my last post Which linear model is best? I wrote about using
stepwise selection as a method for selecting linear models, which turns
out to have some issues (see this article, and Wikipedia).
This post will be about two methods that slightly modify ordinary least
squares (OLS) regression – ridge regression and the lasso.

Ridge regression and the lasso are closely related, but only the Lasso
has the ability to select predictors. Like OLS, ridge attempts to
minimize residual sum of squares of predictors in a given model.
However, ridge regression includes an additional ‘shrinkage’ term – the
square of the coefficient estimate – which shrinks the estimate of the
coefficients towards zero. The impact of this term is controlled by
another term, lambda (determined seperately). Two interesting
implications of this design are the facts that when λ = 0 the OLS
coefficients are returned and when λ = ∞, coefficients will approach
zero.

To take a look at this, setup a model matrix (removing the intercept
column), store the independent variable as y, and create a vector of
lambda values.

swiss <- datasets::swiss
x <- model.matrix(Fertility~., swiss)[,-1]
y <- swiss$Fertility
lambda <- 10^seq(10, -2, length = 100)

First, let's prove the fact that when λ = 0 we get the same coefficients
as the OLS model.

#create test and training sets
library(glmnet)
## Loading required package: Matrix

## Loading required package: foreach

## Loaded glmnet 2.0-10
set.seed(489)
train = sample(1:nrow(x), nrow(x)/2)
test = (-train)
ytest = y[test]

Fit your models.

#OLS
swisslm <- lm(Fertility~., data = swiss)
coef(swisslm)
##      (Intercept)      Agriculture      Examination        Education 
##       66.9151817       -0.1721140       -0.2580082       -0.8709401 
##         Catholic Infant.Mortality 
##        0.1041153        1.0770481
#ridge
ridge.mod <- glmnet(x, y, alpha = 0, lambda = lambda)
predict(ridge.mod, s = 0, exact = T, type = 'coefficients')[1:6,]
##      (Intercept)      Agriculture      Examination        Education 
##       66.9365901       -0.1721983       -0.2590771       -0.8705300 
##         Catholic Infant.Mortality 
##        0.1040307        1.0770215

The differences here are nominal. Let's see if we can use ridge to
improve on the OLS estimate.

swisslm <- lm(Fertility~., data = swiss, subset = train)
ridge.mod <- glmnet(x[train,], y[train], alpha = 0, lambda = lambda)
#find the best lambda from our list via cross-validation
cv.out <- cv.glmnet(x[train,], y[train], alpha = 0)
## Warning: Option grouped=FALSE enforced in cv.glmnet, since &lt; 3 observations
## per fold
bestlam <- cv.out$lambda.min
#make predictions
ridge.pred <- predict(ridge.mod, s = bestlam, newx = x[test,])
s.pred <- predict(swisslm, newdata = swiss[test,])
#check MSE
mean((s.pred-ytest)^2)
## [1] 106.0087
mean((ridge.pred-ytest)^2)
## [1] 93.02157

Ridge performs better for this data according to the MSE.

#a look at the coefficients
out = glmnet(x[train,],y[train],alpha = 0)
predict(ridge.mod, type = "coefficients", s = bestlam)[1:6,]
##      (Intercept)      Agriculture      Examination        Education 
##      64.90631178      -0.16557837      -0.59425090      -0.35814759 
##         Catholic Infant.Mortality 
##       0.06545382       1.30375306

As expected, most of the coefficient estimates are more conservative.

Let's have a look at the lasso. The big difference here is in the
shrinkage term – the lasso takes the absolute value of the coefficient
estimates.

lasso.mod <- glmnet(x[train,], y[train], alpha = 1, lambda = lambda)
lasso.pred <- predict(lasso.mod, s = bestlam, newx = x[test,])
mean((lasso.pred-ytest)^2)
## [1] 124.1039

The MSE is a bit higher for the lasso estimate. Let's check out the
coefficients.

lasso.coef  <- predict(lasso.mod, type = 'coefficients', s = bestlam)[1:6,]

Looks like the lasso places high importance on Education,
Examination, and Infant.Mortality. From this we also gain some
evidence that Catholic and Agriculture are not useful predictors for
this model. It is likely that Catholic and Agriculture do have some
effect on Fertility, though, since pushing those coefficients to zero
hurt the model.

There is plenty more to delve into here, but I'll leave the details to
the experts. I am always happy to have your take on the topics I write
about, so please feel free to leave a comment or contact me. Oftentimes
I learn just as much from you all as I do in researching the topic I
write about.

I think the next post will be about more GIS stuff – maybe on rasters or
point pattern analysis.

Thank you for reading!

Kiefer

Which linear model is best?

trains-railroad_00389817

Recently I have been working on a Kaggle competition where participants are tasked with predicting Russian housing prices. In developing a model for the challenge, I came across a few methods for selecting the best regression
model for a given dataset. Let’s load up some data and take a look.

library(ISLR)
set.seed(123)
swiss <- data.frame(datasets::swiss)
dim(swiss)
## [1] 47  6
head(swiss)
##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6

This data provides data on Swiss fertility and some socioeconomic
indicators. Suppose we want to predict fertility rate. Each observation
is as a percentage. In order to assess our prediction ability, create
test and training data sets.

index <- sample(nrow(swiss), 5)
train <- swiss[-index,]
test <- swiss[index,]

Next, have a quick look at the data.

par(mfrow=c(2,2))
plot(train$Education, train$Fertility)
plot(train$Catholic, train$Fertility)
plot(train$Infant.Mortality, train$Fertility)
hist(train$Fertility)

swissplots

Looks like there could be some interesting relationships here. The
following block of code will take a model formula (or model matrix) and
search for the best combination of predictors.

library(leaps)
leap <- regsubsets(Fertility~., data = train, nbest = 10)
leapsum <- summary(leap)
plot(leap, scale = 'adjr2')

leap plot

According to the adjusted R-squared value (larger is better), the best
two models are: those with all predictors and all predictors less
Examination. Both have adjusted R-squared values around .69 – a decent fit. Fit these models so we can evaluate them further.

swisslm <- lm(Fertility~., data = train)
swisslm2 <- lm(Fertility~.-Examination, data = train)
#use summary() for a more detailed look.

First we'll compute the mean squared error (MSE).

mean((train$Fertility-predict(swisslm, train))[index]^2)
## [1] 44.21879
mean((train$Fertility-predict(swisslm2, train))[index]^2)
## [1] 36.4982

Looks like the model with all predictors is marginally better. We're
looking for a low value of MSE here. We can also look at the Akaike
information criterion (AIC) for more information. Lower is better here
as well.

library(MASS)
step1 <- stepAIC(swisslm, direction = "both")
step1$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## Fertility ~ Agriculture + Examination + Education + Catholic + 
##     Infant.Mortality
## 
## Final Model:
## Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
## 
## 
##            Step Df Deviance Resid. Df Resid. Dev      AIC
## 1                                  36   1746.780 168.5701
## 2 - Examination  1 53.77608        37   1800.556 167.8436

Here, the model without Examination scores lower than the full
model. It seems that both models are evenly matched, though I might be
inclined to use the model without Examination.

Since we sampled our original data to make the train and test datasets,
the difference in these tests is subject to change based on the training
data used. I encourage anyone who wants to test themselves to change the
set.seed at the beginning and evaluate the above results again. Even
better: use your own data!

If you learned something or have questions feel free to leave a comment
or shoot me an email.

Happy modeling,

Kiefer