In my last post Which linear model is best? I wrote about using

stepwise selection as a method for selecting linear models, which turns

out to have some issues (see this article, and Wikipedia).

This post will be about two methods that slightly modify ordinary least

squares (OLS) regression – ridge regression and the lasso.

Ridge regression and the lasso are closely related, but only the Lasso

has the ability to select predictors. Like OLS, ridge attempts to

minimize residual sum of squares of predictors in a given model.

However, ridge regression includes an additional ‘shrinkage’ term – the

square of the coefficient estimate – which shrinks the estimate of the

coefficients towards zero. The impact of this term is controlled by

another term, lambda (determined seperately). Two interesting

implications of this design are the facts that when λ = 0 the OLS

coefficients are returned and when λ = ∞, coefficients will approach

zero.

To take a look at this, setup a model matrix (removing the intercept

column), store the independent variable as `y`

, and create a vector of

lambda values.

swiss <- datasets::swiss x <- model.matrix(Fertility~., swiss)[,-1] y <- swiss$Fertility lambda <- 10^seq(10, -2, length = 100)

First, let's prove the fact that when λ = 0 we get the same coefficients

as the OLS model.

#create test and training sets library(glmnet)

```
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-10
```

set.seed(489) train = sample(1:nrow(x), nrow(x)/2) test = (-train) ytest = y[test]

Fit your models.

#OLS swisslm <- lm(Fertility~., data = swiss) coef(swisslm)

```
## (Intercept) Agriculture Examination Education
## 66.9151817 -0.1721140 -0.2580082 -0.8709401
## Catholic Infant.Mortality
## 0.1041153 1.0770481
```

#ridge ridge.mod <- glmnet(x, y, alpha = 0, lambda = lambda) predict(ridge.mod, s = 0, exact = T, type = 'coefficients')[1:6,]

```
## (Intercept) Agriculture Examination Education
## 66.9365901 -0.1721983 -0.2590771 -0.8705300
## Catholic Infant.Mortality
## 0.1040307 1.0770215
```

The differences here are nominal. Let's see if we can use ridge to

improve on the OLS estimate.

swisslm <- lm(Fertility~., data = swiss, subset = train) ridge.mod <- glmnet(x[train,], y[train], alpha = 0, lambda = lambda) #find the best lambda from our list via cross-validation cv.out <- cv.glmnet(x[train,], y[train], alpha = 0)

```
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations
## per fold
```

bestlam <- cv.out$lambda.min

#make predictions ridge.pred <- predict(ridge.mod, s = bestlam, newx = x[test,]) s.pred <- predict(swisslm, newdata = swiss[test,]) #check MSE mean((s.pred-ytest)^2)

```
## [1] 106.0087
```

mean((ridge.pred-ytest)^2)

```
## [1] 93.02157
```

Ridge performs better for this data according to the MSE.

#a look at the coefficients out = glmnet(x[train,],y[train],alpha = 0) predict(ridge.mod, type = "coefficients", s = bestlam)[1:6,]

```
## (Intercept) Agriculture Examination Education
## 64.90631178 -0.16557837 -0.59425090 -0.35814759
## Catholic Infant.Mortality
## 0.06545382 1.30375306
```

As expected, most of the coefficient estimates are more conservative.

Let's have a look at the lasso. The big difference here is in the

shrinkage term – the lasso takes the absolute value of the coefficient

estimates.

lasso.mod <- glmnet(x[train,], y[train], alpha = 1, lambda = lambda) lasso.pred <- predict(lasso.mod, s = bestlam, newx = x[test,]) mean((lasso.pred-ytest)^2)

```
## [1] 124.1039
```

The MSE is a bit higher for the lasso estimate. Let's check out the

coefficients.

lasso.coef <- predict(lasso.mod, type = 'coefficients', s = bestlam)[1:6,]

Looks like the lasso places high importance on `Education`

,

`Examination`

, and `Infant.Mortality`

. From this we also gain some

evidence that `Catholic`

and `Agriculture`

are not useful predictors for

this model. It is likely that `Catholic`

and `Agriculture`

do have some

effect on `Fertility`

, though, since pushing those coefficients to zero

hurt the model.

There is plenty more to delve into here, but I'll leave the details to

the experts. I am always happy to have your take on the topics I write

about, so please feel free to leave a comment or contact me. Oftentimes

I learn just as much from you all as I do in researching the topic I

write about.

I think the next post will be about more GIS stuff – maybe on rasters or

point pattern analysis.

Thank you for reading!

Kiefer

Thanks, this is very helpful. A small issue I ran into with the #ridge code portion (using glmnet v2.0-13), if exact = T, I had to add x=x, y=y within the predict function:

predict(ridge.mod, s = 0, exact = T, type = ‘coefficients’, x=x, y=y)[1:6,] #explained much better than I could in the function help

LikeLike