# 07.14.13

## Regularization – Predictive Modeling Beyond Ordinary Least Squares Fit

### Introduction to Linear Functions and Regularization

A simple yet powerful prediction **model** assumes that the function is **linear** in the input even in cases where the input consists of hundreds of variables and the input variables far outstrip the number of observations. Such prediction models, known are linear regression/classification models, can often outperform fancier non-linear models.

The most popular method of estimation of the parameters (used interchangeably with the word, coefficients) is the method of **Ordinary Least Squares** (OLS). The linear model can be written as

f(x) = β_{o} + ∑X_{j }β_{j}

where j=1 to p, X is the input vector, and β_{j}s are the unknown coefficients.

We solve for the coefficients β (β_{o},β_{1},…β_{p}) that minimize the residual sum of squares (RSS).

RSS() = ∑(y_{i} – β_{o} - ∑X_{ij}β_{j})^{2}

where i=1 to N observations, j= 1 to p variables

Reasons why OLS estimation is often unsatisfactory are:

- Large variance in prediction accuracy. A solution to improving the overall accuracy is to
**shrink**(or set to zero) some of the coefficients. The overall effect is to prevent or reduce over-fitting. - With a large number of input predictors, one would like to determine a smaller subset that would exhibit the strongest effects so we see the big picture.

The process of regularization involves a family of penalty **terms** that can be added to OLS to achieve the shrinkage (in the coefficients).

The **Ridge** penalty term shrinks the regression coefficients by introducing the complexity parameter, **λ**, the greater the value of λ, the greater the amount of shrinkage. By varying λ, the coefficients are shrunk towards zero (and to each other).

_{j}

^{2}where j=1 to p

While the Ridge penalty does a proportional shrinkage, the **LASSO **penalty λ, translates each coefficients by a constant factor stopping at zero. LASSO also does feature-selection; if many features are correlated, LASSO will just pick one.

**Elastic Net** penalty is a **combination** of the LASSO and Ridge regression penalty.

_{j}| + (1 – α)β

_{j}

^{2 }), where j=1 to p

The first term encourages a sparse solution in the coefficients and the second term encourages highly correlated features to be averaged. The parameter α determines the mix of penalties and lies in the range of 0 and 1. With α set to 0, we get the Ridge penalty and with α set to 1, we get the LASSO penalty.

### Example

We now demonstrate this with a example dataset with 204 binary attributes and 704 observations.

##### Getting the Data

The R snippet below will download the dataset from where it is hosted. The data has been previously saved as an R object in the .rda format. We reload it back in to the R object, hiv.data.

`download.file("http://www.shatterline.com/MachineLearning/data/hiv.rda","hiv.rda", mode="wb")`

` load("hiv.rda", verbose=TRUE) #contains hiv.train & hiv.test`

##### Visualizing the Data

The image function in R helps us visualize the dataset. You can see below the relatively strong correlation between the variables. See the visualize.matrix function below.

##### Fitting/Plotting Data

The code snippet below shows the coefficient shrinkage is proportional to λ when we apply the Ridge penalty.

` fit <- glmnet(hiv.train$x,hiv.train$y, alpha=0) #Ridge penalty`

The code snippet below shows that, with the LASSO penalty, the coefficient **hit zero** (unlike Ridge) as λ shrinks**.**

` fit <- glmnet(hiv.train$x,hiv.train$y, alpha=1) #Lasso penalty`

The code snippet below shows a mix of the Ridge and LASSO penalties with the Elastic Net penalty for a specific value of α=0.2 (could also be chose with cross-validation).

`fit <- glmnet(hiv.train$x,hiv.train$y, alpha=0.2) #Elastic Net penalty`

#### Cross-Validation

Ten-fold cross-validation shows us that the number of active variables are approximately 30.

` cv.fit <- cv.glmnet(hiv.train$x,hiv.train$y) #10-fold cross-validation`

plot(cv.fit)

legend("topleft",legend=c("10-fold Cross Validation"))

Predicting the Test Data with the Model

plot(cv.fit)

legend("topleft",legend=c("10-fold Cross Validation"))

The code snippet below predicts the error at every value of λ.

pred.y <- predict(fit, hiv.test$x) #predict the test data

mean.test.error <- apply((pred.y - hiv.test$y)^2,2,mean)

points(log(fit$lambda), mean.test.error, col="blue",pch="*")

legend("topleft",legend=c("10-fold Cross Validation","Test HIV Data"),pch="*",col=c("red","blue"))

##### Plotting the Regularization Path

The code snippet below shows the regularization path by plotting the coefficients against (log of) λ. Each curve represents a coefficient in the model. As λ gets smaller, more coefficients enter the model from a zero value. (see to the left).

`plot(fit,xvar="lambda")`

### Code

# Author Auro Tripathy, auro@shatterline.com # Adapted from ...Trevor Hastie's talk rm(list=ls()) visualize.matrix <- function(mat) { print(names(mat)) image(1:nrow(mat$x), 1:ncol(mat$x), z=mat$x, col = c("darkgreen", "white"), xlab = "Observations", ylab = "Attributes") title(main = "Visualizing the Sparse Binary Matrix", font.main = 4) return (dim(mat$x)) #returns the dimensions of the matrix } #---main---# library(glmnet) ?glmnet download.file("http://www.shatterline.com/MachineLearning/data/hiv.rda", "hiv.rda", mode="wb") load("hiv.rda", verbose=TRUE) #contains hiv.train & hiv.test visualize.matrix(hiv.train) visualize.matrix(hiv.test) print(length(hiv.train$y)) #length of response variable fit <- glmnet(hiv.train$x,hiv.train$y, alpha=0) #Ridge penalty plot(fit) legend("bottomleft",legend=c("Ridge Penalty, alpha=0")) fit <- glmnet(hiv.train$x,hiv.train$y, alpha=1) #Lasso penalty plot(fit) legend("bottomleft",legend=c("LASSO Penalty, alpha=1")) fit <- glmnet(hiv.train$x,hiv.train$y, alpha=0.2) #ElasticNet penalty plot(fit) legend("bottomleft",legend=c("Elastic Net, alpha=0.2")) cv.fit <- cv.glmnet(hiv.train$x,hiv.train$y) #10-fold cross-validation plot(cv.fit) legend("topleft",legend=c("10-fold Cross Validation")) pred.y <- predict(fit, hiv.test$x) #predict the test data mean.test.error <- apply((pred.y - hiv.test$y)^2,2,mean) points(log(fit$lambda), mean.test.error, col="blue",pch="*") legend("topleft",legend=c("10-fold Cross Validation","Test HIV Data"), pch="*", col=c("red","blue")) plot(fit,xvar="lambda") plot(fit,xvar="dev")

### References

- Prof Trever Hastie’s talk
*The Elements of Statistical Learning: Data Mining, Inference, and Prediction*, Trevor Hastie , Robert Tibshirani , Jerome Friedman