07.14.13
Regularization – Predictive Modeling Beyond Ordinary Least Squares Fit
Introduction to Linear Functions and Regularization
A simple yet powerful prediction model assumes that the function is linear in the input even in cases where the input consists of hundreds of variables and the input variables far outstrip the number of observations. Such prediction models, known are linear regression/classification models, can often outperform fancier non-linear models.
The most popular method of estimation of the parameters (used interchangeably with the word, coefficients) is the method of Ordinary Least Squares (OLS). The linear model can be written as
f(x) = βo + ∑Xj βj
where j=1 to p, X is the input vector, and βjs are the unknown coefficients.
We solve for the coefficients β (βo,β1,…βp) that minimize the residual sum of squares (RSS).
RSS() = ∑(yi – βo - ∑Xijβj)2
where i=1 to N observations, j= 1 to p variables
Reasons why OLS estimation is often unsatisfactory are:
- Large variance in prediction accuracy. A solution to improving the overall accuracy is to shrink (or set to zero) some of the coefficients. The overall effect is to prevent or reduce over-fitting.
- With a large number of input predictors, one would like to determine a smaller subset that would exhibit the strongest effects so we see the big picture.
The process of regularization involves a family of penalty terms that can be added to OLS to achieve the shrinkage (in the coefficients).
The Ridge penalty term shrinks the regression coefficients by introducing the complexity parameter, λ, the greater the value of λ, the greater the amount of shrinkage. By varying λ, the coefficients are shrunk towards zero (and to each other).
While the Ridge penalty does a proportional shrinkage, the LASSO penalty λ, translates each coefficients by a constant factor stopping at zero. LASSO also does feature-selection; if many features are correlated, LASSO will just pick one.
Elastic Net penalty is a combination of the LASSO and Ridge regression penalty.
The first term encourages a sparse solution in the coefficients and the second term encourages highly correlated features to be averaged. The parameter α determines the mix of penalties and lies in the range of 0 and 1. With α set to 0, we get the Ridge penalty and with α set to 1, we get the LASSO penalty.
Example
We now demonstrate this with a example dataset with 204 binary attributes and 704 observations.
Getting the Data
The R snippet below will download the dataset from where it is hosted. The data has been previously saved as an R object in the .rda format. We reload it back in to the R object, hiv.data.
download.file("http://www.shatterline.com/MachineLearning/data/hiv.rda","hiv.rda", mode="wb")
load("hiv.rda", verbose=TRUE) #contains hiv.train & hiv.test
Visualizing the Data
The image function in R helps us visualize the dataset. You can see below the relatively strong correlation between the variables. See the visualize.matrix function below.
Fitting/Plotting Data
The code snippet below shows the coefficient shrinkage is proportional to λ when we apply the Ridge penalty.
fit <- glmnet(hiv.train$x,hiv.train$y, alpha=0) #Ridge penalty
The code snippet below shows that, with the LASSO penalty, the coefficient hit zero (unlike Ridge) as λ shrinks.
fit <- glmnet(hiv.train$x,hiv.train$y, alpha=1) #Lasso penalty
The code snippet below shows a mix of the Ridge and LASSO penalties with the Elastic Net penalty for a specific value of α=0.2 (could also be chose with cross-validation).
fit <- glmnet(hiv.train$x,hiv.train$y, alpha=0.2) #Elastic Net penalty
Cross-Validation
Ten-fold cross-validation shows us that the number of active variables are approximately 30.
cv.fit <- cv.glmnet(hiv.train$x,hiv.train$y) #10-fold cross-validation
plot(cv.fit)
legend("topleft",legend=c("10-fold Cross Validation"))
Predicting the Test Data with the Model
plot(cv.fit)
legend("topleft",legend=c("10-fold Cross Validation"))
The code snippet below predicts the error at every value of λ.
pred.y <- predict(fit, hiv.test$x) #predict the test data
mean.test.error <- apply((pred.y - hiv.test$y)^2,2,mean)
points(log(fit$lambda), mean.test.error, col="blue",pch="*")
legend("topleft",legend=c("10-fold Cross Validation","Test HIV Data"),pch="*",col=c("red","blue"))
Plotting the Regularization Path
The code snippet below shows the regularization path by plotting the coefficients against (log of) λ. Each curve represents a coefficient in the model. As λ gets smaller, more coefficients enter the model from a zero value. (see to the left).
plot(fit,xvar="lambda")
Code
# Author Auro Tripathy, auro@shatterline.com # Adapted from ...Trevor Hastie's talk rm(list=ls()) visualize.matrix <- function(mat) { print(names(mat)) image(1:nrow(mat$x), 1:ncol(mat$x), z=mat$x, col = c("darkgreen", "white"), xlab = "Observations", ylab = "Attributes") title(main = "Visualizing the Sparse Binary Matrix", font.main = 4) return (dim(mat$x)) #returns the dimensions of the matrix } #---main---# library(glmnet) ?glmnet download.file("http://www.shatterline.com/MachineLearning/data/hiv.rda", "hiv.rda", mode="wb") load("hiv.rda", verbose=TRUE) #contains hiv.train & hiv.test visualize.matrix(hiv.train) visualize.matrix(hiv.test) print(length(hiv.train$y)) #length of response variable fit <- glmnet(hiv.train$x,hiv.train$y, alpha=0) #Ridge penalty plot(fit) legend("bottomleft",legend=c("Ridge Penalty, alpha=0")) fit <- glmnet(hiv.train$x,hiv.train$y, alpha=1) #Lasso penalty plot(fit) legend("bottomleft",legend=c("LASSO Penalty, alpha=1")) fit <- glmnet(hiv.train$x,hiv.train$y, alpha=0.2) #ElasticNet penalty plot(fit) legend("bottomleft",legend=c("Elastic Net, alpha=0.2")) cv.fit <- cv.glmnet(hiv.train$x,hiv.train$y) #10-fold cross-validation plot(cv.fit) legend("topleft",legend=c("10-fold Cross Validation")) pred.y <- predict(fit, hiv.test$x) #predict the test data mean.test.error <- apply((pred.y - hiv.test$y)^2,2,mean) points(log(fit$lambda), mean.test.error, col="blue",pch="*") legend("topleft",legend=c("10-fold Cross Validation","Test HIV Data"), pch="*", col=c("red","blue")) plot(fit,xvar="lambda") plot(fit,xvar="dev")
References
- Prof Trever Hastie’s talk
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Trevor Hastie , Robert Tibshirani , Jerome Friedman