Recall:

In the regression setting, the standard linear model

\[Y = \beta_0 + \beta_1X_1+\dots + \beta_pX_p + \epsilon\]

is commonly used to describe the relationship between a response $Y$ and a set of variables $X_1,X_2, . . .,X_p$. We have seen in Chapter 3 that one typically fits this model using ___________.

Why might we want to use another fitting procedure instead of least squares?

To increase Prediction Accuracy:

Provided that the true relationship between the response and the predictors is approximately linear, the least squares estimates will have ________.
If $n >> p$—that is, if $n$, the number of observations, is much larger than $p$, the number of variables—then the least squares estimates tend to also have ________, and hence will perform well on test observations.
However, if $n$ is not much larger than $p$, then there can be a lot of variability in the least squares fit, resulting in _________ and consequently poor _________ on future observations not used in model training.
if $p > n$, then there is no longer a unique least squares coefficient estimate: the variance is infinite so the method cannot be used at all.
By ________ or ________ the estimated coefficients, we can often substantially reduce the _________ at the cost of a negligible increase in ________. This can lead to substantial improvements in the accuracy with which we can predict the response for observations not used in model training.

Model Interpretability:

Including _______ variables leads to unnecessary complexity in the resulting model.
By _______ these variables—that is, by setting the corresponding coefficient estimates to zero—we can obtain a model that is more easily interpreted.

In this chapter, we see some approaches for automatically performing __________ (or ____________) that is,for excluding irrelevant variables from a multiple regression model.

Alternative classes of methods:

Subset Selection
- Best subset selection
- Forward selection
- Backward selection
Shrinkage.
- Lasso
- Ridge
Dimension Reduction.
- PCR
- PLS

1 Subset Selection

1.1 Best subset selection

Step 0: Fit the model with 0 predictors (denoted as $\mathcal{M_0}$) - One model

Step 1: Fit all models with 1 predictor - ____ models.

Find the best model using smallest RSS, or equivalently largest $R^2$. Denote this model as $\mathcal{M_1}$

Step 2: Fit all models with 2 predictors - ____ models.

Find the best model using smallest RSS, or equivalently largest $R^2$. Denote this model as $\mathcal{M_2}$

$\vdots$

Step p: Fit all models with p predictors - ____ models.

Find the best model using smallest RSS, or equivalently largest $R^2$. Denote this model as $\mathcal{M_p}$

Final Step: Select a single best model from among $\mathcal{M_0}, \mathcal{M_1}, \dots, \mathcal{M_p}$ using cross validated prediction error, $C_p$, (AIC), BIC, or adjusted $R^2$.

Note 1:

Here we have created _____ models. Out of those models, we used ____ models to pick ___ final model.
Here, RSS ____ and $R^2$ ______ as the number of predictors _____ in the model.
Low RSS or a high $R^2$ indicates a model with low training error.
BUT, we need a model with low ____ error.
Therefore, we use criterion such as ross validated prediction error, $C_p$, (AIC), BIC, or adjusted $R^2$.

Pros: Simple and conceptually appealing.

Cons: Computationally infeasible for values of $p$ greater than around 40 ($p=40$ $\implies$ $2^{40}$ models)

Example 1

Here we apply the best subset selection approach to the Hitters data. We wish to predict a baseball player’s Salary on the basis of various statistics associated with performance in the previous year.

By default, regsubsets() only reports results up to the best eight-variable model. But the nvmax option can be used in order to return as many variables as are desired. Let’s fit up to a 19-variable model.

regfit <- regsubsets (Salary~.,data=Hitters, nvmax =19)
reg_summary <- summary(regfit)
reg_summary # Print the summary output

## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19)
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: exhaustive
##           AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 4  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 5  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "   "*" 
## 7  ( 1 )  " "   "*"  " "   " "  " " "*"   " "   "*"    "*"   "*"    " "   " " 
## 8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   "*"    "*"   " " 
## 9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    "*"   " "    "*"   "*" 
## 16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"   "*" 
## 17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"   "*" 
## 18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   " "    "*"   "*" 
## 19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   "*"    "*"   "*" 
##           CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 )  " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 )  " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 )  " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
## 9  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
## 10  ( 1 ) "*"    " "     "*"       "*"     "*"     " "    " "       
## 11  ( 1 ) "*"    "*"     "*"       "*"     "*"     " "    " "       
## 12  ( 1 ) "*"    "*"     "*"       "*"     "*"     " "    " "       
## 13  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 14  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 15  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 16  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 17  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 18  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 19  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"

There are more information in the summary than displayed above. Use the names() function to check out what’s available

Let’s look at adjusted $R^2$ values for all 19 models selected so far.

How many and what variables are in the best model selected using the adjusted $R^2$ criterion?

Let’s create a plot to visualize the change in adjusted $R^2$ criterion.

Which model has the max adjusted $R^2$?

Now what we know how many and what variables to include in the model, we are ready to fit our final model using the adjusted $R^2$ criterion.

Example 2:

In a similar fashion we can use the $C_p$ criterion, and indicate the models with the _______ value using which.min().

Now, let’s look at $Cp$ values for all 19 models.

How many and what variables are in the best model selected using the $Cp$ criterion?

Let’s create a plot to visualize the change in $Cp$ criterion.

Which model has the min $Cp$?

Now what we know how many and what variables to include in the model, we are ready to fit our final model using the $Cp$ criterion.

Example 3:

In a similar fashion we can use the $BIC$ criterion, and indicate the models with the _______ value using which.min(). Continue and find the best model using the $BIC$ criterion.

1.2 Forward Selection

Step 0: Fit the model with 0 predictors (denoted as $\mathcal{M_0}$)

Step 1: Find the best model only adding one predictor to the previous model. Use RSS, or $R^2$ to pick the best. Denote this model as $\mathcal{M_1}$

Step 2: Find the best model only adding one predictor to the previous model. Use RSS, or $R^2$ to pick the best. Denote this model as $\mathcal{M_2}$

$\vdots$

Step p: Find the best model only adding one predictor to the previous model. Use RSS, or $R^2$ to pick the best. Denote this model as $\mathcal{M_p}$

Final Step: Select a single best model from among $\mathcal{M_0}, \mathcal{M_1}, \dots, \mathcal{M_p}$ using cross validated prediction error, $C_p$, (AIC), BIC, or adjusted $R^2$.

Note 2:

Here we have created $1+\frac{p(p+1)}{2}$ models.

Pros

Computational efficiency over the best subset selection

Cons

Forward selection can be applied even in the high-dimensional setting where $n < p$; however, in this case, it is possible to construct submodels $\mathcal{M_0}, \mathcal{M_1}, \dots, \mathcal{M_{n-1}}$ only, since each submodel is fit using least squares, which will not yield a unique solution if $p \geq n$.

Example 4:

Here we apply the Forward Selection approach to the Hitters data. We wish to predict a baseball player’s Salary on the basis of various statistics associated with performance in the previous year.

We can use the regsubsets() function to perform forward selection, using the argument method="forward".

What is the predictor variable selected for the the best one-variable model?
What are the predictor variables selected for the the best two-variable model?
How many and what variables are in the best model selected using the $Cp$ criterion?
Create a plot to visualize the change in $Cp$ criterion.

1.3 Backward Selection (Backward Elimination)

Step 0: Fit the model with all $p$ predictors (denoted as $\mathcal{M_p}$, the full model)

Step 1: Find the best model only removing one predictor from the previous model. Use RSS, or $R^2$ to pick the best. Denote this model as $\mathcal{M_{p-1}}$

Step 2: Find the best model only removing one predictor from the previous model. Use RSS, or $R^2$ to pick the best. Denote this model as $\mathcal{M_{p-2}}$

$\vdots$

Step p: Find the best model only removing one predictor from the previous model. Use RSS, or $R^2$ to pick the best. Denote this model as $\mathcal{M_{0}}$

Final Step: Select a single best model from among $\mathcal{M_0}, \mathcal{M_1}, \dots, \mathcal{M_p}$ using cross validated prediction error, $C_p$, (AIC), BIC, or adjusted $R^2$.

Note 2:

Here we have created $1+\frac{p(p+1)}{2}$ models.

Pros

Computational efficiency over the best subset selection

Cons

Strictly need $n>p$ to use backward elimination

Example 5

Here we apply the backward elimination approach to the Hitters data. We wish to predict a baseball player’s Salary on the basis of various statistics associated with performance in the previous year.

We can use the regsubsets() function to perform forward selection, using the argument method="backward".

What is the predictor variable selected for the the best one-variable model?
What are the predictor variables selected for the the best two-variable model?
How many and what variables are in the best model selected using the $Cp$ criterion?
Create a plot to visualize the change in $Cp$ criterion.

Now that we know how many and what variables to include in the model, we are ready to fit our final model using the $BIC$ criterion.

1.4 Choosing the optimal model

Directly estimate test error, using validation set approach or cross validation approach.
Estimate test error by making an adjustment to the training error to account for the Bias due to overfitting.

The $C_p$ criterion

\[C_p = \frac{1}{n}(RSS + 2d\hat{\sigma}^2)\]

Where $\hat{\sigma}^2$ estimates the variance of the error $\epsilon$, $d$ is the number of predictors in the LS model.

Here $2d\hat{\sigma}^2$ adjust the fact that training tends to underestimate the test error.

The $C_p$ statistic tends to take on a small value for models with a low test error, so when determining which of a set of models is best, we choose the model with the ________ $C_p$ value.

The AIC criterion

\[AIC = \frac{1}{n\hat{\sigma}^2}(RSS + 2d\hat{\sigma}^2)\] $C_p$ and AIC are proportional to each other, and so only $C_p$ is mostly used.

The BIC criterion

\[BIC = \frac{1}{n\hat{\sigma}^2}(RSS + log(n)d\hat{\sigma}^2)\] Like $C_p$, the BIC will tend to take on a small value for a model with a low test error, and so generally we select the model that has the lowest BIC value.

Adjusted $R^2$

\[\text{Adjusted } R^2 = 1- \frac{RSS/(n-d-1)}{TSS/(n-1)}\]

Unlike $C_p$, AIC, and BIC, for which a small value indicates a model with a low test error, a ________ value of adjusted $R^2$ indicates a model with a small test error.

2 Shrinkage Methods

The subset selection methods described in Section 1 involve using least squares to fit a linear model that contains a subset of the predictors. As an alternative, we can fit a model containing all $p$ predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero.

It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance. The two best-known techniques for shrinking the regression coefficients towards zero are _______ regression and the ______.

2.1 Ridge Regression

Recall from Chapter 3 that the least squares fitting procedure estimates $\beta_1, \beta_2,\dots \beta_p,$ using the values that minimize

\[RSS = \sum_{i=1}^n(y_i-\beta_0 - \sum_{j=1}^p\beta_jx_{ij})^2\] Ridge regression is very similar to least squares, except that the coefficients are estimated by minimizing a slightly different quantity:

\[RSS + \lambda\sum_{j=1}^p\beta_j^2\]

$\lambda$ -
$\lambda\sum_{j=1}^p\beta_j^2$ -

(this quantity is small when $\beta_j$s are small)

When $\lambda = 0$ -
When $\lambda \rightarrow \infty$

Note:

Need to standardize predictors before using Ridge regression
For each $\lambda$ value we choose, ridge regression will produce a different set of coefficient estimates.
Our goal is to find the best $\lambda$ value using CV

Squared bias (black), variance (green), and test mean squared error (purple) for the ridge regression predictions on a simulated data set, as a function of $\lambda$

As $\lambda$ increases flexibility of the ridge model decreases. This will lead to decreased variance but increased bias.

Example 6

Note:

We will use the glmnet package in order to perform ridge regression and the lasso. The main function in this package is glmnet(), which can be used to fit ridge regression models.

This function has slightly different syntax from other model-fitting functions that we have encountered thus far in this book. In particular, we must pass in an $x$ matrix as well as a $y$ vector, and we do not use the $y \sim x$ syntax.

Perform ridge regression in order to predict Salary on the Hitters data.

First we use model.matrix() function. Which is particularly useful for creating $x$; not only does it produce a matrix corresponding to the 19 predictors but it also automatically transforms any qualitative variables into dummy variables.

The glmnet() function has an alpha argument that determines what type of model is fit. If alpha=0 then a ridge regression model is fit

By default the glmnet() function performs ridge regression for an automatically selected range of $\lambda$ values. However, here we have chosen to implement the function over a grid of values ranging from $\lambda = 10^{10}$ to $\lambda = 10^{-2}$

In general, we use cross-validation to choose the tuning parameter $\lambda$. We can do this using the built-in cross-validation function, cv.glmnet().

What is the test MSE associated with this value of chosen $\lambda$?

Finally, we refit our ridge regression model on the FULL DATA SET, using the value of $\lambda$ chosen by cross validation, and examine the coefiient estimates.

2.2 Lasso

Disadvantage of Ridge: Unlike best subset, forward, and backward elimination, which will generally select models that involve just a subset of the variables, ridge regression will include all $p$ predictors in the final model.

The penalty $\lambda\sum_{j=1}^p\beta_j^2$ will shrink all of the coefficients towards zero, but it will not set any of them exactly to zero (unless $\lambda = \infty$ ).

The lasso is a relatively recent alternative to ridge regression that overcomes this disadvantage. The lasso coefficients, minimize the quantity:

\[RSS + \lambda\sum_{j=1}^p|\beta_j|\]

When $\lambda = 0$ -
When $\lambda$ is large some coefficient will be exactly equal to zero.
Lasso give ____ models - models only has subset of variables.

Note:

Need to standardize predictors before using Lasso
For each $\lambda$ value we choose, Lasso will produce a different set of coefficient estimates.
Our goal is to find the best $\lambda$ value using CV

$The standardized lasso coefficients on the Credit data set are shown as a function of \lambda$ * Finally fit the model using all of the available observations and selected valu of $\lambda$

Example 7:

Perform Lasso in order to predict Salary on the Hitters data.

Just like in the ridge case we first we use model.matrix() function. Which is particularly useful for creating $x$; not only does it produce a matrix corresponding to the 19 predictors but it also automatically transforms any qualitative variables into dummy variables.

The glmnet() function has an alpha argument that determines what type of model is fit. If alpha=1 then a lasso model is fit

By default the glmnet() function performs lasso for an automatically selected range of $\lambda$ values. However, here we have chosen to implement the function over a grid of values ranging from $\lambda = 10^{10}$ to $\lambda = 10^{-2}$

In general, we use cross-validation to choose the tuning parameter $\lambda$. We can do this using the built-in cross-validation function, cv.glmnet().

However, the lasso has a substantial advantage over ridge regression in that the resulting coefficient estimates are sparse. Here we see that _____ of the 19 coefficient estimates are exactly zero. So the lasso model with λ chosen by cross-validation contains only ____ variables.

Chapter 6 - Linear Model Selection and Regularization

Dr. Lasanthi Watagoda

April 03, 2025

1 Subset Selection

1.1 Best subset selection

1.2 Forward Selection

1.3 Backward Selection (Backward Elimination)

1.4 Choosing the optimal model

2 Shrinkage Methods

2.1 Ridge Regression

2.2 Lasso