This involves repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model.

The most commonly used resampling techniques are

  1. Cross Validation(CV): can be used to estimate the Test Error associated with a given statistical learning method in order to evaluate its performance, or to select the appropriate level of flexibility.

  2. Bootstrap: used in several contexts, most commonly to provide a measure of accuracy of a parameter estimate or of a given statistical learning method.

1 Cross-Validation

Recall from Chapter 2:

In the absence of a very large designated test set that can be used to directly estimate the test error rate, a number of techniques can be used to estimate this quantity using the available training data.

We consider 3 methods that estimate the test error rate by holding out a subset of the training observations from the fitting process, and then applying the statistical learning method to those held out observations.

1.1 The Validation Set Approach

Step 1: Randomly dividing the available set of observations into two parts, a training set and a Test set (or holdout set).

Step 2: Fit the model using the training set

Step 3: Then, use the fitted model to predict the responses for the observations in the Test set

Step 4: Find the Test MSE (Test Mean Squared Error)

Above figure is a schematic display of the validation set approach. A set of \(n\) observations are randomly split into a training set (shown in blue, containing observations 7, 22, and 13, among others) and a validation set (shown in beige, and containing observation 91, among others). The statistical learning method is fit on the training set, and its performance is evaluated on the validation set.

Example 1:

For this example we use the Auto data set where we have 392 observations.

  1. split the 392 observations into two sets, a training set containing 196 of the data points, and a validation set containing the remaining 196 observations.

  2. Fit a simple linear regression model (degree one) to predict the mpg using hoursepower using the training set.

  3. Calculate the test MSE.

library(ISLR)
set.seed(1)
data("Auto")
dim(Auto)
## [1] 392   9
#1.  

train = sample.int(392, 196) # Training indices

trainSet <- Auto[train, ] # Training data set
dim(trainSet)
## [1] 196   9
testSet <- Auto[-train, ] # Test data set
dim(testSet)
## [1] 196   9
#2.

fit1 <- lm(mpg ~ horsepower, data = trainSet)
fit1
## 
## Call:
## lm(formula = mpg ~ horsepower, data = trainSet)
## 
## Coefficients:
## (Intercept)   horsepower  
##     41.2835      -0.1697
#3.
testPredit <- predict(fit1, testSet) # yhat values

testMSE1 <- mean((testSet$mpg - testPredit)^2)
testMSE1
## [1] 23.26601

Example 2:

Now we fit several (nine) regression models with degrees 2 through 10 to predict the mpg using hoursepower using the same training set that we defined above. Find the test MSE for each nine models.

Example 3

Create a plot of degree of the model vs. the test MSE as a visual aid

Note: Validation set MSE for the quadratic fit is considerably smaller than the linear (first degree) fit.

Therefore our final model is the quadratic fit.

In the graph above, the validation method was repeated ten times, each time using a different random split of the observations into a training set and a validation set. This illustrates the variability in the estimated test MSE that results from this approach.

Disadvantages of the validation approach:

  1. As seen in the graph above, the validation estimate of the test error rate can be highly _______, depending on precisely which observations are included in the training set and which observations are included in the validation set.

  2. Validation set error rate may tend to ________ the test error rate.

1.2 Leave One Out Cross Validation (LOOCV)

Step 1: Split the set of observations such that:

  • validation set has ____ observation, say ___
  • training set has the remaining ___ observations.

Step 2: Fit the model using the training set (___ observations)

Step 3: Predict ___ for the excluded observation using its value \(x_1\).

Step 4: Calculate \(MSE_1 =\)

Step 5: repeat the process for all \(n\) observations.

Step 6: Calculate

\[CV_{(n)} = \]

The figure above is a schematic display of LOOCV. A set of n data points is repeatedly split into a training set (shown in blue) containing all but one observation, and a validation set that contains only that observation (shown in beige). The test error is then estimated by averaging the n resulting MSE’s. The first training set contains all but observation 1, the second training set contains all but observation 2, and so forth.

Example 3:

For this example we use the Auto data set where we have 392 observations. Fit a simple linear regression model (degree one) to predict the mpg using hoursepower. Calculate the LOOCV Error.

Note: Here, we will perform linear regression using the glm() function rather than the lm() function because the former can be used together with cv.glm() to get the LOOCV Error.

Example 4:

Fit several (10) regression models with degrees 1 through 10 to predict the mpg using horsepower using the LOOCV approach. Find the LOOCV Errors for each 10 models. Which model would you use as your final model and why?

Advantages of the LOOCV approach:

  1. Less bias (training set has \(n-1\) observations all the time)
  2. Does not overestimate the test error.
  3. The results are same all the time.

1.3 k-Fold Cross-Validation

Step 1: Divide the set of observations into \(k\) groups, or folds, of approximately equal size.

Step 2: The first fold is treated as a validation set

Step 3: Fit the model using remaining \(k − 1\) folds.

Step 3: The mean squared error, \(MSE_1\), is then computed on the observations in the held-out fold.

Step 4: This procedure is repeated \(k\) times; each time, a different group of observations is treated as a validation set.

Step 5: This process results in k estimates of the test error,, ,….

Step 6: The \(k\)-fold CV estimate is computed by averaging these values,

\[CV_{(k)} = \]

Above figure is a schematic display of 5-fold CV. A set of n observations is randomly split into five non-overlapping groups. Each of these fifths acts as a validation set (shown in beige), and the remainder as a training set (shown in blue). The test error is estimated by averaging the five resulting MSE estimates.

Note:

  1. Usually perform \(k\)-fold CV using \(k=\) or \(k=\).

  2. LOOCV is a special case in \(k\)-fold CV in which \(k=\).

  3. Advantage:

Example 5:

For this example we use the Auto data set where we have 392 observations. Fit a simple linear regression model (degree one) to predict the mpg using hoursepower. Calculate the 10-fold CV Error.

Example 6:

Fit several (10) regression models with degrees 1 through 10 to predict the mpg using horsepower using the LOOCV approach. Find the 10-fold CV Errors for each 10 models. Which model would you use as your final model and why?

1.3.1 Bias-Variance Trade-Off for k-Fold Cross-Validation

Method No. of obs used to fit the model Bias? Variance?
Validation approach \(\rule{1cm}{0.15mm}\) (usually) overestimate the test error, so _____ _____
LOOCV \(\rule{1cm}{0.15mm}\) Unbiased High compared to k-fold CV
k-fold CV \(\rule{1cm}{0.15mm}\) (each training set) Somewhere in between \(k=5\) or \(10\) gives low variance

1.4 The Bootstrap

Rather than repeatedly obtaining independent data sets from the population, we instead obtain distinct data sets by repeatedly sampling observations from the original data set with replacement. This process is known as the Bootstrap process.

This approach is illustrated in the following Figure on a simple data set;

Note: Bootstrap sample size is n, which is the size of the original sample as well. This is usually the case.

The Bootstrap Idea: The original sample approximates the population from which it was drawn. So resamples from this sample approximate what we would get if we took many samples from the population. The bootstrap distribution of a statistic, based on many resamples, approximates the sampling distribution of the statistic, based on many samples.

Example 7:

We will investigate samples taken from the CDC’s database of births. For the North Carolina data: NCBirths2004, we are interested in \(\mu\), the true birth weight mean for all North Carolina babies born in 2004 (population mean).

  1. What is the average birth wight of a NC baby in this sample (here we are looking for the sample mean)

  2. Find the mean and the standard error of the bootstrap distribution of the mean birth wight of a NC babies

  3. Plot and comment about the bootstrap distribution of the mean

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Note: We use * to denote the bootstrap estimates. For example, \(\bar{X}^{∗}_1\) would be the mean from the first bootstrap sample.