This involves repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model.
The most commonly used resampling techniques are
Cross Validation(CV): can be used to estimate the Test Error associated with a given statistical learning method in order to evaluate its performance, or to select the appropriate level of flexibility.
Bootstrap: used in several contexts, most commonly to provide a measure of accuracy of a parameter estimate or of a given statistical learning method.
1 Cross-Validation
Recall from Chapter 2:
- Test Error: The test error is the average error that results from using a statistical learning method to predict the response on a new observation— that is, a measurement that was not used in training the method.
In the absence of a very large designated test set that can be used to directly estimate the test error rate, a number of techniques can be used to estimate this quantity using the available training data.
We consider 3 methods that estimate the test error rate by holding out a subset of the training observations from the fitting process, and then applying the statistical learning method to those held out observations.
1.1 The Validation Set Approach
Step 1: Randomly dividing the available set of observations into two parts, a training set and a Test set (or holdout set).
Step 2: Fit the model using the training set
Step 3: Then, use the fitted model to predict the responses for the observations in the Test set
Step 4: Find the Test MSE (Test Mean Squared Error)
Above figure is a schematic display of the validation set approach. A set of \(n\) observations are randomly split into a training set (shown in blue, containing observations 7, 22, and 13, among others) and a validation set (shown in beige, and containing observation 91, among others). The statistical learning method is fit on the training set, and its performance is evaluated on the validation set.
Example 1:
For this example we use the Auto
data set where we have
392 observations.
split the 392 observations into two sets, a training set containing 196 of the data points, and a validation set containing the remaining 196 observations.
Fit a simple linear regression model (degree one) to predict the
mpg
usinghoursepower
using the training set.Calculate the test MSE.
## [1] 392 9
#1.
train = sample.int(392, 196) # Training indices
trainSet <- Auto[train, ] # Training data set
dim(trainSet)
## [1] 196 9
## [1] 196 9
##
## Call:
## lm(formula = mpg ~ horsepower, data = trainSet)
##
## Coefficients:
## (Intercept) horsepower
## 41.2835 -0.1697
#3.
testPredit <- predict(fit1, testSet) # yhat values
testMSE1 <- mean((testSet$mpg - testPredit)^2)
testMSE1
## [1] 23.26601
Example 2:
Now we fit several (nine) regression models with degrees 2 through 10
to predict the mpg
using hoursepower
using the
same training set that we defined above. Find the test MSE for each nine
models.
Example 3
Create a plot of degree of the model vs. the test MSE as a visual aid
Note: Validation set MSE for the quadratic fit is considerably smaller than the linear (first degree) fit.
Therefore our final model is the quadratic fit.
In the graph above, the validation method was repeated ten times, each time using a different random split of the observations into a training set and a validation set. This illustrates the variability in the estimated test MSE that results from this approach.
Disadvantages of the validation approach:
As seen in the graph above, the validation estimate of the test error rate can be highly _______, depending on precisely which observations are included in the training set and which observations are included in the validation set.
Validation set error rate may tend to ________ the test error rate.
1.2 Leave One Out Cross Validation (LOOCV)
Step 1: Split the set of observations such that:
- validation set has ____ observation, say ___
- training set has the remaining ___ observations.
Step 2: Fit the model using the training set (___ observations)
Step 3: Predict ___ for the excluded observation using its value \(x_1\).
Step 4: Calculate \(MSE_1 =\)
Step 5: repeat the process for all \(n\) observations.
Step 6: Calculate
\[CV_{(n)} = \]
The figure above is a schematic display of LOOCV. A set of n data points is repeatedly split into a training set (shown in blue) containing all but one observation, and a validation set that contains only that observation (shown in beige). The test error is then estimated by averaging the n resulting MSE’s. The first training set contains all but observation 1, the second training set contains all but observation 2, and so forth.
Example 3:
For this example we use the Auto
data set where we have
392 observations. Fit a simple linear regression model (degree one) to
predict the mpg
using hoursepower
. Calculate
the LOOCV Error.
Note: Here, we will perform linear regression
using the glm()
function rather than the lm()
function because the former can be used together with
cv.glm()
to get the LOOCV Error.
Example 4:
Fit several (10) regression models with degrees 1 through 10 to
predict the mpg
using horsepower
using the
LOOCV approach. Find the LOOCV Errors for each 10 models. Which model
would you use as your final model and why?
Advantages of the LOOCV approach:
- Less bias (training set has \(n-1\) observations all the time)
- Does not overestimate the test error.
- The results are same all the time.
1.3 k-Fold Cross-Validation
Step 1: Divide the set of observations into \(k\) groups, or folds, of approximately equal size.
Step 2: The first fold is treated as a validation set
Step 3: Fit the model using remaining \(k − 1\) folds.
Step 3: The mean squared error, \(MSE_1\), is then computed on the observations in the held-out fold.
Step 4: This procedure is repeated \(k\) times; each time, a different group of observations is treated as a validation set.
Step 5: This process results in k estimates of the test error,, ,….
Step 6: The \(k\)-fold CV estimate is computed by averaging these values,
\[CV_{(k)} = \]
Above figure is a schematic display of 5-fold CV. A set of n observations is randomly split into five non-overlapping groups. Each of these fifths acts as a validation set (shown in beige), and the remainder as a training set (shown in blue). The test error is estimated by averaging the five resulting MSE estimates.
Note:
Usually perform \(k\)-fold CV using \(k=\) or \(k=\).
LOOCV is a special case in \(k\)-fold CV in which \(k=\).
Advantage:
Example 5:
For this example we use the Auto
data set where we have
392 observations. Fit a simple linear regression model (degree one) to
predict the mpg
using hoursepower
. Calculate
the 10-fold CV Error.
Example 6:
Fit several (10) regression models with degrees 1 through 10 to
predict the mpg
using horsepower
using the
LOOCV approach. Find the 10-fold CV Errors for each 10 models. Which
model would you use as your final model and why?
1.3.1 Bias-Variance Trade-Off for k-Fold Cross-Validation
Method | No. of obs used to fit the model | Bias? | Variance? |
---|---|---|---|
Validation approach | \(\rule{1cm}{0.15mm}\) (usually) | overestimate the test error, so _____ | _____ |
LOOCV | \(\rule{1cm}{0.15mm}\) | Unbiased | High compared to k-fold CV |
k-fold CV | \(\rule{1cm}{0.15mm}\) (each training set) | Somewhere in between | \(k=5\) or \(10\) gives low variance |
1.4 The Bootstrap
Rather than repeatedly obtaining independent data sets from the population, we instead obtain distinct data sets by repeatedly sampling observations from the original data set with replacement. This process is known as the Bootstrap process.
This approach is illustrated in the following Figure on a simple data set;
Note: Bootstrap sample size is n, which is the size of the original sample as well. This is usually the case.
The Bootstrap Idea: The original sample approximates the population from which it was drawn. So resamples from this sample approximate what we would get if we took many samples from the population. The bootstrap distribution of a statistic, based on many resamples, approximates the sampling distribution of the statistic, based on many samples.
Example 7:
We will investigate samples taken from the CDC’s database of births.
For the North Carolina data: NCBirths2004
, we are
interested in \(\mu\), the true birth
weight mean for all North Carolina babies born in 2004 (population
mean).
What is the average birth wight of a NC baby in this sample (here we are looking for the sample mean)
Find the mean and the standard error of the bootstrap distribution of the mean birth wight of a NC babies
Plot and comment about the bootstrap distribution of the mean
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Note: We use * to denote the bootstrap estimates. For example, \(\bar{X}^{∗}_1\) would be the mean from the first bootstrap sample.