1 What is Statistical learning?
Statistical learning refers to a ________________.
These tools can be classified as:
1.1 Supervised Learning
This involves building a statistical model for ________________, or ________________, an output based on ________________. Problems of this nature occur in fields as diverse as business, medicine, astrophysics, and public policy.
Examples:
Spam detection: Spam detection is another example of a supervised learning model. Using supervised classification algorithms, organizations can train databases to recognize patterns or anomalies in new data to organize spam and non-spam-related correspondences effectively.
Predicting house/property price
1.2 Unsupervised Learning
Here, there are ________________ but no supervising ________________; nevertheless we can learn relationships and structure from such data.
Examples:
- Data exploration
- customer segmentation: suppose we’re working for a company that sells clothes and we have data from previous customers: how much they spent, their ages and the day that they bought the product. Our task is to find a pattern or relationship between the variables in order to provide the company with useful information so they can create marketing strategies, decide on which type of client they should focus on to maximize the profits or which customer segment they can put more effort to expand in the market.
1.3 Data sets
To provide an illustration of some applications of statistical learning, we briefly discuss three real-world data sets.
- Wage data —
Wage
- Stock Market data —
Smarket
- Gene Expression data —
NCI60
1.3.1 Wage data —
Wage
In this application we examine a number of factors that relate to
wages for a group of males from the Atlantic region of the United
States. In particular, we wish to understand the association between
an employee’s age
and education
, as well as
the calendar year
, on his wage
.
## year age maritl race education region
## 231655 2006 18 1. Never Married 1. White 1. < HS Grad 2. Middle Atlantic
## 86582 2004 24 1. Never Married 1. White 4. College Grad 2. Middle Atlantic
## 161300 2003 45 2. Married 1. White 3. Some College 2. Middle Atlantic
## 155159 2003 43 2. Married 3. Asian 4. College Grad 2. Middle Atlantic
## 11443 2005 50 4. Divorced 1. White 2. HS Grad 2. Middle Atlantic
## 376662 2008 54 2. Married 1. White 4. College Grad 2. Middle Atlantic
## jobclass health health_ins logwage wage
## 231655 1. Industrial 1. <=Good 2. No 4.318063 75.04315
## 86582 2. Information 2. >=Very Good 2. No 4.255273 70.47602
## 161300 1. Industrial 1. <=Good 1. Yes 4.875061 130.98218
## 155159 2. Information 2. >=Very Good 1. Yes 5.041393 154.68529
## 11443 2. Information 1. <=Good 1. Yes 4.318063 75.04315
## 376662 2. Information 2. >=Very Good 1. Yes 4.845098 127.11574
## 'data.frame': 3000 obs. of 11 variables:
## $ year : int 2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
## $ age : int 18 24 45 43 50 54 44 30 41 52 ...
## $ maritl : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
## $ race : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
## $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
## $ region : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ jobclass : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
## $ health : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
## $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
## $ logwage : num 4.32 4.26 4.88 5.04 4.32 ...
## $ wage : num 75 70.5 131 154.7 75 ...
Consider, for example, wage versus age for each of the individuals in the data set.
- Create a scatter plot (shown here) for wage versus age
- Describe the plot you created in i)
\[\\[0.5in]\]
- Create a scatter plot (shown here) for wage versus year
- Describe the plot you created in iii)
\[\\[0.5in]\]
- Create a boxplot (shown here) for wage for each education level.
- Describe the plot you created in v) \[\\[0.5in]\]
Clearly, the most accurate prediction of a given man’s
wage
will be obtained by combining his age
,
his education
, and the year
.
Note:
The Wage data involves predicting a ________________ output value.
This is often referred to as a ________________ problem.
1.3.2 Stock Market data —
Smarket
In this case we instead wish to predict a non-numerical value—that is, a ________________ output.
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
## 1 2001 0.381 -0.192 -2.624 -1.055 5.010 1.1913 0.959 Up
## 2 2001 0.959 0.381 -0.192 -2.624 -1.055 1.2965 1.032 Up
## 3 2001 1.032 0.959 0.381 -0.192 -2.624 1.4112 -0.623 Down
## 4 2001 -0.623 1.032 0.959 0.381 -0.192 1.2760 0.614 Up
## 5 2001 0.614 -0.623 1.032 0.959 0.381 1.2057 0.213 Up
## 6 2001 0.213 0.614 -0.623 1.032 0.959 1.3491 1.392 Up
The goal is to predict whether the index will increase or decrease on a given day using the past 5 days’ percentage changes in the index.
Here the statistical learning problem does not involve predicting a numerical value. Instead it involves predicting whether a given day’s stock market performance will fall into the ___ bucket or the ____ bucket.
Note:
This is known as a _____________ problem.
- Create a boxplot (shown here) for yesterday’s percentage change with
the
Direction
variable
- Is there any indication that there is an association between the past and present performance of the stock market? \[\\[0.5in]\]
1.3.3 Gene Expression
data — NCI60
The previous two applications illustrate data sets with both input and output variables. However, another important class of problems involves situations in which we only observe ___________ variables, with no corresponding ______________.
Example: In a marketing setting, we might have demographic information for a number of current customers. We may wish to understand which types of customers are similar to each other by grouping individuals according to their observed characteristics.
Note:
This is known as a _____________ problem.
We consider the NCI60
data set, which consists of 6830
gene expression measurements for each of 64 cancer cell lines. Instead
of predicting a particular output variable, we are interested in
determining whether there are groups, or clusters, among the cell lines
based on their gene expression measurements. This is a difficult
question to address, in part because there are thousands of gene
expression measurements per cell line, making it hard to visualize the
data.
2 Summary of what we learned in this chapter:
3 What do we cover in this class?
In Chapter 2 we introduce the basic terminology and concepts behind statistical learning. This chapter also presents the \(K\)-nearest neighbor classifier, a very simple method that works surprisingly well on many problems.
Chapter 3 reviews linear regression, the fundamental starting point for all regression methods.
A central problem in all statistical learning situations involves choosing the best method for a given application. Hence, in Chapter 5 we introduce cross-validation and the bootstrap, which can be used to estimate the accuracy of a number of different methods in order to choose the best one.
Chapter 6 we consider a host of linear methods, both classical and more modern, which offer potential improvements over standard linear regression. These include stepwise selection, ridge regression, principal components regression, partial least squares, and the lasso.