Binomial Response

Simulated Data Example 1

# Load libraries
library(dplyr)
library(data.table)
# --------------------------
# Step 1: Create example dataset
# --------------------------
set.seed(123)

n <- 200
students <- data.frame(
  familiarR = rbinom(n, 1, 0.4),             # 0 = No, 1 = Yes
  hoursStudy = rpois(n, lambda = 120),       # study hours
  STAMscore = sample(5:9, n, replace = TRUE) # STAM scores
)

# True model (simulate outcome with known betas)
eta <- -1.3 + 1*students$familiarR + 
       0.01*students$hoursStudy +
       ifelse(students$STAMscore <= 5, -2.5,
              ifelse(students$STAMscore == 6, 0,
                     ifelse(students$STAMscore == 7, 1.2, 1.6)))

prob <- exp(eta) / (1 + exp(eta))
students$passSRM <- rbinom(n, 1, prob)

data.table(students)

     familiarR hoursStudy STAMscore passSRM
         <int>      <int>     <int>   <int>
  1:         0        112         6       0
  2:         1        120         7       1
  3:         0        132         7       0
  4:         1        113         5       0
  5:         1        119         6       1
 ---                                       
196:         0        110         9       1
197:         0        126         7       0
198:         1        130         6       1
199:         0        138         5       0
200:         0        120         8       1

Calculate the of passing Exam SRM for someone who:

is not familiar with R,
has studied the ASM manual for 150 hours,
passed Exam STAM with a score of 7.

1) Fit logistic regression model

2) Predict for given scenario

Scenario: not familiar with R, 150 study hours, STAM score = 7

Compute the Systematic component \(\eta\) (eta)

Systematic component (eta) is : 0.9516063

Calculate the Odds of passing.

Odds of passing: 2.5898663

Calculate the Probability of passing

Probability of passing: 0.7214381

Simulated data example 2

An insurance company wants to estimate the probability of a claim based on driver characteristics and vehicle type. The response variable Claim is binary (1 = claim, 0 = no claim). Predictors include:

Vehicle Body: Coupe, Roadster, Sedan, Station Wagon, Truck, Utility

Driver Gender: Male, Female

Area: A, B, C, D

A probit regression model is fitted to the data:

\[Claim ∼ VehicleBody + Gender + Area\]

Tasks:

Fit a GLM with a probit link using the provided dataset.
Using the fitted model, calculate the estimated probability of a claim for the following scenarios:
1. Male driver, Roadster, Area A
2. Female driver, Sedan, Area B
3. Male driver, Utility, Area D
Compare the results across the three cases. Which combination of characteristics results in the highest predicted probability of a claim? Which results in the lowest?
Briefly interpret what these results mean in the context of insurance risk.

## -------------------------------
## Example: Probit GLM on Insurance Claims
## ----------------------------- --

## Load packages
library(dplyr)

set.seed(123)  # for reproducibility

## Simulate a dataset
n <- 500

VehicleBody <- sample(c("Coupe", "Roadster", "Sedan",
                        "StationWagon", "Truck", "Utility"),
                      size = n, replace = TRUE)

Gender <- sample(c("Male", "Female"),
                 size = n, replace = TRUE)

Area <- sample(c("A", "B", "C", "D"),
               size = n, replace = TRUE)

## True coefficients (like in your summary table)
beta0 <- -1.5
beta_vehicle <- c(Coupe = -0.9, Roadster = -1.0, Sedan = -1.2,
                  StationWagon = -1.1, Truck = -1.1, Utility = -1.3)
beta_gender <- c(Male = -0.03, Female = 0)
beta_area <- c(A = 0, B = 0.1, C = 0.04, D = -0.1)

## Linear predictor
eta <- beta0 +
  beta_vehicle[VehicleBody] +
  beta_gender[Gender] +
  beta_area[Area]

## Probit link: probability = Phi(eta)
p_claim <- pnorm(eta)

## Generate binary response
Claim <- rbinom(n, size = 1, prob = p_claim)

## Put into a dataframe
ins_data <- data.frame(Claim, VehicleBody, Gender, Area)
data.table(ins_data)

     Claim  VehicleBody Gender   Area
     <int>       <char> <char> <char>
  1:     0        Sedan   Male      B
  2:     0      Utility   Male      D
  3:     0        Sedan   Male      D
  4:     0     Roadster Female      D
  5:     0     Roadster Female      B
 ---                                 
496:     0        Coupe   Male      D
497:     0        Truck   Male      A
498:     0      Utility Female      A
499:     0 StationWagon Female      C
500:     0 StationWagon Female      D

1) Fit a probit GLM

2) Prediction:

3) Compare the results across the three cases

4) Interpretation

Nominal Response

Problem Statement

An analyst wants to model which type of car a person buys. The response has three categories:

Sedan (reference/base category)
Van
SUV

Explanatory variables:

Gender: Male, Female
AgeGroup: Under25, 25to54, 55plus

A nominal (multinomial) logistic regression is fit with Sedan as the base outcome. That is, the model estimates separate log-odds equations for Van vs Sedan and SUV vs Sedan, each using the same predictors (Gender, AgeGroup).

Tasks

Simulate a dataset and fit a multinomial logistic regression with base outcome = Sedan.
Using your fitted model, compute and interpret the odds ratio for Gender = Female in the Van vs Sedan equation.
Using your fitted model, compute the predicted probability that a Male age 20 (AgeGroup = Under25) buys an SUV.

1. Data Simulation (Optional to Understand for our class)

library(dplyr)
library(nnet)

set.seed(2025)

# Sample size
n <- 1200

# Simulate predictors
Gender   <- sample(c("Male", "Female"), size = n, replace = TRUE, prob = c(0.5, 0.5))
AgeGroup <- sample(c("Under25", "25to54", "55plus"),
                   size = n, replace = TRUE, prob = c(0.25, 0.55, 0.20))

# Set reference levels
dat <- tibble(Gender, AgeGroup) %>%
  mutate(
    Gender = factor(Gender, levels = c("Male","Female")),
    AgeGroup = factor(AgeGroup, levels = c("25to54","Under25","55plus"))
  )

# True coefficients (to generate outcomes)
beta <- list(
  Van = c("(Intercept)" = 0.10,
          "GenderFemale" = -0.18,
          "AgeGroupUnder25" = -0.11,
          "AgeGroup55plus" = 0.06),
  SUV = c("(Intercept)" = -0.02,
          "GenderFemale" = -0.06,
          "AgeGroupUnder25" = 0.18,
          "AgeGroup55plus" = 0.04)
)

# Indicator variables
X_GenderFemale <- as.numeric(dat$Gender == "Female")
X_Under25      <- as.numeric(dat$AgeGroup == "Under25")
X_55plus       <- as.numeric(dat$AgeGroup == "55plus")

# Linear predictors
eta_van <- beta$Van["(Intercept)"] +
  beta$Van["GenderFemale"] * X_GenderFemale +
  beta$Van["AgeGroupUnder25"] * X_Under25 +
  beta$Van["AgeGroup55plus"] * X_55plus

eta_suv <- beta$SUV["(Intercept)"] +
  beta$SUV["GenderFemale"] * X_GenderFemale +
  beta$SUV["AgeGroupUnder25"] * X_Under25 +
  beta$SUV["AgeGroup55plus"] * X_55plus

# Probabilities (softmax)
exp_van <- exp(eta_van)
exp_suv <- exp(eta_suv)
denom   <- 1 + exp_van + exp_suv

p_sedan <- 1 / denom
p_van   <- exp_van / denom
p_suv   <- exp_suv / denom

# Generate outcome
y <- mapply(function(ps, pv, pu) sample(c("Sedan","Van","SUV"), 1,
                                        prob = c(ps, pv, pu)),
            ps = p_sedan, pv = p_van, pu = p_suv)

dat <- dat %>% mutate(CarType = factor(y, levels = c("Sedan","Van","SUV")))

data.table(dat)

##       Gender AgeGroup CarType
##       <fctr>   <fctr>  <fctr>
##    1:   Male   25to54   Sedan
##    2: Female  Under25     SUV
##    3:   Male   25to54   Sedan
##    4: Female   55plus   Sedan
##    5:   Male   55plus   Sedan
##   ---                        
## 1196: Female  Under25   Sedan
## 1197:   Male  Under25   Sedan
## 1198:   Male   25to54   Sedan
## 1199: Female  Under25   Sedan
## 1200: Female   25to54   Sedan

Model Fitting

2. Using your fitted model, compute and interpret the odds ratio for Gender = Female in the Van vs Sedan equation.

Log-odds coefficient (Female, Van vs Sedan): -0.281
Odds ratio: 0.755

Interpretation:

3. Using your fitted model, compute the predicted probability that a Male age 20 (AgeGroup = Under25) buys an SUV.

Ordinal Response

Problem Statement

A university surveys course satisfaction using an ordinal 3-level scale:

Category	Description
1	Dissatisfied
2	Neutral
3	Satisfied

A cumulative proportional odds model is used with a single explanatory variable mode (In-person vs. Telehealth). All other covariates are identical for the two groups.

Given target probabilities:

Mode	P(Y=1)	P(Y=2)	P(Y=3)
In-person	0.377541	0.440033	0.182426
Telehealth	0.450166	?	?

Goal: Determine the probability of Neutral, \(\Pr(Y=2)\), for the Telehealth group.

Data Simulation (consistent with the worked example)

We simulate data from a proportional-odds (logistic) model with cutpoints \(\alpha_1=-0.5\), \(\alpha_2=1.5\) for the baseline (In-person), and a mode effect \(\beta=0.3\) added to the cumulative logits for Telehealth. These reproduce the In-person probabilities and the Telehealth shift used in the solution.

set.seed(3852)
library(dplyr)
library(MASS)   # polr()
library(tidyr)

# Cutpoints for baseline (In-person)
alpha1 <- -0.5
alpha2 <-  1.5
beta   <-  0.3   # proportional-odds shift for Telehealth

# Helper: logistic CDF
logistic <- function(x) 1/(1 + exp(-x))

# Function to generate ordinal outcomes under proportional odds
# P(Y<=1 | mode) = logistic(alpha1 + beta*mode)
# P(Y<=2 | mode) = logistic(alpha2 + beta*mode)
# Then derive category probabilities
r_ord <- function(n, mode){
  eta1 <- alpha1 + beta*mode
  eta2 <- alpha2 + beta*mode
  p_le1 <- logistic(eta1)
  p_le2 <- logistic(eta2)
  p1 <- p_le1
  p2 <- p_le2 - p_le1
  p3 <- 1 - p_le2
  # sample categories 1,2,3
  draws <- apply(cbind(p1, p2, p3), 1, function(pp) sample(1:3, size=1, prob=pp))
  factor(draws, levels = 1:3, labels = c("1","2","3"))
}

# Simulate data
n_in   <- 5000
n_tele <- 5000
dat_in <- tibble(mode = factor(rep("In-person", n_in), levels = c("In-person","Telehealth")),
                 mode_num = 0)
dat_te <- tibble(mode = factor(rep("Telehealth", n_tele), levels = c("In-person","Telehealth")),
                 mode_num = 1)

dat <- bind_rows(dat_in, dat_te) %>%
  mutate(Y = r_ord(n = n(), mode = mode_num))

data.table(dat)

##              mode mode_num      Y
##            <fctr>    <num> <fctr>
##     1:  In-person        0      2
##     2:  In-person        0      2
##     3:  In-person        0      2
##     4:  In-person        0      1
##     5:  In-person        0      1
##    ---                           
##  9996: Telehealth        1      3
##  9997: Telehealth        1      1
##  9998: Telehealth        1      2
##  9999: Telehealth        1      2
## 10000: Telehealth        1      1

# Check empirical probabilities to verify target structure (large n -> close)
emp <- dat %>% count(mode, Y) %>% group_by(mode) %>% mutate(p = n/sum(n))
data.table(emp)

##          mode      Y     n      p
##        <fctr> <fctr> <int>  <num>
## 1:  In-person      1  1890 0.3780
## 2:  In-person      2  2202 0.4404
## 3:  In-person      3   908 0.1816
## 4: Telehealth      1  2261 0.4522
## 5: Telehealth      2  2009 0.4018
## 6: Telehealth      3   730 0.1460

Fit the Cumulative Proportional Odds Model

We fit an ordinal logistic regression (proportional odds) with MASS::polr, using mode as the single predictor and category 1 < 2 < 3 order.

Compute \(Pr(Y=2∣Telehealth)\) from the fitted model

Poisson Response

Example 1 — Number of Cars (Poisson, log link)

Response: Number of cars
Predictors: \(\ln(\text{Income})\) (continuous), Family size (categorical: 1–2, 3–4, 5+)
Target coefficients (for data generation):
\(\beta_0 = 0.186\), \(\beta_{\ln(\text{Income})}=0.009\),
\(\beta_{\text{Fam:1–2}}=0\) (reference), \(\beta_{\text{Fam:3–4}}=0.137\), \(\beta_{\text{Fam:5+}}=0.355\).

Task: For \(\text{Income}=150{,}000\) and Family size = 4 (i.e., 3–4 category), compute the variance of \(Y\) under the Poisson GLM.

Simulate Data & Fit Model

set.seed(4830)
library(dplyr)

# 1) Simulate predictors
n <- 4000
# Income: log-normal-ish spread centered around ~100k–150k
log_income <- rnorm(n, mean = log(120000), sd = 0.5)
Income     <- exp(log_income)

FamilySize <- sample(c("1-2","3-4","5+"), size = n, replace = TRUE, prob = c(0.45, 0.4, 0.15))
FamilySize <- factor(FamilySize, levels = c("1-2","3-4","5+"))  # set reference = 1-2

# 2) True coefficients (match your worked example structure)
b0   <- 0.186
b_ln <- 0.009
b_fs <- c("1-2" = 0.000, "3-4" = 0.137, "5+" = 0.355)

eta_true <- b0 + b_ln*log(Income) + b_fs[as.character(FamilySize)]
mu_true  <- exp(eta_true)

# 3) Generate Poisson outcomes
Cars <- rpois(n, lambda = mu_true)

dat1 <- tibble(Cars, Income, logIncome = log(Income), FamilySize)
data.table(dat1)

##        Cars    Income logIncome FamilySize
##       <int>     <num>     <num>     <fctr>
##    1:     1  71200.36  11.17325        3-4
##    2:     3  75520.50  11.23216        3-4
##    3:     3 115753.95  11.65922        1-2
##    4:     3 126596.82  11.74876        1-2
##    5:     1  92295.18  11.43275        1-2
##   ---                                     
## 3996:     2  61437.72  11.02578        1-2
## 3997:     0  77520.67  11.25830        3-4
## 3998:     1  47426.84  10.76694        3-4
## 3999:     3 138898.78  11.84150         5+
## 4000:     0 229913.72  12.34546         5+

Fit Poisson GLM (log link)

Compute the Required Variance from the Fitted Model

We want Income=150,000 and Family size = 4: category 3–4 in our coding. For a Poisson GLM, \(Var(Y)=\mu\).

Estimated mean \(\hat{\mu}\): 1.598 Estimated variance \(Var(Y)\): 1.598 (Poisson: Var = Mean)

Example 2 — Claim Count (Poisson Model)

Response Variable: Claim Count
Distribution: Poisson
Link Function: Log
Predictors: Rating class (Standard vs Preferred), Miles driven (in thousands)

Parameter	df	Estimate
Intercept	1	-1.512
Rating class
* Standard	0	0.000
* Preferred	1	-0.301
Miles driven	1	0.020

Problem

Calculate the probability that a Preferred driver who drives 10,000 miles (i.e., 10 units in thousands) submits at least one claim.

set.seed(3852)

m <- 5000
Rating <- sample(c("Standard","Preferred"), size = m, replace = TRUE, prob = c(0.6, 0.4))
Rating <- factor(Rating, levels = c("Standard","Preferred"))  # ref = Standard

# Miles in thousands (nonnegative, skewed)
Miles <- pmax(0, rnorm(m, mean = 9, sd = 3))  # center near 9k–10k

# True coefficients
b0    <- -1.512
b_pref<- -0.301
b_mi  <-  0.020

eta2_true <- b0 + (Rating == "Preferred")*b_pref + b_mi*Miles
mu2_true  <- exp(eta2_true)

Claims <- rpois(m, lambda = mu2_true)

dat2 <- tibble(Claims, Rating, Miles)
data.table(dat2)

##       Claims    Rating     Miles
##        <int>    <fctr>     <num>
##    1:      0  Standard 12.539087
##    2:      0  Standard  5.419065
##    3:      0  Standard  9.173771
##    4:      0 Preferred  6.289731
##    5:      0 Preferred 10.361668
##   ---                           
## 4996:      0  Standard  7.767208
## 4997:      2 Preferred 14.172462
## 4998:      2 Preferred 11.584201
## 4999:      1 Preferred 11.161610
## 5000:      0  Standard  2.603598

Fit Poisson GLM

Compute \(Pr(Y \geq 1)\) for Preferred, 10,000 miles

For Poisson \(Y \sim Pois(\mu)\): \(Pr(Y\geq1)=1−Pr(Y=0)=1−e^ {-\mu}\)

Estimated mean claims \(\hat{\mu}\): 0.182 Estimated Pr(Y): 0.167

Categorical Response

Dr. Lasanthi Watagoda

2025-10-01

Binomial Response

Simulated Data Example 1

1) Fit logistic regression model

2) Predict for given scenario

Scenario: not familiar with R, 150 study hours, STAM score = 7

Simulated data example 2

1) Fit a probit GLM

2) Prediction:

3) Compare the results across the three cases

4) Interpretation

Nominal Response

Problem Statement

Tasks

1. Data Simulation (Optional to Understand for our class)

Model Fitting

2. Using your fitted model, compute and interpret the odds ratio for Gender = Female in the Van vs Sedan equation.

3. Using your fitted model, compute the predicted probability that a Male age 20 (AgeGroup = Under25) buys an SUV.

Ordinal Response

Problem Statement

Data Simulation (consistent with the worked example)

Fit the Cumulative Proportional Odds Model

Compute \(Pr(Y=2∣Telehealth)\) from the fitted model

Poisson Response

Example 1 — Number of Cars (Poisson, log link)

Simulate Data & Fit Model

Fit Poisson GLM (log link)

Example 2 — Claim Count (Poisson Model)

Problem

Fit Poisson GLM