Binomial Response

Simulated Data Example 1

# Load libraries
library(dplyr)
library(data.table)
# --------------------------
# Step 1: Create example dataset
# --------------------------
set.seed(123)

n <- 200
students <- data.frame(
  familiarR = rbinom(n, 1, 0.4),             # 0 = No, 1 = Yes
  hoursStudy = rpois(n, lambda = 120),       # study hours
  STAMscore = sample(5:9, n, replace = TRUE) # STAM scores
)

# True model (simulate outcome with known betas)
eta <- -1.3 + 1*students$familiarR + 
       0.01*students$hoursStudy +
       ifelse(students$STAMscore <= 5, -2.5,
              ifelse(students$STAMscore == 6, 0,
                     ifelse(students$STAMscore == 7, 1.2, 1.6)))

prob <- exp(eta) / (1 + exp(eta))
students$passSRM <- rbinom(n, 1, prob)

data.table(students)
     familiarR hoursStudy STAMscore passSRM
         <int>      <int>     <int>   <int>
  1:         0        112         6       0
  2:         1        120         7       1
  3:         0        132         7       0
  4:         1        113         5       0
  5:         1        119         6       1
 ---                                       
196:         0        110         9       1
197:         0        126         7       0
198:         1        130         6       1
199:         0        138         5       0
200:         0        120         8       1

Calculate the of passing Exam SRM for someone who:

  • is not familiar with R,
  • has studied the ASM manual for 150 hours,
  • passed Exam STAM with a score of 7.

1) Fit logistic regression model

2) Predict for given scenario

Scenario: not familiar with R, 150 study hours, STAM score = 7

  1. Compute the Systematic component \(\eta\) (eta)

Systematic component (eta) is : 0.9516063

  1. Calculate the Odds of passing.

Odds of passing: 2.5898663

  1. Calculate the Probability of passing

Probability of passing: 0.7214381

Simulated data example 2

An insurance company wants to estimate the probability of a claim based on driver characteristics and vehicle type. The response variable Claim is binary (1 = claim, 0 = no claim). Predictors include:

Vehicle Body: Coupe, Roadster, Sedan, Station Wagon, Truck, Utility

Driver Gender: Male, Female

Area: A, B, C, D

A probit regression model is fitted to the data:

\[Claim ∼ VehicleBody + Gender + Area\]

Tasks:

  1. Fit a GLM with a probit link using the provided dataset.

  2. Using the fitted model, calculate the estimated probability of a claim for the following scenarios:

    1. Male driver, Roadster, Area A
    2. Female driver, Sedan, Area B
    3. Male driver, Utility, Area D
  3. Compare the results across the three cases. Which combination of characteristics results in the highest predicted probability of a claim? Which results in the lowest?

  4. Briefly interpret what these results mean in the context of insurance risk.

## -------------------------------
## Example: Probit GLM on Insurance Claims
## ----------------------------- --

## Load packages
library(dplyr)

set.seed(123)  # for reproducibility

## Simulate a dataset
n <- 500

VehicleBody <- sample(c("Coupe", "Roadster", "Sedan",
                        "StationWagon", "Truck", "Utility"),
                      size = n, replace = TRUE)

Gender <- sample(c("Male", "Female"),
                 size = n, replace = TRUE)

Area <- sample(c("A", "B", "C", "D"),
               size = n, replace = TRUE)

## True coefficients (like in your summary table)
beta0 <- -1.5
beta_vehicle <- c(Coupe = -0.9, Roadster = -1.0, Sedan = -1.2,
                  StationWagon = -1.1, Truck = -1.1, Utility = -1.3)
beta_gender <- c(Male = -0.03, Female = 0)
beta_area <- c(A = 0, B = 0.1, C = 0.04, D = -0.1)

## Linear predictor
eta <- beta0 +
  beta_vehicle[VehicleBody] +
  beta_gender[Gender] +
  beta_area[Area]

## Probit link: probability = Phi(eta)
p_claim <- pnorm(eta)

## Generate binary response
Claim <- rbinom(n, size = 1, prob = p_claim)

## Put into a dataframe
ins_data <- data.frame(Claim, VehicleBody, Gender, Area)
data.table(ins_data)
     Claim  VehicleBody Gender   Area
     <int>       <char> <char> <char>
  1:     0        Sedan   Male      B
  2:     0      Utility   Male      D
  3:     0        Sedan   Male      D
  4:     0     Roadster Female      D
  5:     0     Roadster Female      B
 ---                                 
496:     0        Coupe   Male      D
497:     0        Truck   Male      A
498:     0      Utility Female      A
499:     0 StationWagon Female      C
500:     0 StationWagon Female      D

1) Fit a probit GLM

2) Prediction:

3) Compare the results across the three cases

4) Interpretation


Nominal Response

Problem Statement

An analyst wants to model which type of car a person buys. The response has three categories:

  • Sedan (reference/base category)
  • Van
  • SUV

Explanatory variables:

  • Gender: Male, Female
  • AgeGroup: Under25, 25to54, 55plus

A nominal (multinomial) logistic regression is fit with Sedan as the base outcome. That is, the model estimates separate log-odds equations for Van vs Sedan and SUV vs Sedan, each using the same predictors (Gender, AgeGroup).

Tasks

  1. Simulate a dataset and fit a multinomial logistic regression with base outcome = Sedan.
  2. Using your fitted model, compute and interpret the odds ratio for Gender = Female in the Van vs Sedan equation.
  3. Using your fitted model, compute the predicted probability that a Male age 20 (AgeGroup = Under25) buys an SUV.

1. Data Simulation (Optional to Understand for our class)

library(dplyr)
library(nnet)

set.seed(2025)

# Sample size
n <- 1200

# Simulate predictors
Gender   <- sample(c("Male", "Female"), size = n, replace = TRUE, prob = c(0.5, 0.5))
AgeGroup <- sample(c("Under25", "25to54", "55plus"),
                   size = n, replace = TRUE, prob = c(0.25, 0.55, 0.20))

# Set reference levels
dat <- tibble(Gender, AgeGroup) %>%
  mutate(
    Gender = factor(Gender, levels = c("Male","Female")),
    AgeGroup = factor(AgeGroup, levels = c("25to54","Under25","55plus"))
  )

# True coefficients (to generate outcomes)
beta <- list(
  Van = c("(Intercept)" = 0.10,
          "GenderFemale" = -0.18,
          "AgeGroupUnder25" = -0.11,
          "AgeGroup55plus" = 0.06),
  SUV = c("(Intercept)" = -0.02,
          "GenderFemale" = -0.06,
          "AgeGroupUnder25" = 0.18,
          "AgeGroup55plus" = 0.04)
)

# Indicator variables
X_GenderFemale <- as.numeric(dat$Gender == "Female")
X_Under25      <- as.numeric(dat$AgeGroup == "Under25")
X_55plus       <- as.numeric(dat$AgeGroup == "55plus")

# Linear predictors
eta_van <- beta$Van["(Intercept)"] +
  beta$Van["GenderFemale"] * X_GenderFemale +
  beta$Van["AgeGroupUnder25"] * X_Under25 +
  beta$Van["AgeGroup55plus"] * X_55plus

eta_suv <- beta$SUV["(Intercept)"] +
  beta$SUV["GenderFemale"] * X_GenderFemale +
  beta$SUV["AgeGroupUnder25"] * X_Under25 +
  beta$SUV["AgeGroup55plus"] * X_55plus

# Probabilities (softmax)
exp_van <- exp(eta_van)
exp_suv <- exp(eta_suv)
denom   <- 1 + exp_van + exp_suv

p_sedan <- 1 / denom
p_van   <- exp_van / denom
p_suv   <- exp_suv / denom

# Generate outcome
y <- mapply(function(ps, pv, pu) sample(c("Sedan","Van","SUV"), 1,
                                        prob = c(ps, pv, pu)),
            ps = p_sedan, pv = p_van, pu = p_suv)

dat <- dat %>% mutate(CarType = factor(y, levels = c("Sedan","Van","SUV")))

data.table(dat)
##       Gender AgeGroup CarType
##       <fctr>   <fctr>  <fctr>
##    1:   Male   25to54   Sedan
##    2: Female  Under25     SUV
##    3:   Male   25to54   Sedan
##    4: Female   55plus   Sedan
##    5:   Male   55plus   Sedan
##   ---                        
## 1196: Female  Under25   Sedan
## 1197:   Male  Under25   Sedan
## 1198:   Male   25to54   Sedan
## 1199: Female  Under25   Sedan
## 1200: Female   25to54   Sedan

Model Fitting

2. Using your fitted model, compute and interpret the odds ratio for Gender = Female in the Van vs Sedan equation.

  • Log-odds coefficient (Female, Van vs Sedan): -0.281

  • Odds ratio: 0.755

Interpretation:

3. Using your fitted model, compute the predicted probability that a Male age 20 (AgeGroup = Under25) buys an SUV.


Ordinal Response

Problem Statement

A university surveys course satisfaction using an ordinal 3-level scale:

Category Description
1 Dissatisfied
2 Neutral
3 Satisfied

A cumulative proportional odds model is used with a single explanatory variable mode (In-person vs. Telehealth). All other covariates are identical for the two groups.

Given target probabilities:

Mode P(Y=1) P(Y=2) P(Y=3)
In-person 0.377541 0.440033 0.182426
Telehealth 0.450166 ? ?

Goal: Determine the probability of Neutral, \(\Pr(Y=2)\), for the Telehealth group.


Data Simulation (consistent with the worked example)

We simulate data from a proportional-odds (logistic) model with cutpoints \(\alpha_1=-0.5\), \(\alpha_2=1.5\) for the baseline (In-person), and a mode effect \(\beta=0.3\) added to the cumulative logits for Telehealth. These reproduce the In-person probabilities and the Telehealth shift used in the solution.

set.seed(3852)
library(dplyr)
library(MASS)   # polr()
library(tidyr)

# Cutpoints for baseline (In-person)
alpha1 <- -0.5
alpha2 <-  1.5
beta   <-  0.3   # proportional-odds shift for Telehealth

# Helper: logistic CDF
logistic <- function(x) 1/(1 + exp(-x))

# Function to generate ordinal outcomes under proportional odds
# P(Y<=1 | mode) = logistic(alpha1 + beta*mode)
# P(Y<=2 | mode) = logistic(alpha2 + beta*mode)
# Then derive category probabilities
r_ord <- function(n, mode){
  eta1 <- alpha1 + beta*mode
  eta2 <- alpha2 + beta*mode
  p_le1 <- logistic(eta1)
  p_le2 <- logistic(eta2)
  p1 <- p_le1
  p2 <- p_le2 - p_le1
  p3 <- 1 - p_le2
  # sample categories 1,2,3
  draws <- apply(cbind(p1, p2, p3), 1, function(pp) sample(1:3, size=1, prob=pp))
  factor(draws, levels = 1:3, labels = c("1","2","3"))
}

# Simulate data
n_in   <- 5000
n_tele <- 5000
dat_in <- tibble(mode = factor(rep("In-person", n_in), levels = c("In-person","Telehealth")),
                 mode_num = 0)
dat_te <- tibble(mode = factor(rep("Telehealth", n_tele), levels = c("In-person","Telehealth")),
                 mode_num = 1)

dat <- bind_rows(dat_in, dat_te) %>%
  mutate(Y = r_ord(n = n(), mode = mode_num))

data.table(dat)
##              mode mode_num      Y
##            <fctr>    <num> <fctr>
##     1:  In-person        0      2
##     2:  In-person        0      2
##     3:  In-person        0      2
##     4:  In-person        0      1
##     5:  In-person        0      1
##    ---                           
##  9996: Telehealth        1      3
##  9997: Telehealth        1      1
##  9998: Telehealth        1      2
##  9999: Telehealth        1      2
## 10000: Telehealth        1      1
# Check empirical probabilities to verify target structure (large n -> close)
emp <- dat %>% count(mode, Y) %>% group_by(mode) %>% mutate(p = n/sum(n))
data.table(emp)
##          mode      Y     n      p
##        <fctr> <fctr> <int>  <num>
## 1:  In-person      1  1890 0.3780
## 2:  In-person      2  2202 0.4404
## 3:  In-person      3   908 0.1816
## 4: Telehealth      1  2261 0.4522
## 5: Telehealth      2  2009 0.4018
## 6: Telehealth      3   730 0.1460

Fit the Cumulative Proportional Odds Model

We fit an ordinal logistic regression (proportional odds) with MASS::polr, using mode as the single predictor and category 1 < 2 < 3 order.

Compute \(Pr(Y=2∣Telehealth)\) from the fitted model


Poisson Response

Example 2 — Claim Count (Poisson Model)

Response Variable: Claim Count
Distribution: Poisson
Link Function: Log
Predictors: Rating class (Standard vs Preferred), Miles driven (in thousands)

Parameter df Estimate
Intercept 1 -1.512
Rating class
* Standard 0 0.000
* Preferred 1 -0.301
Miles driven 1 0.020

Problem

Calculate the probability that a Preferred driver who drives 10,000 miles (i.e., 10 units in thousands) submits at least one claim.


set.seed(3852)

m <- 5000
Rating <- sample(c("Standard","Preferred"), size = m, replace = TRUE, prob = c(0.6, 0.4))
Rating <- factor(Rating, levels = c("Standard","Preferred"))  # ref = Standard

# Miles in thousands (nonnegative, skewed)
Miles <- pmax(0, rnorm(m, mean = 9, sd = 3))  # center near 9k–10k

# True coefficients
b0    <- -1.512
b_pref<- -0.301
b_mi  <-  0.020

eta2_true <- b0 + (Rating == "Preferred")*b_pref + b_mi*Miles
mu2_true  <- exp(eta2_true)

Claims <- rpois(m, lambda = mu2_true)

dat2 <- tibble(Claims, Rating, Miles)
data.table(dat2)
##       Claims    Rating     Miles
##        <int>    <fctr>     <num>
##    1:      0  Standard 12.539087
##    2:      0  Standard  5.419065
##    3:      0  Standard  9.173771
##    4:      0 Preferred  6.289731
##    5:      0 Preferred 10.361668
##   ---                           
## 4996:      0  Standard  7.767208
## 4997:      2 Preferred 14.172462
## 4998:      2 Preferred 11.584201
## 4999:      1 Preferred 11.161610
## 5000:      0  Standard  2.603598

Fit Poisson GLM

Compute \(Pr(Y \geq 1)\) for Preferred, 10,000 miles

For Poisson \(Y \sim Pois(\mu)\): \(Pr(Y\geq1)=1−Pr(Y=0)=1−e^ {-\mu}\)

Estimated mean claims \(\hat{\mu}\): 0.182 Estimated Pr(Y): 0.167