# Load libraries
library(dplyr)
library(data.table)
# --------------------------
# Step 1: Create example dataset
# --------------------------
set.seed(123)
n <- 200
students <- data.frame(
familiarR = rbinom(n, 1, 0.4), # 0 = No, 1 = Yes
hoursStudy = rpois(n, lambda = 120), # study hours
STAMscore = sample(5:9, n, replace = TRUE) # STAM scores
)
# True model (simulate outcome with known betas)
eta <- -1.3 + 1*students$familiarR +
0.01*students$hoursStudy +
ifelse(students$STAMscore <= 5, -2.5,
ifelse(students$STAMscore == 6, 0,
ifelse(students$STAMscore == 7, 1.2, 1.6)))
prob <- exp(eta) / (1 + exp(eta))
students$passSRM <- rbinom(n, 1, prob)
data.table(students)
familiarR hoursStudy STAMscore passSRM
<int> <int> <int> <int>
1: 0 112 6 0
2: 1 120 7 1
3: 0 132 7 0
4: 1 113 5 0
5: 1 119 6 1
---
196: 0 110 9 1
197: 0 126 7 0
198: 1 130 6 1
199: 0 138 5 0
200: 0 120 8 1
Calculate the of passing Exam SRM for someone who:
Systematic component (eta) is : 0.9516063
Odds of passing: 2.5898663
Probability of passing: 0.7214381
An insurance company wants to estimate the probability of a claim based on driver characteristics and vehicle type. The response variable Claim is binary (1 = claim, 0 = no claim). Predictors include:
Vehicle Body: Coupe, Roadster, Sedan, Station Wagon, Truck, Utility
Driver Gender: Male, Female
Area: A, B, C, D
A probit regression model is fitted to the data:
\[Claim ∼ VehicleBody + Gender + Area\]
Tasks:
Fit a GLM with a probit link using the provided dataset.
Using the fitted model, calculate the estimated probability of a claim for the following scenarios:
Compare the results across the three cases. Which combination of characteristics results in the highest predicted probability of a claim? Which results in the lowest?
Briefly interpret what these results mean in the context of insurance risk.
## -------------------------------
## Example: Probit GLM on Insurance Claims
## ----------------------------- --
## Load packages
library(dplyr)
set.seed(123) # for reproducibility
## Simulate a dataset
n <- 500
VehicleBody <- sample(c("Coupe", "Roadster", "Sedan",
"StationWagon", "Truck", "Utility"),
size = n, replace = TRUE)
Gender <- sample(c("Male", "Female"),
size = n, replace = TRUE)
Area <- sample(c("A", "B", "C", "D"),
size = n, replace = TRUE)
## True coefficients (like in your summary table)
beta0 <- -1.5
beta_vehicle <- c(Coupe = -0.9, Roadster = -1.0, Sedan = -1.2,
StationWagon = -1.1, Truck = -1.1, Utility = -1.3)
beta_gender <- c(Male = -0.03, Female = 0)
beta_area <- c(A = 0, B = 0.1, C = 0.04, D = -0.1)
## Linear predictor
eta <- beta0 +
beta_vehicle[VehicleBody] +
beta_gender[Gender] +
beta_area[Area]
## Probit link: probability = Phi(eta)
p_claim <- pnorm(eta)
## Generate binary response
Claim <- rbinom(n, size = 1, prob = p_claim)
## Put into a dataframe
ins_data <- data.frame(Claim, VehicleBody, Gender, Area)
data.table(ins_data)
Claim VehicleBody Gender Area
<int> <char> <char> <char>
1: 0 Sedan Male B
2: 0 Utility Male D
3: 0 Sedan Male D
4: 0 Roadster Female D
5: 0 Roadster Female B
---
496: 0 Coupe Male D
497: 0 Truck Male A
498: 0 Utility Female A
499: 0 StationWagon Female C
500: 0 StationWagon Female D
An analyst wants to model which type of car a person buys. The response has three categories:
Explanatory variables:
A nominal (multinomial) logistic regression is fit with Sedan as the base outcome. That is, the model estimates separate log-odds equations for Van vs Sedan and SUV vs Sedan, each using the same predictors (Gender, AgeGroup).
library(dplyr)
library(nnet)
set.seed(2025)
# Sample size
n <- 1200
# Simulate predictors
Gender <- sample(c("Male", "Female"), size = n, replace = TRUE, prob = c(0.5, 0.5))
AgeGroup <- sample(c("Under25", "25to54", "55plus"),
size = n, replace = TRUE, prob = c(0.25, 0.55, 0.20))
# Set reference levels
dat <- tibble(Gender, AgeGroup) %>%
mutate(
Gender = factor(Gender, levels = c("Male","Female")),
AgeGroup = factor(AgeGroup, levels = c("25to54","Under25","55plus"))
)
# True coefficients (to generate outcomes)
beta <- list(
Van = c("(Intercept)" = 0.10,
"GenderFemale" = -0.18,
"AgeGroupUnder25" = -0.11,
"AgeGroup55plus" = 0.06),
SUV = c("(Intercept)" = -0.02,
"GenderFemale" = -0.06,
"AgeGroupUnder25" = 0.18,
"AgeGroup55plus" = 0.04)
)
# Indicator variables
X_GenderFemale <- as.numeric(dat$Gender == "Female")
X_Under25 <- as.numeric(dat$AgeGroup == "Under25")
X_55plus <- as.numeric(dat$AgeGroup == "55plus")
# Linear predictors
eta_van <- beta$Van["(Intercept)"] +
beta$Van["GenderFemale"] * X_GenderFemale +
beta$Van["AgeGroupUnder25"] * X_Under25 +
beta$Van["AgeGroup55plus"] * X_55plus
eta_suv <- beta$SUV["(Intercept)"] +
beta$SUV["GenderFemale"] * X_GenderFemale +
beta$SUV["AgeGroupUnder25"] * X_Under25 +
beta$SUV["AgeGroup55plus"] * X_55plus
# Probabilities (softmax)
exp_van <- exp(eta_van)
exp_suv <- exp(eta_suv)
denom <- 1 + exp_van + exp_suv
p_sedan <- 1 / denom
p_van <- exp_van / denom
p_suv <- exp_suv / denom
# Generate outcome
y <- mapply(function(ps, pv, pu) sample(c("Sedan","Van","SUV"), 1,
prob = c(ps, pv, pu)),
ps = p_sedan, pv = p_van, pu = p_suv)
dat <- dat %>% mutate(CarType = factor(y, levels = c("Sedan","Van","SUV")))
data.table(dat)
## Gender AgeGroup CarType
## <fctr> <fctr> <fctr>
## 1: Male 25to54 Sedan
## 2: Female Under25 SUV
## 3: Male 25to54 Sedan
## 4: Female 55plus Sedan
## 5: Male 55plus Sedan
## ---
## 1196: Female Under25 Sedan
## 1197: Male Under25 Sedan
## 1198: Male 25to54 Sedan
## 1199: Female Under25 Sedan
## 1200: Female 25to54 Sedan
Log-odds coefficient (Female, Van vs Sedan): -0.281
Odds ratio: 0.755
Interpretation:
A university surveys course satisfaction using an ordinal 3-level scale:
| Category | Description |
|---|---|
| 1 | Dissatisfied |
| 2 | Neutral |
| 3 | Satisfied |
A cumulative proportional odds model is used with a
single explanatory variable mode (In-person
vs. Telehealth). All other covariates are identical for the two
groups.
Given target probabilities:
| Mode | P(Y=1) | P(Y=2) | P(Y=3) |
|---|---|---|---|
| In-person | 0.377541 | 0.440033 | 0.182426 |
| Telehealth | 0.450166 | ? | ? |
Goal: Determine the probability of Neutral, \(\Pr(Y=2)\), for the Telehealth group.
We simulate data from a proportional-odds (logistic) model with cutpoints \(\alpha_1=-0.5\), \(\alpha_2=1.5\) for the baseline (In-person), and a mode effect \(\beta=0.3\) added to the cumulative logits for Telehealth. These reproduce the In-person probabilities and the Telehealth shift used in the solution.
set.seed(3852)
library(dplyr)
library(MASS) # polr()
library(tidyr)
# Cutpoints for baseline (In-person)
alpha1 <- -0.5
alpha2 <- 1.5
beta <- 0.3 # proportional-odds shift for Telehealth
# Helper: logistic CDF
logistic <- function(x) 1/(1 + exp(-x))
# Function to generate ordinal outcomes under proportional odds
# P(Y<=1 | mode) = logistic(alpha1 + beta*mode)
# P(Y<=2 | mode) = logistic(alpha2 + beta*mode)
# Then derive category probabilities
r_ord <- function(n, mode){
eta1 <- alpha1 + beta*mode
eta2 <- alpha2 + beta*mode
p_le1 <- logistic(eta1)
p_le2 <- logistic(eta2)
p1 <- p_le1
p2 <- p_le2 - p_le1
p3 <- 1 - p_le2
# sample categories 1,2,3
draws <- apply(cbind(p1, p2, p3), 1, function(pp) sample(1:3, size=1, prob=pp))
factor(draws, levels = 1:3, labels = c("1","2","3"))
}
# Simulate data
n_in <- 5000
n_tele <- 5000
dat_in <- tibble(mode = factor(rep("In-person", n_in), levels = c("In-person","Telehealth")),
mode_num = 0)
dat_te <- tibble(mode = factor(rep("Telehealth", n_tele), levels = c("In-person","Telehealth")),
mode_num = 1)
dat <- bind_rows(dat_in, dat_te) %>%
mutate(Y = r_ord(n = n(), mode = mode_num))
data.table(dat)
## mode mode_num Y
## <fctr> <num> <fctr>
## 1: In-person 0 2
## 2: In-person 0 2
## 3: In-person 0 2
## 4: In-person 0 1
## 5: In-person 0 1
## ---
## 9996: Telehealth 1 3
## 9997: Telehealth 1 1
## 9998: Telehealth 1 2
## 9999: Telehealth 1 2
## 10000: Telehealth 1 1
# Check empirical probabilities to verify target structure (large n -> close)
emp <- dat %>% count(mode, Y) %>% group_by(mode) %>% mutate(p = n/sum(n))
data.table(emp)
## mode Y n p
## <fctr> <fctr> <int> <num>
## 1: In-person 1 1890 0.3780
## 2: In-person 2 2202 0.4404
## 3: In-person 3 908 0.1816
## 4: Telehealth 1 2261 0.4522
## 5: Telehealth 2 2009 0.4018
## 6: Telehealth 3 730 0.1460
We fit an ordinal logistic regression (proportional odds) with MASS::polr, using mode as the single predictor and category 1 < 2 < 3 order.
Response: Number of cars
Predictors: \(\ln(\text{Income})\) (continuous),
Family size (categorical: 1–2, 3–4, 5+)
Target coefficients (for data generation):
\(\beta_0 = 0.186\), \(\beta_{\ln(\text{Income})}=0.009\),
\(\beta_{\text{Fam:1–2}}=0\)
(reference), \(\beta_{\text{Fam:3–4}}=0.137\), \(\beta_{\text{Fam:5+}}=0.355\).
Task: For \(\text{Income}=150{,}000\) and Family size = 4 (i.e., 3–4 category), compute the variance of \(Y\) under the Poisson GLM.
set.seed(4830)
library(dplyr)
# 1) Simulate predictors
n <- 4000
# Income: log-normal-ish spread centered around ~100k–150k
log_income <- rnorm(n, mean = log(120000), sd = 0.5)
Income <- exp(log_income)
FamilySize <- sample(c("1-2","3-4","5+"), size = n, replace = TRUE, prob = c(0.45, 0.4, 0.15))
FamilySize <- factor(FamilySize, levels = c("1-2","3-4","5+")) # set reference = 1-2
# 2) True coefficients (match your worked example structure)
b0 <- 0.186
b_ln <- 0.009
b_fs <- c("1-2" = 0.000, "3-4" = 0.137, "5+" = 0.355)
eta_true <- b0 + b_ln*log(Income) + b_fs[as.character(FamilySize)]
mu_true <- exp(eta_true)
# 3) Generate Poisson outcomes
Cars <- rpois(n, lambda = mu_true)
dat1 <- tibble(Cars, Income, logIncome = log(Income), FamilySize)
data.table(dat1)
## Cars Income logIncome FamilySize
## <int> <num> <num> <fctr>
## 1: 1 71200.36 11.17325 3-4
## 2: 3 75520.50 11.23216 3-4
## 3: 3 115753.95 11.65922 1-2
## 4: 3 126596.82 11.74876 1-2
## 5: 1 92295.18 11.43275 1-2
## ---
## 3996: 2 61437.72 11.02578 1-2
## 3997: 0 77520.67 11.25830 3-4
## 3998: 1 47426.84 10.76694 3-4
## 3999: 3 138898.78 11.84150 5+
## 4000: 0 229913.72 12.34546 5+
Compute the Required Variance from the Fitted Model
We want Income=150,000 and Family size = 4: category 3–4 in our coding. For a Poisson GLM, \(Var(Y)=\mu\).
Estimated mean \(\hat{\mu}\): 1.598 Estimated variance \(Var(Y)\): 1.598 (Poisson: Var = Mean)
Response Variable: Claim Count
Distribution: Poisson
Link Function: Log
Predictors: Rating class (Standard vs Preferred), Miles
driven (in thousands)
| Parameter | df | Estimate |
|---|---|---|
| Intercept | 1 | -1.512 |
| Rating class | ||
| * Standard | 0 | 0.000 |
| * Preferred | 1 | -0.301 |
| Miles driven | 1 | 0.020 |
Calculate the probability that a Preferred driver who drives 10,000 miles (i.e., 10 units in thousands) submits at least one claim.
set.seed(3852)
m <- 5000
Rating <- sample(c("Standard","Preferred"), size = m, replace = TRUE, prob = c(0.6, 0.4))
Rating <- factor(Rating, levels = c("Standard","Preferred")) # ref = Standard
# Miles in thousands (nonnegative, skewed)
Miles <- pmax(0, rnorm(m, mean = 9, sd = 3)) # center near 9k–10k
# True coefficients
b0 <- -1.512
b_pref<- -0.301
b_mi <- 0.020
eta2_true <- b0 + (Rating == "Preferred")*b_pref + b_mi*Miles
mu2_true <- exp(eta2_true)
Claims <- rpois(m, lambda = mu2_true)
dat2 <- tibble(Claims, Rating, Miles)
data.table(dat2)
## Claims Rating Miles
## <int> <fctr> <num>
## 1: 0 Standard 12.539087
## 2: 0 Standard 5.419065
## 3: 0 Standard 9.173771
## 4: 0 Preferred 6.289731
## 5: 0 Preferred 10.361668
## ---
## 4996: 0 Standard 7.767208
## 4997: 2 Preferred 14.172462
## 4998: 2 Preferred 11.584201
## 4999: 1 Preferred 11.161610
## 5000: 0 Standard 2.603598
Compute \(Pr(Y \geq 1)\) for Preferred, 10,000 miles
For Poisson \(Y \sim Pois(\mu)\): \(Pr(Y\geq1)=1−Pr(Y=0)=1−e^ {-\mu}\)
Estimated mean claims \(\hat{\mu}\): 0.182 Estimated Pr(Y): 0.167