Hands-On GLM Tutorial: Building a Poisson and Logistic Model in Python

GLM vs. Linear Regression: When to Use Each ModelUnderstanding when to use a generalized linear model (GLM) versus ordinary (classic) linear regression is essential for building appropriate, reliable statistical models. This article explains the assumptions, structure, strengths, and limitations of each approach, shows common use cases, and gives practical guidance — with examples and R/Python code — to help you pick the right tool.


What is Linear Regression?

Linear regression (ordinary least squares, OLS) models a continuous response variable y as a linear function of predictor variables x1, x2, …, xp plus an error term:

y = β0 + β1 x1 + … + βp xp + ε

Key assumptions:

  • Linearity: expected value of y is a linear combination of predictors.
  • Gaussian errors: ε ~ N(0, σ²).
  • Homoscedasticity: constant variance of errors.
  • Independence: errors are independent.
  • No perfect multicollinearity among predictors.

When these hold, OLS provides unbiased, efficient parameter estimates and straightforward inference (t-tests, F-tests, R²).


What is a Generalized Linear Model (GLM)?

A GLM generalizes linear regression to accommodate response variables with error distributions from the exponential family (Normal, Binomial, Poisson, Gamma, etc.). A GLM has three components:

  1. Random component: the distribution of the response (e.g., Binomial for binary outcomes, Poisson for counts).
  2. Systematic component: linear predictor η = β0 + β1 x1 + … + βp xp.
  3. Link function g(·): relates the expected value μ = E[y] to the linear predictor: g(μ) = η.

Common link–distribution pairs:

  • Gaussian with identity link → OLS.
  • Binomial with logit link → logistic regression.
  • Poisson with log link → count models.
  • Gamma with inverse or log link → positive continuous skewed responses.

GLMs relax the normality and homoscedasticity assumptions and allow modeling of non-negative, integer, or bounded responses.


Main Differences (Concise)

  • Response type: OLS expects continuous, unbounded, normally distributed errors. GLM handles many response types (binary, counts, positive skewed).
  • Link function: OLS uses identity link. GLM can use non-identity links (logit, log, inverse), enabling non-linear relationships on the original scale.
  • Error distribution: OLS assumes Gaussian errors. GLM allows exponential-family distributions.
  • Variance structure: OLS assumes constant variance. GLM variance can be a function of the mean (e.g., Var(Y)=μ for Poisson).

When to Use Linear Regression (OLS)

Use OLS when:

  • The response is continuous and approximately normally distributed.
  • Residuals show roughly constant variance and independence.
  • Relationship between predictors and outcome is approximately linear (on original scale).
  • Interpretability of coefficients on the original scale is desired.

Examples:

  • Predicting height from age and nutrition.
  • Modeling household electricity consumption (after verifying assumptions).
  • Estimating test scores from study hours and demographics.

Practical checks: histogram/Q–Q plot of residuals, residuals vs fitted values, Breusch–Pagan test for heteroscedasticity, variance inflation factor (VIF) for multicollinearity.


When to Use a GLM

Use a GLM when:

  • The response is binary, counts, proportions, or positive-skewed continuous.
  • Variance changes with the mean (heteroscedasticity linked to mean).
  • You need a link function to map the mean to the linear predictor (e.g., log for multiplicative effects).

Common cases:

  • Binary outcome: logistic regression (Binomial + logit).
  • Count data: Poisson regression (Poisson + log) — use negative binomial if overdispersion.
  • Proportion/ratio data: Binomial with logit or probit; Beta regression for continuous proportions (not standard GLM).
  • Skewed positive data: Gamma with log link.

Examples:

  • Predicting disease presence (yes/no) from biomarkers → logistic.
  • Modeling number of insurance claims → Poisson or negative binomial.
  • Time-to-event rates per exposure (events per person-year) → Poisson with offset.

Practical Model Choice Flow

  1. Identify response type (continuous, binary, count, proportion, positive-skewed).
  2. Inspect distribution and variance patterns.
  3. Start with an appropriate GLM family and link (e.g., binomial/logit for binary).
  4. Check for overdispersion (compare residual deviance to degrees of freedom).
    • If overdispersed in counts, consider negative binomial or quasi-Poisson.
  5. Validate model: residual plots, goodness-of-fit, predictive performance (AIC, cross-validation).
  6. If assumptions fail, consider transformations, generalized additive models (GAMs), mixed models, or nonparametric methods.

Examples

R — Linear regression (OLS)

lm_fit <- lm(y ~ x1 + x2, data = df) summary(lm_fit) plot(lm_fit) # diagnostic plots 

R — Logistic regression (GLM)

glm_logit <- glm(y_binary ~ x1 + x2, family = binomial(link = "logit"), data = df) summary(glm_logit) 

Python — OLS and GLM (statsmodels)

import statsmodels.api as sm # OLS X = sm.add_constant(df[['x1','x2']]) ols = sm.OLS(df['y'], X).fit() print(ols.summary()) # Logistic (GLM) glm_logit = sm.GLM(df['y_binary'], X, family=sm.families.Binomial(sm.families.links.logit())).fit() print(glm_logit.summary()) 

Interpreting Coefficients

  • OLS: β represents expected change in y for one-unit change in x (holding others constant).
  • GLM with logit link: β is log-odds change; exponentiate to get odds ratios.
  • GLM with log link: β is log change in expected response; exponentiate to get multiplicative effects (rate ratios).
  • For non-identity links, interpret effects on the scale of the link or transform back to original scale for intuitive interpretation.

Diagnostics & Common Pitfalls

  • Mis-specifying the family/link leads to biased/inconsistent estimates.
  • Overdispersion: common in count data; check deviance/df; use negative binomial or quasi-likelihood.
  • Zero-inflation: many zeros may need zero-inflated or hurdle models.
  • Nonlinearity and interactions: consider polynomial terms, splines, or GAMs.
  • Correlated data: use generalized estimating equations (GEE) or mixed-effects GLMs for clustered/repeated measures.

Quick Decision Table

Scenario Likely model
Continuous, approx normal errors Linear regression (OLS)
Binary outcome (0/1) GLM binomial (logit/probit)
Count data (nonnegative integers) GLM Poisson (or negative binomial)
Proportion from counts Binomial (with weights/denominator)
Positive skewed continuous GLM Gamma (log link)

Summary

  • Use OLS when residuals are roughly normal with constant variance and the response is continuous.
  • Use GLM when the response distribution is non-normal (binary, counts, skewed), or when variance depends on the mean; choose family and link that match the data-generating process.
  • Always validate model assumptions, check diagnostics (overdispersion, residuals), and consider alternatives (transformations, GAMs, mixed models) when assumptions fail.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *