library(tidyverse)
if(params$eval.stan) {
  library(rstan)
  rstan_options(auto_write = T)
}
options(htmltools.dir.version = F)
knitr::opts_chunk$set(message = F, warning = F, error = F, echo = T)
blues = c("#00A8E1", "#009CDE", "#0077C8", "#003865", "#BCDDF4")
yellows = c("#FFCD00", "#EAAA00", "#CC8A00", "#833921", "#FFEFBD")
theme_set(theme_bw() +
            theme(axis.text = element_text(size = 10),
                  axis.title.x = element_text(size = 12,
                                              margin = margin(t = 10)),
                  axis.title.y = element_text(size = 12,
                                              margin = margin(r = 10)),
                  strip.text = element_text(size = 10)))

Non-standard dependent variables

Typical introductions to regression introduce two types of models, each with very specific expectations of its dependent variables. Linear regression assumes that the dependent variable can take on any real value (in principle, depending on the observed values of the predictors). Logistic regression is appropriate for binary dependent variables (yes/no, 1/0).

But in the real world, we often want to build regression models for dependent variables that don’t fit into either of these two categories. In the higher ed space, we commonly encounter dependent variables that are constrained in important ways:

In what follows, we will review regression techniques for each of these types of dependent variables. Sample code for each model shows a fit with an R package, plus Stan code for a fully customizable Bayesian implementation (using the rstan package to interface with R). All code assumes you have loaded the tidyverse; other packages are introduced as needed.

To reproduce this document yourself, download nonstandard_regression.Rmd, fake_dataset.csv, and all the .png files in the images folder from the GitHub page. By default, this markdown document does not run the Stan models; to do so (if you’re compiling the document yourself), use the Knit with Parameters... option and check eval.stan. (You’ll also need to download all the .stan files.)

Working dataset

exams.df = read.csv("fake_dataset.csv", header = T, stringsAsFactors = F) %>%
  mutate(days = fct_relevel(days, "Online", "MWF", "TR")) %>%
  rownames_to_column("student.id")

Variables

Our working dataset consists of final exam scores for students enrolled in Underwater Basket-Weaving 101 at State U. It includes the following predictor variables:

  • Phone battery charge before exam (0 - 100)
  • Number of Twitter followers
  • Does the student prefer Luke Skywalker or Han Solo? (binary)
  • Is the student left-handed? (binary)
  • Tests 1, 2, and 3 scores (0 - 100)

Final exam scores are bounded between 0 and 100.

exams.df %>%
  ggplot(aes(x = exam.raw)) +
  geom_histogram() +
  labs(x = "Final exam score", y = "Number of observations",
       title = "Distribution of final exam scores")

(We see a large number of scores at precisely zero. These are probably cases where the student didn’t take the exam at all. The models below could be improved by doing two-stage regression: first predict whether the student scored above 0; then, for those students, predict the actual score. But I haven’t done this here, to keep the focus on bounded regression itself.)

Some initial cleaning

Before fitting any of the models below, let’s center and standardize the continuous predictors. Number of Twitter followers is (unsurprisingly) skewed right, so we’ll take the log.

exams.df = exams.df %>%
  mutate(battery = (battery.raw - mean(battery.raw)) / sd(battery.raw),
         followers = (log(followers.raw + 100) -
                        mean(log(followers.raw + 100))) /
                     sd(log(followers.raw + 100)),
         test.1 = (test.1.raw - mean(test.1.raw)) / sd(test.1.raw),
         test.2 = (test.2.raw - mean(test.2.raw)) / sd(test.2.raw),
         test.3 = (test.3.raw - mean(test.3.raw)) / sd(test.3.raw))

Preparing the data for Stan

To fit models with Stan (as opposed to R), we need the data in a format that Stan can read. All the Stan models below for final exam score accept a dataset in this form:

score.stan.data = list(
  N = nrow(exams.df),
  I = 7,
  X = exams.df %>%
    dplyr::select(battery, followers, skywalker, left, test.1, test.2, test.3),
  y = exams.df$exam.raw
)

Classical linear regression

A standard linear regression models \(y_i\) as a linear function of the predictor variable(s) \(x_i\), plus a normally distributed error term \(\epsilon_i\).

\[y_i = \alpha + \beta x_i + \epsilon_i\]

\[\epsilon_i \sim N(0, \sigma)\]

Equivalently, we can write \(y_i\) as a normally distributed random variable whose mean is a linear function of \(x_i\):

\[y_i \sim N(\alpha + \beta x_i, \sigma)\]

One implication of this model is that, in principle, there are no limits on possible values of \(y\): given appropriate values of \(x\), \(y\) can take on any real value. This property of the model may be undesirable if we happen to know that \(y\) cannot take on certain values (for example, if we know that \(y\) cannot be negative).

In practice, of course, the fact that the model is capable of predicting impossible values may not cause problems. If we know that extreme values of \(x\) are impossible too, then the issue simply won’t come up. (And even if extreme values of \(x\) are possible, extrapolating too far beyond the attested observations is a bad idea; we shouldn’t be doing that anyway.)

But if we have attested observations very close to the hard limits on \(y\), then we very well may find ourselves in a situation where the classical model leads us to predict impossible values of \(y\) for reasonable values of \(x\). Exam scores are a case in point: there are many observations at 0, and a sizeable number close to 100; we want to be able to predict attested values without predicting unattested ones.

Similarly, the classical model does not translate well to cases where \(y\) consists of discrete ordered responses. We can map each value of \(y\) to a number and fit the model to those numbers (A = 4, B = 3, etc.), but we face the same problem that it’s easy to predict “grades” that are impossibly small or large. In addition, discrete ordered responses may not be equally spaced: in a given dataset, the “distance” between A and B may not be the same as the “distance” between B and C. To the extent that this is true, the classical model is not appropriate.

Bounded dependent variables

There are multiple techniques for building regression models for bounded dependent variables. We will review four of them: classical linear regression, censored linear regression, logit-normal regression, and beta regression. What all of the alternative models have in common is that they predict an unrestricted latent variable \(z\) - representing the student’s true knowledge and/or ability - as a function of \(x\), and then map \(z\) to the observed \(y\) values.

Classical linear regression

As shown in the graph of fitted values, a classical linear regression fit to the final exam dataset does indeed predict exam scores of less than 0. If we nevertheless want to use this model, one way to bring its predictions into alignment with the limits on \(y\) is to censor predictions that are over 100 or under 0; the predictions shown below have been manipulated in this way.

Fit the model

# Fit a model in R.
normal.fit = lm(exam.raw ~ battery + followers + skywalker + left + test.1 +
                  test.2 + test.3,
                data = exams.df)
# Fit a model with Stan.
library(rstan)
normal.stan.fit = stan(file = "normal_model.stan",
                       data = score.stan.data,
                       chains = 4, iter = 2000, cores = 4)

Summarize fitted parameters

# Parameters from the R model.
round(coef(summary(normal.fit)), 4)
##             Estimate Std. Error  t value Pr(>|t|)
## (Intercept)  65.4229     0.3073 212.9215   0.0000
## battery       0.8397     0.2066   4.0647   0.0000
## followers     0.0607     0.2013   0.3015   0.7631
## skywalker    -1.2448     0.4034  -3.0856   0.0020
## left          0.1865     0.8576   0.2175   0.8278
## test.1        2.9729     0.2859  10.3995   0.0000
## test.2        7.0235     0.3400  20.6556   0.0000
## test.3       17.2223     0.3217  53.5378   0.0000
# Parameters from the Stan model.
library(tidybayes)
params.normal.df = normal.stan.fit %>%
  gather_draws(beta[i]) %>%
  ungroup() %>%
  mutate(parameter = fct_reorder(case_when(i == 1 ~ "intercept",
                                           i == 2 ~ "battery",
                                           i == 3 ~ "followers",
                                           i == 4 ~ "skywalker",
                                           i == 5 ~ "left",
                                           i == 6 ~ "test.1",
                                           i == 7 ~ "test.2",
                                           i == 8 ~ "test.3"), i)) %>%
  group_by(parameter) %>%
  summarize(Estimate = median(.value),
            `2.5%` = quantile(.value, 0.025),
            `25%` = quantile(.value, 0.25),
            `75%` = quantile(.value, 0.75),
            `97.5%` = quantile(.value, 0.975)) %>%
  ungroup()
params.normal.df
## # A tibble: 8 × 6
##   parameter Estimate `2.5%`   `25%`  `75%` `97.5%`
##   <fct>        <dbl>  <dbl>   <dbl>  <dbl>   <dbl>
## 1 intercept  65.3    64.8   65.1    65.5    65.9  
## 2 battery     0.962   0.579  0.827   1.09    1.36 
## 3 followers   0.0471 -0.343 -0.0857  0.179   0.428
## 4 skywalker  -1.09   -1.82  -1.34   -0.838  -0.385
## 5 left        0.0417 -1.18  -0.379   0.492   1.35 
## 6 test.1      3.38    2.84   3.19    3.56    3.89 
## 7 test.2      7.32    6.74   7.11    7.51    7.92 
## 8 test.3     16.1    15.5   15.9    16.3    16.7

Get predictions and residuals

# Predictions from the R model.
exams.df = exams.df %>%
  mutate(pred.normal = normal.fit$fitted.values,
         pred.normal.out.of.bounds = pred.normal > 100 | pred.normal < 0,
         pred.normal = case_when(pred.normal > 100 ~ 100,
                                 pred.normal < 0 ~ 0,
                                 T ~ pred.normal),
         resid.normal = exam.raw - pred.normal)
# Predictions from the Stan model.
exams.df = exams.df %>%
  mutate(pred.normal.stan = normal.stan.fit %>%
           gather_draws(y_pred[i]) %>%
           group_by(i) %>%
           summarize(pred = median(.value)) %>%
           arrange(i) %>%
           pull(pred),
         pred.normal.stan.out.of.bounds = pred.normal.stan > 100 |
                                          pred.normal.stan < 0,
         pred.normal.stan = case_when(pred.normal.stan > 100 ~ 100,
                                      pred.normal.stan < 0 ~ 0,
                                      T ~ pred.normal.stan),
         resid.normal.stan = exam.raw - pred.normal.stan)

Plot fitted values

# Predictions from the R model.
exams.df %>%
  mutate(alert = pred.normal > 100 | pred.normal < 0) %>%
  ggplot(aes(x = pred.normal, y = exam.raw,
             color = pred.normal.out.of.bounds,
             alpha = pred.normal.out.of.bounds)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = blues[1]) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  scale_x_continuous(limits = c(0, 100)) +
  scale_y_continuous(limits = c(-10, 100)) +
  scale_color_manual(labels = c("No", "Yes"), values = c("black", "red")) +
  scale_alpha_manual(labels = c("No", "Yes"), values = c(0.2, 1)) +
  theme(panel.border = element_blank(),
        axis.ticks = element_blank()) +
  labs(x = "Predicted final exam score", y = "Observed final exam score",
       color = "Prediction outside 0-100", alpha = "Prediction outside 0-100",
       title = "Normal regression", subtitle = "Fitted vs. observed values")

# Predictions from the Stan model.
exams.df %>%
  mutate(alert = pred.normal.stan > 100 | pred.normal.stan < 0) %>%
  ggplot(aes(x = pred.normal.stan, y = exam.raw,
             color = pred.normal.stan.out.of.bounds,
             alpha = pred.normal.stan.out.of.bounds)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = blues[1]) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  scale_x_continuous(limits = c(0, 100)) +
  scale_y_continuous(limits = c(-10, 100)) +
  scale_color_manual(labels = c("No", "Yes"), values = c("black", "red")) +
  scale_alpha_manual(labels = c("No", "Yes"), values = c(0.2, 1)) +
  theme(panel.border = element_blank(),
        axis.ticks = element_blank()) +
  labs(x = "Predicted final exam score", y = "Observed final exam score",
       color = "Prediction outside 0-100", alpha = "Prediction outside 0-100",
       title = "Normal regression", subtitle = "Fitted vs. observed values")

Stan code

data {
  int<lower=0> N; // number of observations
  int<lower=0> I; // number of predictors
  matrix[N,I] X; // matrix of predictors
  vector<lower=0,upper=100>[N] y; // final exam scores
}

transformed data {
  // add an intercept column to the matrix of predictors
  matrix[N,I+1] X_intercept = append_col(rep_vector(1, N), X);
}

parameters {
  vector[I+1] beta; // intercept and predictor coefficients
  real<lower=0> sigma; // scale parameter
}

model {
  // priors
  beta[1] ~ normal(50, 15);
  segment(beta, 2, I) ~ std_normal();
  sigma ~ exponential(0.5);
  // model
  y ~ normal(X_intercept * beta, sigma);
}

generated quantities {
  real y_pred[N] = normal_rng(X_intercept * beta, sigma);
}

Censored regression

An alternative model is censored regression. Here, we predict a latent variable \(z\) that represents the student’s true knowledge/ability; \(z\) is normally distributed with a mean that is a linear function of \(x\).

\[z_i: \mbox{student's true knowledge/ability}\]

\[z_i \sim N(\alpha + \beta x_i, \sigma)\]

The observed values \(y\) are truncated from \(z\): they are 0 or 100 if \(z\) falls below or above those bounds, and identical to \(z\) otherwise. Predictions of the censored model are predicted values of \(z\); we can censor these fitted values to fall between 0 and 100, just as we did for the normal regression model. (Fitted values in the R model using the censReg package are values of \(z\); the code below censors them after extracting them from the model. The Stan code censors fitted values as part of the model.)

\[y_i = \left\{\begin{array}{ll} 100 & z_i \geq 100 \\ 0 & z_i \leq 0 \\ z_i & 0 \lt z_i \lt 100 \end{array}\right.\]

Fit the model

# Fit a model in R.
library(censReg)
censored.fit = censReg(exam.raw ~ battery + followers + skywalker + left +
                         test.1 + test.2 + test.3,
                       left = 0, right = 100, data = exams.df)
# Fit a model with Stan.
library(rstan)
censored.stan.fit = stan(file = "censored_model.stan",
                         data = score.stan.data,
                         pars = c("beta", "sigma", "y_pred"),
                         chains = 4, iter = 2000, cores = 4)

Summarize fitted parameters

# Parameters from the R model.
round(coef(summary(censored.fit)), 4)
##             Estimate Std. error  t value Pr(> t)
## (Intercept)  63.5503     0.3496 181.7722  0.0000
## battery       0.6268     0.2351   2.6658  0.0077
## followers     0.3417     0.2271   1.5045  0.1325
## skywalker    -0.7001     0.4530  -1.5455  0.1222
## left         -0.1587     0.9786  -0.1622  0.8711
## test.1        3.7594     0.3484  10.7891  0.0000
## test.2        8.3002     0.3895  21.3106  0.0000
## test.3       19.5282     0.3709  52.6557  0.0000
## logSigma      2.6640     0.0112 238.7725  0.0000
# Parameters from the Stan model.
library(tidybayes)
params.censored.df = censored.stan.fit %>%
  gather_draws(beta[i]) %>%
  ungroup() %>%
  mutate(parameter = fct_reorder(case_when(i == 1 ~ "intercept",
                                           i == 2 ~ "battery",
                                           i == 3 ~ "followers",
                                           i == 4 ~ "skywalker",
                                           i == 5 ~ "left",
                                           i == 6 ~ "exam.1",
                                           i == 7 ~ "exam.2",
                                           i == 8 ~ "exam.3"), i)) %>%
  group_by(parameter) %>%
  summarize(Estimate = median(.value),
            `2.5%` = quantile(.value, 0.025),
            `25%` = quantile(.value, 0.25),
            `75%` = quantile(.value, 0.75),
            `97.5%` = quantile(.value, 0.975)) %>%
  ungroup()
params.censored.df
## # A tibble: 8 × 6
##   parameter Estimate  `2.5%`  `25%`  `75%` `97.5%`
##   <fct>        <dbl>   <dbl>  <dbl>  <dbl>   <dbl>
## 1 intercept   63.5   62.8    63.3   63.8    64.2  
## 2 battery      0.633  0.152   0.460  0.790   1.09 
## 3 followers    0.348 -0.0824  0.205  0.501   0.785
## 4 skywalker   -0.683 -1.55   -0.998 -0.381   0.159
## 5 left        -0.155 -2.07   -0.809  0.516   1.83 
## 6 exam.1       3.72   3.03    3.48   3.95    4.39 
## 7 exam.2       8.30   7.59    8.06   8.56    9.09 
## 8 exam.3      19.5   18.8    19.3   19.8    20.2

Get predictions and residuals

# Predictions from the R model.
predictors = c("battery", "followers", "skywalker", "left", "test.1", "test.2",
               "test.3")
exams.df$pred.censored = c(
  as.matrix(cbind(rep(1, nrow(exams.df)), exams.df[,predictors])) %*%
  summary(censored.fit)$estimate[c("(Intercept)", predictors),"Estimate"]
)
exams.df = exams.df %>%
  mutate(pred.censored = case_when(pred.censored > 100 ~ 100,
                                   pred.censored < 0 ~ 0,
                                   T ~ pred.censored),
         resid.censored = exam.raw - pred.censored)
# Predictions from the Stan model.
exams.df = exams.df %>%
  mutate(pred.censored.stan = censored.stan.fit %>%
           gather_draws(y_pred[i]) %>%
           group_by(i) %>%
           summarize(pred = median(.value)) %>%
           arrange(i) %>%
           pull(pred),
         resid.censored.stan = exam.raw - pred.censored.stan)

Plot fitted values

# Predictions from the R model.
exams.df %>%
  ggplot(aes(x = pred.censored, y = exam.raw)) +
  geom_point(alpha = 0.2) +
  geom_abline(slope = 1, intercept = 0, color = blues[1]) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  scale_x_continuous(limits = c(0, 100)) +
  scale_y_continuous(limits = c(0, 100)) +
  theme(legend.position = "none",
        panel.border = element_blank(),
        axis.ticks = element_blank()) +
  labs(x = "Predicted final exam score", y = "Observed final exam score",
       title = "Censored regression", subtitle = "Fitted vs. observed values")

# Predictions from the Stan model.
exams.df %>%
  ggplot(aes(x = pred.censored.stan, y = exam.raw)) +
  geom_point(alpha = 0.2) +
  geom_abline(slope = 1, intercept = 0, color = blues[1]) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  scale_x_continuous(limits = c(0, 100)) +
  scale_y_continuous(limits = c(-10, 100)) +
  theme(panel.border = element_blank(),
        axis.ticks = element_blank()) +
  labs(x = "Predicted final exam score", y = "Observed final exam score",
       title = "Censored regression (Bayesian)",
       subtitle = "Fitted vs. observed values")

Stan code

functions {
  // Get the indices of left-, right-, or non-censored observations.
  // type 1: left
  // type 2: right
  // type 3: non
  int[] inds_censored(vector y, int type) {
    int inds[rows(y)];
    int counter = 0;
    for(n in 1:rows(y)) {
      if((type == 1 && y[n] == 0) ||
         (type == 2 && y[n] == 100) ||
         (type == 3 && y[n] > 0 && y[n] < 100)) {
        counter += 1;
        inds[counter] = n;
      }
    }
    return(inds[1:counter]);
  }
}

data {
  int<lower=0> N; // number of observations
  int<lower=0> I; // number of predictors
  matrix[N,I] X; // matrix of predictors
  vector<lower=0,upper=100>[N] y; // final exam scores
}

transformed data {
  // indices of left-, right-, and non-censored observations
  int inds_left_censored[size(inds_censored(y, 1))] = inds_censored(y, 1);
  int inds_right_censored[size(inds_censored(y, 2))] = inds_censored(y, 2);
  int inds_non_censored[size(inds_censored(y, 3))] = inds_censored(y, 3);
  // add an intercept column to the matrix of predictors
  matrix[N,I+1] X_intercept = append_col(rep_vector(1, N), X);
  // scale outputs between 0 and 1
  // (underflow problems when we try to predict on the original scale)
  vector[N] y_scaled = y / 100;
}

parameters {
  vector[I+1] beta_scaled; // intercept and predictor coefficients
  real<lower=0> sigma_scaled; // scale parameter
}

transformed parameters {
  vector[I+1] beta = beta_scaled * 100;
  real<lower=0> sigma = sigma_scaled * 100;
}

model {
  // priors
  beta_scaled[1] ~ normal(0.5, 0.15);
  segment(beta_scaled, 2, I) ~ std_normal();
  sigma_scaled ~ exponential(5);
  // model
  y_scaled[inds_non_censored] ~ normal(X_intercept[inds_non_censored,] * beta_scaled, sigma_scaled);
  target += normal_lcdf(y_scaled[inds_left_censored] | X_intercept[inds_left_censored,] * beta_scaled,
                        sigma_scaled);
  target += normal_lccdf(y_scaled[inds_right_censored] | X_intercept[inds_right_censored,] * beta_scaled,
                         sigma_scaled);
}

generated quantities {
  real y_pred[N] = normal_rng(X_intercept * beta, sigma);
  for(n in 1:N) {
    if(y_pred[n] > 100) {
      y_pred[n] = 100;
    } else if(y_pred[n] < 0) {
      y_pred[n] = 0;
    }
  }
}

Logit-normal regression

A logit-normal regression assumes that \(y\) has a logit-normal distribution - that is, that the logit of \(y\) has a normal distribution. Although it isn’t strictly necessary, we can write this model using a latent variable \(z\) just as before; \(y\) is then the inverse logit of \(z\) (dividing \(y\) by 100 to yield a variable bounded between 0 and 1, matching the range of the inverse logit function).

\[z_i: \mbox{student's true knowledge/ability}\]

\[z_i \sim N(\alpha + \beta x_i, \sigma)\]

\[\frac{y_i}{100} = \mbox{logit}^{-1}(z_i)\]

Because the logit of 0 or 1 is infinite, a logit-normal regression can’t handle values of \(y\) that are exactly 0 or 1. To avoid this problem, we can offset values of 0 or 1 by a small amount.

score.offset = 0.0001
exams.df = exams.df %>%
  mutate(exam.raw.offset = case_when(exam.raw / 100 == 1 ~ 1 - score.offset,
                                     exam.raw / 100 == 0 ~ score.offset,
                                     T ~ exam.raw / 100))

There are two ways to fit a logit-normal regression in R. This first method is to use the gamlss package, which makes it possible to fit a logit-normal regression directly to the observed values of \(y\). The second method is to fit an ordinary normal linear model to the logit of \(y\). As shown in the code and plots below, the two methods produce identical parameter estimates (within a rounding error).

Fit the model

# Method 1: fit a logit-normal regression directly to the observed values of y.
library(gamlss)
logit.1.fit = gamlss(exam.raw.offset ~ battery + followers + skywalker + left +
                       test.1 + test.2 + test.3,
                     data = exams.df, family = LOGITNO(),
                     control = gamlss.control(trace = F))

# Method 2: fit a normal regression to logit(y).
library(logitnorm)
logit.2.fit = lm(logit(exam.raw.offset) ~ battery + followers + skywalker +
                   left + test.1 + test.2 + test.3,
                 data = exams.df)
# Fit a model with Stan (method 2 only).
library(rstan)
logit.stan.fit = stan(file = "logit_model.stan",
                      data = score.stan.data,
                      chains = 4, iter = 2000, cores = 4)

Summarize fitted parameters

# Parameters from the R model (method 1).
round(summary(logit.1.fit), 4)

# Parameters from the R model (method 2).
round(coef(summary(logit.2.fit)), 4)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -0.0463     0.0428 -1.0810   0.2798
## battery      -0.0019     0.0288 -0.0646   0.9485
## followers     0.0782     0.0280  2.7876   0.0053
## skywalker     0.0988     0.0562  1.7572   0.0789
## left         -0.0296     0.1195 -0.2477   0.8044
## test.1        0.1387     0.0398  3.4830   0.0005
## test.2        0.7318     0.0474 15.4462   0.0000
## test.3        2.3920     0.0448 53.3688   0.0000
## (Intercept)   0.6075     0.0102 59.3118   0.0000
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -0.0463     0.0428 -1.0801   0.2802
## battery      -0.0019     0.0288 -0.0646   0.9485
## followers     0.0782     0.0281  2.7853   0.0054
## skywalker     0.0988     0.0563  1.7558   0.0792
## left         -0.0296     0.1196 -0.2475   0.8045
## test.1        0.1387     0.0399  3.4801   0.0005
## test.2        0.7318     0.0474 15.4332   0.0000
## test.3        2.3920     0.0449 53.3240   0.0000
# Parameters from the Stan model (method 2 only).
library(tidybayes)
params.logit.df = logit.stan.fit %>%
  gather_draws(beta[i]) %>%
  ungroup() %>%
  mutate(parameter = fct_reorder(case_when(i == 1 ~ "intercept",
                                           i == 2 ~ "battery",
                                           i == 3 ~ "followers",
                                           i == 4 ~ "skywalker",
                                           i == 5 ~ "left",
                                           i == 6 ~ "exam.1",
                                           i == 7 ~ "exam.2",
                                           i == 8 ~ "exam.3"), i)) %>%
  group_by(parameter) %>%
  summarize(Estimate = median(.value),
            `2.5%` = quantile(.value, 0.025),
            `25%` = quantile(.value, 0.25),
            `75%` = quantile(.value, 0.75),
            `97.5%` = quantile(.value, 0.975)) %>%
  ungroup()
params.logit.df
## # A tibble: 8 × 6
##   parameter  Estimate  `2.5%`   `25%`   `75%` `97.5%`
##   <fct>         <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 intercept -0.0450   -0.132  -0.0743 -0.0168  0.0385
## 2 battery   -0.000757 -0.0571 -0.0206  0.0180  0.0564
## 3 followers  0.0785    0.0232  0.0587  0.0978  0.135 
## 4 skywalker  0.0960   -0.0141  0.0583  0.135   0.212 
## 5 left      -0.0291   -0.259  -0.109   0.0536  0.206 
## 6 exam.1     0.142     0.0657  0.116   0.167   0.216 
## 7 exam.2     0.731     0.640   0.699   0.763   0.818 
## 8 exam.3     2.39      2.30    2.36    2.42    2.48

Get predictions and residuals

# Predictions from the R model (method 1).
exams.df = exams.df %>%
  mutate(pred.logit.1 = fitted(logit.1.fit) * 100,
         resid.logit.1 = exam.raw - pred.logit.1)

# Predictions from the R model (method 2).
exams.df = exams.df %>%
  mutate(pred.logit.2 = invlogit(unname(logit.2.fit$fitted.values)) * 100,
         resid.logit.2 = exam.raw - pred.logit.2)
# Predictions from the Stan model (method 2 only).
exams.df = exams.df %>%
  mutate(pred.logit.stan = logit.stan.fit %>%
           gather_draws(y_pred[i]) %>%
           group_by(i) %>%
           summarize(pred = median(.value)) %>%
           arrange(i) %>%
           pull(pred),
         resid.logit.stan = exam.raw - pred.logit.stan)

Plot fitted values

# Predictions from the R model (method 1).
exams.df %>%
  ggplot(aes(x = pred.logit.1, y = exam.raw)) +
  geom_point(alpha = 0.2) +
  geom_abline(slope = 1, intercept = 0, color = blues[1]) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  scale_x_continuous(limits = c(0, 100)) +
  scale_y_continuous(limits = c(0, 100)) +
  theme(legend.position = "none",
        panel.border = element_blank(),
        axis.ticks = element_blank()) +
  labs(x = "Predicted final exam score", y = "Observed final exam score",
       title = "Logit-normal regression (method 1)",
       subtitle = "Fitted vs. observed values")

# Predictions from the R model (method 2).
exams.df %>%
  ggplot(aes(x = pred.logit.2, y = exam.raw)) +
  geom_point(alpha = 0.2) +
  geom_abline(slope = 1, intercept = 0, color = blues[1]) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  scale_x_continuous(limits = c(0, 100)) +
  scale_y_continuous(limits = c(0, 100)) +
  theme(legend.position = "none",
        panel.border = element_blank(),
        axis.ticks = element_blank()) +
  labs(x = "Predicted final exam score", y = "Observed final exam score",
       title = "Logit-normal regression (method 2)",
       subtitle = "Fitted vs. observed values")

# Predictions from the Stan model (method 2 only).
exams.df %>%
  ggplot(aes(x = pred.logit.stan, y = exam.raw)) +
  geom_point(alpha = 0.2) +
  geom_abline(slope = 1, intercept = 0, color = blues[1]) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  scale_x_continuous(limits = c(0, 100)) +
  scale_y_continuous(limits = c(-10, 100)) +
  theme(panel.border = element_blank(),
        axis.ticks = element_blank()) +
  labs(x = "Predicted final exam score", y = "Observed final exam score",
       title = "Logit-normal regression (Bayesian)",
       subtitle = "Fitted vs. observed values")

Stan code

data {
  int<lower=0> N; // number of observations
  int<lower=0> I; // number of predictors
  matrix[N,I] X; // matrix of predictors
  vector<lower=0,upper=100>[N] y; // final exam scores
}

transformed data {
  // add an intercept column to the matrix of predictors
  matrix[N,I+1] X_intercept = append_col(rep_vector(1, N), X);
  vector[N] y_offset = y / 100; // final exam scores, scaled between 0 and 1 and offset
  for(n in 1:N) {
    if(y_offset[n] == 1) {
      y_offset[n] = 0.9999;
    } else if(y_offset[n] == 0) {
      y_offset[n] = 0.0001;
    }
  }
}

parameters {
  vector[I+1] beta; // intercept and predictor coefficients
  real<lower=0> sigma; // scale parameter
}

model {
  // priors
  beta ~ std_normal();
  sigma ~ exponential(1);
  // model
  logit(y_offset) ~ normal(X_intercept * beta, sigma);
}

generated quantities {
  real y_pred[N] = inv_logit(normal_rng(X_intercept * beta, sigma));
  for(n in 1:N) {
    y_pred[n] = y_pred[n] * 100;
  }
}

Beta regression

A beta regression is similar to the previous examples in that we posit a latent variable \(z\) that represents the student’s true knowledge/ability; \(z\) is exactly equal to a linear function of \(x\).

\[z_i: \mbox{student's true knowledge/ability}\]

\[z_i = \alpha + \beta x_i\]

For notational convenience, define \(\mu\) as the inverse logit of \(z\).

\[\mu_i = \mbox{logit}^{-1}(z_i)\]

We then model \(y\) as a random variable with a beta distribution, parameterized in terms of \(\mu\) (the mean of the distribution, which must be between 0 and 1) and \(\phi\) (a scale parameter, which must be positive). It’s possible to model \(\phi\) as a function of \(x\), just as \(\mu\) is a function of \(x\), but in this example we estimate a single fixed \(\phi\). Like the logit-normal distribution, the beta distribution is supported between 0 and 1.

\[\frac{y_i}{100} \sim \mbox{Beta}(\mu_i \phi, (1 - \mu_i) \phi)\]

Fit the model

# Fit a model with R.
library(betareg)
beta.fit = betareg(exam.raw.offset ~ battery + followers + skywalker + left +
                     test.1 + test.2 + test.3,
                   data = exams.df,
                   link = "logit", link.phi = "log")
# Fit a model with Stan.
library(rstan)
beta.stan.fit = stan(file = "beta_model.stan",
                     data = score.stan.data,
                     chains = 4, iter = 2000, cores = 4)

Summarize fitted parameters

# Parameters from the R model.
round(coef(summary(beta.fit))$mean, 4)
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   0.3331     0.0212 15.6909   0.0000
## battery       0.0432     0.0142  3.0446   0.0023
## followers     0.0151     0.0138  1.0961   0.2730
## skywalker    -0.0109     0.0277 -0.3927   0.6946
## left          0.0603     0.0594  1.0155   0.3099
## test.1        0.1259     0.0200  6.2804   0.0000
## test.2        0.3323     0.0234 14.2035   0.0000
## test.3        1.1356     0.0235 48.3989   0.0000
# Parameters from the Stan model.
library(tidybayes)
params.beta.df = beta.stan.fit %>%
  gather_draws(beta[i]) %>%
  ungroup() %>%
  mutate(parameter = fct_reorder(case_when(i == 1 ~ "intercept",
                                           i == 2 ~ "battery",
                                           i == 3 ~ "followers",
                                           i == 4 ~ "skywalker",
                                           i == 5 ~ "left",
                                           i == 6 ~ "exam.1",
                                           i == 7 ~ "exam.2",
                                           i == 8 ~ "exam.3"), i)) %>%
  group_by(parameter) %>%
  summarize(Estimate = median(.value),
            `2.5%` = quantile(.value, 0.025),
            `25%` = quantile(.value, 0.25),
            `75%` = quantile(.value, 0.75),
            `97.5%` = quantile(.value, 0.975)) %>%
  ungroup()
params.beta.df
## # A tibble: 8 × 6
##   parameter Estimate  `2.5%`    `25%`   `75%` `97.5%`
##   <fct>        <dbl>   <dbl>    <dbl>   <dbl>   <dbl>
## 1 intercept   0.333   0.291   0.318   0.347    0.374 
## 2 battery     0.0431  0.0153  0.0337  0.0529   0.0716
## 3 followers   0.0149 -0.0114  0.00603 0.0242   0.0407
## 4 skywalker  -0.0112 -0.0645 -0.0292  0.00806  0.0432
## 5 left        0.0606 -0.0535  0.0204  0.100    0.175 
## 6 exam.1      0.126   0.0866  0.113   0.139    0.166 
## 7 exam.2      0.332   0.286   0.316   0.347    0.377 
## 8 exam.3      1.14    1.09    1.12    1.15     1.18

Get predictions and residuals

# Predictions from the R model.
exams.df = exams.df %>%
  mutate(pred.beta = unname(beta.fit$fitted.values) * 100,
         resid.beta = exam.raw - pred.beta)
# Predictions from the Stan model.
exams.df = exams.df %>%
  mutate(pred.beta.stan = beta.stan.fit %>%
           gather_draws(y_pred[i]) %>%
           group_by(i) %>%
           summarize(pred = median(.value)) %>%
           arrange(i) %>%
           pull(pred),
         resid.beta.stan = exam.raw - pred.beta.stan)

Plot fitted values

# Predictions from the R model.
exams.df %>%
  ggplot(aes(x = pred.beta, y = exam.raw)) +
  geom_point(alpha = 0.2) +
  geom_abline(slope = 1, intercept = 0, color = blues[1]) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  scale_x_continuous(limits = c(0, 100)) +
  scale_y_continuous(limits = c(0, 100)) +
  theme(legend.position = "none",
        panel.border = element_blank(),
        axis.ticks = element_blank()) +
  labs(x = "Predicted final exam score", y = "Observed final exam score",
       title = "Beta regression", subtitle = "Fitted vs. observed values")

# Predictions from the Stan model.
exams.df %>%
  ggplot(aes(x = pred.beta.stan, y = exam.raw)) +
  geom_point(alpha = 0.2) +
  geom_abline(slope = 1, intercept = 0, color = blues[1]) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  scale_x_continuous(limits = c(0, 100)) +
  scale_y_continuous(limits = c(-10, 100)) +
  theme(panel.border = element_blank(),
        axis.ticks = element_blank()) +
  labs(x = "Predicted final exam score", y = "Observed final exam score",
       title = "Beta regression (Bayesian)",
       subtitle = "Fitted vs. observed values")

Stan code

data {
  int<lower=0> N; // number of observations
  int<lower=0> I; // number of predictors
  matrix[N,I] X; // matrix of predictors
  vector<lower=0,upper=100>[N] y; // final exam scores
}

transformed data {
  // add an intercept column to the matrix of predictors
  matrix[N,I+1] X_intercept = append_col(rep_vector(1, N), X);
  vector[N] y_offset = y / 100; // final exam scores, scaled between 0 and 1 and offset
  for(n in 1:N) {
    if(y_offset[n] == 1) {
      y_offset[n] = 0.9999;
    } else if(y_offset[n] == 0) {
      y_offset[n] = 0.0001;
    }
  }
}

parameters {
  vector[I+1] beta; // intercept and predictor coefficients
  real<lower=0> phi; // scale parameter
}

model {
  // priors
  beta ~ std_normal();
  phi ~ exponential(1);
  // model
  y_offset ~ beta(inv_logit(X_intercept * beta) * phi,
                  (1 - inv_logit(X_intercept * beta)) * phi);
}

generated quantities {
  real y_pred[N] = beta_rng(inv_logit(X_intercept * beta) * phi,
                            (1 - inv_logit(X_intercept * beta)) * phi);
  for(n in 1:N) {
    y_pred[n] = y_pred[n] * 100;
  }
}

Comparing the four models for the final exam dataset

Fundamentally, all four of these models assume that \(y\) is a (possibly non-linear) function of a linear combination of the predictor variables. It’s not surprising, then, that the fitted coefficients for the predictors are highly similar across all four models. (The precise values of the coefficients differ somewhat, but their relative importance is clearly consistent.)

bind_rows(
  normal = data.frame(coef(summary(normal.fit))[2:(length(predictors)+1),c("Estimate", "Std. Error")]) %>%
    rownames_to_column("parameter"),
  censored = data.frame(coef(summary(censored.fit))[2:(length(predictors)+1),c("Estimate", "Std. error")]) %>%
    rownames_to_column("parameter") %>%
    dplyr::select(parameter, Estimate, Std..Error = Std..error),
  logit.normal = data.frame(logit.1.fit.summary[2:(length(predictors)+1),c("Estimate", "Std. Error")]) %>%
    rownames_to_column("parameter"),
  beta = data.frame(coef(summary(beta.fit))[["mean"]][2:(length(predictors)+1),c("Estimate", "Std. Error")]) %>%
    rownames_to_column("parameter"),
  .id = "model"
) %>%
  mutate(model = fct_relevel(gsub("\\.", "-", model),
                             "normal", "censored", "logit-normal", "beta"),
         parameter = fct_relevel(parameter, "battery", "followers",
                                 "skywalker", "left", "exam.1", "exam.2",
                                 "exam.3"),
         parameter = fct_rev(parameter)) %>%
  ggplot(aes(x = parameter, y = Estimate)) +
  geom_point(size = 3, color = blues[1]) +
  geom_linerange(aes(ymin = Estimate + (qnorm(0.25) * Std..Error),
                     ymax = Estimate + (qnorm(0.75) * Std..Error)),
                 size = 2, color = blues[1]) +
  geom_linerange(aes(ymin = Estimate + (qnorm(0.975) * Std..Error),
                     ymax = Estimate + (qnorm(0.025) * Std..Error)),
                 color = blues[1]) +
  geom_hline(yintercept = 0) +
  facet_wrap(~ model, scales = "free_x") +
  coord_flip() +
  labs(x = "Parameter",
       title = "Parameter estimates for the same dataset from normal, censored,\nlogit-normal, and beta regressions")

For the same reasons, the predictions made by the four models are highly correlated with each other. A few differences are worth noting:

  • The censored regression predicts slightly more extreme scores than the normal regression, at both ends (but especially at the low end for this particular dataset). This is because the censored regression takes into account the fact that an observed score of 0 or 100 might actually represent a value of \(z\) less than 0 or greater than 100.

  • Similarly, the logit-normal and beta regressions predict more extreme scores than the normal and censored regressions (more so for the logit-normal regression). This is because of the non-linear function that relates the latent variable \(z\) and the observed score \(y\).

library(GGally)
exams.df %>%
  dplyr::select(normal = pred.normal, censored = pred.censored,
                `logit-normal` = pred.logit.1, beta = pred.beta) %>%
  ggpairs(lower = list(continuous =
    function(data, mapping, ...) {
      ggplot(data = data, mapping = mapping) +
        geom_point() +
        geom_abline(slope = 1, intercept = 0, color = blues[1])
    }
  )) +
  labs(title = "Predictions for the same dataset from normal, censored, logit-normal,\nand beta regressions")

When we look at the residuals from each model, we see that they are fairly comparable. The beta regression struggles the most, with a sizeable number of under-predictions; the model predicts no very high scores at all (the highest prediction is 90.05). The logit-normal regression does fairly well but has a long tail of under-predictions. The normal and censored regressions are extremely similar.

exams.df %>%
  dplyr::select(-ends_with(".stan")) %>%
  pivot_longer(cols = starts_with("resid."),
               names_to = "model",
               names_prefix = "resid.",
               values_to = "resid") %>%
  filter(model != "logit.2" & !grepl("bounds", model)) %>%
  mutate(model = ifelse(model == "logit.1", "logit-normal", model),
         model = fct_relevel(model, "normal", "censored", "logit-normal", "beta"),
         model = fct_rev(model)) %>%
  ggplot(aes(x = model, y = resid)) +
  geom_violin(color = yellows[4], fill = yellows[4]) +
  geom_boxplot(color = "white", width = 0.1, outlier.alpha = 0) +
  stat_summary(fun.y = "median", geom = "point", color = yellows[4]) +
  coord_flip() +
  labs(x = "Model", y = "Residuals")

Comparing the four models via simulation

To explore differences among the four models more systematically, I simulated datasets in which final exam score (scaled to 0-1) was a function of a single predictor variable (which I called “GPA”, although the variable was actually \(N(0, 1)\) distributed). The code that generated the simulations that follow is available here as 01-create_datasets.R; graphs are produced by the code in 02-evaluate_models.R. The simulations varied along several dimensions:

  • Average score: mid (0.5), high (0.8)
  • Error term: narrow (0.2), wide (0.5)
  • Number of observations: 100, 500
  • Strength of association between GPA and score (5 steps)
  • Distribution that generated score from GPA (censored, logit-normal, beta)

I generated 100 simulated datasets for each combination of the above features, and fit four models to each one (normal, censored, logit-normal, and beta). The first question I was interested in was how often these models were able to detect a significant effect of GPA at \(\alpha = 0.05\). The overall patterns were not surprising: models were more likely to find a significant effect when there were many observations, small error terms, large effects of GPA, and when the scores were not clustered near the boundary. However, the specific model fit to the data made little or no difference: all the regression types performed about the same, regardless of the true distribution that had generated the data.

Similarly, the residual patterns appear to depend more on the distribution that generated the data than they do on the model fit to the data. Logit-normal regression has a characteristic long tail when the observed scores fall closer to the boundary, but other residual patterns look remarkably similar. (Gray rectangles in the graph below mark models that match the true distribution that generated the data.)

Ordered discrete dependent variables

In the context of academic assessment, letter grades (“A”, “B”, etc.) are an excellent example of an ordered discrete outcome: A \(>\) B \(>\) C \(>\) D \(>\) E. For purposes of illustration, we can compute letter grades from final exam grades using a standard grading scale: A for 91-100, B for 81-90, and so on.

exams.df = exams.df %>%
  mutate(exam.grade = fct_relevel(case_when(exam.raw > 90 ~ "A",
                                            exam.raw > 80 ~ "B",
                                            exam.raw > 70 ~ "C",
                                            exam.raw > 60 ~ "D",
                                            T ~ "E"),
                                  "E", "D", "C", "B", "A"))

In a real dataset, of course, there’s little point in working with letter grades if we have access to the underlying scores, which provide us with more information. But for purposes of illustration, it’s helpful to be able to assess how much information each regression model is able to recover.

exams.df %>%
  ggplot(aes(x = exam.grade)) +
  geom_bar() +
  labs(x = "Final exam grade", y = "Number of observations",
       title = "Distribution of final exam grades")

Preparing the data for Stan

Both Stan models below for final exam grade accept a dataset in this form:

grade.stan.data = list(
  N = nrow(exams.df),
  I = 7,
  X = exams.df %>%
    dplyr::select(battery, followers, skywalker, left, test.1, test.2, test.3),
  J = 5,
  y = as.numeric(exams.df$exam.grade)
)

Classical linear regression

If we assign each discrete outcome to a number, and order those numbers appopriately, we can fit a classical linear regression that predicts those numeric outcomes. To convert the predictions of the model back to discrete outcomes, we can round them to the nearest integer and map them back to letter grades. For our working dataset, this method performs very poorly; the model doesn’t predict a single “A”!

Fit the model

# Fit a model in R.
discrete.normal.fit = lm(as.numeric(exam.grade) ~ battery + followers +
                           skywalker + left + test.1 + test.2 + test.3,
                         data = exams.df)
# Fit a model with Stan.
library(rstan)
discrete.normal.stan.fit = stan(file = "discrete_normal_model.stan",
                                data = grade.stan.data,
                                chains = 4, iter = 2000, cores = 4)

Summarize fitted parameters

# Parameters from the R model.
round(coef(summary(discrete.normal.fit)), 4)
##             Estimate Std. Error  t value Pr(>|t|)
## (Intercept)   2.8724     0.0228 126.0445   0.0000
## battery       0.1169     0.0153   7.6321   0.0000
## followers    -0.0373     0.0149  -2.4960   0.0126
## skywalker    -0.2107     0.0299  -7.0427   0.0000
## left          0.0340     0.0636   0.5338   0.5935
## test.1        0.2472     0.0212  11.6613   0.0000
## test.2        0.3353     0.0252  13.2974   0.0000
## test.3        0.4919     0.0239  20.6177   0.0000
# Parameters from the Stan model.
library(tidybayes)
params.discrete.normal.df = discrete.normal.stan.fit %>%
  gather_draws(beta[i]) %>%
  ungroup() %>%
  mutate(parameter = fct_reorder(case_when(i == 1 ~ "intercept",
                                           i == 2 ~ "battery",
                                           i == 3 ~ "followers",
                                           i == 4 ~ "skywalker",
                                           i == 5 ~ "left",
                                           i == 6 ~ "exam.1",
                                           i == 7 ~ "exam.2",
                                           i == 8 ~ "exam.3"), i)) %>%
  group_by(parameter) %>%
  summarize(Estimate = median(.value),
            `2.5%` = quantile(.value, 0.025),
            `25%` = quantile(.value, 0.25),
            `75%` = quantile(.value, 0.75),
            `97.5%` = quantile(.value, 0.975)) %>%
  ungroup()
params.discrete.normal.df
## # A tibble: 8 × 6
##   parameter Estimate  `2.5%`    `25%`   `75%`  `97.5%`
##   <fct>        <dbl>   <dbl>    <dbl>   <dbl>    <dbl>
## 1 intercept   2.87    2.83    2.86     2.89    2.92   
## 2 battery     0.117   0.0864  0.106    0.127   0.148  
## 3 followers  -0.0378 -0.0661 -0.0474  -0.0274 -0.00796
## 4 skywalker  -0.210  -0.271  -0.231   -0.190  -0.152  
## 5 left        0.0343 -0.0841 -0.00664  0.0765  0.157  
## 6 exam.1      0.247   0.206   0.233    0.261   0.289  
## 7 exam.2      0.336   0.286   0.319    0.353   0.385  
## 8 exam.3      0.491   0.445   0.475    0.508   0.538

Get predictions

# Predictions from the R model.
exams.df$pred.discrete.normal = round(discrete.normal.fit$fitted.values)
# Predictions from the Stan model.
exams.df = exams.df %>%
  mutate(pred.discrete.normal.stan = discrete.normal.stan.fit %>%
           gather_draws(y_pred[i]) %>%
           group_by(i) %>%
           summarize(pred = median(.value)) %>%
           arrange(i) %>%
           pull(pred))

Plot fitted values

# Predictions from the R model.
exams.df %>%
  mutate(alert = pred.discrete.normal > 5 | pred.discrete.normal < 1) %>%
  ggplot(aes(x = exam.grade, y = pred.discrete.normal,
             color = alert)) +
  geom_jitter(alpha = 0.2) +
  geom_abline(slope = 1, intercept = 0, color = blues[1]) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  scale_y_continuous(breaks = -1:5,
                     labels = c("-1", "0",
                                levels(exams.df$exam.grade))) +
  scale_color_manual(values = c("black", "red")) +
  theme(legend.position = "none",
        panel.border = element_blank(),
        axis.ticks = element_blank()) +
  labs(x = "Observed final exam grade", y = "Predicted final exam grade",
       title = "Normal regression", subtitle = "Fitted vs. observed values")

# Predictions from the Stan model.
exams.df %>%
  mutate(alert = pred.discrete.normal.stan > 5 | pred.discrete.normal.stan < 1) %>%
  ggplot(aes(x = exam.grade, y = round(pred.discrete.normal.stan),
             color = alert)) +
  geom_jitter(alpha = 0.2) +
  geom_abline(slope = 1, intercept = 0, color = blues[1]) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  scale_y_continuous(breaks = -1:5,
                     labels = c("-1", "0",
                                levels(exams.df$exam.grade))) +
  scale_color_manual(values = c("black", "red")) +
  theme(legend.position = "none",
        panel.border = element_blank(),
        axis.ticks = element_blank()) +
  labs(x = "Observed final exam grade", y = "Predicted final exam grade",
       title = "Normal regression (Bayesian)",
       subtitle = "Fitted vs. observed values")

Stan code

data {
  int<lower=0> N; // number of observations
  int<lower=0> I; // number of predictors
  matrix[N,I] X; // matrix of predictors
  int J; // number of possible grades
  real<lower=1,upper=J> y[N]; // final exam grades
}

transformed data {
  // add an intercept column to the matrix of predictors
  matrix[N,I+1] X_intercept = append_col(rep_vector(1, N), X);
}

parameters {
  vector[I+1] beta; // intercept and predictor coefficients
  real<lower=0> sigma; // scale parameter
}

model {
  // priors
  beta[1] ~ normal(3, 1);
  segment(beta, 2, I) ~ std_normal();
  sigma ~ exponential(0.5);
  // model
  y ~ normal(X_intercept * beta, sigma);
}

generated quantities {
  real y_pred[N] = round(normal_rng(X_intercept * beta, sigma));
}

Ordered logistic regression

Ordered logistic regression models each discrete outcome as a probabilistic function of a latent variable \(z\) with a logistic distribution. The idea is that \(z\) is a continuous variable that represents the true “scale” in question; here, that scale is the student’s performance on the exam. In other situations, the scale may be more abstract; for example, when predicting responses on a Likert scale in a survey (“Strongly agree”, “Somewhat agree”, “Neutral”, etc.), \(z\) might represent the respondent’s true level of agreement. As usual, \(z\) is a function of the predictor variables \(x\).

\[z_i: \mbox{student's performance}\]

\[z_i \sim \mbox{Logistic}(\beta x_i)\]

In addition to the \(\beta\) coefficients, an ordered logistic regression model fits a series of cutpoints \(c\): one less than the total number of possible discrete outcomes. The probability that \(y_i\) is a particular discrete outcome is equal to the proportion of the distribution of \(z_i\) that falls between the two relevant cutpoints.

\[y_i: \mbox{student's grade}\]

\[P(y_i = B) = P(z_i > c_{C|B} \mbox{ and } z_i <= c_{B|A})\]

To predict a single outcome for a given observation, we can either choose the outcome with the highest probability (the method used below) or stochastically choose an outcome weighted by probability.

data.frame(z = seq(-5, 5, 0.01)) %>%
  mutate(d = dlogis(z - 1)) %>%
  ggplot(aes(x = z,
             y = d)) +
  geom_area(stat = "function", fun = function(x) { dlogis(x - 1) },
            fill = "gray", xlim = c(-0.5, 2)) +
  geom_line() +
  geom_vline(xintercept = c(-3, -2, -0.5, 2)) +
  annotate("label", x = 0.7, y = 0.1,
           label = eval(bquote(expression(P(y[i] == B) == .(round(plogis(2) - plogis(-0.5), 2)))))) +
  scale_x_continuous(expression(z[i] == beta*x[i]),
                     breaks = c(-3, -2, -0.5, 2),
                     labels = c("E|D", "D|C", "C|B", "B|A")) +
  scale_y_continuous("") +
  theme(panel.grid.minor.x = element_blank())

Note that I’ve omitted an overall intercept \(\alpha\) from this model. If we try to model both \(\alpha\) and the full set of cutpoints \(c\), the model is nonidentifiable: note that the predictions in the figure above would be exactly the same if we added some arbitrary constant (say, 6.4) to each \(z_i\) and to each cutpoint. An alternative (not shown here) is to include \(\alpha\) but fix one of the cutpoints to a predetermined value (usually 0).

Fit the model

# Fit a model in R.
library(MASS)
ordered.logistic.fit = polr(exam.grade ~ battery + followers + skywalker +
                              left + test.1 + test.2 + test.3,
                            data = exams.df)
# Fit a model with Stan.
library(rstan)
ordered.logistic.stan.fit = stan(file = "ordered_logistic_model.stan",
                                 data = grade.stan.data,
                                 chains = 4, iter = 2000, cores = 4)

Summarize fitted parameters

# Parameters from the R model.
round(coef(summary(ordered.logistic.fit)), 4)
##             Value Std. Error  t value
## battery    0.1523     0.0326   4.6768
## followers  0.0100     0.0312   0.3200
## skywalker -0.2495     0.0619  -4.0291
## left       0.0183     0.1375   0.1333
## test.1     0.9569     0.0614  15.5799
## test.2     1.2242     0.0670  18.2774
## test.3     1.7485     0.0734  23.8085
## E|D       -1.3034     0.0642 -20.2982
## D|C        0.1715     0.0599   2.8646
## C|B        1.7379     0.0651  26.6804
## B|A        3.7776     0.0813  46.4417
# Parameters from the Stan model.
library(tidybayes)
params.ordered.logistic.df = bind_rows(
  ordered.logistic.stan.fit %>%
    gather_draws(beta[i]) %>%
    ungroup() %>%
    mutate(parameter = case_when(i == 1 ~ "battery",
                                 i == 2 ~ "followers",
                                 i == 3 ~ "skywalker",
                                 i == 4 ~ "left",
                                 i == 5 ~ "exam.1",
                                 i == 6 ~ "exam.2",
                                 i == 7 ~ "exam.3"), i) %>%
    dplyr::select(parameter, .value),
  ordered.logistic.stan.fit %>%
    gather_draws(c[i]) %>%
    ungroup() %>%
    mutate(parameter = case_when(i == 1 ~ "E|D",
                                 i == 2 ~ "D|C",
                                 i == 3 ~ "C|B",
                                 i == 4 ~ "B|A")) %>%
    dplyr::select(parameter, .value)
) %>%
  mutate(parameter = fct_relevel(parameter, "battery", "followers", "skywalker",
                                 "left", "exam.1", "exam.2", "exam.3", "E|D",
                                 "D|C", "C|B", "B|A")) %>%
  group_by(parameter) %>%
  summarize(Estimate = median(.value),
            `2.5%` = quantile(.value, 0.025),
            `25%` = quantile(.value, 0.25),
            `75%` = quantile(.value, 0.75),
            `97.5%` = quantile(.value, 0.975)) %>%
  ungroup()
params.ordered.logistic.df
## # A tibble: 11 × 6
##    parameter Estimate  `2.5%`   `25%`   `75%` `97.5%`
##    <fct>        <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 battery    0.155    0.0913  0.132   0.175   0.218 
##  2 followers  0.00938 -0.0503 -0.0119  0.0307  0.0707
##  3 skywalker -0.249   -0.373  -0.293  -0.207  -0.127 
##  4 left       0.0189  -0.254  -0.0742  0.112   0.283 
##  5 exam.1     0.958    0.841   0.917   0.999   1.08  
##  6 exam.2     1.22     1.09    1.18    1.27    1.36  
##  7 exam.3     1.74     1.60    1.70    1.79    1.89  
##  8 E|D       -1.30    -1.43   -1.35   -1.26   -1.18  
##  9 D|C        0.170    0.0493  0.129   0.213   0.287 
## 10 C|B        1.74     1.61    1.69    1.78    1.87  
## 11 B|A        3.78     3.61    3.72    3.83    3.94

Get predictions

# Predictions from the R model.
exams.df$latent.ordered.logistic = ordered.logistic.fit$lp
exams.df$pred.ordered.logistic = apply(
  ordered.logistic.fit$fitted.values,
  1,
  function(x) { c("E", "D", "C", "B", "A")[which(x == max(x))] }
) %>%
  fct_relevel("E", "D", "C", "B", "A")
# Predictions from the Stan model.
exams.df = exams.df %>%
  mutate(latent.ordered.logistic.stan = ordered.logistic.stan.fit %>%
           gather_draws(y_latent[i]) %>%
           group_by(i) %>%
           summarize(pred = median(.value)) %>%
           arrange(i) %>%
           pull(pred),
         pred.ordered.logistic.stan = ordered.logistic.stan.fit %>%
           gather_draws(y_pred[i]) %>%
           group_by(i) %>%
           summarize(pred = median(.value)) %>%
           arrange(i) %>%
           pull(pred),
         pred.ordered.logistic.stan = c("E", "D", "C", "B", "A")[pred.ordered.logistic.stan],
         pred.ordered.logistic.stan = fct_relevel(pred.ordered.logistic.stan,
                                                  "E", "D", "C", "B", "A"))

Plot fitted values

# Predictions from the R model.
exams.df %>%
  ggplot(aes(x = exam.grade, y = pred.ordered.logistic)) +
  geom_jitter(alpha = 0.2) +
  geom_abline(slope = 1, intercept = 0, color = blues[1]) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  theme(legend.position = "none",
        panel.border = element_blank(),
        axis.ticks = element_blank()) +
  labs(x = "Observed final exam grade", y = "Predicted final exam grade")

# Predictions from the Stan model.
exams.df %>%
  ggplot(aes(x = exam.grade, y = pred.ordered.logistic.stan)) +
  geom_jitter(alpha = 0.2) +
  geom_abline(slope = 1, intercept = 0, color = blues[1]) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  theme(legend.position = "none",
        panel.border = element_blank(),
        axis.ticks = element_blank()) +
  labs(x = "Observed final exam grade", y = "Predicted final exam grade")

Stan code

data {
  int<lower=0> N; // number of observations
  int<lower=0> I; // number of predictors
  matrix[N,I] X; // matrix of predictors
  int J; // number of possible grades
  int<lower=1,upper=J> y[N]; // final exam grades
}

parameters {
  vector[I] beta; // predictor coefficients
  ordered[J-1] c; // cutpoints
}

model {
  // priors
  beta ~ std_normal();
  // model
  y ~ ordered_logistic(X * beta, c);
}

generated quantities {
  vector[N] y_latent;
  int y_pred[N];
  y_latent = X * beta;
  for(n in 1:N) {
    y_pred[n] = ordered_logistic_rng(y_latent[n], c);
  }
}

Evaluating the two models for the final exam dataset

The ordered logistic regression model clearly outperforms the classical linear regression: 55% accuracy for the ordered logistic regression, 33% for classical linear regression. (There are, of course, better measures of performance for multi-class classifiers, but some of the better ones – such as log-loss – require an estimate of the probability of each class for each observation, which does not come “baked in” with classical linear regression.)

A convenient characteristic of the final exam dataset is that the latent variable \(z\) has a very clear, and observed, real-world interpretation: the student’s numeric score on the exam. (Or very nearly so; the score is bounded, as discussed above, and its relationship to the student’s letter grade is deterministic rather than stochastic.) For this dataset, the ordered logistic regression model does remarkably well at recovering (a linear function of) students’ actual scores, considering the fact that converting scores to letter grades throws away information.

cutpoints.df = coef(summary(ordered.logistic.fit)) %>%
  data.frame() %>%
  rownames_to_column("parameter") %>%
  filter(grepl("\\|", parameter)) %>%
  mutate(real.value = case_when(parameter == "E|D" ~ 60,
                                parameter == "D|C" ~ 70,
                                parameter == "C|B" ~ 80,
                                parameter == "B|A" ~ 90))
library(fuzzyjoin)
exams.df %>%
  ggplot(aes(x = exam.raw, y = latent.ordered.logistic)) +
  geom_point(alpha = 0.2) +
  geom_vline(data = cutpoints.df,
             aes(xintercept = real.value), color = yellows[1]) +
  geom_hline(data = cutpoints.df,
             aes(yintercept = Value), color = yellows[1]) +
  stat_smooth(color = blues[1], fill = blues[1]) +
  geom_label(data = cutpoints.df %>%
              regex_inner_join(data.frame(g = LETTERS[1:5]),
                               by = c(parameter = "g")) %>%
              group_by(g) %>%
              summarize(max.x = max(real.value),
                        min.x = min(real.value),
                        max.y = max(Value),
                        min.y = min(Value)) %>%
              ungroup() %>%
              mutate(max.x = ifelse(g == "A", 100, max.x),
                     min.x = ifelse(g == "E", 0, min.x),
                     max.y = ifelse(g == "A",
                                    max(exams.df$latent.ordered.logistic),
                                    max.y),
                     min.y = ifelse(g == "E",
                                    min(exams.df$latent.ordered.logistic),
                                    min.y)),
            aes(x = min.x + ((max.x - min.x) / 2),
                y = min.y + ((max.y - min.y) / 2),
                label = g),
            color = "black", fill = yellows[1], size = 5) +
  labs(x = "Final exam score", y = "z")

Unordered discrete dependent variables

Finally, our sample dataset has an example of a discrete variable whose values are not intrinsically ordered: the days when the class meets.

Preparing the data for Stan

The Stan model below for meeting days accepts a dataset in this form:

days.stan.data = list(
  N = nrow(exams.df),
  I = 4,
  X = exams.df %>%
    dplyr::select(battery, followers, skywalker, left),
  J = 3,
  y = as.numeric(exams.df$days)
)

Multinomial regression

To assess whether some types of students are more likely to sign up for certain meeting patterns than others, we can fit a multinomial regression model. In this model, we posit \(k\) latent variables \(z_{ik}\), one for each possible outcome \(k\), each representing the propensity of observation \(i\) to fall into category \(k\).

\[z_{ik}: \mbox{propensity of observation $i$ to fall into category $k$}\]

We fit a full set of coefficients \(\alpha\) and \(\beta\) for every outcome \(k\) except one (here, “Online”; if we try to fit coefficients for every outcome, the model is nonidentifiable).

\[z_{ik} = \left\{\begin{array}{ll} 0 & k = 1 \\ \alpha_k + \beta_k x_i & k > 1 \end{array}\right.\]

The probability that observation \(i\) falls into category \(k\) is then defined by the softmax function over \(z_{ik}\):

\[P(y_i = k) = \frac{e^{z_{ik}}}{\sum_{k = 1}^K e^{z_{ik}}}\]

Fit the model

# Fit a model in R.
library(nnet)
multi.fit = multinom(days ~ battery + followers + skywalker + left,
                     data = exams.df)
# Fit a model with Stan.
library(rstan)
multi.stan.fit = stan(file = "multi_model.stan",
                      data = days.stan.data,
                      chains = 4, iter = 2000, cores = 4)

Summarize fitted parameters

# Parameters from the R model.
round(coef(summary(multi.fit)), 4)
##     (Intercept) battery followers skywalker    left
## MWF      1.4519 -0.6833   -0.2912   -0.3584 -3.1092
## TR       0.1086 -0.1500   -0.2087    3.0021 -3.4341
# Parameters from the Stan model.
library(tidybayes)
params.multi.df = multi.stan.fit %>%
  gather_draws(beta[i,j]) %>%
  ungroup() %>%
  mutate(parameter = fct_reorder(case_when(i == 1 ~ "intercept",
                                           i == 2 ~ "battery",
                                           i == 3 ~ "followers",
                                           i == 4 ~ "skywalker",
                                           i == 5 ~ "left",
                                           i == 6 ~ "exam.1",
                                           i == 7 ~ "exam.2",
                                           i == 8 ~ "exam.3"), i),
         outcome = fct_reorder(case_when(j == 1 ~ "MWF",
                                         j == 2 ~ "TR"), j)) %>%
  group_by(parameter, outcome) %>%
  summarize(Estimate = median(.value),
            `2.5%` = quantile(.value, 0.025),
            `25%` = quantile(.value, 0.25),
            `75%` = quantile(.value, 0.75),
            `97.5%` = quantile(.value, 0.975)) %>%
  ungroup()
params.multi.df
## # A tibble: 10 × 7
##    parameter outcome Estimate  `2.5%`   `25%`   `75%` `97.5%`
##    <fct>     <fct>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 intercept MWF        1.44   1.31    1.40    1.49    1.57  
##  2 intercept TR         0.107 -0.0538  0.0537  0.165   0.274 
##  3 battery   MWF       -0.670 -0.792  -0.714  -0.628  -0.551 
##  4 battery   TR        -0.138 -0.265  -0.183  -0.0952 -0.0158
##  5 followers MWF       -0.274 -0.386  -0.313  -0.235  -0.161 
##  6 followers TR        -0.188 -0.307  -0.227  -0.148  -0.0733
##  7 skywalker MWF       -0.406 -0.652  -0.490  -0.325  -0.171 
##  8 skywalker TR         2.94   2.70    2.86    3.02    3.18  
##  9 left      MWF       -2.95  -3.34   -3.09   -2.81   -2.54  
## 10 left      TR        -3.27  -3.63   -3.40   -3.15   -2.89

Get predictions

# Predictions from the R model.
exams.df$pred.multi = apply(
  multi.fit$fitted.values,
  1,
  function(x) { c("Online", "MWF", "TR")[which(x == max(x))] }
) %>%
  fct_relevel("Online", "MWF", "TR")
# Predictions from the Stan model.
exams.df = exams.df %>%
  mutate(pred.multi.stan = multi.stan.fit %>%
           gather_draws(y_pred[i]) %>%
           group_by(i) %>%
           summarize(pred = median(.value)) %>%
           arrange(i) %>%
           pull(pred),
         pred.multi.stan = c("Online", "MWF", "TR")[pred.multi.stan],
         pred.multi.stan = fct_relevel(pred.multi.stan, "Online", "MWF", "TR"))

Plot fitted values

# Predictions from the R model.
exams.df %>%
  ggplot(aes(x = days, y = pred.multi)) +
  geom_jitter(alpha = 0.2) +
  theme(legend.position = "none",
        panel.border = element_blank(),
        axis.ticks = element_blank()) +
  labs(x = "Observed meeting pattern", y = "Predicted meeting pattern")

# Predictions from the Stan model.
exams.df %>%
  ggplot(aes(x = days, y = pred.multi)) +
  geom_jitter(alpha = 0.2) +
  theme(legend.position = "none",
        panel.border = element_blank(),
        axis.ticks = element_blank()) +
  labs(x = "Observed meeting pattern", y = "Predicted meeting pattern")

Stan code

data {
  int<lower=0> N; // number of observations
  int<lower=0> I; // number of predictors
  matrix[N,I] X; // matrix of predictors
  int<lower=1> J; // number of possible meeting day patterns
  int<lower=1,upper=J> y[N]; // meeting days
}

transformed data {
  // add an intercept column to the matrix of predictors
  matrix[N,I+1] X_intercept = append_col(rep_vector(1, N), X);
}

parameters {
  matrix[I+1,J-1] beta; // predictor coefficients for each possible outcome except one
}

transformed parameters {
  matrix[I+1,J] beta_baseline; // add zeros for the first outcome category
  beta_baseline = append_col(rep_vector(0, I + 1), beta);
}

model {
  // priors
  to_vector(beta) ~ std_normal();
  // model
  for(n in 1:N) {
    y[n] ~ categorical_logit((X_intercept[n,] * beta_baseline)');
  }
}

generated quantities {
  int y_pred[N];
  for(n in 1:N) {
    y_pred[n] = categorical_logit_rng((X_intercept[n,] * beta_baseline)');
  }
}