Doing Regression When Your Dependent Variables Aren’t Well Behaved

# Doing Regression When Your Dependent Variables Aren’t Well Behaved
## Salt Lake City R Users Group
### Abby Kaplan, Salt Lake Community College
### May 28, 2020

---

<div class="my-footer">https://github.com/kaplanas/nonstandard-regression</div>

---

# Outline

---

# Outline

+ Some types of dependent variables

+ Focus of this presentation

+ Working dataset

+ Bounded dependent variables

+ Classical linear regression
    + Censored regression
    + Logit-normal regression
    + Beta regression

+ Ordered discrete dependent variables

+ Classical linear regression
    + Ordered logistic regression

+ Unordered discrete dependent variables

+ Multinomial logistic regression

---

# Some types of dependent variables

---

# Some types of dependent variables

+ Unbounded continuous variables

+ Linear regression ✔
 + I can't think of a single example in higher ed...

+ Binary variables

+ Logistic regression ✔
 + Did the student pass the class?
 + Did the student return in the next term?
 + Did the student meet with an advisor?

---

# Some types of dependent variables

+ Bounded continuous variables

+ ??? ✘
 + Test score
 + GPA
 + Section fill percentage

+ Ordered discrete variables

+ ??? ✘
 + Letter grade
 + Likert response (strongly agree, somewhat agree, ...)

+ Unordered discrete variables

+ ??? ✘
 + Choice of section
 + Choice of course
 + Choice of major
 + Choice of...

---

# Focus of this presentation

+ Higher ed data

+ You can imagine analogous dependent variables in your domain

+ Regression, not machine learning

+ Trying to answer questions about causality
    + Want to understand why the model makes the predictions it does

+ R code for regression models

+ Fitting, inspecting, predicting
    + Stan code is available on GitHub

---

# Dataset: Final exams

---

# Dataset: Final exams

+ Students enrolled in Underwater Basket-Weaving 101

+ This is fake data, but realistic

+ Similar things happen when we fit the same models to real datasets
    + (Different variables, but the same idea)

+ Dependent variables:

+ Final exam score _(0 - 100)_
    + Final exam grade _(A, B, C, D, F)_
    + Class meeting days _(MWF, TR, Online)_

+ Predictor variables:

+ Phone battery charge before exam _(0 - 100, centered and standardized)_
    + Number of Twitter followers _(logged, centered, and standardized)_
    + Luke Skywalker or Han Solo? _(binary)_
    + Left-handed? _(binary)_
    + Tests 1, 2, 3 scores _(0 - 100, centered and standardized)_

---

# Dataset: Final exams

---

# Linear regression

---

# Classical linear regression

.pull-left[
`$$y_i = \alpha + \beta x_i + \epsilon_i$$`
![](rug2020_presentation_files/figure-html/normal_regression-1.png)
]

.pull-right[
`$$\epsilon_i \sim N(0, \sigma)$$`
![](rug2020_presentation_files/figure-html/normal_regression_residuals-1.png)
]

---

# Classical linear regression: No limits on y

.pull-left[
`$$y_i \sim N(\alpha + \beta x_i, \sigma)$$`
![](rug2020_presentation_files/figure-html/normal_regression_no_limits-1.png)
]

+ Extrapolating so far beyond the support of the data is not a good idea
{{content}}

]

+ But there are scenarios where it's easy to predict impossible values of y
{{content}}

+ Example: hard limits on y, data near those limits
    + Test scores
    + GPA
    + Section fill rates
{{content}}

---

# Fitting a classical linear regression

```r
normal.fit = lm(exam.raw ~ battery + followers + skywalker + left +
                  test.1 + test.2 + test.3,
                data = exams.df)
round(coef(summary(normal.fit)), 4)
```

```
##             Estimate Std. Error  t value Pr(>|t|)
## (Intercept)  65.4229     0.3073 212.9215   0.0000
## battery       0.8397     0.2066   4.0647   0.0000
## followers     0.0607     0.2013   0.3015   0.7631
## skywalker    -1.2448     0.4034  -3.0856   0.0020
## left          0.1865     0.8576   0.2175   0.8278
## test.1        2.9729     0.2859  10.3995   0.0000
## test.2        7.0235     0.3400  20.6556   0.0000
## test.3       17.2223     0.3217  53.5378   0.0000
```

---

# Predictions of a classical linear regression

```r
exams.df = exams.df %>%
 mutate(pred.normal = normal.fit$fitted.values,
 pred.normal.out.of.bounds = pred.normal > 100 |
 pred.normal < 0) %>%
 mutate(pred.normal = case_when(pred.normal > 100 ~ 100,
 pred.normal < 0 ~ 0,
 T ~ pred.normal),
 resid.normal = exam.raw - pred.normal)
```

---

# Predictions of a classical linear regression

---

# Censored regression

---

# Censored regression

`$$z_i: \mbox{student's true knowledge/ability}$$`

`$$z_i \sim N(\alpha + \beta x_i, \sigma)$$`

`$$y_i = \Bigg\{\begin{array}{ll} 100 & z_i \geq 100 \\ 0 & z_i \leq 0 \\ z_i & 0 \lt z_i \lt 100 \end{array}$$`

---

# Censored normal distributions

---

# Fitting a censored regression

```r
library(censReg)
censored.fit = censReg(exam.raw ~ battery + followers + skywalker +
                         left + test.1 + test.2 + test.3,
                       left = 0, right = 100, data = exams.df)
round(coef(summary(censored.fit)), 4)
```

```
##             Estimate Std. error  t value Pr(> t)
## (Intercept)  63.5503     0.3496 181.7722  0.0000
## battery       0.6268     0.2351   2.6658  0.0077
## followers     0.3417     0.2271   1.5045  0.1325
## skywalker    -0.7001     0.4530  -1.5455  0.1222
## left         -0.1587     0.9786  -0.1622  0.8711
## test.1        3.7594     0.3484  10.7891  0.0000
## test.2        8.3002     0.3895  21.3106  0.0000
## test.3       19.5282     0.3709  52.6557  0.0000
## logSigma      2.6640     0.0112 238.7725  0.0000
```

---

# Predictions of a censored regression

```r
predictors = c("battery", "followers", "skywalker", "left",
 "test.1", "test.2", "test.3")
exams.df$pred.censored = c(
 as.matrix(cbind(rep(1, nrow(exams.df)), exams.df[,predictors]))
 %*%
 summary(censored.fit)$estimate[c("(Intercept)",
 predictors),"Estimate"]
)
exams.df = exams.df %>%
 mutate(pred.censored = case_when(pred.censored > 100 ~ 100,
 pred.censored < 0 ~ 0,
 T ~ pred.censored),
 resid.censored = exam.raw - pred.censored)
```

---

# Predictions of a censored regression

---

# Logit-normal regression

---

# Logit-normal regression

`$$z_i: \mbox{student's true knowledge/ability}$$`

`$$z_i \sim N(\alpha + \beta x_i, \sigma)$$`

`$$\frac{y_i}{100} = \mbox{logit}^{-1}(z_i)$$`

---

# Logit-normal distributions

---

# Adjusting y for a logit-normal regression

+ Divide `$y$` by 100 to bound it between 0 and 1

+ `$y$` can't be exactly 0 or 1 (the logit is infinite), so offset those values slightly

```r
score.offset = 0.0001
exams.df = exams.df %>%
  mutate(exam.raw.offset = case_when(exam.raw / 100 == 1 ~
                                       1 - score.offset,
                                     exam.raw / 100 == 0 ~
                                       score.offset,
                                     T ~ exam.raw / 100))
```

---

# Fitting a logit-normal regression: Method 1

```r
library(gamlss)
logit.1.fit = gamlss(exam.raw.offset ~ battery + followers +
                       skywalker + left + test.1 + test.2 + test.3,
                     data = exams.df, family = LOGITNO(),
                     control = gamlss.control(trace = F))
summary(logit.fit)
```

```
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -0.0463     0.0428 -1.0810   0.2798
## battery      -0.0019     0.0288 -0.0646   0.9485
## followers     0.0782     0.0280  2.7876   0.0053
## skywalker     0.0988     0.0562  1.7572   0.0789
## left         -0.0296     0.1195 -0.2477   0.8044
## test.1        0.1387     0.0398  3.4830   0.0005
## test.2        0.7318     0.0474 15.4462   0.0000
## test.3        2.3920     0.0448 53.3688   0.0000
## (Intercept)   0.6075     0.0102 59.3118   0.0000
```

---

# Predictions of a logit-normal regression: Method 1

```r
exams.df = exams.df %>%
  mutate(pred.logit.1 = fitted(logit.1.fit) * 100,
         resid.logit.1 = exam.raw - pred.logit.1)
```

---

# Fitting a logit-normal regression: Method 2

```r
library(logitnorm)
logit.2.fit = lm(logit(exam.raw.offset) ~ battery + followers +
                   skywalker + left + test.1 + test.2 + test.3,
                 data = exams.df)
round(coef(summary(logit.2.fit)), 4)
```

```
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -0.0463     0.0428 -1.0801   0.2802
## battery      -0.0019     0.0288 -0.0646   0.9485
## followers     0.0782     0.0281  2.7853   0.0054
## skywalker     0.0988     0.0563  1.7558   0.0792
## left         -0.0296     0.1196 -0.2475   0.8045
## test.1        0.1387     0.0399  3.4801   0.0005
## test.2        0.7318     0.0474 15.4332   0.0000
## test.3        2.3920     0.0449 53.3240   0.0000
```

---

# Predictions of a logit-normal regression: Method 2

```r
exams.df = exams.df %>%
  mutate(pred.logit.2 = invlogit(unname(logit.2.fit$fitted.values))
                        * 100,
         resid.logit.2 = exam.raw - pred.logit.2)
```

---

# Predictions of a logit-normal regression: Method 2

---

# Beta regression

---

# Beta regression

`$$z_i: \mbox{student's true knowledge/ability}$$`

`$$z_i = \alpha + \beta x_i$$`

`$$\mu_i = \mbox{logit}^{-1}(z_i)$$`

`$$\frac{y_i}{100} \sim \mbox{Beta}(\mu_i \phi, (1 - \mu_i) \phi)$$`

---

# Beta distributions

---

# Fitting a beta regression

```r
library(betareg)
beta.fit = betareg(exam.raw.offset ~ battery + followers +
                     skywalker + left + test.1 + test.2 + test.3,
                   data = exams.df,
                   link = "logit", link.phi = "log")
round(coef(summary(beta.fit))$mean, 4)
```

```
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   0.3331     0.0212 15.6909   0.0000
## battery       0.0432     0.0142  3.0446   0.0023
## followers     0.0151     0.0138  1.0961   0.2730
## skywalker    -0.0109     0.0277 -0.3927   0.6946
## left          0.0603     0.0594  1.0155   0.3099
## test.1        0.1259     0.0200  6.2804   0.0000
## test.2        0.3323     0.0234 14.2035   0.0000
## test.3        1.1356     0.0235 48.3989   0.0000
```

---

# Predictions of a beta regression

```r
exams.df = exams.df %>%
  mutate(pred.beta = unname(beta.fit$fitted.values) * 100,
         resid.beta = exam.raw - pred.beta)
```

---

# Comparing regression types

---

# Comparing regression types: Parameter estimates

---

# Comparing regression types: Predictions

---

# Comparing regression types: Residuals

---