Lab 7

John Zito

Duke University
STA 199 Spring 2025

2025-03-31

Overview

Lab 7 Overview

Part 1: all things logistic regression
Part 2: a data science assessment survey

Data Science Reasoning Assessment

The goal of this data science assessment, to accurately measure introductory data science students’ reasoning skills.
You will be graded based on completion + honest effort. This assessment is not optional. This is meant to be an individual, closed notes assessment.
At the end, you will be asked if your anonymized responses can be used to help better improve this assessment.
- Your responses matter!
- The goal of this assessment is to be a national data science reasoning assessment for introductory courses
- This starts with improving questions based on your responses

Logistic regression overview

`forested` data

7107 rows and 19 columns;
Each observation (row) is a plot of land;
Variables include geographical and meteorological information about each plot, as well as a binary indicator forested (“Yes” or “No”);
Given information about a plot that is easy (and cheap) to collect remotely, can we use a model to predict if a plot is forested without actually visiting it (which could be difficult and costly)?

Goal

Use the data we’ve already seen to predict if a yet-to-be-observed plot of land is forested;
We want a model that does well on data it has never seen before;
“Out-of-sample” predictions on new data are more useful than “in-sample” predictions on old data;

Training versus testing data

To mimic this “out-of-sample” idea, we randomly split the data into two parts:

training data: this is what the model gets to see when we fit it;
test data: withheld. We assess how well the trained model can predict on this data it hasn’t seen before.

Randomly split data into training and test sets

By default it’s a 75%/25% training/test split.

set.seed(8675309)

forested_split <- initial_split(forested)

forested_train <- training(forested_split)
forested_test <- testing(forested_split)

The split is random, but we want the results to be reproducible, so we “freeze the random numbers in time” by setting a seed. If we don’t tell you exactly what seed to use on an assignment, you can pick any positive integer you want.

Explore: forested or not

ggplot(forested_train, aes(x = lon, y = lat, color = forested)) +
  geom_point(alpha = 0.7) +
  scale_color_manual(values = c("Yes" = "forestgreen", "No" = "gold2")) +
  theme_minimal()

Explore: annual precipitation

ggplot(forested_train, aes(x = lon, y = lat, color = precip_annual)) +
  geom_point(alpha = 0.7) +
  labs(color = "annual\nprecipitation\n(mm × 100)") +
  theme_minimal()

FYI: the response variable must be a factor

forested already comes as a factor, so we’re lucky:

class(forested$forested)

[1] "factor"

levels(forested$forested)

[1] "Yes" "No"

But if it didn’t, things would not work:

logistic_reg() |>
  fit(as.numeric(forested) ~ precip_annual, data = forested_train)

Error in `check_outcome()`:
! For a classification model, the outcome should be a <factor>, not a
  double vector.

FYI: the base level is treated as “failure” (0)

The base level here is “Yes”, so “No” is treated as “success” (1):

levels(forested$forested)

[1] "Yes" "No"

As a result, this code:

logistic_reg() |>
  fit(forested ~ precip_annual, data = forested_train)

Corresponds to this model:

\[ \text{Prob}( \texttt{forested = "No"} ) = \frac{e^{\beta_0+\beta_1 x}}{1 + e^{\beta_0+\beta_1 x}}. \]

This is not a problem, but it means that in order to interpret the output correctly, you need to understand how your factors are leveled.

Fitting a logistic regression model

Similar syntax to linear regression:

forested_precip_fit <- logistic_reg() |>
  fit(forested ~ precip_annual, data = forested_train)

tidy(forested_precip_fit)

# A tibble: 2 × 5
  term          estimate std.error statistic   p.value
  <chr>            <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)    1.57    0.0557         28.2 6.89e-175
2 precip_annual -0.00190 0.0000602     -31.6 5.98e-219

\[ \log\left(\frac{\hat{p}}{1-\hat{p}}\right) = 1.57 - 0.0019 \times precip. \]

Interpreting the intercept

\[ \begin{aligned} \log\left(\frac{\hat{p}}{1-\hat{p}}\right) &= 1.57 - 0.0019 \times precip\\ \frac{\hat{p}}{1-\hat{p}} &= e^{1.57 - 0.0019 \times precip} . \end{aligned} \]

So when \(precip = 0\), the model predicts that the odds of forested = "No" are \(e^{1.57}\approx 4.8\), on average.

Interpreting the slope

At \(precip\):

\[ \frac{\hat{p}}{1-\hat{p}} = {\color{blue}{e^{1.57 - 0.0019 \times precip}}} \]

At \(precip + 1\):

\[ \begin{aligned} \frac{\hat{p}}{1-\hat{p}} &= e^{1.57 - 0.0019 \times (precip + 1)} \\ &= e^{1.57 - 0.0019 \times precip - 0.0019} \\ &= {\color{blue}{e^{1.57 - 0.0019 \times precip}}} \times \color{red}{e^{-0.0019}} \end{aligned} \]

If \(precip\) increases by one unit, the model predicts a decrease in the odds that forested = "No" by a multiplicative factor of \(e^{-0.0019}\approx 0.99\), on average.

Generate predictions for the test data

Augment the test data frame with three new columns on the left that include model predictions (classifications and probabilities) for each row:

forested_precip_aug <- augment(forested_precip_fit, forested_test)
forested_precip_aug

# A tibble: 1,777 × 22
   .pred_class .pred_Yes .pred_No forested  year elevation eastness northness
   <fct>           <dbl>    <dbl> <fct>    <dbl>     <dbl>    <dbl>     <dbl>
 1 Yes             0.711  0.289   No        2005       164      -84        53
 2 Yes             0.968  0.0319  Yes       2003      1031      -49        86
 3 Yes             0.992  0.00806 Yes       2005      1330       99         7
 4 No              0.266  0.734   No        2014       507       44       -89
 5 No              0.263  0.737   No        2014       542      -32       -94
 6 No              0.267  0.733   No        2014       759       -2       -99
 7 No              0.232  0.768   No        2014       119        0         0
 8 No              0.241  0.759   No        2014       419       86       -49
 9 No              0.336  0.664   Yes       2014       569      -97       -21
10 No              0.279  0.721   No        2014       340      -54        83
# ℹ 1,767 more rows
# ℹ 14 more variables: roughness <dbl>, tree_no_tree <fct>, dew_temp <dbl>,
#   precip_annual <dbl>, temp_annual_mean <dbl>, temp_annual_min <dbl>,
#   temp_annual_max <dbl>, temp_january_min <dbl>, vapor_min <dbl>,
#   vapor_max <dbl>, canopy_cover <dbl>, lon <dbl>, lat <dbl>, land_type <fct>

How did the model perform?

These are the four possibilities:

Our test data have the truth in the forested column;
We can compare the predictions in .pred_class to the true values and see how we did.

Getting the error rates

forested_precip_aug |>
  count(forested, .pred_class) |>
  group_by(forested) |>
  mutate(
    p = n / sum(n),
    decision = case_when(
      forested == "Yes" & .pred_class == "Yes" ~ "sensitivity",
      forested == "Yes" & .pred_class == "No" ~ "false negative",
      forested == "No" & .pred_class == "Yes" ~ "false positive",
      forested == "No" & .pred_class == "No" ~ "specificity",
    )
  )

# A tibble: 4 × 5
# Groups:   forested [2]
  forested .pred_class     n     p decision      
  <fct>    <fct>       <int> <dbl> <chr>         
1 Yes      Yes           683 0.702 sensitivity   
2 Yes      No            290 0.298 false negative
3 No       Yes           140 0.174 false positive
4 No       No            664 0.826 specificity

FYI: the default threshold is 50%

The model produces probabilities: .pred_Yes and .pred_No;
The concrete classifications in the .pred_class column come from applying a 50% threshold to these probabilities:

\[ \widehat{\texttt{forested}}= \begin{cases} \texttt{"No"} & \texttt{.pred\_Yes} \leq 0.5\\ \texttt{"Yes"} & \texttt{.pred\_Yes} > 0.5. \end{cases} \]

If you want to override that default, you must do so manually.