AE 20: A Dataset’s Life

Nope, it’s not A Hero’s Life, or A Boy’s Life, or A Nun’s Life; it’s A Dataset’s Life. Let’s take a data set through all four steps in our data science workflow:

preparation: import, join, clean, reshape, mutate, etc;
exploration: pictures (ABV!) and a concise set of numerical summaries;
modeling: linear/logistic regression, model selection, etc;
inference: interval estimation and hypothesis testing.

library(tidyverse)
library(tidymodels)

If you’re like me, you are sick of the penguins at this point. So for our last act, let’s banish them to worlds beyond. I packed all of those lil’ beasts onto a rocket ship and took them to both Mars and Venus, where I recorded their weights (in pounds). I then abandoned the ship to hurtle through space so that future stawanana students never have to study them again.

1. Preparation

Load the data

I have two sets of measurements for the same sample of penguins: one set taken on Mars, and another on Venus. They are contained in two separate files that need to be merged:

penguins_venus <- read_csv("data/penguins_venus.csv")
penguins_mars <- read_csv("data/penguins_mars.csv")

glimpse(penguins_venus)

Rows: 344
Columns: 6
$ id                <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ species           <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
$ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
$ sex               <chr> "male", "female", "female", NA, "female", "male", "f…
$ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ weight_lb         <dbl> 8.356999, 7.347528, 6.165162, NA, 9.173660, 8.157105…

glimpse(penguins_mars)

Rows: 344
Columns: 2
$ id        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ weight_lb <chr> "3.8219301852578837", "3.56777681688403", "1.761707536468984…

Merge the data

Task: Use an appropriate join function to create a new data frame called space_penguins, where each penguin’s weight on either planet is recorded in a separate column:

space_penguins <- full_join(penguins_venus, penguins_mars, by = join_by(id)) |>
  rename(venus_weight = weight_lb.x,
         mars_weight = weight_lb.y)
space_penguins

# A tibble: 344 × 7
      id species island    sex    flipper_length_mm venus_weight mars_weight    
   <dbl> <chr>   <chr>     <chr>              <dbl>        <dbl> <chr>          
 1     1 Adelie  Torgersen male                 181         8.36 3.821930185257…
 2     2 Adelie  Torgersen female               186         7.35 3.567776816884…
 3     3 Adelie  Torgersen female               195         6.17 1.761707536468…
 4     4 Adelie  Torgersen <NA>                  NA        NA    <NA>           
 5     5 Adelie  Torgersen female               193         9.17 3.073547356370…
 6     6 Adelie  Torgersen male                 190         8.16 2.026601878680…
 7     7 Adelie  Torgersen female               181         8.58 3.318306212723…
 8     8 Adelie  Torgersen male                 195        11.9  3.772687060633…
 9     9 Adelie  Torgersen <NA>                 193         8.14 2.603709532256…
10    10 Adelie  Torgersen <NA>                 190         8.10 3.002844015041…
# ℹ 334 more rows

Reshape the data

Task: Use an appropriate pivot function to create a new version of space_penguins where the information about the planet where the measurements were taken is its own column:

space_penguins_long <- space_penguins |>
  pivot_longer(
    cols = c(venus_weight, mars_weight),
    names_to = "planet",
    values_to = "weight_lb"
  )

Error in `pivot_longer()`:
! Can't combine `venus_weight` <double> and `mars_weight` <character>.

Woops! Looks like we might have a problem here. Why is it saying that the Mars weights are type character?

space_penguins |>
  select(mars_weight) |>
  arrange(desc(mars_weight))

# A tibble: 344 × 1
   mars_weight                          
   <chr>                                
 1 this is why we can't have nice things
 2 get wrecked                          
 3 6.93005873361171                     
 4 6.233851410442386                    
 5 6.151607146997191                    
 6 6.054400666505051                    
 7 6.021791903417968                    
 8 6.010224782201169                    
 9 5.99589305540989                     
10 5.844409564530165                    
# ℹ 334 more rows

Oh. That’s really stupid. If we tell it to convert this column to numbers, it will do it, and anything it cannot convert (such as those two pieces of text) will simply be replaced with NA, which is fine by me in this case:

space_penguins <- space_penguins |>
  mutate(
    mars_weight = as.numeric(mars_weight)
  )

space_penguins

# A tibble: 344 × 7
      id species island    sex    flipper_length_mm venus_weight mars_weight
   <dbl> <chr>   <chr>     <chr>              <dbl>        <dbl>       <dbl>
 1     1 Adelie  Torgersen male                 181         8.36        3.82
 2     2 Adelie  Torgersen female               186         7.35        3.57
 3     3 Adelie  Torgersen female               195         6.17        1.76
 4     4 Adelie  Torgersen <NA>                  NA        NA          NA   
 5     5 Adelie  Torgersen female               193         9.17        3.07
 6     6 Adelie  Torgersen male                 190         8.16        2.03
 7     7 Adelie  Torgersen female               181         8.58        3.32
 8     8 Adelie  Torgersen male                 195        11.9         3.77
 9     9 Adelie  Torgersen <NA>                 193         8.14        2.60
10    10 Adelie  Torgersen <NA>                 190         8.10        3.00
# ℹ 334 more rows

Alright, let’s try to reshape again:

space_penguins_long <- space_penguins |>
  pivot_longer(
    cols = c(venus_weight, mars_weight),
    names_to = "planet",
    values_to = "weight_lb"
  ) |>
  mutate(
    planet = if_else(planet == "venus_weight", "Venus", "Mars")
  )
space_penguins_long

# A tibble: 688 × 7
      id species island    sex    flipper_length_mm planet weight_lb
   <dbl> <chr>   <chr>     <chr>              <dbl> <chr>      <dbl>
 1     1 Adelie  Torgersen male                 181 Venus       8.36
 2     1 Adelie  Torgersen male                 181 Mars        3.82
 3     2 Adelie  Torgersen female               186 Venus       7.35
 4     2 Adelie  Torgersen female               186 Mars        3.57
 5     3 Adelie  Torgersen female               195 Venus       6.17
 6     3 Adelie  Torgersen female               195 Mars        1.76
 7     4 Adelie  Torgersen <NA>                  NA Venus      NA   
 8     4 Adelie  Torgersen <NA>                  NA Mars       NA   
 9     5 Adelie  Torgersen female               193 Venus       9.17
10     5 Adelie  Torgersen female               193 Mars        3.07
# ℹ 678 more rows

So, we have the wide view in space_penguins, where each row is a penguin, and we have the long view in space_penguins_long where each row is a penguin/planet. Which of these is “tidy”? As we will see, it depends on the task.

2. Exploration

Pictures

Task: Generate side-by-side density plots of the weights on each planet.

ggplot(space_penguins_long, aes(x = weight_lb, fill = planet)) + 
  geom_density(alpha = 0.5)

So, weights on Mars are systematically lower than weights on Venus.

If we wished to slice-and-dice the data further according to other categorical variables like sex or island, we could do it with one of those darned facet functions (I can never remember which is which or where the ~ goes):

ggplot(space_penguins_long, aes(x = weight_lb, fill = planet)) + 
  geom_density(alpha = 0.5) + 
  facet_grid(island ~ .)

A concise set of numerical summaries

Task: compute the mean, standard deviation, and median of weight on each planet:

space_penguins_long |>
  group_by(planet) |>
  summarize(
    mean_wt = mean(weight_lb, na.rm = TRUE),
    sd_wt = sd(weight_lb, na.rm = TRUE),
    med_wt = median(weight_lb, na.rm = TRUE)
  )

# A tibble: 2 × 4
  planet mean_wt sd_wt med_wt
  <chr>    <dbl> <dbl>  <dbl>
1 Mars      3.58  1.19   3.62
2 Venus     8.46  1.94   8.36

Notice that these numbers match what we saw in the plot. The distributions are symmetric, and so the means and medians are about the same. Furthermore, the spread of the Mars distribution was lower than the spread of the Venus distribution, and we see that in the numbers as well.

3. Modeling

What is the conversion factor between weights on Venus and weights on Mars? If you ask a physicist, I’m pretty sure they will tell you it’s this:

\[ Mars~weight~(lb)\approx Venus~weight~(lb)\times 0.41. \]

So, you weigh less on Mars than you do on Venus. Something something gravity. If we didn’t already know this, could we learn it from our data? Note that is is not a foregone conclusion, because our data are noisy. The weights have been measured with error, and so what should be a perfect straight line relationship won’t be, exactly.

Linear

Task: Create a scatterplot of Venus weights versus Mars weights, and add a line of best fit:

ggplot(space_penguins, aes(x = venus_weight, y = mars_weight)) + 
  geom_point() + 
  geom_smooth(method = "lm")

Plotting “this” versus “that”

There has been some controversy about how to read a phrase like “Venus weights versus Mars weights.” I personally take it to mean that Venus weights go on the \(x\)-axis and Mars weights go on the \(y\)-axis, but some on the teaching team think I am psychotic for thinking this. Based on my informal polling of students, we all would seem to be evenly split on this point. Anyway, you’re not being tested on this, and on an exam I will try to be sufficiently clear so that it doesn’t end up mattering.

Task: Compute and interpret the estimated slope and intercept of the best fit line:

linear_fit <- linear_reg() |>
  fit(mars_weight ~ venus_weight, data = space_penguins)

tidy(linear_fit)

# A tibble: 2 × 5
  term         estimate std.error statistic  p.value
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     1.17     0.256       4.57 6.80e- 6
2 venus_weight    0.285    0.0295      9.68 1.02e-19

\[ \widehat{Mars~weight} = 1.17 + 0.28\times Venus~weight \]

Interpretations:

intercept: if a penguin weighs zero pounds on Venus, we predict that, on average, they weigh 1.17 pounds on Mars. This obviously makes no sense;
slope: a one pound increase on Venus predicts a 0.28 pounds increase on Mars, on average.

Hmmm. We know that the “true” relationship has an intercept of 0 and a slope of 0.41, so this is not so good.

Don’t read this. You don’t need to know this.

What we’re witnessing here is attenuation bias.

Logistic

Task: fit a logistic regression that predicts the planet the measurement was made on from the value of the measurement. Lower measurements were probably more likely to have been taken on Mars, as we saw.

space_penguins_long <- space_penguins_long |>
  mutate(planet = as_factor(planet))

levels(space_penguins_long$planet)

[1] "Venus" "Mars"

logistic_fit <- logistic_reg() |>
  fit(planet ~ weight_lb, data = space_penguins_long)

tidy(logistic_fit)

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    14.4      1.51       9.53 1.50e-21
2 weight_lb      -2.60     0.277     -9.39 5.86e-21

\[ \log\left(\frac{\hat{p}}{1-\hat{p}}\right)=14.4-2.6\times weight \]

Interpretations:

intercept: we predict a weight measurement of zero lb to have come from the planet Mars with an odds of \(e^{14.4}\approx 1794075\). Again, doesn’t make a tremendous amount of sense, but in general, the lower the weight, the more likely it was to have been recorded on Mars;
slope: for each one pound increase in the measurement, we predict on average that the odds of the measurement coming from Mars decrease by a multiplicative factor of \(e^{-2.6}\approx 0.074\). That is precipitous!

Task: generate the ROC curve for this model:

log_aug <- augment(logistic_fit, space_penguins_long)

log_roc <- roc_curve(log_aug, 
                     truth = planet, 
                     .pred_Mars, 
                     event_level = "second")

ggplot(log_roc, aes(x = 1 - specificity, y = sensitivity)) +
  geom_path() +
  geom_abline(lty = 3) +
  coord_equal() + 
  theme_minimal()

So, given a measurement, it’s not too hard to predict what planet it was taken on. Our model does a pretty good job.

4. Inference

Back to linear regression now…

Interval estimation

Task: Compute, visualize, and interpret a 95% confidence interval for the slope of the regression line:

observed_fit <- space_penguins |>
  specify(mars_weight ~ venus_weight) |>
  fit()

set.seed(8675309)
boot_fits <- space_penguins |>
  specify(mars_weight ~ venus_weight) |>
  generate(reps = 1000, type = "bootstrap") |>
  fit()

ci_95 <- get_confidence_interval(
  boot_fits,
  point_estimate = observed_fit,
  level = 0.95,
  type = "percentile"
)
ci_95

# A tibble: 2 × 3
  term         lower_ci upper_ci
  <chr>           <dbl>    <dbl>
1 intercept       0.654    1.69 
2 venus_weight    0.228    0.342

visualize(boot_fits) + 
  shade_confidence_interval(ci_95)

We are 95% confident that the true slope is between 0.228 and 0.342.

Hypothesis testing

Task: Test the following hypotheses at the 5% level:

\[ \begin{aligned} H_0&:\beta_1=0\\ H_A&:\beta_1\neq0 \end{aligned} \]

set.seed(20)
null_dist <- space_penguins |>
  specify(mars_weight ~ venus_weight) |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  fit()

visualize(null_dist) +
  shade_p_value(obs_stat = observed_fit, direction = "two-sided")

null_dist |>
  get_p_value(obs_stat = observed_fit, direction = "two-sided")

# A tibble: 2 × 2
  term         p_value
  <chr>          <dbl>
1 intercept          0
2 venus_weight       0

The p-value is basically zero. This means that, in a world where the null hypothesis is true, and \(\beta_1\) does in fact equal zero, then there is a 0% chance that we would observe an estimated slope as or more extreme than the one we saw. This provides evidence that the null is totally bogus, and so we reject it.

Task: Test the following hypotheses at the 5% level:

\[ \begin{aligned} H_0&:\beta_1=0.41\\ H_A&:\beta_1\neq0.41 \end{aligned} \]

We haven’t exactly learned how to do this, but there’s a duality between confidence intervals and hypothesis testing. Determining how far the estimate is in the tails of the null distribution is the same as determining whether or not the hypothesized value is contained in a confidence interval. Our 95% confidence interval was (0.228, 0.342). This does not contain the hypothesized value of 0.41, meaning that the hypothesized value is not within the range of likely values. So we reject the null at the 5% level (because it was a 95% interval).

Reality check

This is a setting where we sorta know the true value. Physicists are pretty sure the slope is 0.41, so it’s alarming that we just rejected that hypothesis. What’s a boy to do?

BONUS: regression through the origin

This is the (population) linear model we are estimating with data:

\[ Mars~weight = \beta_0+\beta_1\times Venus~weight + \varepsilon. \]

So we are allowing for the possibility that \(\beta_0\neq 0\) even though we know this is logically impossible. If the weight on Venus is zero, then the weight on Mars for the same subject must also be zero. Period.

If we wish to force the intercept to be zero to align with this logic, then we are using the regression-through-the-origin model:

\[ Mars~weight = \beta_1\times Venus~weight + \varepsilon. \]

We still estimate \(\beta_1\) using least squares, but now we’re getting the best fitting line that goes through the origin, as opposed to the best fitting line that could go anywhere.

This is how to fit this in R. That -1 syntax is like saying “remove the intercept.”

better_fit <- linear_reg() |>
  fit(mars_weight ~ venus_weight - 1, data = space_penguins)

tidy(better_fit)

# A tibble: 1 × 5
  term         estimate std.error statistic   p.value
  <chr>           <dbl>     <dbl>     <dbl>     <dbl>
1 venus_weight    0.417   0.00678      61.5 6.26e-186

Hey good lookin’, what’s cookin’! That’s a number I like. So, when we use some domain knowledge to change the model and force it to better match how reality works, we get better results. Here’s how the previous line of best fit and the best-fit-through-the-origin line compare:

ggplot(space_penguins, aes(x = venus_weight, y = mars_weight)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + 
  geom_smooth(method = "lm", se = FALSE, formula = y ~ x - 1, color = "red") + 
  xlim(0, 14) + 
  ylim(0, 7)

If we redo the confidence interval for the slope in this new model, the “true value” is comfortably inside:

observed_fit <- space_penguins |>
  specify(mars_weight ~ venus_weight - 1) |>
  fit()

set.seed(8675309)
boot_fits <- space_penguins |>
  specify(mars_weight ~ venus_weight - 1) |>
  generate(reps = 1000, type = "bootstrap") |>
  fit()

ci_90 <- get_confidence_interval(
  boot_fits,
  point_estimate = observed_fit,
  level = 0.95,
  type = "percentile"
)
ci_90

# A tibble: 1 × 3
  term         lower_ci upper_ci
  <chr>           <dbl>    <dbl>
1 venus_weight    0.404    0.430

visualize(boot_fits) + 
  shade_confidence_interval(ci_90)

Fab.