Joining data

Lecture 7

John Zito

Duke University
STA 199 Spring 2025

2025-02-04

While you wait…

Prepare for today’s application exercise: ae-06-taxes-join

  • Go to your ae project in RStudio.

  • Make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.

  • Click Pull to get today’s application exercise file: ae-06-taxes-join.qmd.

  • Wait till the you’re prompted to work on the application exercise during class before editing the file.

Recap: pivoting

Recoding data

What’s going on in this plot?

Can you guess the variable plotted here?

Sales taxes in US states

sales_taxes
# A tibble: 51 × 5
   state      state_tax_rate avg_local_tax_rate combined_rate max_local_tax_rate
   <chr>               <dbl>              <dbl>         <dbl>              <dbl>
 1 Alabama            0.04               0.0529        0.0929             0.075 
 2 Alaska             0                  0.0182        0.0182             0.0785
 3 Arizona            0.056              0.0278        0.0838             0.053 
 4 Arkansas           0.065              0.0295        0.0945             0.0613
 5 California         0.0725             0.016         0.0885             0.0475
 6 Colorado           0.029              0.0491        0.0781             0.083 
 7 Connectic…         0.0635             0             0.0635             0     
 8 Delaware           0                  0             0                  0     
 9 Florida            0.06               0.01          0.07               0.02  
10 Georgia            0.04               0.0338        0.0738             0.05  
# ℹ 41 more rows

Sales tax in swing states

Suppose you’re tasked with the following:

Compare the average state sales tax rates of swing states (Arizona, Georgia, Michigan, Nevada, North Carolina, Pennsylvania, and Wisconsin) vs. non-swing states.

How would you approach this task?

  • Create a new variable called swing_state with levels "Swing" and "Non-swing"
  • Group by swing_state
  • Summarize to find the mean sales tax in each type of state

ae-06-taxes-join

  • Go to your ae project in RStudio.

  • If you haven’t yet done so, make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.

  • If you haven’t yet done so, click Pull to get today’s application exercise file: ae-06-taxes-join.qmd.

  • Work through the application exercise in class, and render, commit, and push your edits by the end of class.

mutate() with if_else()

Create a new variable called swing_state with levels "Swing" and "Non-swing".

list_of_swing_states <- c("Arizona", "Georgia", "Michigan", "Nevada", 
                          "North Carolina", "Pennsylvania", "Wisconsin")

sales_taxes <- sales_taxes |>
  mutate(
    swing_state = if_else(state %in% list_of_swing_states,
                          "Swing",
                          "Non-swing")) |>
  relocate(swing_state)

sales_taxes
# A tibble: 51 × 6
   swing_state state       state_tax_rate avg_local_tax_rate combined_rate
   <chr>       <chr>                <dbl>              <dbl>         <dbl>
 1 Non-swing   Alabama             0.04               0.0529        0.0929
 2 Non-swing   Alaska              0                  0.0182        0.0182
 3 Swing       Arizona             0.056              0.0278        0.0838
 4 Non-swing   Arkansas            0.065              0.0295        0.0945
 5 Non-swing   California          0.0725             0.016         0.0885
 6 Non-swing   Colorado            0.029              0.0491        0.0781
 7 Non-swing   Connecticut         0.0635             0             0.0635
 8 Non-swing   Delaware            0                  0             0     
 9 Non-swing   Florida             0.06               0.01          0.07  
10 Swing       Georgia             0.04               0.0338        0.0738
# ℹ 41 more rows
# ℹ 1 more variable: max_local_tax_rate <dbl>

Recap: if_else()

if_else(
1  x == y,
2  "x is equal to y",
3  "x is not equal to y"
)
1
Condition
2
Value if condition is TRUE
3
Value if condition is FALSE

Sales tax in swing states

Compare the average state sales tax rates of swing states vs. non-swing states.

sales_taxes |>
  group_by(swing_state) |>
  summarize(mean_state_tax = mean(state_tax_rate))
# A tibble: 2 × 2
  swing_state mean_state_tax
  <chr>                <dbl>
1 Non-swing           0.0504
2 Swing               0.0546

Sales tax in coastal states

Suppose you’re tasked with the following:

Compare the average state sales tax rates of states on the Pacific Coast, states on the Atlantic Coast, and the rest of the states.

How would you approach this task?

  • Create a new variable called coast with levels "Pacific", "Atlantic", and "Neither"
  • Group by coast
  • Summarize to find the mean sales tax in each type of state

mutate() with case_when()

Create a new variable called coast with levels "Pacific", "Atlantic", and "Neither".

pacific_coast <- c("Alaska", "Washington", "Oregon", "California", "Hawaii")

atlantic_coast <- c(
  "Connecticut", "Delaware", "Georgia", "Florida", "Maine", "Maryland", 
  "Massachusetts", "New Hampshire", "New Jersey", "New York", 
  "North Carolina", "Rhode Island", "South Carolina", "Virginia"
)

sales_taxes <- sales_taxes |>
  mutate(
    coast = case_when(
      state %in% pacific_coast ~ "Pacific",
      state %in% atlantic_coast ~ "Atlantic",
      .default = "Neither"
    )
  ) |>
  relocate(coast)

sales_taxes
# A tibble: 51 × 7
   coast    swing_state state    state_tax_rate avg_local_tax_rate combined_rate
   <chr>    <chr>       <chr>             <dbl>              <dbl>         <dbl>
 1 Neither  Non-swing   Alabama          0.04               0.0529        0.0929
 2 Pacific  Non-swing   Alaska           0                  0.0182        0.0182
 3 Neither  Swing       Arizona          0.056              0.0278        0.0838
 4 Neither  Non-swing   Arkansas         0.065              0.0295        0.0945
 5 Pacific  Non-swing   Califor…         0.0725             0.016         0.0885
 6 Neither  Non-swing   Colorado         0.029              0.0491        0.0781
 7 Atlantic Non-swing   Connect…         0.0635             0             0.0635
 8 Atlantic Non-swing   Delaware         0                  0             0     
 9 Atlantic Non-swing   Florida          0.06               0.01          0.07  
10 Atlantic Swing       Georgia          0.04               0.0338        0.0738
# ℹ 41 more rows
# ℹ 1 more variable: max_local_tax_rate <dbl>

Recap: case_when()

case_when(
1  x > y  ~ "x is greater than y",
2  x < y  ~ "x is less than y",
3  .default = "x is equal to y"
)
1
Value if first condition is TRUE
2
Value if second condition is TRUE
3
Value if neither condition is TRUE, i.e., default value

Sales tax in coastal states

Compare the average state sales tax rates of states on the Pacific Coast, states on the Atlantic Coast, and the rest of the states.

sales_taxes |>
  group_by(coast) |>
  summarize(mean_state_tax = mean(state_tax_rate))
# A tibble: 3 × 2
  coast    mean_state_tax
  <chr>             <dbl>
1 Atlantic         0.0484
2 Neither          0.0545
3 Pacific          0.0355

Sales tax in US regions

Suppose you’re tasked with the following:

Compare the average state sales tax rates of states in various regions (Midwest - 12 states, Northeast - 9 states, South - 16 states, West - 13 states).

How would you approach this task?

  • Create a new variable called region with levels "Midwest", "Northeast", "South", and "West".
  • Group by region
  • Summarize to find the mean sales tax in each type of state

mutate() with case_when()

Who feels like filling in the blanks lists of states in each region? Who feels like it’s simply too tedious to write out names of all states?

list_of_midwest_states <- c(___)
list_of_northeast_states <- c(___)
list_of_south_states <- c(___)
list_of_west_states <- c(___)

sales_taxes <- sales_taxes |>
  mutate(
    coast = case_when(
      state %in% list_of_west_states ~ "Midwest",
      state %in% list_of_northeast_states ~ "Northeast",
      state %in% list_of_south_states ~ "South",
      state %in% list_of_west_states ~ "West"
    )
  )

Joining data

Why join?

Suppose we want to answer questions like:

Is there a relationship between
- number of QS courses taken
- having scored a 4 or 5 on the AP stats exam
- motivation for taking course
- …
and performance in this course?”

Each of these would require joining class performance data with an outside data source so we can have all relevant information (columns) in a single data frame.

Why join?

Suppose we want to answer questions like:

Compare the average state sales tax rates of states in various regions (Midwest - 12 states, Northeast - 9 states, South - 16 states, West - 13 states).

This can also be solved with joining region information with the state-level sales tax data.

Setup

For the next few slides…

x <- tibble(
  id = c(1, 2, 3),
  value_x = c("x1", "x2", "x3")
  )

x
# A tibble: 3 × 2
     id value_x
  <dbl> <chr>  
1     1 x1     
2     2 x2     
3     3 x3     
y <- tibble(
  id = c(1, 2, 4),
  value_y = c("y1", "y2", "y4")
  )

y
# A tibble: 3 × 2
     id value_y
  <dbl> <chr>  
1     1 y1     
2     2 y2     
3     4 y4     

left_join()

left_join(x, y)
# A tibble: 3 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     3 x3      <NA>   

right_join()

right_join(x, y)
# A tibble: 3 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     4 <NA>    y4     

full_join()

full_join(x, y)
# A tibble: 4 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     3 x3      <NA>   
4     4 <NA>    y4     

inner_join()

inner_join(x, y)
# A tibble: 2 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     

semi_join()

semi_join(x, y)
# A tibble: 2 × 2
     id value_x
  <dbl> <chr>  
1     1 x1     
2     2 x2     

anti_join()

anti_join(x, y)
# A tibble: 1 × 2
     id value_x
  <dbl> <chr>  
1     3 x3     

Summary