Lecture 7
Duke University
STA 199 Spring 2025
2025-02-04
Prepare for today’s application exercise: ae-06-taxes-join
Go to your ae
project in RStudio.
Make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.
Click Pull to get today’s application exercise file: ae-06-taxes-join.qmd.
Wait till the you’re prompted to work on the application exercise during class before editing the file.
Can you guess the variable plotted here?
# A tibble: 51 × 5
state state_tax_rate avg_local_tax_rate combined_rate max_local_tax_rate
<chr> <dbl> <dbl> <dbl> <dbl>
1 Alabama 0.04 0.0529 0.0929 0.075
2 Alaska 0 0.0182 0.0182 0.0785
3 Arizona 0.056 0.0278 0.0838 0.053
4 Arkansas 0.065 0.0295 0.0945 0.0613
5 California 0.0725 0.016 0.0885 0.0475
6 Colorado 0.029 0.0491 0.0781 0.083
7 Connectic… 0.0635 0 0.0635 0
8 Delaware 0 0 0 0
9 Florida 0.06 0.01 0.07 0.02
10 Georgia 0.04 0.0338 0.0738 0.05
# ℹ 41 more rows
Suppose you’re tasked with the following:
Compare the average state sales tax rates of swing states (Arizona, Georgia, Michigan, Nevada, North Carolina, Pennsylvania, and Wisconsin) vs. non-swing states.
How would you approach this task?
swing_state
with levels "Swing"
and "Non-swing"
swing_state
Go to your ae project in RStudio.
If you haven’t yet done so, make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.
If you haven’t yet done so, click Pull to get today’s application exercise file: ae-06-taxes-join.qmd.
Work through the application exercise in class, and render, commit, and push your edits by the end of class.
mutate()
with if_else()
Create a new variable called swing_state
with levels "Swing"
and "Non-swing"
.
list_of_swing_states <- c("Arizona", "Georgia", "Michigan", "Nevada",
"North Carolina", "Pennsylvania", "Wisconsin")
sales_taxes <- sales_taxes |>
mutate(
swing_state = if_else(state %in% list_of_swing_states,
"Swing",
"Non-swing")) |>
relocate(swing_state)
sales_taxes
# A tibble: 51 × 6
swing_state state state_tax_rate avg_local_tax_rate combined_rate
<chr> <chr> <dbl> <dbl> <dbl>
1 Non-swing Alabama 0.04 0.0529 0.0929
2 Non-swing Alaska 0 0.0182 0.0182
3 Swing Arizona 0.056 0.0278 0.0838
4 Non-swing Arkansas 0.065 0.0295 0.0945
5 Non-swing California 0.0725 0.016 0.0885
6 Non-swing Colorado 0.029 0.0491 0.0781
7 Non-swing Connecticut 0.0635 0 0.0635
8 Non-swing Delaware 0 0 0
9 Non-swing Florida 0.06 0.01 0.07
10 Swing Georgia 0.04 0.0338 0.0738
# ℹ 41 more rows
# ℹ 1 more variable: max_local_tax_rate <dbl>
if_else()
TRUE
FALSE
Compare the average state sales tax rates of swing states vs. non-swing states.
Suppose you’re tasked with the following:
Compare the average state sales tax rates of states on the Pacific Coast, states on the Atlantic Coast, and the rest of the states.
How would you approach this task?
coast
with levels "Pacific"
, "Atlantic"
, and "Neither"
coast
mutate()
with case_when()
Create a new variable called coast
with levels "Pacific"
, "Atlantic"
, and "Neither"
.
pacific_coast <- c("Alaska", "Washington", "Oregon", "California", "Hawaii")
atlantic_coast <- c(
"Connecticut", "Delaware", "Georgia", "Florida", "Maine", "Maryland",
"Massachusetts", "New Hampshire", "New Jersey", "New York",
"North Carolina", "Rhode Island", "South Carolina", "Virginia"
)
sales_taxes <- sales_taxes |>
mutate(
coast = case_when(
state %in% pacific_coast ~ "Pacific",
state %in% atlantic_coast ~ "Atlantic",
.default = "Neither"
)
) |>
relocate(coast)
sales_taxes
# A tibble: 51 × 7
coast swing_state state state_tax_rate avg_local_tax_rate combined_rate
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Neither Non-swing Alabama 0.04 0.0529 0.0929
2 Pacific Non-swing Alaska 0 0.0182 0.0182
3 Neither Swing Arizona 0.056 0.0278 0.0838
4 Neither Non-swing Arkansas 0.065 0.0295 0.0945
5 Pacific Non-swing Califor… 0.0725 0.016 0.0885
6 Neither Non-swing Colorado 0.029 0.0491 0.0781
7 Atlantic Non-swing Connect… 0.0635 0 0.0635
8 Atlantic Non-swing Delaware 0 0 0
9 Atlantic Non-swing Florida 0.06 0.01 0.07
10 Atlantic Swing Georgia 0.04 0.0338 0.0738
# ℹ 41 more rows
# ℹ 1 more variable: max_local_tax_rate <dbl>
case_when()
case_when(
1 x > y ~ "x is greater than y",
2 x < y ~ "x is less than y",
3 .default = "x is equal to y"
)
TRUE
TRUE
TRUE
, i.e., default value
Compare the average state sales tax rates of states on the Pacific Coast, states on the Atlantic Coast, and the rest of the states.
Suppose you’re tasked with the following:
Compare the average state sales tax rates of states in various regions (Midwest - 12 states, Northeast - 9 states, South - 16 states, West - 13 states).
How would you approach this task?
region
with levels "Midwest"
, "Northeast"
, "South"
, and "West"
.region
mutate()
with case_when()
Who feels like filling in the blanks lists of states in each region? Who feels like it’s simply too tedious to write out names of all states?
list_of_midwest_states <- c(___)
list_of_northeast_states <- c(___)
list_of_south_states <- c(___)
list_of_west_states <- c(___)
sales_taxes <- sales_taxes |>
mutate(
coast = case_when(
state %in% list_of_west_states ~ "Midwest",
state %in% list_of_northeast_states ~ "Northeast",
state %in% list_of_south_states ~ "South",
state %in% list_of_west_states ~ "West"
)
)
Suppose we want to answer questions like:
Is there a relationship between
- number of QS courses taken
- having scored a 4 or 5 on the AP stats exam
- motivation for taking course
- …
and performance in this course?”
Each of these would require joining class performance data with an outside data source so we can have all relevant information (columns) in a single data frame.
Suppose we want to answer questions like:
Compare the average state sales tax rates of states in various regions (Midwest - 12 states, Northeast - 9 states, South - 16 states, West - 13 states).
This can also be solved with joining region information with the state-level sales tax data.
For the next few slides…
left_join()
right_join()
full_join()
inner_join()
semi_join()
anti_join()