Lab 4

John Zito

Duke University
STA 199 Spring 2025

2025-02-10

While you wait… get your repo ready

  • Log in to RStudio (via your container)

  • Clone the repo & start a new RStudio project

    • Go to the course organization at github.com/sta199-s25 organization on GitHub. Click on the repo with the prefix lab-4.

    • Click on the green CODE button and select Use SSH. Click on the clipboard icon to copy the repo URL.

    • In RStudio, go to FileNew ProjectVersion ControlGit to clone your Lab 4 repo.

  • Update the YAML

    • In lab-4.qmd, update the author field to your name, Render your document, and examine the changes. Then, in the Git pane, click on Diff to view your changes, add a commit message (e.g., “Added author name”), and click Commit. Then, Push the changes to your GitHub repository and, in your browser, confirm that these changes have indeed propagated to your repository.

A review

You have learned a lot thus far

  • Git/GitHub

  • R

    • Plot related functions

      ggplot(), aes(), geom_boxplot(), geom_point(), geom_histogram(), geom_smooth(), geom_line(), geom_beeswarm(), labs(), theme_minimal(), theme() scale_x_continuous(), scale_color_manual()

    • More functions

      glimpse(), nrow(), ncol(), dim(), slice_head(), filter(), arrange(), relocate(), if_else(), case_when(), count(), group_by(), ungroup(), read_csv(), separate(), mutate(), summarize(), pivot_*(), *_join()

mutate() and summarize()

 

 

  • mutate(): modifies existing data frame – creates new columns (i.e., variables) or modifies existing columns. Note that the number of rows does not change.

 

  • summarize(): creates a new data frame – returns one for for each combination of grouping variables. If there is no grouping it will have a single row summarizing all observations

Example: Set up

library(tidyverse)
library(knitr)

df <- tibble(
  col_1 = c("A", "A", "A", "B", "B"),
  col_2 = c("X", "Y", "X", "X", "Y"),
  col_3 = c(1, 2, 3, 4, 5)
)

df #this is used to display the data frame
# A tibble: 5 × 3
  col_1 col_2 col_3
  <chr> <chr> <dbl>
1 A     X         1
2 A     Y         2
3 A     X         3
4 B     X         4
5 B     Y         5

What would be the result of the following code? # rows? # cols? column/variable names?

df |>
  mutate(med_col_3 = median(col_3))


df |>
  summarize(med_col_3 = median(col_3))

Example: no groups

mutate()

df |>
  mutate(med_col_3 = median(col_3))
# A tibble: 5 × 4
  col_1 col_2 col_3 med_col_3
  <chr> <chr> <dbl>     <dbl>
1 A     X         1         3
2 A     Y         2         3
3 A     X         3         3
4 B     X         4         3
5 B     Y         5         3

summarize()

df |>
  summarize(med_col_3 = median(col_3))
# A tibble: 1 × 1
  med_col_3
      <dbl>
1         3

We did not assign any new or existing data frames (e.g., no ??? <-). In particular, we did not write over df (i.e., no df <-), so what will be the result of the following code?

df

Example: assignment

  • It’s the same as when it was originally assigned. It has not been overwritten!
df 
# A tibble: 5 × 3
  col_1 col_2 col_3
  <chr> <chr> <dbl>
1 A     X         1
2 A     Y         2
3 A     X         3
4 B     X         4
5 B     Y         5
  • We will often write a single pipeline and show the result, i.e., no assignment.

  • If you will need to refer to the data frame later, it might be a good idea to assign a name to the data frame. Otherwise, see the result and continue on.

    Note: if you assign the new/updated data frame, the result does not appear in the Console or the rendered document! (Type the name of the variable, e.g., df as shown above, to display the data frame.)

Example: with groups

What if there is grouping?

# group by 1 variable
df |>
  group_by(col_1) |>
  mutate(med_col_3 = median(col_3))


df |>
  group_by(col_1) |>
  summarize(med_col_3 = median(col_3))

# group by 2 variables
df |>
  group_by(col_1, col_2) |>
  mutate(med_col_3 = median(col_3))


df |>
  group_by(col_1, col_2) |>
  summarize(med_col_3 = median(col_3))

If you aren’t sure, try it out and see what happens (e.g., use data frame from Lab 3, Part 1).

pivot_*()

  • Pivoting reshapes the data frame.

  • pivot_longer makes the updated data frame longer (i.e., fewer columns)

  • pivot_wider makes the updated data frame wider (i.e., more columns)

Example: pivot_*()

Let’s examine the number of hours people slept during the week.

How do we go from this…

ppl Mon Tues Weds Thurs Fri
person1 8 7 6 10 8
person2 7 5 4 6 7

…to this?

ppl day hours
person1 Mon 8
person1 Tues 7
person1 Weds 6
person1 Thurs 10
person1 Fri 8
person2 Mon 7
person2 Tues 5
person2 Weds 4
person2 Thurs 6
person2 Fri 7
df_longer <- df |>
  pivot_longer(
    cols = -ppl,
    names_to = "day",
    values_to = "hours"
  )

*_join()

  • Typically we use *_join() to merge data from two data frames (e.g., left_join(), right_join(), full_join(), inner_join()), i.e., create a new data frames with more columns/variables.

    For example, there is useful info in two data frames: x and y. We want to create a new data frame which includes variables from both (e.g., data frame x has student ID numbers and student names and data frame y has student ID numbers and email addresses).

  • Sometimes we use *_join() to filter rows/observations, e.g., find the rows from one data frame that do (or do not) exist in another data frame (e.g., semi_join(), anti_join())
  • Let’s focus on the joins that merge data…

Example: *_join() setup

For the next few slides…

x <- tibble(
  id = c(1, 2, 3),
  value_x = c("x1", "x2", "x3")
  )

x
# A tibble: 3 × 2
     id value_x
  <dbl> <chr>  
1     1 x1     
2     2 x2     
3     3 x3     
y <- tibble(
  id = c(1, 2, 4),
  value_y = c("y1", "y2", "y4")
  )

y
# A tibble: 3 × 2
     id value_y
  <dbl> <chr>  
1     1 y1     
2     2 y2     
3     4 y4     

left_join()

left_join(x, y)
# A tibble: 3 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     3 x3      <NA>   

Keep all rows from left data frame.

right_join()

right_join(x, y)
# A tibble: 3 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     4 <NA>    y4     

  Keep all rows from right data frame.

full_join()

full_join(x, y)
# A tibble: 4 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     3 x3      <NA>   
4     4 <NA>    y4     

 

Keep all rows from both data frames.

inner_join()

inner_join(x, y)
# A tibble: 2 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     

 

Keep all rows that exist in both data frames.

Example: *_join() more info

  • We could also use *_join() within a pipeline.
x |>
  left_join(y)
  • Which data frame is on the left and which is on the right? x or y?
  • The above code is equivalent to left_join(x, y) since the result before the pipe |> is passed as the first argument to the function after the pipe.
  • In this example x has 2 variables: id and value_x and y has 2 variables: id and value_x, so there was only one common variable between x and yid. We could have been more explicit and used the following code
left_join(x, y, by = join_by(id))

# or alternatively

left_join(x, y, by = join_by(id == id))

Factors

  • Factors are used for categorical variables, e.g., days of the week; religion; low, mid, high

  • Very helpful for ordering (i.e., when numerical and alphabetical ordering don’t cut it!)

  • Examples

    • Friday, Monday, Saturday, Sunday, Tuesday, Thursday, Wednesday

    • Apr, Feb, Jan, July, Jun, Mar, May

    • Agree, Disagree, Neither agree nor disagree, Strongly agree, Strongly disagree

    • Example below (from prepare [r4ds] chp 16.4)

Factor example

Recall from Thursday’s lecture

survey |>
  mutate(
    year = fct_relevel(year, "First-year", "Sophomore", "Junior", "Senior")
    ) |>
  ggplot(aes(x = year)) +
  geom_bar() + 
  labs(
    title = "Number of students by year",
    x = "Year",
    y = "Count"
  )

How is the x-axis ordered in the left and right plots?

This week’s lab

  • Gain more experience with joining and pivoting data frames; and modifying the order of factors.

  • Review Quarto cell options

  • Learn to read data in from Excel spreadsheets (will learn more on Tuesday about this)

  • Datasets

    • More inflation!

    • 2020 and 2024 US Olympic Team rosters

    • Survey regarding medical marijuana in NC

    • mtcars from 1974 Motor Trend US magazine