More practice

Lecture 10

John Zito

Duke University
STA 199 Spring 2025

2025-02-13

Before we begin…

Midterm Exam 1

  • In-class (70%)

    • Thursday February 20 11:45 AM - 1:00 PM;
    • All multiple choice;
    • You should have gotten an email about room assignment;
    • 8.5” x 11” cheat sheet.
  • Take-home (30%)

    • Released Thursday February 20 at 1:00 PM;
    • Due Monday February 24 at 8:30 AM.
    • Basically a mini lab;
    • Open resource (citation policies apply);
    • No collaboration.

See slides from 2/11 for more details.

Code smell

One way to look at smells is with respect to principles and quality: “Smells are certain structures in the code that indicate violation of fundamental design principles and negatively impact design quality”. Code smells are usually not bugs; they are not technically incorrect and do not prevent the program from functioning. Instead, they indicate weaknesses in design that may slow down development or increase the risk of bugs or failures in the future.

Code style

Follow the Tidyverse style guide:

  • Spaces before and line breaks after each + when building a ggplot

  • Spaces before and line breaks after each |> in a data transformation pipeline,

  • Proper indentation

  • Spaces around = signs and spaces after commas

  • Lines should not span more than 80 characters, long lines should be broken up with each argument on its own line

FAQ

Quotes VS no quotes VS backticks

df <- tibble(
  x = c(-2, -0.5, 0.5, 1, 2),
  `2011` = c(-2, -0.5, 0.5, 1, 2)
)
df
# A tibble: 5 × 2
      x `2011`
  <dbl>  <dbl>
1  -2     -2  
2  -0.5   -0.5
3   0.5    0.5
4   1      1  
5   2      2  

Quotes VS no quotes VS backticks

df <- tibble(
  x = c(-2, -0.5, 0.5, 1, 2),
  `2011` = c(-2, -0.5, 0.5, 1, 2)
)

Referencing a column in a pipeline:

df |>
  filter("x" > 0)
# A tibble: 5 × 2
      x `2011`
  <dbl>  <dbl>
1  -2     -2  
2  -0.5   -0.5
3   0.5    0.5
4   1      1  
5   2      2  

"x" means the literal character string.

df |>
  filter(x > 0)
# A tibble: 3 × 2
      x `2011`
  <dbl>  <dbl>
1   0.5    0.5
2   1      1  
3   2      2  

x means the column name in df.

df |>
  filter(`x` > 0)
# A tibble: 3 × 2
      x `2011`
  <dbl>  <dbl>
1   0.5    0.5
2   1      1  
3   2      2  

`x` also means the column name in df.

Quotes VS no quotes VS backticks

df <- tibble(
  x = c(-2, -0.5, 0.5, 1, 2),
  `2011` = c(-2, -0.5, 0.5, 1, 2)
)

Referencing a column in a pipeline:

df |>
  filter("2011" > 0)
# A tibble: 5 × 2
      x `2011`
  <dbl>  <dbl>
1  -2     -2  
2  -0.5   -0.5
3   0.5    0.5
4   1      1  
5   2      2  

"2011" means the literal character string.

df |>
  filter(2011 > 0)
# A tibble: 5 × 2
      x `2011`
  <dbl>  <dbl>
1  -2     -2  
2  -0.5   -0.5
3   0.5    0.5
4   1      1  
5   2      2  

2011 means the literal number.

df |>
  filter(`2011` > 0)
# A tibble: 3 × 2
      x `2011`
  <dbl>  <dbl>
1   0.5    0.5
2   1      1  
3   2      2  

`2011` means the column name in df.

Why %in% instead of ==?

Consider adding a season column:

durham_climate
# A tibble: 12 × 4
   month     avg_high_f avg_low_f precipitation_in
   <chr>          <dbl>     <dbl>            <dbl>
 1 January           49        28             4.45
 2 February          53        29             3.7 
 3 March             62        37             4.69
 4 April             71        46             3.43
 5 May               79        56             4.61
 6 June              85        65             4.02
 7 July              89        70             3.94
 8 August            87        68             4.37
 9 September         81        60             4.37
10 October           71        47             3.7 
11 November          62        37             3.39
12 December          53        30             3.43

Why %in% instead of ==?

Consider adding a season column:

durham_climate |>
  mutate(
    season = if_else(
      month ????? c("December", "January", "February"),
      "Winter",
      "Not Winter"
    )
  )

Why %in% instead of ==?

Consider adding a season column:

durham_climate |>
  mutate(
    season = if_else(
      month %in% c("December", "January", "February"),
      "Winter",
      "Not Winter"
    )
  )
# A tibble: 12 × 5
   month     avg_high_f avg_low_f precipitation_in season    
   <chr>          <dbl>     <dbl>            <dbl> <chr>     
 1 January           49        28             4.45 Winter    
 2 February          53        29             3.7  Winter    
 3 March             62        37             4.69 Not Winter
 4 April             71        46             3.43 Not Winter
 5 May               79        56             4.61 Not Winter
 6 June              85        65             4.02 Not Winter
 7 July              89        70             3.94 Not Winter
 8 August            87        68             4.37 Not Winter
 9 September         81        60             4.37 Not Winter
10 October           71        47             3.7  Not Winter
11 November          62        37             3.39 Not Winter
12 December          53        30             3.43 Winter    

Why %in% instead of ==?

Consider adding a season column:

durham_climate |>
  mutate(
    season = if_else(
      month == c("December", "January", "February"),
      "Winter",
      "Not Winter"
    )
  )
# A tibble: 12 × 5
   month     avg_high_f avg_low_f precipitation_in season    
   <chr>          <dbl>     <dbl>            <dbl> <chr>     
 1 January           49        28             4.45 Not Winter
 2 February          53        29             3.7  Not Winter
 3 March             62        37             4.69 Not Winter
 4 April             71        46             3.43 Not Winter
 5 May               79        56             4.61 Not Winter
 6 June              85        65             4.02 Not Winter
 7 July              89        70             3.94 Not Winter
 8 August            87        68             4.37 Not Winter
 9 September         81        60             4.37 Not Winter
10 October           71        47             3.7  Not Winter
11 November          62        37             3.39 Not Winter
12 December          53        30             3.43 Not Winter

Why %in% instead of ==?

"January" == c("December", "January", "February")
[1] FALSE  TRUE FALSE
"January" %in% c("December", "January", "February")
[1] TRUE

Punchline

Inside if_else or case_when your condition needs to result in a single value of TRUE or FALSE for each row. If it results in multiple values of TRUE/FALSE (a vector of TRUE/FALSE), you will not necessarily get an error or even a warning, but unexpected things could happen.

Four tasks for today

Task 1: Prettifying the plot from ae-07

ggplot(
  durham_climate, 
  aes(x = month, y = avg_high_f, group = 1)
  ) +
  geom_line() +
  geom_point(
    shape = "circle filled", size = 4,
    color = "black", fill = "white", stroke = 1
  ) +
  labs(
    x = "Month",
    y = "Average high temperature (F)",
    title = "Durham climate"
  ) + 
  theme_minimal()

Things to change

  1. Reorder the months chronologically;
  2. Fill the circles with season-specific colors;
  3. Add a legend for these colors to the top of the plot;
  4. Make sure the legend is ordered chronologically by season.

0. Why group = 1?

With it:

ggplot(
  durham_climate, 
  aes(x = month, y = avg_high_f, group = 1)
  ) +
  geom_line() +
  geom_point(
    shape = "circle filled", size = 4,
    color = "black", fill = "white", stroke = 1
  ) +
  labs(
    x = "Month",
    y = "Average high temperature (F)",
    title = "Durham climate"
  ) +
  theme_minimal()

0. Why group = 1?

Without it (even though I have geom_line!):

ggplot(
  durham_climate, 
  aes(x = month, y = avg_high_f)
  ) +
  geom_line() +
  geom_point(
    shape = "circle filled", size = 4,
    color = "black", fill = "white", stroke = 1
  ) +
  labs(
    x = "Month",
    y = "Average high temperature (F)",
    title = "Durham climate"
  ) +
  theme_minimal()

0. Why group = 1?

Don’t need group for numerical vs numerical:

ggplot(
  durham_climate, 
  aes(x = avg_low_f, y = avg_high_f)
  ) +
  geom_line() +
  geom_point(
    shape = "circle filled", size = 4,
    color = "black", fill = "white", stroke = 1
  ) +
  labs(
    x = "Average low temperature (F)",
    y = "Average high temperature (F)",
    title = "Durham climate"
  ) +
  theme_minimal()

0. Why group = 1?

Do need group for categorical vs numerical:

ggplot(
  durham_climate, 
  aes(x = month, y = avg_high_f, group = 1)
  ) +
  geom_line() +
  geom_point(
    shape = "circle filled", size = 4,
    color = "black", fill = "white", stroke = 1
  ) +
  labs(
    x = "Month",
    y = "Average high temperature (F)",
    title = "Durham climate"
  ) +
  theme_minimal()

1. Reorder the months chronologically

durham_climate |>
  mutate(
    month = fct_relevel(month, month.name)
  ) |>
  ggplot(
    aes(x = month, y = avg_high_f, group = 1)
  ) +
  geom_line() +
  geom_point(
    shape = "circle filled", size = 4,
    color = "black", fill = "white", stroke = 1
  ) +
  labs(
    x = "Month",
    y = "Average high temperature (F)",
    title = "Durham climate"
  ) +
  theme_minimal()

2. Fill the circles with season-specific colors

durham_climate |>
  mutate(
    month = fct_relevel(month, month.name),
    season = case_when(
      month %in% c("December", "January", "February") ~ "Winter",
      month %in% c("March", "April", "May") ~ "Spring",
      month %in% c("June", "July", "August") ~ "Summer",
      month %in% c("September", "October", "November") ~ "Fall",
    )
  ) |>
  ggplot(
    aes(x = month, y = avg_high_f, group = 1)
    ) +
  geom_line() +
  geom_point(
    aes(fill = season),
    shape = "circle filled", size = 4,
    color = "black", stroke = 1
  ) +
  scale_fill_manual(
    values = c(
      "Winter" = "lightskyblue1",
      "Spring" = "chartreuse3",
      "Summer" = "gold2",
      "Fall" = "lightsalmon4"
    )
  ) + 
  labs(
    x = "Month",
    y = "Average high temperature (F)",
    title = "Durham climate"
  ) +
  theme_minimal()

3. Add legend for season to top of plot

durham_climate |>
  mutate(
    month = fct_relevel(month, month.name),
    season = case_when(
      month %in% c("December", "January", "February") ~ "Winter",
      month %in% c("March", "April", "May") ~ "Spring",
      month %in% c("June", "July", "August") ~ "Summer",
      month %in% c("September", "October", "November") ~ "Fall",
    )
  ) |>
  ggplot(
    aes(x = month, y = avg_high_f, group = 1)
    ) +
  geom_line() +
  geom_point(
    aes(fill = season),
    shape = "circle filled", size = 4,
    color = "black", stroke = 1
  ) +
  scale_fill_manual(
    values = c(
      "Winter" = "lightskyblue1",
      "Spring" = "chartreuse3",
      "Summer" = "gold2",
      "Fall" = "lightsalmon4"
    )
  ) + 
  labs(
    x = "Month",
    y = "Average high temperature (F)",
    title = "Durham climate"
  ) +
  theme_minimal() + 
  theme(legend.position = "top")

4. Order legend chronologically

durham_climate |>
  mutate(
    month = fct_relevel(month, month.name),
    season = case_when(
      month %in% c("December", "January", "February") ~ "Winter",
      month %in% c("March", "April", "May") ~ "Spring",
      month %in% c("June", "July", "August") ~ "Summer",
      month %in% c("September", "October", "November") ~ "Fall",
    ),
    season = fct_relevel(season, "Winter", "Spring", "Summer", "Fall")
  ) |>
  ggplot(
    aes(x = month, y = avg_high_f, group = 1)
    ) +
  geom_line() +
  geom_point(
    aes(fill = season),
    shape = "circle filled", size = 4,
    color = "black", stroke = 1
  ) +
  scale_fill_manual(
    values = c(
      "Winter" = "lightskyblue1",
      "Spring" = "chartreuse3",
      "Summer" = "gold2",
      "Fall" = "lightsalmon4"
    )
  ) + 
  labs(
    x = "Month",
    y = "Average high temperature (F)",
    title = "Durham climate"
  ) +
  theme_minimal() + 
  theme(legend.position = "top")

Task 2: pivot to replicate this…

Give it a shot in your ae-07-durham-climate-factors file. And don’t worry about prettification. Just get the two lines correct.

Task 3: recoding and writing to file

  • Read a CSV file

  • Split it into subsets based on features of the data

  • Write out subsets as CSV files

Work on the first part in ae-08-age-gaps-sales-import.qmd.

Age gap in Hollywood relationships

What is the story in this visualization?

Task 4: reading in from excel (yuck!)

  • Using readr:
    • Most commonly: read_csv()
    • Maybe also: read_tsv(), read_delim(), etc.
  • Using googlesheets4: read_sheet() – We haven’t covered this in the videos, but might be useful for your projects

Reading Excel files

  • Read an Excel file with non-tidy data

  • Tidy it up!

Work on the second part in ae-08-age-gaps-sales-import.qmd.

Sales data

Are these data tidy? Why or why not?

Sales data

What “data moves” do we need to go from the original, non-tidy data to this, tidy one?