# A tibble: 5 × 2
x `2011`
<dbl> <dbl>
1 -2 -2
2 -0.5 -0.5
3 0.5 0.5
4 1 1
5 2 2
Lecture 10
Duke University
STA 199 Spring 2025
2025-02-13
In-class (70%)
Take-home (30%)
See slides from 2/11 for more details.
One way to look at smells is with respect to principles and quality: “Smells are certain structures in the code that indicate violation of fundamental design principles and negatively impact design quality”. Code smells are usually not bugs; they are not technically incorrect and do not prevent the program from functioning. Instead, they indicate weaknesses in design that may slow down development or increase the risk of bugs or failures in the future.
Follow the Tidyverse style guide:
Spaces before and line breaks after each +
when building a ggplot
Spaces before and line breaks after each |>
in a data transformation pipeline,
Proper indentation
Spaces around =
signs and spaces after commas
Lines should not span more than 80 characters, long lines should be broken up with each argument on its own line
Referencing a column in a pipeline:
# A tibble: 5 × 2
x `2011`
<dbl> <dbl>
1 -2 -2
2 -0.5 -0.5
3 0.5 0.5
4 1 1
5 2 2
"x"
means the literal character string.
x
means the column name in df
.
Referencing a column in a pipeline:
# A tibble: 5 × 2
x `2011`
<dbl> <dbl>
1 -2 -2
2 -0.5 -0.5
3 0.5 0.5
4 1 1
5 2 2
"2011"
means the literal character string.
# A tibble: 5 × 2
x `2011`
<dbl> <dbl>
1 -2 -2
2 -0.5 -0.5
3 0.5 0.5
4 1 1
5 2 2
2011
means the literal number.
%in%
instead of ==
?Consider adding a season
column:
# A tibble: 12 × 4
month avg_high_f avg_low_f precipitation_in
<chr> <dbl> <dbl> <dbl>
1 January 49 28 4.45
2 February 53 29 3.7
3 March 62 37 4.69
4 April 71 46 3.43
5 May 79 56 4.61
6 June 85 65 4.02
7 July 89 70 3.94
8 August 87 68 4.37
9 September 81 60 4.37
10 October 71 47 3.7
11 November 62 37 3.39
12 December 53 30 3.43
%in%
instead of ==
?Consider adding a season
column:
%in%
instead of ==
?Consider adding a season
column:
durham_climate |>
mutate(
season = if_else(
month %in% c("December", "January", "February"),
"Winter",
"Not Winter"
)
)
# A tibble: 12 × 5
month avg_high_f avg_low_f precipitation_in season
<chr> <dbl> <dbl> <dbl> <chr>
1 January 49 28 4.45 Winter
2 February 53 29 3.7 Winter
3 March 62 37 4.69 Not Winter
4 April 71 46 3.43 Not Winter
5 May 79 56 4.61 Not Winter
6 June 85 65 4.02 Not Winter
7 July 89 70 3.94 Not Winter
8 August 87 68 4.37 Not Winter
9 September 81 60 4.37 Not Winter
10 October 71 47 3.7 Not Winter
11 November 62 37 3.39 Not Winter
12 December 53 30 3.43 Winter
%in%
instead of ==
?Consider adding a season
column:
durham_climate |>
mutate(
season = if_else(
month == c("December", "January", "February"),
"Winter",
"Not Winter"
)
)
# A tibble: 12 × 5
month avg_high_f avg_low_f precipitation_in season
<chr> <dbl> <dbl> <dbl> <chr>
1 January 49 28 4.45 Not Winter
2 February 53 29 3.7 Not Winter
3 March 62 37 4.69 Not Winter
4 April 71 46 3.43 Not Winter
5 May 79 56 4.61 Not Winter
6 June 85 65 4.02 Not Winter
7 July 89 70 3.94 Not Winter
8 August 87 68 4.37 Not Winter
9 September 81 60 4.37 Not Winter
10 October 71 47 3.7 Not Winter
11 November 62 37 3.39 Not Winter
12 December 53 30 3.43 Not Winter
%in%
instead of ==
?[1] FALSE TRUE FALSE
[1] TRUE
Punchline
Inside if_else
or case_when
your condition needs to result in a single value of TRUE or FALSE for each row. If it results in multiple values of TRUE/FALSE (a vector of TRUE/FALSE), you will not necessarily get an error or even a warning, but unexpected things could happen.
group = 1
?With it:
group = 1
?Without it (even though I have geom_line
!):
group = 1
?Don’t need group
for numerical vs numerical:
group = 1
?Do need group
for categorical vs numerical:
durham_climate |>
mutate(
month = fct_relevel(month, month.name)
) |>
ggplot(
aes(x = month, y = avg_high_f, group = 1)
) +
geom_line() +
geom_point(
shape = "circle filled", size = 4,
color = "black", fill = "white", stroke = 1
) +
labs(
x = "Month",
y = "Average high temperature (F)",
title = "Durham climate"
) +
theme_minimal()
durham_climate |>
mutate(
month = fct_relevel(month, month.name),
season = case_when(
month %in% c("December", "January", "February") ~ "Winter",
month %in% c("March", "April", "May") ~ "Spring",
month %in% c("June", "July", "August") ~ "Summer",
month %in% c("September", "October", "November") ~ "Fall",
)
) |>
ggplot(
aes(x = month, y = avg_high_f, group = 1)
) +
geom_line() +
geom_point(
aes(fill = season),
shape = "circle filled", size = 4,
color = "black", stroke = 1
) +
scale_fill_manual(
values = c(
"Winter" = "lightskyblue1",
"Spring" = "chartreuse3",
"Summer" = "gold2",
"Fall" = "lightsalmon4"
)
) +
labs(
x = "Month",
y = "Average high temperature (F)",
title = "Durham climate"
) +
theme_minimal()
durham_climate |>
mutate(
month = fct_relevel(month, month.name),
season = case_when(
month %in% c("December", "January", "February") ~ "Winter",
month %in% c("March", "April", "May") ~ "Spring",
month %in% c("June", "July", "August") ~ "Summer",
month %in% c("September", "October", "November") ~ "Fall",
)
) |>
ggplot(
aes(x = month, y = avg_high_f, group = 1)
) +
geom_line() +
geom_point(
aes(fill = season),
shape = "circle filled", size = 4,
color = "black", stroke = 1
) +
scale_fill_manual(
values = c(
"Winter" = "lightskyblue1",
"Spring" = "chartreuse3",
"Summer" = "gold2",
"Fall" = "lightsalmon4"
)
) +
labs(
x = "Month",
y = "Average high temperature (F)",
title = "Durham climate"
) +
theme_minimal() +
theme(legend.position = "top")
durham_climate |>
mutate(
month = fct_relevel(month, month.name),
season = case_when(
month %in% c("December", "January", "February") ~ "Winter",
month %in% c("March", "April", "May") ~ "Spring",
month %in% c("June", "July", "August") ~ "Summer",
month %in% c("September", "October", "November") ~ "Fall",
),
season = fct_relevel(season, "Winter", "Spring", "Summer", "Fall")
) |>
ggplot(
aes(x = month, y = avg_high_f, group = 1)
) +
geom_line() +
geom_point(
aes(fill = season),
shape = "circle filled", size = 4,
color = "black", stroke = 1
) +
scale_fill_manual(
values = c(
"Winter" = "lightskyblue1",
"Spring" = "chartreuse3",
"Summer" = "gold2",
"Fall" = "lightsalmon4"
)
) +
labs(
x = "Month",
y = "Average high temperature (F)",
title = "Durham climate"
) +
theme_minimal() +
theme(legend.position = "top")
Give it a shot in your ae-07-durham-climate-factors
file. And don’t worry about prettification. Just get the two lines correct.
Read a CSV file
Split it into subsets based on features of the data
Write out subsets as CSV files
Work on the first part in ae-08-age-gaps-sales-import.qmd.
What is the story in this visualization?
read_csv()
read_tsv()
, read_delim()
, etc.read_excel()
read_sheet()
– We haven’t covered this in the videos, but might be useful for your projectsRead an Excel file with non-tidy data
Tidy it up!
Work on the second part in ae-08-age-gaps-sales-import.qmd.
Are these data tidy? Why or why not?
What “data moves” do we need to go from the original, non-tidy data to this, tidy one?