Lab 3
Inflation everywhere
Introduction
This lab gives you more practice with data wrangling and data pipelines, with an emphasis on mastering the subtleties of group_by
, as well as introducing pivoting to the mix.
This lab assumes you’ve completed the labs so far and doesn’t repeat setup and overview content from those labs. If you have not yet done those, you should go back and review the previous labs before starting on this one.
Learning objectives
By the end of the lab, you will…
- Be able to pivot/reshape data using
tidyr
- Continue developing your data wrangling skills using
dplyr
- Build on your mastery of data visualizations using
ggplot2
- Get more experience with data science workflow using R, RStudio, Git, and GitHub
- Further your reproducible authoring skills with Quarto
- Improve your familiarity with version control using Git and GitHub
Getting started
Log in to RStudio, clone your lab-3
repo from GitHub, open your lab-3.qmd
document, and get started!
Step 1: Log in to RStudio
- Go to https://cmgr.oit.duke.edu/containers and log in with your Duke NetID and Password.
- Click
STA198-199
under My reservations to log into your container. You should now see the RStudio environment.
Step 2: Clone the repo & start new RStudio project
Go to the course organization at github.com/sta199-s25 organization on GitHub. Click on the repo with the prefix lab-3. It contains the starter documents you need to complete the lab.
Click on the green CODE button, select Use SSH (this might already be selected by default, and if it is, you’ll see the text Clone with SSH). Click on the clipboard icon to copy the repo URL.
In RStudio, go to File ➛ New Project ➛Version Control ➛ Git.
Copy and paste the URL of your assignment repo into the dialog box Repository URL. Again, please make sure to have SSH highlighted under Clone when you copy the address.
Click Create Project, and the files from your GitHub repo will be displayed in the Files pane in RStudio.
Click lab-3.qmd to open the template Quarto file. This is where you will write up your code and narrative for the lab.
Step 3: Update the YAML
In lab-3.qmd
, update the author
field to your name, render your document and examine the changes. Then, in the Git pane, click on Diff to view your changes, add a commit message (e.g., “Added author name”), and click Commit. Then, push the changes to your GitHub repository, and in your browser confirm that these changes have indeed propagated to your repository.
If you run into any issues with the first steps outlined above, flag a TA for help before proceeding.
Packages
In this lab we will work with the tidyverse package, which is a collection of packages for doing data analysis in a “tidy” way.
-
Run the code cell by clicking on the green triangle (play) button for the code cell labeled
load-packages
. This loads the package to make its features (the functions and datasets in it) be accessible from your Console. - Then, render the document which loads this package to make its features (the functions and datasets in it) be available for other code cells in your Quarto document.
Guidelines
As we’ve discussed in lecture, your plots should include an informative title, axes and legends should have human-readable labels, and careful consideration should be given to aesthetic choices.
Additionally, code should follow the tidyverse style. Particularly,
there should be spaces before and line breaks after each
+
when building aggplot
,there should also be spaces before and line breaks after each
|>
in a data transformation pipeline,code should be properly indented,
there should be spaces around
=
signs and spaces after commas.
Furthermore, all code should be visible in the PDF output, i.e., should not run off the page on the PDF. Long lines that run off the page should be split across multiple lines with line breaks.
Continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.
Questions
Part 1
All about group_by()
!
Use the following data frame for Question 1 and Question 2:
df <- tibble(
var_1 = c(50, 20, 70, 10, 100, 30, 40, 80, 120, 60, 90, 110),
var_2 = c("Pizza", "Burger", "Pizza", "Pizza", "Burger", "Burger",
"Burger", "Pizza", "Burger", "Pizza", "Pizza", "Burger"),
var_3 = c("Apple", "Apple", "Pear", "Banana", "Pear", "Banana",
"Apple", "Apple", "Pear", "Pear", "Banana", "Banana")
)
df
# A tibble: 12 × 3
var_1 var_2 var_3
<dbl> <chr> <chr>
1 50 Pizza Apple
2 20 Burger Apple
3 70 Pizza Pear
4 10 Pizza Banana
5 100 Burger Pear
6 30 Burger Banana
7 40 Burger Apple
8 80 Pizza Apple
9 120 Burger Pear
10 60 Pizza Pear
11 90 Pizza Banana
12 110 Burger Banana
Question 1
Grouping by a single variable.
a. What does the following code chunk do? Run it, analyze the result, and articulate in words what arrange()
does.
df |>
arrange(var_2)
b. What does the following code chunk do? Run it, analyze the result, and articulate in words what group_by()
does. Also, comment on how it’s different from the arrange()
in part (b).
df |>
group_by(var_2)
c. What does the following code chunk do? Run it, analyze the result, and articulate in words what the pipeline does.
d. Compare this behavior to the following code chunk. Run it, analyze the result, and articulate in words what the pipeline does, and how it’s behavior is different from part (c).
Question 2
Grouping by two variables.
a. How many levels does var_2
have? How many levels does var_3
have? How many possible combinations are there of the levels of var_2
and var_3
?
b. Write some code that uses basic arithmetic operations to manually compute the average value of var_1
among all the rows that have Burger-Apple. Do the same thing for the rows that have Burger-Banana.
c. You’re probably sick of that, right? What does the following code chunk do? Run it, analyze the result, and articulate in words what the pipeline does and how it compared to part (b). Then, comment on what the message says.
d. What does the following code chunk do? Run it, analyze the result, and articulate in words what the pipeline does, especially mentioning what the .groups
argument does. How is the output different from the one in part (c)?
e. What do the following pipelines do? Run both, analyze their results, and articulate in words what each pipeline does. How are the outputs of the two pipelines different?
Render, commit, and push your changes to GitHub with the commit message “Added answers for Questions 1 and 2”.
Make sure to commit and push all changed files so that your Git pane is empty afterward.
Part 2
Inflation across the world
For this part of the analysis you will work with inflation data from various countries in the world over the last 30 years.
country_inflation <- read_csv("data/country-inflation.csv")
Question 3
Get to know the data.
glimpse()
at thecountry_inflation
data frame and answer the following questions based on the output. How many rows doescountry_inflation
have and what does each row represent? How many columns doescountry_inflation
have and what does each column represent?Display a list of the unique countries included in the dataset. Hint: Another word for “unique” is
distinct()
.
A function that can be useful for part (b) is pull()
. Check out its documentation for examples of usage.
Question 4
Which countries had the top three highest inflation rates in 2023? Your output should be a data frame with two columns, country
and 2023
, with inflation rates in descending order, and three rows for the top three countries. Briefly comment on how the inflation rates for these countries compare to the inflation rate for United States in that year.
- You might find the command
slice_head
helpful here. - Column names that are numbers are not considered “proper” in R, therefore to select them you’ll need to surround them with backticks, e.g.
select( ` 1993 ` )
.
Question 5
In a single pipeline,
- calculate the ratio of the inflation in 2023 and inflation in 1993 for each country and store this information in a new column called
inf_ratio
, - select the variables
country
andinf_ratio
, and - store the result in a new data frame called
country_inflation_ratios
.
Then, in two separate pipelines,
- arrange
country_inflation_ratios
in increasing order ofinf_ratio
and - arrange
country_inflation_ratios
in decreasing order ofinf_ratio
.
Which country’s inflation increase is the largest over this time period and by how much? Which country’s inflation decrease is the largest over this time period and by how much?
For this question you’ll once again need to use variables whose names are numbers (years) in your pipeline. Make sure to surround the names of such variables with backticks (`
).
Question 6
Reshape (pivot) country_inflation
such that each row represents a country/year combination, with columns country
, year
, and annual_inflation
. Then, display the resulting data frame and state how many rows and columns it has.
Requirements:
- Your code must use one of
pivot_longer()
orpivot_wider()
. There are other ways you can do this reshaping move in R, but this question requires solving this problem by pivoting. - In your
pivot_*()
function, you must usenames_transform = as.numeric
as an argument to transform the variable type to numeric as you pivot the data so that in the resulting data frame the year variable is numeric. - The resulting data frame must be saved as something other than
country_inflation
so you (1) can refer to this data frame later in your analysis and (2) do not overwritecountry_inflation
. Use a short but informative name.
The remaining questions require the use of the pivoted data frame.
If you haven’t yet done so, now is a good time to render, commit, and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Question 7
Use a separate, single pipeline to answer each of the following questions.
Requirement: Your code must use the filter()
function for each part, not arrange()
.
What is the highest inflation rate observed between 1993 and 2023? The output of the pipeline should be a data frame with one row and three columns. In addition to code and output, your response should include a single sentence stating the country and year.
What is the lowest inflation rate observed between 1993 and 2023? The output of the pipeline should be a data frame with one row and three columns. In addition to code and output, your response should include a single sentence stating the country and year.
Putting (a) and (b) together: What are the highest and the lowest inflation rates observed between 1993 and 2023? The output of the pipeline should be a data frame with two rows and three columns.
Question 8
a. Create a vector called countries_of_interest
which contains the names of up to five countries you want to visualize the inflation rates for over the years. For example, if these countries are Türkiye and United States, you can express this as follows:
countries_of_interest <- c("Türkiye", "United States")
If they are Türkiye, United States, and Chile, you can express this as follows:
countries_of_interest <- c(
"Türkiye", "United States", "Chile"
)
So on and so forth… Then, in 1-2 sentences, state why you chose these countries.
Your countries_of_interest
should consist of no more than five countries. Make sure that the spelling of your countries matches how they appear in the dataset.
b. In a single pipeline, filter your reshaped dataset to include only the countries_of_interest
from part (a), and save the resulting data frame with a new name so you (1) can refer to this data frame later in your analysis and (2) do not overwrite the data frame you’re starting with. Use a short but informative name. Then, in a new pipeline, find the distinct()
countries in the data frame you created.
The number of distinct countries in the filtered data frame you created in part (b) should equal the number of countries you chose in part (a). If it doesn’t, you might have misspelled a country name or made a mistake in filtering for these countries. Go back and correct your work.
Question 9
Using your data frame from the previous question, create a plot of annual inflation vs. year for these countries. Then, in a few sentences, describe the patterns you observe in the plot, particularly focusing on anything you find surprising or not surprising, based on your knowledge (or lack thereof) of these countries economies.
Requirements for the plot:
- Data should be represented with points as well as lines connecting the points for each country.
- Each country should be represented by a different color line and different color and shape points.
- Axes and legend should be properly labeled.
- The plot should have an appropriate title (and optionally a subtitle).
- Plot should be customized in at least one way – you could use a different than default color scale, or different than default theme, or some other customization.
Now is another good time to render, commit, and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Wrap-up
Submission
Once you are finished with the lab, you will submit your final PDF document to Gradescope.
Before you wrap up the assignment, make sure all of your documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.
You must turn in a PDF file to the Gradescope page by the submission deadline to be considered “on time”.
To submit your assignment:
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials \(\rightarrow\) Duke NetID and log in using your NetID credentials.
- Click on your STA 199 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with question. All the pages of your lab should be associated with at least one question (i.e., should be “checked”).
Make sure you have:
- attempted all questions
- rendered your Quarto document
- committed and pushed everything to your GitHub repository such that the Git pane in RStudio is empty
- uploaded your PDF to Gradescope
- selected pages associated with each question on Gradescope
Grading and feedback
Some of the questions will be graded for accuracy.
Some will be graded for completion.
-
There are also workflow points:
- for coding style;
- for committing at least three times as you work through your lab;
- for pushing your final rendered PDF into your lab repo before the deadline (in addition to uploading it to Gradescope);
- for overall organization.
You’ll receive feedback on your lab on Gradescope within a week.