Lab 7
Logistic regression
Introduction
All things logistic regression! This is a new statistical model we introduced for modeling a response variable that is binary (categorical with two levels) rather than numerical. If we can do that, then we can use the model for a special kind of prediction called classification.
This lab assumes you’ve completed the labs so far and doesn’t repeat setup and overview content from those labs. If you haven’t done those yet, you should review the previous labs before starting on this one.
Learning objectives
By the end of the lab, you will…
- fit, interpret, predict with, and compare logistic regression models;
- work with the ROC curve and the area under the ROC curve;
- complete a data science assessment.
And, as usual, you will also…
- Get more experience with data science workflow using R, RStudio, Git, and GitHub
- Further your reproducible authoring skills with Quarto
- Improve your familiarity with version control using Git and GitHub
Getting started
Log in to RStudio, clone your lab-7
repo from GitHub, open your lab-7.qmd
document, and get started!
Step 1: Log in to RStudio
- Go to https://cmgr.oit.duke.edu/containers and log in with your Duke NetID and Password.
- Click
STA198-199
under My reservations to log into your container. You should now see the RStudio environment.
Step 2: Clone the repo & start a new RStudio project
Go to the course organization at github.com/sta199-s25 organization on GitHub. Click on the repo with the prefix lab-7. It contains the starter documents you need to complete the lab.
Click on the green CODE button and select Use SSH. This might already be selected by default; if it is, you’ll see the text Clone with SSH. Click on the clipboard icon to copy the repo URL.
In RStudio, go to File ➛ New Project ➛Version Control ➛ Git.
Copy and paste the URL of your assignment repo into the dialog box Repository URL. Again, please make sure to have SSH highlighted under Clone when you copy the address.
Click Create Project, and the files from your GitHub repo will be displayed in the Files pane in RStudio.
Click lab-7.qmd to open the template Quarto file. This is where you will write up your code and narrative for the lab.
Step 3: Update the YAML
In lab-7.qmd
, update the author
field to your name, render your document and examine the changes. Then, in the Git pane, click on Diff to view your changes, add a commit message (e.g., “Added author name”), and click Commit. Then, push the changes to your GitHub repository, and in your browser confirm that these changes have indeed propagated to your repository.
If you run into any issues with the first steps outlined above, flag a TA for help before proceeding.
Packages
In this lab, we will work with the
- tidyverse package for doing data analysis in a “tidy” way;
- tidymodels package for modeling in a “tidy” way.
-
Run the code cell by clicking on the green triangle (play) button for the code cell labeled
load-packages
. This loads the package so that its features (the functions and datasets in it) are accessible from your Console. - Then, render the document that loads this package to make its features (the functions and datasets in it) available for other code cells in your Quarto document.
Guidelines
As we’ve discussed in lecture, your plots should include an informative title, axes and legends should have human-readable labels, and careful consideration should be given to aesthetic choices.
Additionally, code should follow the tidyverse style. Particularly,
there should be spaces before and line breaks after each
+
when building aggplot
,there should also be spaces before and line breaks after each
|>
in a data transformation pipeline,code should be properly indented,
there should be spaces around
=
signs and spaces after commas.
Furthermore, all code should be visible in the PDF output, i.e., should not run off the page on the PDF. Long lines that run off the page should be split across multiple lines with line breaks.
Continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.
You are also expected to pay attention to code smell in addition to code style and readability. You should review and improve your code to avoid redundant steps (e.g., grouping, ungrouping, and grouping again by the same variable in a pipeline), using inconsistent syntax (e.g., !
to say “not” in one place and -
in another place), etc.
Part 1 - logistic regression
For this exercise, we will work with hotel cancellations. The data describe the demand of two different types of hotels. Each observation represents a hotel booking between July 1, 2015 and August 31, 2017. Some bookings were cancelled (is_canceled = 1
) and others were kept, i.e., the guests checked into the hotel (is_canceled = 0
). You can view the code book for all variables here.
Question 1 - data preparation
Read the data in with
read_csv()
.Transform the
is_canceled
variable to be a factor with “not canceled” (0) as the first level and “canceled” (1) as the second level. Confirm that you were able to successfully transform the variable by runninglevels(hotels$is_canceled)
.Split the data into training (75%) and testing (25%) sets. Report the number of rows in the testing and training sets. NOTE: use
set.seed(555)
.
Question 2 - exploratory analysis
In a single pipeline, calculate the mean arrival dates (
arrival_date_day_of_month
) for reservations that were cancelled and reservations that were not cancelled in the training data.Explore attributes of bookings in the training data and summarize your findings in 5 bullet points. You must provide a visualization or summary supporting each finding. Note: This is not meant to be an exhaustive exploration. We anticipate a wide variety of answers to this question.
Question 3 - fit a simple model
Fit the appropriate model to predict whether a reservation was cancelled from
arrival_date_day_of_month
and display a tidy summary of the model output. Make sure you use only the training data.Typeset the fitted regression equation.
Interpret the slope coefficient in the context of the data and the research question.
Calculate the probability that a hotel reservation is cancelled if the arrival date is on the 26th of the month. Based on this probability, would you predict this booking would be cancelled or not cancelled. Explain your reasoning for your classification.
Use the
augment
function to generate predictions for every observation in the test dataset, and make sure you store the results in a variable with an informative name.Compute the error rates (TPR, TNR, FPR, FNR) of the classifications you just produced.
Plot the ROC curve for this model and the test data.
Since the level of is_canceled
we’re predicting is the second level, we need to set event_level = "second"
when calculating the ROC curve.
Question 4 - fit an alternative model
Fit the appropriate model to predict whether a reservation was cancelled from
hotel
,arrival_date_month
, andlead_time
, and display a tidy summary of the model output. Make sure you use only the training data.Interpret the intercept coefficient in the context of the data and the research question.
Use the
augment
function to generate predictions for every observation in the test dataset, and make sure you store the results in a variable with an informative name.Plot the ROC curve for this model and the test data.
Question 5 - selecting a model
Plot the ROC curves of the models from Question 4 and Question 5 on the same plot, using different colors for each model and a legend that describes which model is represented with which color.
Calculate the AUC (area under the curve) for each model using the
roc_auc()
function.Based on the results of parts a and b, which model do you prefer? Explain your reasoning.
Part 2 - data science assessment
Take a data science assessment at https://duke.qualtrics.com/jfe/form/SV_3WUJ8CJw8hlzit0. This will take you about an hour, and it is a mandatory part of your Lab 7 grade. The assessment is graded for completion and an honest effort, not accuracy. You will not be penalized for wrong answers, so please answer every question to the best of your ability. You should not consult other people or any outside resources.
This assessment is part of a research study, and at the end of the assessment, you may choose whether or not you wish to opt into the study. I will never know what you chose. The research administrator will deliver your responses to me regardless of your study participation.