Grammar of data transformation

Lecture 3

John Zito

Duke University
STA 199 Spring 2025

2025-01-21

Now, where was I?

Alison Bechdel

The Bechdel Test

(Dykes to Watch Out For - 1985)

Film passes if…

  1. two female characters;
  2. talk to each other;
  3. about something besides a man.

Do JZ’s favorite movies pass?

Double Indemnity (1944) 🥴
Sunset Boulevard (1950) 🥴
Sweet Smell of Success (1957)
One Hundred and One Dalmatians (1961)
Chinatown (1974)
Amadeus (1984)
Goodfellas (1990) 🥴
Bram Stoker’s Dracula (1992)
The Lord of the Rings (2001 - 2003)
Vera Drake (2004)

Our starting point

“We did a statistical analysis of films to test two claims: first, that films that pass the Bechdel test — featuring women in stronger roles — see a lower return on investment, and second, that they see lower gross profits. We found no evidence to support either claim.”

ae-02-bechdel-dataviz

Go to RStudio, confirm that you’re in the ae project, and open the document ae-02-bechdel-dataviz.qmd.

Recap: Code cells (aka code chunks)

. . .

  • Cell labels are helpful for describing what the code is doing, for jumping between code cells in the editor, and for troubleshooting

  • message: false hides any messages emitted by the code in your rendered document

Describing distributions and relationships

Talking about one numerical variable

  • center: what is the “typical” value (mean, median, mode) the data are concentrating around?
  • spread: how concentrated are the data around a typical value?
  • shape: does the distribution have one peak, or many? is it symmetric or skewed?

Interaction between shape and center

Histograms provide more detail…

…but boxplots are nice for side-by-side comparisons

Talking about two numerical variables

  • direction: positive or negative
  • shape: linear or nonlinear
  • strength: how close are points to the “trend”

Strength and direction of linear relationships

Nonlinear relationships

Data transformation

A quick reminder

1bechdel |>
2  filter(roi > 400) |>
3  select(title, roi, budget_2013, gross_2013, year, clean_test)
1
Start with the bechdel data frame
2
Filter for movies with roi greater than 400 (gross is more than 400 times budget)
3
Select the columns title, roi, budget_2013, gross_2013, year, and clean_test
# A tibble: 3 × 6
  title                     roi budget_2013 gross_2013  year clean_test
  <chr>                   <dbl>       <dbl>      <dbl> <dbl> <chr>     
1 Paranormal Activity      671.      505595  339424558  2007 dubious   
2 The Blair Witch Project  648.      839077  543776715  1999 ok        
3 El Mariachi              583.       11622    6778946  1992 nowomen   

The pipe |>

The pipe operator passes what comes before it into the function that comes after it as the first argument in that function.

sum(1, 2)
[1] 3
1 |> 
  sum(2)
[1] 3


select(filter(bechdel, roi > 400), title)
# A tibble: 3 × 1
  title                  
  <chr>                  
1 Paranormal Activity    
2 The Blair Witch Project
3 El Mariachi            
bechdel |>
  filter(roi > 400) |>
  select(title)
# A tibble: 3 × 1
  title                  
  <chr>                  
1 Paranormal Activity    
2 The Blair Witch Project
3 El Mariachi            

Code style tip

  • In data transformation pipelines, always use a
    • space before |>
    • line break after |>
    • indent the next line of code
  • In data visualization layers, always use a
    • space before +
    • line break after +
    • indent the next line of code

The pipe, in action

Find movies that pass the Bechdel test and display their titles and ROIs in descending order of ROI.

Start with the bechdel data frame:

bechdel
# A tibble: 1,615 × 7
   title                   year gross_2013 budget_2013    roi binary clean_test
   <chr>                  <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over               2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D                2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a Slave        2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns                  2013  208105475    61000000  3.41  FAIL   notalk    
 5 42                      2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin                2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day to Die Hard  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time              2013  102648667    12000000  8.55  PASS   ok        
 9 Admission               2013   36014634    13000000  2.77  PASS   ok        
10 After Earth             2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows

The pipe, in action

Find movies that pass the Bechdel test and display their titles and ROIs in descending order of ROI.

Filter for rows where binary is equal to "PASS":

bechdel |>
  filter(binary == "PASS")
# A tibble: 753 × 7
   title                 year gross_2013 budget_2013   roi binary clean_test
   <chr>                <dbl>      <dbl>       <dbl> <dbl> <chr>  <chr>     
 1 Dredd 3D              2012   55078343    45658735  1.21 PASS   ok        
 2 About Time            2013  102648667    12000000  8.55 PASS   ok        
 3 Admission             2013   36014634    13000000  2.77 PASS   ok        
 4 American Hustle       2013  397915817    40000000  9.95 PASS   ok        
 5 August: Osage County  2013   87609748    25000000  3.50 PASS   ok        
 6 Beautiful Creatures   2013   75392809    50000000  1.51 PASS   ok        
 7 Blue Jasmine          2013  101793664    18000000  5.66 PASS   ok        
 8 Carrie                2013  120268278    30000000  4.01 PASS   ok        
 9 Despicable Me 2       2013 1338831390    76000000 17.6  PASS   ok        
10 Elysium               2013  379242208   120000000  3.16 PASS   ok        
# ℹ 743 more rows

The pipe, in action

Find movies that pass the Bechdel test and display their titles and ROIs in descending order of ROI.

Arrange the rows in descending order of roi:

bechdel |>
  filter(binary == "PASS") |>
  arrange(desc(roi))
# A tibble: 753 × 7
   title                     year gross_2013 budget_2013   roi binary clean_test
   <chr>                    <dbl>      <dbl>       <dbl> <dbl> <chr>  <chr>     
 1 The Blair Witch Project   1999  543776715      839077 648.  PASS   ok        
 2 The Devil Inside          2012  157289709     1014639 155.  PASS   ok        
 3 My Big Fat Greek Wedding  2002  768922942     6475896 119.  PASS   ok        
 4 Chasing Amy               1997   39417963      362810 109.  PASS   ok        
 5 Slacker                   1991    4200140       39349 107.  PASS   ok        
 6 Insidious                 2010  164379554     1602348 103.  PASS   ok        
 7 Paranormal Activity 2     2010  280159759     3204696  87.4 PASS   ok        
 8 Paranormal Activity 3     2011  322170936     5178454  62.2 PASS   ok        
 9 The Last Exorcism         2010  118787648     1922817  61.8 PASS   ok        
10 Cinderella                1997  246710482     4208591  58.6 PASS   ok        
# ℹ 743 more rows

The pipe, in action

Find movies that pass the Bechdel test and display their titles and ROIs in descending order of ROI.

Select columns title and roi:

bechdel |>
  filter(binary == "PASS") |>
  arrange(desc(roi)) |>
  select(title, roi)
# A tibble: 753 × 2
   title                      roi
   <chr>                    <dbl>
 1 The Blair Witch Project  648. 
 2 The Devil Inside         155. 
 3 My Big Fat Greek Wedding 119. 
 4 Chasing Amy              109. 
 5 Slacker                  107. 
 6 Insidious                103. 
 7 Paranormal Activity 2     87.4
 8 Paranormal Activity 3     62.2
 9 The Last Exorcism         61.8
10 Cinderella                58.6
# ℹ 743 more rows

In this class, you will…

Build cakes (ggplot)

Stack dolls (pipe |>)

Master these constructs, and everything will be coming up roses!