Data types and classes

Lecture 8

John Zito

Duke University
STA 199 Spring 2025

2025-02-06

Warm-up

While you wait…

Prepare for today’s application exercise: ae-07-durham-climate-factors

  • Go to your ae project in RStudio.

  • Make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.

  • Click Pull to get today’s application exercise file: ae-07-durham-climate-factors.qmd.

  • Wait till the you’re prompted to work on the application exercise during class before editing the file.

Regrade request policy

  • Considered for errors in grade calculation or if a correct answer was mistakenly marked as incorrect

  • Not a mechanism for:

    • disputing the number of points deducted for an incorrect response
    • asking for clarification on feedback (come to office hours instead)
  • Due on Gradescope within a week after an assignment is returned

  • The entire assignment may be regraded, which could result in an adjustment in either direction

  • No regrade requests after the final exam has been administered

Data types

How many classes do you have on Tuesdays?

survey
# A tibble: 209 × 3
   Timestamp         How many classes do you have on Tues…¹ `What year are you?`
   <chr>             <chr>                                  <chr>               
 1 2/6/2025 11:33:57 3                                      Sophomore           
 2 2/6/2025 11:37:39 3                                      First-year          
 3 2/6/2025 11:40:55 2                                      Senior              
 4 2/6/2025 11:42:05 3                                      First-year          
 5 2/6/2025 11:42:46 3                                      Senior              
 6 2/6/2025 11:43:28 3                                      Senior              
 7 2/6/2025 11:44:41 3                                      First-year          
 8 2/6/2025 11:44:49 3                                      First-year          
 9 2/6/2025 11:44:51 2                                      Sophomore           
10 2/6/2025 11:44:51 3                                      Sophomore           
# ℹ 199 more rows
# ℹ abbreviated name: ¹​`How many classes do you have on Tuesdays?`

rename() variables

To make them easier to work with…

survey <- survey |>
  rename(
    tue_classes = `How many classes do you have on Tuesdays?`,
    year = `What year are you?`
  )

Variable types

What type of variable is tue_classes?

survey
# A tibble: 209 × 3
   Timestamp         tue_classes year      
   <chr>             <chr>       <chr>     
 1 2/6/2025 11:33:57 3           Sophomore 
 2 2/6/2025 11:37:39 3           First-year
 3 2/6/2025 11:40:55 2           Senior    
 4 2/6/2025 11:42:05 3           First-year
 5 2/6/2025 11:42:46 3           Senior    
 6 2/6/2025 11:43:28 3           Senior    
 7 2/6/2025 11:44:41 3           First-year
 8 2/6/2025 11:44:49 3           First-year
 9 2/6/2025 11:44:51 2           Sophomore 
10 2/6/2025 11:44:51 3           Sophomore 
# ℹ 199 more rows

Variable types

Why isn’t the tue_classes column numeric?

survey |>
  count(tue_classes)
# A tibble: 13 × 2
   tue_classes                  n
   <chr>                    <int>
 1 1                           10
 2 2                           53
 3 2 -3                         1
 4 3                          104
 5 3 classes                    1
 6 4                           28
 7 5                            3
 8 Four                         1
 9 TWO MANY                     1
10 Three                        2
11 Two                          3
12 Two plus a chemistry lab     1
13 three                        1

Let’s clean it up

It’s a huge pain in the rear:

survey <- survey |>
  mutate(
    tue_classes = case_when(
      tue_classes == "2 -3" ~ "3",
      tue_classes == "3 classes" ~ "3",
      tue_classes == "Four" ~ "4",
      tue_classes == "TWO MANY" ~ "2",
      tue_classes == "Three" ~ "3",
      tue_classes == "Two" ~ "2",
      tue_classes == "Two plus a chemistry lab" ~ "3",
      tue_classes == "three" ~ "3",
      .default = tue_classes
    ),
    tue_classes = as.numeric(tue_classes)
  )

survey
# A tibble: 209 × 3
   Timestamp         tue_classes year      
   <chr>                   <dbl> <chr>     
 1 2/6/2025 11:33:57           3 Sophomore 
 2 2/6/2025 11:37:39           3 First-year
 3 2/6/2025 11:40:55           2 Senior    
 4 2/6/2025 11:42:05           3 First-year
 5 2/6/2025 11:42:46           3 Senior    
 6 2/6/2025 11:43:28           3 Senior    
 7 2/6/2025 11:44:41           3 First-year
 8 2/6/2025 11:44:49           3 First-year
 9 2/6/2025 11:44:51           2 Sophomore 
10 2/6/2025 11:44:51           3 Sophomore 
# ℹ 199 more rows

Data types

Data types in R

  • logical
  • double
  • integer
  • character
  • and some more, but we won’t be focusing on those

Logical & character

logical - Boolean values TRUE and FALSE


typeof(TRUE)
[1] "logical"

character - character strings



typeof("First-year")
[1] "character"

Double & integer

double - floating point numerical values (default numerical type)


typeof(2.5)
[1] "double"
typeof(3)
[1] "double"

integer - integer numerical values (indicated with an L)


typeof(3L)
[1] "integer"
typeof(1:3)
[1] "integer"

Concatenation

Vectors can be constructed using the c() function.

  • Numeric vector:
c(1, 2, 3)
[1] 1 2 3
  • Character vector:
c("Hello", "World!")
[1] "Hello"  "World!"
  • Vector made of vectors:
c(c("hi", "hello"), c("bye", "jello"))
[1] "hi"    "hello" "bye"   "jello"

Converting between types

with intention…

x <- 1:3
x
[1] 1 2 3
typeof(x)
[1] "integer"
y <- as.character(x)
y
[1] "1" "2" "3"
typeof(y)
[1] "character"

Converting between types

with intention…

x <- c(TRUE, FALSE)
x
[1]  TRUE FALSE
typeof(x)
[1] "logical"
y <- as.numeric(x)
y
[1] 1 0
typeof(y)
[1] "double"

Converting between types

without intention…

c(2, "Just this one!")
[1] "2"              "Just this one!"

R will happily convert between various types without complaint when different types of data are concatenated in a vector, and that’s not always a great thing!

Converting between types

without intention…

c(FALSE, 3L)
[1] 0 3
c(1.2, 3L)
[1] 1.2 3.0
c(2L, "two")
[1] "2"   "two"

Explicit vs. implicit coercion

Explicit coercion:

When you call a function like as.logical(), as.numeric(), as.integer(), as.double(), or as.character().

Implicit coercion:

Happens when you use a vector in a specific context that expects a certain type of vector.

Data classes

Data classes

  • Vectors are like Lego building blocks
  • We stick them together to build more complicated constructs, e.g. representations of data
  • The class attribute relates to the S3 class of an object which determines its behaviour
    • You don’t need to worry about what S3 classes really mean, but you can read more about it here if you’re curious
  • Examples: factors, dates, and data frames

Factors

R uses factors to handle categorical variables, variables that have a fixed and known set of possible values

class_years <- factor(
  c(
    "First-year", "Sophomore", "Sophomore", "Senior", "Junior"
    )
  )
class_years
[1] First-year Sophomore  Sophomore  Senior     Junior    
Levels: First-year Junior Senior Sophomore
typeof(class_years)
[1] "integer"
class(class_years)
[1] "factor"

More on factors

We can think of factors like character (level labels) and an integer (level numbers) glued together

glimpse(class_years)
 Factor w/ 4 levels "First-year","Junior",..: 1 4 4 3 2
as.integer(class_years)
[1] 1 4 4 3 2

Dates

today <- as.Date("2024-09-24")
today
[1] "2024-09-24"
typeof(today)
[1] "double"
class(today)
[1] "Date"

More on dates

We can think of dates like an integer (the number of days since the origin, 1 Jan 1970) and an integer (the origin) glued together

as.integer(today)
[1] 19990
as.integer(today) / 365 # roughly 55 yrs
[1] 54.76712

Data frames

We can think of data frames like like vectors of equal length glued together

df <- data.frame(x = 1:2, y = 3:4)
df
  x y
1 1 3
2 2 4
typeof(df)
[1] "list"
class(df)
[1] "data.frame"

Lists

Lists are a generic vector container; vectors of any type can go in them

l <- list(
  x = 1:4,
  y = c("hi", "hello", "jello"),
  z = c(TRUE, FALSE)
)
l
$x
[1] 1 2 3 4

$y
[1] "hi"    "hello" "jello"

$z
[1]  TRUE FALSE

Lists and data frames

  • A data frame is a special list containing vectors of equal length
df
  x y
1 1 3
2 2 4
  • When we use the pull() function, we extract a vector from the data frame
df |>
  pull(y)
[1] 3 4

Working with factors

Read data in as character strings

survey
# A tibble: 209 × 3
   Timestamp         tue_classes year      
   <chr>                   <dbl> <chr>     
 1 2/6/2025 11:33:57           3 Sophomore 
 2 2/6/2025 11:37:39           3 First-year
 3 2/6/2025 11:40:55           2 Senior    
 4 2/6/2025 11:42:05           3 First-year
 5 2/6/2025 11:42:46           3 Senior    
 6 2/6/2025 11:43:28           3 Senior    
 7 2/6/2025 11:44:41           3 First-year
 8 2/6/2025 11:44:49           3 First-year
 9 2/6/2025 11:44:51           2 Sophomore 
10 2/6/2025 11:44:51           3 Sophomore 
# ℹ 199 more rows

But coerce when plotting

ggplot(survey, mapping = aes(x = year)) +
  geom_bar()

Use forcats to reorder levels

survey |>
  mutate(
    year = fct_relevel(year, "First-year", "Sophomore", "Junior", "Senior")
  ) |>
  ggplot(mapping = aes(x = year)) +
  geom_bar()

A peek into forcats

Reordering levels by:

  • fct_relevel(): hand

  • fct_infreq(): frequency

  • fct_reorder(): sorting along another variable

  • fct_rev(): reversing

Changing level values by:

  • fct_lump(): lumping uncommon levels together into “other”

  • fct_other(): manually replacing some levels with “other”