Data science ethics

Lecture 11

John Zito

Duke University
STA 199 Spring 2025

2025-02-18

Exam reminders

Basic facts

Worth 20% of your final grade; consists of two parts:

  • In-class: worth 70% of the Midterm 1 grade;

    • Thursday February 20 11:45 AM - 1:00 PM;
    • Take note of your room assignment!
    • All multiple choice;
    • Both sides of one 8.5” x 11” sheet of notes.
  • Take-home: worth 30% of the Midterm 1 grade.

    • Released Thursday February 20 at 1:00 PM;
    • Due Monday February 24 at 8:30 AM;
    • Works just like a mini-lab, only zero collaboration.

Last week’s advice

When the world was your oyster and you had nothing but time…

  • Practice problems: released Thursday February 13;
  • Attend lab: Kahoot on Monday February 17;
  • Old labs: correct parts where you lost points;
  • Old AEs: complete tasks we didn’t get to and compare with key;
  • Code along: watch these videos specifically;
  • Textbook: odd-numbered exercises in the back of Chs. 1, 4, 5, 6.

This week’s advice

Now that you only have forty-eight hours…

  • Study the Kahoot and the practice problems;
  • Study the Lab 4 solutions;
  • Spend some serious time with your cheat sheet;
  • Study old AE keys, and work stuff we didn’t complete.

What if we get snowed out?

  • If classes are canceled: in-class exam is moved to 2/25;

  • If classes are not canceled: in-class exam is 2/20 as planned;

  • Take-home exam is the same regardless;

  • If classes are canceled, Testing Center appointments are canceled, and we’ll cross that bridge when we come to it (it will be a mess).

Misrepresentation

Misrepresenting data science results

Some common ways people do this, either intentionally or unintentionally, include:

  • Claiming causality where it’s not in the scope of inference of the underlying study

  • Distorting axes and scales to make the data tell a different story

  • Visualizing spatial areas instead of human density for issues that depend on and affect humans

  • Omitting uncertainty in reporting

Causality - TIME coverage

How plausible is the statement in the title of this article?

Causality - LA Times coverage

What does “research shows” mean?

Causality - Original study

Moore, Steven C., et al. “Association of leisure-time physical activity with risk of 26 types of cancer in 1.44 million adults.” JAMA internal medicine 176.6 (2016): 816-825.

  • Volunteers were asked about their physical activity level over the preceding year.
  • Half exercised less than about 150 minutes per week, half exercised more.
  • Compared to the bottom 10% of exercisers, the top 10% had lower rates of esophageal, liver, lung, endometrial, colon, and breast cancer.
  • Researchers found no association between exercising and 13 other cancers (e.g. pancreatic, ovarian, and brain).

Axes and scales - Tax cuts

What is the difference between these two pictures? Which presents a better way to represent these data?

Axes and scales - Cost of gas

What is wrong with this picture? How would you correct it?

Axes and scales - Cost of gas

df <- tibble(
  date = ymd(c("2019-11-01", "2020-10-25", "2020-11-01")),
  cost = c(3.17, 3.51, 3.57)
)
ggplot(df, aes(x = date, y = cost, group = 1)) +
  geom_point() +
  geom_line() +
  geom_label(aes(label = cost), hjust = -0.25) +
  labs(
    title = "Cost of gas",
    subtitle = "National average",
    x = NULL, y = NULL, 
    caption = "Source: AAA Fuel Gauge Report"
  ) +
  scale_x_continuous(
    breaks = ymd(c("2019-11-01", "2020-10-25", "2020-11-01")), 
    labels = c("Last year", "Last week", "Current"),
    guide = guide_axis(angle = 90),
    limits = ymd(c("2019-11-01", "2020-11-29")),
    minor_breaks = ymd(c("2019-11-01", "2020-10-25", "2020-11-01"))
  ) +
  scale_y_continuous(labels = label_dollar())

Axes and scales - COVID in GA

What is wrong with this picture? How would you correct it?

Axes and scales - COVID in GA

Axes and scales - PP services

What is wrong with this picture? How would you correct it?

Axes and scales - PP services

pp <- tibble(
  year = c(2006, 2006, 2013, 2013),
  service = c("Abortion", "Cancer", "Abortion", "Cancer"),
  n = c(289750, 2007371, 327000, 935573)
)

ggplot(pp, aes(x = year, y = n, color = service)) +
  geom_point(size = 2) +
  geom_line(linewidth = 1) +
  geom_text(aes(label = n), nudge_y = 100000) +
  geom_text(
    aes(label = year), 
    nudge_y = 200000, 
    color = "darkgray"
  ) +
  labs(
    title = "Services provided by Planned Parenthood",
    caption = "Source: Planned Parenthood",
    x = NULL,
    y = NULL
  ) +
  scale_x_continuous(breaks = c(2006, 2013)) +
  scale_y_continuous(labels = label_number(big.mark = ",")) +
  scale_color_manual(values = c("red", "purple")) +
  annotate(
    geom = "text",
    label = "Abortions",
    x = 2009.5,
    y = 400000,
    color = "red"
  ) +
  annotate(
    geom = "text",
    label = "Cancer screening\nand prevention services",
    x = 2010.5,
    y = 1600000, 
    color = "purple"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Maps and areas - Voting map

Do you recognize this map? What does it show?

Maps and areas - Two alternate tales

Maps and areas - Voting percentages

Maps and areas - Voting percentages

Uncertainty - Catalan independence

On December 19, 2014, the front page of Spanish national newspaper El País read “Catalan public opinion swings toward ‘no’ for independence, says survey”.

Uncertainty - Catalan independence

Algorithmic bias

California Proposition 25 (2020)

Popular referendum on 2018’s Senate Bill 10:

  • YES: replace cash bail with ``risk assessment.’’

    • Democratic Party, Governor Gavin Newson, League of Women Voters of California, California Medical Association, Democracy for America (progressive PAC), etc.
  • NO: keep the cash bail system.

    • Republican Party, American Bail Coalition, ACLU of Southern California, NAACP, California Asian Pacific Chamber of Commerce, etc.
  • If passed, each county would be empowered to develop a tool that predicts the risk of a suspect reoffending before trial.

  • Judges would consult this prediction to make bail decisions.

What might “risk assessment” look like?

Something we will study after spring break:

Above the line means high risk means no bail. Is this progress?

What happens when we try “predictive policing”?

2016 ProPublica article on algorithm used for rating a defendant’s risk of future crime:

In forecasting who would re-offend, the algorithm made mistakes with black and white defendants at roughly the same rate but in very different ways.

  • The formula was particularly likely to falsely flag black defendants as future criminals, wrongly labeling them this way at almost twice the rate as white defendants.

  • White defendants were mislabeled as low risk more often than black defendants.

Notice anything?

What is common among the defendants who were assigned a high/low risk score for reoffending?

“But race wasn’t in my model”

How can an algorithm that doesn’t use race as input data be racist?

Predicting ethnicity

Improving Ecological Inference by Predicting Individual Ethnicity from Voter Registration Record (Imran and Khan, 2016)

In both political behavior research and voting rights litigation, turnout and vote choice for different racial groups are often inferred using aggregate election results and racial composition. Over the past several decades, many statistical methods have been proposed to address this ecological inference problem. We propose an alternative method to reduce aggregation bias by predicting individual-level ethnicity from voter registration records. Building on the existing methodological literature, we use Bayes’s rule to combine the Census Bureau’s Surname List with various information from geocoded voter registration records. We evaluate the performance of the proposed methodology using approximately nine million voter registration records from Florida, where self-reported ethnicity is available. We find that it is possible to reduce the false positive rate among Black and Latino voters to 6% and 3%, respectively, while maintaining the true positive rate above 80%. Moreover, we use our predictions to estimate turnout by race and find that our estimates yields substantially less amounts of bias and root mean squared error than standard ecological inference estimates. We provide open-source software to implement the proposed methodology. The open-source software is available for implementing the proposed methodology.

wru package

The said “source software” is the wru package: https://github.com/kosukeimai/wru.

Do you have any ethical concerns about installing this package?

wru package

Was the publication of this model ethical? Does the open-source nature of the code affect your answer? Is it ethical to use this software? Does your answer change depending on the intended use?

library(wru)
predict_race(voter.file = voters, surname.only = TRUE) |>
  select(surname, pred.whi, pred.bla, pred.his, pred.asi, pred.oth)
      surname    pred.whi    pred.bla     pred.his    pred.asi    pred.oth
1      Khanna 0.045110474 0.003067623 0.0068522723 0.860411906 0.084557725
2        Imai 0.052645440 0.001334812 0.0558160072 0.719376581 0.170827160
3      Rivera 0.043285692 0.008204605 0.9136195794 0.024316883 0.010573240
4     Fifield 0.895405704 0.001911388 0.0337464844 0.011079323 0.057857101
5        Zhou 0.006572555 0.001298962 0.0005388581 0.982365594 0.009224032
6    Ratkovic 0.861236727 0.008212824 0.0095395642 0.011334635 0.109676251
7     Johnson 0.543815322 0.344128607 0.0272403940 0.007405765 0.077409913
8       Lopez 0.038939877 0.004920643 0.9318797791 0.012154125 0.012105576
10 Wantchekon 0.330697188 0.194700665 0.4042849478 0.021379541 0.048937658
9       Morse 0.866360147 0.044429853 0.0246568086 0.010219712 0.054333479

wru package

me <- tibble(surname = "Zito")

predict_race(voter.file = me, surname.only = TRUE)
  surname  pred.whi   pred.bla   pred.his    pred.asi   pred.oth
1    Zito 0.9220001 0.00419631 0.03968994 0.009652312 0.02446131

California Prop 25 did not pass

The cash bail system was retained:

Choice Votes Percent
Yes 7,232,380 43.59%
No 9,358,226 56.41%
  • reasonable people can debate if this outcome is good or bad;
  • every Californian was invited to decide whether statistics and data science should be deployed to make decisions with major social consequences. They opted out;
  • This vote was held in the pre-ChatGPT era. What would the outcome be today? Is the case for YES stronger or weaker?

Another algorithmic decision…

Armies of stats PhDs go to work on these models. They have no training in the ethics of what they’re doing.

A success story?

Data + Model to predict timing of menstrual cycle:

A perfect microcosm of the themes of our course, and maybe one of the real triumphs of data and modeling improving modern life.

…but what if you learned they were selling your data?

Data privacy (aka the reason Tony Soprano ripped the GPS out of this Escalade)

Data privacy

“Your” data

  • Every time we use apps, websites, and devices, our data is being collected and used or sold to others.

  • More importantly, decisions are made by law enforcement, financial institutions, and governments based on data that directly affect the lives of people.

Privacy of your data

What pieces of data have you left on the internet today? Think through everything you’ve logged into, clicked on, checked in, either actively or automatically, that might be tracking you. Do you know where that data is stored? Who it can be accessed by? Whether it’s shared with others?

Sharing your data

What are you OK with sharing?

  • Name
  • Age
  • Email
  • Phone Number
  • List of every video you watch
  • List of every video you comment on
  • How you type: speed, accuracy
  • How long you spend on different content
  • List of all your private messages (date, time, person sent to)
  • Info about your photos (how it was taken, where it was taken (GPS), when it was taken)

What does Google think/know about you?

Have you ever thought about why you’re seeing an ad on Google? Google it! Try to figure out if you have ad personalization on and how your ads are personalized.

Your browsing history

Which of the following are you OK with your browsing history to be used towards?

  • For serving you targeted ads
  • To score you as a candidate for a job
  • To predict your race/ethnicity for voting purposes

Who else gets to use your data?

Suppose you create a profile on a social media site and share your personal information on your profile. Who else gets to use that data?

  • Companies the social media company has a connection to?
  • Companies the social media company sells your data to?
  • Researchers?

AOL search data leak

OK Cupid data breach

  • In 2016, researchers published data of 70,000 OkCupid users—including usernames, political leanings, drug usage, and intimate sexual details

  • Researchers didn’t release the real names and pictures of OKCupid users, but their identities could easily be uncovered from the details provided, e.g. usernames

OK Cupid data breach

OK Cupid data breach

Some may object to the ethics of gathering and releasing this data. However, all the data found in the dataset are or were already publicly available, so releasing this dataset merely presents it in a more useful form.

Researchers Emil Kirkegaard and Julius Daugbjerg Bjerrekær

Data privacy

In analysis of data that individuals willingly shared publicly on a given platform (e.g. social media), how do you make sure you don’t violate reasonable expectations of privacy?

Some good news?

Faster, more accurate cancer screening?

Augmenting doctors’ diagnostic capacity so that they make fewer mistakes, treat more people, and focus on other aspects of care:

The Nobel Prize last year

  • AlphaFold2: “predicting 3D structures [of proteins] (\(y\)) directly from the primary amino acid sequence (\(x\)).”

  • “researchers can now better understand antibiotic resistance and create images of enzymes that can decompose plastic.”

Further reading

How Charts Lie

How Charts Lie

Getting Smarter about Visual Information

by Alberto Cairo

Calling Bullshit

Calling Bullshit
The Art of Skepticism in a
Data-Driven World

by Carl Bergstrom and Jevin West

Machine Bias

Machine Bias

by Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner

Ethics and Data Science

Ethics and Data Science

by Mike Loukides, Hilary Mason, DJ Patil
(Free Kindle download)

Weapons of Math Destruction

Weapons of Math Destruction
How Big Data Increases Inequality and Threatens Democracy

by Cathy O’Neil

Algorithms of Oppression

Algorithms of Oppression
How Search Engines Reinforce Racism

by Safiya Umoja Noble

And more recently…

How AI discriminates and what that means for your Google habit
A conversation with UCLA internet studies scholar Safiya Noble

by Julia Busiek

Parting thoughts

  • At some point during your data science learning journey you will learn tools that can be used unethically

  • You might also be tempted to use your knowledge in a way that is ethically questionable either because of business goals or for the pursuit of further knowledge (or because your boss told you to do so)

How do you train yourself to make the right decisions (or reduce the likelihood of accidentally making the wrong decisions) at those points?