6 Data

Objectives

This chapter presents the law school dataset (Wightman (1998)) used in the next chapters to illustrates the different methodologies. It then presents the causal graph assumed in the upcoming examples.

Required packages and definition of colours.

# Required packages----
library(tidyverse)

# Graphs----
font_main = font_title = 'Times New Roman'
extrafont::loadfonts(quiet = T)
face_text='plain'
face_title='plain'
size_title = 14
size_text = 11
legend_size = 11

global_theme <- function() {
  theme_minimal() %+replace%
    theme(
      text = element_text(family = font_main, size = size_text, face = face_text),
      legend.text = element_text(family = font_main, size = legend_size),
      axis.text = element_text(size = size_text, face = face_text), 
      plot.title = element_text(
        family = font_title, 
        size = size_title, 
        hjust = 0.5
      ),
      plot.subtitle = element_text(hjust = 0.5),
      legend.position = "bottom"
    )
}

# Seed
set.seed(2025)
colours_all <- c(
  "source" = "#00A08A",
  "reference" = "#F2AD00",
  "naive" = "#000000"
)

This law school dataset contains information collected through a survey conducted from 1991 through 1997 by the Law School Admission Council across 163 law schools in the United States of America (Wightman (1998)). In total, 21,790 law students were tracked through law school, graduation, and sittings for bar exams.

We use the formatted data from De Lara et al. (2024), also used by De Lara et al. (2021) (see the github associated to the paper: https://github.com/lucasdelara/PI-Fair.git).

Each row from the raw data gives information for a student. The following characteristics are available:

race: Race of the student (character: Amerindian, Asian, Black, Hispanic, Mexican, Other, Puertorican, White).
sex: Sex of the student (numeric: 1 female, 2 male).
LSAT: LSAT score received by the student (numeric).
UGPA: Undergraduate GPA of the student (numeric).
region_first: region in which the student took their first bar examination (Far West, Great Lakes, Midsouth, Midwest, Mountain West, Northeast, New England, Northwest, South Central, South East) (character)
ZFYA: standardized first-year law school grades (first year average grade, FYA) (numeric).
sander_index: Sander index of the student: weighted average of normalized UGPA and LSAT scores (however, no details are given for this specific dataset, see Sander (2004), p. 393) (numeric)
first_pf: Probably a binary variable that indicates whether the student passed on their first trial ? No information is given about this variable… (numeric 0/1).

6.1 Data Pre-Processing

We load the data:

df <- read_csv('../data/law_data.csv')

Here is some summary information on this dataset:

summary(df)

      ...1           race                sex             LSAT      
 Min.   :    0   Length:21791       Min.   :1.000   Min.   :11.00  
 1st Qu.: 6516   Class :character   1st Qu.:1.000   1st Qu.:33.00  
 Median :13698   Mode  :character   Median :2.000   Median :37.00  
 Mean   :13732                      Mean   :1.562   Mean   :36.77  
 3rd Qu.:20862                      3rd Qu.:2.000   3rd Qu.:41.00  
 Max.   :27476                      Max.   :2.000   Max.   :48.00  
      UGPA       region_first            ZFYA           sander_index   
 Min.   :0.000   Length:21791       Min.   :-3.35000   Min.   :0.3875  
 1st Qu.:3.000   Class :character   1st Qu.:-0.55000   1st Qu.:0.7116  
 Median :3.300   Mode  :character   Median : 0.09000   Median :0.7696  
 Mean   :3.227                      Mean   : 0.09643   Mean   :0.7669  
 3rd Qu.:3.500                      3rd Qu.: 0.75000   3rd Qu.:0.8274  
 Max.   :4.200                      Max.   : 3.48000   Max.   :1.0000  
    first_pf     
 Min.   :0.0000  
 1st Qu.:1.0000  
 Median :1.0000  
 Mean   :0.8884  
 3rd Qu.:1.0000  
 Max.   :1.0000

Then, we focus on a subset of variables of interest:

df <- df |> 
  select(
    race, # we can take S = race (white/black)
    sex,  # or S = gender
    LSAT, 
    UGPA,
    ZFYA  # Y
  )

We create a dataset where the only protected class is the race, and we focus on Black individuals and White individuals only:

# Table for S = race
df_race <- df |> 
  select(
    race,
    UGPA,
    LSAT,
    ZFYA
  ) |> 
  filter(
    race %in% c("Black", "White")
  ) |> 
  rename(
    S = race,
    X1 = UGPA,
    X2 = LSAT,
    Y = ZFYA
  ) |>  # no NA values
  mutate(
    S = as.factor(S)
  )

And another dataset in which the only protected class is the sex:

# Table for S = gender
df_gender <- df |> 
  select(
    sex,
    UGPA,
    LSAT,
    ZFYA
  ) |> 
  rename(
    S = sex,
    X1 = UGPA,
    X2 = LSAT,
    Y = ZFYA
  ) |>  # no NA values
  mutate(
    S = as.factor(S)
  )

S = Race
S = Gender

Codes used to create the Figure.

ggplot(
  data = df_race, 
  mapping = aes(x = Y, fill = S)
) +
  geom_histogram(
    mapping = aes(y = after_stat(density)), 
    alpha = 0.5, position = "identity", binwidth = 0.5
  ) +
  geom_density(alpha = 0.5) +
  scale_fill_manual(
    "S",
    values = c(
      "Black" = colours_all[["source"]], 
      "White" = colours_all[["reference"]]
    )
  ) +
  labs(
    title = "Race",
    x = "Y",
    y = "Density"
  ) +
  global_theme()

Figure 6.1: Distribution of the standardized first-year law school grades among the two groups, when \(S\) is the race

Codes used to create the Figure.

ggplot(
  data = df_gender, 
  mapping = aes(x = Y, fill = S)) +
  geom_histogram(
    mapping = aes(y = after_stat(density)), 
    alpha = 0.5, position = "identity", binwidth = 0.5
  ) +
  geom_density(alpha = 0.5) +
  scale_fill_manual(
    "S",
    values = c(
      "Black" = colours_all[["source"]], 
      "White" = colours_all[["reference"]]
    )
  ) +
  labs(
    title = "Gender",
    x = "Y",
    y = "Density"
  ) +
  global_theme()

Figure 6.2: Distribution of the standardized first-year law school grades among the two groups, when \(S\) is the gender

6.2 Causal graph

The assumed causal graph we use here is different from that of the different papers De Lara et al. (2024), Kusner et al. (2017), Black, Yeom, and Fredrikson (2020) using the same dataset.

We make the following assumptions:

The sensitive attribute, \(S\) (race), has no parents.
The two other explanatory variables, \(X_1\) (UGPA) and \(X_2\) (LSAT), both directly depend on the sensitive attribute.
The second variable, \(X_2\) (LSAT), also depends on the first variable, \(X_1\) (UGPA). This is done for illustrative purposes, assuming that the score obtained on the LSAT is influenced by the UGPA.
The two variables, \(X_1\) (UGPA) and \(X_2\) (LSAT), cause the target variable \(Y\), i.e., whether the student obtained a high standardized first-year average (ZFYA).

The corresponding Structural Equation Model writes:

\[ \begin{cases} S: \text{ sensitive attribute (race)} \\ X_1 = h_1(S, U_1): \text{ UGPA, dependent on } S \\ X_2 = h_2(S, X_1, U_2): \text{ LSAT, dependent on } S \text{ and } X_1 \\ Y = h_3(X_1, X_2, U_Y): \text{ ZFYA, dependent on } X_1 \text{ and } X_2 \\ \end{cases} \]

where \(U_1\), \(U_2\), and \(U_Y\) are independent error terms.

In R, we construct the upper triangular adjacency matrix to reflect our assumed causal structure:

variables <- colnames(df_race)
# Adjacency matrix: upper triangular
adj <- matrix(
  c(0, 1, 1, 1,
    0, 0, 1, 1,
    0, 0, 0, 1,
    0, 0, 0, 0),
  ncol = length(variables), 
  dimnames = rep(list(variables), 2),
  byrow = TRUE
)

Which can be visualized as follows:

causal_graph <- fairadapt::graphModel(adj)
plot(causal_graph)

The topological order:

top_order <- variables
top_order

[1] "S"  "X1" "X2" "Y"

6.3 Saving objects

save(df_race, file = "../data/df_race.rda")
save(df_gender, file = "../data/df_gender.rda")
save(adj, file = "../data/adj.rda")