2 German Credit Dataset

Objectives

We apply the two methods presented in Chapter 1 to build counterfactuals for a categorical feature.

Recall that the first method, outlined in Algorithm 1.1, consists in using Gaussian optimal transport based on an alternative representation of the probability vector (in the Euclidean space \(\mathbb{R}^{d-1}\)).

The second method uses transport and matching directly within the simplex \(\mathcal{S}_d\) using an appropriate cost function.

For a categorical feature \(\mathbf{x}_j\) with \(d\) categories, we fit a model \(\widehat{m}(\mathbf{x}_j|\mathbf{x}_{-j})\) using a multinomial loss, yielding predicted scores \(\widehat{\mathbf{x}}_j\in\mathcal{S}_d\). We then apply Algorithm 1.1 (first method) to transport \(\widehat{\mathbf{x}}_j|s=0\) to the counterfactual \(\widehat{\mathbf{x}}_j|s=1\) in \(\mathcal{S}_d\). Then, we apply Algorithm 1.2 (second method) to build the counterfactuals using matching in \(\mathcal{S}_d\).

We illustrate the procedure with the German Credit dataset (Hofmann (1994)). Specifically, we transport the variable Purpose (\(\mathbf{x}_j\)), a factor with ten levels indicating loan purposes, conditioned on gender (\(s\)), encoded as a binary variable.

To facilitate visualization using ternary graphs, we recode Purpose into three categories: “cars,” “equipment,” and “other.” Our goal is to construct a counterfactual loan purpose assuming individuals had a different gender.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(devtools)

Loading required package: usethis

library(ggtern)

Registered S3 methods overwritten by 'ggtern':
  method           from   
  grid.draw.ggplot ggplot2
  plot.ggplot      ggplot2
  print.ggplot     ggplot2
--
Remember to cite, run citation(package = 'ggtern') for further info.
--

Attaching package: 'ggtern'

The following objects are masked from 'package:ggplot2':

    aes, annotate, ggplot, ggplot_build, ggplot_gtable, ggplotGrob,
    ggsave, layer_data, theme_bw, theme_classic, theme_dark,
    theme_gray, theme_light, theme_linedraw, theme_minimal, theme_void

# load the functions from our package to perform optimal transport on 
# compositional data
load_all("../../")

ℹ Loading transportsimplex

Definition of a ggplot2 theme.

#' Theme for ggplot2
#'
#' @param ... arguments passed to the theme function
#' @export
#' @importFrom ggplot2 element_rect element_text element_blank element_line unit
#'   rel
theme_paper <- function(...) {
  theme(
    text = element_text(family = "Times New Roman"),
    plot.background = element_rect(fill = "transparent", color = NA),
    legend.text = element_text(size = rel(1.1)),
    legend.title = element_text(size = rel(1.1)),
    legend.background = element_rect(
     fill = "transparent", linetype="solid", colour ="black"),
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.box = "vertical",
    legend.key = element_blank(),
    panel.spacing = unit(1, "lines"),
    plot.title = element_text(hjust = 0, size = rel(1.3), face = "bold"),
    plot.title.position = "plot",
    strip.background = element_rect(fill = NA, colour = NA),
    strip.text = element_text(size = rel(1.1))
  )
}

2.1 The Dataset

The dataset is available in the {fairml} package.

data(german.credit, package = "fairml")
str(german.credit)

'data.frame':   1000 obs. of  21 variables:
 $ Account_status          : Factor w/ 4 levels "< 0 DM",">= 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
 $ Duration                : num  6 48 12 42 24 36 24 36 12 30 ...
 $ Credit_history          : Factor w/ 5 levels "all credits at this bank paid back duly",..: 2 4 2 4 3 4 4 4 4 2 ...
 $ Purpose                 : Factor w/ 10 levels "business","car (new)",..: 8 8 5 6 2 5 6 3 8 2 ...
 $ Credit_amount           : num  1169 5951 2096 7882 4870 ...
 $ Savings_bonds           : Factor w/ 5 levels "< 100 DM",">= 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ...
 $ Present_employment_since: Factor w/ 5 levels "< 1 year",">= 7 years",..: 2 3 4 4 3 3 2 3 4 5 ...
 $ Installment_rate        : num  4 2 2 2 3 2 3 2 2 4 ...
 $ Other_debtors_guarantors: Factor w/ 3 levels "co-applicant",..: 3 3 3 2 3 3 3 3 3 3 ...
 $ Resident_since          : num  4 2 3 4 4 4 4 2 4 2 ...
 $ Property                : Factor w/ 4 levels "building society savings agreement / life insurance",..: 3 3 3 1 4 4 1 2 3 2 ...
 $ Age                     : num  67 22 49 45 53 35 53 35 61 28 ...
 $ Other_installment_plans : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Housing                 : Factor w/ 3 levels "rent","own","for free": 2 2 2 3 3 3 2 1 2 2 ...
 $ Existing_credits        : num  2 1 1 1 2 1 1 1 1 2 ...
 $ Job                     : Factor w/ 4 levels "management / self-employed / highly qualified employee / officer",..: 2 2 4 2 2 4 2 1 4 1 ...
 $ People_maintenance_for  : num  1 1 2 2 2 2 1 1 1 1 ...
 $ Telephone               : Factor w/ 2 levels "none","yes": 2 1 1 1 1 2 1 2 1 1 ...
 $ Foreign_worker          : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
 $ Credit_risk             : Factor w/ 2 levels "BAD","GOOD": 2 1 2 2 1 2 2 2 2 1 ...
 $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 1 1 1 1 1 1 1 1 ...

We want to transport the Purpose variable, which has 10 different levels in the dataset. To be able to visualize the transport, we merge some of the classes together to obtain three main purposes: cars, equipment, and other.

german.credit <- 
  german.credit |> 
  mutate(
    Purpose = case_when(
      Purpose %in% c("car (new)","car (used)") ~ "cars",
      Purpose %in% c(
        "domestic appliances", "furniture / equipment", "radio / television"
      ) ~ "equipment",
      TRUE ~ "other"
    ) |> as.factor()
  )

2.2 Estimation of Scores

We fit four models, \(\widehat{m}^{(k)}(\mathbf{x}_j|\mathbf{x}_{-j})\), using a multinomial loss, yielding predicted scores \(\widehat{\mathbf{x}}_{j}^{(k)}\in\mathcal{S}_d\) for \(k \in \{1,2,3,4\}\). The sensitive attribute \(s\) and the credit risk variable (Bad/Good)–typically the target variable in this dataset but not in our case–are excluded from estimation. The four models are:

GAM-MLR (1): A multinomial model with splines for three continuous variables.
GAM-MLR (2): A multinomial model with splines for three continuous variables and seven additional predictors.
Random Forest: A classifier using all available variables.
Gradient Boosting Model: A GBM trained on all available variables.

The estimation of GAM-MLR (1):

library(splines)
require(nnet)

Loading required package: nnet

set.seed(123)
model_glm_1 <- multinom(
  Purpose ~ bs(Credit_amount) + bs(Age) + bs(Duration), 
  data = german.credit
)

# weights:  33 (20 variable)
initial  value 1098.612289 
iter  10 value 984.769274
iter  20 value 982.287732
final  value 982.287248 
converged

The estimation of GAM-MLR (2):

model_glm_2 <- multinom(
  Purpose ~ bs(Credit_amount) + bs(Age) + bs(Duration) +
    Present_employment_since + Savings_bonds + Property + Account_status + 
    Credit_history + Resident_since + Job + Housing,
  data = german.credit,
)

# weights:  105 (68 variable)
initial  value 1098.612289 
iter  10 value 955.629654
iter  20 value 934.033840
iter  30 value 931.711611
iter  40 value 930.922840
iter  50 value 930.662446
iter  60 value 930.649598
final  value 930.649264 
converged

The estimation of the random forest:

library(randomForest)
model_rf <- randomForest(Purpose ~ ., data = german.credit[, -c(20,21)])

The estimation of the gradient boosting model:

library(gbm)
library(caret)
model_gbm <- gbm(
  Purpose ~ .,
  data = german.credit[, -c(20,21)],
  distribution = "multinomial",
  cv.folds = 10,
  shrinkage = .01,
  n.minobsinnode = 10,
  n.trees = 2000
)

Now that we have estimated the four models, we can extract the estimates scores \(\widehat{\mathbf{x}}_{j}^{(k)}\in\mathcal{S}_d\):

scores_glm_1 <- predict(model_glm_1, type = "probs")
scores_glm_2 <- predict(model_glm_2, type = "probs")
scores_rf <- predict(model_rf, type = "prob")
scores_gbm <- predict.gbm(
  object = model_gbm,
  newdata = german.credit[, -c(20,21)],
  n.trees = 2000,
  type = "response"
)
scores_gbm <- scores_gbm[ , , 1]

Let us have a look at the predicted scores of four individuals (2 women, 2 men), for each model.

Codes to create the Table.

print_highlighted_obs <- function(scores) {
  ind_highlight <- c(2, 5, 13, 8)
  tb <- german.credit |>
    select(Purpose, Gender) |> 
    bind_cols(scores) |> 
    slice(ind_highlight)
  row.names(tb) <- NULL
  tb
}

tb_four_indiv <- 
  print_highlighted_obs(scores_glm_1) |> mutate(model = "glm_1") |> 
  bind_rows(
    print_highlighted_obs(scores_glm_2) |> mutate(model = "glm_2")
  ) |> 
  bind_rows(
    print_highlighted_obs(scores_rf) |> mutate(model = "rf")
  ) |> 
  bind_rows(
    print_highlighted_obs(scores_gbm) |> mutate(model = "gbm")
  ) |> 
  mutate(
    model = factor(
      model, 
      levels = c("glm_1", "glm_2", "rf", "gbm"),
      labels = c("GAM-MLR(1)", "GAM-MLR(2)", "Random Forest", 
                 "Gradient Boosting Model")
    )
  ) |> 
  relocate(model, .before = Purpose)

tb_four_indiv |> select(-model) |> 
  kableExtra::kbl(
    booktabs = TRUE, digits = 4,
  ) |> 
  kableExtra::kable_paper() |> 
  kableExtra::add_header_above(c(" " = 2, "Predicted Scores" = 3)) |> 
  kableExtra::pack_rows(index = table(tb_four_indiv$model))

Table 2.1: Mappings from the purpose categorical variable \(x\) to the compositional one \(\tilde{\mathbf{x}}\), for four individuals of the dataset.

		Predicted Scores
Purpose	Gender	cars	equipment	other
GAM-MLR(1)
equipment	Male	0.1838	0.6156	0.2006
cars	Female	0.4086	0.4238	0.1676
equipment	Male	0.1941	0.7082	0.0977
cars	Female	0.4704	0.2683	0.2613
GAM-MLR(2)
equipment	Male	0.0922	0.7592	0.1486
cars	Female	0.4680	0.2406	0.2914
equipment	Male	0.1123	0.7907	0.0971
cars	Female	0.5074	0.2698	0.2228
Random Forest
equipment	Male	0.2265	0.4972	0.2762
cars	Female	0.3457	0.3138	0.3404
equipment	Male	0.1685	0.7609	0.0707
cars	Female	0.4492	0.3476	0.2032
Gradient Boosting Model
equipment	Male	0.0999	0.6904	0.2097
cars	Female	0.5859	0.1415	0.2725
equipment	Male	0.1137	0.7613	0.1250
cars	Female	0.5011	0.2523	0.2466

2.3 Optimal Transport in the Euclidean Representation

We can now apply Algorithm 1.1 with a Gaussian mapping in an Euclidean representation space to transport from observed \(\widehat{\mathbf{x}}_j|s=0\) (scores for women) to counterfactual \(\widehat{\mathbf{x}}_j|s=1\) (scores for men), in \(\mathcal{S}_d\).

Let us isolate women and men:

ind_0 <- which(german.credit$Gender == "Female")
ind_1 <- which(german.credit$Gender == "Male")

Then, we create matrices for each model with the predicted scores for each category, i.e., the representation of the categorical variable Purpose in the unit simplex.

# GAM-MLR(1)
X0_glm_1 <- scores_glm_1[ind_0,]
X1_glm_1 <- scores_glm_1[ind_1,]
# GAM-MLR(2)
X0_glm_2 <- scores_glm_2[ind_0,]
X1_glm_2 <- scores_glm_2[ind_1,]
# RF
X0_rf <- scores_rf[ind_0,]
X1_rf <- scores_rf[ind_1,]
# GBM
X0_gbm <- scores_gbm[ind_0,]
X1_gbm <- scores_gbm[ind_1,]

Then, we can apply Algorithm 1.1 to each set of predicted scores. We use the clr transform and Gaussian OT:

transp_glm_1 <- transport_simplex(X0 = X0_glm_1, X1 = X1_glm_1, n_interp = 31)
transp_glm_2 <- transport_simplex(X0 = X0_glm_2, X1 = X1_glm_2, n_interp = 31)
transp_rf <- transport_simplex(X0 = X0_rf, X1 = X1_rf, n_interp = 31)
transp_gbm <- transport_simplex(X0 = X0_rf, X1 = X1_gbm, n_interp = 31)

We can then have a look at the percentage of each purpose category in the initial dataset, compare it with the average predicted score of each category for each model, and with the average of the transported predicted score for women.

Codes to create the Table.

# Proportions of each purpose level by gender in the dataset
prop_purposes <- 
  german.credit |> 
  count(Purpose, Gender) |> 
  group_by(Gender) |> 
  mutate(prop = n / sum(n)) |> 
  select(-n) |> 
  pivot_wider(names_from = Purpose, values_from = prop) |> 
  mutate(type = "Categorical")

get_table_pred_transp <- function(scores, transp_scores) {
  # Average predicted scores for each purpose level by gender
  mean_scores_by_gender <- 
    german.credit |> 
    select(Purpose, Gender) |> 
    bind_cols(scores) |> 
    group_by(Gender) |> 
    summarise(across(colnames(!!scores), ~mean(.x))) |> 
    mutate(type = "Composition")
  
  # Average predicted transported score of women for each purpose level
  mean_transp_scores_women <- colMeans(transp_scores) |> 
    as_tibble_row() |>
    mutate(type = "Transported", Gender = "Female -> Male")
  
  mean_scores_by_gender |> 
    bind_rows(mean_transp_scores_women)
}

tb_pred_transp_mean <-
  prop_purposes |> mutate(model = "obs") |> 
  bind_rows(
    get_table_pred_transp(scores_glm_1, transp_glm_1) |> mutate(model = "glm_1")
  ) |> 
  bind_rows(
    get_table_pred_transp(scores_glm_2, transp_glm_2) |> mutate(model = "glm_2")
  ) |> 
  bind_rows(
    get_table_pred_transp(scores_rf, transp_rf) |> mutate(model = "rf")
  ) |> 
  bind_rows(
    get_table_pred_transp(scores_gbm, transp_gbm) |> mutate(model = "gbm")
  ) |> 
  mutate(
    model = factor(
      model, 
      levels = c("obs", "glm_1", "glm_2", "rf", "gbm"),
      labels = c(
        "Observed Values",
        "GAM-MLR(1)", "GAM-MLR(2)", "Random Forest", 
        "Gradient Boosting Model"
      )
    )
  ) |> 
  relocate(model, .before = Gender) |> 
  relocate(type, .after = model)

tb_pred_transp_mean |> select(-model) |> 
  kableExtra::kbl(
  booktabs = TRUE, digits = 4,
) |> 
  kableExtra::kable_paper() |> 
  kableExtra::add_header_above(c(" " = 2, "Purposes" = 3)) |> 
  kableExtra::pack_rows(index = table(tb_pred_transp_mean$model))

Table 2.2: Optimal transport using the \(\operatorname{clr}\) transformation, and Gaussian optimal transports, on the scores in the database, with two logistic GAM models to predict scores, a random forest, and a boosting model. For observed values, the observed proportions of purpose categories are reported by gender. Then, for each model, the average of predicted scores by gender for each categories are shown (Composition). Lastly, the average of transported predicted scores for women are reported (Transported).

		Purposes
type	Gender	cars	equipment	other
Observed Values
Categorical	Female	0.3522	0.4464	0.2014
Categorical	Male	0.3032	0.5323	0.1645
GAM-MLR(1)
Composition	Female	0.3487	0.4549	0.1964
Composition	Male	0.3111	0.5133	0.1757
Transported	Female -> Male	0.3102	0.5142	0.1757
GAM-MLR(2)
Composition	Female	0.3448	0.4584	0.1967
Composition	Male	0.3195	0.5054	0.1751
Transported	Female -> Male	0.3185	0.5057	0.1758
Random Forest
Composition	Female	0.3521	0.4519	0.1960
Composition	Male	0.3223	0.4986	0.1791
Transported	Female -> Male	0.3233	0.4969	0.1798
Gradient Boosting Model
Composition	Female	0.3452	0.4553	0.1995
Composition	Male	0.3147	0.5118	0.1735
Transported	Female -> Male	0.3168	0.5086	0.1746

2.3.1 Visualization of Transported Categories

We can then show the counterfactuals on a ternary plot, and graph the displacement interpolation when generationg from the factual (women) to the counterfactuals (men).

First, we format the paths:

transp_glm_1_path <- interpolated(transp_glm_1) |> 
  list_rbind(names_to = "id_obs")
transp_glm_2_path <- interpolated(transp_glm_2) |> 
  list_rbind(names_to = "id_obs")
transp_rf_path <- interpolated(transp_rf) |> 
  list_rbind(names_to = "id_obs")
transp_gbm_path <- interpolated(transp_gbm) |> 
  list_rbind(names_to = "id_obs")

Then, we can show, for each model, the representation in the simplex of the categorical variable Purpose, by gender (women in red and men in blue), as well as the displacement interpolation.

Codes to create the Figure.

scores_all <- as_tibble(scores_glm_1) |> 
  mutate(Gender = german.credit$Gender, model = "glm_1") |> 
  bind_rows(
    as_tibble(scores_glm_2) |> 
      mutate(Gender = german.credit$Gender, model = "glm_2")
  ) |> 
  bind_rows(
    as_tibble(as.data.frame(scores_rf)) |> 
      mutate(Gender = german.credit$Gender, model = "rf")
  ) |> 
  bind_rows(
    as_tibble(scores_gbm) |> 
      mutate(Gender = german.credit$Gender, model = "gbm")
  ) |> 
  mutate(
    model = factor(
      model, 
      levels = c("obs", "glm_1", "glm_2", "rf", "gbm"),
      labels = c(
        "Observed Values",
        "GAM-MLR(1)", "GAM-MLR(2)", "Random Forest", 
        "Gradient Boosting Model"
      )
    )
  )

transp_all_path <- 
  transp_glm_1_path |> 
  mutate(
    Gender = factor("Female", levels = levels(german.credit$Gender)),
    model = "glm_1"
  ) |> 
  bind_rows(
    transp_glm_2_path |> 
      mutate(
        Gender = factor("Female", levels = levels(german.credit$Gender)),
        model = "glm_2"
      ) 
  ) |> 
  bind_rows(
    transp_rf_path |> 
      mutate(
        Gender = factor("Female", levels = levels(german.credit$Gender)),
        model = "rf"
      ) 
  ) |> 
  bind_rows(
    transp_gbm_path |> 
      mutate(
        Gender = factor("Female", levels = levels(german.credit$Gender)),
        model = "gbm"
      ) 
  ) |> 
  mutate(
    model = factor(
      model, 
      levels = c("obs", "glm_1", "glm_2", "rf", "gbm"),
      labels = c(
        "Observed Values",
        "GAM-MLR(1)", "GAM-MLR(2)", "Random Forest", 
        "Gradient Boosting Model"
      )
    )
  )

ggtern(
  data = scores_all,
  mapping = aes(x = cars, y = other, z = equipment, colour = Gender)
) +
  geom_point(size = .1) +
  geom_line(
    data = transp_all_path, 
    linewidth = .2, alpha = .8,
    mapping = aes(group = id_obs)
  ) +
  scale_colour_manual(values = c("Female" = "red", "Male" = "blue"), guide = "none") +
  facet_wrap(~model) +
  labs(x = "Cars", y = "Other", z = "Equip.") +
  theme_paper() +
  theme(
    tern.axis.arrow.show = TRUE,
    tern.axis.arrow.sep = .13,
    tern.axis.vshift = .05,
  ) +
  theme_hidetitles()

Figure 2.1: Optimal Transport using clr transform. Points in red are compositions for women, whereas points in blue are compositions for men. The lines indicate the displacement interpolation when generating counterfactuals.

2.4 Optimal Transport in \(\mathcal{S}_3\)

Let us now use Algorithm 1.2 to create counterfactuals using matching in \(\mathcal{S}_3\).

To that end, we use the wasserstein_simplex() function from our package.

Note

The codes are a bit long to run (about 40 seconds for each model on a 2023 MacBook Pro with an M2 chip). Here, we load results saved from a previously evaluated code.

mapping_glm_1 <- wasserstein_simplex(X0_glm_1, X1_glm_1)
mapping_glm_2 <- wasserstein_simplex(X0_glm_2, X1_glm_2)
mapping_rf <- wasserstein_simplex(X0_rf, X1_rf)
mapping_gbm <- wasserstein_simplex(X0_gbm, X1_gbm)

if (!dir.exists("../output/")) dir.create("../output/", recursive = TRUE)
save(
  mapping_glm_1, mapping_glm_2, mapping_rf, mapping_gbm,
  file = "../output/matching_german.rda"
)

load("../output/matching_german.rda")

We extract the estimated weights for all the individuals:

M0_glm_1 <- mapping_glm_1$plan * nrow(X0_glm_1)
M0_glm_2 <- mapping_glm_2$plan * nrow(X0_glm_2)
M0_rf <- mapping_rf$plan * nrow(X0_rf)
M0_gbm <- mapping_gbm$plan * nrow(X0_gbm)

Let us focus on a single individual \(x_{0,i}=\):

i <- 34

For each model, we extract the representation of the Purpose characteristic in the simplex (i.e., the predicted scores by the \(k\)-th model, \(\widehat{m}^{(k)}(\mathbf{x}_j|\mathbf{x}_{-j})\)). Let us denote this composition as \(\mathbf{x}_{0,i}^{(k)}\)

indiv_i_glm_1 <- X0_glm_1[i, ]
indiv_i_glm_2 <- X0_glm_2[i, ]
indiv_i_rf <- X0_rf[i, ]
indiv_i_gbm <- X0_gbm[i, ]

We then extract the weights \(\mathbf{P}^{\star(k)}_i\):

weights_i_glm_1 <- M0_glm_1[i, ]
weights_i_glm_2 <- M0_glm_2[i, ]
weights_i_rf <- M0_rf[i, ]
weights_i_gbm <- M0_gbm[i, ]

Lastly, we compute, for our individual, its counterfactual \(T^\star(\mathbf{x}_{0,i})\), by simply computing the weighted average of the characteristics of the individuals from the other group.

cfact_i_glm_1 <- weights_i_glm_1 %*% X1_glm_1
cfact_i_glm_2 <- weights_i_glm_2 %*% X1_glm_2
cfact_i_rf <- weights_i_rf %*% X1_rf
cfact_i_gbm <- weights_i_gbm %*% X1_gbm

We can then plot (Figure 3.2) the representation of the woman of interest obtained for each model (red dot), and its counterfactual obtained by matching (blue dot). We also plot, on the ternary plot, all the women and men. The size of the dots for men is proportional to the weights corresponding to the women of interest.

Codes to create the Figure.

# Women
tb_plot_females <- 
  as_tibble(X0_glm_1) |> 
  mutate(Gender = "Female", model = "glm_1") |> 
  bind_rows(
    as_tibble(X0_glm_2) |> 
      mutate(Gender = "Female", model = "glm_2")
  ) |> 
  bind_rows(
    as_tibble(X0_rf) |> 
      mutate(Gender = "Female", model = "rf")
  ) |> 
  bind_rows(
    as_tibble(X0_gbm) |> 
      mutate(Gender = "Female", model = "gbm",)
  ) |> 
  mutate(
    model = factor(
      model, 
      levels = c("obs", "glm_1", "glm_2", "rf", "gbm"),
      labels = c(
        "Observed Values",
        "GAM-MLR(1)", "GAM-MLR(2)", "Random Forest", 
        "Gradient Boosting Model"
      )
    )
  )

# Males individuals, with a column weights_i giving their weight used to 
# construct the counterfactual for indiv i
tb_plot_males <- 
  as_tibble(X1_glm_1) |> 
  mutate(
    Gender = "Male", model = "glm_1", weights_i = weights_i_glm_1
  ) |> 
  bind_rows(
    as_tibble(X1_glm_2) |> 
      mutate(
        Gender = "Male", model = "glm_2", weights_i = weights_i_glm_2
      )
  ) |> 
  bind_rows(
    as_tibble(X1_rf) |> 
      mutate(
        Gender = "Male", model = "rf", weights_i = weights_i_rf
      )
  ) |> 
  bind_rows(
    as_tibble(X1_gbm) |> 
      mutate(
        Gender = "Male", model = "gbm", weights_i = weights_i_rf
      )
  ) |> 
  mutate(
    model = factor(
      model, 
      levels = c("obs", "glm_1", "glm_2", "rf", "gbm"),
      labels = c(
        "Observed Values",
        "GAM-MLR(1)", "GAM-MLR(2)", "Random Forest", 
        "Gradient Boosting Model"
      )
    )
  )

indiv_i <- 
  as_tibble_row(indiv_i_glm_1) |> mutate(Gender = "Female", model = "glm_1") |> 
  bind_rows(
    as_tibble_row(indiv_i_glm_2) |> mutate(Gender = "Female", model = "glm_2")
  ) |> 
  bind_rows(
    as_tibble_row(indiv_i_rf) |> mutate(Gender = "Female", model = "rf")
  ) |> 
  bind_rows(
    as_tibble_row(indiv_i_gbm) |> mutate(Gender = "Female", model = "gbm")
  ) |> 
  mutate(
    model = factor(
      model, 
      levels = c("obs", "glm_1", "glm_2", "rf", "gbm"),
      labels = c(
        "Observed Values",
        "GAM-MLR(1)", "GAM-MLR(2)", "Random Forest", 
        "Gradient Boosting Model"
      )
    )
  )

cfact_indiv_i <- 
  as_tibble(cfact_i_glm_1) |> mutate(Gender = "Male", model = "glm_1") |> 
  bind_rows(
    as_tibble(cfact_i_glm_2) |> mutate(Gender = "Male", model = "glm_2")
  ) |> 
  bind_rows(
    as_tibble(cfact_i_rf) |> mutate(Gender = "Male", model = "rf")
  ) |> 
  bind_rows(
    as_tibble(cfact_i_gbm) |> mutate(Gender = "Male", model = "gbm")
  ) |> 
  mutate(
    model = factor(
      model, 
      levels = c("obs", "glm_1", "glm_2", "rf", "gbm"),
      labels = c(
        "Observed Values",
        "GAM-MLR(1)", "GAM-MLR(2)", "Random Forest", 
        "Gradient Boosting Model"
      )
    )
  )

ggtern(
  mapping = aes(x = cars, y = other, z = equipment, colour = Gender)
) +
  geom_point(
    data = tb_plot_females,
    size = .1,
    alpha = .6
  ) +
  geom_point(
    data = tb_plot_males |> 
      group_by(model) |> 
      mutate(high = weights_i > quantile(weights_i, probs = .995)), 
    mapping = aes(size = weights_i, alpha = high),
  ) +
  geom_point(data = indiv_i, size = 3, colour = "white") +
  geom_point(data = indiv_i, size = 2) +
  geom_point(data = cfact_indiv_i, size = 3, colour = "white", shape = 15) +
  geom_point(data = cfact_indiv_i, size = 2, shape = 15) +
  facet_wrap(~model) +
  labs(x = "Cars", y = "Other", z = "Equip.") +
  scale_colour_manual(
    values = c("Female" = "red", "Male" = "blue"), guide = "none"
  ) +
  scale_size_continuous(range = c(0, 2), guide = "none") +
  scale_alpha_discrete(guide = "none") +
  theme_paper() +
  theme(
    tern.axis.arrow.show = TRUE,
    tern.axis.arrow.sep = .13,
    tern.axis.vshift = .05,
  ) +
  theme_hidetitles()

Figure 2.2: Empirical matching of a woman \(\mathbf{x}_{0,i}^{(k)}\) (big red dot) with men (blue dots). The Size of blue dots are proportional to the weights \(\mathbf{P}^\star_i\). The counterfactual obtained with matching \(T^\star(\mathbf{x}_{0,i})\) is shown as a blue square.

To finish, let us look more closely to the i-th woman for which we have shown the counterfactual on the ternary plot. The value of the Purpose variable for her is other:

german.credit[ind_0[i], "Purpose"]

[1] other
Levels: cars equipment other

Using the GAM-MLR(2) model, we obtained the following composition:

X0_glm_2[i, ]

     cars equipment     other 
0.3697923 0.2381080 0.3920997

The closest points in the group of men, obtained using Algorithm 1.2 are:

ind_close_points <- tail(order(weights_i_glm_2), 3)
X1_glm_2[ind_close_points, ] |> 
  as_tibble() |> 
  mutate(
    Purpose = german.credit[ind_1[ind_close_points], ] |> select(Purpose),
    index = ind_close_points,
    weights_i = weights_i_glm_2[ind_close_points]) |> 
  arrange(desc(weights_i)) |> 
  kableExtra::kbl(booktabs = TRUE) |> 
  kableExtra::kable_paper()

cars	equipment	other	Purpose	index	weights_i
0.3627510	0.2817122	0.3555368	other	168	0.5483871
0.3993449	0.3155724	0.2850827	other	133	0.2258065
0.4232457	0.2865582	0.2901962	equipment	275	0.2258065

For the first two closest men, \(x_{1,j}=\) other. So, it would make sense to suppose that the counterfactual version of woman \(i\) with an “other” credit is a man with the same purpose.

The counterfactual version for the composition is:

cfact_i_glm_2

          cars equipment     other
[1,] 0.3846742 0.2904523 0.3248735

The counterfactual categorical value would thus be:

colnames(cfact_i_glm_2)[which.max(cfact_i_glm_2)]

[1] "cars"

For comparison, using Gaussian OT, the counterfactual would be:

transp_glm_2 |> slice(i)

# A tibble: 1 × 3
   cars equipment other
  <dbl>     <dbl> <dbl>
1 0.347     0.286 0.368