2 Recalibration – From Uncertainty to Precision: Enhancing Binary Classifier Performance through Calibration

2.1 Data Generating Process

We use the same DGP as that presented in Section 1.1 in Chapter 1. Let us redefine here the function which simulates data.

#' Simulates data
#'
#' @param n_obs number of desired observations
#' @param seed seed to use to generate the data
#' @param alpha scale parameter for the latent probability (if different 
#'   from 1, the probabilities are transformed and it may induce decalibration)
#' @param gamma scale parameter for the latent score (if different from 1, 
#'   the probabilities are transformed and it may induce decalibration)
sim_data <- function(n_obs = 2000, 
                     seed, 
                     alpha = 1, 
                     gamma = 1) {
  set.seed(seed)

  x1 <- runif(n_obs)
  x2 <- runif(n_obs)
  x3 <- runif(n_obs)
  x4 <- runif(n_obs)
  epsilon_p <- rnorm(n_obs, mean = 0, sd = .5)
  
  # True latent score
  eta <- -0.1*x1 + 0.05*x2 + 0.2*x3 - 0.05*x4  + epsilon_p
  # Transformed latent score
  eta_u <- gamma * eta
  
  # True probability
  p <- (1 / (1 + exp(-eta)))
  # Transformed probability
  p_u <- ((1 / (1 + exp(-eta_u))))^alpha

  # Observed event
  d <- rbinom(n_obs, size = 1, prob = p)

  tibble(
    # Event Probability
    p = p,
    p_u = p_u,
    # Binary outcome variable
    d = d,
    # Variables
    x1 = x1,
    x2 = x2,
    x3 = x3,
    x4 = x4
  )
}

2.2 Recalibration Methods

To compare different calibration metrics, we will split our dataset into the following sets:

a calibration set: to train the recalibrator
a test set: on which we will compute the calibration metrics.

Note

In the general case where the scores are obtained using a classifier, the dataset needs to be split into three parts instead of two:

a train set: to train the classifier
a calibration set: to train the recalibrator
a test set: on which we will compute the calibration metrics.

We define (as in the previous chapter 1) a function to create the splits.

#' Get calibration/test samples from the DGP
#'
#' @param seed seed to use to generate the data
#' @param n_obs number of desired observations
#' @param alpha scale parameter for the latent probability (if different 
#'   from 1, the probabilities are transformed and it may induce decalibration)
#' @param gamma scale parameter for the latent score (if different from 1, 
#'   the probabilities are transformed and it may induce decalibration)
get_samples <- function(seed,
                        n_obs = 2000,
                        alpha = 1,
                        gamma = 1) {
  set.seed(seed)
  data_all <- sim_data(
    n_obs = n_obs, seed = seed, alpha = alpha, gamma = gamma
  )
  
  # Calibration/test sets----
  data <- data_all |> select(d, x1:x4)
  probas <- data_all |> select(p)

  calib_index <- sample(1:nrow(data), size = .6 * nrow(data), replace = FALSE)
  tb_calib <- data |> slice(calib_index)
  tb_test <- data |> slice(-calib_index)
  probas_calib <- probas |> slice(calib_index)
  probas_test <- probas |> slice(-calib_index)

  list(
    data_all = data_all,
    data = data,
    tb_calib = tb_calib,
    tb_test = tb_test,
    probas_calib = probas_calib,
    probas_test = probas_test,
    calib_index = calib_index,
    seed = seed,
    n_obs = n_obs,
    alpha = alpha,
    gamma = gamma
  )
}

We simulate a single toy dataset to begin with. Simulations made on replications will be done later.

Let us consider a case where the probabilities are distorted using \(\alpha=.25\).

n_obs <- 2000
toy_data <- get_samples(seed = 1, n_obs = 2000, alpha = .25, gamma = 1)
toy_data$data_all

# A tibble: 2,000 × 7
       p   p_u     d     x1     x2     x3     x4
   <dbl> <dbl> <int>  <dbl>  <dbl>  <dbl>  <dbl>
 1 0.366 0.778     0 0.266  0.872  0.188  0.770 
 2 0.613 0.885     1 0.372  0.967  0.505  0.690 
 3 0.561 0.865     1 0.573  0.867  0.0273 0.650 
 4 0.343 0.765     1 0.908  0.438  0.496  0.0747
 5 0.293 0.736     0 0.202  0.192  0.947  0.903 
 6 0.569 0.869     0 0.898  0.0823 0.381  0.133 
 7 0.345 0.766     0 0.945  0.583  0.698  0.211 
 8 0.705 0.916     0 0.661  0.0704 0.689  0.155 
 9 0.726 0.923     1 0.629  0.528  0.478  0.0545
10 0.673 0.906     1 0.0618 0.472  0.273  0.715 
# ℹ 1,990 more rows

We extract the calib/test datasets with true probabilities:

data_all_calib <- toy_data$data_all |>
    slice(toy_data$calib_index)
data_all_calib

# A tibble: 1,200 × 7
       p   p_u     d     x1      x2    x3     x4
   <dbl> <dbl> <int>  <dbl>   <dbl> <dbl>  <dbl>
 1 0.670 0.905     0 0.262  0.155   0.818 0.0906
 2 0.650 0.898     1 0.975  0.683   0.697 0.367 
 3 0.413 0.802     0 0.229  0.687   0.554 0.734 
 4 0.750 0.930     1 0.0438 0.0907  0.816 0.0173
 5 0.304 0.742     0 0.0275 0.591   0.239 0.872 
 6 0.652 0.899     0 0.753  0.121   0.953 0.704 
 7 0.309 0.746     0 0.0747 0.922   0.557 0.408 
 8 0.355 0.772     0 0.914  0.493   0.205 0.175 
 9 0.555 0.863     0 0.513  0.00726 0.963 0.333 
10 0.425 0.807     0 0.386  0.802   0.313 0.571 
# ℹ 1,190 more rows

data_all_test <- toy_data$data_all |>
    slice(-toy_data$calib_index)
data_all_test

# A tibble: 800 × 7
       p   p_u     d    x1     x2      x3      x4
   <dbl> <dbl> <int> <dbl>  <dbl>   <dbl>   <dbl>
 1 0.613 0.885     1 0.372 0.967  0.505   0.690  
 2 0.498 0.840     1 0.498 0.396  0.566   0.973  
 3 0.402 0.796     1 0.718 0.106  0.0169  0.970  
 4 0.302 0.741     1 0.126 0.0102 0.554   0.507  
 5 0.684 0.909     0 0.382 0.0704 0.869   0.478  
 6 0.457 0.822     1 0.340 0.413  0.822   0.914  
 7 0.566 0.867     0 0.600 0.0802 0.983   0.377  
 8 0.760 0.934     1 0.494 0.277  0.266   0.231  
 9 0.444 0.816     1 0.827 0.0911 0.191   0.00274
10 0.637 0.894     1 0.668 0.277  0.00375 0.653  
# ℹ 790 more rows

2.2.1 Platt Scaling

Platt scaling (Platt et al. 1999) consists of applying logistic regression to \((d,s(x))\) where \(d\) denotes the binary outcome and \(s(x)\) is the vector of predicted scores.

# Logistic regression
lr <- glm(d ~ p_u, family = binomial(link = 'logit'), data = data_all_calib)

The predicted values in the calibration set and in the test set:

score_c_platt_calib <- predict(lr, newdata = data_all_calib, type = "response")
score_c_platt_test <- predict(lr, newdata = data_all_test, type = "response")

Let us create a vector of values to estimate the calibration curve.

linspace <- seq(0, 1, length.out = 100)

We can then use the fitted logistic regression to make predictions on this vector of values:

score_c_platt_linspace <- predict(
  lr, 
  newdata = tibble(p_u = linspace), 
  type = "response"
)

Let us put these values in a tibble:

tb_scores_c_platt <- tibble(
  linspace = linspace,
  p_c = score_c_platt_linspace #recalibrated score
)
tb_scores_c_platt

# A tibble: 100 × 2
   linspace       p_c
      <dbl>     <dbl>
 1   0      0.0000729
 2   0.0101 0.0000817
 3   0.0202 0.0000917
 4   0.0303 0.000103 
 5   0.0404 0.000115 
 6   0.0505 0.000129 
 7   0.0606 0.000145 
 8   0.0707 0.000163 
 9   0.0808 0.000183 
10   0.0909 0.000205 
# ℹ 90 more rows

The predicted probabilities \(p_u\) will then be transformed according to the logistic model depicted in Figure 5.1.

par(mar = c(4.1, 4.1, 2.1, 2.1))
plot(
  data_all_calib$p_u, data_all_calib$d, type = "p", cex = .5, pch = 19,
  col = adjustcolor("black", alpha.f = .4),
  xlab = "p", ylab = "g(p)",
  xlim = c(0,1)
)
lines(
  x = tb_scores_c_platt$linspace, y = tb_scores_c_platt$p_c, 
  type = "l", col = "#D55E00"
)

Figure 2.1: Recalibration Using Platt Scaling

2.2.2 Isotonic Regression

Isotonic regression is a non parametric approach using the pool-adjacent-violators (PAV) algorithm, introduced by Zadrozny and Elkan (2002). In a nutshell, it assumes that the predicted scores of the initial model (random forest in this notebook) reproduces well the ranks of the observations. Under this assumption, the mapping \(g(\cdot)\) from the scores \(s(x)\) into the probabilities \(g(p)\) is non-decreasing. It is then possible to use isotonic regression to learn the mapping. The PAV algorithm works as follows:

At a given iteration: consider the ranked examples \(x_{i-1}\) and \(x_{i}\).
- If the current values of the function to be learned is such that \(g(x_{i-1}) \leq g(x_{i})\), nothing changes.
- Otherwise, \(x_1\) and \(x_2\) are called pair-adjacent violators. The values of \(g(x_{i-1})\) and \(g(x_{i})\) are replaced by their mean \((g(x_{i-1}) + g(x_{i})) / 2\). If this move creates earlier violations (\(g(x_{i-1})\) might be lower than \(g(x_{i-2})\)), a new value is set for \(g(x_{i-2})\), \(g(x_{i-1})\), and \(g(x_{i})\), as the average in the group.

Let us compute the isototic least squares regression on the scores \(p_u\):

iso <- isoreg(x = data_all_calib$p_u, y = data_all_calib$d)

Transforming the fit into a function:

fit_iso <- as.stepfun(iso)

The predicted values on the calibration set and on the test set:

score_c_isotonic_calib <- fit_iso(data_all_calib$p_u)
score_c_isotonic_test <- fit_iso(data_all_test$p_u)

Then, we can use this function to get estimated probabilities at some specific values (linspace):

score_c_isotonic_linspace <- fit_iso(linspace)

Let us recreate the tibble with the recalibrated scores:

tb_scores_c_isotonic <- tibble(
  linspace = linspace,
  p_c = score_c_isotonic_linspace
)

The predicted probabilities \(p_u\) will then be transformed according to the logistic model depicted in Figure 5.2.

par(mar = c(4.1, 4.1, 2.1, 2.1))
plot(
  data_all_calib$p_u, data_all_calib$d, type = "p", cex = .5, pch = 19,
  col = adjustcolor("black", alpha.f = .4),
  xlab = "p", ylab = "g(p)",
  xlim = c(0, 1)
)
lines(
  x = tb_scores_c_isotonic$linspace, y = tb_scores_c_isotonic$p_c, 
  type = "l", col = "#D55E00"
)

Figure 2.2: Recalibration Using Isotonic Regression

2.2.3 Beta Calibration

Instead of fitting a logistic regression on the predicted values, as we know that the distribution of the values are bounded to \([0,1]\), it is possible to use beta calibration Kull, Silva Filho, and Flach (2017). With this method, instead of assuming that the scores obtained by the classifier are normally distributed (as is the underlying assumption when using Platt scaling), the scores are assumed to follow a Beta distribution. We estimate : \[\mu(s;a,b,c) = \frac{1}{1 + \frac{1}{e^c \frac{s^a}{(1-s)^b}}}\]

library(betacal)
# Beta calibration using the paper package
bc <- beta_calibration(
  p = data_all_calib$p_u, 
  y = data_all_calib$d, 
  parameters = "abm" # 3 parameters a, b & m
)

[1] -126.7104
[1] 42.94288

The predicted values on the calibration set and on the test set:

score_c_beta_calib <- beta_predict(p = data_all_calib$p_u, bc)
score_c_beta_test <- beta_predict(p = data_all_test$p_u, bc)

We can then use the beta calibration model to make predictions at the desired values (linspace).

score_c_beta_linspace <- beta_predict(linspace, bc)

Let us recreate the tibble with the recalibrated scores:

tb_scores_c_beta <- tibble(
  linspace = linspace,
  p_c = score_c_beta_linspace
)

The predicted probabilities \(p_u\) will then be transformed according to the logistic model depicted in Figure 5.3

par(mar = c(4.1, 4.1, 2.1, 2.1))
plot(
  data_all_calib$p_u, data_all_calib$d, type = "p", cex = .5, pch = 19,
  col = adjustcolor("black", alpha.f = .4),
  xlab = "p", ylab = "g(p)",
  xlim = c(0, 1)
)
lines(
  x = tb_scores_c_beta$linspace, y = tb_scores_c_beta$p_c, 
  type = "l", col = "#D55E00"
)

Figure 2.3: Recalibration Using Beta Calibration

2.2.4 Local Regression

Local regression fits polynomials locally to each bin defined by nn argument of the locfit() function.

library(locfit)

locfit 1.5-9.9   2024-03-01


Attaching package: 'locfit'

The following object is masked from 'package:purrr':

    none

We consider three versions here, with different degrees for the polynomials (0, 1, or 2). We set the number of nearest neighbors to use to nn = 0.15, that is, 15%.

# Deg 0
locfit_0 <- locfit(
  formula = d ~ lp(p_u, nn = 0.15, deg = 0), 
  kern = "rect", maxk = 200, data = data_all_calib
)

Let us get the predicted values in the calibration data:

score_c_locfit_0_calib <- predict(locfit_0, newdata = data_all_calib)
score_c_locfit_0_test <- predict(locfit_0, newdata = data_all_test)

Then, we can use the estimated mapping to get estimated probabilities at some specific values (linspace):

score_c_locfit_0_linspace <- predict(locfit_0, newdata = linspace)

# Deg 1
locfit_1 <- locfit(
  formula = d ~ lp(p_u, nn = 0.15, deg = 1), 
  kern = "rect", maxk = 200, data = data_all_calib
)

Let us get the predicted values in the calibration data:

score_c_locfit_1_calib <- predict(locfit_1, newdata = data_all_calib)
score_c_locfit_1_test <- predict(locfit_1, newdata = data_all_test)

Then, we can use the estimated mapping to get estimated probabilities at some specific values (linspace):

score_c_locfit_1_linspace <- predict(locfit_1, newdata = linspace)

# Deg 2
locfit_2 <- locfit(
  formula = d ~ lp(p_u, nn = 0.15, deg = 2), 
  kern = "rect", maxk = 200, data = data_all_calib
)

Let us get the predicted values in the calibration data:

score_c_locfit_2_calib <- predict(locfit_2, newdata = data_all_calib)
score_c_locfit_2_test <- predict(locfit_2, newdata = data_all_test)

Then, we can use the estimated mapping to get estimated probabilities at some specific values (linspace):

score_c_locfit_2_linspace <- predict(locfit_2, newdata = linspace)

The predicted probabilities \(p_u\) will then be transformed according to the logistic model depicted in Figure 5.4, Figure 5.5, and Figure 5.4

par(mar = c(4.1, 4.1, 2.1, 2.1))
plot(
  data_all_calib$p_u, data_all_calib$d, type = "p", cex = .5, pch = 19,
  col = adjustcolor("black", alpha.f = .4),
  xlab = "p", ylab = "g(p)",
  xlim = c(0, 1)
)
lines(
  x = linspace, y = score_c_locfit_0_linspace, 
  type = "l", col = "#D55E00"
)

Figure 2.4: Recalibration Using Local Regression

par(mar = c(4.1, 4.1, 2.1, 2.1))
plot(
  data_all_calib$p_u, data_all_calib$d, type = "p", cex = .5, pch = 19,
  col = adjustcolor("black", alpha.f = .4),
  xlab = "p", ylab = "g(p)",
  xlim = c(0, 1)
)
lines(
  x = linspace, y = score_c_locfit_1_linspace, 
  type = "l", col = "#D55E00"
)

Figure 2.5: Recalibration Using Local Regression

par(mar = c(4.1, 4.1, 2.1, 2.1))
plot(
  data_all_calib$p_u, data_all_calib$d, type = "p", cex = .5, pch = 19,
  col = adjustcolor("black", alpha.f = .4),
  xlab = "p", ylab = "g(p)",
  xlim = c(0, 1)
)
lines(
  x = linspace, y = score_c_locfit_2_linspace, 
  type = "l", col = "#D55E00"
)

Figure 2.6: Recalibration Using Local Regression

2.3 Simulations: setup

Let us now consider multiple scenarios in which the scores are distorded by varying either \(\alpha\) or \(\gamma\) (see Section 1.1 in Chapter 1).

For each value of \(\alpha\) and \(\gamma\), we will generate 200 replications of the whole process consisting in the following steps:

generate data from the PGD and transform true probabilities \(p\) accordingly with either \(\alpha\) or \(\gamma\) to obtain \(p^u\)
split the data into two sets: calibration and test set
apply each recalibration method on the calibration set 4.compute calibration metrics on both sets (for comparison) using either:

the recalibrated scores \(p^c := g(s(x))\)
the estimated scores \(p^u := s(x)\)
the true probabilities \(p\).

2.3.1 Helper Functions

We (re)define a helper function to compute standard metrics (see Section 1.4 in Chapter 1):

#' Computes goodness of fit metrics
#' 
#' @param true_prob true probabilities
#' @param obs observed values (binary outcome)
#' @param pred predicted scores
#' @param threshold classification threshold (default to `.5`)
compute_gof <- function(true_prob,
                        obs, 
                        pred, 
                        threshold = .5) {
  
  # MSE
  mse <- mean((true_prob - pred)^2)
  
  pred_class <- as.numeric(pred > threshold)
  confusion_tb <- tibble(
    obs = obs,
    pred = pred_class
  ) |> 
    count(obs, pred)
  
  TN <- confusion_tb |> filter(obs == 0, pred == 0) |> pull(n)
  TP <- confusion_tb |> filter(obs == 1, pred == 1) |> pull(n)
  FP <- confusion_tb |> filter(obs == 0, pred == 1) |> pull(n)
  FN <- confusion_tb |> filter(obs == 1, pred == 0) |> pull(n)
  
  if (length(TN) == 0) TN <- 0
  if (length(TP) == 0) TP <- 0
  if (length(FP) == 0) FP <- 0
  if (length(FN) == 0) FN <- 0
  
  n_pos <- sum(obs == 1)
  n_neg <- sum(obs == 0)
  
  # Accuracy
  acc <- (TP + TN) / (n_pos + n_neg)
  # Missclassification rate
  missclass_rate <- 1 - acc
  # Sensitivity (True positive rate)
  # proportion of actual positives that are correctly identified as such
  TPR <- TP / n_pos
  # Specificity (True negative rate)
  # proportion of actual negatives that are correctly identified as such
  TNR <- TN / n_neg
  # False positive Rate
  FPR <- FP / n_neg
  
  tibble(
    mse = mse,
    accuracy = acc,
    missclass_rate = missclass_rate,
    sensitivity = TPR,
    specificity = TNR,
    threshold = threshold,
    FPR = FPR
  )
}

We (re)define a few helper functions to compute calibration metrics (see Section 1.2 in Chapter 1):

brier_score() to compute Brier Score (see Section 1.2.1.2 in Chapter 1).

Display the functions used to compute Brier Score

brier_score <- function(obs, scores) mean((scores - obs)^2)

e_calib_error() to compute the Expected Calibration Error (see Section 1.2.1.3 in Chapter 1). This function relies on get_summary_bins() which computes summary statistics for binomial observed data and predicted scores returned by a model.

Display the functions used to compute the ECE

#' Computes summary statistics for binomial observed data and predicted scores
#' returned by a model
#'
#' @param obs vector of observed events
#' @param scores vector of predicted probabilities
#' @param k number of classes to create (quantiles, default to `10`)
#' @param threshold classification threshold (default to `.5`)
#' @return a tibble where each row correspond to a bin, and each columns are:
#' - `score_class`: level of the decile that the bin represents
#' - `nb`: number of observation
#' - `mean_obs`: average of obs (proportion of positive events)
#' - `mean_score`: average predicted score (confidence)
#' - `sum_obs`: number of positive events (number of positive events)
#' - `accuracy`: accuracy (share of correctly predicted, using the
#'    threshold)
get_summary_bins <- function(obs,
                             scores,
                             k = 10, 
                             threshold = .5) {
  breaks <- quantile(scores, probs = (0:k) / k)
  tb_breaks <- tibble(breaks = breaks, labels = 0:k) |>
    group_by(breaks) |>
    slice_tail(n = 1) |>
    ungroup()
  
  x_with_class <- tibble(
    obs = obs,
    score = scores,
  ) |>
    mutate(
      score_class = cut(
        score,
        breaks = tb_breaks$breaks,
        labels = tb_breaks$labels[-1],
        include.lowest = TRUE
      ),
      pred_class = ifelse(score > threshold, 1, 0),
      correct_pred = obs == pred_class
    )
  
  x_with_class |>
    group_by(score_class) |>
    summarise(
      nb = n(),
      mean_obs = mean(obs),
      mean_score = mean(score), # confidence
      sum_obs = sum(obs),
      accuracy = mean(correct_pred)
    ) |>
    ungroup() |>
    mutate(
      score_class = as.character(score_class) |> as.numeric()
    ) |>
    arrange(score_class)
}


#' Expected Calibration Error
#'
#' @param obs vector of observed events
#' @param scores vector of predicted probabilities
#' @param k number of classes to create (quantiles, default to `10`)
#' @param threshold classification threshold (default to `.5`)
e_calib_error <- function(obs,
                          scores, 
                          k = 10, 
                          threshold = .5) {
  summary_bins <- get_summary_bins(
    obs = obs, scores = scores, k = k, threshold = .5
  )
  summary_bins |>
    mutate(ece_bin = nb * abs(accuracy - mean_score)) |>
    summarise(ece = 1 / sum(nb) * sum(ece_bin)) |>
    pull(ece)
}

qmse_error() to compute Quantile-based MSE (see Section 1.2.1.4 in Chapter 1). This function also relies on get_summary_bins().

Display the functions used to compute the QMSE

#' Quantile-Based MSE
#'
#' @param obs vector of observed events
#' @param scores vector of predicted probabilities
#' @param k number of classes to create (quantiles, default to `10`)
#' @param threshold classification threshold (default to `.5`)
qmse_error <- function(obs,
                       scores, 
                       k = 10, 
                       threshold = .5) {
  summary_bins <- get_summary_bins(
    obs = obs, scores = scores, k = k, threshold = .5
  )
  summary_bins |>
    mutate(qmse_bin = nb * (mean_obs - mean_score)^2) |>
    summarise(qmse = 1/sum(nb) * sum(qmse_bin)) |>
    pull(qmse)
}

wmse_error() to compute Weighted MSE (see Section 1.2.1.5 in Chapter 1). This function relies on local_ci_scores() which identifies the nearest neighbors of a certain predicted score and then calculates the mean scores in that neighborhood accompanied with its confidence interval.

Display the functions used to compute the WMSE

#' @param obs vector of observed events
#' @param scores vector of predicted probabilities
#' @param tau value at which to compute the confidence interval
#' @param nn fraction of nearest neighbors
#' @param prob level of the confidence interval (default to `.95`)
#' @param method Which method to use to construct the interval. Any combination
#'  of c("exact", "ac", "asymptotic", "wilson", "prop.test", "bayes", "logit",
#'  "cloglog", "probit") is allowed. Default is "all".
#' @return a tibble with a single row that corresponds to estimations made in
#'   the neighborhood of a probability $p=\tau$`, using the fraction `nn` of
#'   neighbors, where the columns are:
#'  - `score`: score tau in the neighborhood of which statistics are computed
#'  - `mean`: estimation of $E(d | s(x) = \tau)$
#'  - `lower`: lower bound of the confidence interval
#'  - `upper`: upper bound of the confidence interval
local_ci_scores <- function(obs,
                            scores,
                            tau,
                            nn,
                            prob = .95,
                            method = "probit") {
  
  # Identify the k nearest neighbors based on hat{p}
  k <- round(length(scores) * nn)
  rgs <- rank(abs(scores - tau), ties.method = "first")
  idx <- which(rgs <= k)
  
  binom.confint(
    x = sum(obs[idx]),
    n = length(idx),
    conf.level = prob,
    methods = method
  )[, c("mean", "lower", "upper")] |>
    tibble() |>
    mutate(xlim = tau) |>
    relocate(xlim, .before = mean)
}

#' Compute the Weighted Mean Squared Error to assess the calibration of a model
#'
#' @param local_scores tibble with expected scores obtained with the 
#'   `local_ci_scores()` function
#' @param scores vector of raw predicted probabilities
weighted_mse <- function(local_scores, scores) {
  # To account for border bias (support is [0,1])
  scores_reflected <- c(-scores, scores, 2 - scores)
  dens <- density(
    x = scores_reflected, from = 0, to = 1, 
    n = length(local_scores$xlim)
  )
  # The weights
  weights <- dens$y
  local_scores |>
    mutate(
      wmse_p = (xlim - mean)^2,
      weight = !!weights
    ) |>
    summarise(wmse = sum(weight * wmse_p) / sum(weight)) |>
    pull(wmse)
}

local_calib_score() to compute Local Calibration Score (see Section 1.2.1.6 in Chapter 1).

Display the functions used to compute the LCS

#' Calibration score using Local Regression
#' 
#' @param obs vector of observed events
#' @param scores vector of predicted probabilities
local_calib_score <- function(obs, 
                              scores) {
  
  # Add a little noise to the scores, to avoir crashing R
  scores <- scores + rnorm(length(scores), 0, .001)
  locfit_0 <- locfit(
    formula = d ~ lp(scores, nn = 0.15, deg = 0), 
    kern = "rect", maxk = 200, 
    data = tibble(
      d = obs,
      scores = scores
    )
  )
  # Predictions on [0,1]
  linspace_raw <- seq(0, 1, length.out = 100)
  # Restricting this space to the range of observed scores
  keep_linspace <- which(linspace_raw >= min(scores) & linspace_raw <= max(scores))
  linspace <- linspace_raw[keep_linspace]
  
  locfit_0_linspace <- predict(locfit_0, newdata = linspace)
  locfit_0_linspace[locfit_0_linspace > 1] <- 1
  locfit_0_linspace[locfit_0_linspace < 0] <- 0
  
  # Squared difference between predicted value and the bissector, weighted by the density of values
  scores_reflected <- c(-scores, scores, 2 - scores)
  dens <- density(
    x = scores_reflected, from = 0, to = 1, 
    n = length(linspace_raw)
  )
  # The weights
  weights <- dens$y[keep_linspace]
  
  weighted.mean((linspace - locfit_0_linspace)^2, weights)
}

Then, we define the recalibrate() function which recalibrate a model using the observed events \(d\), the predicted associated probabilities \(p^u\) and a given recalibration technique (as presented above in Section 2.2).

#' Recalibrates scores using a calibration
#' 
#' @param obs_calib vector of observed events in the calibration set
#' @param scores_calib vector of predicted probabilities in the calibration set
#' #' @param obs_test vector of observed events in the test set
#' @param scores_test vector of predicted probabilities in the test set
#' @param method recalibration method (`"platt"` for Platt-Scaling, 
#'   `"isotonic"` for isotonic regression, `"beta"` for beta calibration, 
#'   `"locfit"` for local regression)
#' @param iso_params list of named parameters to use in the local regression 
#'   (`nn` for fraction of nearest neighbors to use, `deg` for degree)
#' @param linspace vector of alues at which to compute the recalibrated scores
#' @returns list of three elements: recalibrated scores on the calibration set,
#'   recalibrated scores on the test set, and recalibrated scores on a segment 
#'   of values
recalibrate <- function(obs_calib,
                        scores_calib,
                        obs_test,
                        scores_test,
                        method = c("platt", "isotonic", "beta", "locfit"),
                        iso_params = NULL,
                        linspace = NULL) {
  
  if (is.null(linspace)) linspace <- seq(0, 1, length.out = 100)
  
  data_calib <- tibble(d = obs_calib, p_u = scores_calib)
  data_test <- tibble(d = obs_test, p_u = scores_test)
  
  if (method == "platt") {
    lr <- glm(d ~ p_u, family = binomial(link = 'logit'), data = data_calib)
    # Recalibrated scores on calibration and test set
    score_c_calib <- predict(lr, newdata = data_calib, type = "response")
    score_c_test <- predict(lr, newdata = data_test, type = "response")
    # Recalibrated values along a segment
    score_c_linspace <- predict(
      lr, 
      newdata = tibble(p_u = linspace), 
      type = "response"
    )
  } else if (method == "isotonic") {
    iso <- isoreg(x = data_calib$p_u, y = data_calib$d)
    fit_iso <- as.stepfun(iso)
    # Recalibrated scores on calibration and test set
    score_c_calib <- fit_iso(data_calib$p_u)
    score_c_test <- fit_iso(data_test$p_u)
    # Recalibrated values along a segment
    score_c_linspace <- fit_iso(linspace)
  } else if (method == "beta") {
    capture.output({
      bc <- beta_calibration(
        p = data_calib$p_u, 
        y = data_calib$d, 
        parameters = "abm" # 3 parameters a, b & m
      )
    })
    # Recalibrated scores on calibration and test set
    score_c_calib <- beta_predict(p = data_calib$p_u, bc)
    score_c_test <- beta_predict(p = data_test$p_u, bc)
    # Recalibrated values along a segment
    score_c_linspace <- beta_predict(linspace, bc)
  } else if (method == "locfit") {
    # Deg 0
    locfit_reg <- locfit(
      formula = d ~ lp(p_u, nn = iso_params$nn, deg = iso_params$deg), 
      kern = "rect", maxk = 200, data = data_calib
    )
    # Recalibrated scores on calibration and test set
    score_c_calib <- predict(locfit_reg, newdata = data_calib)
    score_c_calib[score_c_calib < 0] <- 0
    score_c_calib[score_c_calib > 1] <- 1
    
    score_c_test <- predict(locfit_reg, newdata = data_test)
    score_c_test[score_c_test < 0] <- 0
    score_c_test[score_c_test > 1] <- 1
    
    # Recalibrated values along a segment
    score_c_linspace <- predict(locfit_reg, newdata = linspace)
    score_c_linspace[score_c_linspace < 0] <- 0
    score_c_linspace[score_c_linspace > 1] <- 1
  } else {
    stop(str_c(
      'Wrong method. Use one of the following:',
      '"platt", "isotonic", "beta", "locfit"'
    ))
  }
  
  # Format results in tibbles:
  # For calibration set
  tb_score_c_calib <- tibble(
    d = obs_calib,
    p_u = scores_calib,
    p_c = score_c_calib
  )
  # For test set
  tb_score_c_test <- tibble(
    d = obs_test,
    p_u = scores_test,
    p_c = score_c_test
  )
  # For linear space
  tb_score_c_linspace <- tibble(
    linspace = linspace,
    p_c = score_c_linspace
  )
  
  list(
    tb_score_c_calib = tb_score_c_calib,
    tb_score_c_test = tb_score_c_test,
    tb_score_c_linspace = tb_score_c_linspace
  )
}

Let us define a function that computes the different calibration metrics for a single replication of the simulations.

#' Computes the calibration metrics for a set of observed and predicted 
#' probabilities
#' 
#' @param obs observed events
#' @param scores predicted scores
#' @param true_probas true probabilities from the PGD (to compute MSE)
#' @param linspace vector of values at which to compute the WMSE
compute_metrics <- function(obs, 
                            scores, 
                            true_probas,
                            linspace) {
  mse <- mean((true_probas - scores)^2)
  brier <- brier_score(obs = obs, scores = scores)
  if (length(unique(scores)) > 1) {
    ece <- e_calib_error(obs = obs, scores = scores, k = 10, threshold = .5)
    qmse <- qmse_error(obs = obs, scores = scores, k = 10, threshold = .5)
  } else {
    ece <- NA
    qmse <- NA
  }
  
  expected_events <- map(
    .x = linspace,
    .f = ~local_ci_scores(
      obs = obs, 
      scores = scores,
      tau = .x, nn = .15, prob = .95, method = "probit")
  ) |> 
    bind_rows()
  wmse <- weighted_mse(local_scores = expected_events, scores = scores)
  lcs <- local_calib_score(obs = obs, scores = scores)
  
  tibble(
    mse = mse, brier = brier, ece = ece, qmse = qmse, wmse = wmse, lcs = lcs
  )
  
}

Lastly, we define the f_simul() function to perform one simulation.

#' Performs one replication for a simulation
#' 
#' @param i row number of the grid to use for the simulation
#' @param grid grid tibble with the seed number (column `seed`) and the deformations value (either `alpha` or `gamma`)
#' @param n_obs desired number of observation
#' @param type deformation probability type (either `alpha` or `gamma`); the 
#' name should match with the `grid` tibble
#' @param linspace values at which to compute the mean observed event when computing the WMSE
f_simul <- function(i, 
                    grid, 
                    n_obs, 
                    type = c("alpha", "gamma"),
                    linspace = NULL) {
  
  if (is.null(linspace)) linspace <- seq(0, 1, length.out = 100)
  
  ## 1. Generate Data----
  current_seed <- grid$seed[i]
  if (type == "alpha") {
    transform_scale <- grid$alpha[i]
    current_data <- get_samples(
      seed = current_seed, n_obs = n_obs, alpha = transform_scale, gamma = 1
    )
  } else if (type == "gamma") {
    transform_scale <- grid$gamma[i]
    current_data <- get_samples(
      seed = current_seed, n_obs = n_obs, alpha = 1, gamma = transform_scale
    )
  } else {
    stop("Transform type should be either alpha or gamma.")
  }
  
  ## 2. Calibration/Test sets----
  # Datasets with true probabilities
  data_all_calib <- current_data$data_all |>
    slice(current_data$calib_index)
  
  data_all_test <- current_data$data_all |>
    slice(-current_data$calib_index)
  
  ## 3. Recalibration----
  methods <- c("platt", "isotonic", "beta", "locfit", "locfit", "locfit")
  params <- list(
    NULL, NULL, NULL, 
    list(nn = .15, deg = 0), list(nn = .15, deg = 1), list(nn = .15, deg = 2)
  )
  method_names <- c(
    "platt", "isotonic", "beta", "locfit_0", "locfit_1", "locfit_2"
  )
  res_recalibration <- map2(
    .x = methods,
    .y = params,
    .f = ~recalibrate(
      obs_calib = data_all_calib$d, 
      scores_calib = data_all_calib$p_u, 
      obs_test = data_all_test$d, 
      scores_test = data_all_test$p_u,
      method = .x,
      iso_params = .y,
      linspace = linspace
    )
  )
  names(res_recalibration) <- method_names
  
  ## 4. Calibration metrics----
  
  ### Using True Probabilities
  #### Calibration Set
  calib_metrics_true_calib <- compute_metrics(
    obs = data_all_calib$d, 
    scores = data_all_calib$p, 
    true_probas = data_all_calib$p,
    linspace = linspace) |> 
    mutate(method = "True Prob.", sample = "Calibration")
  #### Test Set
  calib_metrics_true_test <- compute_metrics(
    obs = data_all_test$d, 
    scores = data_all_test$p, 
    true_probas = data_all_test$p,
    linspace = linspace) |> 
    mutate(method = "True Prob.", sample = "Test")
  
  ### Without Recalibration
  #### Calibration Set
  calib_metrics_without_calib <- compute_metrics(
    obs = data_all_calib$d, 
    scores = data_all_calib$p_u, 
    true_probas = data_all_calib$p,
    linspace = linspace) |> 
    mutate(method = "No Calibration", sample = "Calibration")
  #### Test Set
  calib_metrics_without_test <- compute_metrics(
    obs = data_all_test$d, 
    scores = data_all_test$p_u, 
    true_probas = data_all_test$p,
    linspace = linspace) |> 
    mutate(method = "No Calibration", sample = "Test")
  
  calib_metrics <- 
    calib_metrics_true_calib |> 
    bind_rows(calib_metrics_true_test) |> 
    bind_rows(calib_metrics_without_calib) |> 
    bind_rows(calib_metrics_without_test)
  
  ### With Recalibration: loop on methods
  for (method in method_names) {
    res_recalibration_current <- res_recalibration[[method]]
    #### Calibration Set
    calib_metrics_without_calib <- compute_metrics(
      obs = data_all_calib$d, 
      scores = res_recalibration_current$tb_score_c_calib$p_c, 
      true_probas = data_all_calib$p,
      linspace = linspace) |> 
      mutate(method = method, sample = "Calibration")
    #### Test Set
    calib_metrics_without_test <- compute_metrics(
      obs = data_all_test$d, 
      scores = res_recalibration_current$tb_score_c_test$p_c, 
      true_probas = data_all_test$p,
      linspace = linspace) |> 
      mutate(method = method, sample = "Test")
    
    calib_metrics <- 
      calib_metrics |> 
      bind_rows(calib_metrics_without_calib) |> 
      bind_rows(calib_metrics_without_test)
  }
  
  calib_metrics <- 
    calib_metrics |> 
    mutate(
      seed = current_seed,
      transform_scale = transform_scale,
      type = type
    )
  
  list(
    res_recalibration = res_recalibration,
    linspace = linspace,
    calib_metrics = calib_metrics,
    data_all_calib = data_all_calib,
    data_all_test = data_all_test,
    seed = current_seed
  )
}

2.4 Running the Simulations

Let us now run the simulations. We consider the following values for \(\alpha\) and \(\gamma\):

alphas <- gammas <- c(1/3, 2/3, 1, 3/2, 3)

For each value of \(\alpha\), and then for each value of \(\gamma\), let us make 200 replication samples from the same DGP.

n_repl <- 200 # number of replications
n_obs <- 2000 # number of observations to draw
grid_alpha <- expand_grid(alpha = alphas, seed = 1:n_repl)
grid_gamma <- expand_grid(gamma = gammas, seed = 1:n_repl)

We perform the simulations for the varying values of \(\alpha\)

library(future)
nb_cores <- future::availableCores()-1
plan(multisession, workers = nb_cores)
progressr::with_progress({
  p <- progressr::progressor(steps = nrow(grid_alpha))
  simul_recalib_alpha <- furrr::future_map(
    .x = 1:nrow(grid_alpha),
    .f = ~{
      p()
      f_simul(
        i = .x, 
        grid = grid_alpha, 
        n_obs = n_obs, 
        type = "alpha", 
        linspace = NULL)
    },
    .options = furrr::furrr_options(seed = FALSE)
  )
})

And we do the same for varying values of \(\gamma\):

library(future)
nb_cores <- future::availableCores()-1
plan(multisession, workers = nb_cores)
progressr::with_progress({
  p <- progressr::progressor(steps = nrow(grid_gamma))
  simul_recalib_gamma <- furrr::future_map(
    .x = 1:nrow(grid_gamma),
    .f = ~{
      p()
      f_simul(
        i = .x, 
        grid = grid_gamma, 
        n_obs = n_obs, 
        type = "gamma", 
        linspace = NULL)
    },
    .options = furrr::furrr_options(seed = FALSE)
  )
})

2.5 Standard Metrics on Simulations

We (re)define the function compute_gof_simul() to apply compute_gof(), defined above, to compute the different standard performance metrics on recalibrated probabilities (see Section 1.4 in Chapter 1), to which, initially, we have applied transformations:

#' Computes goodness of fit metrics for a replication
#'
#' @param i row number of the grid to use for the simulation
#' @param grid grid tibble with the seed number (column `seed`) and the deformations value (either `alpha` or `gamma`)
#' @param n_obs desired number of observation
#' @param type deformation probability type (either `alpha` or `gamma`); the 
#' name should match with the `grid` tibble
compute_gof_simul <- function(i,
                              grid,
                              n_obs,
                              type = c("alpha", "gamma")) {
  current_seed <- grid$seed[i]
  if (type == "alpha") {
    transform_scale <- grid$alpha[i]
    current_data <- get_samples(
      seed = current_seed, n_obs = n_obs, alpha = transform_scale, gamma = 1
    )
  } else if (type == "gamma") {
    transform_scale <- grid$gamma[i]
    current_data <- get_samples(
      seed = current_seed, n_obs = n_obs, alpha = 1, gamma = transform_scale
    )
  } else {
    stop("Transform type should be either alpha or gamma.")
  }
  
  
  # Get the calib/test datasets with true probabilities
  data_all_calib <- current_data$data_all |>
    slice(current_data$calib_index)
  
  data_all_test <- current_data$data_all |>
    slice(-current_data$calib_index)
  
  # Calibration set
  true_prob_calib <- data_all_calib$p_u
  obs_calib <- data_all_calib$d
  pred_calib <- data_all_calib$p
  
  # Test set
  true_prob_test <- data_all_test$p_u
  obs_test <- data_all_test$d
  pred_test <- data_all_test$p
  
  # Recalibration
  methods <- c("platt", "isotonic", "beta", "locfit", "locfit", "locfit")
  params <- list(
    NULL, NULL, NULL, 
    list(nn = .15, deg = 0), list(nn = .15, deg = 1), list(nn = .15, deg = 2)
  )
  method_names <- c(
    "platt", "isotonic", "beta", "locfit_0", "locfit_1", "locfit_2"
  )
  res_recalibration <- map2(
    .x = methods,
    .y = params,
    .f = ~recalibrate(
      obs_calib = data_all_calib$d, 
      scores_calib = data_all_calib$p_u, 
      obs_test = data_all_test$d, 
      scores_test = data_all_test$p_u,
      method = .x,
      iso_params = .y,
      linspace = NULL
    )
  )
  names(res_recalibration) <- method_names
  
  # Initialisation
  gof_metrics_simul_calib <- tibble()
  gof_metrics_simul_test <- tibble()
  
  # Calculate standard metrics
  ## With Recalibration: loop on methods
  for (method in method_names) {
    res_recalibration_current <- res_recalibration[[method]]
    ### Computation of metrics on the calibration set
    metrics_simul_calib <- map(
    .x = seq(0, 1, by = .01), # we vary the probability threshold
    .f = ~compute_gof(
      true_prob = true_prob_calib,
      obs = obs_calib,
      #### the predictions are now recalibrated:
      pred = res_recalibration_current$tb_score_c_calib$p_c,
      threshold = .x
      )
    ) |>
      list_rbind()
    
    ### Computation of metricson the test set
    metrics_simul_test <- map(
    .x = seq(0, 1, by = .01), # we vary the probability threshold
    .f = ~compute_gof(
      true_prob = true_prob_test,
      obs = obs_test,
      #### the predictions are now recalibrated:
      pred = res_recalibration_current$tb_score_c_test$p_c,
      threshold = .x
      )
    ) |>
      list_rbind()
    
    roc_calib <- pROC::roc(
      obs_calib, 
      res_recalibration_current$tb_score_c_calib$p_c
    )
    auc_calib <- as.numeric(pROC::auc(roc_calib))
    
    roc_test <- pROC::roc(
      obs_test, 
      res_recalibration_current$tb_score_c_test$p_c
    )
    auc_test <- as.numeric(pROC::auc(roc_test))
    
    metrics_simul_calib <- metrics_simul_calib |>
      mutate(
        auc = auc_calib,
        seed = current_seed,
        scale_parameter = transform_scale,
        type = type,
        method = method,
        sample = "calibration"
      )
    
    metrics_simul_test <- metrics_simul_test |>
      mutate(
        auc = auc_test,
        seed = current_seed,
        scale_parameter = transform_scale,
        type = type,
        method = method,
        sample = "test"
      )
    
    gof_metrics_simul_calib <- gof_metrics_simul_calib |>
      bind_rows(metrics_simul_calib)
    gof_metrics_simul_test <- gof_metrics_simul_test |>
      bind_rows(metrics_simul_test)
  }
  
  gof_metrics_simul_calib |> 
    bind_rows(gof_metrics_simul_test)
}

Let us apply the function compute_gof_simul to the different simulations. We begin with the recalibrated probabilities initially transformed according to the variation of the parameter \(\alpha\).

library(future)
nb_cores <- future::availableCores()-1
plan(multisession, workers = nb_cores)
progressr::with_progress({
  p <- progressr::progressor(steps = nrow(grid_alpha))
  recalib_metrics_alpha <- furrr::future_map(
    .x = 1:nrow(grid_alpha),
    .f = ~{
      p()
      compute_gof_simul(
        i = .x, 
        grid = grid_alpha, 
        n_obs = n_obs, 
        type = "alpha"
      )
    },
    .options = furrr::furrr_options(seed = FALSE)
  )
})

recalib_metrics_alpha <- list_rbind(recalib_metrics_alpha)

We do the same for \(\gamma\):

library(future)
nb_cores <- future::availableCores()-1
plan(multisession, workers = nb_cores)
progressr::with_progress({
  p <- progressr::progressor(steps = nrow(grid_gamma))
  recalib_metrics_gamma <- furrr::future_map(
    .x = 1:nrow(grid_gamma),
    .f = ~{
      p()
      compute_gof_simul(
        i = .x, 
        grid = grid_gamma, 
        n_obs = n_obs, 
        type = "gamma"
      )
    },
    .options = furrr::furrr_options(seed = FALSE)
  )
})

recalib_metrics_gamma <- list_rbind(recalib_metrics_gamma)

We (re)define function boxplot_simuls_metrics() from Section 1.4 (Chapter 1) to plot the standard metrics results on the recalibrated simulations. This function will produce a panel of boxplots. Each row of the panel will correspond to a metric whereas each column will correspond to a value for either \(\alpha\) or \(\gamma\). We also have one column for each recalibration method used. On each figure, the x-axis will correspond to the value used for the probability threshold \(\tau\), and the y-axis will correspond to the values of the metric.

#' Boxplots for the simulations to visualize the distribution of some 
#' traditional metrics as a function of the probability threshold.
#' And, ROC curves
#' The resulting figure is a panel of graphs, with vayring values for the 
#' transformation applied to the probabilities (in columns) and different 
#' metrics (in rows).
#' 
#' @param tb_metrics tibble with computed metrics for the simulations
#' @param type type of transformation: `"alpha"` or `"gamma"`
#' @param metrics names of the metrics computed
boxplot_simuls_metrics <- function(tb_metrics,
                                   type = c("alpha", "gamma"),
                                   metrics) {
  scale_parameters <- unique(tb_metrics$scale_parameter)
  
  par(mfrow = c(length(metrics), length(scale_parameters)))
  for (i_metric in 1:length(metrics)) {
    metric <- metrics[i_metric]
    for (i_scale_parameter in 1:length(scale_parameters)) {
      scale_parameter <- scale_parameters[i_scale_parameter]
      
      tb_metrics_current <- tb_metrics |> 
        filter(scale_parameter == !!scale_parameter)
      
      if (metric == "roc") {
        seeds <- unique(tb_metrics_current$seed)
        if (i_metric == 1) {
          # first row
          title <- latex2exp::TeX(
            str_c("$\\", type, " = ", round(scale_parameter, 2), "$")
          )
          size_top <- 2.1
        } else if (i_metric == length(metrics)) {
          # Last row
          title <- ""
          size_top <- 1.1
        } else {
          title <- ""
          size_top <- 1.1
        }
        
        if (i_scale_parameter == 1) {
          # first column
          y_lab <- str_c(metric, "\n True Positive Rate") 
          size_left <- 5.1
        } else {
          y_lab <- ""
          size_left <- 4.1
        }
        
        par(mar = c(4.5, size_left, size_top, 2.1))
        plot(
          0:1, 0:1,
          type = "l", col = NULL,
          xlim = 0:1, ylim = 0:1,
          xlab = "False Positive Rate", 
          ylab = y_lab,
          main = ""
        )
        for (i_seed in 1:length(seeds)) {
          tb_metrics_current_seed <- 
            tb_metrics_current |> 
            filter(seed == seeds[i_seed])
          lines(
            x = tb_metrics_current_seed$FPR, 
            y = tb_metrics_current_seed$sensitivity,
            lwd = 2, col = adjustcolor("black", alpha.f = .04)
          )
        }
        segments(0, 0, 1, 1, col = "black", lty = 2)
        
      } else {
        # not ROC
        tb_metrics_current <- 
          tb_metrics_current |> 
          filter(threshold %in% seq(0, 1, by = .1))
        form <- str_c(metric, "~threshold")
        if (i_metric == 1) {
          # first row
          title <- latex2exp::TeX(
            str_c("$\\", type, " = ", round(scale_parameter, 2), "$")
          )
          size_top <- 2.1
        } else if (i_metric == length(metrics)) {
          # Last row
          title <- ""
          size_top <- 1.1
        } else {
          title <- ""
          size_top <- 1.1
        }
        
        if (i_scale_parameter == 1) {
          # first column
          y_lab <- metric
        } else {
          y_lab <- ""
        }
        
        par(mar = c(4.5, 4.1, size_top, 2.1))
        boxplot(
          formula(form), data = tb_metrics_current,
          xlab = "Threshold", ylab = y_lab,
          main = title
        )
      }
    }
  }
}

We aim to create a set of boxplots to visually assess the influence of probability transformations using \(\alpha\) or \(\gamma\) on standard metrics. Whenever \(\alpha \neq 1\) or \(\gamma \neq 1\), the resulting scores \(p^c\) represent values akin to those obtained from an initially uncalibrated model, with recalibration method applied. We want to verify that the recalibration methods applied to the uncalibrated probabilities do not degrade performance, as assessed by standard metrics. The results are shown in Figure 1.6 for vayring values of \(\alpha\), and in Figure 1.7 for vayring values of \(\gamma\).

Note

When using monotone transformation methods such as isotonic regression, the AUC cannot be degraded as it is insensitive to the application of an increasing function to the predicted scores by a model. Isotonic regression assumes that the initial model, without recalibration, has an AUC of 1. Therefore, if the initial model requires decreasing transformations in the recalibration step, isotonic regression will not be effective.

metrics <- c("mse", "accuracy", "sensitivity", "specificity", "roc", "auc")
methods <- c("platt", "isotonic", "beta", "locfit", "locfit", "locfit")

Varying \(\alpha\)
Varying \(\gamma\)