2  Recalibration

In this chapter, we explore different methods used in the literature to recalibrate a model. The basic idea is to learn a function \(g(\cdot)\) mapping scores \(s(x)\) into probability estimates \(g(p) := \mathbb{E}[D \mid s(x) = p]\). To avoid overfitting the training data while learning that mapping, we will rely on data from the calibration set.

As in Chapter 1, we will transform the true probabilities \(p\) of simulated data and consider these transformed values \(p^u\) to be scores that could be returned by a classifier model.

Display the definitions of colors.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Display the definitions of colors.
wongBlack     <- "#000000"
wongGold      <- "#E69F00"
wongLightBlue <- "#56B4E9"
wongGreen     <- "#009E73"
wongYellow    <- "#F0E442"
wongBlue      <- "#0072B2"
wongOrange    <- "#D55E00"
wongPurple    <- "#CC79A7"

2.1 Data Generating Process

We use the same DGP as that presented in Section 1.1 in Chapter 1. Let us redefine here the function which simulates data.

#' Simulates data
#'
#' @param n_obs number of desired observations
#' @param seed seed to use to generate the data
#' @param alpha scale parameter for the latent probability (if different 
#'   from 1, the probabilities are transformed and it may induce decalibration)
#' @param gamma scale parameter for the latent score (if different from 1, 
#'   the probabilities are transformed and it may induce decalibration)
sim_data <- function(n_obs = 2000, 
                     seed, 
                     alpha = 1, 
                     gamma = 1) {
  set.seed(seed)

  x1 <- runif(n_obs)
  x2 <- runif(n_obs)
  x3 <- runif(n_obs)
  x4 <- runif(n_obs)
  epsilon_p <- rnorm(n_obs, mean = 0, sd = .5)
  
  # True latent score
  eta <- -0.1*x1 + 0.05*x2 + 0.2*x3 - 0.05*x4  + epsilon_p
  # Transformed latent score
  eta_u <- gamma * eta
  
  # True probability
  p <- (1 / (1 + exp(-eta)))
  # Transformed probability
  p_u <- ((1 / (1 + exp(-eta_u))))^alpha

  # Observed event
  d <- rbinom(n_obs, size = 1, prob = p)

  tibble(
    # Event Probability
    p = p,
    p_u = p_u,
    # Binary outcome variable
    d = d,
    # Variables
    x1 = x1,
    x2 = x2,
    x3 = x3,
    x4 = x4
  )
}

2.2 Recalibration Methods

To compare different calibration metrics, we will split our dataset into the following sets:

  1. a calibration set: to train the recalibrator
  2. a test set: on which we will compute the calibration metrics.
Note

In the general case where the scores are obtained using a classifier, the dataset needs to be split into three parts instead of two:

  1. a train set: to train the classifier
  2. a calibration set: to train the recalibrator
  3. a test set: on which we will compute the calibration metrics.

We define (as in the previous chapter 1) a function to create the splits.

#' Get calibration/test samples from the DGP
#'
#' @param seed seed to use to generate the data
#' @param n_obs number of desired observations
#' @param alpha scale parameter for the latent probability (if different 
#'   from 1, the probabilities are transformed and it may induce decalibration)
#' @param gamma scale parameter for the latent score (if different from 1, 
#'   the probabilities are transformed and it may induce decalibration)
get_samples <- function(seed,
                        n_obs = 2000,
                        alpha = 1,
                        gamma = 1) {
  set.seed(seed)
  data_all <- sim_data(
    n_obs = n_obs, seed = seed, alpha = alpha, gamma = gamma
  )
  
  # Calibration/test sets----
  data <- data_all |> select(d, x1:x4)
  probas <- data_all |> select(p)

  calib_index <- sample(1:nrow(data), size = .6 * nrow(data), replace = FALSE)
  tb_calib <- data |> slice(calib_index)
  tb_test <- data |> slice(-calib_index)
  probas_calib <- probas |> slice(calib_index)
  probas_test <- probas |> slice(-calib_index)

  list(
    data_all = data_all,
    data = data,
    tb_calib = tb_calib,
    tb_test = tb_test,
    probas_calib = probas_calib,
    probas_test = probas_test,
    calib_index = calib_index,
    seed = seed,
    n_obs = n_obs,
    alpha = alpha,
    gamma = gamma
  )
}

We simulate a single toy dataset to begin with. Simulations made on replications will be done later.

Let us consider a case where the probabilities are distorted using \(\alpha=.25\).

n_obs <- 2000
toy_data <- get_samples(seed = 1, n_obs = 2000, alpha = .25, gamma = 1)
toy_data$data_all
# A tibble: 2,000 × 7
       p   p_u     d     x1     x2     x3     x4
   <dbl> <dbl> <int>  <dbl>  <dbl>  <dbl>  <dbl>
 1 0.366 0.778     0 0.266  0.872  0.188  0.770 
 2 0.613 0.885     1 0.372  0.967  0.505  0.690 
 3 0.561 0.865     1 0.573  0.867  0.0273 0.650 
 4 0.343 0.765     1 0.908  0.438  0.496  0.0747
 5 0.293 0.736     0 0.202  0.192  0.947  0.903 
 6 0.569 0.869     0 0.898  0.0823 0.381  0.133 
 7 0.345 0.766     0 0.945  0.583  0.698  0.211 
 8 0.705 0.916     0 0.661  0.0704 0.689  0.155 
 9 0.726 0.923     1 0.629  0.528  0.478  0.0545
10 0.673 0.906     1 0.0618 0.472  0.273  0.715 
# ℹ 1,990 more rows

We extract the calib/test datasets with true probabilities:

data_all_calib <- toy_data$data_all |>
    slice(toy_data$calib_index)
data_all_calib
# A tibble: 1,200 × 7
       p   p_u     d     x1      x2    x3     x4
   <dbl> <dbl> <int>  <dbl>   <dbl> <dbl>  <dbl>
 1 0.670 0.905     0 0.262  0.155   0.818 0.0906
 2 0.650 0.898     1 0.975  0.683   0.697 0.367 
 3 0.413 0.802     0 0.229  0.687   0.554 0.734 
 4 0.750 0.930     1 0.0438 0.0907  0.816 0.0173
 5 0.304 0.742     0 0.0275 0.591   0.239 0.872 
 6 0.652 0.899     0 0.753  0.121   0.953 0.704 
 7 0.309 0.746     0 0.0747 0.922   0.557 0.408 
 8 0.355 0.772     0 0.914  0.493   0.205 0.175 
 9 0.555 0.863     0 0.513  0.00726 0.963 0.333 
10 0.425 0.807     0 0.386  0.802   0.313 0.571 
# ℹ 1,190 more rows
data_all_test <- toy_data$data_all |>
    slice(-toy_data$calib_index)
data_all_test
# A tibble: 800 × 7
       p   p_u     d    x1     x2      x3      x4
   <dbl> <dbl> <int> <dbl>  <dbl>   <dbl>   <dbl>
 1 0.613 0.885     1 0.372 0.967  0.505   0.690  
 2 0.498 0.840     1 0.498 0.396  0.566   0.973  
 3 0.402 0.796     1 0.718 0.106  0.0169  0.970  
 4 0.302 0.741     1 0.126 0.0102 0.554   0.507  
 5 0.684 0.909     0 0.382 0.0704 0.869   0.478  
 6 0.457 0.822     1 0.340 0.413  0.822   0.914  
 7 0.566 0.867     0 0.600 0.0802 0.983   0.377  
 8 0.760 0.934     1 0.494 0.277  0.266   0.231  
 9 0.444 0.816     1 0.827 0.0911 0.191   0.00274
10 0.637 0.894     1 0.668 0.277  0.00375 0.653  
# ℹ 790 more rows

2.2.1 Platt Scaling

Platt scaling (Platt et al. 1999) consists of applying logistic regression to \((d,s(x))\) where \(d\) denotes the binary outcome and \(s(x)\) is the vector of predicted scores.

# Logistic regression
lr <- glm(d ~ p_u, family = binomial(link = 'logit'), data = data_all_calib)

The predicted values in the calibration set and in the test set:

score_c_platt_calib <- predict(lr, newdata = data_all_calib, type = "response")
score_c_platt_test <- predict(lr, newdata = data_all_test, type = "response")

Let us create a vector of values to estimate the calibration curve.

linspace <- seq(0, 1, length.out = 100)

We can then use the fitted logistic regression to make predictions on this vector of values:

score_c_platt_linspace <- predict(
  lr, 
  newdata = tibble(p_u = linspace), 
  type = "response"
)

Let us put these values in a tibble:

tb_scores_c_platt <- tibble(
  linspace = linspace,
  p_c = score_c_platt_linspace #recalibrated score
)
tb_scores_c_platt
# A tibble: 100 × 2
   linspace       p_c
      <dbl>     <dbl>
 1   0      0.0000729
 2   0.0101 0.0000817
 3   0.0202 0.0000917
 4   0.0303 0.000103 
 5   0.0404 0.000115 
 6   0.0505 0.000129 
 7   0.0606 0.000145 
 8   0.0707 0.000163 
 9   0.0808 0.000183 
10   0.0909 0.000205 
# ℹ 90 more rows

The predicted probabilities \(p_u\) will then be transformed according to the logistic model depicted in Figure 5.1.

par(mar = c(4.1, 4.1, 2.1, 2.1))
plot(
  data_all_calib$p_u, data_all_calib$d, type = "p", cex = .5, pch = 19,
  col = adjustcolor("black", alpha.f = .4),
  xlab = "p", ylab = "g(p)",
  xlim = c(0,1)
)
lines(
  x = tb_scores_c_platt$linspace, y = tb_scores_c_platt$p_c, 
  type = "l", col = "#D55E00"
)
Figure 2.1: Recalibration Using Platt Scaling

2.2.2 Isotonic Regression

Isotonic regression is a non parametric approach using the pool-adjacent-violators (PAV) algorithm, introduced by Zadrozny and Elkan (2002). In a nutshell, it assumes that the predicted scores of the initial model (random forest in this notebook) reproduces well the ranks of the observations. Under this assumption, the mapping \(g(\cdot)\) from the scores \(s(x)\) into the probabilities \(g(p)\) is non-decreasing. It is then possible to use isotonic regression to learn the mapping. The PAV algorithm works as follows:

  • At a given iteration: consider the ranked examples \(x_{i-1}\) and \(x_{i}\).

    • If the current values of the function to be learned is such that \(g(x_{i-1}) \leq g(x_{i})\), nothing changes.
    • Otherwise, \(x_1\) and \(x_2\) are called pair-adjacent violators. The values of \(g(x_{i-1})\) and \(g(x_{i})\) are replaced by their mean \((g(x_{i-1}) + g(x_{i})) / 2\). If this move creates earlier violations (\(g(x_{i-1})\) might be lower than \(g(x_{i-2})\)), a new value is set for \(g(x_{i-2})\), \(g(x_{i-1})\), and \(g(x_{i})\), as the average in the group.

Let us compute the isototic least squares regression on the scores \(p_u\):

iso <- isoreg(x = data_all_calib$p_u, y = data_all_calib$d)

Transforming the fit into a function:

fit_iso <- as.stepfun(iso)

The predicted values on the calibration set and on the test set:

score_c_isotonic_calib <- fit_iso(data_all_calib$p_u)
score_c_isotonic_test <- fit_iso(data_all_test$p_u)

Then, we can use this function to get estimated probabilities at some specific values (linspace):

score_c_isotonic_linspace <- fit_iso(linspace)

Let us recreate the tibble with the recalibrated scores:

tb_scores_c_isotonic <- tibble(
  linspace = linspace,
  p_c = score_c_isotonic_linspace
)

The predicted probabilities \(p_u\) will then be transformed according to the logistic model depicted in Figure 5.2.

par(mar = c(4.1, 4.1, 2.1, 2.1))
plot(
  data_all_calib$p_u, data_all_calib$d, type = "p", cex = .5, pch = 19,
  col = adjustcolor("black", alpha.f = .4),
  xlab = "p", ylab = "g(p)",
  xlim = c(0, 1)
)
lines(
  x = tb_scores_c_isotonic$linspace, y = tb_scores_c_isotonic$p_c, 
  type = "l", col = "#D55E00"
)
Figure 2.2: Recalibration Using Isotonic Regression

2.2.3 Beta Calibration

Instead of fitting a logistic regression on the predicted values, as we know that the distribution of the values are bounded to \([0,1]\), it is possible to use beta calibration Kull, Silva Filho, and Flach (2017). With this method, instead of assuming that the scores obtained by the classifier are normally distributed (as is the underlying assumption when using Platt scaling), the scores are assumed to follow a Beta distribution. We estimate : \[\mu(s;a,b,c) = \frac{1}{1 + \frac{1}{e^c \frac{s^a}{(1-s)^b}}}\]

library(betacal)
# Beta calibration using the paper package
bc <- beta_calibration(
  p = data_all_calib$p_u, 
  y = data_all_calib$d, 
  parameters = "abm" # 3 parameters a, b & m
)
[1] -126.7104
[1] 42.94288

The predicted values on the calibration set and on the test set:

score_c_beta_calib <- beta_predict(p = data_all_calib$p_u, bc)
score_c_beta_test <- beta_predict(p = data_all_test$p_u, bc)

We can then use the beta calibration model to make predictions at the desired values (linspace).

score_c_beta_linspace <- beta_predict(linspace, bc)

Let us recreate the tibble with the recalibrated scores:

tb_scores_c_beta <- tibble(
  linspace = linspace,
  p_c = score_c_beta_linspace
)

The predicted probabilities \(p_u\) will then be transformed according to the logistic model depicted in Figure 5.3

par(mar = c(4.1, 4.1, 2.1, 2.1))
plot(
  data_all_calib$p_u, data_all_calib$d, type = "p", cex = .5, pch = 19,
  col = adjustcolor("black", alpha.f = .4),
  xlab = "p", ylab = "g(p)",
  xlim = c(0, 1)
)
lines(
  x = tb_scores_c_beta$linspace, y = tb_scores_c_beta$p_c, 
  type = "l", col = "#D55E00"
)
Figure 2.3: Recalibration Using Beta Calibration

2.2.4 Local Regression

Local regression fits polynomials locally to each bin defined by nn argument of the locfit() function.

library(locfit)
locfit 1.5-9.8   2023-06-11

Attaching package: 'locfit'
The following object is masked from 'package:purrr':

    none

We consider three versions here, with different degrees for the polynomials (0, 1, or 2). We set the number of nearest neighbors to use to nn = 0.15, that is, 15%.

# Deg 0
locfit_0 <- locfit(
  formula = d ~ lp(p_u, nn = 0.15, deg = 0), 
  kern = "rect", maxk = 200, data = data_all_calib
)

Let us get the predicted values in the calibration data:

score_c_locfit_0_calib <- predict(locfit_0, newdata = data_all_calib)
score_c_locfit_0_test <- predict(locfit_0, newdata = data_all_test)

Then, we can use the estimated mapping to get estimated probabilities at some specific values (linspace):

score_c_locfit_0_linspace <- predict(locfit_0, newdata = linspace)
# Deg 1
locfit_1 <- locfit(
  formula = d ~ lp(p_u, nn = 0.15, deg = 1), 
  kern = "rect", maxk = 200, data = data_all_calib
)

Let us get the predicted values in the calibration data:

score_c_locfit_1_calib <- predict(locfit_1, newdata = data_all_calib)
score_c_locfit_1_test <- predict(locfit_1, newdata = data_all_test)

Then, we can use the estimated mapping to get estimated probabilities at some specific values (linspace):

score_c_locfit_1_linspace <- predict(locfit_1, newdata = linspace)
# Deg 2
locfit_2 <- locfit(
  formula = d ~ lp(p_u, nn = 0.15, deg = 2), 
  kern = "rect", maxk = 200, data = data_all_calib
)

Let us get the predicted values in the calibration data:

score_c_locfit_2_calib <- predict(locfit_2, newdata = data_all_calib)
score_c_locfit_2_test <- predict(locfit_2, newdata = data_all_test)

Then, we can use the estimated mapping to get estimated probabilities at some specific values (linspace):

score_c_locfit_2_linspace <- predict(locfit_2, newdata = linspace)

The predicted probabilities \(p_u\) will then be transformed according to the logistic model depicted in Figure 5.4, Figure 5.5, and Figure 5.4

par(mar = c(4.1, 4.1, 2.1, 2.1))
plot(
  data_all_calib$p_u, data_all_calib$d, type = "p", cex = .5, pch = 19,
  col = adjustcolor("black", alpha.f = .4),
  xlab = "p", ylab = "g(p)",
  xlim = c(0, 1)
)
lines(
  x = linspace, y = score_c_locfit_0_linspace, 
  type = "l", col = "#D55E00"
)
Figure 2.4: Recalibration Using Local Regression

par(mar = c(4.1, 4.1, 2.1, 2.1))
plot(
  data_all_calib$p_u, data_all_calib$d, type = "p", cex = .5, pch = 19,
  col = adjustcolor("black", alpha.f = .4),
  xlab = "p", ylab = "g(p)",
  xlim = c(0, 1)
)
lines(
  x = linspace, y = score_c_locfit_1_linspace, 
  type = "l", col = "#D55E00"
)
Figure 2.5: Recalibration Using Local Regression

par(mar = c(4.1, 4.1, 2.1, 2.1))
plot(
  data_all_calib$p_u, data_all_calib$d, type = "p", cex = .5, pch = 19,
  col = adjustcolor("black", alpha.f = .4),
  xlab = "p", ylab = "g(p)",
  xlim = c(0, 1)
)
lines(
  x = linspace, y = score_c_locfit_2_linspace, 
  type = "l", col = "#D55E00"
)
Figure 2.6: Recalibration Using Local Regression

2.3 Simulations: setup

Let us now consider multiple scenarios in which the scores are distorded by varying either \(\alpha\) or \(\gamma\) (see Section 1.1 in Chapter 1).

For each value of \(\alpha\) and \(\gamma\), we will generate 200 replications of the whole process consisting in the following steps:

  1. generate data from the PGD and transform true probabilities \(p\) accordingly with either \(\alpha\) or \(\gamma\) to obtain \(p^u\)
  2. split the data into two sets: calibration and test set
  3. apply each recalibration method on the calibration set 4.compute calibration metrics on both sets (for comparison) using either:
  • the recalibrated scores \(p^c := g(s(x))\)
  • the estimated scores \(p^u := s(x)\)
  • the true probabilities \(p\).

2.3.1 Helper Functions

We (re)define a helper function to compute standard metrics (see Section 1.4 in Chapter 1):

#' Computes goodness of fit metrics
#' 
#' @param true_prob true probabilities
#' @param obs observed values (binary outcome)
#' @param pred predicted scores
#' @param threshold classification threshold (default to `.5`)
compute_gof <- function(true_prob,
                        obs, 
                        pred, 
                        threshold = .5) {
  
  # MSE
  mse <- mean((true_prob - pred)^2)
  
  pred_class <- as.numeric(pred > threshold)
  confusion_tb <- tibble(
    obs = obs,
    pred = pred_class
  ) |> 
    count(obs, pred)
  
  TN <- confusion_tb |> filter(obs == 0, pred == 0) |> pull(n)
  TP <- confusion_tb |> filter(obs == 1, pred == 1) |> pull(n)
  FP <- confusion_tb |> filter(obs == 0, pred == 1) |> pull(n)
  FN <- confusion_tb |> filter(obs == 1, pred == 0) |> pull(n)
  
  if (length(TN) == 0) TN <- 0
  if (length(TP) == 0) TP <- 0
  if (length(FP) == 0) FP <- 0
  if (length(FN) == 0) FN <- 0
  
  n_pos <- sum(obs == 1)
  n_neg <- sum(obs == 0)
  
  # Accuracy
  acc <- (TP + TN) / (n_pos + n_neg)
  # Missclassification rate
  missclass_rate <- 1 - acc
  # Sensitivity (True positive rate)
  # proportion of actual positives that are correctly identified as such
  TPR <- TP / n_pos
  # Specificity (True negative rate)
  # proportion of actual negatives that are correctly identified as such
  TNR <- TN / n_neg
  # False positive Rate
  FPR <- FP / n_neg
  
  tibble(
    mse = mse,
    accuracy = acc,
    missclass_rate = missclass_rate,
    sensitivity = TPR,
    specificity = TNR,
    threshold = threshold,
    FPR = FPR
  )
}

We (re)define a few helper functions to compute calibration metrics (see Section 1.2 in Chapter 1):

Display the functions used to compute Brier Score
brier_score <- function(obs, scores) mean((scores - obs)^2)
  • e_calib_error() to compute the Expected Calibration Error (see Section 1.2.1.3 in Chapter 1). This function relies on get_summary_bins() which computes summary statistics for binomial observed data and predicted scores returned by a model.
Display the functions used to compute the ECE
#' Computes summary statistics for binomial observed data and predicted scores
#' returned by a model
#'
#' @param obs vector of observed events
#' @param scores vector of predicted probabilities
#' @param k number of classes to create (quantiles, default to `10`)
#' @param threshold classification threshold (default to `.5`)
#' @return a tibble where each row correspond to a bin, and each columns are:
#' - `score_class`: level of the decile that the bin represents
#' - `nb`: number of observation
#' - `mean_obs`: average of obs (proportion of positive events)
#' - `mean_score`: average predicted score (confidence)
#' - `sum_obs`: number of positive events (number of positive events)
#' - `accuracy`: accuracy (share of correctly predicted, using the
#'    threshold)
get_summary_bins <- function(obs,
                             scores,
                             k = 10, 
                             threshold = .5) {
  breaks <- quantile(scores, probs = (0:k) / k)
  tb_breaks <- tibble(breaks = breaks, labels = 0:k) |>
    group_by(breaks) |>
    slice_tail(n = 1) |>
    ungroup()
  
  x_with_class <- tibble(
    obs = obs,
    score = scores,
  ) |>
    mutate(
      score_class = cut(
        score,
        breaks = tb_breaks$breaks,
        labels = tb_breaks$labels[-1],
        include.lowest = TRUE
      ),
      pred_class = ifelse(score > threshold, 1, 0),
      correct_pred = obs == pred_class
    )
  
  x_with_class |>
    group_by(score_class) |>
    summarise(
      nb = n(),
      mean_obs = mean(obs),
      mean_score = mean(score), # confidence
      sum_obs = sum(obs),
      accuracy = mean(correct_pred)
    ) |>
    ungroup() |>
    mutate(
      score_class = as.character(score_class) |> as.numeric()
    ) |>
    arrange(score_class)
}


#' Expected Calibration Error
#'
#' @param obs vector of observed events
#' @param scores vector of predicted probabilities
#' @param k number of classes to create (quantiles, default to `10`)
#' @param threshold classification threshold (default to `.5`)
e_calib_error <- function(obs,
                          scores, 
                          k = 10, 
                          threshold = .5) {
  summary_bins <- get_summary_bins(
    obs = obs, scores = scores, k = k, threshold = .5
  )
  summary_bins |>
    mutate(ece_bin = nb * abs(accuracy - mean_score)) |>
    summarise(ece = 1 / sum(nb) * sum(ece_bin)) |>
    pull(ece)
}
  • qmse_error() to compute Quantile-based MSE (see Section 1.2.1.4 in Chapter 1). This function also relies on get_summary_bins().
Display the functions used to compute the QMSE
#' Quantile-Based MSE
#'
#' @param obs vector of observed events
#' @param scores vector of predicted probabilities
#' @param k number of classes to create (quantiles, default to `10`)
#' @param threshold classification threshold (default to `.5`)
qmse_error <- function(obs,
                       scores, 
                       k = 10, 
                       threshold = .5) {
  summary_bins <- get_summary_bins(
    obs = obs, scores = scores, k = k, threshold = .5
  )
  summary_bins |>
    mutate(qmse_bin = nb * (mean_obs - mean_score)^2) |>
    summarise(qmse = 1/sum(nb) * sum(qmse_bin)) |>
    pull(qmse)
}
  • wmse_error() to compute Weighted MSE (see Section 1.2.1.5 in Chapter 1). This function relies on local_ci_scores() which identifies the nearest neighbors of a certain predicted score and then calculates the mean scores in that neighborhood accompanied with its confidence interval.
Display the functions used to compute the WMSE
#' @param obs vector of observed events
#' @param scores vector of predicted probabilities
#' @param tau value at which to compute the confidence interval
#' @param nn fraction of nearest neighbors
#' @param prob level of the confidence interval (default to `.95`)
#' @param method Which method to use to construct the interval. Any combination
#'  of c("exact", "ac", "asymptotic", "wilson", "prop.test", "bayes", "logit",
#'  "cloglog", "probit") is allowed. Default is "all".
#' @return a tibble with a single row that corresponds to estimations made in
#'   the neighborhood of a probability $p=\tau$`, using the fraction `nn` of
#'   neighbors, where the columns are:
#'  - `score`: score tau in the neighborhood of which statistics are computed
#'  - `mean`: estimation of $E(d | s(x) = \tau)$
#'  - `lower`: lower bound of the confidence interval
#'  - `upper`: upper bound of the confidence interval
local_ci_scores <- function(obs,
                            scores,
                            tau,
                            nn,
                            prob = .95,
                            method = "probit") {
  
  # Identify the k nearest neighbors based on hat{p}
  k <- round(length(scores) * nn)
  rgs <- rank(abs(scores - tau), ties.method = "first")
  idx <- which(rgs <= k)
  
  binom.confint(
    x = sum(obs[idx]),
    n = length(idx),
    conf.level = prob,
    methods = method
  )[, c("mean", "lower", "upper")] |>
    tibble() |>
    mutate(xlim = tau) |>
    relocate(xlim, .before = mean)
}

#' Compute the Weighted Mean Squared Error to assess the calibration of a model
#'
#' @param local_scores tibble with expected scores obtained with the 
#'   `local_ci_scores()` function
#' @param scores vector of raw predicted probabilities
weighted_mse <- function(local_scores, scores) {
  # To account for border bias (support is [0,1])
  scores_reflected <- c(-scores, scores, 2 - scores)
  dens <- density(
    x = scores_reflected, from = 0, to = 1, 
    n = length(local_scores$xlim)
  )
  # The weights
  weights <- dens$y
  local_scores |>
    mutate(
      wmse_p = (xlim - mean)^2,
      weight = !!weights
    ) |>
    summarise(wmse = sum(weight * wmse_p) / sum(weight)) |>
    pull(wmse)
}
Display the functions used to compute the LCS
#' Calibration score using Local Regression
#' 
#' @param obs vector of observed events
#' @param scores vector of predicted probabilities
local_calib_score <- function(obs, 
                              scores) {
  
  # Add a little noise to the scores, to avoir crashing R
  scores <- scores + rnorm(length(scores), 0, .001)
  locfit_0 <- locfit(
    formula = d ~ lp(scores, nn = 0.15, deg = 0), 
    kern = "rect", maxk = 200, 
    data = tibble(
      d = obs,
      scores = scores
    )
  )
  # Predictions on [0,1]
  linspace_raw <- seq(0, 1, length.out = 100)
  # Restricting this space to the range of observed scores
  keep_linspace <- which(linspace_raw >= min(scores) & linspace_raw <= max(scores))
  linspace <- linspace_raw[keep_linspace]
  
  locfit_0_linspace <- predict(locfit_0, newdata = linspace)
  locfit_0_linspace[locfit_0_linspace > 1] <- 1
  locfit_0_linspace[locfit_0_linspace < 0] <- 0
  
  # Squared difference between predicted value and the bissector, weighted by the density of values
  scores_reflected <- c(-scores, scores, 2 - scores)
  dens <- density(
    x = scores_reflected, from = 0, to = 1, 
    n = length(linspace_raw)
  )
  # The weights
  weights <- dens$y[keep_linspace]
  
  weighted.mean((linspace - locfit_0_linspace)^2, weights)
}

Then, we define the recalibrate() function which recalibrate a model using the observed events \(d\), the predicted associated probabilities \(p^u\) and a given recalibration technique (as presented above in Section 2.2).

#' Recalibrates scores using a calibration
#' 
#' @param obs_calib vector of observed events in the calibration set
#' @param scores_calib vector of predicted probabilities in the calibration set
#' #' @param obs_test vector of observed events in the test set
#' @param scores_test vector of predicted probabilities in the test set
#' @param method recalibration method (`"platt"` for Platt-Scaling, 
#'   `"isotonic"` for isotonic regression, `"beta"` for beta calibration, 
#'   `"locfit"` for local regression)
#' @param iso_params list of named parameters to use in the local regression 
#'   (`nn` for fraction of nearest neighbors to use, `deg` for degree)
#' @param linspace vector of alues at which to compute the recalibrated scores
#' @returns list of three elements: recalibrated scores on the calibration set,
#'   recalibrated scores on the test set, and recalibrated scores on a segment 
#'   of values
recalibrate <- function(obs_calib,
                        scores_calib,
                        obs_test,
                        scores_test,
                        method = c("platt", "isotonic", "beta", "locfit"),
                        iso_params = NULL,
                        linspace = NULL) {
  
  if (is.null(linspace)) linspace <- seq(0, 1, length.out = 100)
  
  data_calib <- tibble(d = obs_calib, p_u = scores_calib)
  data_test <- tibble(d = obs_test, p_u = scores_test)
  
  if (method == "platt") {
    lr <- glm(d ~ p_u, family = binomial(link = 'logit'), data = data_calib)
    # Recalibrated scores on calibration and test set
    score_c_calib <- predict(lr, newdata = data_calib, type = "response")
    score_c_test <- predict(lr, newdata = data_test, type = "response")
    # Recalibrated values along a segment
    score_c_linspace <- predict(
      lr, 
      newdata = tibble(p_u = linspace), 
      type = "response"
    )
  } else if (method == "isotonic") {
    iso <- isoreg(x = data_calib$p_u, y = data_calib$d)
    fit_iso <- as.stepfun(iso)
    # Recalibrated scores on calibration and test set
    score_c_calib <- fit_iso(data_calib$p_u)
    score_c_test <- fit_iso(data_test$p_u)
    # Recalibrated values along a segment
    score_c_linspace <- fit_iso(linspace)
  } else if (method == "beta") {
    capture.output({
      bc <- beta_calibration(
        p = data_calib$p_u, 
        y = data_calib$d, 
        parameters = "abm" # 3 parameters a, b & m
      )
    })
    # Recalibrated scores on calibration and test set
    score_c_calib <- beta_predict(p = data_calib$p_u, bc)
    score_c_test <- beta_predict(p = data_test$p_u, bc)
    # Recalibrated values along a segment
    score_c_linspace <- beta_predict(linspace, bc)
  } else if (method == "locfit") {
    # Deg 0
    locfit_reg <- locfit(
      formula = d ~ lp(p_u, nn = iso_params$nn, deg = iso_params$deg), 
      kern = "rect", maxk = 200, data = data_calib
    )
    # Recalibrated scores on calibration and test set
    score_c_calib <- predict(locfit_reg, newdata = data_calib)
    score_c_calib[score_c_calib < 0] <- 0
    score_c_calib[score_c_calib > 1] <- 1
    
    score_c_test <- predict(locfit_reg, newdata = data_test)
    score_c_test[score_c_test < 0] <- 0
    score_c_test[score_c_test > 1] <- 1
    
    # Recalibrated values along a segment
    score_c_linspace <- predict(locfit_reg, newdata = linspace)
    score_c_linspace[score_c_linspace < 0] <- 0
    score_c_linspace[score_c_linspace > 1] <- 1
  } else {
    stop(str_c(
      'Wrong method. Use one of the following:',
      '"platt", "isotonic", "beta", "locfit"'
    ))
  }
  
  # Format results in tibbles:
  # For calibration set
  tb_score_c_calib <- tibble(
    d = obs_calib,
    p_u = scores_calib,
    p_c = score_c_calib
  )
  # For test set
  tb_score_c_test <- tibble(
    d = obs_test,
    p_u = scores_test,
    p_c = score_c_test
  )
  # For linear space
  tb_score_c_linspace <- tibble(
    linspace = linspace,
    p_c = score_c_linspace
  )
  
  list(
    tb_score_c_calib = tb_score_c_calib,
    tb_score_c_test = tb_score_c_test,
    tb_score_c_linspace = tb_score_c_linspace
  )
}

Let us define a function that computes the different calibration metrics for a single replication of the simulations.

#' Computes the calibration metrics for a set of observed and predicted 
#' probabilities
#' 
#' @param obs observed events
#' @param scores predicted scores
#' @param true_probas true probabilities from the PGD (to compute MSE)
#' @param linspace vector of values at which to compute the WMSE
compute_metrics <- function(obs, 
                            scores, 
                            true_probas,
                            linspace) {
  mse <- mean((true_probas - scores)^2)
  brier <- brier_score(obs = obs, scores = scores)
  if (length(unique(scores)) > 1) {
    ece <- e_calib_error(obs = obs, scores = scores, k = 10, threshold = .5)
    qmse <- qmse_error(obs = obs, scores = scores, k = 10, threshold = .5)
  } else {
    ece <- NA
    qmse <- NA
  }
  
  expected_events <- map(
    .x = linspace,
    .f = ~local_ci_scores(
      obs = obs, 
      scores = scores,
      tau = .x, nn = .15, prob = .95, method = "probit")
  ) |> 
    bind_rows()
  wmse <- weighted_mse(local_scores = expected_events, scores = scores)
  lcs <- local_calib_score(obs = obs, scores = scores)
  
  tibble(
    mse = mse, brier = brier, ece = ece, qmse = qmse, wmse = wmse, lcs = lcs
  )
  
}

Lastly, we define the f_simul() function to perform one simulation.

#' Performs one replication for a simulation
#' 
#' @param i row number of the grid to use for the simulation
#' @param grid grid tibble with the seed number (column `seed`) and the deformations value (either `alpha` or `gamma`)
#' @param n_obs desired number of observation
#' @param type deformation probability type (either `alpha` or `gamma`); the 
#' name should match with the `grid` tibble
#' @param linspace values at which to compute the mean observed event when computing the WMSE
f_simul <- function(i, 
                    grid, 
                    n_obs, 
                    type = c("alpha", "gamma"),
                    linspace = NULL) {
  
  if (is.null(linspace)) linspace <- seq(0, 1, length.out = 100)
  
  ## 1. Generate Data----
  current_seed <- grid$seed[i]
  if (type == "alpha") {
    transform_scale <- grid$alpha[i]
    current_data <- get_samples(
      seed = current_seed, n_obs = n_obs, alpha = transform_scale, gamma = 1
    )
  } else if (type == "gamma") {
    transform_scale <- grid$gamma[i]
    current_data <- get_samples(
      seed = current_seed, n_obs = n_obs, alpha = 1, gamma = transform_scale
    )
  } else {
    stop("Transform type should be either alpha or gamma.")
  }
  
  ## 2. Calibration/Test sets----
  # Datasets with true probabilities
  data_all_calib <- current_data$data_all |>
    slice(current_data$calib_index)
  
  data_all_test <- current_data$data_all |>
    slice(-current_data$calib_index)
  
  ## 3. Recalibration----
  methods <- c("platt", "isotonic", "beta", "locfit", "locfit", "locfit")
  params <- list(
    NULL, NULL, NULL, 
    list(nn = .15, deg = 0), list(nn = .15, deg = 1), list(nn = .15, deg = 2)
  )
  method_names <- c(
    "platt", "isotonic", "beta", "locfit_0", "locfit_1", "locfit_2"
  )
  res_recalibration <- map2(
    .x = methods,
    .y = params,
    .f = ~recalibrate(
      obs_calib = data_all_calib$d, 
      scores_calib = data_all_calib$p_u, 
      obs_test = data_all_test$d, 
      scores_test = data_all_test$p_u,
      method = .x,
      iso_params = .y,
      linspace = linspace
    )
  )
  names(res_recalibration) <- method_names
  
  ## 4. Calibration metrics----
  
  ### Using True Probabilities
  #### Calibration Set
  calib_metrics_true_calib <- compute_metrics(
    obs = data_all_calib$d, 
    scores = data_all_calib$p, 
    true_probas = data_all_calib$p,
    linspace = linspace) |> 
    mutate(method = "True Prob.", sample = "Calibration")
  #### Test Set
  calib_metrics_true_test <- compute_metrics(
    obs = data_all_test$d, 
    scores = data_all_test$p, 
    true_probas = data_all_test$p,
    linspace = linspace) |> 
    mutate(method = "True Prob.", sample = "Test")
  
  ### Without Recalibration
  #### Calibration Set
  calib_metrics_without_calib <- compute_metrics(
    obs = data_all_calib$d, 
    scores = data_all_calib$p_u, 
    true_probas = data_all_calib$p,
    linspace = linspace) |> 
    mutate(method = "No Calibration", sample = "Calibration")
  #### Test Set
  calib_metrics_without_test <- compute_metrics(
    obs = data_all_test$d, 
    scores = data_all_test$p_u, 
    true_probas = data_all_test$p,
    linspace = linspace) |> 
    mutate(method = "No Calibration", sample = "Test")
  
  calib_metrics <- 
    calib_metrics_true_calib |> 
    bind_rows(calib_metrics_true_test) |> 
    bind_rows(calib_metrics_without_calib) |> 
    bind_rows(calib_metrics_without_test)
  
  ### With Recalibration: loop on methods
  for (method in method_names) {
    res_recalibration_current <- res_recalibration[[method]]
    #### Calibration Set
    calib_metrics_without_calib <- compute_metrics(
      obs = data_all_calib$d, 
      scores = res_recalibration_current$tb_score_c_calib$p_c, 
      true_probas = data_all_calib$p,
      linspace = linspace) |> 
      mutate(method = method, sample = "Calibration")
    #### Test Set
    calib_metrics_without_test <- compute_metrics(
      obs = data_all_test$d, 
      scores = res_recalibration_current$tb_score_c_test$p_c, 
      true_probas = data_all_test$p,
      linspace = linspace) |> 
      mutate(method = method, sample = "Test")
    
    calib_metrics <- 
      calib_metrics |> 
      bind_rows(calib_metrics_without_calib) |> 
      bind_rows(calib_metrics_without_test)
  }
  
  calib_metrics <- 
    calib_metrics |> 
    mutate(
      seed = current_seed,
      transform_scale = transform_scale,
      type = type
    )
  
  list(
    res_recalibration = res_recalibration,
    linspace = linspace,
    calib_metrics = calib_metrics,
    data_all_calib = data_all_calib,
    data_all_test = data_all_test,
    seed = current_seed
  )
}

2.4 Running the Simulations

Let us now run the simulations. We consider the following values for \(\alpha\) and \(\gamma\):

alphas <- gammas <- c(1/3, 2/3, 1, 3/2, 3)

For each value of \(\alpha\), and then for each value of \(\gamma\), let us make 200 replication samples from the same DGP.

n_repl <- 200 # number of replications
n_obs <- 2000 # number of observations to draw
grid_alpha <- expand_grid(alpha = alphas, seed = 1:n_repl)
grid_gamma <- expand_grid(gamma = gammas, seed = 1:n_repl)

We perform the simulations for the varying values of \(\alpha\)

library(future)
nb_cores <- future::availableCores()-1
plan(multisession, workers = nb_cores)
progressr::with_progress({
  p <- progressr::progressor(steps = nrow(grid_alpha))
  simul_recalib_alpha <- furrr::future_map(
    .x = 1:nrow(grid_alpha),
    .f = ~{
      p()
      f_simul(
        i = .x, 
        grid = grid_alpha, 
        n_obs = n_obs, 
        type = "alpha", 
        linspace = NULL)
    },
    .options = furrr::furrr_options(seed = FALSE)
  )
})

And we do the same for varying values of \(\gamma\):

library(future)
nb_cores <- future::availableCores()-1
plan(multisession, workers = nb_cores)
progressr::with_progress({
  p <- progressr::progressor(steps = nrow(grid_gamma))
  simul_recalib_gamma <- furrr::future_map(
    .x = 1:nrow(grid_gamma),
    .f = ~{
      p()
      f_simul(
        i = .x, 
        grid = grid_gamma, 
        n_obs = n_obs, 
        type = "gamma", 
        linspace = NULL)
    },
    .options = furrr::furrr_options(seed = FALSE)
  )
})

2.5 Standard Metrics on Simulations

We (re)define the function compute_gof_simul() to apply compute_gof(), defined above, to compute the different standard performance metrics on recalibrated probabilities (see Section 1.4 in Chapter 1), to which, initially, we have applied transformations:

#' Computes goodness of fit metrics for a replication
#'
#' @param i row number of the grid to use for the simulation
#' @param grid grid tibble with the seed number (column `seed`) and the deformations value (either `alpha` or `gamma`)
#' @param n_obs desired number of observation
#' @param type deformation probability type (either `alpha` or `gamma`); the 
#' name should match with the `grid` tibble
compute_gof_simul <- function(i,
                              grid,
                              n_obs,
                              type = c("alpha", "gamma")) {
  current_seed <- grid$seed[i]
  if (type == "alpha") {
    transform_scale <- grid$alpha[i]
    current_data <- get_samples(
      seed = current_seed, n_obs = n_obs, alpha = transform_scale, gamma = 1
    )
  } else if (type == "gamma") {
    transform_scale <- grid$gamma[i]
    current_data <- get_samples(
      seed = current_seed, n_obs = n_obs, alpha = 1, gamma = transform_scale
    )
  } else {
    stop("Transform type should be either alpha or gamma.")
  }
  
  
  # Get the calib/test datasets with true probabilities
  data_all_calib <- current_data$data_all |>
    slice(current_data$calib_index)
  
  data_all_test <- current_data$data_all |>
    slice(-current_data$calib_index)
  
  # Calibration set
  true_prob_calib <- data_all_calib$p_u
  obs_calib <- data_all_calib$d
  pred_calib <- data_all_calib$p
  
  # Test set
  true_prob_test <- data_all_test$p_u
  obs_test <- data_all_test$d
  pred_test <- data_all_test$p
  
  # Recalibration
  methods <- c("platt", "isotonic", "beta", "locfit", "locfit", "locfit")
  params <- list(
    NULL, NULL, NULL, 
    list(nn = .15, deg = 0), list(nn = .15, deg = 1), list(nn = .15, deg = 2)
  )
  method_names <- c(
    "platt", "isotonic", "beta", "locfit_0", "locfit_1", "locfit_2"
  )
  res_recalibration <- map2(
    .x = methods,
    .y = params,
    .f = ~recalibrate(
      obs_calib = data_all_calib$d, 
      scores_calib = data_all_calib$p_u, 
      obs_test = data_all_test$d, 
      scores_test = data_all_test$p_u,
      method = .x,
      iso_params = .y,
      linspace = NULL
    )
  )
  names(res_recalibration) <- method_names
  
  # Initialisation
  gof_metrics_simul_calib <- tibble()
  gof_metrics_simul_test <- tibble()
  
  # Calculate standard metrics
  ## With Recalibration: loop on methods
  for (method in method_names) {
    res_recalibration_current <- res_recalibration[[method]]
    ### Computation of metrics on the calibration set
    metrics_simul_calib <- map(
    .x = seq(0, 1, by = .01), # we vary the probability threshold
    .f = ~compute_gof(
      true_prob = true_prob_calib,
      obs = obs_calib,
      #### the predictions are now recalibrated:
      pred = res_recalibration_current$tb_score_c_calib$p_c,
      threshold = .x
      )
    ) |>
      list_rbind()
    
    ### Computation of metricson the test set
    metrics_simul_test <- map(
    .x = seq(0, 1, by = .01), # we vary the probability threshold
    .f = ~compute_gof(
      true_prob = true_prob_test,
      obs = obs_test,
      #### the predictions are now recalibrated:
      pred = res_recalibration_current$tb_score_c_test$p_c,
      threshold = .x
      )
    ) |>
      list_rbind()
    
    roc_calib <- pROC::roc(
      obs_calib, 
      res_recalibration_current$tb_score_c_calib$p_c
    )
    auc_calib <- as.numeric(pROC::auc(roc_calib))
    
    roc_test <- pROC::roc(
      obs_test, 
      res_recalibration_current$tb_score_c_test$p_c
    )
    auc_test <- as.numeric(pROC::auc(roc_test))
    
    metrics_simul_calib <- metrics_simul_calib |>
      mutate(
        auc = auc_calib,
        seed = current_seed,
        scale_parameter = transform_scale,
        type = type,
        method = method,
        sample = "calibration"
      )
    
    metrics_simul_test <- metrics_simul_test |>
      mutate(
        auc = auc_test,
        seed = current_seed,
        scale_parameter = transform_scale,
        type = type,
        method = method,
        sample = "test"
      )
    
    gof_metrics_simul_calib <- gof_metrics_simul_calib |>
      bind_rows(metrics_simul_calib)
    gof_metrics_simul_test <- gof_metrics_simul_test |>
      bind_rows(metrics_simul_test)
  }
  
  gof_metrics_simul_calib |> 
    bind_rows(gof_metrics_simul_test)
}

Let us apply the function compute_gof_simul to the different simulations. We begin with the recalibrated probabilities initially transformed according to the variation of the parameter \(\alpha\).

library(future)
nb_cores <- future::availableCores()-1
plan(multisession, workers = nb_cores)
progressr::with_progress({
  p <- progressr::progressor(steps = nrow(grid_alpha))
  recalib_metrics_alpha <- furrr::future_map(
    .x = 1:nrow(grid_alpha),
    .f = ~{
      p()
      compute_gof_simul(
        i = .x, 
        grid = grid_alpha, 
        n_obs = n_obs, 
        type = "alpha"
      )
    },
    .options = furrr::furrr_options(seed = FALSE)
  )
})

recalib_metrics_alpha <- list_rbind(recalib_metrics_alpha)

We do the same for \(\gamma\):

library(future)
nb_cores <- future::availableCores()-1
plan(multisession, workers = nb_cores)
progressr::with_progress({
  p <- progressr::progressor(steps = nrow(grid_gamma))
  recalib_metrics_gamma <- furrr::future_map(
    .x = 1:nrow(grid_gamma),
    .f = ~{
      p()
      compute_gof_simul(
        i = .x, 
        grid = grid_gamma, 
        n_obs = n_obs, 
        type = "gamma"
      )
    },
    .options = furrr::furrr_options(seed = FALSE)
  )
})

recalib_metrics_gamma <- list_rbind(recalib_metrics_gamma)

We (re)define function boxplot_simuls_metrics() from Section 1.4 (Chapter 1) to plot the standard metrics results on the recalibrated simulations. This function will produce a panel of boxplots. Each row of the panel will correspond to a metric whereas each column will correspond to a value for either \(\alpha\) or \(\gamma\). We also have one column for each recalibration method used. On each figure, the x-axis will correspond to the value used for the probability threshold \(\tau\), and the y-axis will correspond to the values of the metric.

#' Boxplots for the simulations to visualize the distribution of some 
#' traditional metrics as a function of the probability threshold.
#' And, ROC curves
#' The resulting figure is a panel of graphs, with vayring values for the 
#' transformation applied to the probabilities (in columns) and different 
#' metrics (in rows).
#' 
#' @param tb_metrics tibble with computed metrics for the simulations
#' @param type type of transformation: `"alpha"` or `"gamma"`
#' @param metrics names of the metrics computed
boxplot_simuls_metrics <- function(tb_metrics,
                                   type = c("alpha", "gamma"),
                                   metrics) {
  scale_parameters <- unique(tb_metrics$scale_parameter)
  
  par(mfrow = c(length(metrics), length(scale_parameters)))
  for (i_metric in 1:length(metrics)) {
    metric <- metrics[i_metric]
    for (i_scale_parameter in 1:length(scale_parameters)) {
      scale_parameter <- scale_parameters[i_scale_parameter]
      
      tb_metrics_current <- tb_metrics |> 
        filter(scale_parameter == !!scale_parameter)
      
      if (metric == "roc") {
        seeds <- unique(tb_metrics_current$seed)
        if (i_metric == 1) {
          # first row
          title <- latex2exp::TeX(
            str_c("$\\", type, " = ", round(scale_parameter, 2), "$")
          )
          size_top <- 2.1
        } else if (i_metric == length(metrics)) {
          # Last row
          title <- ""
          size_top <- 1.1
        } else {
          title <- ""
          size_top <- 1.1
        }
        
        if (i_scale_parameter == 1) {
          # first column
          y_lab <- str_c(metric, "\n True Positive Rate") 
          size_left <- 5.1
        } else {
          y_lab <- ""
          size_left <- 4.1
        }
        
        par(mar = c(4.5, size_left, size_top, 2.1))
        plot(
          0:1, 0:1,
          type = "l", col = NULL,
          xlim = 0:1, ylim = 0:1,
          xlab = "False Positive Rate", 
          ylab = y_lab,
          main = ""
        )
        for (i_seed in 1:length(seeds)) {
          tb_metrics_current_seed <- 
            tb_metrics_current |> 
            filter(seed == seeds[i_seed])
          lines(
            x = tb_metrics_current_seed$FPR, 
            y = tb_metrics_current_seed$sensitivity,
            lwd = 2, col = adjustcolor("black", alpha.f = .04)
          )
        }
        segments(0, 0, 1, 1, col = "black", lty = 2)
        
      } else {
        # not ROC
        tb_metrics_current <- 
          tb_metrics_current |> 
          filter(threshold %in% seq(0, 1, by = .1))
        form <- str_c(metric, "~threshold")
        if (i_metric == 1) {
          # first row
          title <- latex2exp::TeX(
            str_c("$\\", type, " = ", round(scale_parameter, 2), "$")
          )
          size_top <- 2.1
        } else if (i_metric == length(metrics)) {
          # Last row
          title <- ""
          size_top <- 1.1
        } else {
          title <- ""
          size_top <- 1.1
        }
        
        if (i_scale_parameter == 1) {
          # first column
          y_lab <- metric
        } else {
          y_lab <- ""
        }
        
        par(mar = c(4.5, 4.1, size_top, 2.1))
        boxplot(
          formula(form), data = tb_metrics_current,
          xlab = "Threshold", ylab = y_lab,
          main = title
        )
      }
    }
  }
}

We aim to create a set of boxplots to visually assess the influence of probability transformations using \(\alpha\) or \(\gamma\) on standard metrics. Whenever \(\alpha \neq 1\) or \(\gamma \neq 1\), the resulting scores \(p^c\) represent values akin to those obtained from an initially uncalibrated model, with recalibration method applied. We want to verify that the recalibration methods applied to the uncalibrated probabilities do not degrade performance, as assessed by standard metrics. The results are shown in Figure 1.6 for vayring values of \(\alpha\), and in Figure 1.7 for vayring values of \(\gamma\).

Note

When using monotone transformation methods such as isotonic regression, the AUC cannot be degraded as it is insensitive to the application of an increasing function to the predicted scores by a model. Isotonic regression assumes that the initial model, without recalibration, has an AUC of 1. Therefore, if the initial model requires decreasing transformations in the recalibration step, isotonic regression will not be effective.

metrics <- c("mse", "accuracy", "sensitivity", "specificity", "roc", "auc")
methods <- c("platt", "isotonic", "beta", "locfit", "locfit", "locfit")
Display the R codes to produce the Figure.
current_recalib_metrics <- recalib_metrics_alpha |> filter(method == "platt")
boxplot_simuls_metrics(
  tb_metrics = current_recalib_metrics, 
  type = "alpha", metrics = metrics
)
Figure 2.7: Calibration transformations made by varying \(\alpha\) and impact on standard metrics. The model is calibrated when \(\alpha=1\). The scores have been recalibrated using Platt-scaling.

Display the R codes to produce the Figure.
current_recalib_metrics <- recalib_metrics_alpha |> filter(method == "isotonic")
boxplot_simuls_metrics(
  tb_metrics = current_recalib_metrics, 
  type = "alpha", metrics = metrics
)
Figure 2.8: Calibration transformations made by varying \(\alpha\) and impact on standard metrics. The model is calibrated when \(\alpha=1\). The scores have been recalibrated using Isotonic regression.

Display the R codes to produce the Figure.
current_recalib_metrics <- recalib_metrics_alpha |> filter(method == "beta")
boxplot_simuls_metrics(
  tb_metrics = current_recalib_metrics, 
  type = "alpha", metrics = metrics
)
Figure 2.9: Calibration transformations made by varying \(\alpha\) and impact on standard metrics. The model is calibrated when \(\alpha=1\). The scores have been recalibrated using Beta calibration.

Display the R codes to produce the Figure.
current_recalib_metrics <- recalib_metrics_alpha |> filter(method == "locfit_0")
boxplot_simuls_metrics(
  tb_metrics = current_recalib_metrics, 
  type = "alpha", metrics = metrics
)
Figure 2.10: Calibration transformations made by varying \(\alpha\) and impact on standard metrics. The model is calibrated when \(\alpha=1\). The scores have been recalibrated using local regression (with deg=0).

Display the R codes to produce the Figure.
current_recalib_metrics <- recalib_metrics_alpha |> filter(method == "locfit_1")
boxplot_simuls_metrics(
  tb_metrics = current_recalib_metrics, 
  type = "alpha", metrics = metrics
)
Figure 2.11: Calibration transformations made by varying \(\alpha\) and impact on standard metrics. The model is calibrated when \(\alpha=1\). The scores have been recalibrated using local regression (with deg=1).

Display the R codes to produce the Figure.
current_recalib_metrics <- recalib_metrics_alpha |> filter(method == "locfit_2")
boxplot_simuls_metrics(
  tb_metrics = current_recalib_metrics, 
  type = "alpha", metrics = metrics
)
Figure 2.12: Calibration transformations made by varying \(\alpha\) and impact on standard metrics. The model is calibrated when \(\alpha=1\). The scores have been recalibrated using local regression (with deg=2).

Display the R codes to produce the Figure.
current_recalib_metrics <- recalib_metrics_gamma |> filter(method == "platt")
boxplot_simuls_metrics(
  tb_metrics = current_recalib_metrics, 
  type = "gamma", metrics = metrics
)
Figure 2.13: Calibration transformations made by varying \(\gamma\) and impact on standard metrics. The model is calibrated when \(\beta=1\). The scores have been recalibrated using Platt-scaling.

Display the R codes to produce the Figure.
current_recalib_metrics <- recalib_metrics_gamma |> filter(method == "isotonic")
boxplot_simuls_metrics(
  tb_metrics = current_recalib_metrics, 
  type = "gamma", metrics = metrics
)
Figure 2.14: Calibration transformations made by varying \(\gamma\) and impact on standard metrics. The model is calibrated when \(\beta=1\). The scores have been recalibrated using Isotonic regression.

Display the R codes to produce the Figure.
current_recalib_metrics <- recalib_metrics_gamma |> filter(method == "beta")
boxplot_simuls_metrics(
  tb_metrics = current_recalib_metrics, 
  type = "gamma", metrics = metrics
)
Figure 2.15: Calibration transformations made by varying \(\gamma\) and impact on standard metrics. The model is calibrated when \(\beta=1\). The scores have been recalibrated using Beta calibration.

Display the R codes to produce the Figure.
current_recalib_metrics <- recalib_metrics_gamma |> filter(method == "locfit_0")
boxplot_simuls_metrics(
  tb_metrics = current_recalib_metrics, 
  type = "gamma", metrics = metrics
)
Figure 2.16: Calibration transformations made by varying \(\gamma\) and impact on standard metrics. The model is calibrated when \(\beta=1\). he scores have been recalibrated using local regression (with deg=0).

Display the R codes to produce the Figure.
current_recalib_metrics <- recalib_metrics_gamma |> filter(method == "locfit_1")
boxplot_simuls_metrics(
  tb_metrics = current_recalib_metrics, 
  type = "gamma", metrics = metrics
)
Figure 2.17: Calibration transformations made by varying \(\gamma\) and impact on standard metrics. The model is calibrated when \(\beta=1\). he scores have been recalibrated using local regression (with deg=1).

Display the R codes to produce the Figure.
current_recalib_metrics <- recalib_metrics_gamma |> filter(method == "locfit_2")
boxplot_simuls_metrics(
  tb_metrics = current_recalib_metrics, 
  type = "gamma", metrics = metrics
)
Figure 2.18: Calibration transformations made by varying \(\gamma\) and impact on standard metrics. The model is calibrated when \(\beta=1\). he scores have been recalibrated using local regression (with deg=0).

We can focus on the transformations that have degraded performance. For that purpose, we load the standard metrics computed on the uncalibrated probabilities:

2.6 Calibration Maps (single replication)

We can visualize the calibration curve as in Section 1.6 (Chapter 1).

2.6.1 Quantile-Based Bins

We first visualize calibration in a fashion similar to what is done with the calibration_curve() method from sci-kit learn.

The x-axis of the calibration plot reports the mean predicted probabilities computed on different bins, where the bins are defined using the deciles of the predicted scores. On the y-axis, the corresponding fraction of positive events (\(d=1\)) are reported.

We can visualize the calibration curve as in Section 1.6 (Chapter 1).

We can accompany the predictions made for each bin with a confidence interval, using the binom.confint() function from {binom}.

library(binom)
#' Confidence interval for binomial data, using quantile-defined bins
#' 
#' @param obs vector of observed events
#' @param scores vector of predicted probabilities
#' @param k number of bins to create (quantiles, default to `10`)
#' @param prob confidence interval level
#' @param method Which method to use to construct the interval. Any combination 
#'  of c("exact", "ac", "asymptotic", "wilson", "prop.test", "bayes", "logit", 
#'  "cloglog", "probit") is allowed. Default is "all".
#' @return a tibble with the following columns, where each row corresponds to
#'   a bin:
#' - `mean`: estimation of $E(d | s(x) = p)$ where $p$ is the average score in bin b
#' - `lower`: lower bound of the confidence interval
#' - `upper`: upper bound of the confidence interval
#' - `prediction`: average of `s(x)` in bin b
#' - `score_class`: decile level of bin b
#' - `nb`: number of observation in bin b
ci_scores_bins <- function(obs,
                           scores,
                           k,
                           prob = .95, 
                           method = "probit" ) {
  
  summary_bins_calib <- get_summary_bins(obs = obs, scores = scores, k = k)
  
  new_k <- nrow(summary_bins_calib)
  prob_ic <- tibble(
    mean = rep(NA, new_k),
    lower = rep(NA, new_k),
    upper = rep(NA, new_k),
    prediction = summary_bins_calib |> pull("mean_score"),
    score_class = summary_bins_calib$score_class,
    nb = summary_bins_calib$nb
  )
  for (i in 1:new_k) {
    prob_ic[i, 1:3] <- binom.confint(
      x = summary_bins_calib$sum_obs[i],
      n = summary_bins_calib$nb[i], 
      conf.level = prob,
      methods = method
    )[, c("mean", "lower", "upper")]
  }
  
  prob_ic
}

Let us define here a function to compute the confidence intervals for a single replication of our simulations.

#' Compute confidence intervals for the calibration values of a single 
#' replication, for a given method
#' 
#' @param simul single simulation obtained with `f_simul()`
#' @param method recalibration method
conf_int_qbins_simul <- function(simul, method) {
  
  obs_calib <- simul$data_all_calib$d
  obs_test <- simul$data_all_test$d
  
  if (method == "True Prob.") {
    scores_calib <- simul$data_all_calib$p
    scores_test <- simul$data_all_test$p
  } else if (method == "No Calibration") {
    scores_calib <- simul$data_all_calib$p_u
    scores_test <- simul$data_all_test$p_u
  } else {
    tb_score_c_calib <- simul$res_recalibration[[method]]$tb_score_c_calib
    tb_score_c_test <- simul$res_recalibration[[method]]$tb_score_c_test
    scores_calib <- tb_score_c_calib$p_c
    scores_test <- tb_score_c_test$p_c
  }
  
  e_scores_bins_calib <- ci_scores_bins(
    obs = obs_calib, 
    scores = scores_calib,
    k = 10, prob = .95, method = "probit"
  )
  e_scores_bins_test <- ci_scores_bins(
    obs = obs_test, 
    scores = scores_test,
    k = 10, prob = .95, method = "probit"
  )
  
  e_scores_bins <- e_scores_bins_calib |> 
    mutate(sample = "Calibration") |> 
    bind_rows(
      e_scores_bins_test |> mutate(sample = "Test")
    ) |> 
    mutate(
      seed = simul$seed, 
      method = method
    )
  
  list(
    e_scores_bins = e_scores_bins,
    obs_calib = obs_calib,
    obs_test = obs_test,
    scores_calib = scores_calib,
    scores_test = scores_test,
    seed = simul$seed,
    method = method
  )
}

We define a function, get_data_plot_quant_simul() to extract a desired simulation from our results (either from simul_recalib_alpha or from simul_recalib_gamma). The function get_data_plot_quant_simul() returns a list with two elements:

  1. ci_res: the confidence interval for the calibration curve for the simulation
  2. n_bins_scores: the counts of observation in each bin defined over [0,1] for the scores (uncalibrated or calibrated, for both the calibration set and the test set).
#' @param i index of the simulation to use (in `simul_recalib_alpha` or 
#'   `simul_recalib_gamma`)
#' @param type type of transformed probabilities (made on `alpha` or `gamma`)
#' @param method name of the recalibration method to focus on
get_data_plot_quant_simul <- function(i, type, method) {
  if (type == "alpha") {
    simul <- simul_recalib_alpha[[i]]
    transform_scale <- grid_alpha$alpha[i]
  } else if (type == "gamma") {
    simul <- simul_recalib_gamma[[i]]
    transform_scale <- grid_gamma$gamma[i]
  } else {
    stop("Wrong value for argument `type`.")
  }
  
  # Counting number of obs in bins defined over [0,1]
  breaks <- seq(0, 1, by = .05)
  if (method == "True Prob.") {
    scores_calib <- simul$data_all_calib$p
    scores_test <- simul$data_all_test$p
    scores_c_calib <- scores_c_test <- NULL
  } else if (method == "No Calibration") {
    scores_calib <- simul$data_all_calib$p_u
    scores_test <- simul$data_all_test$p_u
    scores_c_calib <- scores_c_test <- NULL
  } else {
    tb_score_c_calib <- simul$res_recalibration[[method]]$tb_score_c_calib
    tb_score_c_test <- simul$res_recalibration[[method]]$tb_score_c_test
    scores_calib <- tb_score_c_calib$p_u
    scores_test <- tb_score_c_test$p_u
    scores_c_calib <- tb_score_c_calib$p_c
    scores_c_test <- tb_score_c_test$p_c
  }
  
  n_bins_calib <- table(cut(scores_calib, breaks = breaks))
  n_bins_test <- table(cut(scores_test, breaks = breaks))
  if (!is.null(scores_c_calib)) {
    n_bins_c_calib <- table(cut(scores_c_calib, breaks = breaks))
  } else {
    n_bins_c_calib <- NA_integer_
  }
  if (!is.null(scores_c_test)) {
    n_bins_c_test <- table(cut(scores_c_test, breaks = breaks))
  } else {
    n_bins_c_test <- NA_integer_
  }

  n_bins_scores <- tibble(
    bins = names(table(cut(breaks, breaks = breaks))),
    n_bins_calib = as.vector(n_bins_calib),
    n_bins_test = as.vector(n_bins_test),
    n_bins_c_calib = as.vector(n_bins_c_calib),
    n_bins_c_test = as.vector(n_bins_c_test),
    method = method,
    seed = simul$seed,
    type = type
  )
  
  # Confidence intervals
  ci_res <- conf_int_qbins_simul(simul = simul, method = method)
  list(ci_res = ci_res, n_bins_scores = n_bins_scores)
}

Now, we can define a function that will plot the calibration maps computed on the calibration set and those computed on the test set. This function will plot a panel of calibration maps, each row corresponding to a specific value of the scale used to transform the probabilities (\(\alpha\) or \(\gamma\)). On top of each graph, we plot the histogram of uncalibrated scores and of calibrated scores.

plot_conf_int_qbins_simul <- function(method, type) {
  current_grid <- grid |> 
    filter(method == !!method, type == !!type)
  
  if (type == "alpha") {
    transform_scale <- grid_alpha |> slice(current_grid$i) |> pull(alpha)
  } else if (type == "gamma") {
    transform_scale <- grid_gamma |> slice(current_grid$i) |> pull(gamma)
  } else {
    stop("Error argument `type`. Wrong value: either \"alpha\" or \"gamma\"")
  }
  
  data_plots <-  map(
    .x = current_grid$i,
    .f = ~get_data_plot_quant_simul(i = .x, type = type, method = method)
  )
  data_plots_ci <- map(data_plots, pluck("ci_res"))
  data_plots_n_bins_scores <- map(data_plots, pluck("n_bins_scores"))
  
  
  nb <- length(data_plots)
  mat <- mat_init <- matrix(c(1:4), ncol = 2)
  for (j in 1:(nb-1)) {
    mat <- rbind(mat, mat_init + j * 4)
  }
  layout(mat, heights = rep(c(1, 3), nb))
  
  y_lim <- c(0, 1)
  
  
  # i_recalib <- 1
  for (i_recalib in 1:nb) {
    data_plot_ci <- data_plots_ci[[i_recalib]]
    data_plot_n_bins_scores <- data_plots_n_bins_scores[[i_recalib]]
    
    obs_calib <- data_plot_ci$obs_calib
    scores_calib <- data_plot_ci$scores_calib
    obs_test <- data_plot_ci$obs_test
    scores_test <- data_plot_ci$scores_test
    ci <- data_plot_ci$e_scores_bins
    
    title <- str_c("$\\", type, "=", round(transform_scale[i_recalib], 2), "$")
    
    for (sample in c("Calibration", "Test")) {
      # Histogram with values
      df_plot <- ci |> filter(sample == !!sample)
      
      if (sample == "Calibration"){
        obs_current <- obs_calib
        scores_current <- scores_calib
        colour <- "#D55E00"
        n_bins_current <- data_plot_n_bins_scores$n_bins_calib
        n_bins_c_current <- data_plot_n_bins_scores$n_bins_c_calib
      } else {
        obs_current <- obs_test
        scores_current <- scores_test
        colour <- "#009E73"
        n_bins_current <- data_plot_n_bins_scores$n_bins_test
        n_bins_c_current <- data_plot_n_bins_scores$n_bins_c_test
      }
      
      # heights <- rbind(n_bins_current, n_bins_c_current)
      par(mar = c(0.5, 4.3, 1.0, 0.5))
      y_lim_bp <- range(c(n_bins_current, n_bins_c_current), na.rm = TRUE)
      barplot(
        n_bins_current,
        col = adjustcolor("#000000", alpha.f = .3),
        ylim = y_lim_bp,
        border = "white",
        axes = FALSE,
        xlab = "", ylab = "", 
        main = latex2exp::TeX(title)
      )
      barplot(
        n_bins_c_current,
        col = adjustcolor("#0072B2", alpha.f = .3),
        ylim = y_lim_bp,
        border = "white",
        axes = FALSE,
        xlab = "", ylab = "", main = "",
        add = TRUE
      )
      
      par(mar = c(4.1, 4.3, 0.5, 0.5))
      plot(
        df_plot$prediction, df_plot$mean,
        pch = 19, ylim = y_lim, xlim = 0:1,
        xlab = latex2exp::TeX("Predicted score $p$"), 
        ylab = latex2exp::TeX("$E(d | s(x) = p)$"),
        col = colour
      )
      arrows(
        x0 = df_plot$prediction, y0 = df_plot$lower, 
        x1 = df_plot$prediction, y1 = df_plot$upper,
        angle = 90,length = .05, code = 3,
        col = colour
      )
      segments(0, 0, 1, 1, col = "black", lty = 2)
    }
    
  }
}
Display the code used to plot the calibration curves.
methods <- c(
  "True Prob.", "No Calibration", "platt", "isotonic", "beta", 
  "locfit_0", "locfit_1", "locfit_2"
)

i_alpha <- grid_alpha |> mutate(i = row_number()) |> group_by(alpha) |> slice(1) |> 
  pull(i)
i_gamma <- grid_gamma |> mutate(i = row_number()) |> group_by(gamma) |> slice(1) |> 
  pull(i)

tab <- tibble(i = i_alpha, type = "alpha") |> 
  bind_rows(
    tibble(i = i_gamma, type = "gamma")
  ) |> 
  mutate(id = str_c(i, "_", type))

grid <- expand_grid(id = tab$id, method = methods) |> 
  separate(id, into = c("i", "type"), sep = "_") |> 
  mutate(i = as.numeric(i))

method_names <- tribble(
  ~method, ~method_lab,
  "True Prob.", "True Prob.",
  "No Calibration", "No Calibration", 
  "platt", "Platt Scaling",
  "isotonic", "Isotonic Reg.",
  "beta", "Beta Calib.",
  "locfit_0", "Local Reg. (deg = 0)", 
  "locfit_1", "Local Reg. (deg = 1)",
  "locfit_2", "Local Reg (deg = 2)"
)

In the Figures below, for the tabs True Pob. and No Calibration, the plots show the calibration curves obtained using the true probabilities and the uncalibrated scores instead of recalibrated scores. We do this for comparison purposes.

Calibration curves obtained using quantile-defined bins on the recalibrated scores for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained using quantile-defined bins on the recalibrated scores for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained using quantile-defined bins on the recalibrated scores for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained using quantile-defined bins on the recalibrated scores for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained using quantile-defined bins on the recalibrated scores for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained using quantile-defined bins on the recalibrated scores for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained using quantile-defined bins on the recalibrated scores for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained using quantile-defined bins on the recalibrated scores for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained using quantile-defined bins on the recalibrated scores for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained using quantile-defined bins on the recalibrated scores for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained using quantile-defined bins on the recalibrated scores for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained using quantile-defined bins on the recalibrated scores for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained using quantile-defined bins on the recalibrated scores for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained using quantile-defined bins on the recalibrated scores for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained using quantile-defined bins on the recalibrated scores for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained using quantile-defined bins on the recalibrated scores for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

2.6.2 Calibration Curve with Moving Average

The calibration curves will computed using the local_ci_scores() (defined in Section 2.3.1) and accompanied by a confidence interval obtained using the binom.confint() function from {binom}.

Let us first focus on a single replication for which we can plot the calibration curve with its confidence interval.

calibration_curve_ma_simul <- function(simul, 
                                       method, 
                                       nn = .15,
                                       prob = .95, 
                                       ci_method = "probit") {
  
  
  
  obs_calib <- simul$data_all_calib$d
  obs_test <- simul$data_all_test$d
  
  if (method == "True Prob.") {
    scores_calib <- simul$data_all_calib$p
    scores_test <- simul$data_all_test$p
  } else if (method == "No Calibration") {
    scores_calib <- simul$data_all_calib$p_u
    scores_test <- simul$data_all_test$p_u
  } else {
    tb_score_c_calib <- simul$res_recalibration[[method]]$tb_score_c_calib
    tb_score_c_test <- simul$res_recalibration[[method]]$tb_score_c_test
    scores_calib <- tb_score_c_calib$p_c
    scores_test <- tb_score_c_test$p_c
  }
  
  linspace_raw <- seq(0, 1, length.out = 100)

  keep_linspace_calib <- which(
    linspace_raw >= min(scores_calib) & linspace_raw <= max(scores_calib)
  )
  linspace_calib <- linspace_raw[keep_linspace_calib]
  
  calib_curve_calib <- map(
      .x = linspace_calib,
      .f = ~local_ci_scores(
        obs = obs_calib,
        scores = scores_calib,
        tau = .x, 
        nn = nn, prob = prob, method = ci_method)
    ) |> 
    list_rbind() |> 
    mutate(sample = "Calibration")
  
  keep_linspace_test <- which(
    linspace_raw >= min(scores_test) & linspace_raw <= max(scores_test)
  )
  linspace_test <- linspace_raw[keep_linspace_test]
  
  calib_curve_test <- map(
    .x = linspace_test,
    .f = ~local_ci_scores(
      obs = obs_test,
      scores = scores_test,
      tau = .x, 
      nn = nn, prob = prob, method = ci_method)
  ) |> 
    list_rbind() |> 
    mutate(sample = "Test")
  
  tb_calibration_curve_ma <- 
    calib_curve_calib |> 
    bind_rows(calib_curve_test) |> 
    mutate(
      method = method,
      seed = simul$seed
    )
   
  tb_calibration_curve_ma
}

For convenience, we create a function, get_data_plot_calib_ma_simul() that returns two elements::

  1. tb_ci: confidence intervals associated with the calibration curve for a single replication
  2. n_bins_scores: the count of observation in each bins defined over the [0,1] segment for the scores (uncalibrated and calibrated, for both the train set and the test set).
#' @param i index of the simulation to use (in `simul_recalib_alpha` or 
#'   `simul_recalib_gamma`)
#' @param type type of transformed probabilities (made on `alpha` or `gamma`)
#' @param method name of the recalibration method to focus on
get_data_plot_calib_ma_simul <- function(i, type, method) {
  if (type == "alpha") {
    simul <- simul_recalib_alpha[[i]]
    transform_scale <- grid_alpha$alpha[i]
  } else if (type == "gamma") {
    simul <- simul_recalib_gamma[[i]]
    transform_scale <- grid_gamma$gamma[i]
  } else {
    stop("Wrong value for argument `type`.")
  }
  
  # Counting number of obs in bins defined over [0,1]
  breaks <- seq(0, 1, by = .05)
  if (method == "True Prob.") {
    scores_calib <- simul$data_all_calib$p
    scores_test <- simul$data_all_test$p
    scores_c_calib <- scores_c_test <- NULL
  } else if (method == "No Calibration") {
    scores_calib <- simul$data_all_calib$p_u
    scores_test <- simul$data_all_test$p_u
    scores_c_calib <- scores_c_test <- NULL
  } else {
    tb_score_c_calib <- simul$res_recalibration[[method]]$tb_score_c_calib
    tb_score_c_test <- simul$res_recalibration[[method]]$tb_score_c_test
    scores_calib <- tb_score_c_calib$p_u
    scores_test <- tb_score_c_test$p_u
    scores_c_calib <- tb_score_c_calib$p_c
    scores_c_test <- tb_score_c_test$p_c
  }
  
  n_bins_calib <- table(cut(scores_calib, breaks = breaks))
  n_bins_test <- table(cut(scores_test, breaks = breaks))
  if (!is.null(scores_c_calib)) {
    n_bins_c_calib <- table(cut(scores_c_calib, breaks = breaks))
  } else {
    n_bins_c_calib <- NA_integer_
  }
  if (!is.null(scores_c_test)) {
    n_bins_c_test <- table(cut(scores_c_test, breaks = breaks))
  } else {
    n_bins_c_test <- NA_integer_
  }
  
  n_bins_scores <- tibble(
    bins = names(table(cut(breaks, breaks = breaks))),
    n_bins_calib = as.vector(n_bins_calib),
    n_bins_test = as.vector(n_bins_test),
    n_bins_c_calib = as.vector(n_bins_c_calib),
    n_bins_c_test = as.vector(n_bins_c_test),
    method = method,
    seed = simul$seed,
    type = type
  )
  
  # Confidence intervals
  tb_ci <- calibration_curve_ma_simul(
    simul = simul, 
    method = method, 
    nn = .15, prob = .95,  ci_method = "probit"
  )
  list(
    tb_ci = tb_ci,
    n_bins_scores = n_bins_scores
  )
}

We define a function that will plot the calibration maps computed on the calibration set and those computed on the test set. This function will plot a panel of calibration maps, each row corresponding to a specific value of the scale used to transform the probabilities (\(\alpha\) or \(\gamma\)).

methods <- c(
  "True Prob.", "No Calibration", "platt", "isotonic", "beta", 
  "locfit_0", "locfit_1", "locfit_2"
)

i_alpha <- grid_alpha |> mutate(i = row_number()) |> group_by(alpha) |> slice(1) |> 
  pull(i)
i_gamma <- grid_gamma |> mutate(i = row_number()) |> group_by(gamma) |> slice(1) |> 
  pull(i)

tab <- tibble(i = i_alpha, type = "alpha") |> 
  bind_rows(
    tibble(i = i_gamma, type = "gamma")
  ) |> 
  mutate(id = str_c(i, "_", type))

grid <- expand_grid(id = tab$id, method = methods) |> 
  separate(id, into = c("i", "type"), sep = "_") |> 
  mutate(i = as.numeric(i))

method_names <- tribble(
  ~method, ~method_lab,
  "True Prob.", "True Prob.",
  "No Calibration", "No Calibration", 
  "platt", "Platt Scaling",
  "isotonic", "Isotonic Reg.",
  "beta", "Beta Calib.",
  "locfit_0", "Local Reg. (deg = 0)", 
  "locfit_1", "Local Reg. (deg = 1)",
  "locfit_2", "Local Reg (deg = 2)"
)

The function:

plot_conf_int_ma_simul <- function(method, type) {
  current_grid <- grid |> 
    filter(method == !!method, type == !!type)
  
  if (type == "alpha") {
    transform_scale <- grid_alpha |> slice(current_grid$i) |> pull(alpha)
  } else if (type == "gamma") {
    transform_scale <- grid_gamma |> slice(current_grid$i) |> pull(gamma)
  } else {
    stop("Error argument `type`. Wrong value: either \"alpha\" or \"gamma\"")
  }
  
  data_plots <-  map(
    .x = current_grid$i,
    .f = ~get_data_plot_calib_ma_simul(i = .x, type = type, method = method)
  )
  
  tb_cis <- map(data_plots, pluck("tb_ci"))
  n_bins_scores <- map(data_plots, pluck("n_bins_scores"))
  
  nb <- length(data_plots)
  mat <- mat_init <- matrix(c(1:4), ncol = 2)
  for (j in 1:(nb-1)) {
    mat <- rbind(mat, mat_init + j * 4)
  }
  layout(mat, heights = rep(c(1, 3), nb))
  
  y_lim <- c(0, 1)
  
  # i_recalib <- 1
  for (i_recalib in 1:nb) {
    tb_ci_current <- tb_cis[[i_recalib]]
    n_bins_scores_current <- n_bins_scores[[i_recalib]]
    
    title <- str_c("$\\", type, "=", round(transform_scale[i_recalib], 2), "$")
    
    for (sample in c("Calibration", "Test")) {
      # Histogram with values
      tb_plot <- tb_ci_current |> filter(sample == !!sample)
      
      if (sample == "Calibration"){
        colour <- "#D55E00"
        n_bins_current <- n_bins_scores_current$n_bins_calib
        n_bins_c_current <- n_bins_scores_current$n_bins_c_calib
      } else {
        colour <- "#009E73"
        n_bins_current <- n_bins_scores_current$n_bins_test
        n_bins_c_current <- n_bins_scores_current$n_bins_c_test
      }
      
      par(mar = c(0.5, 4.3, 1.0, 0.5))
      y_lim_bp <- range(c(n_bins_current, n_bins_c_current), na.rm = TRUE)
      barplot(
        n_bins_current,
        col = adjustcolor("#000000", alpha.f = .3),
        ylim = y_lim_bp,
        border = "white",
        axes = FALSE,
        xlab = "", ylab = "", 
        main = latex2exp::TeX(title)
      )
      barplot(
        n_bins_c_current,
        col = adjustcolor("#0072B2", alpha.f = .3),
        ylim = y_lim_bp,
        border = "white",
        axes = FALSE,
        xlab = "", ylab = "", main = "",
        add = TRUE
      )
      
      par(mar = c(4.1, 4.3, 0.5, 0.5))
      plot(
        tb_plot$xlim, tb_plot$mean,
        pch = 19, ylim = y_lim, xlim = 0:1, type = "l",
        xlab = latex2exp::TeX("Predicted score $p$"), 
        ylab = latex2exp::TeX("$E(d | s(x) = p)$"),
        col = colour
      )
      polygon(
        c(tb_plot$xlim, rev(tb_plot$xlim)),
        c(tb_plot$lower, rev(tb_plot$upper)),
        col = adjustcolor(col = colour, alpha.f = .4),
        border = NA
      )
      segments(0, 0, 1, 1, col = "black", lty = 2)
    }
  }
}

For the first two tabs, True Prob. and No Calibration, the calibration curves are those computed using the true probabilities and the uncalibrated scores instead of some recalibrated scores. This is done for comparison purposes.

Calibration curves obtained with recalibrated scores. The curves are obtained with a moving average for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with a moving average for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with a moving average for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with a moving average for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with a moving average for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with a moving average for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with a moving average for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with a moving average for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with a moving average for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with a moving average for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with a moving average for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with a moving average for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with a moving average for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with a moving average for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with a moving average for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with a moving average for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for a single replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

2.7 Calibration Maps (200 replications)

We now turn to the same type of visualization, but adapted to the 200 replications instead of a single one.

First, let us create a function, get_count_simul() to get the number of observation in each bin separating the [0,1] segment with uncalibrated and recalibrated scores (both on the calibration and the recalibration sets), for all the simulations. Then, we can compute an average count per bin over the simulations. This will be useful to have an idea of the distributions of scores in the different scenarios (varying values for \(\alpha\) or \(\gamma\)) and each recalibration method (Platt Scaling, Isotonic regression, etc.).

#' @param i index of the simulation to use (in `simul_recalib_alpha` or 
#'   `simul_recalib_gamma`)
#' @param type type of transformed probabilities (made on `alpha` or `gamma`)
#' @param method name of the recalibration method to focus on
get_count_simul <- function(i, type, method) {
  if (type == "alpha") {
    simul <- simul_recalib_alpha[[i]]
    transform_scale <- grid_alpha$alpha[i]
  } else if (type == "gamma") {
    simul <- simul_recalib_gamma[[i]]
    transform_scale <- grid_gamma$gamma[i]
  } else {
    stop("Wrong value for argument `type`.")
  }
  
  # Counting number of obs in bins defined over [0,1]
  breaks <- seq(0, 1, by = .05)
  if (method == "True Prob.") {
    scores_calib <- simul$data_all_calib$p
    scores_test <- simul$data_all_test$p
    scores_c_calib <- scores_c_test <- NULL
  } else if (method == "No Calibration") {
    scores_calib <- simul$data_all_calib$p_u
    scores_test <- simul$data_all_test$p_u
    scores_c_calib <- scores_c_test <- NULL
  } else {
    tb_score_c_calib <- simul$res_recalibration[[method]]$tb_score_c_calib
    tb_score_c_test <- simul$res_recalibration[[method]]$tb_score_c_test
    scores_calib <- tb_score_c_calib$p_u
    scores_test <- tb_score_c_test$p_u
    scores_c_calib <- tb_score_c_calib$p_c
    scores_c_test <- tb_score_c_test$p_c
  }
  
  n_bins_calib <- table(cut(scores_calib, breaks = breaks))
  n_bins_test <- table(cut(scores_test, breaks = breaks))
  if (!is.null(scores_c_calib)) {
    n_bins_c_calib <- table(cut(scores_c_calib, breaks = breaks))
  } else {
    n_bins_c_calib <- NA_integer_
  }
  if (!is.null(scores_c_test)) {
    n_bins_c_test <- table(cut(scores_c_test, breaks = breaks))
  } else {
    n_bins_c_test <- NA_integer_
  }
  
  n_bins_scores <- tibble(
    bins = names(table(cut(breaks, breaks = breaks))),
    n_bins_calib = as.vector(n_bins_calib),
    n_bins_test = as.vector(n_bins_test),
    n_bins_c_calib = as.vector(n_bins_c_calib),
    n_bins_c_test = as.vector(n_bins_c_test),
    method = method,
    seed = simul$seed,
    type = type,
    transform_scale = transform_scale
  )
  n_bins_scores
}

Let us apply this function to all simulations, both for varying values of \(\alpha\) and for \(\gamma\):

grid_count_alpha <- 
  expand_grid(
    i = 1:nrow(grid_alpha), 
    method = c(
      "True Prob.", "No Calibration",
      "platt", "isotonic", "beta", "locfit_0", "locfit_1", "locfit_2")
  )
grid_count_gamma <- 
  expand_grid(
    i = 1:nrow(grid_gamma), 
    method = c(
      "True Prob.", "No Calibration",
      "platt", "isotonic", "beta", "locfit_0", "locfit_1", "locfit_2")
  )

count_scores_alpha <- map(
  .x = 1:nrow(grid_count_alpha),
  .f = ~get_count_simul(
    i = grid_count_alpha$i[.x], 
    type = "alpha", 
    method = grid_count_alpha$method[.x]
  ),
  .progress = TRUE
)
count_scores_gamma <- map(
  .x = 1:nrow(grid_count_gamma),
  .f = ~get_count_simul(
    i = grid_count_gamma$i[.x], 
    type = "gamma", 
    method = grid_count_gamma$method[.x]
  ),
  .progress = TRUE
)

Wen can then compute the average in each bin for each of the four scores (uncalibrated in the calibration set, uncalibrated in the test set, recalibrated in the calibration set, recalibrated in the test set), for each method, both for varying values of \(\alpha\) and \(\gamma\).

count_scores_alpha <- 
  count_scores_alpha |> 
  list_rbind() |> 
  group_by(method, type, bins, transform_scale) |> 
  summarise(
    n_bins_calib = mean(n_bins_calib, na.rm = TRUE),
    n_bins_test = mean(n_bins_test, na.rm = TRUE),
    n_bins_c_calib = mean(n_bins_c_calib, na.rm = TRUE),
    n_bins_c_test = mean(n_bins_c_test, na.rm = TRUE),
    .groups = "drop"
  )

count_scores_gamma <- 
  count_scores_gamma |> 
  list_rbind() |> 
  group_by(method,type, bins, transform_scale) |> 
  summarise(
    n_bins_calib = mean(n_bins_calib, na.rm = TRUE),
    n_bins_test = mean(n_bins_test, na.rm = TRUE),
    n_bins_c_calib = mean(n_bins_c_calib, na.rm = TRUE),
    n_bins_c_test = mean(n_bins_c_test, na.rm = TRUE),
    .groups = "drop"
  )

2.7.1 Quantile-Based Bins

Instead of looking at the confidence intervals for a single replication, we can plot the 200 replications on a single plot. The quantiles can slightly change from one replication to another. It is therefore not possible to compute credible intervals.

#' @param simul a single replication result
#' @param method name of the method used to recalibrate for which to compute the calibration curve
#' @param k number of bins to create (quantiles, default to `10`)
get_summary_bins_simul <- function(simul, method, k = 10) {
  obs_calib <- simul$data_all_calib$d
  obs_test <- simul$data_all_test$d
  
  if (method == "True Prob.") {
    scores_calib <- simul$data_all_calib$p
    scores_test <- simul$data_all_test$p
  } else if (method == "No Calibration") {
    scores_calib <- simul$data_all_calib$p_u
    scores_test <- simul$data_all_test$p_u
  } else {
    tb_score_c_calib <- simul$res_recalibration[[method]]$tb_score_c_calib
    tb_score_c_test <- simul$res_recalibration[[method]]$tb_score_c_test
    scores_calib <- tb_score_c_calib$p_c
    scores_test <- tb_score_c_test$p_c
  }
  
  summary_bins_calib <- get_summary_bins(
    obs = obs_calib, scores = scores_calib, k = k)
  summary_bins_test <- get_summary_bins(
    obs = obs_test, scores = scores_test, k = k)
  
  summary_bins_calib |> mutate(sample = "Calibration") |> 
    bind_rows(summary_bins_test |> mutate(sample = "Test")) |> 
    mutate(method = method, seed = simul$seed)
}

Let us loop over all the methods and all the replications for each value of \(\alpha\) to get the quantile-based calibration curves.

# For alpha
methods <- names(simul_recalib_alpha[[1]]$res_recalibration)
methods <- c("True Prob.", "No Calibration", methods)
summary_bins_simuls_alpha <- vector(mode = "list", length = length(methods))
names(summary_bins_simuls_alpha) <- methods
library(future)
nb_cores <- future::availableCores()-1
plan(multisession, workers = nb_cores)

for (i_method in 1:length(methods)) {
  progressr::with_progress({
    p <- progressr::progressor(steps = nrow(grid_alpha))
    summary_bins_simuls_m <- furrr::future_map(
      .x = simul_recalib_alpha,
      .f = ~{
        p()
        get_summary_bins_simul(simul = .x, method = methods[i_method], k = 10)
      },
      .options = furrr::furrr_options(seed = FALSE)
    )
  })
  # Add value for alpha
  for (j in 1:length(summary_bins_simuls_m)) {
    summary_bins_simuls_m[[j]]$scale_parameter <- grid_alpha$alpha[j]
  }
  summary_bins_simuls_alpha[[i_method]] <- summary_bins_simuls_m |> 
    list_rbind(names_to = "i_row")
}
summary_bins_simuls_alpha <- list_rbind(
  summary_bins_simuls_alpha, names_to = "method"
) |> 
  mutate(type = "alpha")

We do the same for the values of \(\gamma\):

# For gamma
methods <- names(simul_recalib_gamma[[1]]$res_recalibration)
methods <- c("True Prob.", "No Calibration", methods)
summary_bins_simuls_gamma <- vector(mode = "list", length = length(methods))
names(summary_bins_simuls_gamma) <- methods
library(future)
nb_cores <- future::availableCores()-1
plan(multisession, workers = nb_cores)

for (i_method in 1:length(methods)) {
  progressr::with_progress({
    p <- progressr::progressor(steps = nrow(grid_gamma))
    summary_bins_simuls_m <- furrr::future_map(
      .x = simul_recalib_gamma,
      .f = ~{
        p()
        get_summary_bins_simul(simul = .x, method = methods[i_method], k = 10)
      },
      .options = furrr::furrr_options(seed = FALSE)
    )
  })
  # Add value for alpha
  for (j in 1:length(summary_bins_simuls_m)) {
    summary_bins_simuls_m[[j]]$scale_parameter <- grid_gamma$gamma[j]
  }
  summary_bins_simuls_gamma[[i_method]] <- summary_bins_simuls_m |> 
    list_rbind(names_to = "i_row")
}
summary_bins_simuls_gamma <- list_rbind(
  summary_bins_simuls_gamma, names_to = "method"
) |> 
  mutate(type = "gamma")

The results are stored in a single tibble:

summary_bins_simuls <- summary_bins_simuls_alpha |> 
  bind_rows(summary_bins_simuls_gamma)

Let us define a function to plot the calibration curves on the calibration and the test samples.

plot_calib_qbins_simuls <- function(method, type) {
  tb_calibration_curve <- summary_bins_simuls |> 
    filter(
      method == !!method,
      type == !!type
    )
  
  if (type == "alpha") {
    count_scores <- count_scores_alpha |> filter(method == !!method)
  } else if (type == "gamma") {
    count_scores <- count_scores_gamma |> filter(method == !!method)
  } else {
    stop("Type should be either \"alpha\" or \"gamma\"")
  }
  
  scale_params <- unique(tb_calibration_curve$scale_parameter)
  seeds <- unique(tb_calibration_curve$seed)
  colours <- c("Calibration" = "#D55E00", "Test" = "#009E73")
  
  nb <- length(scale_params)
  mat <- mat_init <- matrix(c(1:4), ncol = 2)
  for (j in 1:(nb-1)) {
    mat <- rbind(mat, mat_init + j * 4)
  }
  layout(mat, heights = rep(c(1, 3), nb))
  
  y_lim <- c(0, 1)
  
  for (scale_param in scale_params) {
    title <- str_c("$\\", type, " = $", round(scale_param, 2))
    
    for (sample in c("Calibration", "Test")) {
      if (sample == "Calibration"){
        n_bins_current <- count_scores |> 
          filter(transform_scale == !!scale_param) |> pull("n_bins_calib")
        n_bins_c_current <- count_scores |> 
          filter(transform_scale == !!scale_param) |> pull("n_bins_c_calib")
      } else {
        n_bins_current <- count_scores |> 
          filter(transform_scale == !!scale_param) |> pull("n_bins_test")
        n_bins_c_current <- count_scores |> 
          filter(transform_scale == !!scale_param) |> pull("n_bins_c_test")
      }
      par(mar = c(0.5, 4.3, 1.0, 0.5))
      y_lim_bp <- range(c(n_bins_current, n_bins_c_current), na.rm = TRUE)
      barplot(
        n_bins_current,
        col = adjustcolor("#000000", alpha.f = .3),
        ylim = y_lim_bp,
        border = "white",
        axes = FALSE,
        xlab = "", ylab = "", 
        main = latex2exp::TeX(title)
      )
      barplot(
        n_bins_c_current,
        col = adjustcolor("#0072B2", alpha.f = .3),
        ylim = y_lim_bp,
        border = "white",
        axes = FALSE,
        xlab = "", ylab = "", main = "",
        add = TRUE
      )
      par(mar = c(4.1, 4.3, 0.5, 0.5))
      plot(
        0:1, 0:1,
        type = "l", col = NULL,
        xlim = 0:1, ylim = 0:1,
        xlab = "Predicted probability", ylab = "Mean predicted probability",
        main = ""
      )
      for (i_simul in seeds) {
        tb_current <- tb_calibration_curve |> 
          filter(
            scale_parameter == scale_param,
            seed == i_simul,
            sample == !!sample
          )
        lines(
          tb_current$mean_score, tb_current$mean_obs,
          lwd = 2, col = adjustcolor(colours[sample], alpha.f = 0.1), t = "b",
          cex = .1, pch = 19
        )
      }
      segments(0, 0, 1, 1, col = "black", lty = 2)
    }
  }
}

The figures below show a panel of graphs with the superimposed calibration curves obtained with the quantile-based bins. Each tab shows the curves for a type of recalibration used. The first two tabs (True Prob. and No Calibration) show the curves obtained using the true probabilities \(p\) and the uncalibrated probabilities \(p^u\), instead of the recalibrated probabilities \(p^c\). Each row of the panel in the Figures corresponds to a value for either \(\alpha\) or \(\gamma\) used to transform \(p\) to get \(p^u\). The left column shows the calibration curve obtained on the calibration set whereas the right column shows the calibration curve obtained on the test set. The average distribution (computed over the 200 simulations) of the uncalibrated scores and of the calibrated scores are shown in the histograms on top of each graph.

Calibration curves obtained with recalibrated scores. The curves are obtained with quantile-defined bins for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with quantile-defined bins for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with quantile-defined bins for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with quantile-defined bins for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with quantile-defined bins for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with quantile-defined bins for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with quantile-defined bins for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with quantile-defined bins for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with quantile-defined bins for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with quantile-defined bins for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with quantile-defined bins for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with quantile-defined bins for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with quantile-defined bins for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with quantile-defined bins for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with quantile-defined bins for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Calibration curves obtained with recalibrated scores. The curves are obtained with quantile-defined bins for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores.

Let us visualize the calibration curves in another way.

plot_calib_quant_simuls <- function(calib_curve, method) {
  tb_calibration_curve_both <- calib_curve |>
    filter(
      method == !!method
    )
  
  mat <- c(
    1, 3, 13, 15,
    2, 4, 14, 16,
    5, 7, 17, 19,
    6, 8, 18, 20,
    9, 11, 21, 23,
    10, 12, 22, 24
  ) |>
    matrix(ncol = 4, byrow = TRUE)
  
  layout(mat, height = rep(c(1, 3), 3))
  
  y_lim <- c(0, 1)
  
  for (type in c("alpha", "gamma")) {
    
    if (type == "alpha") {
      count_scores <- count_scores_alpha |> filter(method == !!method)
    } else if (type == "gamma") {
      count_scores <- count_scores_gamma |> filter(method == !!method)
    } else {
      stop("Type should be either \"alpha\" or \"gamma\"")
    }
    
    tb_calibration_curve <-
      tb_calibration_curve_both |>
      filter(type == !!type)
    
    scale_params <- unique(tb_calibration_curve$scale_parameter)
    seeds <- unique(tb_calibration_curve$seed)
    colours <- c("Calibration" = "#D55E00", "Test" = "#009E73")
    
    for (scale_param in scale_params) {
      title <- str_c("$\\", type, " = $", round(scale_param, 2))
      
      for (sample in c("Calibration", "Test")) {
        
        if (sample == "Calibration"){
          n_bins_current <- count_scores |>
            filter(transform_scale == !!scale_param) |> pull("n_bins_calib")
          n_bins_c_current <- count_scores |>
            filter(transform_scale == !!scale_param) |> pull("n_bins_c_calib")
        } else {
          n_bins_current <- count_scores |>
            filter(transform_scale == !!scale_param) |> pull("n_bins_test")
          n_bins_c_current <- count_scores |>
            filter(transform_scale == !!scale_param) |> pull("n_bins_c_test")
        }
        
        par(mar = c(0.5, 4.3, 1.0, 0.5))
        y_lim_bp <- range(c(n_bins_current, n_bins_c_current), na.rm = TRUE)
        barplot(
          n_bins_current,
          col = adjustcolor("#000000", alpha.f = .3),
          ylim = y_lim_bp,
          border = "white",
          axes = FALSE,
          xlab = "", ylab = "",
          main = latex2exp::TeX(title)
        )
        barplot(
          n_bins_c_current,
          col = adjustcolor("#0072B2", alpha.f = .3),
          ylim = y_lim_bp,
          border = "white",
          axes = FALSE,
          xlab = "", ylab = "", main = "",
          add = TRUE
        )
        
        tb_current <-
          tb_calibration_curve |>
          filter(
            scale_parameter == !!scale_param,
            sample == !!sample
          )
        par(mar = c(4.1, 4.3, 0.5, 0.5), mgp = c(2, 1, 0))
        plot(
          0:1, 0:1,
          type = "l", col = NULL,
          xlim = 0:1, ylim = 0:1,
          xlab = latex2exp::TeX("$p^u$"), 
          ylab = latex2exp::TeX("$E(D | p^u = p^c)$"),
          main = ""
        )
        for (i_simul in seeds) {
          tb_current <- tb_calibration_curve |> 
            filter(
              scale_parameter == scale_param,
              seed == i_simul,
              sample == !!sample
            )
          lines(
            tb_current$mean_score, tb_current$mean_obs,
            lwd = 2, col = adjustcolor(colours[sample], alpha.f = 0.1), t = "b",
            cex = .1, pch = 19
          )
          segments(0, 0, 1, 1, col = "black", lty = 2)
        }
      }
    }
  }
}
methods <- c(
  "True Prob.", 
  "No Calibration",
  "platt", "isotonic", "beta", 
  "locfit_0", "locfit_1", "locfit_2"
)
methods_labs <- c(
  "True Prob.",
  "No Calibration", "Platt", "Isotonic", "Beta", 
  "Locfit (deg=0)", "Locfit (deg=1)", "Locfit (deg=2)"
)
plot_calib_quant_simuls(
  calib_curve = summary_bins_simuls |> 
    filter(scale_parameter %in% c(1/3, 1, 3)), 
  method = "True Prob."
)
Figure 2.19: Calibration Curves Calculated with True Probabilities as the Scores. The curves are obtained with quantile binning, for the calibration set (orange) and for the test set (green) for varying values of \(\alpha\) and \(\gamma\). The curves of the 200 replications of the simulations are superimposed. The histogram on top of each graph show the distribution of the true probabilities

plot_calib_quant_simuls(
  calib_curve = summary_bins_simuls |> 
    filter(scale_parameter %in% c(1/3, 1, 3)), 
  method = "No Calibration"
)
Figure 2.20: Calibration Curves Calculated with Uncalibrated Scores. The curves are obtained with quantile binning, for the calibration set (orange) and for the test set (green) for varying values of \(\alpha\) and \(\gamma\). The curves of the 200 replications of the simulations are superimposed. The histogram on top of each graph show the distribution of the true probabilities.

plot_calib_quant_simuls(
  calib_curve = summary_bins_simuls |> 
    filter(scale_parameter %in% c(1/3, 1, 3)), 
  method = "platt"
)
Figure 2.21: Calibration Curves Calculated with Scores Recalibrated Using Platt Scaling. The curves are obtained with quantile binning, for the calibration set (orange) and for the test set (green) for varying values of \(\alpha\) and \(\gamma\). The curves of the 200 replications of the simulations are superimposed. The histogram on top of each graph show the distribution of the uncalibrated scores (gray), and that of the calibrated scores (blue).

plot_calib_quant_simuls(
  calib_curve = summary_bins_simuls |> 
    filter(scale_parameter %in% c(1/3, 1, 3)), 
  method = "isotonic"
)
Figure 2.22: Calibration Curves Calculated with Scores Recalibrated Using Isotonic Regression. The curves are obtained with quantile binning, for the calibration set (orange) and for the test set (green) for varying values of \(\alpha\) and \(\gamma\). The curves of the 200 replications of the simulations are superimposed. The histogram on top of each graph show the distribution of the uncalibrated scores (gray), and that of the calibrated scores (blue).

plot_calib_quant_simuls(
  calib_curve = summary_bins_simuls |> 
    filter(scale_parameter %in% c(1/3, 1, 3)), 
  method = "beta"
)
Figure 2.23: Calibration Curves Calculated with Scores Recalibrated Using Beta Calibration. The curves are obtained with quantile binning, for the calibration set (orange) and for the test set (green) for varying values of \(\alpha\) and \(\gamma\). The curves of the 200 replications of the simulations are superimposed. The histogram on top of each graph show the distribution of the uncalibrated scores (gray), and that of the calibrated scores (blue).

plot_calib_quant_simuls(
  calib_curve = summary_bins_simuls |> 
    filter(scale_parameter %in% c(1/3, 1, 3)), 
  method = "locfit_0"
)
Figure 2.24: Calibration Curves Calculated with Scores Recalibrated Using Local Regression (with degree 0). The curves are obtained with quantile binning, for the calibration set (orange) and for the test set (green) for varying values of \(\alpha\) and \(\gamma\). The curves of the 200 replications of the simulations are superimposed. The histogram on top of each graph show the distribution of the uncalibrated scores (gray), and that of the calibrated scores (blue).

plot_calib_quant_simuls(
  calib_curve = summary_bins_simuls |> 
    filter(scale_parameter %in% c(1/3, 1, 3)), 
  method = "locfit_1"
)
Figure 2.25: Calibration Curves Calculated with Scores Recalibrated Using Local Regression (with degree 1). The curves are obtained with quantile binning, for the calibration set (orange) and for the test set (green) for varying values of \(\alpha\) and \(\gamma\). The curves of the 200 replications of the simulations are superimposed. The histogram on top of each graph show the distribution of the uncalibrated scores (gray), and that of the calibrated scores (blue).

plot_calib_quant_simuls(
  calib_curve = summary_bins_simuls |> 
    filter(scale_parameter %in% c(1/3, 1, 3)), 
  method = "locfit_2"
)
Figure 2.26: Calibration Curves Calculated with Scores Recalibrated Using Local Regression (with degree 2). The curves are obtained with quantile binning, for the calibration set (orange) and for the test set (green) for varying values of \(\alpha\) and \(\gamma\). The curves of the 200 replications of the simulations are superimposed. The histogram on top of each graph show the distribution of the uncalibrated scores (gray), and that of the calibrated scores (blue).

2.7.2 Calibration Curve with Local Regression

We will plot the calibration curves estimated using the local regression method, for all type of transformation of the probabilities made (varying either \(\alpha\) or \(\gamma\)).

Contrary to the quantile-based calibration curve, we can make predictions on a segment from 0 to 1 using the fitted local regression.

calibration_curve_locfit_simul <- function(simul, 
                                           method, 
                                           k = 10) {
  
  
  obs_calib <- simul$data_all_calib$d
  obs_test <- simul$data_all_test$d
  
  if (method == "True Prob.") {
    scores_calib <- simul$data_all_calib$p
    scores_test <- simul$data_all_test$p
  } else if (method == "No Calibration") {
    scores_calib <- simul$data_all_calib$p_u
    scores_test <- simul$data_all_test$p_u
  } else {
    tb_score_c_calib <- simul$res_recalibration[[method]]$tb_score_c_calib
    tb_score_c_test <- simul$res_recalibration[[method]]$tb_score_c_test
    scores_calib <- tb_score_c_calib$p_c
    scores_test <- tb_score_c_test$p_c
  }
  
  # Add a little noise (otherwise, R may crash...)
  scores_calib <- scores_calib + rnorm(length(scores_calib), 0, .001)
  scores_test <- scores_test + rnorm(length(scores_test), 0, .001)
  
  tb_calib <- tibble(
    obs = obs_calib,
    scores = scores_calib
  )
  
  tb_test <- tibble(
    obs = obs_test,
    scores = scores_test
  )
  
  locfit_0_calib <- locfit(
    formula = obs ~ lp(scores, nn = 0.15, deg = 0), 
    kern = "rect", maxk = 200, data = tb_calib
  )
  
  # Predictions on [0,1]
  linspace_raw <- seq(0, 1, length.out = 100)
  
  # Restricting this space to the range of observed scores
  keep_linspace_calib <- which(
    linspace_raw >= min(scores_calib) & linspace_raw <= max(scores_calib)
  )
  linspace_calib <- linspace_raw[keep_linspace_calib]
  
  score_c_locfit_0_calib <- predict(locfit_0_calib, newdata = linspace_calib)
  score_c_locfit_0_calib[score_c_locfit_0_calib < 0] <- 0
  score_c_locfit_0_calib[score_c_locfit_0_calib > 1] <- 1
  
  locfit_0_test <- locfit(
    formula = obs ~ lp(scores, nn = 0.15, deg = 0), 
    kern = "rect", maxk = 200, data = tb_test
  )
  
  keep_linspace_test <- which(
    linspace_raw >= min(scores_test) & linspace_raw <= max(scores_test)
  )
  linspace_test <- linspace_raw[keep_linspace_test]
  
  score_c_locfit_0_test <- predict(locfit_0_test, newdata = linspace_test)
  score_c_locfit_0_test[score_c_locfit_0_test < 0] <- 0
  score_c_locfit_0_test[score_c_locfit_0_test > 1] <- 1
  
  tb_calibration_curve_locfit <- tibble(
    xlim = linspace_calib,
    locfit_pred = score_c_locfit_0_calib,
    method = method,
    seed = simul$seed,
    sample = "calibration"
  ) |> 
    bind_rows(
      tibble(
        xlim = linspace_test,
        locfit_pred = score_c_locfit_0_test,
        method = method,
        seed = simul$seed,
        sample = "test"
      )
    )
  
  tb_calibration_curve_locfit
}

Let us loop over all the methods and all the replications for each value of \(\alpha\) to get the calibration curves based on local regression.

# For alpha
methods <- names(simul_recalib_alpha[[1]]$res_recalibration)
methods <- c("True Prob.", "No Calibration", methods)
calib_curve_locfit_simuls_alpha <- vector(mode = "list", length = length(methods))
names(calib_curve_locfit_simuls_alpha) <- methods
library(future)
nb_cores <- future::availableCores()-1
plan(multisession, workers = nb_cores)

for (i_method in 1:length(methods)) {
  progressr::with_progress({
    p <- progressr::progressor(steps = nrow(grid_alpha))
    calib_curve_locfit_simuls_m <- furrr::future_map(
      .x = simul_recalib_alpha,
      .f = ~{
        p()
        calibration_curve_locfit_simul(
          simul = .x, method = methods[i_method], k = 10
        )
      },
      .options = furrr::furrr_options(seed = FALSE)
    )
  })
  # Add value for alpha
  for (j in 1:length(calib_curve_locfit_simuls_m)) {
    calib_curve_locfit_simuls_m[[j]]$scale_parameter <- grid_alpha$alpha[j]
  }
  calib_curve_locfit_simuls_alpha[[i_method]] <- calib_curve_locfit_simuls_m |> 
    list_rbind(names_to = "i_row")
}
calib_curve_locfit_simuls_alpha <- list_rbind(
  calib_curve_locfit_simuls_alpha, names_to = "method"
) |> 
  mutate(type = "alpha")

Let us loop over the simulations made with varying values for \(\gamma\):

# For gamma
methods <- names(simul_recalib_gamma[[1]]$res_recalibration)
# Remove isotonic which makes the R session crash...
methods <- c("True Prob.", "No Calibration", methods)
calib_curve_locfit_simuls_gamma <- vector(mode = "list", length = length(methods))
names(calib_curve_locfit_simuls_gamma) <- methods
library(future)
nb_cores <- future::availableCores()-1
plan(multisession, workers = nb_cores)

for (i_method in 1:length(methods)) {
  progressr::with_progress({
    p <- progressr::progressor(steps = nrow(grid_gamma))
    calib_curve_locfit_simuls_m <- furrr::future_map(
      .x = simul_recalib_gamma,
      .f = ~{
        p()
        calibration_curve_locfit_simul(
          simul = .x, method = methods[i_method], k = 10
        )
      },
      .options = furrr::furrr_options(seed = FALSE)
    )
  })
  # Add value for alpha
  for (j in 1:length(calib_curve_locfit_simuls_m)) {
    calib_curve_locfit_simuls_m[[j]]$scale_parameter <- grid_gamma$gamma[j]
  }
  calib_curve_locfit_simuls_gamma[[i_method]] <- calib_curve_locfit_simuls_m |> 
    list_rbind(names_to = "i_row")
}
calib_curve_locfit_simuls_gamma <- list_rbind(
  calib_curve_locfit_simuls_gamma, names_to = "method"
) |> 
  mutate(type = "gamma")

The results are stored in a single tibble:

calib_curve_locfit_simuls <- 
  calib_curve_locfit_simuls_alpha |> 
  bind_rows(calib_curve_locfit_simuls_gamma) |> 
  mutate(
    sample = factor(
      sample, 
      levels = c("calibration", "test"), labels = c("Calibration", "Test")
    )
  )

Let us define a function to plot the calibration curves on the calibration and the test samples.

plot_calib_locfit_simuls <- function(method, type) {
  tb_calibration_curve <- calib_curve_locfit_simuls |> 
    filter(
      method == !!method,
      type == !!type
    )
  if (type == "alpha") {
    count_scores <- count_scores_alpha |> filter(method == !!method)
  } else if (type == "gamma") {
    count_scores <- count_scores_gamma |> filter(method == !!method)
  } else {
    stop("Type should be either alpha or gamma")
  }
  
  scale_params <- unique(tb_calibration_curve$scale_parameter)
  seeds <- unique(tb_calibration_curve$seed)
  colours <- c("Calibration" = "#D55E00", "Test" = "#009E73")
  
  nb <- length(scale_params)
  mat <- mat_init <- matrix(c(1:4), ncol = 2)
  for (j in 1:(nb-1)) {
    mat <- rbind(mat, mat_init + j * 4)
  }
  layout(mat, heights = rep(c(1, 3), nb))
  
  y_lim <- c(0, 1)
  
  for (scale_param in scale_params) {
    title <- str_c("$\\", type, " = $", round(scale_param, 2))
    
    for (sample in c("Calibration", "Test")) {
      
      if (sample == "Calibration"){
        n_bins_current <- count_scores |> 
          filter(transform_scale == !!scale_param) |> pull("n_bins_calib")
        n_bins_c_current <- count_scores |> 
          filter(transform_scale == !!scale_param) |> pull("n_bins_c_calib")
      } else {
        n_bins_current <- count_scores |> 
          filter(transform_scale == !!scale_param) |> pull("n_bins_test")
        n_bins_c_current <- count_scores |> 
          filter(transform_scale == !!scale_param) |> pull("n_bins_c_test")
      }
      
      par(mar = c(0.5, 4.3, 1.0, 0.5))
      y_lim_bp <- range(c(n_bins_current, n_bins_c_current), na.rm = TRUE)
      barplot(
        n_bins_current,
        col = adjustcolor("#000000", alpha.f = .3),
        ylim = y_lim_bp,
        border = "white",
        axes = FALSE,
        xlab = "", ylab = "", 
        main = latex2exp::TeX(title)
      )
      barplot(
        n_bins_c_current,
        col = adjustcolor("#0072B2", alpha.f = .3),
        ylim = y_lim_bp,
        border = "white",
        axes = FALSE,
        xlab = "", ylab = "", main = "",
        add = TRUE
      )
      
      tb_current <- 
        tb_calibration_curve |> 
        filter(
          scale_parameter == !!scale_param,
          sample == !!sample
        ) |> 
        group_by(type, scale_parameter, xlim) |> 
        summarise(
          mean = mean(locfit_pred),
          lower = quantile(locfit_pred, probs = .025),
          upper = quantile(locfit_pred, probs = .975),
          .groups = "drop"
        )
      par(mar = c(4.1, 4.3, 0.5, 0.5))
      plot(
        tb_current$xlim, tb_current$mean,
        type = "l", col = colours[sample],
        xlim = 0:1, ylim = 0:1,
        xlab = "Predicted probability", ylab = "Mean predicted probability",
        main = ""
      )
      polygon(
        c(tb_current$xlim, rev(tb_current$xlim)),
        c(tb_current$lower, rev(tb_current$upper)),
        col = adjustcolor(col = colours[sample], alpha.f = .4),
        border = NA
      )
      segments(0, 0, 1, 1, col = "black", lty = 2)
    }
  }
}

The figures below show a panel of graphs with the calibration curves obtained with the local regression method. Each tab shows the average curve obtained on the 200 replications for a type of recalibration used, as well as the 95% bootstrap confidence intervals. The first two tab (True Prob. and No Calibration) show the curves obtained using the true probabilities \(p\) and the uncalibrated probabilities \(p^u\), instead of the recalibrated probabilities \(p^c\). Each row of the panel in the Figures corresponds to a value for either \(\alpha\) or \(\gamma\) used to transform \(p\) to get \(p^u\). The left column shows the calibration curve obtained on the calibration set whereas the right column shows the calibration curve obtained on the test set. The average distribution (computed over the 200 simulations) of the uncalibrated scores and of the calibrated scores are shown in the histograms on top of each graph.

Calibration curves obtained with recalibrated scores. The curves are obtained with local regressions for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with local regressions for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with local regressions for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with local regressions for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with local regressions for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with local regressions for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with local regressions for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with local regressions for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with local regressions for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with local regressions for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with local regressions for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with local regressions for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with local regressions for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with local regressions for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with local regressions for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with local regressions for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Let us visualize this in another way.

plot_calib_locfit_simuls_2 <- function(calib_curve, method) {
  tb_calibration_curve_both <- calib_curve |>
    filter(
      method == !!method
    )
  
  mat <- c(
    1, 3, 13, 15,
    2, 4, 14, 16,
    5, 7, 17, 19,
    6, 8, 18, 20,
    9, 11, 21, 23,
    10, 12, 22, 24
  ) |>
    matrix(ncol = 4, byrow = TRUE)
  
  layout(mat, height = rep(c(1, 3), 3))
  
  y_lim <- c(0, 1)
  
  for (type in c("alpha", "gamma")) {
    
    tb_calibration_curve <-
      tb_calibration_curve_both |>
      filter(type == !!type)
    
    if (type == "alpha") {
      count_scores <- count_scores_alpha |> filter(method == !!method)
    } else if (type == "gamma") {
      count_scores <- count_scores_gamma |> filter(method == !!method)
    }
    
    scale_params <- unique(tb_calibration_curve$scale_parameter)
    seeds <- unique(tb_calibration_curve$seed)
    colours <- c("Calibration" = "#D55E00", "Test" = "#009E73")
    
    for (scale_param in scale_params) {
      title <- str_c("$\\", type, " = $", round(scale_param, 2))
      
      for (sample in c("Calibration", "Test")) {
        
        if (sample == "Calibration"){
          n_bins_current <- count_scores |>
            filter(transform_scale == !!scale_param) |> pull("n_bins_calib")
          n_bins_c_current <- count_scores |>
            filter(transform_scale == !!scale_param) |> pull("n_bins_c_calib")
        } else {
          n_bins_current <- count_scores |>
            filter(transform_scale == !!scale_param) |> pull("n_bins_test")
          n_bins_c_current <- count_scores |>
            filter(transform_scale == !!scale_param) |> pull("n_bins_c_test")
        }
        
        par(mar = c(0.5, 4.3, 1.0, 0.5))
        y_lim_bp <- range(c(n_bins_current, n_bins_c_current), na.rm = TRUE)
        barplot(
          n_bins_current,
          col = adjustcolor("#000000", alpha.f = .3),
          ylim = y_lim_bp,
          border = "white",
          axes = FALSE,
          xlab = "", ylab = "",
          main = latex2exp::TeX(title)
        )
        barplot(
          n_bins_c_current,
          col = adjustcolor("#0072B2", alpha.f = .3),
          ylim = y_lim_bp,
          border = "white",
          axes = FALSE,
          xlab = "", ylab = "", main = "",
          add = TRUE
        )
        
        tb_current <-
          tb_calibration_curve |>
          filter(
            scale_parameter == !!scale_param,
            sample == !!sample
          ) |>
          group_by(type, scale_parameter, xlim, sample) |>
          summarise(
            mean = mean(locfit_pred),
            lower = quantile(locfit_pred, probs = .025),
            upper = quantile(locfit_pred, probs = .975),
            .groups = "drop"
          )
        par(mar = c(4.1, 4.3, 0.5, 0.5), mgp = c(2, 1, 0))
        plot(
          tb_current$xlim, tb_current$mean,
          type = "l", col = colours[sample],
          xlim = 0:1, ylim = 0:1,
          xlab = latex2exp::TeX("$p^u$"), 
          ylab = latex2exp::TeX("$E(D | p^u = p^c)$"),
          main = ""
        )
        polygon(
          c(tb_current$xlim, rev(tb_current$xlim)),
          c(tb_current$lower, rev(tb_current$upper)),
          col = adjustcolor(col = colours[sample], alpha.f = .4),
          border = NA
        )
        segments(0, 0, 1, 1, col = "black", lty = 2)
      }
    }
  }
}
plot_calib_locfit_simuls_2(
  calib_curve = calib_curve_locfit_simuls |> 
    filter(scale_parameter %in% c(1/3, 1, 3)), 
  method = "True Prob."
)
Figure 2.27: Calibration Curves Calculated with True Probabilities as the Scores. The curves are obtained with quantile binning, for the calibration set (orange) and for the test set (green) for varying values of \(\alpha\) and \(\gamma\). The curves are the average values obtained on 200 replications of the simulations, the bands correspond to 95% bootstrap interval. The histogram on top of each graph show the distribution of the true probabilities

plot_calib_locfit_simuls_2(
  calib_curve = calib_curve_locfit_simuls |> 
    filter(scale_parameter %in% c(1/3, 1, 3)), 
  method = "No Calibration"
)
Figure 2.28: Calibration Curves Calculated with Uncalibrated Scores. The curves are obtained with quantile binning, for the calibration set (orange) and for the test set (green) for varying values of \(\alpha\) and \(\gamma\). The curves are the average values obtained on 200 replications of the simulations, the bands correspond to 95% bootstrap interval. The histogram on top of each graph show the distribution of the true probabilities

plot_calib_locfit_simuls_2(
  calib_curve = calib_curve_locfit_simuls |> 
    filter(scale_parameter %in% c(1/3, 1, 3)), 
  method = "platt"
)
Figure 2.29: Calibration Curves Calculated with Scores Recalibrated Using Platt Scaling. The curves are obtained with quantile binning, for the calibration set (orange) and for the test set (green) for varying values of \(\alpha\) and \(\gamma\). The curves are the average values obtained on 200 replications of the simulations, the bands correspond to 95% bootstrap interval. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

plot_calib_locfit_simuls_2(
  calib_curve = calib_curve_locfit_simuls |> 
    filter(scale_parameter %in% c(1/3, 1, 3)), 
  method = "isotonic"
)
Figure 2.30: Calibration Curves Calculated with Scores Recalibrated Using Isotonic Regression. The curves are obtained with quantile binning, for the calibration set (orange) and for the test set (green) for varying values of \(\alpha\) and \(\gamma\). The curves are the average values obtained on 200 replications of the simulations, the bands correspond to 95% bootstrap interval. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

plot_calib_locfit_simuls_2(
  calib_curve = calib_curve_locfit_simuls |> 
    filter(scale_parameter %in% c(1/3, 1, 3)), 
  method = "beta"
)
Figure 2.31: Calibration Curves Calculated with Scores Recalibrated Using Beta Calibration. The curves are obtained with quantile binning, for the calibration set (orange) and for the test set (green) for varying values of \(\alpha\) and \(\gamma\). The curves are the average values obtained on 200 replications of the simulations, the bands correspond to 95% bootstrap interval. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

plot_calib_locfit_simuls_2(
  calib_curve = calib_curve_locfit_simuls |> 
    filter(scale_parameter %in% c(1/3, 1, 3)), 
  method = "locfit_0"
)
Figure 2.32: Calibration Curves Calculated with Scores Recalibrated Using Local Regression (with degree 0). The curves are obtained with quantile binning, for the calibration set (orange) and for the test set (green) for varying values of \(\alpha\) and \(\gamma\). The curves are the average values obtained on 200 replications of the simulations, the bands correspond to 95% bootstrap interval. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

plot_calib_locfit_simuls_2(
  calib_curve = calib_curve_locfit_simuls |> 
    filter(scale_parameter %in% c(1/3, 1, 3)), 
  method = "locfit_1"
)
Figure 2.33: Calibration Curves Calculated with Scores Recalibrated Using Local Regression (with degree 1). The curves are obtained with quantile binning, for the calibration set (orange) and for the test set (green) for varying values of \(\alpha\) and \(\gamma\). The curves are the average values obtained on 200 replications of the simulations, the bands correspond to 95% bootstrap interval. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

plot_calib_locfit_simuls_2(
  calib_curve = calib_curve_locfit_simuls |> 
    filter(scale_parameter %in% c(1/3, 1, 3)), 
  method = "locfit_2"
)
Figure 2.34: Calibration Curves Calculated with Scores Recalibrated Using Local Regression (with degree 2). The curves are obtained with quantile binning, for the calibration set (orange) and for the test set (green) for varying values of \(\alpha\) and \(\gamma\). The curves are the average values obtained on 200 replications of the simulations, the bands correspond to 95% bootstrap interval. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

2.7.3 Calibration Curve with Moving Average

Let us plot the calibration curve obtained with moving average, for the 200 replications.

# For alpha
methods <- names(simul_recalib_alpha[[1]]$res_recalibration)
methods <- c("True Prob.", "No Calibration", methods)
calib_curve_ma_simuls_alpha <- vector(mode = "list", length = length(methods))
names(calib_curve_ma_simuls_alpha) <- methods
library(future)
nb_cores <- future::availableCores()-1
plan(multisession, workers = nb_cores)

for (i_method in 1:length(methods)) {
  progressr::with_progress({
    p <- progressr::progressor(steps = nrow(grid_alpha))
    calib_curve_ma_simuls_m <- furrr::future_map(
      .x = simul_recalib_alpha,
      .f = ~{
        p()
        calibration_curve_ma_simul(
          simul = .x, method = methods[i_method],
          nn = .15, 
          prob = .95, 
          ci_method = "probit"
        )
      },
      .options = furrr::furrr_options(seed = FALSE)
    )
  })
  # Add value for alpha
  for (j in 1:length(calib_curve_ma_simuls_m)) {
    calib_curve_ma_simuls_m[[j]]$scale_parameter <- grid_alpha$alpha[j]
  }
  calib_curve_ma_simuls_alpha[[i_method]] <- calib_curve_ma_simuls_m |> 
    list_rbind(names_to = "i_row")
}
calib_curve_ma_simuls_alpha <- list_rbind(
  calib_curve_ma_simuls_alpha, names_to = "method"
) |> 
  mutate(type = "alpha")

Let us loop over the simulations made with varying values for \(\gamma\):

# For gamma
methods <- names(simul_recalib_gamma[[1]]$res_recalibration)
# Remove isotonic which makes the R session crash...
methods <- c("True Prob.", "No Calibration", methods)
calib_curve_ma_simuls_gamma <- vector(mode = "list", length = length(methods))
names(calib_curve_ma_simuls_gamma) <- methods
library(future)
nb_cores <- future::availableCores()-1
plan(multisession, workers = nb_cores)

for (i_method in 1:length(methods)) {
  progressr::with_progress({
    p <- progressr::progressor(steps = nrow(grid_gamma))
    calib_curve_ma_simuls_m <- furrr::future_map(
      .x = simul_recalib_gamma,
      .f = ~{
        p()
        calibration_curve_ma_simul(
          simul = .x, method = methods[i_method],
          nn = .15, 
          prob = .95, 
          ci_method = "probit"
        )
      },
      .options = furrr::furrr_options(seed = FALSE)
    )
  })
  # Add value for alpha
  for (j in 1:length(calib_curve_ma_simuls_m)) {
    calib_curve_ma_simuls_m[[j]]$scale_parameter <- grid_gamma$gamma[j]
  }
  calib_curve_ma_simuls_gamma[[i_method]] <- calib_curve_ma_simuls_m |> 
    list_rbind(names_to = "i_row")
}
calib_curve_ma_simuls_gamma <- list_rbind(
  calib_curve_ma_simuls_gamma, names_to = "method"
) |> 
  mutate(type = "gamma")

The results are stored in a single tibble:

calib_curve_ma_simuls <- 
  calib_curve_ma_simuls_alpha |> 
  bind_rows(calib_curve_ma_simuls_gamma) |> 
  mutate(
    sample = factor(
      sample, 
      levels = c("Calibration", "Test"), labels = c("Calibration", "Test")
    )
  )

Let us define a function to plot the calibration curves on the calibration and the test samples.

plot_calib_ma_simuls <- function(method, type) {
  tb_calibration_curve <- calib_curve_ma_simuls |> 
    filter(
      method == !!method,
      type == !!type
    )
  if (type == "alpha") {
    count_scores <- count_scores_alpha |> filter(method == !!method)
  } else if (type == "gamma") {
    count_scores <- count_scores_gamma |> filter(method == !!method)
  } else {
    stop("Type should be either \"alpha\" or \"gamma\"")
  }
  scale_params <- unique(tb_calibration_curve$scale_parameter)
  seeds <- unique(tb_calibration_curve$seed)
  colours <- c("Calibration" = "#D55E00", "Test" = "#009E73")
  
  nb <- length(scale_params)
  mat <- mat_init <- matrix(c(1:4), ncol = 2)
  for (j in 1:(nb-1)) {
    mat <- rbind(mat, mat_init + j * 4)
  }
  layout(mat, heights = rep(c(1, 3), nb))
  
  y_lim <- c(0, 1)
  
  for (scale_param in scale_params) {
    title <- str_c("$\\", type, " = $", round(scale_param, 2))
    
    for (sample in c("Calibration", "Test")) {
      
      if (sample == "Calibration"){
        n_bins_current <- count_scores |> 
          filter(transform_scale == !!scale_param) |> pull("n_bins_calib")
        n_bins_c_current <- count_scores |> 
          filter(transform_scale == !!scale_param) |> pull("n_bins_c_calib")
      } else {
        n_bins_current <- count_scores |> 
          filter(transform_scale == !!scale_param) |> pull("n_bins_test")
        n_bins_c_current <- count_scores |> 
          filter(transform_scale == !!scale_param) |> pull("n_bins_c_test")
      }
      
      par(mar = c(0.5, 4.3, 1.0, 0.5))
      y_lim <- range(c(n_bins_current, n_bins_c_current), na.rm = TRUE)
      barplot(
        n_bins_current,
        col = adjustcolor("#000000", alpha.f = .3),
        ylim = y_lim,
        border = "white",
        axes = FALSE,
        xlab = "", ylab = "", 
        main = latex2exp::TeX(title)
      )
      barplot(
        n_bins_c_current,
        col = adjustcolor("#0072B2", alpha.f = .3),
        border = "white",
        axes = FALSE,
        xlab = "", ylab = "", main = "",
        add = TRUE
      )
      par(mar = c(4.1, 4.3, 0.5, 0.5))
      tb_current <- 
        tb_calibration_curve |> 
        filter(
          scale_parameter == !!scale_param,
          sample == !!sample
        ) |> 
        group_by(type, scale_parameter, xlim) |> 
        summarise(
          lower = quantile(mean, probs = .025),
          upper = quantile(mean, probs = .975),
          mean = mean(mean),
          .groups = "drop"
        )
      plot(
        tb_current$xlim, tb_current$mean,
        type = "l", col = colours[sample],
        xlim = 0:1, ylim = 0:1,
        xlab = "Predicted probability", ylab = "Mean predicted probability",
        main = ""
      )
      polygon(
        c(tb_current$xlim, rev(tb_current$xlim)),
        c(tb_current$lower, rev(tb_current$upper)),
        col = adjustcolor(col = colours[sample], alpha.f = .4),
        border = NA
      )
      segments(0, 0, 1, 1, col = "black", lty = 2)
    }
  }
}

The figures below show a panel of graphs with the calibration curves obtained with the moving average method. Each tab shows the average curve obtained on the 200 replications for a type of recalibration used, as well as the average of the 95% confidence intervals computed on each simulation The first two tab (True Prob. and No Calibration) show the curves obtained using the true probabilities \(p\) and the uncalibrated probabilities \(p^u\), instead of the recalibrated probabilities \(p^c\). Each row of the panel in the Figures corresponds to a value for either \(\alpha\) or \(\gamma\) used to transform \(p\) to get \(p^u\). The left column shows the calibration curve obtained on the calibration set whereas the right column shows the calibration curve obtained on the test set. The average distribution (computed over the 200 simulations) of the uncalibrated scores and of the calibrated scores are shown in the histograms on top of each graph.

Calibration curves obtained with recalibrated scores. The curves are obtained with the moving average method for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with the moving average method for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with the moving average method for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with the moving average method for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with the moving average method for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with the moving average method for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with the moving average method for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with the moving average method for the calibration set (left) and for the test set (right) for varying values of \(\alpha\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with the moving average method for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with the moving average method for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with the moving average method for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with the moving average method for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with the moving average method for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with the moving average method for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with the moving average method for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

Calibration curves obtained with recalibrated scores. The curves are obtained with the moving average method for the calibration set (left) and for the test set (right) for varying values of \(\gamma\) for 200 replication of the simulations. The histogram on top of each graph show the distribution of the uncalibrated scores, and that of the calibrated scores

2.8 Boxplots of the Metrics

We now focus on the calibration metrics

calib_metrics_simul_alpha <- map(simul_recalib_alpha, "calib_metrics") |> 
  list_rbind()
calib_metrics_simul_gamma <- map(simul_recalib_gamma, "calib_metrics") |> 
  list_rbind()

Let us put all the metrics computed for all the simulations in a single tibble.

calib_metrics_simul <- calib_metrics_simul_alpha |> 
  bind_rows(calib_metrics_simul_gamma) |> 
  pivot_longer(
    cols = c(mse, brier, ece, qmse, wmse, lcs),
    names_to = "metric", values_to = "value"
  ) |> 
  mutate(
    metric = factor(
      metric,
      levels = c("mse", "brier", "ece", "qmse", "wmse", "lcs"),
      labels = c("True MSE", "Brier Score", "ECE", "QMSE", "WMSE", "LCS")
    ),
    method = factor(
      method,
      levels = c(
        "True Prob.", "No Calibration", 
        "platt", "isotonic", "beta", "locfit_0", "locfit_1", "locfit_2"),
      labels = c(
        "True Prob.", "No Calibration", 
        "Platt Scaling", "Isotonic Reg.", "Beta Calib.",
        "Local Reg. (deg = 0)", "Local Reg. (deg = 1)", "Local Reg (deg = 2)"
      )
    ),
    sample = factor(
      sample,
      levels = c("Calibration", "Test")
    )
  )

Then, we create a function, plot_boxplot_metric() to graph boxplots for a metric, for each value of \(\alpha\) or \(\gamma\) (x-asis). The y-axis show the values of the desired metric. Each panel of the figure uses a specific predicted score:

  • True Proba: \(p^c := p\) the true probabilities from the DGP
  • No Calibration: \(p^c := p^u\) the transformed probabilities
  • Platt Scaling: \(p^c := g^{\text{platt}}(p^u)\) scores recalibrated using Platt Scaling
  • Isotonic Reg.: \(p^c := g^{\text{iso}}(p^u)\) scores recalibrated using isotonic regression
  • Beta Calib.: \(p^c := g^{\text{beta}}(p^u)\) scores recalibrated using beta calibration
  • Local Reg. (deg = 0): \(p^c := g^{\text{locfit}}(p^u, 0)\) scores recalibrated using local regression with degree 0
  • Local Reg. (deg = 1): \(p^c := g^{\text{locfit}}(p^u, 1)\) scores recalibrated using local regression with degree 1
  • Local Reg. (deg = 2): \(p^c := g^{\text{locfit}}(p^u, 2)\) scores recalibrated using local regression with degree 2.
plot_boxplot_metric <- function(metric, 
                                calib_metrics_simul,
                                type) {
  data_plot <- calib_metrics_simul |>
    filter(metric == !!metric, type == !!type) |> 
    arrange(transform_scale)
  
  methods <- levels(data_plot$method)
  labels_y <- unique(data_plot$transform_scale) |> round(2)
  
  
  par(mfrow = c(4,2))
  for (method in methods) {
    data_plot_current <- data_plot |> filter(method == !!method)
    # par(mar = c(2.1, 12.1, 2.1, 2.1))
    par(mar = c(3.1, 4.1, 2.1, 2.1))
    boxplot(
      value ~ sample + transform_scale,
      data = data_plot_current,
      col = c("#D55E00", "#009E73"),
      horizontal = FALSE,
      main = method,
      las = 1, xlab = "", ylab = "",
      xaxt = "n"
    )
    # ind_benchmark <- which(labels_y == 1)
    labs_y <- str_c("$\\", type, "=", labels_y, "$")
    # labs_y[ind_benchmark] <- str_c(labs_y[ind_benchmark], " (benchmark)")
    axis(
      side = 1, at = seq(1, 2*length(labels_y), by = 2) + .5, 
      labels = latex2exp::TeX(labs_y),
      las = 1,
      # col.axis = "black"
    )
    # # Horizontal lines
    # for (i in seq(1, 2*(length(labels_y)-1), by = 2) + 1.5) {
    #   abline(h = i, lty = 1, col = "gray")
    # }
    # Vertical lines
    for (i in seq(1, 2*(length(labels_y)-1), by = 2) + 1.5) {
      abline(v = i, lty = 1, col = "gray")
    }
  }
}
Distribution of True MSE computed over the 200 replications, on the calibration set and on test set

Distribution of Brier Score computed over the 200 replications, on the calibration set and on test set

Distribution of ECE computed over the 200 replications, on the calibration set and on test set

Distribution of QMSE computed over the 200 replications, on the calibration set and on test set

Distribution of WMSE computed over the 200 replications, on the calibration set and on test set

Distribution of LCS computed over the 200 replications, on the calibration set and on test set

Distribution of True MSE computed over the 200 replications, on the calibration set and on test set

Distribution of Brier Score computed over the 200 replications, on the calibration set and on test set

Distribution of ECE computed over the 200 replications, on the calibration set and on test set

Distribution of QMSE computed over the 200 replications, on the calibration set and on test set

Distribution of WMSE computed over the 200 replications, on the calibration set and on test set

Distribution of LCS computed over the 200 replications, on the calibration set and on test set

We can visualize this in a different way.

boxplot_std_metrics_calib <- function(tb_calib_metrics, metric, x_lim = NULL) {
  scale_parameters <- unique(tb_calib_metrics$transform_scale)
  nb <- length(scale_parameters)
  mat <- mat_init <- matrix(c(1:(2*nb), rep(2*nb+1, nb)), ncol = nb, byrow = TRUE)
  
  # colours_calib <- c(
  #   "#332288", 
  #   # "#117733", 
  #   "#44AA99", "#88CCEE",
  #   "#DDCC77", "#CC6677", "#AA4499", "#882255") |> rev()
  
  colours_names <- c(
    "True Prob.",
    "No Calibration", "Platt", "Isotonic", "Beta", 
    "Locfit (deg=0)", "Locfit (deg=1)", "Locfit (deg=2)"
  )
  colours_calib <- colours_legend <- c(
    "#332288", "#117733", "#44AA99", "#88CCEE",
    "#DDCC77", "#CC6677", "#AA4499", "#882255"
  )
  colours_test <- adjustcolor(colours_calib, alpha.f = .5)
  
  colours <- NULL
  for (k in 1:length(colours_calib)) 
    colours <- c(colours, colours_calib[k], colours_test[k])
  
  
  layout(mat, heights = c(.45, .45, .15))
  
  for (type in unique(tb_calib_metrics$type)) {
    for (i_scale in 1:length(scale_parameters)) {
      scale_parameter <- scale_parameters[i_scale]
      tb_metrics_current <- tb_calib_metrics |> 
        filter(
          transform_scale == !!scale_parameter, 
          type == !!type,
          metric == !!metric
        ) |> 
        mutate(
          method = fct_rev(fct_drop(method))
        )
      title <- latex2exp::TeX(
        str_c("$\\", type, " = ", round(scale_parameter, 2), "$")
      )
      methods_bp <- tb_metrics_current$method |> levels()
      
      ind_colours <- match(methods_bp, colours_names)
      
      colours <- NULL
      for (k in 1:length(colours_calib[ind_colours])) 
        colours <- c(colours, colours_calib[ind_colours][k], 
                     colours_test[ind_colours][k])
      
      
      form <- str_c("value~sample + method")
      par(mar = c(1.5, 1.5, 3.1, 1))
      boxplot(
        formula(form), 
        data = tb_metrics_current |> 
          mutate(
            sample = fct_rev(sample)
            # method = fct_rev(method)
          ),
        xlab = "",
        ylab = "",
        main = title,
        horizontal = TRUE,
        las = 1,
        col = colours,
        ylim = x_lim,
        border = c("black", adjustcolor("black", alpha.f = .5)),
        # sep = ", "
        yaxt = "n"
      )
      # Horizontal lines
      for (i in seq(3, length(methods_bp) * 2, by = 2) - .5) {
        abline(h = i, lty = 1, col = "gray")
      }
    }
  }
  par(mar = c(0, 4.3, 0, 4.3))
  plot.new()
  legend(
    "center", 
    legend = colours_names,
    fill = colours_legend,
    # lwd = 2,
    xpd = TRUE, ncol = 4
  )
}
# Standard Metrics
standard_metrics <- recalib_metrics_alpha |> 
  bind_rows(recalib_metrics_gamma) |> 
  # Without recalibration
  bind_rows(
    metrics_alpha |> 
      mutate(method = "No Calibration")
  ) |> 
  bind_rows(
    metrics_gamma |> 
      mutate(method = "No Calibration")
  ) |> 
  filter(threshold == .5) |> 
  rename(transform_scale = scale_parameter) |> 
  select(
    sample, seed, transform_scale, type, method, 
    mse, accuracy, sensitivity, specificity, auc
  ) |> 
  pivot_longer(
    cols = c(mse, accuracy, sensitivity, specificity, auc), 
    names_to = "metric",
    values_to = "value"
  )

# Calibration metrics
calib_metrics_simul_alpha <- map(simul_recalib_alpha, "calib_metrics") |> 
  list_rbind()
calib_metrics_simul_gamma <- map(simul_recalib_gamma, "calib_metrics") |> 
  list_rbind()

calib_metrics_simul <- calib_metrics_simul_alpha |> 
  bind_rows(calib_metrics_simul_gamma) |> 
  pivot_longer(
    cols = c(mse, brier, ece, qmse, wmse, lcs),
    names_to = "metric", values_to = "value"
  ) |> 
  mutate(
    metric = factor(
      metric,
      levels = c("mse", "brier", "ece", "qmse", "wmse", "lcs"),
      labels = c("True MSE", "Brier Score", "ECE", "QMSE", "WMSE", "LCS")
    ),
    sample = case_when(
      sample == "Calibration"~"calibration",
      sample == "Test"~"test",
      TRUE~NA_character_
    )
  )

metrics_all <- 
  calib_metrics_simul |> 
  bind_rows(standard_metrics) |> 
  mutate(
    method = factor(
      method,
      levels = c("True Prob.", "No Calibration",
                 "platt", "isotonic", "beta", "locfit_0", "locfit_1", "locfit_2"),
      labels = c(
        "True Prob.",
        "No Calibration", "Platt", "Isotonic", "Beta", 
        "Locfit (deg=0)", "Locfit (deg=1)", "Locfit (deg=2)")
    ),
    sample = factor(
      sample, levels = c("calibration", "test"), labels = c("Calibration", "Test")
    )
  )
boxplot_std_metrics_calib(
  tb_calib_metrics = metrics_all |> 
    filter(transform_scale %in% c(1/3, 1, 3)),
  metric = "True MSE"
)
Figure 2.35: True MSE on 200 Simulations for each Value of \(\alpha\) (top) or \(\gamma\) (bottom), on the Calibration (transparent colors) and on the Test Set (full colors). The metrics are computed for different definitions of the scores: using the true probabilities, the non calibrated scores, or the recalibrated scores.

boxplot_std_metrics_calib(
  tb_calib_metrics = metrics_all |> 
    filter(transform_scale %in% c(1/3, 1, 3)),
  metric = "LCS"
)
Figure 2.36: LCS on 200 Simulations for each Value of \(\alpha\) (top) or \(\gamma\) (bottom), on the Calibration (transparent colors) and on the Test Set (full colors). The metrics are computed for different definitions of the scores: using the true probabilities, the non calibrated scores, or the recalibrated scores.

boxplot_std_metrics_calib(
  tb_calib_metrics = metrics_all |> 
    filter(transform_scale %in% c(1/3, 1, 3)),
  metric = "Brier Score"
)
Figure 2.37: Brier Score on 200 Simulations for each Value of \(\alpha\) (top) or \(\gamma\) (bottom), on the Calibration (transparent colors) and on the Test Set (full colors). The metrics are computed for different definitions of the scores: using the true probabilities, the non calibrated scores, or the recalibrated scores.

boxplot_std_metrics_calib(
  tb_calib_metrics = metrics_all |> 
    filter(transform_scale %in% c(1/3, 1, 3)),
  metric = "ECE"
)
Figure 2.38: ECE on 200 Simulations for each Value of \(\alpha\) (top) or \(\gamma\) (bottom), on the Calibration (transparent colors) and on the Test Set (full colors). The metrics are computed for different definitions of the scores: using the true probabilities, the non calibrated scores, or the recalibrated scores.

boxplot_std_metrics_calib(
  tb_calib_metrics = metrics_all |> 
    filter(transform_scale %in% c(1/3, 1, 3)),
  metric = "WMSE"
)
Figure 2.39: WMSE on 200 Simulations for each Value of \(\alpha\) (top) or \(\gamma\) (bottom), on the Calibration (transparent colors) and on the Test Set (full colors). The metrics are computed for different definitions of the scores: using the true probabilities, the non calibrated scores, or the recalibrated scores.

boxplot_std_metrics_calib(
  tb_calib_metrics = metrics_all |> 
    filter(transform_scale %in% c(1/3, 1, 3)),
  metric = "accuracy"
)
Figure 2.40: Accuracy on 200 Simulations for each Value of \(\alpha\) (top) or \(\gamma\) (bottom), on the Calibration (transparent colors) and on the Test Set (full colors). The metrics are computed for different definitions of the scores: using the true probabilities, the non calibrated scores, or the recalibrated scores.

boxplot_std_metrics_calib(
  tb_calib_metrics = metrics_all |> 
    filter(transform_scale %in% c(1/3, 1, 3)),
  metric = "sensitivity"
)
Figure 2.41: Sensitivity on 200 Simulations for each Value of \(\alpha\) (top) or \(\gamma\) (bottom), on the Calibration (transparent colors) and on the Test Set (full colors). The metrics are computed for different definitions of the scores: using the true probabilities, the non calibrated scores, or the recalibrated scores.

boxplot_std_metrics_calib(
  tb_calib_metrics = metrics_all |> 
    filter(transform_scale %in% c(1/3, 1, 3)),
  metric = "specificity"
)
Figure 2.42: Specificity on 200 Simulations for each Value of \(\alpha\) (top) or \(\gamma\) (bottom), on the Calibration (transparent colors) and on the Test Set (full colors). The metrics are computed for different definitions of the scores: using the true probabilities, the non calibrated scores, or the recalibrated scores.

boxplot_std_metrics_calib(
  tb_calib_metrics = metrics_all |> 
    filter(transform_scale %in% c(1/3, 1, 3)),
  metric = "auc"
)
Figure 2.43: AUC on 200 Simulations for each Value of \(\alpha\) (top) or \(\gamma\) (bottom), on the Calibration (transparent colors) and on the Test Set (full colors). The metrics are computed for different definitions of the scores: using the true probabilities, the non calibrated scores, or the recalibrated scores.

2.9 Summary Tables

Let us now report the results in tables. We will focus on a specific probability transformation, and for this transformation, show the computed metrics (in column) depending on the value of the predicted probability used (true probability, transformed probability without calibration, transformed probability with one of the recalibration method).

table_calib_metrics_all <- 
  calib_metrics_simul |> 
  group_by(method, sample, transform_scale, type, metric) |> 
  summarise(
    value_mean = mean(value, na.rm = TRUE),
    value_sd = sd(value, na.rm = TRUE),
    .groups = "drop"
  ) |> 
  mutate(value_mean = round(value_mean, 4), value_sd = round(value_sd, 4)) |> 
  mutate(value = str_c(value_mean, " (", value_sd, ")")) |> 
  select(-value_mean, -value_sd) |> 
  pivot_wider(
    names_from = c(metric, sample), values_from = value, 
    names_sort = TRUE
  )

We define a function to print a table depending on the transformation applied to the probabilities (varying either \(\alpha\) or \(\gamma\)).

print_table <- function(transform_scale, type) {
  table_calib_metrics_all |> 
    filter(transform_scale == !!transform_scale, type == type) |>
    select(-type, -transform_scale) |> 
    mutate(
      across(
        -method, 
        ~kableExtra::cell_spec(
          .x, 
          color = ifelse(
            .x == max(.x), yes = "#882255",
            no = ifelse(.x == min(.x), "#44AA99", "black")
          ),
          bold = ifelse(
            .x == max(.x), yes = TRUE,
            no = ifelse(.x == min(.x), TRUE, FALSE)
          )
        )
      )
    ) |> 
    knitr::kable(
      escape = FALSE, booktabs = T, digits = 4,
      format = "html",
      col.names = c("Calibration Method", rep(c("Calib.", "Test"), 6))
    ) |>
    kableExtra::kable_styling() |>
    kableExtra::add_header_above(
      c(
        "", "MSE (True)" = 2, "Brier" = 2, "ECE" = 2, "QMSE" = 2, "WMSE" = 2, "LCS" = 2
      )
    )
}
Table 2.1: Average value for the calibration metrics (in column) over the 200 replications, for \(\alpha=0.33\), computed on the calibration and on the test set, using different predicted probabilities (in rows). Standard deviations are given between brackets.
MSE (True)
Brier
ECE
QMSE
WMSE
LCS
Calibration Method Calib. Test Calib. Test Calib. Test Calib. Test Calib. Test Calib. Test
No Calibration 0.083 (9e-04) 0.0831 (0.001) 0.3186 (0.0083) 0.3185 (0.0098) 0.2824 (0.0136) 0.2826 (0.0166) 0.0847 (0.0079) 0.0859 (0.0096) 0.0765 (0.0092) 0.0777 (0.0116) 0.0801 (0.0093) 0.0827 (0.0118)
No Calibration 0.006 (2e-04) 0.006 (3e-04) 0.2418 (0.0012) 0.2416 (0.0013) 0.0998 (0.0123) 0.1035 (0.0136) 0.0078 (0.0022) 0.009 (0.0027) 0.007 (0.0015) 0.0083 (0.0018) 0.0067 (0.0027) 0.0079 (0.0034)
True Prob. 0 (0) 0 (0) 0.2359 (0.0033) 0.2355 (0.0037) 0.1056 (0.0105) 0.1113 (0.0126) 0.002 (8e-04) 0.003 (0.0014) 0.0065 (0.0017) 0.0079 (0.0026) 0.003 (0.0012) 0.0035 (0.0017)
True Prob. 0 (0) 0 (0) 0.2359 (0.0033) 0.2355 (0.0037) 0.1056 (0.0105) 0.1113 (0.0126) 0.002 (8e-04) 0.003 (0.0014) 0.0065 (0.0017) 0.0079 (0.0026) 0.003 (0.0012) 0.0035 (0.0017)
beta 5e-04 (4e-04) 5e-04 (4e-04) 0.2354 (0.0034) 0.236 (0.0037) 0.1032 (0.0173) 0.1128 (0.0153) 0.0015 (7e-04) 0.0034 (0.0017) 0.006 (0.001) 0.0082 (0.003) 0.0024 (6e-04) 0.0039 (0.0021)
beta 4e-04 (4e-04) 4e-04 (4e-04) 0.2354 (0.0034) 0.2359 (0.0037) 0.1031 (0.0174) 0.1126 (0.0153) 0.0016 (8e-04) 0.0034 (0.0017) 0.006 (0.001) 0.0082 (0.003) 0.0025 (7e-04) 0.0038 (0.0021)
isotonic 0.0019 (6e-04) 0.0019 (7e-04) 0.2314 (0.0036) 0.2373 (0.0043) 0.0935 (0.0185) 0.1134 (0.0164) 0 (0) 0.0033 (0.0018) 0.006 (0.0012) 0.0102 (0.0036) 0.0053 (0.0015) 0.0085 (0.0037)
isotonic 0.0019 (6e-04) 0.0019 (7e-04) 0.2314 (0.0036) 0.2373 (0.0043) 0.0935 (0.0185) 0.1134 (0.0164) 0 (0) 0.0033 (0.0018) 0.006 (0.0012) 0.0102 (0.0036) 0.0053 (0.0015) 0.0085 (0.0037)
locfit_0 0.0017 (7e-04) 0.0017 (7e-04) 0.235 (0.0033) 0.2372 (0.0036) 0.0998 (0.0189) 0.1093 (0.0156) 0.0014 (6e-04) 0.0039 (0.0016) 0.005 (0.0011) 0.0082 (0.0031) 6e-04 (3e-04) 0.0026 (0.0015)
locfit_0 0.0017 (7e-04) 0.0017 (7e-04) 0.235 (0.0033) 0.2372 (0.0036) 0.0999 (0.0179) 0.1097 (0.0163) 0.0014 (7e-04) 0.0039 (0.0018) 0.005 (0.0011) 0.0082 (0.0031) 6e-04 (3e-04) 0.0026 (0.0015)
locfit_1 0.0015 (6e-04) 0.0016 (6e-04) 0.2342 (0.0034) 0.2369 (0.004) 0.1024 (0.0188) 0.1136 (0.016) 0.001 (6e-04) 0.0037 (0.0016) 0.006 (0.0013) 0.0093 (0.0036) 0.0029 (0.0017) 0.0053 (0.0034)
locfit_1 0.0015 (6e-04) 0.0015 (6e-04) 0.2342 (0.0034) 0.237 (0.004) 0.1021 (0.0177) 0.1135 (0.0164) 0.001 (6e-04) 0.0037 (0.0017) 0.006 (0.0012) 0.0094 (0.0035) 0.0031 (0.0018) 0.0055 (0.0033)
locfit_2 0.003 (9e-04) 0.0031 (0.001) 0.2326 (0.0034) 0.2384 (0.0044) 0.106 (0.0174) 0.1154 (0.0165) 9e-04 (6e-04) 0.0042 (0.0019) 0.0054 (0.001) 0.0108 (0.0038) 0.0029 (0.0015) 0.0066 (0.0036)
locfit_2 0.003 (0.001) 0.0031 (0.001) 0.2327 (0.0034) 0.2383 (0.0044) 0.1057 (0.0178) 0.1154 (0.0165) 9e-04 (5e-04) 0.0042 (0.0021) 0.0054 (9e-04) 0.0109 (0.0039) 0.0031 (0.0015) 0.0069 (0.0037)
platt 5e-04 (4e-04) 5e-04 (4e-04) 0.2357 (0.0034) 0.236 (0.0037) 0.105 (0.017) 0.1141 (0.0146) 0.0018 (9e-04) 0.0035 (0.0017) 0.006 (0.001) 0.0081 (0.0029) 0.0024 (8e-04) 0.0036 (0.0021)
platt 4e-04 (4e-04) 4e-04 (4e-04) 0.2355 (0.0033) 0.2359 (0.0037) 0.1033 (0.0173) 0.1129 (0.0153) 0.0017 (8e-04) 0.0034 (0.0016) 0.006 (0.001) 0.0081 (0.003) 0.0025 (7e-04) 0.0037 (0.0021)
Table 2.2: Average value for the calibration metrics (in column) over the 200 replications, for \(\alpha=0.67\), computed on the calibration and on the test set, using different predicted probabilities (in rows). Standard deviations are given between brackets.
MSE (True)
Brier
ECE
QMSE
WMSE
LCS
Calibration Method Calib. Test Calib. Test Calib. Test Calib. Test Calib. Test Calib. Test
No Calibration 0.0157 (1e-04) 0.0157 (1e-04) 0.2515 (0.0047) 0.2512 (0.0052) 0.1313 (0.0138) 0.1312 (0.0162) 0.0176 (0.0035) 0.0187 (0.0043) 0.0196 (0.0039) 0.0211 (0.0049) 0.0163 (0.0036) 0.0172 (0.0044)
No Calibration 0.0014 (1e-04) 0.0014 (1e-04) 0.2372 (0.0023) 0.2369 (0.0026) 0.099 (0.0115) 0.1037 (0.0127) 0.0033 (0.0013) 0.0044 (0.0018) 0.005 (0.001) 0.0065 (0.0017) 0.0021 (9e-04) 0.0028 (0.0013)
True Prob. 0 (0) 0 (0) 0.2359 (0.0033) 0.2355 (0.0037) 0.1056 (0.0105) 0.1113 (0.0126) 0.002 (8e-04) 0.003 (0.0014) 0.0065 (0.0017) 0.0079 (0.0026) 0.003 (0.0012) 0.0035 (0.0017)
True Prob. 0 (0) 0 (0) 0.2359 (0.0033) 0.2355 (0.0037) 0.1056 (0.0105) 0.1113 (0.0126) 0.002 (8e-04) 0.003 (0.0014) 0.0065 (0.0017) 0.0079 (0.0026) 0.003 (0.0012) 0.0035 (0.0017)
beta 5e-04 (4e-04) 5e-04 (4e-04) 0.2354 (0.0034) 0.236 (0.0037) 0.1032 (0.0172) 0.1128 (0.0153) 0.0015 (7e-04) 0.0034 (0.0017) 0.006 (0.001) 0.0082 (0.003) 0.0024 (6e-04) 0.0039 (0.0021)
beta 4e-04 (4e-04) 4e-04 (4e-04) 0.2354 (0.0034) 0.2359 (0.0037) 0.103 (0.0173) 0.1125 (0.0152) 0.0015 (7e-04) 0.0034 (0.0017) 0.006 (0.001) 0.0082 (0.0029) 0.0024 (7e-04) 0.0039 (0.0021)
isotonic 0.0019 (6e-04) 0.0019 (7e-04) 0.2314 (0.0036) 0.2373 (0.0043) 0.0935 (0.0185) 0.1134 (0.0164) 0 (0) 0.0033 (0.0018) 0.006 (0.0012) 0.0102 (0.0036) 0.0053 (0.0015) 0.0085 (0.0037)
isotonic 0.0019 (6e-04) 0.0019 (7e-04) 0.2314 (0.0036) 0.2373 (0.0043) 0.0935 (0.0185) 0.1134 (0.0164) 0 (0) 0.0033 (0.0018) 0.006 (0.0012) 0.0102 (0.0036) 0.0053 (0.0015) 0.0085 (0.0037)
locfit_0 0.0017 (7e-04) 0.0017 (7e-04) 0.235 (0.0033) 0.2372 (0.0036) 0.0998 (0.0189) 0.1089 (0.016) 0.0014 (7e-04) 0.0038 (0.0017) 0.0051 (0.0011) 0.0082 (0.0031) 6e-04 (3e-04) 0.0026 (0.0015)
locfit_0 0.0017 (7e-04) 0.0017 (7e-04) 0.235 (0.0033) 0.2372 (0.0036) 0.0998 (0.019) 0.1096 (0.0162) 0.0014 (7e-04) 0.0039 (0.0017) 0.005 (0.0011) 0.0083 (0.003) 6e-04 (3e-04) 0.0026 (0.0015)
locfit_1 0.0015 (6e-04) 0.0015 (6e-04) 0.2342 (0.0034) 0.2369 (0.004) 0.1029 (0.0187) 0.1135 (0.0162) 0.001 (5e-04) 0.0037 (0.0016) 0.006 (0.0012) 0.0092 (0.0036) 0.0029 (0.0017) 0.0051 (0.0033)
locfit_1 0.0015 (6e-04) 0.0016 (6e-04) 0.2342 (0.0034) 0.2369 (0.004) 0.1022 (0.0183) 0.1136 (0.0164) 0.001 (6e-04) 0.0037 (0.0017) 0.006 (0.0013) 0.0093 (0.0034) 0.003 (0.0018) 0.0054 (0.0033)
locfit_2 0.003 (0.001) 0.003 (0.001) 0.2327 (0.0034) 0.2384 (0.0044) 0.1067 (0.0175) 0.115 (0.0169) 9e-04 (5e-04) 0.0041 (0.002) 0.0054 (0.001) 0.0108 (0.0039) 0.0029 (0.0015) 0.0066 (0.0037)
locfit_2 0.003 (0.001) 0.0031 (0.001) 0.2326 (0.0034) 0.2384 (0.0045) 0.1059 (0.0179) 0.1153 (0.0165) 9e-04 (5e-04) 0.0042 (0.0018) 0.0054 (9e-04) 0.0108 (0.004) 0.003 (0.0015) 0.0066 (0.0038)
platt 4e-04 (4e-04) 4e-04 (4e-04) 0.2356 (0.0034) 0.2359 (0.0037) 0.1048 (0.017) 0.1135 (0.0147) 0.0017 (8e-04) 0.0034 (0.0017) 0.006 (0.001) 0.0081 (0.003) 0.0021 (7e-04) 0.0033 (0.002)
platt 4e-04 (4e-04) 4e-04 (4e-04) 0.2355 (0.0033) 0.2359 (0.0037) 0.1036 (0.0174) 0.1132 (0.0152) 0.0017 (8e-04) 0.0034 (0.0016) 0.006 (0.001) 0.0081 (0.003) 0.0022 (7e-04) 0.0035 (0.002)
Table 2.3: Average value for the calibration metrics (in column) over the 200 replications, for \(\alpha=1\), computed on the calibration and on the test set, using different predicted probabilities (in rows). Standard deviations are given between brackets.
MSE (True)
Brier
ECE
QMSE
WMSE
LCS
Calibration Method Calib. Test Calib. Test Calib. Test Calib. Test Calib. Test Calib. Test
No Calibration 0 (0) 0 (0) 0.2359 (0.0033) 0.2355 (0.0037) 0.1056 (0.0105) 0.1113 (0.0126) 0.002 (8e-04) 0.003 (0.0014) 0.0065 (0.0017) 0.0079 (0.0026) 0.003 (0.0012) 0.0035 (0.0016)
No Calibration 0 (0) 0 (0) 0.2359 (0.0033) 0.2355 (0.0037) 0.1056 (0.0105) 0.1113 (0.0126) 0.002 (8e-04) 0.003 (0.0014) 0.0065 (0.0017) 0.0079 (0.0026) 0.003 (0.0012) 0.0035 (0.0016)
True Prob. 0 (0) 0 (0) 0.2359 (0.0033) 0.2355 (0.0037) 0.1056 (0.0105) 0.1113 (0.0126) 0.002 (8e-04) 0.003 (0.0014) 0.0065 (0.0017) 0.0079 (0.0026) 0.003 (0.0012) 0.0035 (0.0017)
True Prob. 0 (0) 0 (0) 0.2359 (0.0033) 0.2355 (0.0037) 0.1056 (0.0105) 0.1113 (0.0126) 0.002 (8e-04) 0.003 (0.0014) 0.0065 (0.0017) 0.0079 (0.0026) 0.003 (0.0012) 0.0035 (0.0017)
beta 5e-04 (4e-04) 5e-04 (4e-04) 0.2354 (0.0034) 0.236 (0.0037) 0.1031 (0.0173) 0.1127 (0.0153) 0.0015 (7e-04) 0.0034 (0.0017) 0.006 (0.001) 0.0082 (0.003) 0.0024 (6e-04) 0.0039 (0.0021)
beta 5e-04 (4e-04) 5e-04 (4e-04) 0.2354 (0.0034) 0.236 (0.0037) 0.1031 (0.0173) 0.1127 (0.0153) 0.0015 (7e-04) 0.0034 (0.0017) 0.006 (0.001) 0.0082 (0.003) 0.0024 (6e-04) 0.0039 (0.0021)
isotonic 0.0019 (6e-04) 0.0019 (7e-04) 0.2314 (0.0036) 0.2373 (0.0043) 0.0935 (0.0185) 0.1134 (0.0164) 0 (0) 0.0033 (0.0018) 0.006 (0.0012) 0.0102 (0.0036) 0.0053 (0.0015) 0.0085 (0.0037)
isotonic 0.0019 (6e-04) 0.0019 (7e-04) 0.2314 (0.0036) 0.2373 (0.0043) 0.0935 (0.0185) 0.1134 (0.0164) 0 (0) 0.0033 (0.0018) 0.006 (0.0012) 0.0102 (0.0036) 0.0053 (0.0015) 0.0085 (0.0037)
locfit_0 0.0017 (7e-04) 0.0017 (7e-04) 0.235 (0.0033) 0.2372 (0.0037) 0.099 (0.0182) 0.1092 (0.016) 0.0013 (6e-04) 0.0039 (0.0018) 0.005 (0.0011) 0.0083 (0.0031) 6e-04 (3e-04) 0.0026 (0.0016)
locfit_0 0.0017 (7e-04) 0.0017 (7e-04) 0.235 (0.0033) 0.2372 (0.0037) 0.099 (0.0182) 0.1092 (0.016) 0.0013 (6e-04) 0.0039 (0.0018) 0.005 (0.0011) 0.0083 (0.0031) 6e-04 (3e-04) 0.0026 (0.0016)
locfit_1 0.0015 (6e-04) 0.0015 (6e-04) 0.2342 (0.0034) 0.2369 (0.004) 0.1016 (0.0184) 0.1134 (0.0161) 9e-04 (6e-04) 0.0038 (0.0018) 0.0059 (0.0012) 0.0092 (0.0035) 0.0028 (0.0016) 0.005 (0.0032)
locfit_1 0.0015 (6e-04) 0.0015 (6e-04) 0.2342 (0.0034) 0.2369 (0.004) 0.1016 (0.0184) 0.1134 (0.0161) 9e-04 (6e-04) 0.0038 (0.0018) 0.0059 (0.0012) 0.0092 (0.0035) 0.0028 (0.0016) 0.005 (0.0032)
locfit_2 0.0029 (0.001) 0.003 (0.001) 0.2327 (0.0034) 0.2383 (0.0044) 0.1058 (0.0179) 0.1147 (0.0167) 9e-04 (6e-04) 0.0041 (0.0018) 0.0054 (9e-04) 0.0107 (0.0039) 0.0029 (0.0014) 0.0065 (0.0037)
locfit_2 0.0029 (0.001) 0.003 (0.001) 0.2327 (0.0034) 0.2383 (0.0044) 0.1058 (0.0179) 0.1147 (0.0167) 9e-04 (6e-04) 0.0041 (0.0018) 0.0054 (9e-04) 0.0107 (0.0039) 0.0029 (0.0014) 0.0065 (0.0037)
platt 4e-04 (4e-04) 4e-04 (4e-04) 0.2355 (0.0034) 0.2359 (0.0038) 0.104 (0.0174) 0.1136 (0.0152) 0.0017 (8e-04) 0.0034 (0.0016) 0.006 (0.001) 0.0081 (0.003) 0.0019 (6e-04) 0.0032 (0.0018)
platt 4e-04 (4e-04) 4e-04 (4e-04) 0.2355 (0.0034) 0.2359 (0.0038) 0.104 (0.0174) 0.1136 (0.0152) 0.0017 (8e-04) 0.0034 (0.0016) 0.006 (0.001) 0.0081 (0.003) 0.0019 (6e-04) 0.0032 (0.0018)
Table 2.4: Average value for the calibration metrics (in column) over the 200 replications, for \(\alpha=1.5\), computed on the calibration and on the test set, using different predicted probabilities (in rows). Standard deviations are given between brackets.
MSE (True)
Brier
ECE
QMSE
WMSE
LCS
Calibration Method Calib. Test Calib. Test Calib. Test Calib. Test Calib. Test Calib. Test
No Calibration 0.0192 (1e-04) 0.0192 (1e-04) 0.2553 (0.005) 0.2548 (0.0061) 0.1927 (0.0136) 0.1954 (0.0167) 0.0214 (0.004) 0.0222 (0.005) 0.0251 (0.0044) 0.0256 (0.0058) 0.0215 (0.0041) 0.0216 (0.0052)
No Calibration 0.0025 (1e-04) 0.0025 (1e-04) 0.2384 (0.0047) 0.2379 (0.0051) 0.1342 (0.0122) 0.1384 (0.0142) 0.0045 (0.0016) 0.0053 (0.0021) 0.0117 (0.0028) 0.0129 (0.0037) 0.008 (0.0024) 0.0084 (0.003)
True Prob. 0 (0) 0 (0) 0.2359 (0.0033) 0.2355 (0.0037) 0.1056 (0.0105) 0.1113 (0.0126) 0.002 (8e-04) 0.003 (0.0014) 0.0065 (0.0017) 0.0079 (0.0026) 0.003 (0.0012) 0.0035 (0.0017)
True Prob. 0 (0) 0 (0) 0.2359 (0.0033) 0.2355 (0.0037) 0.1056 (0.0105) 0.1113 (0.0126) 0.002 (8e-04) 0.003 (0.0014) 0.0065 (0.0017) 0.0079 (0.0026) 0.003 (0.0012) 0.0035 (0.0017)
beta 5e-04 (4e-04) 5e-04 (4e-04) 0.2354 (0.0034) 0.236 (0.0037) 0.1028 (0.0175) 0.1127 (0.0154) 0.0015 (7e-04) 0.0035 (0.0017) 0.006 (0.001) 0.0082 (0.0029) 0.0024 (6e-04) 0.004 (0.0021)
beta 5e-04 (4e-04) 5e-04 (4e-04) 0.2354 (0.0034) 0.236 (0.0037) 0.1029 (0.0176) 0.1129 (0.0155) 0.0015 (7e-04) 0.0035 (0.0017) 0.006 (0.001) 0.0082 (0.0029) 0.0023 (6e-04) 0.004 (0.0021)
isotonic 0.0019 (6e-04) 0.0019 (7e-04) 0.2314 (0.0036) 0.2373 (0.0043) 0.0935 (0.0185) 0.1134 (0.0164) 0 (0) 0.0033 (0.0018) 0.006 (0.0012) 0.0102 (0.0036) 0.0053 (0.0015) 0.0085 (0.0037)
isotonic 0.0019 (6e-04) 0.0019 (7e-04) 0.2314 (0.0036) 0.2373 (0.0043) 0.0935 (0.0185) 0.1134 (0.0164) 0 (0) 0.0033 (0.0018) 0.006 (0.0012) 0.0102 (0.0036) 0.0053 (0.0015) 0.0085 (0.0037)
locfit_0 0.0017 (7e-04) 0.0017 (7e-04) 0.235 (0.0033) 0.2372 (0.0036) 0.0999 (0.019) 0.1098 (0.0159) 0.0013 (6e-04) 0.0039 (0.0018) 0.005 (0.0011) 0.0083 (0.0031) 6e-04 (3e-04) 0.0026 (0.0015)
locfit_0 0.0016 (7e-04) 0.0017 (7e-04) 0.2349 (0.0033) 0.2371 (0.0036) 0.0995 (0.0184) 0.1099 (0.0159) 0.0013 (7e-04) 0.0039 (0.0017) 0.005 (0.0011) 0.0083 (0.0031) 6e-04 (3e-04) 0.0026 (0.0015)
locfit_1 0.0015 (6e-04) 0.0015 (6e-04) 0.2342 (0.0034) 0.237 (0.004) 0.1022 (0.0182) 0.114 (0.0163) 0.001 (5e-04) 0.0038 (0.0019) 0.0059 (0.0012) 0.0093 (0.0037) 0.0027 (0.0016) 0.005 (0.0033)
locfit_1 0.0015 (6e-04) 0.0015 (6e-04) 0.2342 (0.0034) 0.2369 (0.004) 0.1023 (0.0175) 0.1138 (0.0162) 9e-04 (5e-04) 0.0038 (0.0018) 0.006 (0.0013) 0.0092 (0.0034) 0.0024 (0.0016) 0.0047 (0.0029)
locfit_2 0.0029 (9e-04) 0.003 (0.001) 0.2327 (0.0034) 0.2382 (0.0044) 0.1062 (0.0176) 0.1157 (0.0159) 9e-04 (5e-04) 0.0042 (0.0019) 0.0054 (9e-04) 0.0108 (0.004) 0.0028 (0.0015) 0.0065 (0.0037)
locfit_2 0.0029 (0.001) 0.003 (0.001) 0.2327 (0.0034) 0.2382 (0.0044) 0.1068 (0.0181) 0.1156 (0.0173) 9e-04 (6e-04) 0.004 (0.0019) 0.0054 (9e-04) 0.0106 (0.0039) 0.0027 (0.0015) 0.0062 (0.0037)
platt 4e-04 (4e-04) 4e-04 (4e-04) 0.2356 (0.0033) 0.236 (0.0037) 0.1035 (0.0176) 0.1126 (0.0159) 0.0017 (8e-04) 0.0034 (0.0016) 0.006 (0.001) 0.0081 (0.0029) 0.0018 (6e-04) 0.0032 (0.0017)
platt 4e-04 (4e-04) 4e-04 (4e-04) 0.2356 (0.0034) 0.236 (0.0038) 0.1047 (0.0175) 0.1142 (0.0149) 0.0017 (8e-04) 0.0034 (0.0017) 0.006 (0.001) 0.0081 (0.003) 0.0015 (5e-04) 0.0028 (0.0017)
Table 2.5: Average value for the calibration metrics (in column) over the 200 replications, for \(\alpha=3\), computed on the calibration and on the test set, using different predicted probabilities (in rows). Standard deviations are given between brackets.
MSE (True)
Brier
ECE
QMSE
WMSE
LCS
Calibration Method Calib. Test Calib. Test Calib. Test Calib. Test Calib. Test Calib. Test
No Calibration 0.1281 (7e-04) 0.128 (8e-04) 0.3642 (0.01) 0.3637 (0.0122) 0.3459 (0.0141) 0.3476 (0.0163) 0.1302 (0.0099) 0.131 (0.012) 0.1196 (0.0098) 0.1183 (0.0117) 0.1211 (0.0099) 0.1204 (0.0119)
No Calibration 0.0243 (6e-04) 0.0243 (6e-04) 0.2604 (0.0076) 0.2596 (0.0084) 0.2253 (0.0136) 0.227 (0.0157) 0.0263 (0.0045) 0.0268 (0.0051) 0.031 (0.0049) 0.0312 (0.0059) 0.0294 (0.0048) 0.0294 (0.0058)
True Prob. 0 (0) 0 (0) 0.2359 (0.0033) 0.2355 (0.0037) 0.1056 (0.0105) 0.1113 (0.0126) 0.002 (8e-04) 0.003 (0.0014) 0.0065 (0.0017) 0.0079 (0.0026) 0.003 (0.0012) 0.0035 (0.0017)
True Prob. 0 (0) 0 (0) 0.2359 (0.0033) 0.2355 (0.0037) 0.1056 (0.0105) 0.1113 (0.0126) 0.002 (8e-04) 0.003 (0.0014) 0.0065 (0.0017) 0.0079 (0.0026) 0.003 (0.0012) 0.0035 (0.0017)
beta 5e-04 (4e-04) 5e-04 (4e-04) 0.2354 (0.0034) 0.236 (0.0037) 0.1025 (0.0176) 0.1124 (0.0154) 0.0015 (7e-04) 0.0035 (0.0017) 0.006 (0.001) 0.0082 (0.0029) 0.0026 (7e-04) 0.0042 (0.0021)
beta 5e-04 (4e-04) 6e-04 (4e-04) 0.2353 (0.0034) 0.2361 (0.0037) 0.1027 (0.0177) 0.113 (0.0154) 0.0015 (7e-04) 0.0035 (0.0017) 0.006 (0.001) 0.0082 (0.0029) 0.0023 (7e-04) 0.004 (0.0021)
isotonic 0.0019 (6e-04) 0.0019 (7e-04) 0.2314 (0.0036) 0.2373 (0.0043) 0.0935 (0.0185) 0.1134 (0.0164) 0 (0) 0.0033 (0.0018) 0.006 (0.0012) 0.0102 (0.0036) 0.0053 (0.0015) 0.0085 (0.0037)
isotonic 0.0019 (6e-04) 0.0019 (7e-04) 0.2314 (0.0036) 0.2373 (0.0043) 0.0935 (0.0185) 0.1134 (0.0164) 0 (0) 0.0033 (0.0018) 0.006 (0.0012) 0.0102 (0.0036) 0.0053 (0.0015) 0.0085 (0.0037)
locfit_0 0.0017 (7e-04) 0.0017 (7e-04) 0.235 (0.0033) 0.2371 (0.0036) 0.1013 (0.0189) 0.1106 (0.0161) 0.0013 (7e-04) 0.0039 (0.0017) 0.0051 (0.0011) 0.0082 (0.0031) 6e-04 (3e-04) 0.0026 (0.0015)
locfit_0 0.0016 (6e-04) 0.0017 (6e-04) 0.235 (0.0033) 0.2371 (0.0038) 0.1001 (0.0184) 0.1111 (0.0159) 0.0012 (6e-04) 0.0039 (0.0018) 0.0052 (0.001) 0.0085 (0.0031) 6e-04 (3e-04) 0.0026 (0.0016)
locfit_1 0.0015 (6e-04) 0.0015 (6e-04) 0.2342 (0.0034) 0.2369 (0.004) 0.1018 (0.0176) 0.1143 (0.0161) 9e-04 (5e-04) 0.0037 (0.0018) 0.0061 (0.0013) 0.0092 (0.0035) 0.0025 (0.0015) 0.0047 (0.003)
locfit_1 0.0015 (6e-04) 0.0015 (6e-04) 0.2344 (0.0034) 0.2369 (0.0041) 0.1027 (0.0185) 0.1143 (0.0161) 0.001 (6e-04) 0.0038 (0.0017) 0.0061 (0.0013) 0.0093 (0.0035) 0.0015 (0.001) 0.0037 (0.0024)
locfit_2 0.0029 (0.001) 0.003 (0.001) 0.2327 (0.0034) 0.2383 (0.0044) 0.1067 (0.0185) 0.1154 (0.0169) 9e-04 (5e-04) 0.0042 (0.0019) 0.0054 (0.001) 0.0108 (0.0039) 0.0027 (0.0014) 0.0063 (0.0036)
locfit_2 0.0029 (0.001) 0.0029 (0.001) 0.2327 (0.0034) 0.2381 (0.0043) 0.1069 (0.0179) 0.1154 (0.0177) 9e-04 (5e-04) 0.0042 (0.0018) 0.0055 (0.001) 0.0107 (0.0039) 0.002 (0.0013) 0.0054 (0.0033)
platt 0.0011 (4e-04) 0.0012 (4e-04) 0.2363 (0.0032) 0.2367 (0.0036) 0.1013 (0.0184) 0.1083 (0.0165) 0.0023 (0.001) 0.004 (0.0018) 0.0059 (0.001) 0.0081 (0.0027) 0.0028 (0.0011) 0.0043 (0.0019)
platt 8e-04 (4e-04) 8e-04 (4e-04) 0.236 (0.0034) 0.2364 (0.0038) 0.1077 (0.0176) 0.1165 (0.0145) 0.0019 (9e-04) 0.0037 (0.0018) 0.0059 (0.001) 0.0081 (0.0029) 0.001 (5e-04) 0.0024 (0.0015)
Table 2.6: Average value for the calibration metrics (in column) over the 200 replications, for \(\gamma=0.33\), computed on the calibration and on the test set, using different predicted probabilities (in rows). Standard deviations are given between brackets.
MSE (True)
Brier
ECE
QMSE
WMSE
LCS
Calibration Method Calib. Test Calib. Test Calib. Test Calib. Test Calib. Test Calib. Test
No Calibration 0.083 (9e-04) 0.0831 (0.001) 0.3186 (0.0083) 0.3185 (0.0098) 0.2824 (0.0136) 0.2826 (0.0166) 0.0847 (0.0079) 0.0859 (0.0096) 0.0765 (0.0092) 0.0777 (0.0116) 0.0801 (0.0093) 0.0827 (0.0118)
No Calibration 0.006 (2e-04) 0.006 (3e-04) 0.2418 (0.0012) 0.2416 (0.0013) 0.0998 (0.0123) 0.1035 (0.0136) 0.0078 (0.0022) 0.009 (0.0027) 0.007 (0.0015) 0.0083 (0.0018) 0.0067 (0.0027) 0.0079 (0.0034)
True Prob. 0 (0) 0 (0) 0.2359 (0.0033) 0.2355 (0.0037) 0.1056 (0.0105) 0.1113 (0.0126) 0.002 (8e-04) 0.003 (0.0014) 0.0065 (0.0017) 0.0079 (0.0026) 0.003 (0.0012) 0.0035 (0.0017)
True Prob. 0 (0) 0 (0) 0.2359 (0.0033) 0.2355 (0.0037) 0.1056 (0.0105) 0.1113 (0.0126) 0.002 (8e-04) 0.003 (0.0014) 0.0065 (0.0017) 0.0079 (0.0026) 0.003 (0.0012) 0.0035 (0.0017)
beta 5e-04 (4e-04) 5e-04 (4e-04) 0.2354 (0.0034) 0.236 (0.0037) 0.1032 (0.0173) 0.1128 (0.0153) 0.0015 (7e-04) 0.0034 (0.0017) 0.006 (0.001) 0.0082 (0.003) 0.0024 (6e-04) 0.0039 (0.0021)
beta 4e-04 (4e-04) 4e-04 (4e-04) 0.2354 (0.0034) 0.2359 (0.0037) 0.1031 (0.0174) 0.1126 (0.0153) 0.0016 (8e-04) 0.0034 (0.0017) 0.006 (0.001) 0.0082 (0.003) 0.0025 (7e-04) 0.0038 (0.0021)
isotonic 0.0019 (6e-04) 0.0019 (7e-04) 0.2314 (0.0036) 0.2373 (0.0043) 0.0935 (0.0185) 0.1134 (0.0164) 0 (0) 0.0033 (0.0018) 0.006 (0.0012) 0.0102 (0.0036) 0.0053 (0.0015) 0.0085 (0.0037)
isotonic 0.0019 (6e-04) 0.0019 (7e-04) 0.2314 (0.0036) 0.2373 (0.0043) 0.0935 (0.0185) 0.1134 (0.0164) 0 (0) 0.0033 (0.0018) 0.006 (0.0012) 0.0102 (0.0036) 0.0053 (0.0015) 0.0085 (0.0037)
locfit_0 0.0017 (7e-04) 0.0017 (7e-04) 0.235 (0.0033) 0.2372 (0.0036) 0.0998 (0.0189) 0.1093 (0.0156) 0.0014 (6e-04) 0.0039 (0.0016) 0.005 (0.0011) 0.0082 (0.0031) 6e-04 (3e-04) 0.0026 (0.0015)
locfit_0 0.0017 (7e-04) 0.0017 (7e-04) 0.235 (0.0033) 0.2372 (0.0036) 0.0999 (0.0179) 0.1097 (0.0163) 0.0014 (7e-04) 0.0039 (0.0018) 0.005 (0.0011) 0.0082 (0.0031) 6e-04 (3e-04) 0.0026 (0.0015)
locfit_1 0.0015 (6e-04) 0.0016 (6e-04) 0.2342 (0.0034) 0.2369 (0.004) 0.1024 (0.0188) 0.1136 (0.016) 0.001 (6e-04) 0.0037 (0.0016) 0.006 (0.0013) 0.0093 (0.0036) 0.0029 (0.0017) 0.0053 (0.0034)
locfit_1 0.0015 (6e-04) 0.0015 (6e-04) 0.2342 (0.0034) 0.237 (0.004) 0.1021 (0.0177) 0.1135 (0.0164) 0.001 (6e-04) 0.0037 (0.0017) 0.006 (0.0012) 0.0094 (0.0035) 0.0031 (0.0018) 0.0055 (0.0033)
locfit_2 0.003 (9e-04) 0.0031 (0.001) 0.2326 (0.0034) 0.2384 (0.0044) 0.106 (0.0174) 0.1154 (0.0165) 9e-04 (6e-04) 0.0042 (0.0019) 0.0054 (0.001) 0.0108 (0.0038) 0.0029 (0.0015) 0.0066 (0.0036)
locfit_2 0.003 (0.001) 0.0031 (0.001) 0.2327 (0.0034) 0.2383 (0.0044) 0.1057 (0.0178) 0.1154 (0.0165) 9e-04 (5e-04) 0.0042 (0.0021) 0.0054 (9e-04) 0.0109 (0.0039) 0.0031 (0.0015) 0.0069 (0.0037)
platt 5e-04 (4e-04) 5e-04 (4e-04) 0.2357 (0.0034) 0.236 (0.0037) 0.105 (0.017) 0.1141 (0.0146) 0.0018 (9e-04) 0.0035 (0.0017) 0.006 (0.001) 0.0081 (0.0029) 0.0024 (8e-04) 0.0036 (0.0021)
platt 4e-04 (4e-04) 4e-04 (4e-04) 0.2355 (0.0033) 0.2359 (0.0037) 0.1033 (0.0173) 0.1129 (0.0153) 0.0017 (8e-04) 0.0034 (0.0016) 0.006 (0.001) 0.0081 (0.003) 0.0025 (7e-04) 0.0037 (0.0021)
Table 2.7: Average value for the calibration metrics (in column) over the 200 replications, for \(\gamma=0.67\), computed on the calibration and on the test set, using different predicted probabilities (in rows). Standard deviations are given between brackets.
MSE (True)
Brier
ECE
QMSE
WMSE
LCS
Calibration Method Calib. Test Calib. Test Calib. Test Calib. Test Calib. Test Calib. Test
No Calibration 0.0157 (1e-04) 0.0157 (1e-04) 0.2515 (0.0047) 0.2512 (0.0052) 0.1313 (0.0138) 0.1312 (0.0162) 0.0176 (0.0035) 0.0187 (0.0043) 0.0196 (0.0039) 0.0211 (0.0049) 0.0163 (0.0036) 0.0172 (0.0044)
No Calibration 0.0014 (1e-04) 0.0014 (1e-04) 0.2372 (0.0023) 0.2369 (0.0026) 0.099 (0.0115) 0.1037 (0.0127) 0.0033 (0.0013) 0.0044 (0.0018) 0.005 (0.001) 0.0065 (0.0017) 0.0021 (9e-04) 0.0028 (0.0013)
True Prob. 0 (0) 0 (0) 0.2359 (0.0033) 0.2355 (0.0037) 0.1056 (0.0105) 0.1113 (0.0126) 0.002 (8e-04) 0.003 (0.0014) 0.0065 (0.0017) 0.0079 (0.0026) 0.003 (0.0012) 0.0035 (0.0017)
True Prob. 0 (0) 0 (0) 0.2359 (0.0033) 0.2355 (0.0037) 0.1056 (0.0105) 0.1113 (0.0126) 0.002 (8e-04) 0.003 (0.0014) 0.0065 (0.0017) 0.0079 (0.0026) 0.003 (0.0012) 0.0035 (0.0017)
beta 5e-04 (4e-04) 5e-04 (4e-04) 0.2354 (0.0034) 0.236 (0.0037) 0.1032 (0.0172) 0.1128 (0.0153) 0.0015 (7e-04) 0.0034 (0.0017) 0.006 (0.001) 0.0082 (0.003) 0.0024 (6e-04) 0.0039 (0.0021)
beta 4e-04 (4e-04) 4e-04 (4e-04) 0.2354 (0.0034) 0.2359 (0.0037) 0.103 (0.0173) 0.1125 (0.0152) 0.0015 (7e-04) 0.0034 (0.0017) 0.006 (0.001) 0.0082 (0.0029) 0.0024 (7e-04) 0.0039 (0.0021)
isotonic 0.0019 (6e-04) 0.0019 (7e-04) 0.2314 (0.0036) 0.2373 (0.0043) 0.0935 (0.0185) 0.1134 (0.0164) 0 (0) 0.0033 (0.0018) 0.006 (0.0012) 0.0102 (0.0036) 0.0053 (0.0015) 0.0085 (0.0037)
isotonic 0.0019 (6e-04) 0.0019 (7e-04) 0.2314 (0.0036) 0.2373 (0.0043) 0.0935 (0.0185) 0.1134 (0.0164) 0 (0) 0.0033 (0.0018) 0.006 (0.0012) 0.0102 (0.0036) 0.0053 (0.0015) 0.0085 (0.0037)
locfit_0 0.0017 (7e-04) 0.0017 (7e-04) 0.235 (0.0033) 0.2372 (0.0036) 0.0998 (0.0189) 0.1089 (0.016) 0.0014 (7e-04) 0.0038 (0.0017) 0.0051 (0.0011) 0.0082 (0.0031) 6e-04 (3e-04) 0.0026 (0.0015)
locfit_0 0.0017 (7e-04) 0.0017 (7e-04) 0.235 (0.0033) 0.2372 (0.0036) 0.0998 (0.019) 0.1096 (0.0162) 0.0014 (7e-04) 0.0039 (0.0017) 0.005 (0.0011) 0.0083 (0.003) 6e-04 (3e-04) 0.0026 (0.0015)
locfit_1 0.0015 (6e-04) 0.0015 (6e-04) 0.2342 (0.0034) 0.2369 (0.004) 0.1029 (0.0187) 0.1135 (0.0162) 0.001 (5e-04) 0.0037 (0.0016) 0.006 (0.0012) 0.0092 (0.0036) 0.0029 (0.0017) 0.0051 (0.0033)
locfit_1 0.0015 (6e-04) 0.0016 (6e-04) 0.2342 (0.0034) 0.2369 (0.004) 0.1022 (0.0183) 0.1136 (0.0164) 0.001 (6e-04) 0.0037 (0.0017) 0.006 (0.0013) 0.0093 (0.0034) 0.003 (0.0018) 0.0054 (0.0033)
locfit_2 0.003 (0.001) 0.003 (0.001) 0.2327 (0.0034) 0.2384 (0.0044) 0.1067 (0.0175) 0.115 (0.0169) 9e-04 (5e-04) 0.0041 (0.002) 0.0054 (0.001) 0.0108 (0.0039) 0.0029 (0.0015) 0.0066 (0.0037)
locfit_2 0.003 (0.001) 0.0031 (0.001) 0.2326 (0.0034) 0.2384 (0.0045) 0.1059 (0.0179) 0.1153 (0.0165) 9e-04 (5e-04) 0.0042 (0.0018) 0.0054 (9e-04) 0.0108 (0.004) 0.003 (0.0015) 0.0066 (0.0038)
platt 4e-04 (4e-04) 4e-04 (4e-04) 0.2356 (0.0034) 0.2359 (0.0037) 0.1048 (0.017) 0.1135 (0.0147) 0.0017 (8e-04) 0.0034 (0.0017) 0.006 (0.001) 0.0081 (0.003) 0.0021 (7e-04) 0.0033 (0.002)
platt 4e-04 (4e-04) 4e-04 (4e-04) 0.2355 (0.0033) 0.2359 (0.0037) 0.1036 (0.0174) 0.1132 (0.0152) 0.0017 (8e-04) 0.0034 (0.0016) 0.006 (0.001) 0.0081 (0.003) 0.0022 (7e-04) 0.0035 (0.002)
Table 2.8: Average value for the calibration metrics (in column) over the 200 replications, for \(\gamma=1\), computed on the calibration and on the test set, using different predicted probabilities (in rows). Standard deviations are given between brackets.
MSE (True)
Brier
ECE
QMSE
WMSE
LCS
Calibration Method Calib. Test Calib. Test Calib. Test Calib. Test Calib. Test Calib. Test
No Calibration 0 (0) 0 (0) 0.2359 (0.0033) 0.2355 (0.0037) 0.1056 (0.0105) 0.1113 (0.0126) 0.002 (8e-04) 0.003 (0.0014) 0.0065 (0.0017) 0.0079 (0.0026) 0.003 (0.0012) 0.0035 (0.0016)
No Calibration 0 (0) 0 (0) 0.2359 (0.0033) 0.2355 (0.0037) 0.1056 (0.0105) 0.1113 (0.0126) 0.002 (8e-04) 0.003 (0.0014) 0.0065 (0.0017) 0.0079 (0.0026) 0.003 (0.0012) 0.0035 (0.0016)
True Prob. 0 (0) 0 (0) 0.2359 (0.0033) 0.2355 (0.0037) 0.1056 (0.0105) 0.1113 (0.0126) 0.002 (8e-04) 0.003 (0.0014) 0.0065 (0.0017) 0.0079 (0.0026) 0.003 (0.0012) 0.0035 (0.0017)
True Prob. 0 (0) 0 (0) 0.2359 (0.0033) 0.2355 (0.0037) 0.1056 (0.0105) 0.1113 (0.0126) 0.002 (8e-04) 0.003 (0.0014) 0.0065 (0.0017) 0.0079 (0.0026) 0.003 (0.0012) 0.0035 (0.0017)
beta 5e-04 (4e-04) 5e-04 (4e-04) 0.2354 (0.0034) 0.236 (0.0037) 0.1031 (0.0173) 0.1127 (0.0153) 0.0015 (7e-04) 0.0034 (0.0017) 0.006 (0.001) 0.0082 (0.003) 0.0024 (6e-04) 0.0039 (0.0021)
beta 5e-04 (4e-04) 5e-04 (4e-04) 0.2354 (0.0034) 0.236 (0.0037) 0.1031 (0.0173) 0.1127 (0.0153) 0.0015 (7e-04) 0.0034 (0.0017) 0.006 (0.001) 0.0082 (0.003) 0.0024 (6e-04) 0.0039 (0.0021)
isotonic 0.0019 (6e-04) 0.0019 (7e-04) 0.2314 (0.0036) 0.2373 (0.0043) 0.0935 (0.0185) 0.1134 (0.0164) 0 (0) 0.0033 (0.0018) 0.006 (0.0012) 0.0102 (0.0036) 0.0053 (0.0015) 0.0085 (0.0037)
isotonic 0.0019 (6e-04) 0.0019 (7e-04) 0.2314 (0.0036) 0.2373 (0.0043) 0.0935 (0.0185) 0.1134 (0.0164) 0 (0) 0.0033 (0.0018) 0.006 (0.0012) 0.0102 (0.0036) 0.0053 (0.0015) 0.0085 (0.0037)
locfit_0 0.0017 (7e-04) 0.0017 (7e-04) 0.235 (0.0033) 0.2372 (0.0037) 0.099 (0.0182) 0.1092 (0.016) 0.0013 (6e-04) 0.0039 (0.0018) 0.005 (0.0011) 0.0083 (0.0031) 6e-04 (3e-04) 0.0026 (0.0016)
locfit_0 0.0017 (7e-04) 0.0017 (7e-04) 0.235 (0.0033) 0.2372 (0.0037) 0.099 (0.0182) 0.1092 (0.016) 0.0013 (6e-04) 0.0039 (0.0018) 0.005 (0.0011) 0.0083 (0.0031) 6e-04 (3e-04) 0.0026 (0.0016)
locfit_1 0.0015 (6e-04) 0.0015 (6e-04) 0.2342 (0.0034) 0.2369 (0.004) 0.1016 (0.0184) 0.1134 (0.0161) 9e-04 (6e-04) 0.0038 (0.0018) 0.0059 (0.0012) 0.0092 (0.0035) 0.0028 (0.0016) 0.005 (0.0032)
locfit_1 0.0015 (6e-04) 0.0015 (6e-04) 0.2342 (0.0034) 0.2369 (0.004) 0.1016 (0.0184) 0.1134 (0.0161) 9e-04 (6e-04) 0.0038 (0.0018) 0.0059 (0.0012) 0.0092 (0.0035) 0.0028 (0.0016) 0.005 (0.0032)
locfit_2 0.0029 (0.001) 0.003 (0.001) 0.2327 (0.0034) 0.2383 (0.0044) 0.1058 (0.0179) 0.1147 (0.0167) 9e-04 (6e-04) 0.0041 (0.0018) 0.0054 (9e-04) 0.0107 (0.0039) 0.0029 (0.0014) 0.0065 (0.0037)
locfit_2 0.0029 (0.001) 0.003 (0.001) 0.2327 (0.0034) 0.2383 (0.0044) 0.1058 (0.0179) 0.1147 (0.0167) 9e-04 (6e-04) 0.0041 (0.0018) 0.0054 (9e-04) 0.0107 (0.0039) 0.0029 (0.0014) 0.0065 (0.0037)
platt 4e-04 (4e-04) 4e-04 (4e-04) 0.2355 (0.0034) 0.2359 (0.0038) 0.104 (0.0174) 0.1136 (0.0152) 0.0017 (8e-04) 0.0034 (0.0016) 0.006 (0.001) 0.0081 (0.003) 0.0019 (6e-04) 0.0032 (0.0018)
platt 4e-04 (4e-04) 4e-04 (4e-04) 0.2355 (0.0034) 0.2359 (0.0038) 0.104 (0.0174) 0.1136 (0.0152) 0.0017 (8e-04) 0.0034 (0.0016) 0.006 (0.001) 0.0081 (0.003) 0.0019 (6e-04) 0.0032 (0.0018)
Table 2.9: Average value for the calibration metrics (in column) over the 200 replications, for \(\gamma=1.5\), computed on the calibration and on the test set, using different predicted probabilities (in rows). Standard deviations are given between brackets.
MSE (True)
Brier
ECE
QMSE
WMSE
LCS
Calibration Method Calib. Test Calib. Test Calib. Test Calib. Test Calib. Test Calib. Test
No Calibration 0.0192 (1e-04) 0.0192 (1e-04) 0.2553 (0.005) 0.2548 (0.0061) 0.1927 (0.0136) 0.1954 (0.0167) 0.0214 (0.004) 0.0222 (0.005) 0.0251 (0.0044) 0.0256 (0.0058) 0.0215 (0.0041) 0.0216 (0.0052)
No Calibration 0.0025 (1e-04) 0.0025 (1e-04) 0.2384 (0.0047) 0.2379 (0.0051) 0.1342 (0.0122) 0.1384 (0.0142) 0.0045 (0.0016) 0.0053 (0.0021) 0.0117 (0.0028) 0.0129 (0.0037) 0.008 (0.0024) 0.0084 (0.003)
True Prob. 0 (0) 0 (0) 0.2359 (0.0033) 0.2355 (0.0037) 0.1056 (0.0105) 0.1113 (0.0126) 0.002 (8e-04) 0.003 (0.0014) 0.0065 (0.0017) 0.0079 (0.0026) 0.003 (0.0012) 0.0035 (0.0017)
True Prob. 0 (0) 0 (0) 0.2359 (0.0033) 0.2355 (0.0037) 0.1056 (0.0105) 0.1113 (0.0126) 0.002 (8e-04) 0.003 (0.0014) 0.0065 (0.0017) 0.0079 (0.0026) 0.003 (0.0012) 0.0035 (0.0017)
beta 5e-04 (4e-04) 5e-04 (4e-04) 0.2354 (0.0034) 0.236 (0.0037) 0.1028 (0.0175) 0.1127 (0.0154) 0.0015 (7e-04) 0.0035 (0.0017) 0.006 (0.001) 0.0082 (0.0029) 0.0024 (6e-04) 0.004 (0.0021)
beta 5e-04 (4e-04) 5e-04 (4e-04) 0.2354 (0.0034) 0.236 (0.0037) 0.1029 (0.0176) 0.1129 (0.0155) 0.0015 (7e-04) 0.0035 (0.0017) 0.006 (0.001) 0.0082 (0.0029) 0.0023 (6e-04) 0.004 (0.0021)
isotonic 0.0019 (6e-04) 0.0019 (7e-04) 0.2314 (0.0036) 0.2373 (0.0043) 0.0935 (0.0185) 0.1134 (0.0164) 0 (0) 0.0033 (0.0018) 0.006 (0.0012) 0.0102 (0.0036) 0.0053 (0.0015) 0.0085 (0.0037)
isotonic 0.0019 (6e-04) 0.0019 (7e-04) 0.2314 (0.0036) 0.2373 (0.0043) 0.0935 (0.0185) 0.1134 (0.0164) 0 (0) 0.0033 (0.0018) 0.006 (0.0012) 0.0102 (0.0036) 0.0053 (0.0015) 0.0085 (0.0037)
locfit_0 0.0017 (7e-04) 0.0017 (7e-04) 0.235 (0.0033) 0.2372 (0.0036) 0.0999 (0.019) 0.1098 (0.0159) 0.0013 (6e-04) 0.0039 (0.0018) 0.005 (0.0011) 0.0083 (0.0031) 6e-04 (3e-04) 0.0026 (0.0015)
locfit_0 0.0016 (7e-04) 0.0017 (7e-04) 0.2349 (0.0033) 0.2371 (0.0036) 0.0995 (0.0184) 0.1099 (0.0159) 0.0013 (7e-04) 0.0039 (0.0017) 0.005 (0.0011) 0.0083 (0.0031) 6e-04 (3e-04) 0.0026 (0.0015)
locfit_1 0.0015 (6e-04) 0.0015 (6e-04) 0.2342 (0.0034) 0.237 (0.004) 0.1022 (0.0182) 0.114 (0.0163) 0.001 (5e-04) 0.0038 (0.0019) 0.0059 (0.0012) 0.0093 (0.0037) 0.0027 (0.0016) 0.005 (0.0033)
locfit_1 0.0015 (6e-04) 0.0015 (6e-04) 0.2342 (0.0034) 0.2369 (0.004) 0.1023 (0.0175) 0.1138 (0.0162) 9e-04 (5e-04) 0.0038 (0.0018) 0.006 (0.0013) 0.0092 (0.0034) 0.0024 (0.0016) 0.0047 (0.0029)
locfit_2 0.0029 (9e-04) 0.003 (0.001) 0.2327 (0.0034) 0.2382 (0.0044) 0.1062 (0.0176) 0.1157 (0.0159) 9e-04 (5e-04) 0.0042 (0.0019) 0.0054 (9e-04) 0.0108 (0.004) 0.0028 (0.0015) 0.0065 (0.0037)
locfit_2 0.0029 (0.001) 0.003 (0.001) 0.2327 (0.0034) 0.2382 (0.0044) 0.1068 (0.0181) 0.1156 (0.0173) 9e-04 (6e-04) 0.004 (0.0019) 0.0054 (9e-04) 0.0106 (0.0039) 0.0027 (0.0015) 0.0062 (0.0037)
platt 4e-04 (4e-04) 4e-04 (4e-04) 0.2356 (0.0033) 0.236 (0.0037) 0.1035 (0.0176) 0.1126 (0.0159) 0.0017 (8e-04) 0.0034 (0.0016) 0.006 (0.001) 0.0081 (0.0029) 0.0018 (6e-04) 0.0032 (0.0017)
platt 4e-04 (4e-04) 4e-04 (4e-04) 0.2356 (0.0034) 0.236 (0.0038) 0.1047 (0.0175) 0.1142 (0.0149) 0.0017 (8e-04) 0.0034 (0.0017) 0.006 (0.001) 0.0081 (0.003) 0.0015 (5e-04) 0.0028 (0.0017)
Table 2.10: Average value for the calibration metrics (in column) over the 200 replications, for \(\gamma=3\), computed on the calibration and on the test set, using different predicted probabilities (in rows). Standard deviations are given between brackets.
MSE (True)
Brier
ECE
QMSE
WMSE
LCS
Calibration Method Calib. Test Calib. Test Calib. Test Calib. Test Calib. Test Calib. Test
No Calibration 0.1281 (7e-04) 0.128 (8e-04) 0.3642 (0.01) 0.3637 (0.0122) 0.3459 (0.0141) 0.3476 (0.0163) 0.1302 (0.0099) 0.131 (0.012) 0.1196 (0.0098) 0.1183 (0.0117) 0.1211 (0.0099) 0.1204 (0.0119)
No Calibration 0.0243 (6e-04) 0.0243 (6e-04) 0.2604 (0.0076) 0.2596 (0.0084) 0.2253 (0.0136) 0.227 (0.0157) 0.0263 (0.0045) 0.0268 (0.0051) 0.031 (0.0049) 0.0312 (0.0059) 0.0294 (0.0048) 0.0294 (0.0058)
True Prob. 0 (0) 0 (0) 0.2359 (0.0033) 0.2355 (0.0037) 0.1056 (0.0105) 0.1113 (0.0126) 0.002 (8e-04) 0.003 (0.0014) 0.0065 (0.0017) 0.0079 (0.0026) 0.003 (0.0012) 0.0035 (0.0017)
True Prob. 0 (0) 0 (0) 0.2359 (0.0033) 0.2355 (0.0037) 0.1056 (0.0105) 0.1113 (0.0126) 0.002 (8e-04) 0.003 (0.0014) 0.0065 (0.0017) 0.0079 (0.0026) 0.003 (0.0012) 0.0035 (0.0017)
beta 5e-04 (4e-04) 5e-04 (4e-04) 0.2354 (0.0034) 0.236 (0.0037) 0.1025 (0.0176) 0.1124 (0.0154) 0.0015 (7e-04) 0.0035 (0.0017) 0.006 (0.001) 0.0082 (0.0029) 0.0026 (7e-04) 0.0042 (0.0021)
beta 5e-04 (4e-04) 6e-04 (4e-04) 0.2353 (0.0034) 0.2361 (0.0037) 0.1027 (0.0177) 0.113 (0.0154) 0.0015 (7e-04) 0.0035 (0.0017) 0.006 (0.001) 0.0082 (0.0029) 0.0023 (7e-04) 0.004 (0.0021)
isotonic 0.0019 (6e-04) 0.0019 (7e-04) 0.2314 (0.0036) 0.2373 (0.0043) 0.0935 (0.0185) 0.1134 (0.0164) 0 (0) 0.0033 (0.0018) 0.006 (0.0012) 0.0102 (0.0036) 0.0053 (0.0015) 0.0085 (0.0037)
isotonic 0.0019 (6e-04) 0.0019 (7e-04) 0.2314 (0.0036) 0.2373 (0.0043) 0.0935 (0.0185) 0.1134 (0.0164) 0 (0) 0.0033 (0.0018) 0.006 (0.0012) 0.0102 (0.0036) 0.0053 (0.0015) 0.0085 (0.0037)
locfit_0 0.0017 (7e-04) 0.0017 (7e-04) 0.235 (0.0033) 0.2371 (0.0036) 0.1013 (0.0189) 0.1106 (0.0161) 0.0013 (7e-04) 0.0039 (0.0017) 0.0051 (0.0011) 0.0082 (0.0031) 6e-04 (3e-04) 0.0026 (0.0015)
locfit_0 0.0016 (6e-04) 0.0017 (6e-04) 0.235 (0.0033) 0.2371 (0.0038) 0.1001 (0.0184) 0.1111 (0.0159) 0.0012 (6e-04) 0.0039 (0.0018) 0.0052 (0.001) 0.0085 (0.0031) 6e-04 (3e-04) 0.0026 (0.0016)
locfit_1 0.0015 (6e-04) 0.0015 (6e-04) 0.2342 (0.0034) 0.2369 (0.004) 0.1018 (0.0176) 0.1143 (0.0161) 9e-04 (5e-04) 0.0037 (0.0018) 0.0061 (0.0013) 0.0092 (0.0035) 0.0025 (0.0015) 0.0047 (0.003)
locfit_1 0.0015 (6e-04) 0.0015 (6e-04) 0.2344 (0.0034) 0.2369 (0.0041) 0.1027 (0.0185) 0.1143 (0.0161) 0.001 (6e-04) 0.0038 (0.0017) 0.0061 (0.0013) 0.0093 (0.0035) 0.0015 (0.001) 0.0037 (0.0024)
locfit_2 0.0029 (0.001) 0.003 (0.001) 0.2327 (0.0034) 0.2383 (0.0044) 0.1067 (0.0185) 0.1154 (0.0169) 9e-04 (5e-04) 0.0042 (0.0019) 0.0054 (0.001) 0.0108 (0.0039) 0.0027 (0.0014) 0.0063 (0.0036)
locfit_2 0.0029 (0.001) 0.0029 (0.001) 0.2327 (0.0034) 0.2381 (0.0043) 0.1069 (0.0179) 0.1154 (0.0177) 9e-04 (5e-04) 0.0042 (0.0018) 0.0055 (0.001) 0.0107 (0.0039) 0.002 (0.0013) 0.0054 (0.0033)
platt 0.0011 (4e-04) 0.0012 (4e-04) 0.2363 (0.0032) 0.2367 (0.0036) 0.1013 (0.0184) 0.1083 (0.0165) 0.0023 (0.001) 0.004 (0.0018) 0.0059 (0.001) 0.0081 (0.0027) 0.0028 (0.0011) 0.0043 (0.0019)
platt 8e-04 (4e-04) 8e-04 (4e-04) 0.236 (0.0034) 0.2364 (0.0038) 0.1077 (0.0176) 0.1165 (0.0145) 0.0019 (9e-04) 0.0037 (0.0018) 0.0059 (0.001) 0.0081 (0.0029) 0.001 (5e-04) 0.0024 (0.0015)

Now, let us normalize the values. We use the calibration metric computed with the uncalibrated estimated probabilities as the reference value and express the metrics computed after recalibration of the scores as deviations from that reference.

table_calib_metrics_all_rel <- 
  calib_metrics_simul |> 
  filter(!method == "True Prob.") |> 
  mutate(
    reference = ifelse(method == "No Calibration", yes = value, no = NA)
  ) |> 
  group_by(sample, transform_scale, type, metric, seed) |> 
  mutate(
    reference = sum(reference, na.rm = TRUE),
  ) |> 
  ungroup() |> 
  mutate(
    value_norm = value / reference
  ) |> 
  mutate(
    value_norm = ifelse(value == 0 & reference == 0, yes = 1, no = value_norm)
  ) |> 
  group_by(method, sample, transform_scale, type, metric) |> 
  summarise(
    value_norm_mean = mean(value_norm, na.rm = TRUE),
    value_norm_sd = sd(value_norm, na.rm = TRUE),
    .groups = "drop"
  ) |> 
  mutate(
    value_norm_mean = round(value_norm_mean, 4), 
    value_norm_sd = round(value_norm_sd, 4)
  ) |> 
  mutate(value = str_c(value_norm_mean, " (", value_norm_sd, ")")) |> 
  select(-value_norm_mean, -value_norm_sd) |> 
  pivot_wider(
    names_from = c(metric, sample), values_from = value, 
    names_sort = TRUE
  )

We define a function to print a table depending on the transformation applied to the probabilities (varying either \(\alpha\) or \(\gamma\)).

print_table_rel <- function(transform_scale, type) {
  table_calib_metrics_all_rel |> 
    filter(transform_scale == !!transform_scale, type == type) |>
    select(-type, -transform_scale) |> 
    mutate(
      across(
        -method, 
        ~kableExtra::cell_spec(
          .x, 
          color = ifelse(
            .x == max(.x), yes = "#882255",
            no = ifelse(.x == min(.x), "#44AA99", "black")
          ),
          bold = ifelse(
            .x == max(.x), yes = TRUE,
            no = ifelse(.x == min(.x), TRUE, FALSE)
          )
        )
      )
    ) |> 
    knitr::kable(
      escape = FALSE, booktabs = T, digits = 4,
      format = "html",
      col.names = c("Calibration Method", rep(c("Calib.", "Test"), 6))
    ) |>
    kableExtra::kable_styling() |>
    kableExtra::add_header_above(
      c(
        "", "MSE (True)" = 2, "Brier" = 2, "ECE" = 2, "QMSE" = 2, "WMSE" = 2, "LCS" = 2
      )
    )
}
Table 2.11: Deviation of the average calibration metrics from the reference (average metric computed using the uncalibrated predicted probabilities) over the 200 replications, for \(\alpha=0.33\), computed on the calibration and on the test set, using different predicted probabilities (in rows). Standard deviations are given between brackets.
MSE (True)
Brier
ECE
QMSE
WMSE
LCS
Calibration Method Calib. Test Calib. Test Calib. Test Calib. Test Calib. Test Calib. Test
No Calibration 1 (0) 1 (0) 1 (0) 1 (0) 1 (0) 1 (0) 1 (0) 1 (0) 1 (0) 1 (0) 1 (0) 1 (0)
No Calibration 1 (0) 1 (0) 1 (0) 1 (0) 1 (0) 1 (0) 1 (0) 1 (0) 1 (0) 1 (0) 1 (0) 1 (0)
beta 0.0055 (0.0046) 0.0055 (0.0045) 0.7392 (0.0187) 0.7415 (0.0232) 0.3645 (0.0519) 0.3991 (0.0499) 0.0182 (0.0086) 0.0404 (0.0206) 0.0797 (0.0163) 0.1092 (0.048) 0.0299 (0.0093) 0.0487 (0.0313)
beta 0.0658 (0.0616) 0.0654 (0.0597) 0.9736 (0.0091) 0.9762 (0.0104) 1.0342 (0.1295) 1.0961 (0.1374) 0.2131 (0.0987) 0.3903 (0.1962) 0.9097 (0.2538) 1.0229 (0.4015) 0.46 (0.3116) 0.67 (0.6865)
isotonic 0.0228 (0.0077) 0.0232 (0.0081) 0.7267 (0.0186) 0.7457 (0.0239) 0.3296 (0.0556) 0.4012 (0.0524) 0 (0) 0.0386 (0.0223) 0.0797 (0.0184) 0.1357 (0.0569) 0.0676 (0.0226) 0.1062 (0.0542)
isotonic 0.3141 (0.1088) 0.3191 (0.1093) 0.957 (0.0101) 0.982 (0.0132) 0.9359 (0.1396) 1.1067 (0.1738) 0 (0) 0.4025 (0.2873) 0.9113 (0.2842) 1.2907 (0.5483) 1.0123 (0.7162) 1.491 (1.6222)
locfit_0 0.0205 (0.0085) 0.0207 (0.0085) 0.7379 (0.0187) 0.7453 (0.0231) 0.3522 (0.057) 0.387 (0.0518) 0.0161 (0.0072) 0.0463 (0.0199) 0.0665 (0.0164) 0.1094 (0.0497) 0.0073 (0.004) 0.0315 (0.0194)
locfit_0 0.2816 (0.1183) 0.2821 (0.1158) 0.9718 (0.0092) 0.9815 (0.0104) 1.0027 (0.1415) 1.0679 (0.1574) 0.1881 (0.1159) 0.4606 (0.2375) 0.7556 (0.247) 1.0281 (0.424) 0.1023 (0.0808) 0.3982 (0.3399)
locfit_1 0.0186 (0.0077) 0.0187 (0.0078) 0.7355 (0.0187) 0.7445 (0.0234) 0.3612 (0.057) 0.4018 (0.0513) 0.0117 (0.007) 0.0439 (0.0194) 0.0796 (0.0204) 0.1237 (0.0558) 0.0366 (0.0222) 0.0659 (0.0463)
locfit_1 0.256 (0.106) 0.2558 (0.1039) 0.9684 (0.0094) 0.9806 (0.0118) 1.0247 (0.1403) 1.1067 (0.1661) 0.1376 (