This chapter investigates how the distribution of estimated scores by an extreme gradient boosting model evolves with the number of boosting iterations. In the models, we vary the maximum depth of trees and consider boosting iterations up to 400. For each configuration, we compute the predicted scores from iteration 1 to 400; for each boosting iteration, we use the predicted scores (on train, calibration and test sets) to compute various metrics (performance, calibration, divergence between the distribution of scores and that of true underlying probabilities) on both the initial and recalibrated scores.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
We generate data using the first 12 scenarios from Ojeda et al. (2023) and an additional set of 4 scenarios in which the true probability does not depend on the predictors in a linear way (see Chapter 4).
When we simulate a dataset, we draw the following number of observations:
nb_obs <-10000
Definition of the 16 scenarios
# Coefficients betacoefficients <-list(# First category (baseline, 2 covariates)c(0.5, 1), # scenario 1, 0 noise variablec(0.5, 1), # scenario 2, 10 noise variablesc(0.5, 1), # scenario 3, 50 noise variablesc(0.5, 1), # scenario 4, 100 noise variables# Second category (same as baseline, with lower number of 1s)c(0.5, 1), # scenario 5, 0 noise variablec(0.5, 1), # scenario 6, 10 noise variablesc(0.5, 1), # scenario 7, 50 noise variablesc(0.5, 1), # scenario 8, 100 noise variables# Third category (same as baseline but with 5 num. and 5 categ. covariates)c(0.1, 0.2, 0.3, 0.4, 0.5, 0.01, 0.02, 0.03, 0.04, 0.05),c(0.1, 0.2, 0.3, 0.4, 0.5, 0.01, 0.02, 0.03, 0.04, 0.05),c(0.1, 0.2, 0.3, 0.4, 0.5, 0.01, 0.02, 0.03, 0.04, 0.05),c(0.1, 0.2, 0.3, 0.4, 0.5, 0.01, 0.02, 0.03, 0.04, 0.05),# Fourth category (nonlinear predictor, 3 covariates)c(0.5, 1, .3), # scenario 5, 0 noise variablec(0.5, 1, .3), # scenario 6, 10 noise variablesc(0.5, 1, .3), # scenario 7, 50 noise variablesc(0.5, 1, .3) # scenario 8, 100 noise variables)# Mean parameter for the normal distribution to draw from to draw num covariatesmean_num <-list(# First category (baseline, 2 covariates)rep(0, 2), # scenario 1, 0 noise variablerep(0, 2), # scenario 2, 10 noise variablesrep(0, 2), # scenario 3, 50 noise variablesrep(0, 2), # scenario 4, 100 noise variables# Second category (same as baseline, with lower number of 1s)rep(0, 2), # scenario 5, 0 noise variablerep(0, 2), # scenario 6, 10 noise variablesrep(0, 2), # scenario 7, 50 noise variablesrep(0, 2), # scenario 8, 100 noise variables# Third category (same as baseline but with 5 num. and 5 categ. covariates)rep(0, 5),rep(0, 5),rep(0, 5),rep(0, 5),# Fourth category (nonlinear predictor, 3 covariates)rep(0, 3),rep(0, 3),rep(0, 3),rep(0, 3))# Sd parameter for the normal distribution to draw from to draw num covariatessd_num <-list(# First category (baseline, 2 covariates)rep(1, 2), # scenario 1, 0 noise variablerep(1, 2), # scenario 2, 10 noise variablesrep(1, 2), # scenario 3, 50 noise variablesrep(1, 2), # scenario 4, 100 noise variables# Second category (same as baseline, with lower number of 1s)rep(1, 2), # scenario 5, 0 noise variablerep(1, 2), # scenario 6, 10 noise variablesrep(1, 2), # scenario 7, 50 noise variablesrep(1, 2), # scenario 8, 100 noise variables# Third category (same as baseline but with 5 num. and 5 categ. covariates)rep(1, 5),rep(1, 5),rep(1, 5),rep(1, 5),# Fourth category (nonlinear predictor, 3 covariates)rep(1, 3),rep(1, 3),rep(1, 3),rep(1, 3))params_df <-tibble(scenario =1:16,coefficients = coefficients,n_num =c(rep(2, 8), rep(5, 4), rep(3, 4)),add_categ =c(rep(FALSE, 8), rep(TRUE, 4), rep(FALSE, 4)),n_noise =rep(c(0, 10, 50, 100), 4),mean_num = mean_num,sd_num = sd_num,size_train =rep(nb_obs, 16),size_valid =rep(nb_obs, 16),size_calib =rep(nb_obs, 16),size_test =rep(nb_obs, 16),transform_probs =c(rep(FALSE, 4), rep(TRUE, 4), rep(FALSE, 4), rep(FALSE, 4)),linear_predictor =c(rep(TRUE, 12), rep(FALSE, 4)),seed =202105)rm(coefficients, mean_num, sd_num)
5.2 Metrics
We load the functions from Chapter 3 to compute performance, calibration and divergence metrics.
source("../scripts/functions/metrics.R")
5.3 Simulations Setup
To train the models, we rely on the {xgboost} R package.
library(xgboost)
Attaching package: 'xgboost'
The following object is masked from 'package:dplyr':
slice
Here, we define a function to recalibrate predicted scores using either Platt scaling or isotonic regression. The recalibration algorithm is first trained on the calibration set and then applied to both the calibration and test sets.
#' Recalibrates scores using a calibration#' #' @param obs_calib vector of observed events in the calibration set#' @param scores_calib vector of predicted probabilities in the calibration set#' @param obs_test vector of observed events in the test set#' @param scores_test vector of predicted probabilities in the test set#' @param method recalibration method (`"platt"` for Platt scaling, #' `"isotonic"` for isotonic regression)#' @returns list of two elements: recalibrated scores on the calibration set,#' recalibrated scores on the test setrecalibrate <-function(obs_calib, obs_test, pred_calib, pred_test,method =c("platt", "isotonic", "beta", "locfit")) { data_calib <-tibble(d = obs_calib, scores = pred_calib) data_test <-tibble(d = obs_test, scores = pred_test)if (method =="platt") { lr <-glm(d ~ scores, family =binomial(link ='logit'), data = data_calib) score_c_calib <-predict(lr, newdata = data_calib, type ="response") score_c_test <-predict(lr, newdata = data_test, type ="response") } elseif (method =="isotonic") { iso <-isoreg(x = data_calib$scores, y = data_calib$d) fit_iso <-as.stepfun(iso) score_c_calib <-fit_iso(data_calib$scores) score_c_test <-fit_iso(data_test$scores) } elseif (method =="beta") { fit_beta <-beta_calibration(data_calib$scores, data_calib$d, parameters ="abm") score_c_calib <-beta_predict(data_calib$scores, fit_beta) score_c_test <-beta_predict(data_test$scores, fit_beta) } elseif (method =="locfit") { noise_scores <- data_calib$scores +rnorm(nrow(data_calib), 0, 0.01) noise_data_calib <- data_calib %>%mutate(scores = noise_scores) locfit_reg <-locfit(formula = d ~lp(scores, nn =0.15, deg =0),kern ="rect", maxk =200, data = noise_data_calib ) score_c_calib <-predict(locfit_reg, newdata = data_calib) score_c_test <-predict(locfit_reg, newdata = data_test) } else {stop("Unrecognized method: platt, isotonic, beta or locfit only") }# Format results in tibbles:# For calibration set tb_score_c_calib <-tibble(d = obs_calib,p_u = pred_calib,p_c = score_c_calib )# For test set tb_score_c_test <-tibble(d = obs_test,p_u = pred_test,p_c = score_c_test )list(tb_score_c_calib = tb_score_c_calib,tb_score_c_test = tb_score_c_test )}
As explained in the foreword of this page, we compute metrics based on scores obtained at various boosting iterations. To do so, we define a function, get_metrics_nb_iter(), that will be applied to a fitted model. This function will be called for all the boosting iterations (controlled by the nb_iter argument). The function returns a list with the following elements:
scenario: the ID of the scenario
ind: the index of the grid search (so that we can join with the hyperparameters values, if needed)
repn: the ID of the replication
nb_iter: the boosting iteration at which the metrics are computed
tb_metrics: the tibble with the performance, calibration, and divergence metrics (one row for the train sample, one row for the calibration sample, one row for the validation sample, and one row for the test sample)
tb_prop_scores: additional metrics (\(\mathbb{P}(q_1 < \hat{s}(\mathbf{x}) < q_2)\) for multiple values for \(q_1\) and \(q_2 = 1-q_1\))
scores_hist: elements to be able to plot an histogram of the scores on both the train set and the test set (using 20 equally-sized bins over \([0,1]\)).
We define another function, simul_xgb() which trains an extreme gradient boosting model for a single replication. It calls the get_metrics_nb_iter() on each of the boosting iterations of the model from the second to the last (400th), and returns a list of length 400-1 where each element is a list returned by the get_metrics_nb_iter().
Function simul_xgb()
#' Train an xgboost model and compute performance, calibration, and dispersion#' metrics#'#' @param params tibble with hyperparameters for the simulation#' @param ind index of the grid (numerical ID)#' @param simu_data simulated data obtained with `simulate_data_wrapper()`simul_xgb <-function(params, ind, simu_data) { tb_train <- simu_data$data$train |>rename(d = y) tb_valid <- simu_data$data$valid |>rename(d = y) tb_calib <- simu_data$data$calib |>rename(d = y) tb_test <- simu_data$data$test |>rename(d = y) true_prob <-list(train = simu_data$data$probs_train,valid = simu_data$data$probs_valid,calib = simu_data$data$probs_calib,test = simu_data$data$probs_test )## Format data for xgboost---- tb_train_xgb <-xgb.DMatrix(data =model.matrix(d ~-1+ ., tb_train), label = tb_train$d ) tb_valid_xgb <-xgb.DMatrix(data =model.matrix(d ~-1+ ., tb_valid), label = tb_valid$d ) tb_calib_xgb <-xgb.DMatrix(data =model.matrix(d ~-1+ ., tb_calib), label = tb_calib$d ) tb_test_xgb <-xgb.DMatrix(data =model.matrix(d ~-1+ ., tb_test), label = tb_test$d )# Parameters for the algorithm param <-list(max_depth = params$max_depth, #Note: root node is indexed 0eta = params$eta,nthread =1,objective ="binary:logistic",eval_metric ="auc" ) watchlist <-list(train = tb_train_xgb, eval = tb_valid_xgb)## Estimation---- xgb_fit <-xgb.train( param, tb_train_xgb,nrounds = params$nb_iter_total, watchlist,verbose =0 )# Then, for each boosting iteration number up to params$nb_iter_total# compute the predicted scores and evaluate the metrics resul <-map(seq(2, params$nb_iter_total),~get_metrics_nb_iter(nb_iter = .x,params = params,fitted_xgb = xgb_fit,tb_train_xgb = tb_train_xgb,tb_valid_xgb = tb_valid_xgb,tb_calib_xgb = tb_calib_xgb,tb_test_xgb = tb_test_xgb,simu_data = simu_data,true_prob = true_prob ), ) resul}simulate_xgb_scenario <-function(scenario, params_df, repn) {# Generate Data simu_data <-simulate_data_wrapper(scenario = scenario,params_df = params_df,repn = repn )# Looping over the grid hyperparameters for the scenario res_simul <-vector(mode ="list", length =nrow(grid)) cli::cli_progress_bar("Iteration grid", total =nrow(grid), type ="tasks")for (j in1:nrow(grid)) { curent_params <- grid |> dplyr::slice(!!j) res_simul[[j]] <-simul_xgb(params = curent_params,ind = curent_params$ind,simu_data = simu_data ) cli::cli_progress_update() }# The metrics computed for all set of hyperparameters (identified with `ind`)# and for each number of boosting iterations (`nb_iter`), for the current# scenario (`scenario`) and current replication number (`repn`) metrics_simul <-map( res_simul,function(simul_grid_j) map(simul_grid_j, "tb_metrics") |>list_rbind() ) |>list_rbind()# P(q_1<s(x)<q_2) prop_scores_simul <-map( res_simul,function(simul_grid_j) map(simul_grid_j, "tb_prop_scores") |>list_rbind() ) |>list_rbind()# Decomposition of expected losses decomposition_scores_simul <-map( res_simul,function(simul_grid_j) map(simul_grid_j, "tb_decomposition") |>list_rbind() ) |>list_rbind()# Histogram of estimated scores scores_hist <-map( res_simul,function(simul_grid_j) map(simul_grid_j, "scores_hist") )list(metrics_simul = metrics_simul,scores_hist = scores_hist,prop_scores_simul = prop_scores_simul,decomposition_scores_simul = decomposition_scores_simul )}
The desired number of replications for each scenario needs to be set:
repns_vector <-1:100
The different configurations are reported in Table 5.1.
DT::datatable(grid)
We define a function, simulate_xgb_scenario() to train the model on a dataset for all different values of the hyperparameters of the grid. This function performs a single replication of the simulations for a single scenario.
Function simulate_xgb_scenario()
simulate_xgb_scenario <-function(scenario, params_df, repn) {# Generate Data simu_data <-simulate_data_wrapper(scenario = scenario,params_df = params_df,repn = repn )# Looping over the grid hyperparameters for the scenario res_simul <-vector(mode ="list", length =nrow(grid)) cli::cli_progress_bar("Iteration grid", total =nrow(grid), type ="tasks")for (j in1:nrow(grid)) { curent_params <- grid |> dplyr::slice(!!j) res_simul[[j]] <-simul_xgb(params = curent_params,ind = curent_params$ind,simu_data = simu_data ) cli::cli_progress_update() }# The metrics computed for all set of hyperparameters (identified with `ind`)# and for each number of boosting iterations (`nb_iter`), for the current# scenario (`scenario`) and current replication number (`repn`) metrics_simul <-map( res_simul,function(simul_grid_j) map(simul_grid_j, "tb_metrics") |>list_rbind() ) |>list_rbind()# Sanity check# metrics_simul |> count(scenario, repn, ind, sample, nb_iter) |># filter(n > 1)# P(q_1<s(x)<q_2) prop_scores_simul <-map( res_simul,function(simul_grid_j) map(simul_grid_j, "tb_prop_scores") |>list_rbind() ) |>list_rbind()# Sanity check# prop_scores_simul |> count(scenario, repn, ind, sample, nb_iter)# Histogram of estimated scores scores_hist <-map( res_simul,function(simul_grid_j) map(simul_grid_j, "scores_hist") )# Decomposition of expected losses decomposition_scores_simul <-map( res_simul,function(simul_grid_j) map(simul_grid_j, "tb_decomposition") |>list_rbind() ) |>list_rbind()list(metrics_simul = metrics_simul,scores_hist = scores_hist,prop_scores_simul = prop_scores_simul,decomposition_scores_simul = decomposition_scores_simul )}
5.4 Estimations
We loop over the 16 scenarios and run the 100 replications in parallel.
The resul_rf object is of length 16: each element contains the simulations for a scenario. For each scenario, the elements are a list of length max(repns_vector), i.e., the number of replications. Each replication gives, in a list, the following elements:
metrics_simul: the metrics (AUC, Calibration, KL Divergence, etc.) for each model from the grid search, for all boosting iterations
scores_hist: the counts on bins defined on estimated scores (on train, validation, calibration, and test sets ; for calibration and test sets, the counts are given with or without recalibration)
prop_scores_simul: the estimations of \(\mathbb{P}(q_1 < \hat{\mathbf{x}}< q_2)\) for various values of q_1 and q_2.
5.5 Results
We can now extract some information from the results.
We first aggregate all the computed metrics performance/calibration/divergence in a single tibble, metrics_xgb_all.
For each replication, we made some hyperparameters vary. Let us identify some models of interest:
smallest: model with the lowest number of boosting iteration
largest: model with the highest number of boosting iteration
largest_auc: model with the highest AUC on validation set
lowest_mse: model with the lowest MSE on validation set
lowest_ici: model with the lowest ICI on validation set
lowest_kl: model with the lowest KL Divergence on validation set
Code
# Identify the smallest tree on the validation set, when the scores are not# recalibratedsmallest_xgb <- metrics_xgb_all |>filter(sample =="Validation", recalib =="None") |>group_by(scenario, repn) |>arrange(nb_iter) |>slice_head(n =1) |>select(scenario, repn, ind, nb_iter, recalib) |>mutate(result_type ="smallest") |>ungroup()# Identify the largest treelargest_xgb <- metrics_xgb_all |>filter(sample =="Validation", recalib =="None") |>group_by(scenario, repn) |>arrange(desc(nb_iter)) |>slice_head(n =1) |>select(scenario, repn, ind, nb_iter, recalib) |>mutate(result_type ="largest") |>ungroup()# Identify tree with highest AUC on test sethighest_auc_xgb <- metrics_xgb_all |>filter(sample =="Validation", recalib =="None") |>group_by(scenario, repn) |>arrange(desc(AUC)) |>slice_head(n =1) |>select(scenario, repn, ind, nb_iter, recalib) |>mutate(result_type ="largest_auc") |>ungroup()# # Identify tree with lowest MSE# lowest_mse_xgb <-# metrics_xgb_all |># filter(sample == "Validation", recalib == "None") |># group_by(scenario, repn) |># arrange(mse) |># slice_head(n = 1) |># select(scenario, repn, ind, nb_iter, recalib) |># mutate(result_type = "lowest_mse") |># ungroup()# Identify tree with lowest brierlowest_brier_xgb <- metrics_xgb_all |>filter(sample =="Validation", recalib =="None") |>group_by(scenario, repn) |>arrange(brier) |>slice_head(n =1) |>select(scenario, repn, ind, nb_iter, recalib) |>mutate(result_type ="lowest_brier") |>ungroup()# Identify tree with lowest ICIlowest_ici_xgb <- metrics_xgb_all |>filter(sample =="Validation", recalib =="None") |>group_by(scenario, repn) |>arrange(ici) |>slice_head(n =1) |>select(scenario, repn, ind, nb_iter, recalib) |>mutate(result_type ="lowest_ici") |>ungroup()# Identify tree with lowest KLlowest_kl_xgb <- metrics_xgb_all |>filter(sample =="Validation", recalib =="None") |>group_by(scenario, repn) |>arrange(KL_20_true_probas) |>slice_head(n =1) |>select(scenario, repn, ind, nb_iter, recalib) |>mutate(result_type ="lowest_kl") |>ungroup()mediocre_ici_xgb <- metrics_xgb_all |>filter(sample =="Validation", recalib =="None") |>group_by(scenario, repn) |># For each replication for a scenario, we select a model with a mediocre # calibrationmutate(mean_ici =mean(ici),sd_ici =sd(ici),upb_ici = mean_ici + sd_ici, ) |>filter(ici > upb_ici) |># Among the configurations for which the calibration is not within 1-sd of the# average calibration, we select the model with the lowest ICIarrange(ici) |>slice_head(n =1) |>select(scenario, repn, ind, nb_iter, recalib) |>mutate(result_type ="high_ici") |>ungroup()# Merge thesemodels_of_interest_xgb <- smallest_xgb |>bind_rows(largest_xgb) |>bind_rows(highest_auc_xgb) |># bind_rows(lowest_mse_xgb) |>bind_rows(lowest_brier_xgb) |>bind_rows(lowest_ici_xgb) |>bind_rows(lowest_kl_xgb) |>bind_rows(mediocre_ici_xgb)models_of_interest_metrics <-NULLfor (recalibration_method inc("None", "Platt", "Beta", "Isotonic")) {# Add metrics now models_of_interest_metrics <- models_of_interest_metrics |>bind_rows( models_of_interest_xgb |>select(-recalib) |>left_join( metrics_xgb_all |>filter( recalib == recalibration_method, sample %in%c("Validation", "Test") ),by =c("scenario", "repn", "ind", "nb_iter"),relationship ="many-to-many"# (calib, test) ) )}models_of_interest_metrics <- models_of_interest_metrics |>mutate(result_type =factor( result_type,levels =c("smallest", "largest", #"lowest_mse", "largest_auc","lowest_brier", "lowest_ici", "lowest_kl", "high_ici"),labels =c("Smallest", "Largest", #"MSE*", "AUC*","Brier*", "ICI*", "KL*", "High ICI" ) ) )# Sanity check# models_of_interest_metrics |> count(scenario, sample, result_type)
5.5.1 Metrics vs Number of Iterations
We define a function, plot_metrics() to plot selected metrics (AUC, ICI, and KL Divergence) as a function of the number of boosting iterations, for a given value for the hyperparameter max_depth. Each curve corresponds to a value of the maximal depth hyperparameter.
TBD
5.5.2 Distribution of Scores
Let us extract all the histogram information computed over the simulations and put that in a single object, scores_hist_all.
We then define a function, plot_bp_xgb() which plots the distribution of scores on the test set for a single replication (repn), for a scenario, (scenario). We also define a helper function, plot_bp_interest(), which plots the histogram of the scores at a specific iteration number. We will then be able to plot the distributions at the beginning of the boosting iterations, at the end, at a point where the AUC was the highest on the validation set, and at a point where the KL divergence between the distribution of scores on the validation set and the distribution of the true probabilities was the lowest. We will plot the distributions of the scores returned by the classifier, as well as those obtained with the reclibrators.
5.5.3 KL Divergence and Calibration along Boosting Iterations
We can examine the evolution of the relationship between the divergence of score distributions from true probabilities and model calibration across increasing boosting iterations.
We examine the average values of various metrics across 100 replications for the “best” model selected according to different criteria: AUC* for the model with hyperparameters chosen to maximize AUC on the validation set, ICI* for the model with hyperparameters selected to minimize ICI, Brier* for minimizing the Brier Score, KL* for minimizing the KL divergence between the distribution of scores on the validation set and the true probability distribution, smallest for the model with only 2 boosting iterations, largest for the model with 400 boosting iterations, and mediocre ICI for the model chosen to illustrate the effects of score recalibration when initial calibration is mediocre.
Table 5.2: Performance and calibration metrics (Brier Score, Integrated Calibration Index, Kullback-Leibler Divergence) computed on the test set, on scores returned by the model (column ‘None’), on scores recalibrated using Platt scaling (column ‘Platt’), or Isotonic regression (coliumn ‘Isotonic’)
None
Platt
Beta
Isotonic
DGP
Noise
Optim.
BS
ICI
KL
BS
ICI
KL
BS
ICI
KL
BS
ICI
KL
1
0 noise variables
Smallest
0.231 (0.001)
0.115 (0.005)
1.878 (0.063)
0.214 (0.002)
0.012 (0.004)
2.037 (0.05)
0.214 (0.002)
0.012 (0.004)
2.042 (0.061)
0.214 (0.002)
0.011 (0.004)
2.04 (0.059)
Largest
0.206 (0.002)
0.026 (0.005)
0.046 (0.012)
0.206 (0.002)
0.013 (0.004)
0.095 (0.011)
0.206 (0.002)
0.013 (0.004)
0.037 (0.012)
0.206 (0.002)
0.011 (0.004)
0.306 (0.105)
AUC*
0.201 (0.002)
0.011 (0.005)
0.051 (0.03)
0.201 (0.002)
0.017 (0.005)
0.131 (0.031)
0.201 (0.002)
0.011 (0.004)
0.051 (0.024)
0.201 (0.002)
0.012 (0.004)
0.304 (0.095)
KL*
0.202 (0.002)
0.013 (0.005)
0.021 (0.006)
0.202 (0.002)
0.015 (0.005)
0.107 (0.016)
0.202 (0.002)
0.011 (0.004)
0.027 (0.011)
0.202 (0.002)
0.012 (0.004)
0.307 (0.101)
High ICI
0.217 (0.002)
0.062 (0.006)
0.158 (0.066)
0.213 (0.002)
0.018 (0.005)
0.199 (0.05)
0.212 (0.002)
0.013 (0.004)
0.08 (0.053)
0.212 (0.002)
0.011 (0.004)
0.348 (0.104)
10 noise variables
Smallest
0.231 (0.001)
0.115 (0.005)
1.878 (0.063)
0.214 (0.002)
0.012 (0.004)
2.037 (0.05)
0.214 (0.002)
0.012 (0.004)
2.042 (0.061)
0.214 (0.002)
0.011 (0.004)
2.04 (0.059)
Largest
0.21 (0.002)
0.04 (0.005)
0.042 (0.011)
0.209 (0.002)
0.016 (0.004)
0.151 (0.027)
0.208 (0.002)
0.011 (0.005)
0.033 (0.011)
0.209 (0.002)
0.011 (0.005)
0.31 (0.117)
AUC*
0.201 (0.002)
0.014 (0.005)
0.063 (0.032)
0.201 (0.002)
0.018 (0.005)
0.135 (0.025)
0.201 (0.002)
0.011 (0.004)
0.053 (0.024)
0.201 (0.002)
0.011 (0.004)
0.296 (0.101)
KL*
0.204 (0.002)
0.015 (0.005)
0.01 (0.004)
0.204 (0.002)
0.016 (0.005)
0.109 (0.015)
0.203 (0.002)
0.01 (0.004)
0.017 (0.008)
0.204 (0.002)
0.012 (0.004)
0.302 (0.11)
High ICI
0.229 (0.003)
0.106 (0.006)
0.442 (0.194)
0.216 (0.002)
0.025 (0.005)
0.386 (0.102)
0.215 (0.002)
0.011 (0.004)
0.106 (0.144)
0.215 (0.002)
0.012 (0.004)
0.364 (0.137)
50 noise variables
Smallest
0.231 (0.001)
0.115 (0.005)
1.875 (0.05)
0.214 (0.002)
0.012 (0.004)
2.037 (0.052)
0.214 (0.002)
0.011 (0.004)
2.041 (0.056)
0.214 (0.002)
0.011 (0.004)
2.041 (0.06)
Largest
0.213 (0.002)
0.048 (0.005)
0.047 (0.011)
0.211 (0.002)
0.017 (0.005)
0.192 (0.02)
0.21 (0.002)
0.011 (0.004)
0.043 (0.013)
0.211 (0.002)
0.011 (0.005)
0.333 (0.115)
AUC*
0.201 (0.002)
0.016 (0.005)
0.08 (0.032)
0.201 (0.002)
0.018 (0.005)
0.142 (0.029)
0.201 (0.002)
0.012 (0.004)
0.06 (0.027)
0.201 (0.002)
0.012 (0.004)
0.304 (0.098)
KL*
0.205 (0.002)
0.019 (0.005)
0.009 (0.003)
0.205 (0.002)
0.016 (0.004)
0.12 (0.02)
0.205 (0.002)
0.01 (0.004)
0.022 (0.01)
0.205 (0.002)
0.012 (0.004)
0.313 (0.095)
High ICI
0.235 (0.003)
0.129 (0.006)
0.717 (0.169)
0.216 (0.002)
0.029 (0.006)
0.453 (0.166)
0.215 (0.002)
0.01 (0.004)
0.117 (0.223)
0.215 (0.002)
0.011 (0.004)
0.359 (0.211)
100 noise variables
Smallest
0.231 (0.001)
0.115 (0.005)
1.875 (0.05)
0.214 (0.002)
0.012 (0.004)
2.037 (0.052)
0.214 (0.002)
0.011 (0.004)
2.041 (0.056)
0.214 (0.002)
0.011 (0.004)
2.041 (0.06)
Largest
0.214 (0.002)
0.051 (0.005)
0.051 (0.012)
0.212 (0.002)
0.017 (0.004)
0.205 (0.014)
0.211 (0.002)
0.01 (0.004)
0.052 (0.013)
0.212 (0.002)
0.011 (0.005)
0.321 (0.103)
AUC*
0.201 (0.002)
0.016 (0.005)
0.087 (0.029)
0.201 (0.002)
0.018 (0.005)
0.144 (0.024)
0.201 (0.002)
0.012 (0.004)
0.061 (0.024)
0.201 (0.002)
0.011 (0.004)
0.324 (0.114)
KL*
0.206 (0.002)
0.019 (0.005)
0.009 (0.004)
0.206 (0.002)
0.015 (0.004)
0.125 (0.023)
0.206 (0.002)
0.01 (0.004)
0.025 (0.012)
0.206 (0.002)
0.011 (0.004)
0.302 (0.101)
High ICI
0.236 (0.003)
0.136 (0.006)
0.807 (0.093)
0.216 (0.002)
0.031 (0.005)
0.444 (0.032)
0.215 (0.002)
0.01 (0.004)
0.086 (0.055)
0.215 (0.002)
0.011 (0.004)
0.343 (0.115)
2
0 noise variables
Smallest
0.192 (0.001)
0.226 (0.005)
3.212 (0.199)
0.131 (0.002)
0.035 (0.005)
1.889 (0.089)
0.131 (0.002)
0.028 (0.005)
1.894 (0.084)
0.13 (0.002)
0.009 (0.004)
1.718 (0.283)
Largest
0.123 (0.002)
0.016 (0.004)
0.024 (0.01)
0.124 (0.002)
0.042 (0.003)
0.881 (0.034)
0.122 (0.002)
0.01 (0.004)
0.019 (0.006)
0.122 (0.002)
0.009 (0.004)
0.209 (0.066)
AUC*
0.118 (0.002)
0.01 (0.004)
0.029 (0.014)
0.12 (0.002)
0.038 (0.004)
0.783 (0.217)
0.118 (0.002)
0.009 (0.004)
0.027 (0.012)
0.118 (0.002)
0.009 (0.004)
0.214 (0.069)
KL*
0.12 (0.002)
0.01 (0.004)
0.013 (0.005)
0.121 (0.002)
0.038 (0.004)
0.863 (0.141)
0.119 (0.002)
0.009 (0.004)
0.016 (0.007)
0.12 (0.002)
0.009 (0.004)
0.215 (0.073)
High ICI
0.131 (0.003)
0.048 (0.004)
0.128 (0.125)
0.13 (0.003)
0.05 (0.005)
0.845 (0.042)
0.127 (0.003)
0.009 (0.004)
0.046 (0.014)
0.127 (0.003)
0.009 (0.003)
0.237 (0.073)
10 noise variables
Smallest
0.192 (0.001)
0.226 (0.005)
3.212 (0.199)
0.131 (0.002)
0.035 (0.005)
1.889 (0.089)
0.131 (0.002)
0.028 (0.005)
1.894 (0.084)
0.13 (0.002)
0.009 (0.004)
1.718 (0.283)
Largest
0.126 (0.002)
0.028 (0.004)
0.032 (0.01)
0.127 (0.002)
0.045 (0.003)
0.869 (0.035)
0.124 (0.002)
0.009 (0.004)
0.022 (0.008)
0.124 (0.002)
0.01 (0.004)
0.222 (0.075)
AUC*
0.119 (0.002)
0.012 (0.004)
0.03 (0.016)
0.12 (0.002)
0.038 (0.004)
0.759 (0.221)
0.118 (0.002)
0.01 (0.003)
0.026 (0.011)
0.119 (0.002)
0.009 (0.004)
0.213 (0.074)
KL*
0.12 (0.002)
0.011 (0.003)
0.007 (0.003)
0.122 (0.002)
0.04 (0.004)
0.887 (0.076)
0.12 (0.002)
0.009 (0.004)
0.012 (0.006)
0.121 (0.002)
0.01 (0.004)
0.205 (0.074)
High ICI
0.137 (0.003)
0.075 (0.004)
0.288 (0.127)
0.132 (0.002)
0.058 (0.004)
1.24 (0.354)
0.128 (0.002)
0.009 (0.004)
0.045 (0.029)
0.128 (0.002)
0.009 (0.003)
0.263 (0.1)
50 noise variables
Smallest
0.192 (0.001)
0.226 (0.005)
3.212 (0.199)
0.131 (0.002)
0.035 (0.005)
1.889 (0.089)
0.131 (0.002)
0.028 (0.005)
1.894 (0.084)
0.13 (0.002)
0.009 (0.003)
1.718 (0.283)
Largest
0.127 (0.002)
0.033 (0.004)
0.039 (0.011)
0.128 (0.002)
0.047 (0.003)
0.863 (0.036)
0.125 (0.002)
0.009 (0.003)
0.028 (0.009)
0.125 (0.002)
0.009 (0.003)
0.223 (0.072)
AUC*
0.119 (0.002)
0.013 (0.004)
0.038 (0.018)
0.12 (0.002)
0.038 (0.004)
0.728 (0.221)
0.119 (0.002)
0.01 (0.004)
0.029 (0.013)
0.119 (0.002)
0.009 (0.003)
0.207 (0.069)
KL*
0.121 (0.002)
0.011 (0.003)
0.006 (0.003)
0.123 (0.002)
0.041 (0.004)
0.894 (0.045)
0.121 (0.002)
0.009 (0.003)
0.014 (0.008)
0.121 (0.002)
0.009 (0.004)
0.215 (0.068)
High ICI
0.139 (0.003)
0.089 (0.004)
0.429 (0.105)
0.133 (0.003)
0.064 (0.005)
1.799 (0.155)
0.127 (0.002)
0.01 (0.004)
0.041 (0.02)
0.127 (0.002)
0.009 (0.003)
0.235 (0.073)
100 noise variables
Smallest
0.192 (0.001)
0.226 (0.005)
3.21 (0.199)
0.132 (0.002)
0.035 (0.005)
1.887 (0.091)
0.131 (0.002)
0.028 (0.005)
1.894 (0.087)
0.13 (0.002)
0.009 (0.003)
1.721 (0.286)
Largest
0.129 (0.002)
0.037 (0.004)
0.044 (0.011)
0.129 (0.002)
0.048 (0.003)
0.856 (0.035)
0.126 (0.002)
0.009 (0.003)
0.034 (0.01)
0.126 (0.002)
0.009 (0.003)
0.22 (0.073)
AUC*
0.119 (0.002)
0.014 (0.004)
0.044 (0.023)
0.121 (0.002)
0.038 (0.004)
0.729 (0.224)
0.119 (0.002)
0.01 (0.004)
0.03 (0.013)
0.119 (0.002)
0.009 (0.003)
0.214 (0.062)
KL*
0.122 (0.002)
0.012 (0.004)
0.006 (0.003)
0.124 (0.003)
0.041 (0.004)
0.89 (0.033)
0.122 (0.002)
0.009 (0.004)
0.016 (0.008)
0.122 (0.002)
0.009 (0.003)
0.217 (0.076)
High ICI
0.14 (0.003)
0.093 (0.004)
0.482 (0.099)
0.133 (0.003)
0.065 (0.005)
1.842 (0.155)
0.127 (0.002)
0.01 (0.004)
0.042 (0.014)
0.127 (0.002)
0.009 (0.004)
0.23 (0.07)
3
0 noise variables
Smallest
0.24 (0.001)
0.075 (0.006)
1.827 (0.099)
0.233 (0.001)
0.011 (0.005)
1.65 (0.21)
0.233 (0.001)
0.011 (0.004)
1.663 (0.205)
0.233 (0.001)
0.011 (0.004)
1.674 (0.208)
Largest
0.229 (0.002)
0.044 (0.005)
0.108 (0.026)
0.226 (0.001)
0.012 (0.005)
0.092 (0.022)
0.226 (0.001)
0.012 (0.004)
0.068 (0.022)
0.226 (0.001)
0.012 (0.005)
0.291 (0.101)
AUC*
0.22 (0.002)
0.01 (0.004)
0.012 (0.009)
0.221 (0.002)
0.012 (0.004)
0.041 (0.013)
0.22 (0.002)
0.01 (0.004)
0.013 (0.008)
0.221 (0.002)
0.011 (0.004)
0.268 (0.106)
KL*
0.221 (0.002)
0.012 (0.004)
0.005 (0.002)
0.221 (0.002)
0.011 (0.004)
0.047 (0.014)
0.221 (0.002)
0.01 (0.004)
0.017 (0.011)
0.222 (0.002)
0.011 (0.004)
0.286 (0.115)
High ICI
0.246 (0.002)
0.105 (0.005)
0.631 (0.086)
0.231 (0.001)
0.014 (0.004)
0.268 (0.047)
0.231 (0.001)
0.011 (0.004)
0.174 (0.035)
0.231 (0.001)
0.011 (0.004)
0.376 (0.108)
10 noise variables
Smallest
0.24 (0.001)
0.075 (0.006)
1.827 (0.099)
0.233 (0.001)
0.011 (0.005)
1.65 (0.21)
0.233 (0.001)
0.011 (0.004)
1.663 (0.205)
0.233 (0.001)
0.011 (0.004)
1.674 (0.208)
Largest
0.231 (0.002)
0.052 (0.005)
0.11 (0.024)
0.227 (0.001)
0.012 (0.004)
0.123 (0.025)
0.227 (0.001)
0.011 (0.004)
0.074 (0.024)
0.227 (0.001)
0.012 (0.004)
0.307 (0.103)
AUC*
0.221 (0.001)
0.011 (0.004)
0.027 (0.018)
0.221 (0.002)
0.012 (0.005)
0.046 (0.014)
0.221 (0.001)
0.01 (0.004)
0.017 (0.009)
0.221 (0.002)
0.011 (0.004)
0.28 (0.099)
KL*
0.222 (0.002)
0.014 (0.005)
0.004 (0.002)
0.222 (0.002)
0.012 (0.004)
0.056 (0.019)
0.222 (0.002)
0.01 (0.004)
0.023 (0.016)
0.222 (0.002)
0.011 (0.004)
0.284 (0.103)
High ICI
0.253 (0.003)
0.127 (0.005)
0.932 (0.118)
0.232 (0.001)
0.015 (0.004)
0.366 (0.035)
0.232 (0.001)
0.01 (0.004)
0.191 (0.04)
0.232 (0.001)
0.011 (0.004)
0.392 (0.108)
50 noise variables
Smallest
0.24 (0.001)
0.075 (0.006)
1.827 (0.099)
0.233 (0.001)
0.011 (0.005)
1.65 (0.21)
0.233 (0.001)
0.011 (0.004)
1.663 (0.205)
0.233 (0.001)
0.011 (0.004)
1.674 (0.208)
Largest
0.233 (0.002)
0.06 (0.005)
0.12 (0.025)
0.228 (0.002)
0.012 (0.005)
0.163 (0.027)
0.228 (0.002)
0.01 (0.004)
0.097 (0.029)
0.229 (0.002)
0.011 (0.004)
0.342 (0.105)
AUC*
0.221 (0.001)
0.013 (0.005)
0.053 (0.03)
0.221 (0.002)
0.012 (0.004)
0.049 (0.015)
0.221 (0.002)
0.01 (0.004)
0.021 (0.011)
0.222 (0.002)
0.011 (0.004)
0.29 (0.124)
KL*
0.224 (0.002)
0.018 (0.006)
0.004 (0.002)
0.224 (0.002)
0.011 (0.004)
0.075 (0.023)
0.224 (0.002)
0.01 (0.004)
0.037 (0.021)
0.224 (0.002)
0.011 (0.004)
0.284 (0.104)
High ICI
0.259 (0.003)
0.145 (0.006)
1.285 (0.169)
0.233 (0.001)
0.017 (0.005)
0.402 (0.027)
0.232 (0.001)
0.01 (0.004)
0.204 (0.044)
0.232 (0.001)
0.011 (0.004)
0.424 (0.127)
100 noise variables
Smallest
0.24 (0.001)
0.075 (0.007)
1.827 (0.099)
0.233 (0.001)
0.011 (0.005)
1.65 (0.21)
0.233 (0.001)
0.011 (0.004)
1.663 (0.205)
0.233 (0.001)
0.011 (0.004)
1.674 (0.208)
Largest
0.235 (0.002)
0.065 (0.005)
0.129 (0.026)
0.229 (0.001)
0.012 (0.004)
0.185 (0.028)
0.229 (0.001)
0.01 (0.004)
0.115 (0.029)
0.229 (0.001)
0.012 (0.005)
0.348 (0.119)
AUC*
0.222 (0.001)
0.015 (0.006)
0.067 (0.031)
0.222 (0.002)
0.012 (0.004)
0.052 (0.016)
0.221 (0.002)
0.01 (0.004)
0.024 (0.012)
0.222 (0.002)
0.011 (0.004)
0.286 (0.107)
KL*
0.225 (0.002)
0.019 (0.005)
0.004 (0.002)
0.224 (0.002)
0.011 (0.004)
0.08 (0.021)
0.224 (0.002)
0.01 (0.004)
0.042 (0.02)
0.225 (0.002)
0.011 (0.004)
0.301 (0.122)
High ICI
0.261 (0.003)
0.152 (0.005)
1.454 (0.18)
0.233 (0.001)
0.017 (0.004)
0.418 (0.036)
0.232 (0.001)
0.01 (0.004)
0.206 (0.038)
0.233 (0.001)
0.011 (0.004)
0.416 (0.11)
4
0 noise variables
Smallest
0.239 (0.001)
0.081 (0.01)
2.366 (0.299)
0.229 (0.002)
0.011 (0.005)
2.059 (0.108)
0.229 (0.002)
0.011 (0.005)
2.055 (0.097)
0.229 (0.002)
0.011 (0.005)
2.061 (0.113)
Largest
0.209 (0.002)
0.028 (0.005)
0.019 (0.006)
0.208 (0.002)
0.015 (0.004)
0.117 (0.015)
0.208 (0.002)
0.011 (0.004)
0.024 (0.009)
0.208 (0.002)
0.011 (0.004)
0.315 (0.115)
AUC*
0.204 (0.002)
0.011 (0.004)
0.039 (0.021)
0.205 (0.002)
0.016 (0.004)
0.13 (0.02)
0.204 (0.002)
0.011 (0.004)
0.035 (0.014)
0.205 (0.002)
0.011 (0.005)
0.294 (0.1)
KL*
0.206 (0.002)
0.018 (0.005)
0.011 (0.004)
0.206 (0.002)
0.015 (0.004)
0.115 (0.012)
0.206 (0.002)
0.01 (0.004)
0.019 (0.007)
0.206 (0.002)
0.011 (0.004)
0.289 (0.105)
High ICI
0.222 (0.003)
0.073 (0.006)
0.199 (0.286)
0.215 (0.003)
0.019 (0.006)
0.249 (0.191)
0.215 (0.003)
0.011 (0.004)
0.113 (0.215)
0.215 (0.003)
0.012 (0.005)
0.36 (0.222)
10 noise variables
Smallest
0.239 (0.001)
0.081 (0.011)
2.366 (0.299)
0.229 (0.002)
0.011 (0.005)
2.059 (0.108)
0.229 (0.002)
0.011 (0.005)
2.055 (0.097)
0.229 (0.002)
0.011 (0.005)
2.061 (0.113)
Largest
0.213 (0.002)
0.036 (0.005)
0.018 (0.006)
0.211 (0.002)
0.015 (0.004)
0.173 (0.02)
0.211 (0.002)
0.011 (0.005)
0.048 (0.013)
0.211 (0.002)
0.011 (0.005)
0.307 (0.106)
AUC*
0.206 (0.002)
0.014 (0.005)
0.089 (0.026)
0.206 (0.002)
0.016 (0.005)
0.142 (0.022)
0.206 (0.002)
0.012 (0.004)
0.06 (0.017)
0.206 (0.002)
0.012 (0.005)
0.294 (0.104)
KL*
0.211 (0.002)
0.028 (0.005)
0.014 (0.005)
0.21 (0.002)
0.015 (0.004)
0.156 (0.02)
0.21 (0.002)
0.011 (0.005)
0.043 (0.012)
0.21 (0.002)
0.011 (0.005)
0.307 (0.091)
High ICI
0.232 (0.002)
0.105 (0.005)
0.307 (0.236)
0.219 (0.002)
0.021 (0.005)
0.391 (0.109)
0.219 (0.002)
0.01 (0.004)
0.14 (0.144)
0.219 (0.002)
0.011 (0.005)
0.422 (0.181)
50 noise variables
Smallest
0.239 (0.001)
0.081 (0.01)
2.366 (0.299)
0.229 (0.002)
0.011 (0.005)
2.059 (0.108)
0.229 (0.002)
0.011 (0.005)
2.055 (0.097)
0.229 (0.002)
0.011 (0.005)
2.061 (0.113)
Largest
0.216 (0.002)
0.042 (0.006)
0.019 (0.005)
0.214 (0.002)
0.014 (0.004)
0.206 (0.019)
0.214 (0.002)
0.011 (0.005)
0.079 (0.019)
0.215 (0.002)
0.012 (0.005)
0.327 (0.1)
AUC*
0.207 (0.002)
0.019 (0.005)
0.126 (0.031)
0.207 (0.002)
0.016 (0.005)
0.145 (0.025)
0.207 (0.002)
0.012 (0.004)
0.072 (0.02)
0.207 (0.002)
0.012 (0.005)
0.3 (0.104)
KL*
0.215 (0.002)
0.034 (0.005)
0.017 (0.004)
0.213 (0.002)
0.014 (0.004)
0.19 (0.021)
0.213 (0.002)
0.011 (0.004)
0.072 (0.018)
0.214 (0.002)
0.012 (0.004)
0.345 (0.101)
High ICI
0.238 (0.003)
0.125 (0.006)
0.422 (0.047)
0.221 (0.002)
0.024 (0.005)
0.424 (0.039)
0.22 (0.002)
0.011 (0.005)
0.134 (0.023)
0.22 (0.002)
0.012 (0.005)
0.401 (0.119)
100 noise variables
Smallest
0.239 (0.001)
0.081 (0.01)
2.366 (0.299)
0.229 (0.002)
0.011 (0.005)
2.059 (0.108)
0.229 (0.002)
0.011 (0.005)
2.055 (0.097)
0.229 (0.002)
0.011 (0.005)
2.061 (0.113)
Largest
0.218 (0.002)
0.045 (0.006)
0.02 (0.005)
0.216 (0.002)
0.014 (0.005)
0.218 (0.019)
0.216 (0.002)
0.011 (0.004)
0.092 (0.02)
0.216 (0.002)
0.012 (0.005)
0.35 (0.105)
AUC*
0.208 (0.002)
0.021 (0.007)
0.145 (0.035)
0.207 (0.002)
0.015 (0.005)
0.147 (0.025)
0.207 (0.002)
0.012 (0.004)
0.078 (0.021)
0.207 (0.002)
0.012 (0.005)
0.334 (0.109)
KL*
0.216 (0.002)
0.037 (0.006)
0.017 (0.004)
0.215 (0.002)
0.013 (0.005)
0.202 (0.022)
0.215 (0.002)
0.011 (0.004)
0.084 (0.021)
0.215 (0.002)
0.012 (0.005)
0.367 (0.107)
High ICI
0.241 (0.003)
0.133 (0.006)
0.486 (0.046)
0.221 (0.002)
0.025 (0.004)
0.468 (0.057)
0.22 (0.002)
0.011 (0.005)
0.141 (0.023)
0.22 (0.002)
0.011 (0.005)
0.398 (0.101)
5.5.5 Before vs. After Recalibration
Let us visualize how the KL divergence and the ICI of a selected model changes after the scores are recalibrated, either using Platt scaling or isotonic regression.
Ojeda, Francisco M., Max L. Jansen, Alexandre Thiéry, Stefan Blankenberg, Christian Weimar, Matthias Schmid, and Andreas Ziegler. 2023. “Calibrating Machine Learning Approaches for Probability Estimation: A Comprehensive Comparison.”Statistics in Medicine 42 (29): 5451–78. https://doi.org/10.1002/sim.9921.