#' Brier Score
#'
#' The Brier Score \citep{brier_1950}, is expressed as: \deqn{\text{BS} =
#' \frac{1}{n}\sum_{i=1}^{n} \big(\hat{s}(\mathbf{x}_i) - d_i\big)^{2}} where
#' \eqn{d_i \in \{0,1\}} is the observed event for observation \eqn{i}.
#'
#' @param scores vector of scores
#' @param obs vector of observed binary events
#'
#' @references Brier, G. W. (1950). Verification of forecasts expressed in terms
#' of probability. Monthly Weather Review 78: 1–3.
#'
#' @export
<- function(obs, scores) mean((scores - obs)^2) brier_score
3 Metrics
\(\DeclareMathOperator{\g}{g}\)
This chapter introduces the functions used to compute various metrics, including performance, calibration, and the divergence between model-predicted scores and true probabilities’ distribution.
The codes defined here are saved in functions/metrics.R
.
3.1 Performance and Calibration Metrics
To measure performance, we chose to compute:
- the true Mean Squared Error (MSE): the average of the quadratic difference between predicted scores and true probabilities (only if the true probabilities are available thanks to the knowledge of the PGD)
- the accuracy, which gives the proportion of correctly predicted instances; we use a probability threshold of 0.5)
- the AUC.
To measure calibration, we compute two metrics:
Given a sample size \(n\), the Brier Score Brier (1950), is expressed as: \[ \begin{equation} \text{BS} = \frac{1}{n}\sum_{i=1}^{n} \big(\hat{s}(\mathbf{x}_i) - d_i\big)^{2}\enspace , \end{equation} \tag{3.1}\]
where \(\hat{s}(\mathbf{x}_i)\) and \(d_i \in \{0,1\}\) are the predicted score and observed event, respectively, for observation \(i\).
Instead of defining bins, the Integrated Calibration Index or ICI (Austin and Steyerberg (2019)) measures calibration using a local estimation (loess if the number of observation is lower than 1000 ; using a GAM otherwise).
The occurrence of the binary event is regressed on the predicted scores, employing either locally estimated scatterplot smoothing (LOESS) when the number of observations is small (\(n < 1000\)) or cubic regression splines for larger datasets. The ICI is defined as \[ \begin{equation} \text{ICI} = \int_{0}^{1} f(p) \phi(p)\, dp \end{equation} \tag{3.2}\] where \(f(p) = | p - \g(p) |\) is the absolute difference between the calibration curve and the bisector where \(p\) denotes a predicted score (i.e., \(p=\hat{s}(\mathbf{x})\)) and \(\g(p)\) is the value of the calibration curve at this predicted score. The density function of the distribution of predicted scores is denoted \(\phi(p)\).
All these metrics are computed in a function we name compute_metrics()
which takes three arguments:
obs
: a vector of observed binary eventsscores
: a vector of predicted scorestrue_probas
: if available, a vector of true probabilities from the PGD (to compute MSE).
#' Computes the calibration metrics for a set of observed and predicted
#' probabilities
#'
#' @returns
#' \itemize{
#' \item \code{mse}: True Mean Squared Error based on true probability.
#' \item \code{acc}: accuracy with a .5 probability threshold.
#' \item \code{AUC}: Area Under the ROC Curve.
#' \item \code{brier}: Brier score.
#' \item \code{ici}: Integrated Calibration Index.
#' }
#'
#' @param obs observed events
#' @param scores predicted scores
#' @param true_probas true probabilities from the PGD (to compute MSE)
#'
#' @importFrom purrr map
#' @importFrom tibble tibble
#' @importFrom dplyr bind_rows
#'
#' @export
<- function(obs,
compute_metrics
scores,true_probas = NULL) {
# True MSE
if (!is.null(true_probas)) {
<- mean((true_probas - scores)^2)
mse else {
} <- NA
mse
}
# True MAE
if (!is.null(true_probas)) {
<- mean(abs(true_probas - scores))
mae else {
} <- NA
mae
}
# AUC
<- pROC::auc(obs, scores, levels = c("0", "1"), quiet = TRUE) |>
AUC as.numeric()
# Brier Score
<- brier_score(obs = as.numeric(as.character(obs)), scores = scores)
brier # gmish::brier(pred = scores, obs = obs) #same results
# ICI
<- purrr::quietly(gmish::ici)
ici_quiet <- ici_quiet(pred = scores, obs = as.numeric(as.character(obs)))
ici <- ici$result
ici
# Accuracy
<- ifelse(scores > .5, yes = 1, no = 0)
pred_class <- sum(diag(table(obs = obs, pred = pred_class))) / length(scores)
acc
tibble(
mse = mse,
mae = mae,
acc = acc,
AUC = AUC,
brier = brier,
ici = ici
) }
3.2 Dispersion Metrics
We compute the Kullback-Leibler divergence Kullback and Leibler (1951) between the distribution of the estimated scores and the distribution of the true probabilities. Denoting (Q) the distribution of the scores and (P) the distribution of the true probabilities, the Kullback Leibler divergence of \(Q\) with respect to \(P\) is :% \[\begin{equation} D_{KL}(Q || P) = \sum_{i} Q(i) \log \frac{Q(i)}{P(i)}. \end{equation}\]
The distributions both need to be discretized. We divide the segment ([0,1]) into (m) bins.
In the dispersion_metrics()
that we define to that end, we consider two values for the number of bins: \(m\in \{10,20\}\). We also consider switching the reference distribution (where \(Q\) denotes the distribution of the true probabilities and \(P\) denotes the distribution of scores).
#' Computes the dispersion metrics for a set of observed and predicted
#' probabilities
#'
#' @returns
#' \itemize{
#' \item \code{inter_quantile_25_75}: Difference of inter-quantile between 25% and 75%
#' \item \code{inter_quantile_10_90}: Difference of inter-quantile between 10% and 90%
#' \item \code{KL_10_true_probas}: KL of of predicted probabilities w.r. to true probabilities with 10 bins
#' \item \code{KL_10_scores}: KL of of true probabilities w.r. to predicted probabilities with 10 bins
#' \item \code{KL_20_true_probas}: KL of of predicted probabilities w.r. to true probabilities with 20 bins
#' \item \code{KL_20_scores}: KL of of true probabilities w.r. to predicted probabilities with 20 bins
#' \item \code{ind_cov}: Difference between the variance of true probabilities and the covariance between true probabilities and predicted scores
#' }
#'
#' @param true_probas true probabilities from simulations
#' @param scores predicted scores
#'
<- function(true_probas, scores){
dispersion_metrics
# Inter-quantiles
<- diff(quantile(scores, c(.9, .1))) /
inter_q_80 diff(quantile(true_probas, c(.9, .1)))
<- diff(quantile(scores, c(.75,.25))) /
inter_q_50 diff(quantile(true_probas, c(.75, .25)))
# KL divergences
<- 10 # Number of bins
m <- hist(true_probas, breaks = seq(0, 1, length = m + 1), plot = FALSE)
h_p <- hist(scores, breaks = seq(0, 1, length = m + 1), plot = FALSE)
h_phat # Densities
<- rbind(h_phat$density / m, h_p$density / m) # Reference : true probabilities
h1 <- rbind(h_p$density / m, h_phat$density / m) # Reference : predicted scores
h2 <- distance(
KL_10_true_probas method = "kullback-leibler", unit = "log2", mute.message = TRUE)
h1, <- distance(
KL_10_scores method = "kullback-leibler", unit = "log2", mute.message = TRUE)
h2,
<- 20 # Number of bins
m <- hist(true_probas,breaks = seq(0, 1, length = m + 1), plot = FALSE)
h_p <- hist(scores, breaks = seq(0, 1, length = m + 1), plot = FALSE)
h_phat # Densities
<- rbind(h_phat$density / m,h_p$density / m) # Reference : true probabilities
h1 <- rbind(h_p$density / m, h_phat$density / m) # Reference : predicted scores
h2 <- distance(
KL_20_true_probas method = "kullback-leibler", unit = "log2", mute.message = TRUE)
h1, <- distance(
KL_20_scores method = "kullback-leibler", unit = "log2", mute.message = TRUE)
h2,
# Indicator of the difference between variance and covariance
<- var(true_probas)
var_p <- cov(true_probas, scores)
cov_p_phat <- abs(cov_p_phat - var_p)
ind_cov
# Collection
<- tibble(
dispersion_metrics "inter_quantile_25_75" = as.numeric(inter_q_50),
"inter_quantile_10_90" = as.numeric(inter_q_80),
"KL_10_true_probas" = as.numeric(KL_10_true_probas),
"KL_10_scores" = as.numeric(KL_10_scores),
"KL_20_true_probas" = as.numeric(KL_20_true_probas),
"KL_20_scores" = as.numeric(KL_20_scores),
"ind_cov" = ind_cov
)
dispersion_metrics }
Lastly, we estimate \(\mathbb{P}(q_1 < \hat{s}(\mathbf{x}) < q_2)\), with \(q_2 = 1-q_1\), for different values of \(q_1\) and \(q_2\). To do so, we simply calculate the sample proportion of scores between \(q_1\) and \(q_2\). The prop_btw_quantiles()
does it.
#' Computes \hat{P}(q_1 < s < q_2)
#'
#' @param s scores
#' @param q1 lower quantile
#' @param q2 upper quantile (default to 1-q2)
<- function(s, q1, q2 = 1 - q1) {
prop_btw_quantiles tibble(q1 = q1, q2 = q2, freq = mean(s < q2 & s > q1))
}