3 Metrics

Note

This chapter introduces the functions used to compute various metrics, including performance, calibration, and the divergence between model-predicted scores and true probabilities’ distribution.

Warning

The codes defined here are saved in functions/metrics.R.

3.1 Performance and Calibration Metrics

To measure performance, we chose to compute:

the true Mean Squared Error (MSE): the average of the quadratic difference between predicted scores and true probabilities (only if the true probabilities are available thanks to the knowledge of the PGD)
the accuracy, which gives the proportion of correctly predicted instances; we use a probability threshold of 0.5)
the AUC.

To measure calibration, we compute two metrics:

the Brier score (Brier (1950))
the Integrated Calibration Index (Austin and Steyerberg (2019)).

Brier Score

Given a sample size \(n\), the Brier Score Brier (1950), is expressed as: \[ \begin{equation} \text{BS} = \frac{1}{n}\sum_{i=1}^{n} \big(\hat{s}(\mathbf{x}_i) - d_i\big)^{2}\enspace , \end{equation} \tag{3.1}\]

where \(\hat{s}(\mathbf{x}_i)\) and \(d_i \in \{0,1\}\) are the predicted score and observed event, respectively, for observation \(i\).

Integrated Calibration Index

Instead of defining bins, the Integrated Calibration Index or ICI (Austin and Steyerberg (2019)) measures calibration using a local estimation (loess if the number of observation is lower than 1000 ; using a GAM otherwise).

The occurrence of the binary event is regressed on the predicted scores, employing either locally estimated scatterplot smoothing (LOESS) when the number of observations is small (\(n < 1000\)) or cubic regression splines for larger datasets. The ICI is defined as \[ \begin{equation} \text{ICI} = \int_{0}^{1} f(p) \phi(p)\, dp \end{equation} \tag{3.2}\] where \(f(p) = | p - \g(p) |\) is the absolute difference between the calibration curve and the bisector where \(p\) denotes a predicted score (i.e., \(p=\hat{s}(\mathbf{x})\)) and \(\g(p)\) is the value of the calibration curve at this predicted score. The density function of the distribution of predicted scores is denoted \(\phi(p)\).

All these metrics are computed in a function we name compute_metrics() which takes three arguments:

obs: a vector of observed binary events
scores: a vector of predicted scores
true_probas: if available, a vector of true probabilities from the PGD (to compute MSE).

#' Brier Score
#'
#' The Brier Score \citep{brier_1950}, is expressed as: \deqn{\text{BS} =
#' \frac{1}{n}\sum_{i=1}^{n} \big(\hat{s}(\mathbf{x}_i) - d_i\big)^{2}} where
#' \eqn{d_i \in \{0,1\}} is the observed event for observation \eqn{i}.
#'
#' @param scores vector of scores
#' @param obs vector of observed binary events
#'
#' @references Brier, G. W. (1950). Verification of forecasts expressed in terms
#' of probability. Monthly Weather Review 78: 1–3.
#'
#' @export
brier_score <- function(obs, scores) mean((scores - obs)^2)

#' Computes the calibration metrics for a set of observed and predicted
#' probabilities
#'
#' @returns
#' \itemize{
#'   \item \code{mse}: True Mean Squared Error based on true probability.
#'   \item \code{acc}: accuracy with a .5 probability threshold.
#'   \item \code{AUC}: Area Under the ROC Curve.
#'   \item \code{brier}: Brier score.
#'   \item \code{ici}: Integrated Calibration Index.
#' }
#'
#' @param obs observed events
#' @param scores predicted scores
#' @param true_probas true probabilities from the PGD (to compute MSE)
#'
#' @importFrom purrr map
#' @importFrom tibble tibble
#' @importFrom dplyr bind_rows
#'
#' @export
compute_metrics <- function(obs,
                            scores,
                            true_probas = NULL) {

  # True MSE
  if (!is.null(true_probas)) {
    mse <- mean((true_probas - scores)^2)
  } else {
    mse <- NA
  }

  # True MAE
  if (!is.null(true_probas)) {
    mae <- mean(abs(true_probas - scores))
  } else {
    mae <- NA
  }

  # AUC
  AUC <- pROC::auc(obs, scores, levels = c("0", "1"), quiet = TRUE) |>
    as.numeric()

  # Brier Score
  brier <- brier_score(obs = as.numeric(as.character(obs)), scores = scores)
  # gmish::brier(pred = scores, obs = obs) #same results

  # ICI
  ici_quiet <- purrr::quietly(gmish::ici)
  ici <- ici_quiet(pred = scores, obs = as.numeric(as.character(obs)))
  ici <- ici$result

  # Accuracy
  pred_class <- ifelse(scores > .5, yes = 1, no = 0)
  acc <- sum(diag(table(obs = obs, pred = pred_class))) / length(scores)

  tibble(
    mse = mse,
    mae = mae,
    acc = acc,
    AUC = AUC,
    brier = brier,
    ici = ici
  )
}

3.2 Dispersion Metrics

We compute the Kullback-Leibler divergence Kullback and Leibler (1951) between the distribution of the estimated scores and the distribution of the true probabilities. Denoting (Q) the distribution of the scores and (P) the distribution of the true probabilities, the Kullback Leibler divergence of \(Q\) with respect to \(P\) is :% \[\begin{equation} D_{KL}(Q || P) = \sum_{i} Q(i) \log \frac{Q(i)}{P(i)}. \end{equation}\]

The distributions both need to be discretized. We divide the segment ([0,1]) into (m) bins.

In the dispersion_metrics() that we define to that end, we consider two values for the number of bins: \(m\in \{10,20\}\). We also consider switching the reference distribution (where \(Q\) denotes the distribution of the true probabilities and \(P\) denotes the distribution of scores).

#' Computes the dispersion metrics for a set of observed and predicted
#' probabilities
#'
#' @returns
#' \itemize{
#'   \item \code{inter_quantile_25_75}: Difference of inter-quantile between 25% and 75%
#'   \item \code{inter_quantile_10_90}: Difference of inter-quantile between 10% and 90%
#'   \item \code{KL_10_true_probas}: KL of of predicted probabilities w.r. to true probabilities with 10 bins
#'   \item \code{KL_10_scores}: KL of of true probabilities w.r. to predicted probabilities with 10 bins
#'   \item \code{KL_20_true_probas}: KL of of predicted probabilities w.r. to true probabilities with 20 bins
#'   \item \code{KL_20_scores}: KL of of true probabilities w.r. to predicted probabilities with 20 bins
#'   \item \code{ind_cov}: Difference between the variance of true probabilities and the covariance between true probabilities and predicted scores
#' }
#'
#' @param true_probas true probabilities from simulations
#' @param scores predicted scores
#'
dispersion_metrics <- function(true_probas, scores){

  # Inter-quantiles
  inter_q_80 <- diff(quantile(scores, c(.9, .1))) /
    diff(quantile(true_probas, c(.9, .1)))
  inter_q_50 <- diff(quantile(scores, c(.75,.25))) /
    diff(quantile(true_probas, c(.75, .25)))

  # KL divergences
  m <- 10 # Number of bins
  h_p <- hist(true_probas, breaks = seq(0, 1, length = m + 1), plot = FALSE)
  h_phat <- hist(scores, breaks = seq(0, 1, length = m + 1), plot = FALSE)
  # Densities
  h1 <- rbind(h_phat$density / m, h_p$density / m) # Reference : true probabilities
  h2 <- rbind(h_p$density / m, h_phat$density / m) # Reference : predicted scores
  KL_10_true_probas <- distance(
    h1, method = "kullback-leibler", unit = "log2", mute.message = TRUE)
  KL_10_scores <- distance(
    h2, method = "kullback-leibler", unit = "log2", mute.message = TRUE)


  m <- 20 # Number of bins
  h_p <- hist(true_probas,breaks = seq(0, 1, length = m + 1), plot = FALSE)
  h_phat <- hist(scores, breaks = seq(0, 1, length = m + 1), plot = FALSE)
  # Densities
  h1 <- rbind(h_phat$density / m,h_p$density / m) # Reference : true probabilities
  h2 <- rbind(h_p$density / m, h_phat$density / m) # Reference : predicted scores
  KL_20_true_probas <- distance(
    h1, method = "kullback-leibler", unit = "log2", mute.message = TRUE)
  KL_20_scores <- distance(
    h2, method = "kullback-leibler", unit = "log2", mute.message = TRUE)

  # Indicator of the difference between variance and covariance
  var_p <- var(true_probas)
  cov_p_phat <- cov(true_probas, scores)
  ind_cov <- abs(cov_p_phat - var_p)

  # Collection
  dispersion_metrics <- tibble(
    "inter_quantile_25_75" = as.numeric(inter_q_50),
    "inter_quantile_10_90" = as.numeric(inter_q_80),
    "KL_10_true_probas" = as.numeric(KL_10_true_probas),
    "KL_10_scores" = as.numeric(KL_10_scores),
    "KL_20_true_probas" = as.numeric(KL_20_true_probas),
    "KL_20_scores" = as.numeric(KL_20_scores),
    "ind_cov" = ind_cov
    )

  dispersion_metrics
}

Lastly, we estimate \(\mathbb{P}(q_1 < \hat{s}(\mathbf{x}) < q_2)\), with \(q_2 = 1-q_1\), for different values of \(q_1\) and \(q_2\). To do so, we simply calculate the sample proportion of scores between \(q_1\) and \(q_2\). The prop_btw_quantiles() does it.

#' Computes \hat{P}(q_1 < s < q_2)
#'
#' @param s scores
#' @param q1 lower quantile
#' @param q2 upper quantile (default to 1-q2)
prop_btw_quantiles <- function(s, q1, q2 = 1 - q1) {
  tibble(q1 = q1, q2 = q2, freq = mean(s < q2 & s > q1))
}