In this part, we look at the calibration of the model. In a nutshell, in traditional regression, a model output denoted as \(\hat{Z}\) is considered well calibrated for \(Z\) when (Krüger and Ziegel (2020)): \[
\mathbb{E}\left[ Z \mid \hat{Z} \right] = \hat{Z}\enspace.
\]
However, here, we discretize the variable of interest (square meter prices) so as to obtain a multi-class variable \(\mathcal{Y} = [K]\), where \(K>0\) is a positive integer.
Here, we focus on ordinal regression. The variable of interest is denoted as \(\mathcal{Y} = [K]\), where \(K>0\) is a positive integer. In this context, a model \(h\in\mathcal{H}\) is deemed well calibrated in predicting the distribution of confident scores \(\hat{\boldsymbol{P}}\) (Widmann, Lindsten, and Zachariah (2019)) if: \[
\mathbb{P}(Y = k | \,\hat{\boldsymbol{P}} = \hat{\boldsymbol{p}}) = \hat{p}_k,\quad \text{ for all }k\in[K]\enspace,
\] where \(\hat{\boldsymbol{p}} = (\hat{p}_1, \ldots, \hat{p}_K)\) and \(\hat{p}_k\) represents the (confidence) score or probability estimate for class \(k\) under the model \(h\).
To measure the calibration of a model, we rely on the group-wise calibration error, or in the calibration literature, the Expected Calibration Error (ECE) (Naeini, Cooper, and Hauskrecht (2015)) denoted \(\mathcal{U}_{ECE}\).
The empirical distribution is calculated as follows: \[
\hat{\mathbb{P}}_{(\boldsymbol{X}_i, Y_i)_i}:= \frac{1}{n} \sum_{i=1}^n \delta_{(\boldsymbol{X}_i, Y_i)}\enspace,
\tag{3.1}\] where \((\boldsymbol{X}_i, Y_i)_{i=1, \ldots, n}\)is the empirical data, i.i.d. copies of \((\boldsymbol{X}, Y)\).
Let the interval \([0,1]\) be partitioned into \(B\) bins based on quantiles of \(\max_k \hat{p}_k\) values, where each bin \(b\in[B]\) is associated with the set \(\mathcal{I}_b\) containing the indices of instances within that bin. This partitioning is used in the subsequent definition of the ECE, which is applicable to the multi-class classification framework and thereby extends to ordinal regression.
Model calibration in multi-class classification
To quantify the ECE, we introduce the accuracy and confidence measures within each bin \(b\in[B]\):
The calibration error measure for a model \(h\in\mathcal{H}\) is then defined as \[
\mathcal{U}_{ECE}(h) := \frac{1}{n} \sum_{b\in[B]} |\mathcal{I}_b| \cdot \left|\textrm{acc}(\mathcal{I}_b) - \textrm{conf}(\mathcal{I}_b)\right|
\enspace,
\] and model \(h\) is considered (group-wise) well calibrated if \(\mathcal{U}_{ECE}(h) = 0\).
3.1 Load Data
Let us load the real estate data that were cleaned in Section 1.5, in Chapter 1.
load("../data/data_clean_all.rda")
Let us have a look at the quantiles of the observed and the estimated prices. We consider here levels from 0 to 1 in steps of .05 for the quantiles.
Then, for each observed price and each predicted price, we can assign the bin in which it lies.
# Change first and last values of quantiles (0% - 100%)quantiles[1] <-0quantiles[length(quantiles)] <-Inf# Get true and predicted labels based on quantilesobs_labels <-cut( obs, breaks = quantiles, labels =1:(length(quantiles) -1)) |>as.numeric()pred_labels <-cut( pred, breaks = quantiles, labels =1:(length(quantiles)-1)) |>as.numeric()
Now, in each bin, we can compute the mean of observed and predicted prices (obs_mean and pred_mean). This will allow us to create a calibration curve. We also compute the number of observation in the bin (nb), the length of the bin (dist), the midpoints of the two quantiles defining the bin (midpoints), and the difference between the average observed price and the average predicted price (diff).
For each example in the data, we will compute the distance of the predicted price to each of the midpoints of the quantile-defined bins. Before doing so, let us show an example with the first observation. The predicted price is:
data_to_display$pred[1]
[1] 9227.013
Its predicted bin is:
data_to_display$pred_labs[1]
[1] 2
To compute the distance between an observation and a midpoint of a bin, we create a function, distance_metric().
#' Calculates distance between a data point and a midpoint by normalizing to 1#' distance between quantiles#' #' @param x observation#' @param x_class assigned bin class#' @param midpoint midpoint midpoints of the bin of interest#' @param midpoints midpoints of all the bins#' @param quantiles quantiles defining the bins#' @param dist_quantiles length of each bindistance_metric <-function(x, x_class, midpoint, midpoints, quantiles, dist_quantiles ) { midpoint_class <-which(midpoints == midpoint)if (x_class == midpoint_class){return(distance(x, midpoint) / dist_quantiles[x_class]) } elseif (x_class > midpoint_class){ diff_class <- x_class - midpoint_class dist <-rep(NA, 2)# First and last distance values dist[1] <-0.5 dist[length(dist)] <-distance(quantiles[x_class], x) / dist_quantiles[x_class] dist <-sum(dist)if (diff_class >1) { dist <- dist + (diff_class -1) }return(dist) } else { diff_class <- midpoint_class - x_class dist <-rep(NA, 2)# First and last distance values dist[1] <-distance(quantiles[x_class +1], x) / dist_quantiles[x_class] dist[length(dist)] <-0.5 dist <-sum(dist)if (diff_class >1) { dist <- dist + (diff_class -1) }return(dist) }}
This function needs another one, distance(), that returns the Euclidean distance between two points in \(\mathbb{R}\).
distance <-function(a, b) {return(abs(a - b))}
The distance to the j-th midpoint can be computed as follows:
Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
`.name_repair` is omitted as of tibble 2.0.0.
ℹ Using compatibility `.name_repair`.
For each observation, we can then apply the softmax function so that the distances for each observation sum to 1. The resulting values are considered as scores.
Recall that in data_to_display, each observed price and each predicted price were assigned to a bin. The bins were defined using the quantiles computed on the predicted prices. It is possible to compare the assigned bin based on the observed or the predicted value. We can use this comparison to compute an overall accuracy.
For convenience, let us wrap the code from the illustrative example into some helper functions.
3.3.1 Quantile Binning
First, we are going to build a function that will:
partition the interval \([0,1]\) into \(B\) bins, based on quantiles of the estimated prices, where each bin \(b\in[B]\) is associated with the set \(\mathcal{I}_b\) containing the indices of instances within that bin,
assign to each observed and predicted values the corresponding bin it lies in, and also return the middle of value of the corresponding bin,
#' Partition predicted values into k bins#' #' @param obs vector of observed values (y)#' @param pred vector of predicted values (\hat{y})#' @param k number of bins#' @param partition_along which vector of values to use to partition #' (`"obs"` or `"pred`)#' #' @returns a list with four elements:#' - `quantiles`: the empirical quantiles of the predicted values#' - `midpoints`: the midpoint of each bin#' - `dist_quantiles`: the width of each bin#' - `data_to_display`: a tibble with the observed and predicted values, as well#' as their predicted bins (`obs_labs` and `pred_labs`), and the middle of#' the bin (`dist`)quantile_binning <-function(obs, pred, k,partition_along =c("obs", "pred") ) {if (partition_along =="obs") {# Define k bins on Y (quantile) quantiles <-quantile(obs, probs = (0:k) / k) } else {# Define k bins on \hat_{Y} (quantile) quantiles <-quantile(pred, probs = (0:k) / k) }# Midpoints of those quantiles midpoints <- quantiles |>rollmean(quantiles, k =2) midpoints <- midpoints[-length(midpoints)]# Distance between quantiles (length of each bin) dist_quantiles <-diff(quantiles) # to normalize deviation dist_q <-tibble(pred_labs =1:length(dist_quantiles), dist = dist_quantiles ) |>mutate(pred_labs = pred_labs)# Change first and last values of quantiles (0% - 100%) quantiles[1] <-0 quantiles[length(quantiles)] <-Inf# Get true and predicted labels based on quantiles obs_labels <-cut( obs, breaks = quantiles, labels =1:(length(quantiles) -1) ) |>as.numeric() pred_labels <-cut( pred, breaks = quantiles, labels =1:(length(quantiles)-1) ) |>as.numeric() data_to_display <-tibble(obs = obs,pred = pred,obs_labs = obs_labels, pred_labs = pred_labels ) |>left_join(dist_q, by="pred_labs")list(quantiles = quantiles,midpoints = midpoints,dist_quantiles = dist_quantiles,data_to_display = data_to_display )}
Then, we define another function, compute_local_means(), that computes for each bin the average observed values \(E[Y|\hat{Y} \in [q_b ; q_{b+1}]]\) and the average predicted values \(E[\hat{Y}|\hat{Y} \in [q_b,q_{b+1}]]\).
#' Estimate `E[Y|\hat{Y} \in [q_k ; q_{k+1}]]` related to#' `E[\hat{Y}|\hat{Y} \in [q_k,q_{k+1}]]`.#' #' @param obs vector of observed values (y)#' @param pred vector of predicted values (\hat{y})#' @param midpoints midpoints of all the binscompute_local_means <-function(obs, pred, data_to_display, midpoints) { local_means <- data_to_display |>group_by(pred_labs) |>summarise(obs_mean =mean(obs),pred_mean =mean(pred),nb =n(),dist =min(dist) ) |>mutate(midpoints = midpoints) |>mutate(diff = obs_mean - pred_mean) local_means}
3.3.2 Expected Calibration Error
First, we need to define a function, distance(), that returns the Euclidean distance between two points in in \(\mathcal{R}\).
distance <-function(a, b) {return(abs(a - b))}
Second, we define a function to compute the distance between a data point an a midpoint. The distance between quantiles is normalized to 1.
#' Calculates distance between a data point and a midpoint by normalizing to 1#' distance between quantiles#' #' @param x observation#' @param x_class assigned bin class#' @param midpoint midpoint midpoints of the bin of interest#' @param midpoints midpoints of all the bins#' @param quantiles quantile defining the bins#' @param dist_quantiles length of each bindistance_metric <-function(x, x_class, midpoint, midpoints, quantiles, dist_quantiles ) { midpoint_class <-which(midpoints == midpoint)if (x_class == midpoint_class){return(distance(x, midpoint) / dist_quantiles[x_class]) } elseif (x_class > midpoint_class){ diff_class <- x_class - midpoint_class dist <-rep(NA, 2)# First and last distance values dist[1] <-0.5 dist[length(dist)] <-distance(quantiles[x_class], x) / dist_quantiles[x_class] dist <-sum(dist)if (diff_class >1) { dist <- dist + (diff_class -1) }return(dist) } else { diff_class <- midpoint_class - x_class dist <-rep(NA, 2)# First and last distance values dist[1] <-distance(quantiles[x_class +1], x) / dist_quantiles[x_class] dist[length(dist)] <-0.5 dist <-sum(dist)if (diff_class >1) { dist <- dist + (diff_class -1) }return(dist) }}
Lastly, we need a function to compute the Expected Calibration Error.
#' Function to calculate ECE#' #' @param obs vector of observed values (y)#' @param pred vector of predicted values (\hat{y})#' @param k number of binscalculate_ece <-function(obs, pred, k) { n_obs <-length(obs)# Quantile binning q_binning <-quantile_binning(obs = obs, pred = pred, k = k, partition_along ="obs" ) data_to_display <- q_binning$data_to_display quantiles <- q_binning$quantiles midpoints <- q_binning$midpoints dist_quantiles <- q_binning$dist_quantiles# Distance between predicted price and each midpoint of the bins dist_y_hat <-matrix(data =NA, nrow = n_obs, ncol =length(midpoints))for (i in1:n_obs) {for (j in1:length(midpoints)) { dist_y_hat[i,j] <-distance_metric(x = data_to_display$pred[i], x_class = data_to_display$pred_labs[i], midpoint = midpoints[j],midpoints = midpoints,quantiles = quantiles,dist_quantiles = dist_quantiles ) } }# Softmax applied on the distances: get the scores confidence_scores <- LDATS::softmax(-dist_y_hat)# Predicted labels predicted_labels <- ramify::argmax(confidence_scores)# Max score for each observation: confidence max_confidence_scores <-apply(confidence_scores, 1, max) data_to_display <- data_to_display |>mutate(scores_conf = max_confidence_scores)# Accuracy data_to_display <- data_to_display |>mutate(acc =if_else(obs_labs == pred_labs, 1, 0))# ECE# new bins based on scores breaks <-quantile(data_to_display$scores_conf, probs = (0:k) / k) tb_breaks <-tibble(breaks = breaks, labels =0:k) |>group_by(breaks) |>slice_tail(n =1) |>ungroup() x_with_class <- data_to_display |>mutate(score_bin =cut( scores_conf,breaks = tb_breaks$breaks,labels = tb_breaks$labels[-1],include.lowest =TRUE ) )# Accuracy and Confidence in each bin ece_by_bin <- x_with_class |>group_by(score_bin) |>summarise(acc =mean(acc), conf =mean(scores_conf), nb =n() ) ece <-sum((ece_by_bin$nb / n_obs) *abs(ece_by_bin$acc - ece_by_bin$conf)) ece}
3.4 Whole Dataset
Let us apply the quantile_binning() function to split the data into bins and compute the average of observed prices and predicted prices in each bin.
Let us make the number of bins vary so that we consider in turn 3, 5, and 10 bins.
k_values <-c(3, 5, 10)
We initiate empty objects that will contain the estimated means of observed and predicted prices in each bin.
Now, let us visualize the calibration of the price model for the different number of bins that we envisaged. To do so, we loop over the different values for the number of bins and create a calibration graph in each case.
We can then display the calibration plots. The left panel in Figure 3.4 shows the calibration when considering 3 bins, the panel in the middle is when the number of bins is 5 and the one on the right is when the number of bins is 10.
Let us now consider 5 bins, and contrast three situations for the calibration of the model:
using the whole dataset as previously,
using the whole dataset but with prices estimated randomly drawn,
focusing on a single arrondissement.
# Number of desired binsk <-5# Init. result objectdata_to_display <- local_means <- ece_values <-vector(mode ="list", 3)
3.5.1 On the Whole Dataset
obs <- data_clean_all$pm2pred <- data_clean_all$pm2_estimatedres_q <-quantile_binning(obs = obs, pred = pred, k = k, partition_along ="pred")data_to_display[[1]] <- res_q$data_to_display |>mutate(type ="Whole dataset")local_means_current <-compute_local_means(obs = obs, pred = pred,data_to_display = res_q$data_to_display, midpoints = res_q$midpoints )local_means[[1]] <- local_means_current |>mutate(type ="Whole dataset")ece_current <-calculate_ece(obs = obs, pred = pred,k = k )ece_values[[1]] <- ece_current
3.5.2 Random Model
We will draw the values for \(\hat{y}\) on a \(\mathcal{U}[\underline{Y}, \overline{Y}]\), where \(\underline{Y}\) represents the minimum of the observed prices and \(\overline{Y}\) represents the maximum of the observed prices.
n <-nrow(data_clean_all)y_lim <-c(min(data_clean_all$pm2), max(data_clean_all$pm2))# Draw n values from a U[min(Y), max(Y)]set.seed(123)obs <- data_clean_all$pm2pred <-runif(n, y_lim[1], y_lim[2])res_q <-quantile_binning(obs = obs, pred = pred, k = k, partition_along ="pred")data_to_display[[2]] <- res_q$data_to_display |>mutate(type ="Random Model")local_means_current <-compute_local_means(obs = obs, pred = pred,data_to_display = res_q$data_to_display, midpoints = res_q$midpoints)local_means[[2]] <- local_means_current |>mutate(type ="Random Model")ece_current <-calculate_ece(obs = obs, pred = pred,k = k )ece_values[[2]] <- ece_current
3.5.3 7th Arrondissement
Let us focus on the 7th arrondissement. We first need to know which IRIS codes are in the th arrondissement. To do so, we can simply use the NOM_COM variable in the dataset.
data_7th <- data_clean_all |>filter(NOM_COM =="Paris 7e Arrondissement")nrow(data_7th)
[1] 327
Let us compute the average values of observed and predicted prices in each bins defined using the 5 quantiles of the observed prices in the 7th arrondissement.
obs <- data_7th$pm2pred <- data_7th$pm2_estimatedres_q <-quantile_binning(obs = obs, pred = pred, k = k, partition_along ="pred")data_to_display[[3]] <- res_q$data_to_display |>mutate(type ="7th arrondissement")local_means_current <-compute_local_means(obs = obs, pred = pred,data_to_display = res_q$data_to_display, midpoints = res_q$midpoints)local_means[[3]] <- local_means_current |>mutate(type ="7th arrondissement")ece_current <-calculate_ece(obs = obs, pred = pred,k = k )ece_values[[3]] <- ece_current
3.5.4 Visualizing the Calibration for the 3 Cases
Now, we can create the calibration plot in each of the three situations, and plot the result.
The results are shown in Figure 3.5, with the calibration estimated on the whole dataset on the left panel, the calibration estimated when the estimated prices are randomly drawn in the middle panel, and the calibration estimated on the subset of observations corresponding to the 7th arrondissement only.
For convenience, let us wrap the code from Section 3.5.3 in a function, so that we can compute the ECE for a given arrondissement.
#' Computes ECE for a single arrondissement, depending on the neighbors distance#' #' @param arrond name od the arrondissement#' @param k number of neighbors to consider#' @param obs_name name of the variable with observed values in the data#' @param pred_name name of the variable with predicted values in the data#' @param data dataset to usecompute_ece_arrondissement <-function(arrond, k, obs_name ="pm2", pred_name ="pm2_estimated", data) { data_current <- data |>filter(NOM_COM ==!!arrond) obs <- data_current |>pull(!!obs_name) pred <- data_current |>pull(!!pred_name) ece <-calculate_ece(obs = obs, pred = pred,k = k ) ece}
We just need to loop over the names of the different arrondissements to calculate the expected calibration errors.
Table 3.1: Expected Calibration Error per Arrondissement.
Arrondissement
ECE
Paris 1er Arrondissement
0.188
Paris 2e Arrondissement
0.097
Paris 3e Arrondissement
0.126
Paris 4e Arrondissement
0.116
Paris 5e Arrondissement
0.131
Paris 6e Arrondissement
0.126
Paris 7e Arrondissement
0.170
Paris 8e Arrondissement
0.157
Paris 9e Arrondissement
0.120
Paris 10e Arrondissement
0.132
Paris 11e Arrondissement
0.136
Paris 12e Arrondissement
0.102
Paris 13e Arrondissement
0.052
Paris 14e Arrondissement
0.089
Paris 15e Arrondissement
0.072
Paris 16e Arrondissement
0.055
Paris 17e Arrondissement
0.096
Paris 18e Arrondissement
0.029
Paris 19e Arrondissement
0.078
Paris 20e Arrondissement
0.088
3.7 Focus on Montmartre and Champs-de-Mars
Let us now focus on two IRIS: Montmartre and Champs-de-Mars. We will compute the ECE for these two IRIS, varying the number of neighbors used to aggregate data.
In Section 2.3 from Chapter 2, we computed the minimum distance from one iris to another, considering distances up to 30. We will also need this information.
Let us define a function, compute_ece_iris_neighbors() that computes the ECE for a given IRIS code, using the neighborhood up to a certain degree.
#' Computes ECE for a single IRIS, depending on the neighbors distance#' #' @param iris_code IRIS identifier#' @param k number of bins#' @param num_neigh distance of neighbors to include#' @param obs_name name of the variable with observed values in the data#' @param pred_name name of the variable with predicted values in the data#' @param data dataset to usecompute_ece_iris_neighbors <-function(iris_code, k, num_neigh,obs_name ="pm2", pred_name ="pm2_estimated", data, neighbors_dist) { want_iris <- neighbors_dist |>filter(from_iris == iris_code) |>filter(distance <= num_neigh) |>pull(to_iris) |>unique() data_current <- data |>filter(CODE_IRIS %in%!!want_iris) obs <- data_current |>pull(!!obs_name) pred <- data_current |>pull(!!pred_name) ece <-calculate_ece(obs = obs, pred = pred,k = k ) ece}
Let us apply this function to Montmartre and Champs-de-Mars, and let us make the maximum distance of neighbors to include vary from 1 to 9 in steps of 1. For Montmartre:
ece_montmartre_neigh <-map_dbl(.x =1:9,.f =~compute_ece_iris_neighbors(iris_code ="751093503", k =5,num_neigh = .x, obs_name ="pm2",pred_name ="pm2_estimated", data = data_clean_all, neighbors_dist = neighbours_all ))
And for Champs-de-Mars:
ece_champ_neigh <-map_dbl(.x =1:9,.f =~compute_ece_iris_neighbors(iris_code ="751072812", k =5,num_neigh = .x, obs_name ="pm2",pred_name ="pm2_estimated", data = data_clean_all, neighbors_dist = neighbours_all ))
Table 3.2: Expected Calibration Error in Montmartre and Champs-de-Mars depending on the number of neighbors considered to compute the metric.
Neighbors
Montmartre
Champs-de-Mars
1
0.190
0.159
2
0.132
0.077
3
0.129
0.047
4
0.065
0.038
5
0.026
0.029
6
0.029
0.033
7
0.025
0.026
8
0.035
0.029
9
0.044
0.025
Krüger, Fabian, and Johanna F. Ziegel. 2020. “Generic Conditions for Forecast Dominance.”Journal of Business & Economic Statistics 39 (2021): 972–83.
Naeini, Mahdi Pakdaman, Gregory Cooper, and Milos Hauskrecht. 2015. “Obtaining Well Calibrated Probabilities Using Bayesian Binning.” In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 29. 1.
Widmann, David, Fredrik Lindsten, and Dave Zachariah. 2019. “Calibration Tests in Multi-Class Classification: A Unifying Framework.”Advances in Neural Information Processing Systems 32.