13  Priors: Illustration

Note

This chapter illustrates the application of the method using real-world datasets. We train three models –a Generalized Linear Model (GLM), a Generalized Additive Model (GAM), and a Generalized Additive Model with model selection (GAMSEL)– on a binary variable to estimate the underlying event probabilities using available covariates. For each model, we derive scores from the test set and fit a Beta distribution via maximum likelihood estimation. This process yields three distinct priors for the true probability distribution of the event.

Code Availability

The functions for data preprocessing, model estimation, and Beta distribution fitting are stored in functions/real-data.R and will be used in subsequent chapters.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gam)
Loading required package: splines
Loading required package: foreach

Attaching package: 'foreach'

The following objects are masked from 'package:purrr':

    accumulate, when

Loaded gam 1.22-3
library(gamsel)
Loaded gamsel 1.8-4
# Colours for train/test
colour_samples <- c(
  "Train" = "#0072B2",
  "Test" = "#D55E00"
)

13.1 Raw Data

To illustrate the process, we use the spambase dataset (available on UCI Machine Learning Repository). The dataset contains 4,601 rows. The target variable, is_spam will be explained using the 57 continuous predictors.

The dataset can be downloaded as follows:

if (!dir.exists("data")) dir.create("data")
download.file(
  url = "https://archive.ics.uci.edu/static/public/94/spambase.zip", 
  destfile = "data/spambase.zip"
)

The names of the columns are given in the spambase.names file in that archive.

# This chunk is not run
info_data <- scan(
  unz("data/spambase.zip", "spambase.names"), what = "character", sep = "\n"
)
# Print the names for this dataset (not very convenient...)
str_extract(info_data[31:length(info_data)], "^(.*):") |> 
  str_remove(":$") |> 
  (\(.x) str_c('"', .x, '",'))() |> 
  cat()

Then, we can import the dataset:

dataset <- read_csv(
  file = unz("data/spambase.zip", "spambase.data"),
  col_names = c(
    "word_freq_make", "word_freq_address", "word_freq_all", "word_freq_3d",
    "word_freq_our", "word_freq_over", "word_freq_remove", "word_freq_internet",
    "word_freq_order", "word_freq_mail", "word_freq_receive", "word_freq_will",
    "word_freq_people", "word_freq_report", "word_freq_addresses", 
    "word_freq_free", "word_freq_business", "word_freq_email", "word_freq_you",
    "word_freq_credit", "word_freq_your", "word_freq_font", "word_freq_000",
    "word_freq_money", "word_freq_hp", "word_freq_hpl", "word_freq_george",
    "word_freq_650", "word_freq_lab", "word_freq_labs", "word_freq_telnet",
    "word_freq_857", "word_freq_data", "word_freq_415", "word_freq_85",
    "word_freq_technology", "word_freq_1999", "word_freq_parts", "word_freq_pm",
    "word_freq_direct", "word_freq_cs", "word_freq_meeting", 
    "word_freq_original", "word_freq_project", "word_freq_re", "word_freq_edu",
    "word_freq_table", "word_freq_conference", "char_freq_;", "char_freq_(",
    "char_freq_[", "char_freq_!", "char_freq_$", "char_freq_#",
    "capital_run_length_average", "capital_run_length_longest",
    "capital_run_length_total", "is_spam"
  )
)
Rows: 4601 Columns: 58
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (58): word_freq_make, word_freq_address, word_freq_all, word_freq_3d, wo...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The target variable is is_spam.

target_name <- "is_spam"

13.2 Data Pre-processing

We define two functions to pre-process the data. The first one, split_train_test() simply split the dataset into two subsets: one for training the models (train) and another one for testing the models (test).

#' Split dataset into train and test set
#'
#' @param data dataset
#' @param prop_train proportion in the train test (default to .8)
#' @param seed desired seed (default to `NULL`)
#'
#' @returns a list with two elements: the train set, the test set
split_train_test <- function(data,
                             prop_train = .8,
                             seed = NULL) {
  if (!is.null(seed)) set.seed(seed)
  size_train <- round(prop_train * nrow(data))
  ind_sample <- sample(1:nrow(data), replace = FALSE, size = size_train)

  list(
    train = data |> dplyr::slice(ind_sample),
    test = data |> dplyr::slice(-ind_sample)
  )
}

With the current dataset:

data <- split_train_test(data = dataset, prop_train = .8, seed = 1234)
names(data)
[1] "train" "test" 

Some of the models we use need the data to be numerical. We thus build a function, encode_dataset() that transforms the categorical columns into sets of dummy variables. For each categorical variable, we remove one of the levels to avoid colinearity in the predictor matrix. This step is made using the convenient functions from the {recipes} package. In addition, the spline function from the {gam} package does not support variables with names that do not respect the R naming conventions. We thus rename all the variables and keep track of the changes.

The encode_dataset() returns a list with five elements:

  • train: the train set where categorical variables have been transformed into dummy variables
  • test: the test set where categorical variables have been transformed into dummy variables
  • initial_covariate_names: vector of names of all explanatory variables
  • categ_names: vector of new names of categorical variables (if any)
  • covariate_names: vector of new names of all explanatory variables (including categorigal ones).
#' One-hot encoding, and renaming variables to avoid naming that do not respect
#' r old naming conventions
#'
#' @param data_train train set
#' @param data_test test set
#' @param target_name name of the target (response) variable
#' @param intercept should a column for an intercept be added? Default to
#'  `FALSE`
#'
#' @returns list with five elements:
#'  - `train`: train set
#'  - `test`: test set
#'  - `initial_covariate_names`: vector of names of all explanatory variables
#'  - `categ_names`: vector of new names of categorical variables (if any)
#'  - `covariate_names`: vector of new names of all explanatory variables (including
#'     categorical ones).
encode_dataset <- function(data_train,
                           data_test,
                           target_name,
                           intercept = FALSE) {

  col_names <- colnames(data_train)
  col_names_covariates <- col_names[-which(col_names == target_name)]
  new_names_covariates <- str_c("X_", 1:length(col_names_covariates))
  data_train <- data_train |>
    rename_with(.cols = all_of(col_names_covariates), .fn = ~new_names_covariates)
  data_test <- data_test |>
    rename_with(.cols = all_of(col_names_covariates), .fn = ~new_names_covariates)

  data_rec <- recipes::recipe(
    formula(str_c(target_name, " ~ .")),
    data = data_train
  )

  ref_cell <- data_rec |> recipes::step_dummy(
    recipes::all_nominal(), -recipes::all_outcomes(),
    one_hot = TRUE
  ) |>
    recipes::prep(training = data_train)

  X_train_dmy <- recipes::bake(ref_cell, new_data = data_train)
  X_test_dmy  <- recipes::bake(ref_cell, new_data = data_test)

  # Identify categorical variables
  # Bake the recipe to apply the transformation
  df_transformed <- recipes::bake(ref_cell, new_data = NULL)
  # Get the names of the transformed data
  new_names <- names(X_train_dmy)
  original_vars <- names(data_train)
  categ_names <- setdiff(new_names, original_vars)
  covariate_names <- colnames(X_train_dmy)
  covariate_names <- covariate_names[!covariate_names == target_name]
  categ_names <- categ_names[!categ_names == target_name]
  list(
    train = X_train_dmy,
    test = X_test_dmy,
    initial_covariate_names = col_names_covariates,
    categ_names = categ_names,
    covariate_names = covariate_names
  )
}

Let us use the encode_dataset() function to rename the columns here. As there is no categorical variable among the predictors, no dummy variable will be created.

data_dmy <- encode_dataset(
  data_train = data$train,
  data_test = data$test,
  target_name = target_name,
  intercept = FALSE
)

13.3 Estimation Functions

13.3.1 GLM

Let us estimate the probability that the event occurs (the email is a spam) using a Generalized Linear Model with a logistic link function.

We first build the formula:

form <- str_c(target_name, "~.") |> as.formula()

Then, we fit the model:

fit <- glm(form, data = data_dmy$train, family = "binomial")

Lastly, we can get the predicted scores:

scores_train <- predict(fit, newdata = data_dmy$train, type = "response")
scores_test <- predict(fit, newdata = data_dmy$test, type = "response")

We encompass these steps in a helper function:

#' Train a GLM-logistic model
#'
#' @param data_train train set
#' @param data_test test set
#' @param target_name name of the target (response) variable
#' @param return_model if TRUE, the estimated model is returned
#'
#' @returns list with estimated scores on train set (`scores_train`) and on
#'  test set (`scores_test`)
train_glm <- function(data_train,
                      data_test,
                      target_name,
                      return_model = FALSE) {
  # Encode dataset so that categorical variables become dummy variables
  data_dmy <- encode_dataset(
    data_train = data_train,
    data_test = data_test,
    target_name = target_name,
    intercept = FALSE
  )
  # Formula for the model
  form <- str_c(target_name, "~.") |> as.formula()
  # Estimation
  fit <- glm(form, data = data_dmy$train, family = "binomial")
  # Scores on train and test set
  scores_train <- predict(fit, newdata = data_dmy$train, type = "response")
  scores_test <- predict(fit, newdata = data_dmy$test, type = "response")

  if (return_model == TRUE) {
    res <- list(
      scores_train = scores_train, 
      scores_test = scores_test, 
      fit = fit)
  } else {
    list(scores_train = scores_train, scores_test = scores_test, fit = NULL)
  }
}

This function can then be used in a very simple way:

scores_glm <- train_glm(
    data_train = data$train, data_test = data$test, target_name = target_name
  )
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

13.3.2 GAM

We then estimate the probability that the event occurs (the email is a spam) using a Generalized Additive Model.

We first build the formula. Here, this is a tiny bit more complex than with the GLM. The model will contain smooth terms (for numerical variables) and linear terms (for categorical variables which, if present in the data, were encoded as dummy variables).

We identify the numerical variables

num_names <- data_dmy$covariate_names[!data_dmy$covariate_names %in% data_dmy$categ_names]
num_names <- num_names[!num_names %in% target_name]

Then, we count the number of unique values for each variable. This step ensures that the smoothing parameter of the spline function applied to numerical variables is not larger than the number of unique values. We arbitrarily set the smoothing parameter to 6. But if the number of unique values for a variable is lower than this, then we use the number of unique values minus 1 as the smoothing parameter for that variable.

spline_df <- 6
num_nb_val <- map_dbl(num_names, ~data_dmy$train |> pull(.x) |> unique() |> length())
deg_num <- ifelse(num_nb_val < spline_df, num_nb_val, spline_df)

Then, we can construct the formula object. We begin with the numerical variables:

num_term <- str_c("s(", num_names, ", df = ", deg_num, ")", collapse = " + ")

Note: if there are no numerical variables, the num_term is simply set to NULL.

If there are categorical variables, we do not use a smoothing function for them.

if (length(data_dmy$categ_names > 0)) {
  categ_terms <- str_c(data_dmy$categ_names, collapse = " + ")
} else {
  categ_terms <- NULL
}

Lastly, we can create the whole part of the formula which contains the predictors:

form_terms <- str_c(num_term, categ_terms, sep = " + ")

The formula can eventually be created:

form_gam <- str_c(target_name, " ~ ", form_terms) |> as.formula()
form_gam
is_spam ~ s(X_1, df = 6) + s(X_2, df = 6) + s(X_3, df = 6) + 
    s(X_4, df = 6) + s(X_5, df = 6) + s(X_6, df = 6) + s(X_7, 
    df = 6) + s(X_8, df = 6) + s(X_9, df = 6) + s(X_10, df = 6) + 
    s(X_11, df = 6) + s(X_12, df = 6) + s(X_13, df = 6) + s(X_14, 
    df = 6) + s(X_15, df = 6) + s(X_16, df = 6) + s(X_17, df = 6) + 
    s(X_18, df = 6) + s(X_19, df = 6) + s(X_20, df = 6) + s(X_21, 
    df = 6) + s(X_22, df = 6) + s(X_23, df = 6) + s(X_24, df = 6) + 
    s(X_25, df = 6) + s(X_26, df = 6) + s(X_27, df = 6) + s(X_28, 
    df = 6) + s(X_29, df = 6) + s(X_30, df = 6) + s(X_31, df = 6) + 
    s(X_32, df = 6) + s(X_33, df = 6) + s(X_34, df = 6) + s(X_35, 
    df = 6) + s(X_36, df = 6) + s(X_37, df = 6) + s(X_38, df = 6) + 
    s(X_39, df = 6) + s(X_40, df = 6) + s(X_41, df = 6) + s(X_42, 
    df = 6) + s(X_43, df = 6) + s(X_44, df = 6) + s(X_45, df = 6) + 
    s(X_46, df = 6) + s(X_47, df = 6) + s(X_48, df = 6) + s(X_49, 
    df = 6) + s(X_50, df = 6) + s(X_51, df = 6) + s(X_52, df = 6) + 
    s(X_53, df = 6) + s(X_54, df = 6) + s(X_55, df = 6) + s(X_56, 
    df = 6) + s(X_57, df = 6)

Then, we fit the model:

fit <- gam::gam(formula = form_gam, family = binomial, data = data_dmy$train)

Lastly, we can get the predicted scores:

scores_train <- predict(fit, newdata = data_dmy$train, type = "response")
scores_test <- predict(fit, newdata = data_dmy$test, type = "response")

We encompass these steps in a helper function:

#' Train a GAM model
#'
#' @param data_train train set
#' @param data_test test set
#' @param target_name name of the target (response) variable
#' @param spline_df degree of freedom for the splines
#' @param return_model if TRUE, the estimated model is returned
#'
#' @returns list with estimated scores on train set (`scores_train`) and on
#'  test set (`scores_test`)
train_gam <- function(data_train,
                      data_test,
                      target_name,
                      spline_df = 5,
                      return_model = FALSE) {
  # Encode dataset so that categorical variables become dummy variables
  data_dmy <- encode_dataset(
    data_train = data_train,
    data_test = data_test,
    target_name = target_name,
    intercept = FALSE
  )

  # Formula for the model
  ## Names of numerical variables
  num_names <- data_dmy$covariate_names[!data_dmy$covariate_names %in% data_dmy$categ_names]
  num_names <- num_names[!num_names %in% target_name]
  if (length(num_names) > 0) {
    ## Number of unique values
    num_nb_val <- map_dbl(num_names, ~data_dmy$train |> pull(.x) |> unique() |> length())
    ## Degree for numerical variables
    deg_num <- ifelse(num_nb_val < spline_df, num_nb_val, spline_df)
    num_term <- str_c("s(", num_names, ", df = ", deg_num, ")", collapse = " + ")
  } else {
    num_term <- NULL
  }
  if (length(data_dmy$categ_names > 0)) {
    categ_terms <- str_c(data_dmy$categ_names, collapse = " + ")
  } else {
    categ_terms <- NULL
  }
  
  form_terms <- str_c(num_term, categ_terms, sep = " + ")
  
  form_gam <- str_c(target_name, " ~ ", form_terms) |> as.formula()
  # Estimation
  fit <- gam::gam(formula = form_gam, family = binomial, data = data_dmy$train)
  # Scores on train and test set
  scores_train <- predict(fit, newdata = data_dmy$train, type = "response")
  scores_test <- predict(fit, newdata = data_dmy$test, type = "response")

  if (return_model == TRUE) {
    res <- list(
      scores_train = scores_train,
      scores_test = scores_test,
      fit = fit)
  } else {
    list(scores_train = scores_train, scores_test = scores_test, fit = NULL)
  }
}

This function can then be used in a very simple way:

scores_gam <- train_gam(
    data_train = data$train, data_test = data$test, target_name = target_name,
    spline_df = 6
)

13.3.3 GAMSEL

We then estimate the probability that the event occurs (the email is a spam) using a Generalized Additive Model with model selection.

First, we need to split the target variable and the predictors in distinct objects.

X_dmy_train <- data_dmy$train |> dplyr::select(-!!target_name)
X_dmy_test <- data_dmy$test |> dplyr::select(-!!target_name)

Then, we need to make sure that all variables obtained after using the encode_dataset() function are coded as numeric: the estimation function from {gamsel} does not allow integer variables.

X_dmy_train <- X_dmy_train |> mutate(across(everything(), as.numeric))
X_dmy_test <- X_dmy_test |> mutate(across(everything(), as.numeric))

The target variable:

y_train <- data_dmy$train |> dplyr::pull(!!target_name)
y_test <- data_dmy$test |> dplyr::pull(!!target_name)

Then we need to build the formula. As for the GAM, this is a bit more complex that with the GLM. We need to create a vector that gives the maximum spline basis function to use for each variable. For dummy variables, this needs to be set to 1. For other variables, let us use either 6 or the minimum number of distinct values minus 1.

degrees <- 6
deg <- rep(NA, ncol(X_dmy_train))
col_names_X <- colnames(X_dmy_train)
nb_val <- map_dbl(
  col_names_X, ~X_dmy_train |> pull(.x) |> unique() |> length()
)
for (i_var_name in 1:ncol(X_dmy_train)) {
  var_name <- col_names_X[i_var_name]
  if (var_name %in% data_dmy$categ_names) {
    deg[i_var_name] <- 1
  } else {
    deg[i_var_name] <- min(nb_val[i_var_name]-1, degrees)
  }
}

Then, we fit the model. The penalty parameter \(\lambda\) is selected by 10-fold cross validation in a first step:

gamsel_cv <- gamsel::cv.gamsel(
  x = as.data.frame(X_dmy_train), y = y_train, family = "binomial",
  degrees = deg
)

We use the value of lambda which gives the minimum cross validation metric. Note that we could also use the largest value of lambda such that the error is within 1 standard error of the minimum (using lambda = gamsel_cv$lambda.1se):

gamsel_out <- gamsel(
  x = as.data.frame(X_dmy_train), y = y_train, family = "binomial",
  degrees = deg,
  lambda = gamsel_cv$lambda.min
)

Lastly, we can get the predicted scores:

scores_train <- predict(
    gamsel_out, newdata = as.data.frame(X_dmy_train), type = "response")[, 1]
  scores_test <- predict(
    gamsel_out, newdata = as.data.frame(X_dmy_test), type = "response")[, 1]

We encompass these steps in a helper function:

#' Train a GAMSEL model
#'
#' @param data_train train set
#' @param data_test test set
#' @param target_name name of the target (response) variable
#' @param degrees degree for the splines
#' @param return_model if TRUE, the estimated model is returned
#'
#' @returns list with estimated scores on train set (`scores_train`) and on
#'  test set (`scores_test`)
train_gamsel <- function(data_train,
                         data_test,
                         target_name,
                         degrees = 6,
                         return_model = FALSE) {
  # Encode dataset so that categorical variables become dummy variables
  data_dmy <- encode_dataset(
    data_train = data_train,
    data_test = data_test,
    target_name = target_name,
    intercept = FALSE
  )
  # Estimation
  X_dmy_train <- data_dmy$train |> dplyr::select(-!!target_name)
  X_dmy_train <- X_dmy_train |> mutate(across(everything(), as.numeric))
  X_dmy_test <- data_dmy$test |> dplyr::select(-!!target_name)
  X_dmy_test <- X_dmy_test |> mutate(across(everything(), as.numeric))
  y_train <- data_dmy$train |> dplyr::pull(!!target_name)
  y_test <- data_dmy$test |> dplyr::pull(!!target_name)

  deg <- rep(NA, ncol(X_dmy_train))
  col_names_X <- colnames(X_dmy_train)
  nb_val <- map_dbl(
    col_names_X, ~X_dmy_train |> pull(.x) |> unique() |> length()
  )
  for (i_var_name in 1:ncol(X_dmy_train)) {
    var_name <- col_names_X[i_var_name]
    if (var_name %in% data_dmy$categ_names) {
      deg[i_var_name] <- 1
    } else {
      deg[i_var_name] <- min(nb_val[i_var_name]-1, degrees)
    }
  }
  gamsel_cv <- gamsel::cv.gamsel(
    x = as.data.frame(X_dmy_train), y = y_train, family = "binomial",
    degrees = deg
  )
  gamsel_out <- gamsel::gamsel(
    x = as.data.frame(X_dmy_train), y = y_train, family = "binomial",
    degrees = deg,
    lambda = gamsel_cv$lambda.min
  )
  # Scores on train and test set
  scores_train <- predict(
    gamsel_out, newdata = as.data.frame(X_dmy_train), type = "response")[, 1]
  scores_test <- predict(
    gamsel_out, newdata = as.data.frame(X_dmy_test), type = "response")[, 1]

  if (return_model == TRUE) {
    res <- list(
      scores_train = scores_train,
      scores_test = scores_test,
      fit = fit)
  } else {
    list(scores_train = scores_train, scores_test = scores_test, fit = NULL)
  }
}

This function can then be used in a very simple way:

scores_gamsel <- train_gamsel(
    data_train = data$train, data_test = data$test, target_name = target_name,
    degrees = 6
  )

13.4 Fitting a Beta Distribution

Once the scores from the models have been estimated, we fit a Beta distribution to them. This will provide a prior distribution of the true probabilities in the exercise.

To avoid crashing the ML estimation of the two parameters of the Beta distribution, let us make sure that any score is in \((0,1)\) and not exactly equal to 0 or 1.

x_glm <- (scores_glm$scores_test * (1 - 1e-6)) + 1e-6 / 2
x_gam <- (scores_gam$scores_test * (1 - 1e-6)) + 1e-6 / 2
x_gamsel <- (scores_gamsel$scores_test * (1 - 1e-6)) + 1e-6 / 2

To estimate the two parameters of the Beta distribution, we define a small function, fit_beta_scores() that calls the fitdistr() function from {MASS}.

#' Maximum-likelihood fitting of Beta distribution on scores
#'
#' @param scores vector of estimated scores
#' @param shape1 non-negative first parameter of the Beta distribution
#' @param shape1 non-negative second parameter of the Beta distribution
#'
#' @returns An object of class `fitdistr`, a list with four components
#'  (see: MASS::fitdistr())
#'  - `estimate`: the parameter estimates
#'  - `sd`: the estimated standard errors
#'  - `vcov`: the estimated variance-covariance matrix
#'  - `loglik`: the log-likelihood
fit_beta_scores <- function(scores, shape1 = 1, shape2 = 1) {
  # Fit a beta distribution
  mle_fit <- MASS::fitdistr(
    scores, "beta", start = list(shape1 = 1, shape2 = 1)
  )
  mle_fit
}
(mle_gam <- fit_beta_scores(scores = x_gam))
     shape1        shape2   
  0.062913176   0.081751894 
 (0.002447438) (0.003470926)
(mle_glm <- fit_beta_scores(scores = x_glm))
     shape1        shape2   
  0.141335615   0.210571524 
 (0.005452414) (0.009260466)
(mle_gamsel <- fit_beta_scores(scores = x_gamsel[!is.nan(x_gamsel)]))
     shape1       shape2  
  0.42945278   0.51642775 
 (0.01755855) (0.02225033)

Let us plot the distribution of the scores obtained with the GAMSEL model. On top of the graph, we draw the density of the Beta distribution with the parameters estimated for each model.

Code
val_u <- seq(0, 1, length = 651)
layout(mat = matrix(1:2), heights = c(3,1))
# Histogram of scores obtained with the GAMSEL, on test set
par(mar = c(4.1, 4.1, 1, 2.1))
hist(
  scores_gamsel$scores_test,
  breaks = seq(0, 1, by = .05), probability = TRUE,
  main = "spambase", xlab = latex2exp::TeX("$\\hat{s}(x)$")
)
# Beta dist. estimated using the scores from the GLM
lines(
  val_u,
  dbeta(val_u, mle_glm$estimate[1], mle_glm$estimate[2]),
  col = "#D55E00",lwd = 1.5
)
# Beta dist. estimated using the scores from the GAM
lines(
  val_u,
  dbeta(val_u, mle_gam$estimate[1], mle_gam$estimate[2]),
  col = "#0072B2",lwd = 1.5
)
# Beta dist. estimated using the scores from the GAM
lines(
  val_u,
  dbeta(val_u, mle_gamsel$estimate[1], mle_gamsel$estimate[2]),
  col = "#E69F00",lwd = 1.5
)
par(mar = c(0, 4.1, 0, 2.1))
plot.new()
legend(
  xpd = TRUE, ncol = 3,
  "center",
  title = "Model",
  lwd = 1.5,
  col = c("#D55E00", "#0072B2", "#E69F00"),
  legend = c("GLM-logistic", "GAM", "GAMSEL")
)
Figure 13.1: Distribution of estimated probabilities by the GAMSEL model and Beta distribution fitted to the scores of each of the three models.

13.5 Wrapper Functions

For convenience, we build a wrapper function, get_beta_fit() that takes a dataset as an input, the name of the target variable and possibly a seet. From these arguments, the function splits the dataset into a training and a test set. It then fits the models, and fit a Beta distribution on the scores estimated in the test set. This function returns a list with 6 elements: the first three are the estimated scores of the three models, the last three are the parameters of the Beta distribution estimated using the scores of each model.

Function get_beta_fit()
#' Estimation of a GLM-logistic, a GAM and a GAMSEL model on a classification
#' task. Then, on estimated scores from the test set, fits a Beta distribution.
#'
#' @param dataset dataset with response variable and predictors
#' @param target_name name of the target (response) variable
#' @param seed desired seed (default to `NULL`)
#'
#' @returns A list with the following elements:
#'  - `scores_glm`: scores on train and test set (in a list) from the GLM
#'  - `scores_gam`: scores on train and test set (in a list) from the GAM
#'  - `scores_gamsel`: scores on train and test set (in a list) from the GAMSEL
#'  - `mle_glm`: An object of class "fitdistr" for the GLM model
#'    (see fit_beta_scores())
#'  - `mle_gamsel`: An object of class "fitdistr" for the GAM model
#'    (see fit_beta_scores())
#'  - `mle_gamsel`: An object of class "fitdistr" for the GAMSEL model
#'    (see fit_beta_scores())
get_beta_fit <- function(dataset,
                         target_name,
                         seed = NULL) {
  # Split data into train/test
  data <- split_train_test(data = dataset, prop_train = .8, seed = seed)

  # Train a GLM-logistic model
  scores_glm <- train_glm(
    data_train = data$train, data_test = data$test, target_name = target_name
  )
  # Train a GAM model
  scores_gam <- train_gam(
    data_train = data$train, data_test = data$test, target_name = target_name,
    spline_df = 6
  )
  # Train a GAMSEL model
  scores_gamsel <- train_gamsel(
    data_train = data$train, data_test = data$test, target_name = target_name,
    degrees = 6
  )
  # Add a little noise to the estimated scores to avoid being in [0,1] and be
  # in (0,1) instead.
  x_glm <- (scores_glm$scores_test * (1 - 1e-6)) + 1e-6 / 2
  x_gam <- (scores_gam$scores_test * (1 - 1e-6)) + 1e-6 / 2
  x_gamsel <- (scores_gamsel$scores_test * (1 - 1e-6)) + 1e-6 / 2
  # Fit a Beta distribution on these scores
  mle_gam <- fit_beta_scores(scores = x_gam)
  mle_glm <- fit_beta_scores(scores = x_glm)
  mle_gamsel <- fit_beta_scores(scores = x_gamsel)

  list(
    scores_glm = scores_glm,
    scores_gam = scores_gam,
    scores_gamsel = scores_gamsel,
    mle_glm = mle_glm,
    mle_gam = mle_gam,
    mle_gamsel = mle_gamsel
  )
}

We also define a function, plot_hist_scores_beta() to plot the distribution of scores obtained with the GAMSEL model and the density functions of the three Beta distributions whose parameters were estimated based on the scores of the three models.

Function plot_hist_scores_beta()
#' Plots the histogram of scores estimated with GAMSEL
#' add densities of Beta distribution for whith the parameters have been
#' estimated using scores from the GLM, the GAM, or the GAMSEL model
#'
#' @param fit_resul results obtained from get_beta_fit()
#' @param title title of the graph (e.g.: dataset name)
plot_hist_scores_beta <- function(fit_resul, title = NULL) {
  val_u <- seq(0, 1, length = 651)
  layout(mat = matrix(1:2), heights = c(3,1))
  
  dbeta_val <- vector(mode = "list", length = 3)
  for (i_type in 1:3) {
    type <- c("mle_glm", "mle_gam", "mle_gamsel")[i_type]
    dbeta_val[[i_type]] <- dbeta(
        val_u, 
        fit_resul[[type]]$estimate[1], 
        fit_resul[[type]]$estimate[2]
      )
  }
  y_lim <- c(
    0,
    map(dbeta_val,
        ~range(.x[!is.infinite(.x)], na.rm = TRUE)) |>
      unlist() |> max(na.rm = TRUE)
  )
  
  
  # Histogram of scores obtained with the GAMSEL, on test set
  par(mar = c(4.1, 4.1, 1, 2.1))
  hist(
    fit_resul$scores_gamsel$scores_test,
    breaks = seq(0, 1, by = .05), probability = TRUE,
    main = title, xlab = latex2exp::TeX("$\\hat{s}(x)$"),
    ylim = y_lim
  )

  legend_name <- c("GLM-logistic", "GAM", "GAMSEL")
  colours <- c(
    "mle_glm" = "#D55E00",
    "mle_gam" = "#0072B2",
    "mle_gamsel" = "#E69F00"
  )
  for (i_type in 1:3) {
    type <- c("mle_glm", "mle_gam", "mle_gamsel")[i_type]
    lines(
      val_u,
      dbeta_val[[i_type]],
      col = colours[i_type],lwd = 1.5
    )
  }
  par(mar = c(0, 4.1, 0, 2.1))
  plot.new()
  legend(
    xpd = TRUE, ncol = 3,
    "center",
    title = "Model",
    lwd = 1.5,
    col = colours,
    legend = legend_name
  )
}

These two functions can be called as follows:

# Chunk not evaluated
resul <- get_beta_fit(dataset = dataset, target_name = "is_spam", seed = 1234)
plot_hist_scores_beta(resul, "spambase")