6  Prior Distribution: Example

Note

This chapter illustrates the application of the method using one example of real-world datasets. We train one model –a Generalized Additive Model with model selection (GAMSEL)– on a binary variable to estimate the underlying event probabilities using available covariates. For this model, we derive scores from the test set and fit a Beta distribution via maximum likelihood estimation. This process yields one prior for the true probability distribution of the event.

Code Availability

The functions for data preprocessing, model estimation, and Beta distribution fitting are stored in functions/real-data.R and will be used in subsequent chapters.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gam)
Loading required package: splines
Loading required package: foreach

Attaching package: 'foreach'

The following objects are masked from 'package:purrr':

    accumulate, when

Loaded gam 1.22-5
library(gamsel)
Loaded gamsel 1.8-5
# Colours for train/validation/calibration/test
library(tidyverse)
colour_samples <- c(
  "Train" = "#0072B2",
  "Test" = "#D55E00"
)

# Functions
source("../scripts/functions/real-data.R")

6.1 Raw Data

To illustrate the process, we use the spambase dataset (available on UCI Machine Learning Repository). The dataset contains 4,601 rows. The target variable, is_spam will be explained using the 57 continuous predictors.

The dataset can be downloaded as follows:

if (!dir.exists("../data")) dir.create("../data")
download.file(
  url = "https://archive.ics.uci.edu/static/public/94/spambase.zip", 
  destfile = "../data/spambase.zip"
)

The names of the columns are given in the spambase.names file in that archive.

# This chunk is not run
info_data <- scan(
  unz("../data/spambase.zip", "spambase.names"), what = "character", sep = "\n"
)
# Print the names for this dataset (not very convenient...)
str_extract(info_data[31:length(info_data)], "^(.*):") |> 
  str_remove(":$") |> 
  (\(.x) str_c('"', .x, '",'))() |> 
  cat()

Then, we can import the dataset:

dataset <- read_csv(
  file = unz("../data/spambase.zip", "spambase.data"),
  col_names = c(
    "word_freq_make", "word_freq_address", "word_freq_all", "word_freq_3d",
    "word_freq_our", "word_freq_over", "word_freq_remove", "word_freq_internet",
    "word_freq_order", "word_freq_mail", "word_freq_receive", "word_freq_will",
    "word_freq_people", "word_freq_report", "word_freq_addresses", 
    "word_freq_free", "word_freq_business", "word_freq_email", "word_freq_you",
    "word_freq_credit", "word_freq_your", "word_freq_font", "word_freq_000",
    "word_freq_money", "word_freq_hp", "word_freq_hpl", "word_freq_george",
    "word_freq_650", "word_freq_lab", "word_freq_labs", "word_freq_telnet",
    "word_freq_857", "word_freq_data", "word_freq_415", "word_freq_85",
    "word_freq_technology", "word_freq_1999", "word_freq_parts", "word_freq_pm",
    "word_freq_direct", "word_freq_cs", "word_freq_meeting", 
    "word_freq_original", "word_freq_project", "word_freq_re", "word_freq_edu",
    "word_freq_table", "word_freq_conference", "char_freq_;", "char_freq_(",
    "char_freq_[", "char_freq_!", "char_freq_$", "char_freq_#",
    "capital_run_length_average", "capital_run_length_longest",
    "capital_run_length_total", "is_spam"
  )
)
Rows: 4601 Columns: 58
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (58): word_freq_make, word_freq_address, word_freq_all, word_freq_3d, wo...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The target variable is is_spam.

target_name <- "is_spam"

6.2 Data Pre-processing

We start by splitting the dataset into train and test sets in order calculate the prior distribution.

With the current dataset:

data <- split_train_test(data = dataset, prop_train = .7, seed = 1234)
names(data)
[1] "train" "test" 

Some of the models we use need the data to be numerical. We thus use the function encode_dataset() that transforms the categorical columns into sets of dummy variables. For each categorical variable, we remove one of the levels to avoid colinearity in the predictor matrix. This step is made using the convenient functions from the {recipes} package. In addition, the spline function from the {gam} package does not support variables with names that do not respect the R naming conventions. We thus rename all the variables and keep track of the changes.

Let us use the encode_dataset() function to rename the columns here. As there is no categorical variable among the predictors, no dummy variable will be created.

data_dmy <- encode_dataset(
  data_train = data$train,
  data_test = data$test,
  target_name = target_name,
  intercept = FALSE
)

6.3 Estimation Functions: GAMSEL

We first estimate the probability that the event occurs (the email is a spam) using a Generalized Additive Model with model selection. We use the helper function train_gamsel(), which can be used in a very simple way. This function first splits the target variable and the predictors in distinct objects. Then, we check that all variables obtained after using the encode_dataset() function are coded as numeric: the estimation function from {gamsel} does not allow integer variables. The formula to fit the GAMSEL model is then built (We need to create a vector that gives the maximum spline basis function to use for each variable. For dummy variables, this needs to be set to 1. For other variables, let us use either 6 or the minimum number of distinct values minus 1.). Then, we fit the model. The penalty parameter \(\lambda\) is selected by 10-fold cross validation. We use the value of lambda which gives the minimum cross validation metric. Note that we could also use the largest value of lambda such that the error is within 1 standard error of the minimum (using lambda = gamsel_cv$lambda.1se). Lastly, we get the predicted scores.

The train_gamsel() function.
#' Train a GAMSEL model
#'
#' @param data_train train set
#' @param data_test test set
#' @param target_name name of the target (response) variable
#' @param degrees degree for the splines
#' @param return_model if TRUE, the estimated model is returned
#'
#' @returns list with estimated scores on train set (`scores_train`) and on
#'  test set (`scores_test`)
train_gamsel <- function(data_train,
                         data_test,
                         target_name,
                         degrees = 6,
                         return_model = FALSE) {
  # Encode dataset so that categorical variables become dummy variables
  data_dmy <- encode_dataset(
    data_train = data_train,
    data_test = data_test,
    target_name = target_name,
    intercept = FALSE
  )
  # Estimation
  X_dmy_train <- data_dmy$train |> dplyr::select(-!!target_name)
  X_dmy_train <- X_dmy_train |> mutate(across(everything(), as.numeric))
  X_dmy_test <- data_dmy$test |> dplyr::select(-!!target_name)
  X_dmy_test <- X_dmy_test |> mutate(across(everything(), as.numeric))
  y_train <- data_dmy$train |> dplyr::pull(!!target_name)
  y_test <- data_dmy$test |> dplyr::pull(!!target_name)

  deg <- rep(NA, ncol(X_dmy_train))
  col_names_X <- colnames(X_dmy_train)
  nb_val <- map_dbl(
    col_names_X, ~X_dmy_train |> pull(.x) |> unique() |> length()
  )
  for (i_var_name in 1:ncol(X_dmy_train)) {
    var_name <- col_names_X[i_var_name]
    if (var_name %in% data_dmy$categ_names) {
      deg[i_var_name] <- 1
    } else {
      deg[i_var_name] <- min(nb_val[i_var_name]-1, degrees)
    }
  }
  gamsel_cv <- gamsel::cv.gamsel(
    x = as.data.frame(X_dmy_train), y = y_train, family = "binomial",
    degrees = deg
  )
  gamsel_out <- gamsel::gamsel(
    x = as.data.frame(X_dmy_train), y = y_train, family = "binomial",
    degrees = deg,
    lambda = gamsel_cv$lambda.min
  )
  # Scores on train and test set
  scores_train <- predict(
    gamsel_out, newdata = as.data.frame(X_dmy_train), type = "response")[, 1]
  scores_test <- predict(
    gamsel_out, newdata = as.data.frame(X_dmy_test), type = "response")[, 1]
  scores_test[which(is.na(scores_test))] <-
    1/(1 + exp(-predict(gamsel_out,
                        newdata = as.data.frame(X_dmy_test[which(is.na(scores_test)),]))
               [, 1]))

  if (return_model == TRUE) {
    res <- list(
      scores_train = scores_train,
      scores_test = scores_test,
      fit = fit)
  } else {
    list(scores_train = scores_train, scores_test = scores_test, fit = NULL)
  }
}
scores_gamsel <- train_gamsel(
    data_train = data$train, data_test = data$test, target_name = target_name,
    degrees = 6
  )

6.4 Fitting a Beta Distribution

Once the scores from the models have been estimated, we fit a Beta distribution to them. This will provide a prior distribution of the true probabilities in the exercise.

To avoid crashing the ML estimation of the two parameters of the Beta distribution, let us make sure that any score is in \((0,1)\) and not exactly equal to 0 or 1.

x_gamsel <- (scores_gamsel$scores_test * (1 - 1e-6)) + 1e-6 / 2

To estimate the two parameters of the Beta distribution, we apply the function, fit_beta_scores() that calls the fitdistr() function from {MASS}.

The fit_beta_scores() function.
#' Maximum-likelihood fitting of Beta distribution on scores
#'
#' @param scores vector of estimated scores
#' @param shape1 non-negative first parameter of the Beta distribution
#' @param shape1 non-negative second parameter of the Beta distribution
#'
#' @returns An object of class `fitdistr`, a list with four components
#'  (see: MASS::fitdistr())
#'  - `estimate`: the parameter estimates
#'  - `sd`: the estimated standard errors
#'  - `vcov`: the estimated variance-covariance matrix
#'  - `loglik`: the log-likelihood
fit_beta_scores <- function(scores, shape1 = 1, shape2 = 1) {
  # Fit a beta distribution
  mle_fit <- MASS::fitdistr(
    scores, "beta", start = list(shape1 = 1, shape2 = 1)
  )
  mle_fit
}
(mle_gamsel <- fit_beta_scores(scores = x_gamsel[!is.na(x_gamsel)]))
     shape1       shape2  
  0.42899045   0.51659090 
 (0.01431052) (0.01816698)

6.5 Wrapper Functions

For convenience, we use a wrapper function, get_beta_fit() that takes a dataset as an input, the name of the target variable and possibly a seed. From these arguments, the function splits the dataset into a training and a test set. It then fits the GAMSEL model to the train set, and fit a Beta distribution on the scores estimated in the test set. This function returns a list with 2 elements: the estimated scores of the GAMSEL, the the parameters of the Beta distribution estimated using the scores of this model.

The get_beta_fit() function
#' Estimation of a GLM-logistic, a GAM and a GAMSEL model on a classification
#' task. Then, on estimated scores from the test set, fits a Beta distribution.
#'
#' @param dataset dataset with response variable and predictors
#' @param target_name name of the target (response) variable
#' @param seed desired seed (default to `NULL`)
#'
#' @returns A list with the following elements:
#'  - `scores_glm`: scores on train and test set (in a list) from the GLM
#'  - `scores_gam`: scores on train and test set (in a list) from the GAM
#'  - `scores_gamsel`: scores on train and test set (in a list) from the GAMSEL
#'  - `mle_glm`: An object of class "fitdistr" for the GLM model
#'    (see fit_beta_scores())
#'  - `mle_gamsel`: An object of class "fitdistr" for the GAM model
#'    (see fit_beta_scores())
#'  - `mle_gamsel`: An object of class "fitdistr" for the GAMSEL model
#'    (see fit_beta_scores())
get_beta_fit <- function(dataset,
                         target_name,
                         seed = NULL) {
  # Split data into train/test
  data <- split_train_test(data = dataset, prop_train = .7, seed = seed)

  # Train a GAMSEL model
  scores_gamsel <- train_gamsel(
    data_train = data$train, data_test = data$test, target_name = target_name,
    degrees = 6
  )
  # Add a little noise to the estimated scores to avoid being in [0,1] and be
  # in (0,1) instead.
  x_gamsel <- (scores_gamsel$scores_test * (1 - 1e-6)) + 1e-6 / 2
  # Fit a Beta distribution on these scores
  mle_gamsel <- fit_beta_scores(scores = x_gamsel[!is.nan(x_gamsel)])

  list(
    scores_gamsel = scores_gamsel,
    mle_gamsel = mle_gamsel
  )
}

These two functions can be called as follows:

resul <- get_beta_fit(dataset = dataset, target_name = "is_spam", seed = 1234)

We also use the function, plot_hist_scores_beta() to plot the distribution of scores obtained with the GAMSEL model and the density functions of the Beta distribution whose parameters were estimated based on the scores of the GAMSEL model.

plot_hist_scores_beta(resul, "spambase")