2  Neighborhood-Based Smoothing

Display the setting codes
# Required packages----
library(tidyverse)
library(lubridate)
library(gtsummary)
library(labelled)
library(sf)
library(showtext)
library(extrafont)
library(wesanderson)

# Graphs----
font_main = font_title = 'Times New Roman'
extrafont::loadfonts(quiet = T)
face_text='plain'
face_title='plain'
size_title = 14
size_text = 11
legend_size = 11

global_theme <- function() {
  theme_minimal() %+replace%
    theme(
      text = element_text(family = font_main, size = size_text, face = face_text),
      legend.text = element_text(family = font_main, size = legend_size),
      axis.text = element_text(size = size_text, face = face_text), 
      plot.title = element_text(
        family = font_title, 
        size = size_title, 
        hjust = 0.5
      ),
      plot.subtitle = element_text(hjust = 0.5)
    )
}

# Colours
colors_ <- wes_palette('Rushmore1')
col_seine <- "#2140A3"

The spatial information of the properties is given at the IRIS level. However, in some IRIS, as shown in Figure 2.1, there are no observations. We will therefore use a spatial smoothing to impute values in all IRIS.

2.1 Load Data

Let us load the data obtained in Chapter 1.

# Real Estate
load("../data/data_clean_all.rda")
# Maps files
load("../data/shapes.rda")
Display the codes used to create the Figure
shapes_paris |> 
  left_join(
    data_clean_all |> 
      group_by(CODE_IRIS) |> 
      summarise(
        total_observations = n(), 
        median_price = median(contract_final_net_price)
      ),
    by = "CODE_IRIS"
  ) %>% 
  replace_na(list(total_observations = 0)) |> 
  ggplot() + 
  geom_sf(aes(fill = total_observations)) + 
  scale_fill_gradientn(
    colours = rev(terrain.colors(10)), 
    name = 'No. Observations', 
    position = 'bottom'
    ) +
  geom_sf(
    data = shapes_seine, 
    fill = col_seine
  ) + 
  global_theme()
Figure 2.1: Number of observation per IRIS.

2.2 Defining Neighbors

The size of the area of each IRIS differs considerably. Hence, smoothing the observation by simply defining higher levels by Euclidean distance from a given point might pose problems, as it could include many dense but heterogeneous regions on one end of the spectrum or only a few on the other. Instead, we will define a neighborhood graph (see the paper for a formal definition).

We will hereafter denote \(i\) and \(j\) the polygons of the ith and jth IRIS, respectively.

The neighborhood graph can be represented by an \(n\times n\) matrix named an adjacency matrix, denoted \(A\).If regions \(i\) and \(j\) intersect, \(A_{i,j}=1\), and \(A_{i,j}=0\) otherwise. Representing neighborhood relations using an adjacency matrix has the advantage that non-intermediate connections can be easily obtained by successively multiplying the adjacency matrix by itself. That is, all nonzero elements of \(A^2\) represent neighborhoods that are either immediately adjacent to a given region (the direct neighbors) or are adjacent to the direct neighbors (the neighbors of the direct neighbors). This process can be repeated \(n\) times to obtain neighbors that can be reached within an \(n\) length path. This provides a more natural way to define neighborhoods, as with each increasing path length, an entire homogeneous region is added to the higher-level aggregation.

Let us construct a neighborhood graphs with R, using {igraph}.

library(igraph)

Here we need the Seine River as well. Otherwise, we cannot find neighbors.

shapes_all <- 
  shapes_paris |> 
  bind_rows(shapes_seine)

Let us get the first level:

shapes_neighbours_export <- st_intersects(shapes_all, shapes_all)

The \(A_1\) adjacency matrix can then be constructed. We initialize it with 0s.

adj_matrix_lev_1 <- matrix(
  nrow = length(shapes_neighbours_export), 
  ncol = length(shapes_neighbours_export), 
  0
)

Then, we loop over all polygons and change the value of \(A_1(i,j)\) to 1 if regions \(i\) and \(j\) are neighbors:

row_index <- 1
for (item in shapes_neighbours_export) {
  for (col_idx in item) {
    adj_matrix_lev_1[row_index, col_idx] = 1
  }
  row_index <- row_index + 1
}

Then, by multiplying this square matrix by itself, we can obtain the neighbors and the neighbors of neighbors:

adj_matrix_lev_2 <- adj_matrix_lev_1 %*% adj_matrix_lev_1
adj_matrix_lev_2[which(adj_matrix_lev_2 > 0)] <- 1

Let us focus on a particular IRIS and visualize it and its neighbors on a map. We will produce three maps: one which only shows the particular IRIS, another one which also its immediate neighbors, and a third one which shows the neighbors immediately adjacent or adjacent to the direct neighbor.

target_community <- "Enfants Rouges 4"

The map with the single target IRIS:

single_munip <- 
  shapes_paris %>% 
  rowid_to_column('index_ids') |> 
  filter(grepl('Paris', NOM_COM)) |> 
  mutate(
    Classification = if_else(
      NOM_IRIS == target_community,
      'target community',
      'others')
  ) |> 
  mutate(
    Classification = factor(
      Classification,
      levels=c('target community', 'neighbours', 'others')
    )
  ) |> 
  ggplot() + 
  geom_sf(aes(fill = Classification)) + 
  scale_fill_manual(
    values = c(colors_[c(3)], '#D21F3C', colors_[c(1)]), 
    limits = c('target community', 'neighbours', 'others'), 
    name = ''
  ) + 
  geom_sf(
    data = shapes_seine, 
    fill = col_seine
  ) + 
  global_theme() + 
  theme(legend.position = "bottom")

The map with the immediate neighbors:

one_neigh <- shapes_paris |> 
  rowid_to_column('index_ids') |> 
  filter(grepl('Paris', NOM_COM)) |> 
  mutate(
    Classification = if_else(
      index_ids %in% which(
        adj_matrix_lev_1[which(shapes_paris$NOM_IRIS == target_community),] == 1
        ),
      'neighbours', 'others')
  ) |> 
  mutate(
    Classification = if_else(
      index_ids == which(shapes_paris$NOM_IRIS == target_community), 
      'target community', Classification
    )
  ) |> 
  mutate(
    Classification = factor(
      Classification,
      levels=c('target community', 'neighbours', 'others')
    )
  ) |> 
  ggplot() + 
  geom_sf(aes(fill = Classification)) + 
  scale_fill_manual(values = c(colors_[c(3)], '#D21F3C', colors_[c(1)])) + 
  geom_sf(
    data = shapes_seine, 
    fill = col_seine
  ) + 
  global_theme()

Lastly, the third map with the both immediate neighbors and neighbors of neighbors:

two_neigh <- 
  shapes_paris |> 
  rowid_to_column('index_ids') |> 
  filter(grepl('Paris', NOM_COM)) |> 
  mutate(
    Classification = if_else(
      index_ids %in% which(
        adj_matrix_lev_2[which(shapes_paris$NOM_IRIS == target_community),] > 0
      ),
      'neighbours', 
      'others'
    )
  )|> 
  mutate(
    Classification = if_else(
      index_ids == which(shapes_paris$NOM_IRIS == target_community), 
      'target community', 
      Classification
    )
  ) |> 
  mutate(
    Classification = factor(
      Classification,
      levels=c('target community', 'neighbours', 'others')
    )
  ) |> 
  ggplot() + 
  geom_sf(aes(fill = Classification)) + 
  scale_fill_manual(values = c(colors_[c(3)], '#D21F3C', colors_[c(1)])) + 
  geom_sf(
    data = shapes_seine, 
    fill = col_seine
  ) + 
  global_theme()

Figure 2.2 illustrates how the neighborhood set for a particular IRIS community can be calculated.

Display the codes used to create the Figure
p_example_neigh_row <- cowplot::plot_grid(
  single_munip + theme(legend.position="none"),
  one_neigh + theme(legend.position="none"),
  two_neigh + theme(legend.position="none"),
  align = 'vh',
  ncol = 3
)
p_example_neigh_legend <- cowplot::get_legend(single_munip)

p_example_neigh <- cowplot::plot_grid(
  p_example_neigh_row, p_example_neigh_legend, 
  ncol = 1, rel_heights = c(1, .2)
)
p_example_neigh
Figure 2.2: A sampled IRIS region within Paris (left pane) and its immediate adjacent neighbors (center pane) and the second level neighbors (right pane). The Seine River is depicted in blue whereas all other regions are depicted in yellow.

2.3 Spatial Price Smoothing

For a given variable observed at IRIS level \(x\), the smoothed value for region \(r_i\), denoted as \(\lambda_\omega(x_i)\) can be written as:

\[\begin{align} \lambda_{\omega}(x_i) = \frac{1}{\sum_{j=1}^{n} {\omega}(r_i,r_j)}\sum_{j=1}^n {\omega}(r_i,r_j) x_i \enspace. \end{align} \tag{2.1}\] For example, let \(d(r_i, r_j)\) be the path length between regions \(r_i\) and \(r_j\), then a simple way to define \(\omega(r_i,r_j)\) is: \[ \begin{align}\label{eq:weights} \omega(r_i,r_j)= \begin{cases} \frac{1}{(1+d(r_i,r_j))^p},\quad & \text{if } d(i,j) \leq m \\ 0, & \text{otherwise} \enspace, \end{cases} \end{align} \tag{2.2}\] where \(p\) and \(m\) are hyperparameters to be selected, similar to the bandwidth operator.

We initiate the neighborhood matrix by putting the values we just computed:

neigh_matrix <- adj_matrix_lev_1

Then, we can increase the neighborhood distance \(m\) so that it takes values from 2 to 30 in steps of 1. This allows us to get each IRIS which are distant from one another by a value lower or equal to \(m\).

# dir.create("../data/neighbours/all_neighbour_levels")
# cli::cli_progress_bar(total = length(seq(2,30)))
for (neigh_idst in seq(2,30)) {
  # Set neighbors
  adj_matrix_next_level <- neigh_matrix %*% adj_matrix_lev_1
  adj_matrix_next_level[which(adj_matrix_next_level > 0)] <- 1
  
  # Create Graph 
  graph_tmp <- graph_from_adjacency_matrix(adj_matrix_next_level)
  export_string <- paste(
    '../data/neighbours/all_neighbour_levels/neighbours_level_',
    neigh_idst,
    '.csv', 
    sep=''
  )
  
  graph_tmp |> 
    as_data_frame() |> 
    mutate(access = neigh_idst) |> 
    write_csv(export_string)
  
  # Reset and restart
  neigh_matrix <- adj_matrix_next_level
  # cli::cli_progress_update(set = which(neigh_idst == seq(2,30)))
}

We also export a tibble with a mapping between each IRIS and its row number in shapes_all.

mapping_neighbours <- 
  shapes_all |> 
  rowid_to_column('index_ids') |> 
  select(index_ids, CODE_IRIS) |> 
  as_tibble() |> 
  select(-c(geometry))

mapping_neighbours |> 
  write_csv('../data/neighbours/mapping_neighbour_id_iris.csv')

Now that we have obtained the adjacency matrices from \(m=\{1, 2, \ldots, 30\}\), we can create a tibble which will contain the distance from each polygon to all the other polygons (provided the maximum distance is lower or equal to 30).

We first populate the desired object, all_neighbours with all the IRIS which have a distance equal to 1:

all_neighbours <- 
  graph_from_adjacency_matrix(adj_matrix_lev_1) |> 
  as_data_frame() |> 
  tibble() |> 
  mutate(access = 1)
all_neighbours
# A tibble: 7,559 × 3
    from    to access
   <dbl> <dbl>  <dbl>
 1     1     1      1
 2     1   220      1
 3     1   555      1
 4     1   671      1
 5     1   831      1
 6     1   860      1
 7     2     2      1
 8     2   115      1
 9     2   173      1
10     2   457      1
# ℹ 7,549 more rows

Then, let us add, step by step, the IRIS which are neighbors with a distance lower or equal to 2, then to 3, and so on, until 30.

for (neigh_idst in seq(2,30)) {
  read_string <- paste(
    '../data/neighbours/all_neighbour_levels/neighbours_level_',
    neigh_idst,
    '.csv', 
    sep = ''
  )
  
  neighbor_level <- read_csv(read_string, progress = FALSE)
  
  # Add neighbors at current distance
  all_neighbours <- 
    all_neighbours |> 
    bind_rows(neighbor_level)
}

Now, we would like to extract from the resulting object, the minimum distance from one IRIS to another, for all IRIS.

neighbours_all <- all_neighbours |>
  group_by(from, to) |> 
  summarise(min_distance = min(access), .groups = "drop") |> 
  left_join(
    mapping_neighbours, 
    by = c('from' = 'index_ids')
  ) |> 
  left_join(
    mapping_neighbours, 
    by = c('to' = 'index_ids')
  ) |> 
  select(
    from_iris = CODE_IRIS.x, 
    to_iris = CODE_IRIS.y, 
    distance = min_distance
  )

The result can be saved:

neighbours_all |> 
  write_csv('../data/neighbours/all_neighbours_paris.csv')
neighbours_all <- 
  neighbours_all |> 
  mutate(
    from_iris = as.character(from_iris),
    to_iris = as.character(to_iris)
  )

Here is the resulting matrix:

neighbours_all
# A tibble: 974,169 × 3
   from_iris to_iris   distance
   <chr>     <chr>        <dbl>
 1 751145620 751145620        1
 2 751145620 751145617        3
 3 751145620 751135102        7
 4 751145620 751156004        7
 5 751145620 751124624       16
 6 751145620 751051703        7
 7 751145620 751083104       13
 8 751145620 751166103       10
 9 751145620 751145511        2
10 751145620 751145512        2
# ℹ 974,159 more rows

Let us compute the average selling price in each IRIS:

data_immo_agg <- 
  data_clean_all |> 
  group_by(CODE_IRIS) |> 
  summarise(mean_sale_price = mean(pm2))

Let us use this to plot the observed prices per square meter, on a choropleth map. We will display the unsmoothed version on the left and the smoothed version on the right. The smoothed version is computed using Equation 2.1. We will use here a distance \(m=5\).

m <- 5

smooth_prices <- 
  neighbours_all |> 
  mutate(distance = distance + 1) |> 
  mutate(distance = if_else(from_iris == to_iris, 1, distance)) |> 
  filter(distance <= !!m) |> 
  left_join(
    data_immo_agg, 
    by = c('to_iris' = 'CODE_IRIS')
  ) |> 
  mutate(inverse_weight = 1 / distance) |> 
  mutate(weighted_val = mean_sale_price * inverse_weight) |> 
  drop_na()  |> 
  group_by(from_iris) |> 
  summarise(
    total_weights = sum(inverse_weight), 
    total_val = sum(weighted_val)
  ) |> 
  mutate(smoothed_val = total_val / total_weights) |> 
  ungroup() |> 
  mutate(CODE_IRIS = from_iris) |> 
  select(CODE_IRIS, smoothed_price = smoothed_val)
Display the codes to create the Figure
# Unsmoothed version
p_1 <- 
  shapes_paris |> 
  left_join(
    data_clean_all |> 
      group_by(CODE_IRIS) |> 
      summarise(
        total_observations = n(), 
        median_price = mean(pm2), 
        std_price = sd(contract_final_net_price)
      ),
    by = "CODE_IRIS"
  ) |> 
  replace_na(list(total_observations = 0)) |> 
  ggplot() + 
  geom_sf(aes(fill = median_price)) + 
  scale_fill_gradientn(
    colours = rev(terrain.colors(10)), 
    limits = c(5000,20000),
    name = 'Median Price', 
    position = 'bottom') +
  geom_sf(
    data = shapes_seine, 
    fill = col_seine
  ) + 
  global_theme() + 
  guides(
    fill = guide_colorbar(
      title = "Mean prices per IRIS",
      position = 'bottom',
      title.position = "top", 
      title.vjust = 0,
      # draw border around the legend
      frame.colour = "black",
      barwidth = 10,
      barheight = 1.5
    )
  )

# Smoothed version
p_2 <- 
  shapes_paris |> 
  left_join(smooth_prices, by = "CODE_IRIS") |> 
  ggplot() + 
  geom_sf(aes(fill = smoothed_price)) + 
  scale_fill_gradientn(
    colours = rev(terrain.colors(10)), 
    position = 'bottom', 
    labels = scales::comma
  ) +
  geom_sf(
    data = shapes_seine, 
    fill = col_seine
  ) + 
  global_theme()
Square meter prices by IRIS (raw values on the left, smoothed values on the right), corresponding roughly to the wealth level of the inhabitants.

Square meter prices by IRIS (raw values on the left, smoothed values on the right), corresponding roughly to the wealth level of the inhabitants.

Square meter prices by IRIS (raw values on the left, smoothed values on the right), corresponding roughly to the wealth level of the inhabitants.

First, we compute the median observed price in each IRIS of the dataset, using the raw values. Then, we add the smoothed prices computed earlier (using \(m=\) 5), for each IRIS. Then, we aggregate the values at the arrondissement level (column NOM_COM in the map data): we compute the average of the median IRIS prices and the average of the smoothed version.

data_smooth_arrond <- 
  data_clean_all |> 
  mutate(CODE_IRIS = as.character(iris_cog)) |> 
  group_by(CODE_IRIS) |> 
  summarise(
    median_qmp = median(pm2)
  ) |> 
  left_join(
    smooth_prices |> 
      mutate(smoothed_qmp = smoothed_price),
    by = "CODE_IRIS"
  ) |> 
  left_join(
    shapes_paris |> 
      as_tibble() |> 
      select(CODE_IRIS, NOM_COM),
    by = "CODE_IRIS"
  ) |> 
  group_by(NOM_COM) |> 
  summarise(
    median_qmp = mean(median_qmp), 
    smoothed_qmp = mean(smoothed_qmp),
    .groups = "drop"
  )

The values are reported in Table 2.1.

Display the codes used to create the Table
data_smooth_arrond |> 
  mutate(
    NOM_COM = factor(
      NOM_COM, 
      levels = str_c(
        "Paris ", 1:20, "e", c("r", rep("", 19)), " Arrondissement")
    )
  ) |> 
  arrange(NOM_COM) |> 
  knitr::kable(
    booktabs = TRUE, 
    col.names = c("Arrondissement", "Average of Median", "Average of Smooth")
  )
Table 2.1: Aggregated observed prices at the arrondissement level.
Arrondissement Average of Median Average of Smooth
Paris 1er Arrondissement 12695.811 12483.324
Paris 2e Arrondissement 11724.438 11669.543
Paris 3e Arrondissement 12525.939 11654.372
Paris 4e Arrondissement 13216.678 12315.130
Paris 5e Arrondissement 12658.348 12068.673
Paris 6e Arrondissement 14251.791 13327.016
Paris 7e Arrondissement 14233.646 13141.138
Paris 8e Arrondissement 12087.873 11528.212
Paris 9e Arrondissement 11237.138 10849.159
Paris 10e Arrondissement 10114.682 10124.534
Paris 11e Arrondissement 10271.923 10267.033
Paris 12e Arrondissement 9700.242 9744.037
Paris 13e Arrondissement 8999.515 9521.970
Paris 14e Arrondissement 10301.420 10464.133
Paris 15e Arrondissement 10283.063 10691.530
Paris 16e Arrondissement 11244.938 11429.151
Paris 17e Arrondissement 10845.312 10917.631
Paris 18e Arrondissement 9501.718 9670.950
Paris 19e Arrondissement 8334.067 8566.154
Paris 20e Arrondissement 8663.208 8915.073

Let us now visualize these values on two choropleth map, one for each aggregated method (Figure 2.3).

Display the codes used to create the Figure
p_1 <- shapes_paris |> 
  group_by(NOM_COM) |> 
  summarise(
    arrond_shape = st_combine(geometry)
  ) |> 
  left_join(data_smooth_arrond, by = "NOM_COM") |> 
  ggplot() + 
  geom_sf(aes(fill = median_qmp)) + 
  scale_fill_gradientn(
    colours = rev(terrain.colors(10)), 
    limits = c(5000, 15000),
    name = 'Price', 
    position = 'bottom', 
    labels = scales::comma
  ) +
  geom_sf(
    data = shapes_seine, 
          fill = col_seine
  ) + 
  ggtitle('Aggregated Raw') + 
  global_theme() + 
  guides(
    fill = guide_colorbar(
      title = "Mean prices per Arrondissement",
      position='bottom',
      title.position = "top", title.vjust = 0,
      frame.colour = "black",
      barwidth = 15,
      barheight = 1.5
    )
  )

p_2 <- shapes_paris |> 
   group_by(NOM_COM) |> 
   summarise(
     arrond_shape = st_combine(geometry)
   ) |> 
   left_join(data_smooth_arrond, by = "NOM_COM") |> 
   ggplot() + 
   geom_sf(aes(fill = smoothed_qmp)) + 
  scale_fill_gradientn(
    colours = rev(terrain.colors(10)), 
    limits = c(5000, 15000),
    name = 'Price', 
    position = 'bottom', 
    labels = scales::comma
  ) +
  geom_sf(
    data = shapes_seine, 
    fill = col_seine
  ) + 
  ggtitle('Aggregated Smoothed') + 
  global_theme()

ggpubr::ggarrange(
  p_1, p_2, 
  ncol = 2,nrow = 1,
  legend = "bottom", 
  common.legend = TRUE
)
Figure 2.3: Re-Aggregated data, left pane, mean per arrondissement when the raw, un-smoothed data is used to calculate the average price per square meter of real estate. Right pane, results when the neighbor-smoothed estimates are used.

2.4 Relative Errors in Prices

Let us dig into the relative error between estimated prices and observed ones. First, we compute the mean relative difference between the observed and the estimated price, per IRIS. The is computed as follows \[\frac{z - \hat{z}}{z}\]

diffs_smooth <- 
  data_clean_all |> 
  filter(pm2 < 20000) |> 
  filter(pm2_estimated < 20000) |> 
  select(CODE_IRIS, pm2, difference_pm2) |> 
  mutate(relative_error = difference_pm2/pm2) |> 
  group_by(CODE_IRIS) |>  
  summarise(
    mean_relative_diff = mean(relative_error), 
    median_relative_diff = median(relative_error),
    .groups = "drop"
  )

There are two outliers, so we decide to trim the data.

cuts_ <- quantile(diffs_smooth$mean_relative_diff, c(0.02,0.98))
diffs_smooth <- 
  diffs_smooth |> 
  filter(mean_relative_diff > cuts_[1]) |> 
  filter(mean_relative_diff < cuts_[2])

We use Equation 2.1 to compute the smoothed values, using \(m=\) 5.

smoothed_diff <- 
  neighbours_all |> 
  mutate(distance = distance + 1) |> 
  filter(distance <= !!m) |> 
  mutate(distance = if_else(from_iris == to_iris, 1, distance)) |> 
  left_join(
    diffs_smooth, 
    by = c('to_iris'='CODE_IRIS')
  ) |> 
  mutate(inverse_weight = 1 / distance) |> 
  mutate(value = mean_relative_diff * inverse_weight^2) |> 
  drop_na() |> 
  group_by(from_iris) |> 
  summarise(
    total_weights = sum(inverse_weight), 
    total_value = sum(value)
  ) |> 
  mutate(smoothed_diff = total_value / total_weights) |> 
  ungroup() |> 
  mutate(CODE_IRIS = from_iris, smoothed_diff)

Before plotting the map with the values, we isolate some regions of interest that will be plotted on the maps as well and used as examples in paper.

  1. two IRIS: Montmartre and Champs-de-Mars.
  2. two arrondissements: the 12th and the 20th.
shape_champs <- 
  shapes_paris |> 
  filter(CODE_IRIS == '751072812')

shape_mont <- 
  shapes_paris|> 
  filter(CODE_IRIS == '751093503')

shape_12 <- 
  shapes_paris |> 
  filter(NOM_COM == 'Paris 12e Arrondissement') |> 
  summarise(shape = st_union(geometry))

shape_20 <- 
  shapes_paris |> 
  filter(NOM_COM == 'Paris 20e Arrondissement') |> 
  summarise(shape = st_union(geometry))

The relative errors per IRIS are shown in Figure 2.4.

Display the codes used to create the Figure
scale_val <- seq(
  min(smoothed_diff$smoothed_diff), 
  max(smoothed_diff$smoothed_diff), 
  length.out = 4
)
p <- shapes_paris |> 
  left_join(smoothed_diff, by = "CODE_IRIS") |> 
  ggplot() + 
  geom_sf(aes(fill = smoothed_diff)) + 
  geom_sf(
    data = shapes_seine, 
    fill = col_seine
  ) + 
  geom_sf(
    data = shape_champs, 
    color = 'black', 
    fill = alpha('white', 0),
    lwd = 1, 
    lty = 'solid'
  ) + 
  geom_sf(
    data = shape_mont, 
    color = 'black', 
    lwd = 1, 
    fill = alpha('white', 0),
    lty = 'solid'
  ) + 
  geom_sf(
    data = shape_12, 
    color = 'black', 
    lwd = 1, 
    fill = alpha('white', 0),
    lty = 'dashed'
  ) + 
  geom_sf(
    data = shape_20, 
    color = 'black', 
    lwd = 1, 
    fill = alpha('white', 0),
    lty = 'dashed'
  ) + 
  global_theme() + 
  scale_fill_gradient2(
    NULL,
    midpoint = 0,
    high = "#000090",
    mid = "white",
    low = "#CD3700",
    breaks = scale_val,
    labels = scales::percent(scale_val)
  ) + 
  theme(
    legend.position = 'bottom'
  ) +
  ggtitle('Smoothed relative error')

panel_width <- unit(1,"npc") - sum(ggplotGrob(p)[["widths"]][-3])
p + guides(fill = guide_colorbar(barwidth = panel_width/2))
Figure 2.4: Relative estimation error per \(m^2\) in different sub-regions. The values are smoothed across spatial neighbors to emphasize the spatial correlation.

2.5 Number of observation per IRIS

Let us also show the number of observation, using the smoothed values. First, we smooth the number of observation in each IRIS:

smooth_n <- 
  neighbours_all |> 
  mutate(distance = distance + 1) |> 
  mutate(distance = if_else(from_iris == to_iris, 1, distance)) |> 
  filter(distance <= !!m) |> 
  left_join(
    data_clean_all |> count(CODE_IRIS, name = "n"),
    by = c('to_iris' = 'CODE_IRIS')
  ) |> 
  mutate(inverse_weight = 1 / distance) |> 
  mutate(weighted_val = n * inverse_weight) |> 
  drop_na()  |> 
  group_by(from_iris) |> 
  summarise(
    total_weights = sum(inverse_weight), 
    total_val = sum(weighted_val)
  ) |> 
  mutate(smoothed_n = total_val / total_weights) |> 
  ungroup() |> 
  mutate(CODE_IRIS = from_iris) |> 
  select(CODE_IRIS, smoothed_n = smoothed_n)

Then, we can plot the result (Figure 2.5).

Display the codes used to create the Figure
shapes_paris |> 
  left_join(smooth_n, by = "CODE_IRIS") |> 
  ggplot() + 
  geom_sf(aes(fill = smoothed_n)) + 
  scale_fill_gradientn(
    colours = rev(terrain.colors(10)), 
    name = 'No. Observations', 
    position = 'bottom'
  ) +
  geom_sf(
    data = shapes_seine, 
    fill = col_seine
  ) + 
  global_theme()
Figure 2.5: Number of available observations, smoothed using the neighborhood method.

2.6 Wealth Level per IRIS

Let us have a look at the wealth level per IRIS, using the smoothed version.

We load the income data per IRIS (see Section 1.4 in Chapter 1):

data_income <- read_delim(
  str_c(
    "../data/econ/", 
    "BASE_TD_FILO_DISP_IRIS_2020_CSV/BASE_TD_FILO_DISP_IRIS_2020.csv"
  ),
  delim = ";",
  escape_double = FALSE,
  trim_ws = TRUE
)

The data needs to be formatted to match with the other files used here.

median_inc_data <- 
  data_income |> 
  select(CODE_IRIS = IRIS, DISP_MED20) |> 
  mutate(DISP_MED20 = as.numeric(DISP_MED20))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `DISP_MED20 = as.numeric(DISP_MED20)`.
Caused by warning:
! NAs introduced by coercion

We can compute the smoothed values per IRIS.

smoothed_income <- 
  neighbours_all |> 
  mutate(distance = distance + 1) |> 
  mutate(distance = if_else(from_iris == to_iris, 1, distance)) |> 
  filter(distance <= !!m) |> 
  left_join(
    mapping_neighbours, 
    by=c('to_iris' = 'CODE_IRIS')
  ) |> 
  mutate(CODE_IRIS = as.character(to_iris)) |> 
  left_join(
    median_inc_data, 
    by = 'CODE_IRIS'
  ) |> 
  mutate(inverse_weight = 1 / distance) |> 
  mutate(weighted_inc = DISP_MED20 * inverse_weight) |> 
  drop_na() |> 
  group_by(from_iris) |> 
  summarise(
    total_weights = sum(inverse_weight), 
    total_inc = sum(weighted_inc)
  ) |> 
  mutate(smoothed_inc = total_inc / total_weights) |> 
  ungroup() |> 
  mutate(CODE_IRIS = from_iris) |> 
  select(CODE_IRIS, smoothed_income = smoothed_inc)

Then, we can plot the result (Figure 2.6).

Display the codes used to create the Figure
shapes_paris |> 
  left_join(smoothed_income, by = "CODE_IRIS") |> 
  ggplot() + 
  geom_sf(aes(fill = smoothed_income)) + 
  scale_fill_gradientn(
    colours = rev(terrain.colors(10)), 
    labels = scales::comma
  ) +
  geom_sf(
    data = shapes_seine, 
    fill = col_seine
  ) + 
  global_theme() +
  theme(legend.position = "bottom") +
  guides(
    fill = guide_colorbar(
      title = "Smoothed Median Income per IRIS",
      position='bottom',
      title.position = "top", 
      title.vjust = 0,
      frame.colour = "black",
      barwidth = 15,
      barheight = 1.5
    )
  )
Figure 2.6: Estimated Income per IRIS region, smoothed using the neighborhood method.