Learn - Multi-scale model assessment with spatialsample

Introduction

To use the code in this article, you will need to install the following packages: spatialsample and tidymodels.

Modeling spatially structured data is complicated. In addition to the usual difficulty of statistical modeling, models of spatially structured data may have spatial structure in their errors, with different regions being more or less well-described by a given model. This also means that it can be hard to tell how well your model performs when its predictions are aggregated to different scales, which is common when models fit to data from point measurements (for instance, the sale prices of individual homes) are used to try and estimate quantities over an entire area (the average value of all homes in a city or state). If model accuracy is only investigated at individual aggregation scales, such as when accuracy is only assessed for the original point measurements or across the entire study area as a whole, then local differences in accuracy might be “smoothed out” accidentally resulting in an inaccurate picture of model performance.

For this reason, researchers (most notably, Riemann et al. (2010)) have suggested assessing models at multiple scales of spatial aggregation to ensure cross-scale differences in model accuracy are identified and reported. This is not the same thing as tuning a model, where we’re looking to select the best hyperparameters for our final model fit; instead, we want to assess how that final model performs when its predictions are aggregated to multiple scales. This post walks through how to do that using the spatialsample package.

Multi-scale Assessment

Because Riemann et al. were working with data from the US Forest Inventory and Analysis (FIA) program, we’re going to do the same. However, because our main goal is to show how spatialsample can support this type of analysis, we won’t spend a ton of time worrying about any of the quirks of FIA data or on feature engineering. Instead, we’re going to use a simple linear model to see if we can predict how much aboveground biomass (“AGB”; all the non-root woody bits of trees) there is in a forest based on how many trees there are. We’ll use all the FIA field data from New York State, USA.

Because we’re mostly interested in assessing our models, let’s not focus on how exactly to download and wrangle the FIA data. If you’re curious, the code is in a hidden chunk here:

Pre-processing code

library(dplyr)

# Download the FIA database for New York over the internet,
# and unzip it into our local directory
#
# This updates annually, which means that this post likely won't
# generate the exact same results after 2022
httr::GET(
  "https://apps.fs.usda.gov/fia/datamart/Databases/SQLite_FIADB_NY.zip",
  httr::write_disk("SQLite_FIADB_NY.zip", TRUE)
)

unzip("SQLite_FIADB_NY.zip")

# We're going to work with the database through dplyr's database connections
#
# But first, we need to create a DBI connection to the database and
# load out tables:
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = "FIADB_NY.db")
trees <- tbl(con, "TREE")

plots <- tbl(con, "PLOT")

# The FIA database has every measurement ever collected by the program;
# we'll filter to only the most recent survey for each of the plots.
#
# Plots are measured on a rolling 7 year basis, so we'll also cut out any
# plots which might not be remeasured anymore with a call to filter()
plots <- plots %>% 
  group_by(PLOT) %>% 
  filter(INVYR == max(INVYR, na.rm = TRUE)) %>%
  ungroup() %>% 
  filter(INVYR > 2009) %>% 
  collect()

copy_to(con, plots, "newest_plots", TRUE)
newest_plots <- tbl(con, "newest_plots")

# Now we'll use a filtering join to select only trees measured in the most
# recent sample at each plot
#
# We'll also count how many trees were at each plot,
# sum up their AGB, 
# and save out a few other useful columns like latitude and longitude
plot_measurements <- trees %>% 
  right_join(newest_plots, by = c("INVYR", "PLOT")) %>% 
  group_by(PLOT) %>% 
  summarise(
    yr = mean(INVYR, na.rm = TRUE),
    plot = mean(PLOT, na.rm = TRUE),
    lat = mean(LAT, na.rm = TRUE),
    long = mean(LON, na.rm = TRUE),
    n_trees = n(),
    agb = sum(DRYBIO_AG, na.rm = TRUE)
  ) %>% 
  collect() %>% 
  mutate(
    # Because of how we joined, `n_trees` is always at least 1 -- 
    # even if there are 0 trees
    n_trees = ifelse(is.na(agb) & n_trees == 1, 0, n_trees),
    agb = ifelse(is.na(agb), 0, agb)
  )

DBI::dbDisconnect(con)

readr::write_csv(plot_measurements, "plots.csv")

With that pre-processing done, it’s time to load our data and turn it into an sf object. We’re going to reproject our data to use a coordinate reference system that the US government tends to use for national data products, like the FIA:

library(sf)

invisible(sf_proj_network(TRUE))

plot_measurements <- 
  readr::read_csv("https://www.tidymodels.org/learn/work/multi-scale/plots.csv") %>% 
  st_as_sf(coords = c("long", "lat"), crs = 4326) %>% 
  st_transform(5070)

This is what we’re going to resample. We want to assess our model’s performance at multiple scales, following the approach in Riemann et al. That means we need to do the following:

Block our study area using multiple sets of regular hexagons of different sizes, and assign our data to the hexagon it falls into within each set.
Perform leave-one-block-out cross-validation with each of those sets, fitting our model to n - 1 of the n hexagons we’ve created and assessing it on the hold-out hexagon.
Calculate model accuracy for each size based on the aggregated predictions for each of those held-out hexes.

So to get started, we need to block our study area. We can do this using the spatial_block_cv() function from spatialsample. We’ll generate ten different sets of hexagon tiles, using cellsize arguments of between 10,000 and 100,000 meters. The code to do that, and to store all of our resamples in a single tibble, looks like this:

set.seed(123)
library(dplyr)
library(spatialsample)
cellsize <- seq(10, 100, 10) * 1000

create_resample <- function(cellsize) {
  spatial_block_cv(
    plot_measurements,
    v = Inf,
    cellsize = cellsize,
    square = FALSE
  )
}

riemann_resamples <- tibble(
  cellsize = cellsize,
  resamples = purrr::map(cellsize, create_resample)
)

Two things to highlight about this code:

cellsize is in meters because our coordinate reference system is in meters. This argument represents the length of the apothem, from the center of each polygon to the middle of the side.
v is Inf because we want to perform leave-one-block-out cross-validation, but we don’t know how many blocks there will be before they’re created. This is the supported way to do leave-one-X-out cross-validation in spatialsample > 0.2.0 (another option is to set v = NULL).

If we want, we can visualize a few of our resamples, to get a sense of what our tiling looks like:

autoplot(riemann_resamples$resamples[[9]])

autoplot(riemann_resamples$resamples[[10]])

And that’s step 1 of the process completed! Now we need to move on to step 2, and actually fit models to each of these resamples. As a heads-up, this is a lot of models, and so is going to take a while:

riemann_resamples$resamples %>% purrr::map_dbl(nrow) %>% sum()
#> [1] 2600

Linear regression was invented around 1805, long before the Analytical Engine was a twinkle in Babbage’s eye. Whenever you get frustrated at how long it takes to fit many models, it’s nice to take a step back and recognize that we’re asking our poor, overworked computers to fit roughly as many models as were used in the first ~100 years of the technique’s life.

Now let’s load the rest of the tidymodels packages, then use them to define a workflow (from the workflows package), specifying the formula and model that we want to fit to each resample:

library(tidymodels)

lm_workflow <- workflow(agb ~ n_trees, linear_reg())

Next, we’ll actually apply that workflow a few thousand times! Now as we said at the start, we aren’t looking to tune our models using these resamples. Instead, we’re looking to see how well our point predictions do at estimating AGB across larger areas. As such, we don’t really care about calculating model metrics for each hexagon, and we’ll set our code to only calculate a single metric (root-mean-squared error, or RMSE) to save a little bit of time. We’ll also use the control_resamples() function with save_pred = TRUE to make sure we keep the predictions we’re making across each resample. We can add these predictions as a new column to our resamples using the following:

riemann_resamples <- riemann_resamples %>%
  mutate(
    resampled_outputs = purrr::map(
      resamples, 
      fit_resamples,
      object = lm_workflow,
      metrics = metric_set(rmse),
      control = control_resamples(save_pred = TRUE)
    )
  )

The riemann_resamples object now includes both our original resamples as well as the predictions generated from each run of the model. We can use the following code to “unnest” our predictions and estimate both the average “true” AGB and our average prediction at each hexagon:

riemann_metrics <- riemann_resamples %>% 
  select(cellsize, resampled_outputs) %>% 
  unnest(resampled_outputs) %>%
  transmute(
    cellsize = cellsize,
    mean_agb  = purrr::map_dbl(.predictions, function(x) mean(x$agb)),
    mean_pred = purrr::map_dbl(.predictions, function(x) mean(x$.pred))
  ) 

head(riemann_metrics)
#> # A tibble: 6 × 3
#>   cellsize mean_agb mean_pred
#>      <dbl>    <dbl>     <dbl>
#> 1    10000    5930.     7161.
#> 2    10000    6265.     7020.
#> 3    10000   11766.     7673.
#> 4    10000   28067.    21806.
#> 5    10000   13132.    17911.
#> 6    10000       0      6287.

Now that we’ve got our “true” and estimated AGB for each hexagon, all that’s left is for us to calculate our model accuracy metrics for each aggregation scale we investigated. We can use functions from yardstick to quickly calculate our root-mean-squared error (RMSE) and mean absolute error (MAE) for each cell size we investigated:

riemann_metrics <- riemann_metrics %>%
  group_by(cellsize) %>%
  summarize(rmse = rmse_vec(mean_agb, mean_pred),
            mae  =  mae_vec(mean_agb, mean_pred))

And just like that, we’ve got a multi-scale assessment of our model’s accuracy! To repeat a point from earlier, we aren’t using this as a way to tune our model. Instead, we can use our results to investigate and report how well our model does at different levels of aggregation. For instance, while it appears that both RMSE and MAE improve as we aggregate our predictions to larger and larger hexagons, some scales have a much larger difference between the two metrics than others. This hints that, at those specific scales, a few individual hexagons are large outliers driving RMSE higher, which might indicate that our model isn’t performing well in a few specific locations:

library(ggplot2)

riemann_metrics %>%
  pivot_longer(-cellsize) %>%
  ggplot(aes(cellsize, value, color = name)) + 
  geom_line() +
  geom_point() + 
  theme_minimal()

References

Riemann, R., Wilston, B. T., Lister, A., and Parks, S. 2010. An effective assessment protocol for continuous geospatial datasets of forest characteristics using USFS Forest Inventory and Analysis (FIA) data. Remote Sensing of Environment, 114, pp. 2337-2353. doi: 10.1016/j.rse.2010.05.010.

Session information

#> ─ Session info ─────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23)
#>  os       macOS Big Sur ... 10.16
#>  system   x86_64, darwin17.0
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/Los_Angeles
#>  date     2022-12-07
#>  pandoc   2.19.2 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ─────────────────────────────────────────────────────────
#>  package       * version date (UTC) lib source
#>  broom         * 1.0.1   2022-08-29 [1] CRAN (R 4.2.0)
#>  dials         * 1.1.0   2022-11-04 [1] CRAN (R 4.2.0)
#>  dplyr         * 1.0.10  2022-09-01 [1] CRAN (R 4.2.0)
#>  ggplot2       * 3.4.0   2022-11-04 [1] CRAN (R 4.2.0)
#>  infer         * 1.0.4   2022-12-02 [1] CRAN (R 4.2.1)
#>  parsnip       * 1.0.3   2022-11-11 [1] CRAN (R 4.2.0)
#>  purrr         * 0.3.5   2022-10-06 [1] CRAN (R 4.2.0)
#>  recipes       * 1.0.3   2022-11-09 [1] CRAN (R 4.2.0)
#>  rlang           1.0.6   2022-09-24 [1] CRAN (R 4.2.0)
#>  rsample       * 1.1.1   2022-12-07 [1] CRAN (R 4.2.1)
#>  spatialsample * 0.2.1   2022-08-05 [1] CRAN (R 4.2.0)
#>  tibble        * 3.1.8   2022-07-22 [1] CRAN (R 4.2.0)
#>  tidymodels    * 1.0.0   2022-07-13 [1] CRAN (R 4.2.0)
#>  tune          * 1.0.1   2022-10-09 [1] CRAN (R 4.2.0)
#>  workflows     * 1.1.2   2022-11-16 [1] CRAN (R 4.2.0)
#>  yardstick     * 1.1.0   2022-09-07 [1] CRAN (R 4.2.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
#> 
#> ────────────────────────────────────────────────────────────────────

Multi-scale model assessment with spatialsample

Learning objective

Introduction

Multi-scale Assessment

References

Session information