A practical guide to geospatial interpolation with R

One of the most exciting things you can do with R is geospatial interpolation. This means that you have some kind of information (e.g. measurements of, say, soil temperature) for a limited number of locations and then you apply a mathematical model that will provide you with an educated guess of what your result might look like, if you would have measured at every possible location. The practical advantage is clear. Sampling is expensive and therefore always limited. This is especially true in the geosciences, where sampling might require hundreds of thousands of Euros for drilling deep holes to the formation of your interest.

Doing your geospatial interpolation in R is a great opportunity. Many software tools often associated with that task are commercial products that can become fairly expensive, so it is pretty common that if one finishes university or changes a job position, all the work spend on learning that tool is rendered void. R on the other hand is always free and offers a wide range of algorithms of all kinds of flavors. An additional bonus is the seamless integration with the powerful data transformation and visualization capabilities of the language.

A problem when using R (especially at the beginning) is, that the user interface for the application of different algorithms on the many available data structures varies wildly. Although improvements clearly have been made in the recent years, there is little standardization and every package author basically comes up with their own philosophy on how to handle input/output.

As a consequence, starting with geospatial interpolation in R can be frustrating. Of course you can (and will have to) dig through the documentation of the various packages, but as interpolation models are a fairly advanced topic of statistics, those tend to be heavy on technical slang.

There are a couple of good blog posts (see here, here and here) on geospatial interpolation, but they either focus on dealing with specific package interfaces or are again written for an audience that is already well versed in the world of spatial data.

When I started to learn about geospatial interpolation with R a couple of years ago, I found it difficult to get up to speed with real world data. Most use cases I found focus on pre-treated datasets whose conditions I could not sensibly reproduce with my own data or made steps that seemed outright incomprehensible at that time. So I feel like there is a documentation gap for a practical guide on the very basics of geospatial interpolation: how do I get from having a bunch of field observations to a regular raster of interpolated values that I can plot on a nice little map? I am going to present here a minimalistic workflow designed to step-by-step explain the principles behind the necessary operations, that all approaches for geospatial interpolation with R have in common. That way I hope your entrance into this fascinating world will be more pleasant than mine was.

There are four steps to Geospatial interpolation

  1. Have some data
  2. Create a grid template
  3. Fit a model
  4. Interpolate!

So let’s start!

Step 0: Have some data

# We will need some packages for (spatial) data processing
library(tidyverse) # wrangling tabular data and plotting
library(sf) # processing spatial vector data
library(sp) # another vector data package necessary for continuity
library(raster) # processing spatial raster data. !!!overwrites dplyr::select!!!

# And a lot of different packages to test their interpolation functions
library(gstat)  # inverse distance weighted, Kriging
library(fields) # Thin Plate Spline
library(interp) # Triangulation
library(mgcv)   # Spatial GAM
library(automap)# Automatic approach to Kriging

# Finally, some packages to make pretty plots
library(patchwork)
library(viridis)

# Download the data for this tutorial from Github
# Data is provided by the senate of the city of Berlin under:
# dl-de/by-2-0 Umweltatlas Berlin / Grundwassergüte Ammonium (Umweltatlas)
# for more details see: http://www.stadtentwicklung.berlin.de/umwelt/umweltatlas/ia204.htm

pts_NH4 <- readr::read_csv(
  "https://raw.githubusercontent.com/Ignimbrit/exchange/master/data/2020/Ammonium_Hoppegarten.csv",
  col_types = cols(NH4 = col_double(), 
                   X = col_double(), Y = col_double(), 
                   fid = col_character(), licence = col_character())
  ) %>% 
  dplyr::select(X, Y, NH4)

print(pts_NH4)
## # A tibble: 64 x 3
##          X        Y   NH4
##      <dbl>    <dbl> <dbl>
##  1 406521. 5814334.   2.3
##  2 409320. 5814598.  20.2
##  3 410322. 5814606.   0.3
##  4 408804. 5814825.   4.1
##  5 409049. 5814830.   2.5
##  6 408701. 5814896.  11.3
##  7 408843. 5814898.  16.1
##  8 409022. 5814928.   3  
##  9 408865. 5814943.  12  
## 10 408729. 5814949.  24.4
## # ... with 54 more rows

In the chunk above we already solved Step 0: Have some data! For this tutorial I compiled a little demonstration dataset (see comments for the source) that you can download from github to follow my steps. The data comes from a csv that contains 64 measurements of Ammonium (chemical formula NH4+) in groundwater from a former septic drain field (“Rieselfeld”) at a place called Hoppegarten at the outer rim of Berlin. Ammonia in groundwater can be a problem, so it would be good to know exactly how big our problem is, which can hardly be assessed by a couple of randomly scattered points per se.

point_plot <- ggplot(
  data = pts_NH4,
  mapping = aes(x = X, y = Y, color = NH4)) +
  geom_point(size = 3) +
  scale_color_gradientn(colors = c("blue", "yellow", "red"))

point_plot

We can see from plotting the points alone that there seems to be some spatial dependency on the ammonia concentration. Samples with high concentration seem to be sitting mostly at the center of the plot. This is all we need: a coordinate pair (in this context sometimes referred to as the independent variables) and an associated continuous variable (the corresponding dependent variable). If there is a spatial pattern behind the values of the continuous variable (so that it really is depending on the coordinates specifying its position), geospatial interpolation can work.

Step 1: Create a grid template

This is one of the less intuitive steps but a very important one nonetheless. You need to specify were you would like to interpolate (aka generate new information), before you actually have that new information. When learning about geospatial interpolation with R in the first place, this was the step that used to bother me the most. After all: how am I supposed to know where I would like my information, if I do not actually even have it yet?

I will not dive any further into the whys and hows here. For now just accept that we need a template to later fill with interpolated values, and that creating this template is super easy, so that you should not worry about it. There are a couple of catches, but we will discuss those as we encounter them. Let’s start with a very basic approach to grid template making.

Option a) The simple approach

# First let's define a bounding Box, a rectangle that contains all our data
# points. There are many ways to do this but I am keeping it as simple as
# possible on purpose here
bbox <- c(
  "xmin" = min(pts_NH4$X),
  "ymin" = min(pts_NH4$Y),
  "xmax" = max(pts_NH4$X),
  "ymax" = max(pts_NH4$Y)
)

grd_template <- expand.grid(
  X = seq(from = bbox["xmin"], to = bbox["xmax"], by = 20),
  Y = seq(from = bbox["ymin"], to = bbox["ymax"], by = 20) # 20 m resolution
)

grid_plot <- ggplot() +
  geom_point(data = grd_template, aes(x = X, y = Y), size = 0.01) +
  geom_point(data = pts_NH4,
  mapping = aes(x = X, y = Y, color = NH4), size = 3) +
  scale_color_gradientn(colors = c("blue", "yellow", "red")) +
  coord_cartesian( #zooming in so we can actually see something
    xlim = c(408000, 409000), ylim = c(5815000, 5816000)) +
  theme_bw()

grid_plot

So this is the most simple approach on making a grid template I could think of. And when I say simple I mean that in an absolute positive way. No dependencies, no domain specific knowledge. It has the advantage of needing no other libraries and producing a data.frame which is easy to handle.

Option b) the classical approach

Using base::expand.grid is not necessarily the best way to build your grid template. Another popular way is to use sf::st_make_grid

sf_NH4 <- st_as_sf(pts_NH4, coords = c("X", "Y"), crs = 25833)

alt_grd_template_sf <- sf_NH4 %>% 
  st_bbox() %>% 
  st_as_sfc() %>% 
  st_make_grid(
  cellsize = c(20, 20),
  what = "centers"
  ) %>%
  st_as_sf() %>%
  cbind(., st_coordinates(.)) %>% 
  st_drop_geometry() %>% 
  mutate(Z = 0)

Those commands will create a fairly similar grid to the one we made above, but there are some differences.

  • The alt_grd_template_sf does actually contain all sample points. If you’ve been following my steps closely, you might have noticed, that I made no effort ensuring that when creating grd_template in my call to expand.grid
  • alt_grd_template_sf was created using functions from the sf package and comes in a data structure provided by that package (a sfc, or simple feature column, to be specific). For continuity I converted the object into a data.frame.
  • You can probably see that in order to leverage sf::st_make_grid we have to apply an awful lot of functions and skim through several data types before we actually get to have our “simple” XYZ grid. Coming up with this requires the user to already know a lot about the data classes from the sf package and their interdependencies.

In the long run, if you intend to take a deeper dive into the Rspatial world, I recommend to read up about how to use package sf. It was specifically build to handle spatial vector data and many interpolation algorithms that are out there are explicitly tailored towards accepting its classes or the closely related structures from the “older-sister-package” sp.

Step 1b: Rasterizing your grid template

I warned you that this tutorial would have to be a bit convoluted. We have our grid template now and it contains all the information we need. Regardless whether you prefer base::expand.grid or sf::st_make_grid, what you get in return is essentially a table of point coordinates.

For some interpolation algorithms this is just fine as they work well with data.frames. Others don’t. Usually there are ways to make it work but often enough things will get messy. As we want to look at a couple of different interpolation methods from different packages with vastly different interfaces and expectations, we will create a copy of our grid template in a different data structure to make our life a little easier.

# {raster} expects a PROJ.4 string, see https://epsg.io/25833
crs_raster_format <- "+proj=utm +zone=33 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"

grd_template_raster <- grd_template %>% 
  dplyr::mutate(Z = 0) %>% 
  raster::rasterFromXYZ( 
    crs = crs_raster_format)
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded datum Unknown based on GRS80 ellipsoid in CRS definition,
##  but +towgs84= values preserved
# Let's also carry with us the raster from the alternative approach
alt_grd_template_raster <- alt_grd_template_sf %>% 
  raster::rasterFromXYZ(
     crs = crs_raster_format
  )
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded datum Unknown based on GRS80 ellipsoid in CRS definition,
##  but +towgs84= values preserved

Step 2: Fit a model

Now that we have surpassed the first line of confusion we can progress to the next one. Because even now that we have data to interpolate from and a grid to interpolate to, the process in between does (in most cases) not require one but two more steps. The algorithm of our choice must be fitted to our data before we can use that fit to make interpolations. In R those two processes are (usually) cleanly separated. This apparently makes perfect sense to statisticians for which, after all, the whole language was build. For the average geoscience rookie/end user, however, this might be a novel concept and appear unnecessarily cumbersome. Why would we separate fitting of data with interpolating from it? After all, it’s not like we are going to use the fitted model on any other data anyway. But bear with me, it can be done.

# We start with functions that return model objects as this is the most 
# common case

# Nearest Neighbor
fit_NN <- gstat::gstat( # using package {gstat} 
  formula = NH4 ~ 1,    # The column `NH4` is what we are interested in
  data = as(sf_NH4, "Spatial"), # using {sf} and converting to {sp}, which is expected
  nmax = 10, nmin = 3 # Number of neighboring observations used for the fit
)

# Inverse Distance Weighting
fit_IDW <- gstat::gstat( # The setup here is quite similar to NN
  formula = NH4 ~ 1,
  data = as(sf_NH4, "Spatial"),
  nmax = 10, nmin = 3,
  set = list(idp = 0.5) # inverse distance power
)

# Thin Plate Spline Regression
fit_TPS <- fields::Tps( # using {fields}
  x = as.matrix(pts_NH4[, c("X", "Y")]), # accepts points but expects them as matrix
  Y = pts_NH4$NH4,  # the dependent variable
  miles = FALSE     # EPSG 25833 is based in meters
)

# Generalized Additive Model
fit_GAM <- mgcv::gam( # using {mgcv}
  NH4 ~ s(X, Y),      # here come our X/Y/Z data - straightforward enough
  data = pts_NH4      # specify in which object the data is stored
)

# Next we use a couple of functions that have a slightly different modus
# operandi as they in fact already return interpolated Z values.

# Triangular Irregular Surface
fit_TIN <- interp::interp( # using {interp}
  x = pts_NH4$X,           # the function actually accepts coordinate vectors
  y = pts_NH4$Y,
  z = pts_NH4$NH4,
  xo = grd_template$X,     # here we already define the target grid
  yo = grd_template$Y,
  output = "points"
) %>% bind_cols()

# Automatized Kriging  
fit_KRIG <- automap::autoKrige(      # using {automap}
  formula = NH4 ~ 1,                 # The interface is similar to {gstat} but
  input_data = as(sf_NH4, "Spatial") # {automap} makes a lot of assumptions for you
) %>% 
  .$krige_output %>%  # the function returns a complex object with lot's of metainfo
  as.data.frame() %>% # we keep only the data we are interested in
  dplyr::select(X = x1, Y = x2, Z = var1.pred) 
## [using ordinary kriging]

Interpolate!

By now, everything should be in place. We have data with our measured information, we have specified the precise coordinates for which new information is to be calculated and we have defined the spatial correlation within our data. Now we only need to tell R to make the final connection. But of course, there are again a couple of ways to do that, depending on the interface and object structure of the package that provides the interpolation algorithm.

Case 1: We already have it

Some packages just skip the two-step-approach described in the prior chapter and right away provide you with X, Y and interpolated Z values. Among the algorithms used in the chunk above, this is the case for interp::interp() and automap::autoKrige(). We can just rasterize their output.

# Triangular Irregular Surface
interp_TIN <- raster::rasterFromXYZ(fit_TIN, crs = crs_raster_format)
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded datum Unknown based on GRS80 ellipsoid in CRS definition,
##  but +towgs84= values preserved
# Automatized Kriging
interp_KRIG <- raster::rasterFromXYZ(fit_KRIG, crs = crs_raster_format)
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded datum Unknown based on GRS80 ellipsoid in CRS definition,
##  but +towgs84= values preserved
# Note that you can inspect the result of your interpolation anytime with a 
# simple `plot()`, if it is a {raster*} object
plot(interp_KRIG)

Case 2: Using raster::interpolate

Most geospatial interpolation algorithms will return a model object and if you are not really into statistics and R in general, those will appear mostly useless, as printing them reveals little that would seem of interest to a layperson.

print(fit_TPS)
## Call:
## fields::Tps(x = as.matrix(pts_NH4[, c("X", "Y")]), Y = pts_NH4$NH4, 
##     miles = FALSE)
##                                                
##  Number of Observations:                64     
##  Number of parameters in the null space 3      
##  Parameters for fixed spatial drift     3      
##  Model degrees of freedom:              50.4   
##  Residual degrees of freedom:           13.6   
##  GCV estimate for sigma:                8.145  
##  MLE for sigma:                         6.015  
##  MLE for rho:                           2279000
##  lambda                                 1.6e-05
##  User supplied rho                      NA     
##  User supplied sigma^2                  NA     
## Summary of estimates: 
##                  lambda      trA      GCV      shat -lnLike Prof converge
## GCV        1.587428e-05 50.44717 313.2434  8.144528     284.9669        8
## GCV.model            NA       NA       NA        NA           NA       NA
## GCV.one    1.587428e-05 50.44717 313.2434  8.144528           NA        8
## RMSE                 NA       NA       NA        NA           NA       NA
## pure error           NA       NA       NA        NA           NA       NA
## REML       4.336959e-03 13.00953 335.2979 16.344436     265.3901        6

The upside is that many R functions are specifically build around the conventions defined for model objects, so you do not really have to worry why things work, because they usually just do work, as long as you remember were to insert which object. We will take advantage of this and just plug all our remaining model objects, those that do not yet hold any interpolated values at our target grid, into the raster::interpolate function and have it take care of all the rest.

# Nearest Neighbor
interp_NN <- interpolate(grd_template_raster, fit_NN)
## [inverse distance weighted interpolation]
# Inverse Distance Weighting
interp_IDW <- interpolate(grd_template_raster, fit_IDW)
## [inverse distance weighted interpolation]
# Thin Plate Spline Regression
interp_TPS <- interpolate(grd_template_raster, fit_TPS)

Case 3: Using stats::predict

So this is probably R’s “official” way for interpolating new values from a model fit. It should work with all model objects as every contributor who has her/his package create a model object probably has this function in mind. raster::interpolate on the other hand seems not to be working with spatial GAMs. But by falling back on the lower-level stats::predict we can make it work with just a little more effort

# Generalized Additive Model
interp_GAM <- grd_template %>% 
  mutate(Z = predict(fit_GAM, .)) %>% 
  rasterFromXYZ(crs = crs_raster_format)
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded datum Unknown based on GRS80 ellipsoid in CRS definition,
##  but +towgs84= values preserved

Visualization, conclusion, further reading

Now that we have interpolated rasters, we should visualize them and have a look!

plot_my_rasters <- function(raster_object, raster_name){
  
  df <- rasterToPoints(raster_object) %>% as_tibble()
  colnames(df) <- c("X", "Y", "Z")
  
  ggplot(df, aes(x = X, y = Y, fill = Z)) +
    geom_raster() +
    ggtitle(label = raster_name) +
    scale_fill_viridis(option = "C") +
    theme_bw() +
    theme(
      axis.text = element_blank(),
      axis.title = element_blank(),
      axis.ticks = element_blank()
    )
}

rasterlist <- list(
    "Nearest Neighbor" = interp_NN, 
    "Inverse Distance Weighted" = interp_IDW, 
    "Kriging" = interp_KRIG, 
    "Thin Plate Spline Regression" = interp_TPS,
    "Triangular Irregular Surface" = interp_TIN, 
    "Generalized Additive Model" = interp_GAM
  )

plotlist <- map2(
  rasterlist,
  names(rasterlist),
  plot_my_rasters
)

# Note that the next trick only works because of library(patchwork)
(plotlist[[1]] + plotlist[[2]]) /
  (plotlist[[3]] + plotlist[[4]]) /
  (plotlist[[5]] + plotlist[[6]])

As you can see, the result strongly depends on the algorithm that was used for the interpolation. This is somewhat overemphasized in the plot above, because for this tutorial I did not attempt at all to fine tune the different interpolation algorithms. In a real project in either science or engineering, one would have to put a lot of thought into model parameterization and carefully evaluate the natural situation and the mathematical implications. But this is stuff for another day’s blogpost. If you want to learn more, here are a couple of resources you could check out:

Sören Wilke
Sören Wilke
Dr. rer. nat. Earth Science

My research interests include hydrogeochemistry, geologic modelling and geostatistics.