Kriging with R: Exploring gstat

Back in June I wrote a post about the basics of geospatial interpolation in R that, according to Twitter, resonated with a lot of people. It appears that there is a need for detailed tutorials on how to apply geospatial algorithms to real world data (at least in R). That is very understandable, as R is a very powerful and popular tool for spatial analysis, yet the package ecosystem for geospatial interpolation is exceptionally convoluted by todays standards.

I therefore decided to follow up on the June-post and take another look on the applicability of R to very practical problems in the world of geospatial interpolation.

One of the most famous and popular algorithms for geospatial interpolation is Kriging and the standard package to realize Kriging in R is gstat. In my blogpost from June I included Kriging as one of the methods presented there but took a massive short cut to carry out the actual computation by using the automap package, that can sort out all the intricate gstat-details for you, at the price of handing over all the control to the machine.

The reason I used that short cut back then is that taking control by using gstat directly is hard. So hard in fact, that this post here, that I originally intended to be a more or less comprehensive tour through the gstat-universe, will now focus mainly on funneling your data in and out gstat, just to make Kriging work. A more detailed account on the statistical backgrounds and certain aspects of model optimization will have to wait for another day.

I suppose the main reason why working with gstat these days is so difficult is because that package is a true R veteran. It is older than the tidyverse and the code base dates back to 1993. The interface is probably very different from anything you are used to work with.

For handling spatial data, gstat relies on the sp package which itself has since been superseded mostly by package sf today, which is much more convenient. Yet here we are and gstat is still very much THE place to go if you want to do Kriging. So there is no point in further complaining and let’s see how we can make it work for any given dataset.

Setup

# We will need some packages for (spatial) data processing
library(tidyverse) # wrangling tabular data and plotting
library(sf) # processing spatial vector data - the easy way
library(sp) # processing spatial vector data - the way gstat needs it
library(raster) # processing spatial raster data. !!!overwrites dplyr::select!!!

# Packages for geostatistics
library(gstat)   # The most popular R-Package for Kriging (imho)
library(automap) # Automatize some (or all) parts of the gstat-workflow 

# Finally, some packages to make pretty plots
library(patchwork)
library(viridis)

# Download the data for this tutorial from Github!
# The data for this tutorial is derived from a dataset published here
# https://www.opengeodata.nrw.de/produkte/umwelt_klima/wasser/flurabstandskarte_1988/
# licence information:
# Datenlizenz Deutschland – Flurabstandskarte NRW 1988 – Version 2.0
# see http://www.govdata.de/dl-de/by-2-0 for more details

grd_100_df <- readr::read_csv(
  "https://raw.githubusercontent.com/Ignimbrit/exchange/master/data/2020/grid_100.csv",
  ) %>% 
  dplyr::select(-licence)

# The projection is EPSG 25832

head(grd_100_df)
## # A tibble: 6 x 3
##        X       Y     Z
##    <dbl>   <dbl> <dbl>
## 1 402040 5735960  62.2
## 2 402140 5735960  62.6
## 3 402240 5735960  63.0
## 4 402340 5735960  63.5
## 5 402440 5735960  63.9
## 6 402540 5735960  64.4

The data for this tutorial is a xyz-file giving us the height of the groundwater table in some part of the German state of North Rhine-Westphalia in 1988. We can convert the data that we just read as a tibble into a RasterLayer and have a look.

grd_100_rstr <- raster::rasterFromXYZ(
  grd_100_df, 
  res = c(100, 100), # resolution in meter (see crs)
  crs = "+proj=utm +zone=32 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
    )

plot(grd_100_rstr)

Of course this is quite the contrary of the situation where geospatial interpolation is necessary - we already have a complete raster! So for the sake of training we will pretend we do not know the picture above. Let’s assume that back in 1988 we have measured the groundwater level at a generous yet limited number of locations, as it is often the case wen tasked with interpolating the groundwater table.

set.seed(42) # for reproducibility

# Simulate 100 random observation wells 
wellobs <- slice_sample(grd_100_df, n = 100)

ggplot(
  data = wellobs,
  mapping = aes(x = X, y = Y, color = Z)
) +
  geom_point(size = 3) + 
  scale_color_viridis(option = "B") +
  theme_classic()

# Convert to {sf} because that is the best way to store spatial points
wellobs_sf <- st_as_sf(wellobs, coords = c("X", "Y"), crs = 25832) %>% 
  cbind(st_coordinates(.))

You might have noticed that when I switched to sf in the chunk above, I included an extra step to store the X and Y coordinates of our observation not only in the geometry-column of the sf-object but also in its tabular part. The reason for this is that for Universal Kriging, a certain flavor of Kriging we will learn more about later, the coordinates are also explicit model variables and this is the best way I could come up with to account for this dual-role.

Creating a Variogram

In order to make Kriging work you need to supply a variogram, which is essentially a function describing the relationship between distance and “Z” in your point data. I do not want to go into too much detail here, as “what are variograms” and “working with variograms in gstat” is enough material for two more blogposts. We will therefore rush a bit through this. First let’s create an empirical variogram, giving us the actual distribution of spatial dependencies observed in the data.

# We will discuss later, what Z~1 does actually mean in this context
v_emp_OK <- gstat::variogram(
  Z~1,
  as(wellobs_sf, "Spatial") # switch from {sf} to {sp}
  )

plot(v_emp_OK)

Note that I switched from sf to sp classes in my call to variogram in order to make sure gstat understands what is going on. In fact, by now, some support for sf has been included in gstat but my impression is that this support is fairly limited and should not be relied upon.

Now that we have an empirical variogram, we need to fit a mathematical function so we can inter-/extrapolate. At this point we will take another short cut and have automap do the work for us, because variogram modeling is not the topic of the day.

# automap's autofitVariogram actually produces more info than we need.
# I will only keep the var_model part.
v_mod_OK <- automap::autofitVariogram(Z~1, as(wellobs_sf, "Spatial"))$var_model

# To inspect the automatic fit that was chosen for us we can use
# automap's excellent build in methods for base::plot
plot(automap::autofitVariogram(Z~1, as(wellobs_sf, "Spatial")))

automap has determined the optimal variogram model (as far as automap is concerned) for us and we can feed it into the Kriging algorithm. In this case, the function has determined that a “Stein’s parameterization” model with the parameters listed in the plot are the best fit. Other models that could be used to control the shape of the fitted curves are spherical (“Sph”), exponential (“Exp”), gaussian (“Gau”) and Matern (“Mat”). It is alternatively possible to specify all or some parameters in the call to autofitVariogram and automap will only optimize one the arguments we leave unspecified.

Note that the variogram above implicitly assumes isotropic conditions, which means the semi-variance is the same regardless in which direction you look. This is not necessarily true (in fact, it probably isn’t) and we could dive deeper into this by inspecting several variograms for different directions (north, east, …). This can be achieved by providing an alpha argument in the call to autofitVariogram (or some of the many gstat functions related to that topic). For the sake of brevity, I will not do that now and just assume spatial isotropy.

Defining a target grid

As explained in my last blogpost on spatial interpolation, many interpolation methods expect you to provide them with a target set of coordinates (“X”, “Y”) for which the modeled variable (“Z”) ist to be interpolated. Usually that means coming up with some kind of “empty” grid or raster. This holds true for Kriging with gstat, too, except everything is a little bit more complicated, because we need to do it in sp.

# technically we already have a grid from the initial dataset, but as we are 
# still working under the pretense that our only available data are the 
# simulated observation wells, we will construct our grid from that object.

# Step 1: define a grid based on the bounding box of our observations
grd_100_sf <- wellobs_sf %>% 
  st_bbox() %>% 
  st_as_sfc() %>% 
  st_make_grid(
  cellsize = c(100, 100), # 100m pixel size
  what = "centers"
  ) %>%
  st_as_sf() %>%
  cbind(., st_coordinates(.))

# Step 2: making our grid work for gstat
grd_100_sp <- as(grd_100_sf, "Spatial") # converting to {sp} format
gridded(grd_100_sp) <- TRUE             # informing the object that it is a grid
grd_100_sp <- as(grd_100_sp, "SpatialPixels") # specifying what kind of grid

# That second step there is based on a discussion I found on Stackoverflow
# https://stackoverflow.com/questions/43436466/create-grid-in-r-for-kriging-in-gstat

As you can see, making practical grids is surprisingly verbose, given how common it is to need one when dealing with spatial operations in R in general. Somebody should probably write a package to address this issue.

Kriging

Kriging comes in several flavours. I will focus here on three rather common and basic variants: “Ordinary”, “Simple”, and “Universal” Kriging. What’s the difference between those three?

Simple Kriging assumes that the mean in your target area (your grid template (also known as “the random field”)) is constant and known to you. Local variability is just that: a deviation from the norm which is to be accounted for.

Ordinary Kriging is almost the same, just this time you do not know the value of the mean.

For Universal Kriging now you do not longer have a level plane with some bumps in it, but instead you are dealing with a tilted or even curved surface (still with bumps in it).

In gstat, the main way to differentiate between different kinds of Kriging (or any of the implemented algorithms) is controlled by the formula supplied. We have already seen that syntax when fitting the variogram. The best way to explain this is probably by looking at examples.

# Ordinary Kriging
OK <- krige(
  Z~1,                       # Z is our variable and "~1" means "depends on mean"
  as(wellobs_sf, "Spatial"), # input data in {sp} format
  grd_100_sp,                # locations to interpolate at
  model = v_mod_OK           # the variogram model fitted above
  )
## [using ordinary kriging]
# Simple Kriging
SK <- krige(
  Z~1,                       # Z still depends on mean
  beta = mean(grd_100_df$Z), # but this time we know the mean's value
  as(wellobs_sf, "Spatial"), # input data in {sp} format
  grd_100_sp,                # locations to interpolate at
  model = v_mod_OK           # the variogram model fitted above
  )
## [using simple kriging]
# Universal Kriging
# Implementing this method is somewhat different.
# we no longer assume that Z is essentially depending on a single mean but
# rather on the position of the interpolation location within our target grid
UK <- krige(
  Z~coords.x1+coords.x2, # Think "Z~X+Y" but {sp} conversion alters variable naming
  as(wellobs_sf, "Spatial"), # input data in {sp} format (`X` --> `coords.x1`)
  grd_100_sp,                # locations to interpolate at
  model = autofitVariogram(  # we need an appropriate variogram fit
    Z~X+Y,                   # here we can keep "X+Y" - it's just how it is
    as(wellobs_sf, "Spatial")
    )$var_model
  )
## [using universal kriging]
# I'll also add an inverse distance weighted model to provide a baseline
# for model evaluation
# Note how the only difference to Ordinary Kriging is the absence of a
# fitted variogram model
idwres <- idw(
  Z~1,                       # idw also depends on mean
  as(wellobs_sf, "Spatial"), # input data in {sp} format
  grd_100_sp,                # locations to interpolate at
) 
## [inverse distance weighted interpolation]

As you can see, Universal Kriging is a bit more difficult to realize than the other two. But these difficulties mostly arise from the fact that we need to switch between sf and sp objects and loose our variable names on the way. One could avoid that by working in sp only, but then making a grid template and overall working with our spatial point information would become really inconvenient. I do not recommend.

Inspect the results

Now that we have all interpolations ready we can plot the resulting rasters. I will use a little function I defined back in June to make things a little easier.

# A function to plot rasters
plot_my_gstat_output <- function(raster_object, object_name){
  
  df <- rasterToPoints(raster_object) %>% as_tibble()
  colnames(df) <- c("X", "Y", "Z")
  
  ggplot(df, aes(x = X, y = Y, fill = Z)) +
    geom_raster() +
    ggtitle(label = object_name) +
    scale_fill_viridis(option = "B", limits = c(50, 100)) +
    theme_void() +
    theme(
      plot.title = element_text(hjust = 0.5)
    )
}

p_orig <- plot_my_gstat_output(grd_100_rstr, "Original Raster")
p_idw <- plot_my_gstat_output(raster(idwres), "IDW")
p_SK <- plot_my_gstat_output(raster(SK), "Simple Kriging")
p_OK <- plot_my_gstat_output(raster(OK), "Ordinary Kriging")
p_UK <- plot_my_gstat_output(raster(UK), "Universal Kriging")

# I also want to display sampling locations
p_wellobs <- ggplot(
  data = wellobs,
  mapping = aes(x = X, y = Y, color = Z)
) +
  geom_point(size = 3) + 
  scale_color_viridis(option = "B",  limits = c(50, 100)) +
  ggtitle(label = "Observation Wells Sampled") +
  theme_void() +
    theme(
      plot.title = element_text(hjust = 0.5)
    )

# This works because of library(patchwork)
(p_orig + p_wellobs + p_idw) / 
  (p_SK + p_OK + p_UK) + 
  plot_layout(guides = 'collect')

From the plot alone it is not obvious that the different Kriging methods did actually produce different results. We can, however, take a look at the summary statistics and we will see that the three different approaches did indeed yield (subtly) different results.

map(list(SK, OK, UK), raster) %>% 
  map(summary) %>%
  do.call("cbind", .) %>% 
  as.data.frame() %>% 
  setNames(c("SK", "OK", "UK"))
##               SK       OK       UK
## Min.    54.39102 54.32669 54.32087
## 1st Qu. 68.90548 68.89596 68.90859
## Median  73.02403 73.00607 73.01139
## 3rd Qu. 78.59243 78.57769 78.57435
## Max.    97.83050 97.83246 97.83256
## NA's     0.00000  0.00000  0.00000

Anyway, this is how you can make Kriging in gstat work for any dataset. If you have a nice example to show where the three different Kriging approaches yield visibly different results, please feel free to contact me (you could write me a Twitter DM) - I’d love to see that.

Sören Wilke
Sören Wilke
Dr. rer. nat. Earth Science

My research interests include hydrogeochemistry, geologic modelling and geostatistics.