dynamicSDM: Response variable data

Introduction

dynamicSDM offers novel tools for developing species geographical distribution and abundance modelling (SDM) at high spatiotemporal resolution. Across the dynamicSDM vignettes, we demonstrate package functions on a case study species and guide users through the four key species distribution modelling stages:

Installation

Before starting this tutorial, the dynamicSDM package needs to be installed. You can install the latest version from GitHub with:

# devtools::install_github('r-a-dobson/dynamicSDM')

or from CRAN with

#install.packages("dynamicSDM")
library(dynamicSDM)
library(terra)
#> Warning: package 'terra' was built under R version 4.3.2
#> terra 1.7.55

Due to the utilisation of Google Earth Engine and Google Drive across dynamicSDM functions, you must also set up a free Google account at Google. Once you have your Google log-in details, these can be used to authorise Google Earth Engine in the rgee package and Google Drive in the googledrive package.

library(rgee)
#rgee::ee_install()
rgee::ee_check()

library(googledrive)
#googledrive::drive_auth_configure()
googledrive::drive_user()

Directory organisation

As we continue through the species distribution modelling framework, functions will generate a range of outputs. To keep organised, we recommend a structured directory system. The code below will create a new project directory in your temporary folders, but you can edit the script to specify your home directory if you want to retain function outputs between R sessions.

project_directory <- file.path(file.path(tempdir(), "dynamicSDM_vignette"))
dir.create(project_directory)

Case study: the red-billed quelea

In this four stage tutorial, we will be studying the case study species, the red-billed quelea (Quelea quelea), a nomadic bird inhabiting sub-Saharan Africa. With highly variable distributions driven by short-term weather and resource availability, dynamicSDM provides key functions to accurately model quelea’s distribution through space and time.

Stage 1: Response variable data

In this tutorial, we will be processing a sample of species occurrence records for the red-billed quelea (Quelea quelea) to generate an SDM response variable data frame. This will involve filtering records by spatio-temporal quality, extent and resolution; testing and accounting for spatio-temporal biases in records, and generating pseudo-absences through space and time.

a) Import occurrence records

A sample of red-billed quelea occurrence data can be loaded into your local R environment using the code below.

data("sample_occ_data")

These records are formatted as exported from the Global Biodiversity Information Facility, with columns labelled “decimalLongitude” and “decimalLatitude”. For dynamicSDM functions, occurrence data must be in a specific format; containing numeric data in columns labelled “x” and “y” for the record co-ordinates longitude and latitude, and “day”, “month” and “year” for the record’s date.

convert_gbif() can be used to convert between these formats, or this can be achieved manually for data not extracted from GBIF.

sample_occ <- convert_gbif(sample_occ_data)

b) Filter occurrence records

Validity and completeness

Species occurrence records may contain anomalous or missing values in the co-ordinates or dates given. spatiotemp_check() can be used to exclude records with missing (NA) values (argument na.handle), invalid co-ordinates (coord.handle), invalid dates (date.handle) and duplicate records (duplicate.handle).

Here, we specify date.res = "day" to check all date columns from year to day. If you only want to check the year of record dates, then you could specify date.res = "year".

sample_occ_filtered <- spatiotemp_check(occ.data = sample_occ,
                                               na.handle = "exclude",
                                               date.handle = "exclude",
                                               date.res = "day",
                                               coord.handle = "exclude",
                                               duplicate.handle = "exclude")
#> omitting any species records with coordinates containing NA
#> omitting any species records with dates containing NA
#> omitting any duplicate records
#> any records with invalid co-ordinates excluded
#> any records with invalid dates excluded
nrow(sample_occ_filtered)
#> [1] 311

Additionally, users can opt to take advantage of the CoordinateCleaner package functions to exclude records flagged as containing potentially anomalous co-ordinates, based on a wider variety of spatial tests than dynamicSDM alone offers.

sample_occ_filtered <- spatiotemp_check(occ.data = sample_occ,
                                        na.handle = "exclude",
                                        date.handle = "exclude",
                                        date.res = "day",
                                        coord.handle = "exclude",
                                        duplicate.handle = "exclude",
                                        coordclean = TRUE,
                                        coordclean.species = "quelea",
                                        coordclean.handle = "exclude")

Spatiotemporal extent

When generating dynamic species distribution models, it is important that occurrence records match the spatio-temporal extent of the study and any remote-sensing datasets to be utilised as explanatory variables. spatiotemporal_extent() removes any records outside of the set extents.

In this case study, we want to model the subspecies Q. q. lathamii which is typically limited to southern Africa and we will incorporate remote sensing datasets available for the years 2001 to 2019. To filter records to these extents, we use a polygon of southern African countries, which can be read into your R environment and applied using the following code.

data("sample_extent_data")

sample_occ_cropped <- spatiotemp_extent(occ.data = sample_occ_filtered,
                                               temporal.ext = c("2001-01-01", "2018-12-01"),
                                               spatial.ext = terra::ext(sample_extent_data),
                                               prj = "+proj=longlat +datum=WGS84 +no_defs")
#> spatial.ext could not be used as a mask.


## Lets plot the change
#terra::plot(sample_extent_data$geometry)
terra::plot(terra::vect(sample_occ_filtered[, c("x", "y")],
                        geom = c("x", "y"),
                        crs = "+proj=longlat +datum=WGS84 +no_defs"))


#terra::plot(sample_extent_data$geometry)
terra::plot(terra::vect(sample_occ_cropped[, c("x", "y")],
                        geom = c("x", "y"),
                        crs = "+proj=longlat +datum=WGS84 +no_defs"))

Spatiotemporal resolution

The spatial resolution of occurrence record co-ordinates and the temporal resolution of occurrence record dates are also important when generating dynamic SDMs. spatiotemp_resolution() can filter records by set thresholds.

As quelea migrations are driven by highly local and short-term environmental factors, we will filter to exclude records not given to the specific day and co-ordinates to four decimal places.

sample_occ_cropped<-spatiotemp_resolution(occ.data = sample_occ_cropped,
                                          spatial.res = 4,
                                          temporal.res = "day")

nrow(sample_occ_cropped) # Even more records have been removed!
#> [1] 142

c) Test for spatial and temporal sampling biases

For many reasons, the collection of species occurrence records are biased through space and time. These biases can impact SDM performance by over- or under- representing conditions at a given site or time. spatiotemp_bias() returns the results of statistical tests and plots for visual assessment of such biases. The code below tests for bias across the entire dataset and species range.

bias_results <- spatiotemp_bias(occ.data = sample_occ_cropped,
                                temporal.level = c("year"),
                                plot = FALSE,
                                spatial.method = "simple",
                                prj = "+proj=longlat +datum=WGS84")
bias_results
#> $Temporal_bias_year
#> 
#>  Chi-squared test for given probabilities
#> 
#> data:  occ.data.frequency
#> X-squared = 50.451, df = 15, p-value = 1.016e-05
#> 
#> 
#> $Spatial_bias
#> 
#>  Welch Two Sample t-test
#> 
#> data:  min.nndist.actual and min.nndist.random
#> t = -6.7945, df = 262.91, p-value = 7.243e-11
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -73309.46 -40366.60
#> sample estimates:
#> mean of x mean of y 
#>  45486.12 102324.15

Whereas, this code will test for bias across only the core of the species range.

bias_results <- spatiotemp_bias(occ.data = sample_occ_cropped,
                                temporal.level = c("year"),
                                plot = FALSE,
                                spatial.method = "core",
                                prj = "+proj=longlat +datum=WGS84")
#> Warning: [mask] CRS do not match
bias_results
#> $Temporal_bias_year
#> 
#>  Chi-squared test for given probabilities
#> 
#> data:  occ.data.frequency
#> X-squared = 50.451, df = 15, p-value = 1.016e-05
#> 
#> 
#> $Spatial_bias
#> 
#>  Welch Two Sample t-test
#> 
#> data:  min.nndist.actual and min.nndist.random
#> t = -0.14508, df = 190.95, p-value = 0.8848
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -15509.24  13384.01
#> sample estimates:
#> mean of x mean of y 
#>  45486.12  46548.73

Both approaches can generate important results. The latter may be a more reliable indicator if the species range is expanding or shifting in response to global change. This is because records at the range periphery that have been collected randomly may still not be evenly distributed through space and time due to underlying ecological processes. Therefore, testing for biases on the species range core may be more representative of true spatial and temporal sampling biases.

d) Account for spatial and temporal sampling biases

Occurrence record thinning

One approach to correct spatio-temporal sampling biases are to “thin” records, which involves excluding records that have been collected too closely in space or time. spatiotemp_thin() allows users to specify the spatial and temporal distance to thin records by. An important argument of spatiotemp_thin() is spatial.split.degrees which specifies the size of grid cells to split occurrence records into before temporal thinning. This prevents spatially distant but temporally close records from being filtered out.

occ_thin <- spatiotemp_thin(occ.data = sample_occ_cropped,
                                   temporal.method = "day",
                                   temporal.dist = 30,
                                   spatial.split.degrees = 3,
                                   spatial.dist = 100000,
                                   iterations = 5)
#> [1] "18  of  142 removed by temporal"
#> ********************************************** 
#>  Beginning Spatial Thinning.
#>  Script Started at: Fri Jun 28 14:50:48 2024
#> lat.long.thin.count
#> 47 
#>  1 
#> [1] "Maximum number of records after thinning: 47"
#> [1] "Number of data.frames with max records: 1"
#> [1] "No files written for this run."
#> [1] "19  of  142 removed by temporal"
#> ********************************************** 
#>  Beginning Spatial Thinning.
#>  Script Started at: Fri Jun 28 14:50:48 2024
#> lat.long.thin.count
#> 47 
#>  1 
#> [1] "Maximum number of records after thinning: 47"
#> [1] "Number of data.frames with max records: 1"
#> [1] "No files written for this run."
#> [1] "18  of  142 removed by temporal"
#> ********************************************** 
#>  Beginning Spatial Thinning.
#>  Script Started at: Fri Jun 28 14:50:48 2024
#> lat.long.thin.count
#> 48 
#>  1 
#> [1] "Maximum number of records after thinning: 48"
#> [1] "Number of data.frames with max records: 1"
#> [1] "No files written for this run."
#> [1] "18  of  142 removed by temporal"
#> ********************************************** 
#>  Beginning Spatial Thinning.
#>  Script Started at: Fri Jun 28 14:50:48 2024
#> lat.long.thin.count
#> 47 
#>  1 
#> [1] "Maximum number of records after thinning: 47"
#> [1] "Number of data.frames with max records: 1"
#> [1] "No files written for this run."
#> [1] "19  of  142 removed by temporal"
#> ********************************************** 
#>  Beginning Spatial Thinning.
#>  Script Started at: Fri Jun 28 14:50:49 2024
#> lat.long.thin.count
#> 45 
#>  1 
#> [1] "Maximum number of records after thinning: 45"
#> [1] "Number of data.frames with max records: 1"
#> [1] "No files written for this run."

# Plot the difference in spatial distribution after thinning
# Non-thinned
terra::plot(sample_extent_data$geometry)
terra::plot(terra::vect(sample_occ_cropped[, c("x", "y")],
                        geom = c("x", "y"),
                        crs = "+proj=longlat +datum=WGS84 +no_defs"),add=T)


# Thinned
terra::plot(sample_extent_data$geometry)
terra::plot(terra::vect(occ_thin[, c("x", "y")],
                        geom = c("x", "y"),
                        crs = "+proj=longlat +datum=WGS84 +no_defs"),add=T)

Weight records by sampling effort

Another approach to account for spatio-temporal sampling biases is to weight records by sampling effort when fitting SDMs.This results in the downweighting of the importance of records from over-sampled regions and vice versa.

Here, we will read in a sample of e-Bird avian sampling records in southern Africa to use as a proxy for sampling effort, and sum these across a spatial and temporal buffer from each occurrence record using spatiotemp_weights().

data("sample_events_data")
sample_occ_cropped <- spatiotemp_weights(occ.data = sample_occ_cropped,
                                       samp.events = sample_events_data,
                                       spatial.dist = 100000,
                                       temporal.dist = 30,
                                       prj = "+proj=longlat +datum=WGS84")
#> Records completed: 100

e) Generate pseudo-absences through space and time

Often the paucity of species absence records necessitates the generation of pseudo-absences, which are inferred absences based upon presence records. When modelling a species distribution at high spatiotemporal resolution, it may be best to generate pseudo-absences within close spatial and temporal buffers of species occurrence. This is because model presence-absence comparisons are made at fine scales, instead of coarsely between biomes and seasons. Using spatiotemp_pseudoabs() you can specify the spatial and temporal buffer size or extents to randomly generate pseudo-absences within.

# Pseudo-absences generated within spatial and temporal buffer
pseudo_abs_buff <- spatiotemp_pseudoabs(occ.data = sample_occ_cropped,
                                        spatial.method = "buffer",
                                        temporal.method = "buffer",
                                        spatial.ext = sample_extent_data,
                                        spatial.buffer = c(250000, 500000),
                                        temporal.buffer = c(42, 84),
                                        n.pseudoabs = nrow(sample_occ_cropped))

# Pseudo-absences generated randomly across given spatial and temporal extent
pseudo_abs_rand <- spatiotemp_pseudoabs(occ.data = sample_occ_cropped,
                                        spatial.method = "random",
                                        temporal.method = "random",  
                                        spatial.ext = sample_extent_data,
                                        temporal.ext = c("2002-01-01", "2019-12-01"),
                                        n.pseudoabs = nrow(sample_occ_cropped))

# Plot the spatial distribution of pseudo-absences (red) compared to occurrence records for:
# Buffered
terra::plot(sample_extent_data$geometry)
terra::plot(terra::vect(sample_occ_cropped[, c("x", "y")],
                        geom = c("x", "y"),
                        crs = "+proj=longlat +datum=WGS84 +no_defs"),col = "black",add=T)
terra::plot(terra::vect(pseudo_abs_buff[, c("x", "y")],
                        geom = c("x", "y"),
                        crs = "+proj=longlat +datum=WGS84 +no_defs"),col = "red", add=T)


# Random
terra::plot(sample_extent_data$geometry)
terra::plot(terra::vect(sample_occ_cropped[, c("x", "y")],
                        geom = c("x", "y"),
                        crs = "+proj=longlat +datum=WGS84 +no_defs"),col = "black",add=T)
terra::plot(terra::vect(pseudo_abs_rand[, c("x", "y")],
                        geom = c("x", "y"),
                        crs = "+proj=longlat +datum=WGS84 +no_defs"),col = "red", add=T)

Then, the generated pseudo-absences need to be added to the occurrence record data frame by adding relevant columns and binding the rows together.

pseudo_abs_buff$decimalLatitude <- pseudo_abs_buff$y
pseudo_abs_buff$decimalLongitude <- pseudo_abs_buff$x
pseudo_abs_buff$occurrenceStatus <- rep("absent", nrow(pseudo_abs_buff))
pseudo_abs_buff$source <- rep("dynamicSDM", nrow(pseudo_abs_buff))
pseudo_abs_buff<-spatiotemp_weights(occ.data = pseudo_abs_buff,
                                       samp.events = sample_events_data,
                                       spatial.dist = 100000,
                                       temporal.dist = 30,
                                       prj = "+proj=longlat +datum=WGS84")

complete.dataset <- as.data.frame(rbind(sample_occ_cropped, pseudo_abs_buff))

Summary

At the end of this vignette, we now have a complete data frame of filtered species occurrence and pseudo-absence records of appropriate spatio-temporal quality, extent and resolution. Let’s save this to our project directory for use in the next tutorial!

write.csv(complete.dataset, file = paste0(project_directory, "/filtered_quelea_occ.csv"))