Simulations

Introduction

This vignette describes the methodology used in Buba et al (2024) to simulate first records and evaluate models performance. For a more detailed explanation of the models used, as the scientific justification for the parameters, please refer to Buba et al (2024).

Note: Buba et al (2024) uses a C++ script integerated to R using RcppArmadillo. These functions are currently not available within the alien package and will be added with the next release.

We start by loading the alien package which will be used to: - Create the simulations. - Estimate the introduction rate.

library(alien)

We begin by setting up the parameters:

list_tsl <- list(
  b0  = c(-1,      # mean introduction rate of 0.3 sp/y
          0,       # mean introduction rate of  1  sp/y
          1),      # mean introduction rate of 2.7 sp/y
  b1  = c(-0.02,   # annual decrease of  5%
          -0.01,   # annual decrease of 10%
          -0.005,  # annual decrease of 20%
          0,       # constant introduction rate
          0.005,   # annual increase of  5% 
          0.01,    # annual increase of 10% 
          0.02),   # annual increase of 20% 
  pi0 = c(-4,      # annual observation probability of 0.017,   ~55 years delay
          -2,      # annual observation probability of 0.119,   ~8 years delay
          0,       # annual observation probability of   0.5,   ~2 years delay
          2,       # annual observation probability of   0.88,  ~0.13 years delay
          4),      # annual observation probability of   0.98,  ~0.02 year delay
  pi1 = c( 0,      # annual increase of     0% in probability of detection
           0.5,   # annual increase of  0.02% in probability of detection
           1,  # annual increase of   0.2% in probability of detection
           2), # effect of population size on detection probability
  pi2 = c( 0,      # no effect of population size on detection probability
           1e-7,   # normal effect of population size on detection probability
           0.001), # effect of population size on detection probability
  length = seq(30,200,15)
)

df_tsl <- expand.grid(list_tsl)

est_names <- c("b0_est", "b1_est", "pi0_est", "pi1_est", "pi2_est")
ses_names <- c("b0_se", "b1_se", "pi0_se", "pi1_se", "pi2_se")

To save time for the creation of this vignette, we will use a subset of just one parameter combination:

set.seed(3333) # Set a seed for replication
df_tsl <- df_tsl |> dplyr::sample_n(1)

To demonstrate how \(lambda_t\) looks like using these parameter combination, we can use the calculate_lambda function of the package which is currently not exported but can be called using alien:::calculate_lambda:

params <- as.numeric(df_tsl[1:5]) # parameters
names(params) <- c("b0","b1","pi0","pi1", "pi2") # set names
time_span <- as.numeric(df_tsl[6]) # time series length.

sampling_lambda <- alien:::calculate_lambda(mu = ~ time,
                                            pi = ~ time,
                                            data = data.frame(time = seq_len(time_span)),
                                            beta = params[c("b0","b1")],
                                            gamma = params[c("pi0","pi1")],
                                            growth_param = params["pi2"],
                                            type = "exponential")
ggplot2::ggplot()+
  ggplot2::aes(x = seq_len(time_span), y = sampling_lambda)+
  ggplot2::geom_line(linewidth = 1.1) + 
  ggplot2::labs(x = "Time", y = expression("\U03BB"[t]))

Now for the simulation. It is a nested loop with the following steps: 1. Create a simulation of \(\lambda_t\) using the parameters. Happens once for every parameter set. 2. Sample \(N_t\) from \(\lambda_t\) - here we set it to happen one time per parameter set instead of the 200 used in Buba et al (2024). To change it, increase the value of the variable number_of_repetitions. 3. Fit the naive model, Solow and Costello model, the constant detection model, and the sampling-proxy model to \(N_t\). 4. Summarizing results in the desired structure. 5. Return the parameters, the summarized results, and the full output of the models.

list <- apply(df_tsl,
                MARGIN = 1, FUN = function(row) {

                  # Defining variables and parameters:
                  params <- as.numeric(row[1:5])
                  names(params) <- c("b0","b1","pi0","pi1", "pi2")
                  time_span <- row[6] # time series length.

                  # Create a faux sampling vector based on the time span to be used for our sampling_proxy
                  sampling_vector <- scale(1:time_span) |> as.numeric()

                  # Create lambda vector
                  sampling_lambda <- alien:::calculate_lambda(mu = ~ time,
                                            pi = ~ time,
                                            data = data.frame(time = seq_len(time_span)),
                                            beta = params[c("b0","b1")],
                                            gamma = params[c("pi0","pi1")],
                                            growth_param = params["pi2"],
                                            type = "exponential")
                  
                  number_of_repetitions <- 1
                  
                  result <- lapply(seq_len(number_of_repetitions), function(iteration) {

                    # Creating an introduction
                    sampling_data <- rpois(time_span, sampling_lambda)

                    # model fitting start
                    naive_model <- glm(formula = sampling_data ~ seq_len(time_span), family = "poisson")

                    guess_all <- c(b0 = 0, b1 = 0, pi0 = 0, pi1 = 0, pi2 = 0)

                    snc_model <- alien::snc(y = sampling_lambda, type = "exponential",
                                            mu = ~ time,
                                            pi = ~ time, growth = T,
                                            init = guess_all,
                                            data = data.frame(time = seq_len(time_span)),
                                            control = list(maxit = 10000))

                    constant_detection_model <- alien::snc(y = sampling_lambda, type = "exponential",
                                                           mu = ~ time,
                                                           pi = ~ 1, growth = F,
                                                           init = guess_all[1:3],
                                                           data = data.frame(time = seq_len(time_span)),
                                                           control = list(maxit = 10000))

                    sampling_proxy_model <- alien::snc(y = sampling_lambda, type = "exponential",
                                                       mu = ~ time,
                                                       pi = ~ sampling_vector, growth = F,
                                                       init = guess_all[1:4],
                                                       data = data.frame(time = seq_len(time_span), sampling_vector = sampling_vector),
                                                       control = list(maxit = 10000))

                    # model fitting end
                    
                    # Summarize results start
                    naive_par <-  coefficients(summary(naive_model))[,1] |>  dplyr::bind_rows() |> `colnames<-`(est_names[1:2])
                    naive_se <-  coefficients(summary(naive_model))[,2] |> dplyr::bind_rows() |> `colnames<-`(ses_names[1:2])
                    naive_convergence <- tibble::tibble(converged = naive_model$converged)

                    naive_row <- dplyr::bind_cols(naive_par, naive_se, naive_convergence) |>
                      tibble::add_column(method = "naive")

                    list_of_rows <- lapply(list(snc_model, constant_detection_model, sampling_proxy_model), function(est) {

                      method <- dplyr::case_when(
                        identical(est, snc_model) ~ "snc",
                        identical(est, constant_detection_model) ~ "snc_3p",
                        identical(est, sampling_proxy_model) ~ "snc_vec"
                      )

                      if (any(class(est) == "try-error")){
                        est_row <- rep(NA,11)
                        names(est_row) <- c("b0_est", "b1_est", "pi0_est",
                                            "pi1_est", "pi2_est", "b0_se", "b1_se",
                                            "pi0_se", "pi1_se", "pi2_se", "converged")
                        est_row <- dplyr::bind_rows(est_row) |> tibble::add_column(method = method)
                      } else {

                        par <- est$coefficients$Estimate |> `names<-`(est_names[1:nrow(est$coefficients)]) |> dplyr::bind_rows()
                        ses <- est$coefficients$`Std.Err` |> `names<-`(ses_names[1:nrow(est$coefficients)])  |> dplyr::bind_rows()
                        convergence <- tibble::tibble(converged = est$convergence == 0)
                        est_row <- dplyr::bind_cols(par, ses, convergence) |> tibble::add_column(method = method)
                      }
                    }
                    )

                    params <- dplyr::bind_rows(naive_row,
                                        dplyr::bind_rows(list_of_rows)) |>
                      tibble::add_column(iteration = iteration)

                    # summarize results end
                    return(list(params = params,
                                full = list(simulation = sampling_data,
                                            estimates = snc_model,
                                            estimates_3p = constant_detection_model,
                                            estimates_vec = sampling_proxy_model,
                                            perfect_detection = coefficients(summary(naive_model))
                                )
                    )
                    )
                  })

                  simulation_estimates <- dplyr::bind_rows(purrr::map(result, 1))
                  full <- purrr::map(result, 2)

                  return(list(sim_params = row,
                              estimates = simulation_estimates,
                              full = full))
                })
#> Warning in HelpersMG::SEfromHessian(a = optim_out[["hessian"]]): Error in
#> Hessian matrix inversion
#> Warning in HelpersMG::SEfromHessian(a = optim_out[["hessian"]]): Calculates the
#> Nearest Positive Definite Matrix. Use result with caution.

After this process, we can view a summary of the results.

Here we have the estimate and standard error of each parameter denoted by _{parameter_est} and _{parameter_se}, respectively.


purrr::map(list, 2) |> dplyr::bind_rows() # for the summary of each iteration
#> # A tibble: 4 × 13
#>   b0_est  b1_est b0_se   b1_se converged method  pi0_est pi1_est pi2_est pi0_se
#>    <dbl>   <dbl> <dbl>   <dbl> <lgl>     <chr>     <dbl>   <dbl>   <dbl>  <dbl>
#> 1 -0.943 0.00580 0.220 0.00207 TRUE      naive    NA     NA        NA     NA   
#> 2 -1.00  0.00501 0.532 0.0224  TRUE      snc       0.351 -0.0127    3.65   9.80
#> 3 -1.00  0.00500 0.234 0.00220 TRUE      snc_3p    3.94  NA        NA     86.3 
#> 4 -1.00  0.00509 0.276 0.00425 TRUE      snc_vec   1.74  -0.863    NA     27.9 
#> # ℹ 3 more variables: pi1_se <dbl>, pi2_se <dbl>, iteration <int>

Non-monotonic trends in detection

To create a simulation of a non-monotonic trend in detection, it is necessary to create a faux proxy to simulate from.

This can be done in multiple ways. In Buba et al (2024), an increasing sine vector is used. Here we demonstrate a 100 year increasing sine:

t <- seq(0, 4*pi, length.out = 100)
a <- 3
b <- 0.6
set.seed(100) # This is done to maintain reproducability within time series lengths
c.norm <- rnorm(100)
amp <- 2

sampling_vector <- 3 * a*sin(b*t)+c.norm*amp + 0.3 * 1:100

sampling_vector <- scale(sampling_vector) |> as.numeric()


ggplot2::ggplot()+
  ggplot2::aes(x = 1:100, y = sampling_vector)+
  ggplot2::geom_line(linewidth = 1.1) + 
  ggplot2::labs(x = "Time", y = "Sampling proxy value")

The next step is then to simulate \(\lambda_t\) based on this proxy. This can be done easily using the snc function:

We set the value of pi1 to be 0.2 since the \(\gamma_1\) which we set earlier differs from that of \(\zeta_0\)

params["pi1"] <- 0.2
time_span <- 100
lambda_with_proxy <- alien:::calculate_lambda(mu = ~ time,
                                            pi = ~ sampling_vector,
                                            data = data.frame(time = seq_len(time_span),
                                                              sampling_vector = sampling_vector),
                                            beta = params[c("b0","b1")],
                                            gamma = params[c("pi0","pi1")],
                                            growth_param = params["pi2"],
                                            type = "exponential")

ggplot2::ggplot()+
  ggplot2::aes(x = 1:100, y = lambda_with_proxy)+
  ggplot2::geom_line(linewidth = 1.1) + 
  ggplot2::labs(x = "Time", y = expression("\U03BB"[t]))

This process of creating a proxy and calculating a simulation \(\lambda\) vecrtor based on it can then be place in Step 1 of the loop described above to create simulations of non-monotonic trends in detection.