library(simpr)
set.seed(2001)
This vignette provides an overview of each step in the
simpr
workflow:
specify()
your data-generating process
define()
parameters that you want to systematically
vary across your simulation design (e.g. n, effect
size)
generate()
the simulation data
fit()
models to your data
(e.g. lm()
)
tidy_fits()
for further processing, such as
computing power or Type I Error rates
specify()
your data-generating processspecify()
takes arbitrary R expressions that can be used
for generating data. Each argument should have a name and be
prefixed with ~
, the tilde operator. Order
matters: later arguments can refer to earlier arguments, but not the
other way around.
Good—b
specification refers to a
specification, and comes after a
:
specify(a = ~ runif(6),
b = ~ a + rnorm(6)) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 2]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#> a b
#> <dbl> <dbl>
#> 1 0.532 0.600
#> 2 0.946 0.978
#> 3 0.451 1.86
#> 4 0.249 -1.02
#> 5 0.376 0.409
#> 6 0.824 0.433
Error—b
specification refers to a
specification, but comes before a
, so
generate()
doesn’t know what a
is:
specify(b = ~ a + rnorm(6),
a = ~ runif(6)) %>%
generate(1)
#> Warning in create_sim_results(specs = specs, x = x[c("meta_info", "specify", :
#> Simulation produced errors. See column '.sim_error'.
#> tibble
#> --------------------------
#> # A tibble: 1 × 4
#> .sim_id rep sim .sim_error
#> <int> <int> <list> <chr>
#> 1 1 1 <NULL> "Error in `map()`:\nℹ In index: 1.\nCaused by error in `…
All arguments must imply the same number of rows. Arguments that imply 1 row are recycled.
OK—both a
and b
imply 6 rows:
specify(a = ~ runif(6),
b = ~ rnorm(6)) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 2]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#> a b
#> <dbl> <dbl>
#> 1 0.376 1.41
#> 2 0.824 -1.27
#> 3 0.527 0.0321
#> 4 0.725 -0.392
#> 5 0.513 0.205
#> 6 0.149 -0.186
OK—a
implies 1 row and b
implies 6 rows, so
a
is recycled:
specify(a = ~ runif(1),
b = ~ rnorm(6)) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 2]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#> a b
#> <dbl> <dbl>
#> 1 0.527 0.597
#> 2 0.527 -1.04
#> 3 0.527 -0.0124
#> 4 0.527 -0.0558
#> 5 0.527 0.457
#> 6 0.527 -0.578
Error—a
implies 2 rows and b
implies 6
rows:
specify(a = ~ runif(2),
b = ~ rnorm(6)) %>%
generate(1)
#> Warning in create_sim_results(specs = specs, x = x[c("meta_info", "specify", :
#> Simulation produced errors. See column '.sim_error'.
#> tibble
#> --------------------------
#> # A tibble: 1 × 4
#> .sim_id rep sim .sim_error
#> <int> <int> <list> <chr>
#> 1 1 1 <NULL> "Error in `dplyr::bind_cols()`:\n! Can't recycle `..1` (…
Using x
as an argument to specify()
is not
recommended, because for technical reasons x
is always
placed as the first argument. This means that if x
refers
to prior variables it will return an error:
specify(y = ~ runif(6),
x = ~ y + runif(6)) %>%
generate(1)
#> Formula specification for 'x' detected. Assuming 'x' is the first formula.
#>
#> To hide this message, or to avoid moving this formula first, use a different variable name.
#> Warning in create_sim_results(specs = specs, x = x[c("meta_info", "specify", :
#> Simulation produced errors. See column '.sim_error'.
#> tibble
#> --------------------------
#> # A tibble: 1 × 4
#> .sim_id rep sim .sim_error
#> <int> <int> <list> <chr>
#> 1 1 1 <NULL> "Error in `map()`:\nℹ In index: 1.\nCaused by error in `…
The same specification works fine when x
is renamed:
specify(y = ~ runif(6),
a = ~ y + runif(6)) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 2]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#> y a
#> <dbl> <dbl>
#> 1 0.102 0.683
#> 2 0.478 0.663
#> 3 0.513 0.939
#> 4 0.676 1.28
#> 5 0.348 0.976
#> 6 0.282 0.686
specify()
accepts expressions that generate multiple
columns simultaneously in a matrix
,
data.frame
, or tibble
. By default, the column
names in the output append a number to the variable name.
Here’s an example using MASS::mvrnorm()
, which returns
draws from the multivariate normal distribution as a matrix.
MASS::mvrnorm()
determines the number of columns for the
output data from the length of mu
and the dimensions of the
variance-covariance matrix Sigma
.
specify(a = ~ MASS::mvrnorm(6,
mu = rep(0, 3),
Sigma = diag(rep(1, 3)))) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 3]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 3
#> a_1 a_2 a_3
#> <dbl> <dbl> <dbl>
#> 1 0.477 -0.445 0.0321
#> 2 0.236 -0.280 -0.392
#> 3 -0.656 0.841 0.205
#> 4 1.80 1.51 -0.186
#> 5 -0.458 -0.597 0.327
#> 6 0.130 0.549 0.199
The argument was named a
in specify()
, so
generate()
creates three variables named a_1
,
a_2
, and a_3
.
You can change the separator between the argument name and the number
via .sep
. Here, we change it from the default
"_"
to "."
:
specify(a = ~ MASS::mvrnorm(6,
mu = rep(0, 3),
Sigma = diag(rep(1, 3))),
.sep = ".") %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 3]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 3
#> a.1 a.2 a.3
#> <dbl> <dbl> <dbl>
#> 1 0.236 -0.280 -0.392
#> 2 -0.656 0.841 0.205
#> 3 1.80 1.51 -0.186
#> 4 -0.458 -0.597 0.327
#> 5 0.130 0.549 0.199
#> 6 0.155 0.477 -0.445
Alternatively, you can give a two-sided formula to set names. The
argument name is ignored, and the left hand side must use c
or cbind
.
specify(y = c(a, b, c) ~ MASS::mvrnorm(6,
mu = rep(0, 3),
Sigma = diag(rep(1, 3)))) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 3]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 -0.656 0.841 0.205
#> 2 1.80 1.51 -0.186
#> 3 -0.458 -0.597 0.327
#> 4 0.130 0.549 0.199
#> 5 0.155 0.477 -0.445
#> 6 0.201 0.236 -0.280
If your expression already produces column names, those are used by default. The argument name is again ignored:
specify(a = ~ MASS::mvrnorm(6,
mu = rep(0, 3),
Sigma = diag(rep(1, 3))) %>%
::set_colnames(c("d", "e", "f"))) %>%
magrittrgenerate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 3]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 3
#> d e f
#> <dbl> <dbl> <dbl>
#> 1 1.80 1.51 -0.186
#> 2 -0.458 -0.597 0.327
#> 3 0.130 0.549 0.199
#> 4 0.155 0.477 -0.445
#> 5 0.201 0.236 -0.280
#> 6 0.153 -0.656 0.841
This is useful for dealing with functions from other packages that
already provide informative names (e.g.,
lavaan::simulateData()
). You can turn this behavior off
with .use_names = FALSE
.
Whatever method you use, you can still refer to these generated
variables in subsequent arguments to specify()
:
specify(a = ~ MASS::mvrnorm(6,
mu = rep(0, 3),
Sigma = diag(rep(1, 3))),
b = ~ a_1 - a_2) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 4]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 4
#> a_1 a_2 a_3 b
#> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.458 -0.597 0.327 0.139
#> 2 0.130 0.549 0.199 -0.420
#> 3 0.155 0.477 -0.445 -0.323
#> 4 0.201 0.236 -0.280 -0.0347
#> 5 0.153 -0.656 0.841 0.810
#> 6 1.32 1.80 1.51 -0.480
define()
parameters that you want to systematically
varydefine()
creates metaparameters (also called simulation
factors): values that you want to systematically vary. An obvious choice
is sample size.
Instead of writing a number for the n
argument of
rnorm
, we write a placeholder value samp_size
(this can be any valid R name), and we write a corresponding argument in
define()
that contains the possible values that
samp_size
can take on:
specify(a = ~ rnorm(samp_size)) %>%
define(samp_size = c(10, 20)) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 2 × 4
#> .sim_id samp_size rep sim
#> <int> <dbl> <int> <list>
#> 1 1 10 1 <tibble [10 × 1]>
#> 2 2 20 1 <tibble [20 × 1]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 10 × 1
#> a
#> <dbl>
#> 1 0.199
#> 2 -0.445
#> 3 -0.280
#> 4 0.841
#> 5 1.51
#> 6 -0.597
#> 7 0.549
#> 8 0.477
#> 9 0.236
#> 10 -0.656
Each argument to define()
is a vector or list with the
desired values to be used in the expressions in specify()
.
specify()
expressions can refer to the names of define
arguments, and generate()
will substitute the possible
values that argument when generating data.
When define()
has multiple arguments, each possible
combination is generated:
specify(a = ~ rnorm(samp_size, mu)) %>%
define(samp_size = c(10, 20),
mu = c(0, 10)) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 4 × 5
#> .sim_id samp_size mu rep sim
#> <int> <dbl> <dbl> <int> <list>
#> 1 1 10 0 1 <tibble [10 × 1]>
#> 2 2 20 0 1 <tibble [20 × 1]>
#> 3 3 10 10 1 <tibble [10 × 1]>
#> 4 4 20 10 1 <tibble [20 × 1]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 10 × 1
#> a
#> <dbl>
#> 1 -0.445
#> 2 -0.280
#> 3 0.841
#> 4 1.51
#> 5 -0.597
#> 6 0.549
#> 7 0.477
#> 8 0.236
#> 9 -0.656
#> 10 1.80
(If not all combinations are desired, see options for filtering at
the generate()
step, below.)
define()
can also take lists for any type of value used
in specify
. For instance, the argument s
is
defined as a list with two variables representing two different possible
correlation matrices, and we use that placeholder value s
in the specify()
statement:
specify(a = ~ MASS::mvrnorm(6, rep(0, 2), Sigma = s)) %>%
define(s = list(independent = diag(rep(1, 2)),
dependent = matrix(c(1, 0.5, 0.5, 1), nrow = 2))) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 2 × 5
#> .sim_id s_index rep s sim
#> <int> <chr> <int> <list> <list>
#> 1 1 independent 1 <dbl [2 × 2]> <tibble [6 × 2]>
#> 2 2 dependent 1 <dbl [2 × 2]> <tibble [6 × 2]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#> a_1 a_2
#> <dbl> <dbl>
#> 1 -0.236 -0.280
#> 2 0.656 0.841
#> 3 -1.80 1.51
#> 4 0.458 -0.597
#> 5 -0.130 0.549
#> 6 -0.155 0.477
In the output, simpr
creates a column
s_index
using the names of the list elements of
s
to make the results easier to view and filter.
define()
also supports lists of functions. Here, the
specify
command has a placeholder function
distribution
, where distribution
is defined to
be either rnorm()
or rlnorm()
(the lognormal
distribution):
specify(y = ~ distribution(6)) %>%
define(distribution = list(normal = rnorm,
lognormal = rlnorm)) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 2 × 5
#> .sim_id distribution_index rep distribution sim
#> <int> <chr> <int> <list> <list>
#> 1 1 normal 1 <fn> <tibble [6 × 1]>
#> 2 2 lognormal 1 <fn> <tibble [6 × 1]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#> y
#> <dbl>
#> 1 0.841
#> 2 1.51
#> 3 -0.597
#> 4 0.549
#> 5 0.477
#> 6 0.236
generate()
the simulation dataThe main argument to generate()
is .reps
,
the number of repetitions for each combination of metaparameters in
define()
.
specify(a = ~ rnorm(n, mu)) %>%
define(n = c(6, 12),
mu = c(0, 10)) %>%
generate(2)
#> full tibble
#> --------------------------
#> # A tibble: 8 × 5
#> .sim_id n mu rep sim
#> <int> <dbl> <dbl> <int> <list>
#> 1 1 6 0 1 <tibble [6 × 1]>
#> 2 2 12 0 1 <tibble [12 × 1]>
#> 3 3 6 10 1 <tibble [6 × 1]>
#> 4 4 12 10 1 <tibble [12 × 1]>
#> 5 5 6 0 2 <tibble [6 × 1]>
#> 6 6 12 0 2 <tibble [12 × 1]>
#> 7 7 6 10 2 <tibble [6 × 1]>
#> 8 8 12 10 2 <tibble [12 × 1]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#> a
#> <dbl>
#> 1 1.51
#> 2 -0.597
#> 3 0.549
#> 4 0.477
#> 5 0.236
#> 6 -0.656
Since there are 4 possible combinations of n
and
mu
, there are a total of 4 * 2 = 8 simulations generated, 2
for each possible combination.
If some combination of variables is not desired, add filtering
criteria to generate()
using the same syntax as
dplyr::filter()
. Here we arbitrarily filter to all
combinations of n
and mu
where n
is greater than mu
, but any valid filtering criteria can be
applied.
specify(a = ~ rnorm(n, mu)) %>%
define(n = c(6, 12),
mu = c(0, 10)) %>%
generate(2, n > mu)
#> full tibble
#> --------------------------
#> # A tibble: 6 × 5
#> .sim_id n mu rep sim
#> <int> <dbl> <dbl> <int> <list>
#> 1 1 6 0 1 <tibble [6 × 1]>
#> 2 2 12 0 1 <tibble [12 × 1]>
#> 3 4 12 10 1 <tibble [12 × 1]>
#> 4 5 6 0 2 <tibble [6 × 1]>
#> 5 6 12 0 2 <tibble [12 × 1]>
#> 6 8 12 10 2 <tibble [12 × 1]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#> a
#> <dbl>
#> 1 -0.597
#> 2 0.549
#> 3 0.477
#> 4 0.236
#> 5 -0.656
#> 6 1.80
To preserve reproducibility, a given simulation in the filtered
version of generate
is still the same as if all possible
combinations were generated. This can be tracked using the
.sim_id
that generate()
includes in the output
data, which uniquely identifies the simulation run given the same inputs
to specify
, define
, and .reps
.
Above, note that .sim_id
skips 3 and 7 See
vignette("Reproducing simulations")
for more information on
using generate()
to filter.
generate()
by default continues with the next iteration
if an error is produced, and returns a column .sim_error
with the text of the error. Alternative error handling mechanisms are
available, see vignette("Managing simulation errors")
.
fit()
models to your datafit()
uses similar syntax to generate()
:
you can write arbitrary R expressions to fit models. Again, each
argument should have a name and be prefixed with ~
, the
tilde operator.
Below, we fit both a t-test and a linear model.
specify(a = ~ rnorm(6),
b = ~ a + rnorm(6)) %>%
generate(1) %>%
fit(t_test = ~ t.test(a, b),
lm = ~ lm(b ~ a))
#> full tibble
#> --------------------------
#> # A tibble: 1 × 5
#> .sim_id rep sim t_test lm
#> <int> <int> <list> <list> <list>
#> 1 1 1 <tibble [6 × 2]> <htest> <lm>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#> a b
#> <dbl> <dbl>
#> 1 0.549 0.679
#> 2 0.477 0.632
#> 3 0.236 0.437
#> 4 -0.656 -0.503
#> 5 1.80 3.13
#> 6 -0.458 0.412
#>
#> t_test[[1]]
#> --------------------------
#>
#> Welch Two Sample t-test
#>
#> data: a and b
#> t = -0.76979, df = 9.0768, p-value = 0.461
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -1.8583360 0.9137887
#> sample estimates:
#> mean of x mean of y
#> 0.3255341 0.7978077
#>
#>
#> lm[[1]]
#> --------------------------
#>
#> Call:
#> lm(formula = b ~ a)
#>
#> Coefficients:
#> (Intercept) a
#> 0.3739 1.3021
fit()
adds columns to the overall tibble to contain each
type of fit. Printing the object displays a preview of the first fit
object in each column.
Although the function is named fit
, any arbitrary R
expression can be used. Below, one fit column will include the mean of
a
, while the second will include vectors that are five
larger than a
. The result is always a list-column, so any
type of return object is allowed:
specify(a = ~ rnorm(6)) %>%
generate(1) %>%
fit(mean = ~ mean(a),
why_not = ~ a + 5)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 5
#> .sim_id rep sim mean why_not
#> <int> <int> <list> <list> <list>
#> 1 1 1 <tibble [6 × 1]> <dbl [1]> <dbl [6]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#> a
#> <dbl>
#> 1 0.477
#> 2 0.236
#> 3 -0.656
#> 4 1.80
#> 5 -0.458
#> 6 0.130
#>
#> mean[[1]]
#> --------------------------
#> [1] 0.2555629
#>
#> why_not[[1]]
#> --------------------------
#> [1] 5.477269 5.236078 4.343804 6.804318 4.542280 5.129628
fit()
is computed for each individual simulated dataset,
so usually you do not need to refer to the dataset itself. If a
reference to the dataset is needed, use .
.
The below code is equivalent to the previous example, but explicitly
referencing the dataset using .
and .$
:
specify(a = ~ rnorm(6),
b = ~ a + rnorm(6)) %>%
generate(1) %>%
## .$ and data = . not actually required here
fit(t_test = ~ t.test(.$a, .$b),
lm = ~ lm(b ~ a, data = .))
#> full tibble
#> --------------------------
#> # A tibble: 1 × 5
#> .sim_id rep sim t_test lm
#> <int> <int> <list> <list> <list>
#> 1 1 1 <tibble [6 × 2]> <htest> <lm>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#> a b
#> <dbl> <dbl>
#> 1 0.236 0.437
#> 2 -0.656 -0.503
#> 3 1.80 3.13
#> 4 -0.458 0.412
#> 5 0.130 1.77
#> 6 0.155 0.740
#>
#> t_test[[1]]
#> --------------------------
#>
#> Welch Two Sample t-test
#>
#> data: .$a and .$b
#> t = -1.2662, df = 8.8051, p-value = 0.2379
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -2.2239947 0.6312032
#> sample estimates:
#> mean of x mean of y
#> 0.2018126 0.9982084
#>
#>
#> lm[[1]]
#> --------------------------
#>
#> Call:
#> lm(formula = b ~ a, data = .)
#>
#> Coefficients:
#> (Intercept) a
#> 0.7276 1.3411
Sometimes data manipulation is required between
generate()
and fit()
. After
generate()
, run per_sim()
and then chain any
dplyr
or tidyr
verbs that work on
data.frame
s or tibble
s. These verbs will be
applied to every individual simulated dataset.
A common use-case is needing to reshape wide to long. Consider an intervention study with a control group, an intervention group that does slightly better than the control, and a second intervention group that does much better. This is easiest to specify in wide format, with separate variables for each group and differing means by group:
= specify(control = ~ rnorm(6, mean = 0),
wide_gen intervention_1 = ~ rnorm(6, mean = 0.2),
intervention_2 = ~ rnorm(6, mean = 2)) %>%
generate(2)
wide_gen#> full tibble
#> --------------------------
#> # A tibble: 2 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 3]>
#> 2 2 2 <tibble [6 × 3]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 3
#> control intervention_1 intervention_2
#> <dbl> <dbl> <dbl>
#> 1 -0.656 0.353 2.94
#> 2 1.80 1.52 2.03
#> 3 -0.458 1.07 2.30
#> 4 0.130 1.84 1.50
#> 5 0.155 0.785 1.70
#> 6 0.201 -0.539 1.65
But to run an ANOVA, we need the outcome in one column and the group
name in another column. We first run per_sim()
to indicate
we want to compute on individual simulated datasets, and then we can use
tidyr::pivot_longer()
to reshape each dataset into a format
ready for analysis:
= wide_gen %>%
long_gen per_sim() %>%
pivot_longer(cols = everything(),
names_to = "group",
values_to = "response")
long_gen#> full tibble
#> --------------------------
#> # A tibble: 2 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [18 × 2]>
#> 2 2 2 <tibble [18 × 2]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 18 × 2
#> group response
#> <chr> <dbl>
#> 1 control -0.656
#> 2 intervention_1 0.353
#> 3 intervention_2 2.94
#> 4 control 1.80
#> 5 intervention_1 1.52
#> 6 intervention_2 2.03
#> 7 control -0.458
#> 8 intervention_1 1.07
#> 9 intervention_2 2.30
#> 10 control 0.130
#> 11 intervention_1 1.84
#> 12 intervention_2 1.50
#> 13 control 0.155
#> 14 intervention_1 0.785
#> 15 intervention_2 1.70
#> 16 control 0.201
#> 17 intervention_1 -0.539
#> 18 intervention_2 1.65
Each simulation is now reshaped and ready for fitting:
= long_gen %>%
long_fit fit(aov = ~ aov(response ~ group),
lm = ~ lm(response ~ group))
long_fit#> full tibble
#> --------------------------
#> # A tibble: 2 × 5
#> .sim_id rep sim aov lm
#> <int> <int> <list> <list> <list>
#> 1 1 1 <tibble [18 × 2]> <aov> <lm>
#> 2 2 2 <tibble [18 × 2]> <aov> <lm>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 18 × 2
#> group response
#> <chr> <dbl>
#> 1 control -0.656
#> 2 intervention_1 0.353
#> 3 intervention_2 2.94
#> 4 control 1.80
#> 5 intervention_1 1.52
#> 6 intervention_2 2.03
#> 7 control -0.458
#> 8 intervention_1 1.07
#> 9 intervention_2 2.30
#> 10 control 0.130
#> 11 intervention_1 1.84
#> 12 intervention_2 1.50
#> 13 control 0.155
#> 14 intervention_1 0.785
#> 15 intervention_2 1.70
#> 16 control 0.201
#> 17 intervention_1 -0.539
#> 18 intervention_2 1.65
#>
#> aov[[1]]
#> --------------------------
#> Call:
#> aov(formula = response ~ group)
#>
#> Terms:
#> group Residuals
#> Sum of Squares 10.270758 8.853192
#> Deg. of Freedom 2 15
#>
#> Residual standard error: 0.7682531
#> Estimated effects may be unbalanced
#>
#> lm[[1]]
#> --------------------------
#>
#> Call:
#> lm(formula = response ~ group)
#>
#> Coefficients:
#> (Intercept) groupintervention_1 groupintervention_2
#> 0.1960 0.6437 1.8242
tidy_fits()
for further processingThe output of fit()
is not yet amenable to plotting or
analysis. tidy_fits()
applies broom::tidy()
to
each fit object and binds them together in a single
tibble
:
specify(a = ~ rnorm(n),
b = ~ a + rnorm(n)) %>%
define(n = c(6, 12)) %>%
generate(2) %>%
fit(lm = ~ lm(b ~ a)) %>%
tidy_fits()
#> # A tibble: 8 × 9
#> .sim_id n rep Source term estimate std.error statistic p.value
#> <int> <dbl> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 6 1 lm (Intercept) 0.684 0.406 1.69 0.167
#> 2 1 6 1 lm a 1.26 0.526 2.40 0.0745
#> 3 2 12 1 lm (Intercept) -0.117 0.288 -0.407 0.693
#> 4 2 12 1 lm a 0.739 0.255 2.90 0.0160
#> 5 3 6 2 lm (Intercept) -0.713 0.715 -0.997 0.375
#> 6 3 6 2 lm a 0.503 0.687 0.731 0.505
#> 7 4 12 2 lm (Intercept) 0.335 0.238 1.41 0.190
#> 8 4 12 2 lm a 1.30 0.317 4.11 0.00210
All the same metaparameter information appears in the left-hand
column, and all the model information from broom::tidy()
is
provided. This is a convenient format for filtering, plotting, and
calculating diagnostics.
If more than one kind of fit is present, tidy_fits()
simply brings them together in the same tibble
using
bind_rows
; this means there may be many NA
values where one type of model has no values. In the example below,
t.test
returns the column estimate1
, but
lm
does not, so for the lm
rows there are
NA
values.
specify(a = ~ rnorm(n),
b = ~ a + rnorm(n)) %>%
define(n = c(6, 12)) %>%
generate(2) %>%
fit(lm = ~ lm(b ~ a),
t_test = ~ t.test(a, b)) %>%
tidy_fits()
#> # A tibble: 12 × 16
#> .sim_id n rep Source term estim…¹ std.e…² stati…³ p.value estim…⁴
#> <int> <dbl> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 6 1 lm (Intercep… 0.688 0.385 1.79 0.149 NA
#> 2 1 6 1 lm a 0.471 0.656 0.718 0.513 NA
#> 3 1 6 1 t_test <NA> -0.555 NA -1.36 0.206 0.251
#> 4 2 12 1 lm (Intercep… -0.161 0.279 -0.578 0.576 NA
#> 5 2 12 1 lm a 0.803 0.248 3.23 0.00897 NA
#> 6 2 12 1 t_test <NA> 0.173 NA 0.340 0.737 0.0594
#> 7 3 6 2 lm (Intercep… -0.939 0.610 -1.54 0.199 NA
#> 8 3 6 2 lm a 0.361 0.579 0.623 0.567 NA
#> 9 3 6 2 t_test <NA> 0.633 NA 0.958 0.362 -0.479
#> 10 4 12 2 lm (Intercep… 0.258 0.251 1.03 0.329 NA
#> 11 4 12 2 lm a 1.26 0.329 3.82 0.00338 NA
#> 12 4 12 2 t_test <NA> -0.300 NA -0.697 0.495 0.165
#> # … with 6 more variables: estimate2 <dbl>, parameter <dbl>, conf.low <dbl>,
#> # conf.high <dbl>, method <chr>, alternative <chr>, and abbreviated variable
#> # names ¹estimate, ²std.error, ³statistic, ⁴estimate1
Any option taken by the tidier can be passed through
tidy_fits()
. Below, we specify the conf.level
and conf.int
options for broom::tidy.lm()
:
specify(a = ~ rnorm(n),
b = ~ a + rnorm(n)) %>%
define(n = c(6, 12)) %>%
generate(2) %>%
fit(lm = ~ lm(b ~ a)) %>%
tidy_fits(conf.level = 0.99, conf.int = TRUE)
#> # A tibble: 8 × 11
#> .sim_id n rep Source term estim…¹ std.e…² stati…³ p.value conf.…⁴
#> <int> <dbl> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 6 1 lm (Intercept) 0.720 0.498 1.45 0.222 -1.57
#> 2 1 6 1 lm a 0.450 0.754 0.596 0.583 -3.02
#> 3 2 12 1 lm (Intercept) -0.250 0.234 -1.07 0.310 -0.993
#> 4 2 12 1 lm a 0.664 0.217 3.06 0.0120 -0.0230
#> 5 3 6 2 lm (Intercept) -0.690 0.701 -0.985 0.380 -3.92
#> 6 3 6 2 lm a 0.274 0.559 0.490 0.650 -2.30
#> 7 4 12 2 lm (Intercept) 0.283 0.236 1.20 0.258 -0.465
#> 8 4 12 2 lm a 1.38 0.329 4.21 0.00181 0.341
#> # … with 1 more variable: conf.high <dbl>, and abbreviated variable names
#> # ¹estimate, ²std.error, ³statistic, ⁴conf.low
glance_fits()
analogously provides the one-row summary
provided by broom::glance()
for each simulation:
specify(a = ~ rnorm(n),
b = ~ a + rnorm(n)) %>%
define(n = c(6, 12)) %>%
generate(2) %>%
fit(lm = ~ lm(b ~ a)) %>%
glance_fits()
#> # A tibble: 4 × 16
#> .sim_id n rep Source r.squa…¹ adj.r…² sigma stati…³ p.value df logLik
#> <int> <dbl> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 6 1 lm 0.307 0.134 0.650 1.77 0.254 1 -4.71
#> 2 2 12 1 lm 0.559 0.515 0.769 12.7 0.00516 1 -12.8
#> 3 3 6 2 lm 0.146 -0.0674 1.56 0.684 0.455 1 -9.96
#> 4 4 12 2 lm 0.574 0.531 1.01 13.5 0.00433 1 -16.1
#> # … with 5 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>,
#> # df.residual <int>, nobs <int>, and abbreviated variable names ¹r.squared,
#> # ²adj.r.squared, ³statistic
apply_fits
can take any arbitrary expression (preceded
by ~
) or function and apply it to each fit object. The
special value .
indicates the current fit.
Below, the maximum Cook’s Distance for each simulated model fit is
computed using cooks.distance()
.
specify(a = ~ rnorm(n),
b = ~ a + rnorm(n)) %>%
define(n = c(6, 12)) %>%
generate(2) %>%
fit(lm = ~ lm(b ~ a)) %>%
apply_fits(~ max(cooks.distance(.)))
#> # A tibble: 4 × 5
#> .sim_id n rep Source value
#> <int> <dbl> <int> <chr> <dbl>
#> 1 1 6 1 lm 0.826
#> 2 2 12 1 lm 0.648
#> 3 3 6 2 lm 0.705
#> 4 4 12 2 lm 0.486