acdcquery: Query the Attentional Control Data Collection

Interact with the Attentional Control Data Collection (ACDC). Connect to the database using connect_to_db(), build query statements using add_argument() and query the database using query_db().

Use

A version of the database is included in the package. You can create a connection to it using conn <- connect_to_db(base::system.file("extdata", "acdc.db", package = "acdcquery")). In order to get the latest version of the database, you can download it from its parent repo.

To query the database, specify the connection to the database (obtained via conn <- connect_to_db("path/to/db.db")), a list of filter arguments (obtained by using add_argument()), and a vector containing the names of the variables you want returned.

Working example

# devtools::install_github("SLesche/acdc-query")
library(acdcquery)
db_file <- base::system.file("extdata", "acdc.db", package = "acdcquery")
conn <- connect_to_db(db_file)

# You may also download the latest version of the database and connect to that
# conn <- connect_to_db("path/to/db.db")

arguments <- list() 
arguments <- add_argument(
    list = arguments,
    conn = conn,
    variable = "n_participants",
    operator = "greater",
    values = 200
  )
arguments <- add_argument(
    list = arguments,
    conn = conn,
    variable = "task_name",
    operator = "equal",
    values = c("stroop", "flanker")
  )

requested_vars <- c("rt", "accuracy", "n_participants")

df <- query_db(conn, arguments, requested_vars)

Installation

You can install this package via devtools::install_github("SLesche/acdc-query"). You may need to update imported packages.

Dependencies

This package is designed for use in R 4.2.2. It imports the packages DBI and RSQLite.

Documentation

General

Querying is accomplished by lower level functions using user input to construct an SQL-query which is then applied to the database. To query the full database and select a subset of variables to be returned, functions locate the tables in which the requested variables are present and construct an SQL query selecting those rows of the the target table that match the filter arguments in the respective tables. If multiple filter arguments are present, the variable argument_relation controls how these are combined.

This database consists of multiple tables, each with variables potentially used when filtering. You are be able to filter based on any argument in any table (and even multiple arguments each located in different tables) and return the requested variables. These returned variables may themselves be located in different tables.

Filter Arguments

query_db() requires 3 main inputs. The connection to the database conn, a vector of the requested variables to be returned to the user target_vars and a list of the filter arguments arguments. The function add_argument() can be used for creating this list of filter arguments. Its objective is to provide a user-friendly way of specifying the variable and conditions used when filtering the database. To use add_argument(), users have to provide the connection to the database conn, which allows the function to validate that the variable is present in the database and locate the table in which the variable is present. Furthermore, users need to specify the variable, the operator and the value (or values) which make up the filter argument. Please find an example of creating filter arguments to query the database to only return entries where the study contains more than 200 participants and the study used either stroop or flanker tasks below. add_argument() uses this input to construct an SQL query that corresponds to the filter condition and selects the primary key of the table the filter variable is located in and returns this filter as the next entry in a list. In the above example, arguments is a list of 2 elements. The first being the character string “SELECT dataset_id FROM dataset_table WHERE n_participants > 200” and the second the string “SELECT task_id FROM task_table WHERE task_name = ‘stroop’ OR task_name = ‘flanker’”. Further information can be found in the functions source code.

conn <- acdcquery::connect_to_db("acdc.db")

arguments <- list() 
arguments <- acdcquery::add_argument(
    list = arguments,
    conn = conn,
    variable = "n_participants",
    operator = "greater",
    values = 200
  )
arguments <- acdcquery::add_argument(
    list = arguments,
    conn = conn,
    variable = "task_name",
    operator = "equal",
    values = c("stroop", "flanker")
  )

Argument Relations

When using multiple different arguments to filter, you may specify the operators used to combine these arguments. This can be done via the argument_relation argument in the query_db() function. This argument takes either the strings “and” or “or” or a numerical vector with length equal to the number of arguments specified. “and” or “or” result in all arguments being connected via the respective operator, with “and” being the default option. Passing a numerical vector allows more complicated logical combinations of filter arguments. Each entry of the vector should correspond to a filter argument. Arguments corresponding to the same number in the vector are combined via an “OR” operator. Arguments corresponding to different numbers are combined via an “AND” operator.

If you specify 4 arguments A, B, C, D and want to filter the database such that it meets the conditions A & B & (C | D), the corresponding argument_relation should be argument_relation = c(1, 2, 3, 3). Note that the number of arguments, i.e. entries in the arguments list, should always be equal to the length of the numerical argument_relation vector.

Forward and Backward Connections

This database contains multiple tables each connected via primary and foreign keys. A forward connection in our terminology is any table whose primary key is listed as a foreign key in the current table. The observation table, for example, has forward connections to the dataset, between, within and condition tables. A backward connection represents the opposite path. The dataset table has a backward connection to the observation table. These two means of connections between tables are relevant when filtering and joining tables.

Target Table and Target Variables

You may select any variable present in any table the database and specify them inside a vector target_vars = c("variable1", "variable2", "variable3"). If you want to select all variables of your target_table, add “default” as an element to the vector (target_vars = c("default", "variable1", "variable2"). You will be returned all variables present in the target_table as well as “variable1” and “variable2”.

The argument target_table controls which table is used as as the starting point for querying. Your choice of the target_table has several implications. Most importanlty, it influences the number of (possibly redundant) rows returned to you. If you select the “observation_table” as the target table, which contains trial-level data, but are only interested in variables on dataset-level, i.e. only have target_vars = c("dataset_id", "n_participants"), the function will return all rows of the observation_table that satisfy the filtering requirement. “dataset_id” and “n_participants” will be join onto the observation_table and then selected to be returned to you, but you will receive as many rows of the same “dataset_id” and “n_participants” as the observation_table with its trial-level data retained after filtering. The majority of these rows will be duplicates.

Poor choice in the target_table may lead to unwanted duplications due to joining, but it will never lead to unwanted omission of some variables. To optimally pick a target_table, Inspect the variables you are requesting and figure out which tables they are located in. Then select the table with the most largest number of entries as your target.

Join Paths

In order for the user to be able to select and filter based on variables not present in the target_table, information from other tables must be added to the rows retained after applying the filters. The function add_join_paths_to_query() handles this issue. The connection to the database, a vector containing the filter statements, the argument sequence controlling how those filter statements are combined, the requested variables and a list of optimal join paths, similar to the list of optimal filter paths is required as inputs to the function.

The list for optimal join paths is determined by the function precompute_table_join_paths() which discovers an optimal path from the table listed first in the SQL-argument to all other tables that contain at least one of the requested variables in the database. The function prefers paths which contain the larger number of forward joins (implemented via LEFT JOIN).

The function add_join_paths_to_query() joins the tables containing variables either used for filtering or selected by the user to the requested target table. Because some variables (id-variables only) are present in multiple tables, each joined table is renamed based on its location in the join path list. The table joined first to the initial table is named “dtjoin1”, the second “dtjoin2” and so on. The variables on which the tables are joined also receive the names of the tables they originate from as a prefix. The function discover_id_introduction_steps() returns a dataframe with information on which table and which position in the join path list an id-variable is introduced. This is then used to append the id-variable with the proper table name. Additionally, only variables that are not yet present in the joined data up to this point will be selected from the table to be joined.

See an example of join paths added to the combined SQL query generated above, with n_participants additionally being requested:

add_join_paths_to_query(
  conn = conn,
  filter_statements = c("SELECT dataset_id FROM dataset_table WHERE n_participants > 200", "SELECT task_id FROM task_table WHERE task_name = 'stroop' OR task_name = 'flanker'"),
  join_path_list = precompute_table_join_paths(conn),
  argument_sequence = get_argument_sequence(arguments, argument_relation = "and"), 
  requested_vars = c("rt", "accuracy", "n_participants")
)

“SELECT rt, accuracy, n_participants FROM between_table AS tab LEFT JOIN (SELECT study_id, publication_id, n_groups, n_tasks, study_comment FROM study_table) AS dtjoin1 ON tab.study_id = dtjoin1.study_id LEFT JOIN (SELECT publication_id, authors, conducted, added, country, contact, apa_reference, keywords, publication_code FROM publication_table) AS dtjoin2 ON dtjoin1.publication_id = dtjoin2.publication_id RIGHT JOIN (SELECT dataset_id, study_id, task_id, data_excl, n_participants, n_blocks, n_trials, neutral_trials, fixation_cross, time_limit, mean_dataset_rt, mean_dataset_acc, github, dataset_comment FROM dataset_table) AS dtjoin3 ON tab.study_id = dtjoin3.study_id LEFT JOIN (SELECT task_id, task_name, task_description FROM task_table) AS dtjoin4 ON dtjoin3.task_id = dtjoin4.task_id RIGHT JOIN (SELECT condition_id, dataset_id, between_id, within_id, percentage_congruent, percentage_neutral, n_obs, mean_obs_per_participant, mean_condition_rt, mean_condition_acc FROM condition_table) AS dtjoin5 ON tab.between_id = dtjoin5.between_id AND dtjoin3.dataset_id = dtjoin5.dataset_id LEFT JOIN (SELECT within_id, dataset_id, within_description FROM within_table) AS dtjoin6 ON dtjoin3.dataset_id = dtjoin6.dataset_id AND dtjoin5.within_id = dtjoin6.within_id RIGHT JOIN (SELECT observation_id, dataset_id, subject, block, trial, condition_id, congruency, accuracy, rt FROM observation_table) AS dtjoin7 ON dtjoin3.dataset_id = dtjoin7.dataset_id AND dtjoin5.condition_id = dtjoin7.condition_id WHERE (SELECT dtjoin3.dataset_id FROM dataset_table n_participants > 200) AND (SELECT dtjoin3.task_id FROM task_table task_name = ‘stroop’ OR task_name = ‘flanker’)”

Final steps

This SQL argument is then passed to DBI::dbGetQuery() to return a dataframe containing the requested information which is then returned to the user.