cohortBuilder
package is adapted to work with various
data sources and custom backends. Currently there exists one official
extension cohortBuilder.db
package that allows you to use
cohortBuilder
with database connections.
The goal of this document is to explain how to create custom
extensions to cohortBuilder
.
In general to create the custom layer you need to:
vignette("custom-filters")
).It’s recommended to include all of the methods in your custom R package.
Before you start creating a new layer, you need to choose what data (connection) should your layer operate on.
For example, cohortBuilder uses tblist
class object to
operate on list of data frames , or db
class for operating
on database connections.
To start with create a function that will take required parameters to
define data connection, such as tblist
or
dbtables
in case of cohortBuilder.db
. The
function should return an object of selected class which is used to
define required extension methods.
Below we describe all the required and optional methods you need to define within the created package.
set_source
- method used for defining a new
sourceRequired parameters:
dtconn
Details:
Source$new
inside.Source$new
). The arguments are then available at
source$attributes object.primary_keys
and
binding_keys
parameters (see
vignette("binding-keys")
).source_code
parameter that
allows users to define code for creating source (visible in reproducible
code) and description
storing list of useful source objects
descriptions.Example:
cohortBuilder
- tblist
object (same for
cohortBuilder.db
- db
object)<- function(dtconn, primary_keys = NULL, binding_keys = NULL,
set_source.tblist source_code = NULL, description = NULL, ...) {
$new(
Sourceprimary_keys = primary_keys, binding_keys = binding_keys,
dtconn, source_code = source_code, description = description,
...
) }
.init_step
- structure data passed between
filtering stepsRequired parameters:
source
- Source objectDetails:
data_object
argument).Examples:
cohortBuilder
- ‘tblist’ class. Operating on list of
tables in each step.<- function(source, ...) {
.init_step.tblist $dtconn
source }
cohortBuilder.db
- ‘db’ class.cohortBuilder.db
operates on db
class
object which is list of connection
, tables
and
schema
fields.
<- function(source) {
.init_step.db ::map(
purrr::setNames(source$dtconn$tables, source$dtconn$tables),
statsfunction(table) {
<- dplyr::tbl(
tbl_conn $dtconn$connection,
source::in_schema(source$dtconn$schema, table)
dbplyr
)attr(tbl_conn, "tbl_name") <- table
tbl_conn
}
) }
.pre_filtering
(optional) - modify data object
before filteringRequired parameters:
source
,data_object
- an object following the structure of
.init_step
,step_id
- id of the filtering stepDetails:
Examples:
cohortBuilder
- tblist class. Cleaning up
filtered
attribute for new step data.<- function(source, data_object, step_id) {
.pre_filtering.tblist for (dataset in names(data_object)) {
attr(data_object[[dataset]], "filtered") <- FALSE
}return(data_object)
}
cohortBuilder.db
- creating temp tables for the current
step in database and cleaning up filtered
attributes.<- function(source, data_object, step_id) {
.pre_filtering.db ::map(
purrr::setNames(source$dtconn$tables, source$dtconn$tables),
statsfunction(table) {
<- tmp_table_name(table, step_id)
table_name ::dbRemoveTable(source$dtconn$conn, table_name, temporary = TRUE, fail_if_missing = FALSE)
DBIattr(data_object[[table]], "filtered") <- FALSE
return(data_object[[table]])
}
) }
.post_filtering
(optional) - data object
modification after filtering (before running binding).Required parameters:
.init_step
,.post_binding
(optional) - data object
modification after running binding.Required parameters:
.init_step
,.collect_data
- define how to collect data
object into R.Required parameters:
.init_step
Details:
cohortBuilder
’s equivalent of collect
method known for sourcing the object into R memory when working with
remote environment (e.g. database).data_object
.Examples:
cohortBuilder
- operating in R memory, so return
data_object
.<- function(source, data_object) {
.collect_data.tblist
data_object }
cohortBuilder.db
- collect tables from database and
return as a named list.<- function(source, data_object) {
.collect_data.db ::map(
purrr::setNames(source$dtconn$tables, source$dtconn$tables),
stats~dplyr::collect(data_object[[.x]])
) }
.get_stats
- collect data object
statsRequired parameters:
source
,data_object
Details:
.get_attrition_count
and
shinyCohortBuilder
integration.Examples:
cohortBuilder
- operating in R memory, so return
data_object
.<- function(source, data_object) {
.get_stats.tblist <- names(source$dtconn)
dataset_names %>%
dataset_names ::map(
purrr~ list(n_rows = nrow(data_object[[.x]]))
%>%
) ::setNames(dataset_names)
stats }
cohortBuilder.db
- collect tables from database and
return as a named list.<- function(source, data_object) {
.get_stats.db <- source$dtconn$tables
dataset_names %>%
dataset_names ::map(
purrr~ list(
n_rows = data_object[[.x]] %>%
::summarise(n = n()) %>%
dplyr::collect() %>%
dplyr::pull(n) %>%
dplyras.integer()
)%>%
) ::setNames(dataset_names)
stats }
.run_binding
- method defining how binding should be handledRequired parameters:
source
,binding_key
- binding key definition,data_object_pre
- data object state before filtering in
the current step,data_object_post
- data object state after filtering in
the current step (including effect of previous bindings)Details:
.post_filtering
if
defined)..run_binding
takes care of handling a
single iteration..init_step
method output.post = TRUE/FALSE
, activate = TRUE/FALSE
and
filtered
attribute) but this is not obligatory.Examples:
cohortBuilder
<- function(source, binding_key, data_object_pre, data_object_post, ...) {
.run_binding.tblist <- binding_key$update$dataset
binding_dataset <- names(binding_key$data_keys)
dependent_datasets <- data_object_post %>%
active_datasets ::keep(~ attr(., "filtered")) %>%
purrrnames()
if (!any(dependent_datasets %in% active_datasets)) {
return(data_object_post)
}
<- NULL
key_values <- paste0("key_", seq_along(binding_key$data_keys[[1]]$key))
common_key_names for (dependent_dataset in dependent_datasets) {
<- binding_key$data_keys[[dependent_dataset]]$key
key_names <- dplyr::distinct(data_object_post[[dependent_dataset]][, key_names, drop = FALSE]) %>%
tmp_key_values ::setNames(common_key_names)
statsif (is.null(key_values)) {
<- tmp_key_values
key_values else {
} <- dplyr::inner_join(key_values, tmp_key_values, by = common_key_names)
key_values
}
}
<- dplyr::inner_join(
data_object_post[[binding_dataset]] switch(
as.character(binding_key$post),
"FALSE" = data_object_pre[[binding_dataset]],
"TRUE" = data_object_post[[binding_dataset]]
),
key_values,by = stats::setNames(common_key_names, binding_key$update$key)
)if (binding_key$activate) {
attr(data_object_post[[binding_dataset]], "filtered") <- TRUE
}
return(data_object_post)
}
cohortBuilder.db
- slight modification of the above
function.get_attrition_count
- define how to get metric
used for attrition data plotRequired parameters:
source
,data_stats
- statistics related to each step data -
list of .get_stats
results for each step (and original
data, assigned to step_id = 0
),Details:
n+1
where
n
is number of steps. The first element of the vector
should describe statistic for the base, unfiltered data.attrition
method of Cohort object
(e.g. dataset
in the below example).Examples:
cohortBuilder
<- function(source, data_stats, dataset, ...) {
.get_attrition_count.tblist %>%
data_stats ::map_int(~.[[dataset]][["n_rows"]])
purrr }
cohortBuilder.db
- same as above.get_attrition_label
- define label displayed
in attrition plot for the specified stepRequired parameters:
source
,step_id
- id of the step ("0"
for original
data case),step_filters
- list storing filters configuration for
the selected step (NULL
for original data case),Details:
step_id = "0"
case).attrition
method of Cohort object
(e.g. dataset
in the below example).Examples:
cohortBuilder
<- function(source, step_id, step_filters, dataset, ...) {
get_attrition_label.tblist <- source$primary_keys
pkey <- source$binding_keys
binding_keys if (step_id == "0") {
if (is.null(pkey)) {
return(dataset)
else {
} <- .get_item(pkey, "dataset", dataset)[1][[1]]$key
dataset_pkey if (is.null(dataset_pkey)) return(dataset)
return(glue::glue("{dataset}\n primary key: {paste(dataset_pkey, collapse = ', ')}"))
}
}<- step_filters %>%
filters_section ::keep(~.$dataset == dataset) %>%
purrr::map(~get_attrition_filter_label(.$name, .$value_name, .$value)) %>%
purrrpaste(collapse = "\n")
<- ""
bind_keys_section if (!is.null(binding_keys)) {
<- .get_item(
dependent_datasets attribute = "update", value = dataset,
binding_keys, operator = function(value, target) {
== target$dataset
value
}%>%
) ::map(~names(.[["data_keys"]])) %>%
purrrunlist() %>%
unique()
if (length(dependent_datasets) > 0) {
<- glue::glue(
bind_keys_section "\nData linked with external datasets: {paste(dependent_datasets, collapse = ', ')}",
.trim = FALSE
)
}
}gsub(
"\n$",
"",
::glue("Step: {step_id}\n{filters_section}{bind_keys_section}")
glue
) }
cohortBuilder.db
- same as above