Filtering relies on a filter
key in the feature JSON,
for which the corresponding value is an object:
This vignette will describe the types of filters available and how they can be expressed in this JSON object.
The most basic filter would be to accept all rows where the value in
a given column meets a certain criterion. For example, we may only want
to look at rows for which the value in the age
column is
greater than 18. There are three pieces of information here which must
be encoded:
column
: the name of the
column to filter ontype
: the type of filter to
applyvalue
: the value to compare against.
The type of this must match the type of the column specified.The type
key can take one of the following values:
"in"
In this case, the value
key can either be a single value
or an array of multiple values. This filers for rows where the value in
the column matches one of the provided values.
"gt"
/ "lt"
/ "gt_eq"
/
"lt_eq"
The value in the column must be greater than, less than, greater than
or equal to, or less than or equal to the value
respectively. For these filters, only one value
can be
passed.
"date_gt"
/ "date_lt"
/
"date_gt_eq"
/ "date_lt_eq"
The same as above, but to be used for dates. This distinction is
mandatory because JSON does not have a native date type, so the
value
must be passed as a string in the format
YYYY-MM-DD
.
Thus, for example, if we assume that the column age
contains integers, then
selects for people who are 18, 19, or 20 years old; and likewise if
the date_of_birth
column contains dates, then
selects for people born on or after January 1, 2000.
As a reminder, the table to be filtered is passed as the
source_table
key in the feature JSON. This is described in
the introductory vignette.
Filters may also be combined using the Boolean logic operators AND,
OR, and NOT. Instead of providing the three keys column
,
type
, and value
, only type
is
specified, together with subfilter
, which is itself an
object which maps filter names to filter objects.
An example is probably the easiest way to explain this. If we wanted to select for people who are not 18, 19, or 20 years old, we could write:
{
"filter": {
"type": "not",
"subfilter": {
"age_filter": {
"column": "age",
"type": "in",
"value": [18, 19, 20]
}
}
}
}
The filter name, age_filter
, can be any string you want:
it is used only for logging purposes. Notice how the object that
age_filter
refers to, i.e.,
is itself just a basic filter as has already been described above.
or
and and
filters work in exactly the same
way, with the sole difference being that to accomplish any meaningful
results you would generally want the subfilter
object to
contain two or more filters that work in tandem. For example, the below
filters for 18-year-olds who have a height of over 180 (presumably
centimetres; although this depends on the input table!):
{
"filter": {
"type": "and",
"subfilter": {
"age_filter": {
"column": "age",
"type": "in",
"value": [18]
},
"height_filter": {
"column": "height",
"type": "gt",
"value": 180
}
}
}
}
Again, age_filter
and height_filter
are
just names for logging purposes, and themselves refer to basic
filters.
Finally, note that the subfilters need not necessarily be basic filters, though: they can be compound filters themselves. So it is possible to nest filters as deeply as you like to accomplish the desired result.
Consider the following input table with five rows:
input_table <- data.frame(
id = c(1, 1, 1, 2, 3),
date = as.Date(c(
"2024-01-01", "2024-01-02", "2024-01-03",
"2023-01-01", "2023-01-01"
)),
diagnosis = c("A", "B", "C", "B", "C")
)
input_table
#> id date diagnosis
#> 1 1 2024-01-01 A
#> 2 1 2024-01-02 B
#> 3 1 2024-01-03 C
#> 4 2 2023-01-01 B
#> 5 3 2023-01-01 C
and suppose we would like to count the number of times each person
has had a diagnosis of “B”. This is a transformation_type
of "count"
; but if we do not perform any filtering,
"count"
will simply return the number of times each person
appears in this dataframe.
The following JSON is provided as json_examples/filter1.json
:
{
"output_feature_name": "unfiltered_count",
"transformation_type": "count",
"source_table": "input_table",
"grouping_column": "id",
"absent_default_value": 0
}
Passing this into run_pipeline()
will give us the
following result:
results <- run_pipeline(
data_sources = list(input_table = input_table),
feature_filenames = "json_examples/filter1.json"
)
results$features
#> id unfiltered_count
#> 1 1 3
#> 2 2 1
#> 3 3 1
Instead of this, if we specify a filter for the
diagnosis
column (in json_examples/filter2.json
),
we can achieve the desired result:
{
"output_feature_name": "filtered_count",
"transformation_type": "count",
"source_table": "input_table",
"grouping_column": "id",
"absent_default_value": 0,
"filter": {
"column": "diagnosis",
"type": "in",
"value": [
"B"
]
}
}
results <- run_pipeline(
data_sources = list(input_table = input_table),
feature_filenames = "json_examples/filter2.json"
)
results$features
#> id filtered_count
#> 1 1 1
#> 2 2 1
#> 3 3 0
Note how patient 3 has a count of 0, because they have never been
diagnosed with “B”. This is specified by the
absent_default_value
key in the JSON.
If we would like to find diagnoses of “B” only in 2024, then we can
add a date filter to the mix, and combine it with the diagnosis filter
with an AND filter (json_examples/filter3.json
):
{
"output_feature_name": "doubly_filtered_count",
"transformation_type": "count",
"source_table": "input_table",
"grouping_column": "id",
"absent_default_value": 0,
"filter": {
"type": "and",
"subfilter": {
"diagnosis_filter": {
"column": "diagnosis",
"type": "in",
"value": [
"B"
]
},
"date_filter": {
"column": "date",
"type": "date_gt_eq",
"value": "2024-01-01"
}
}
}
}
results <- run_pipeline(
data_sources = list(input_table = input_table),
feature_filenames = "json_examples/filter3.json"
)
results$features
#> id doubly_filtered_count
#> 1 1 1
#> 2 2 0
#> 3 3 0
Patient 2’s diagnosis has been excluded this time because the corresponding entry came in 2023.
Filtering is used widely in the Gallery examples. See, for example:
NA
values);