Inside each feature JSON, an optional preprocess
object
can be included, which causes the input table to be modified in a
particular way before the feature is calculated.
This is primarily useful for data where each row represents some subdivision of a larger entity, and the user wants to calculate features based on the information from those larger entity. In particular, this is useful for episodic data, where each row represents an episode within a continuous hospital stay.
We begin by making the case for why preprocessing can be required for certain features.
Consider the following data frame. (This is a heavily simplified
version of the example SMR04
data bundled with the package, which you can obtain using
eider_example('random_smr04_data.csv')
.)
input_table <- data.frame(
id = c(1, 1, 1, 1),
admission_date = as.Date(c(
"2015-01-01", "2016-01-01", "2016-01-04", "2017-01-01"
)),
discharge_date = as.Date(c(
"2015-01-05", "2016-01-04", "2016-01-08", "2017-01-08"
)),
cis_marker = c(1, 2, 2, 3),
episode_within_cis = c(1, 1, 2, 1),
diagnosis = c("A", "B", "C", "B")
)
input_table
#> id admission_date discharge_date cis_marker episode_within_cis diagnosis
#> 1 1 2015-01-01 2015-01-05 1 1 A
#> 2 1 2016-01-01 2016-01-04 2 1 B
#> 3 1 2016-01-04 2016-01-08 2 2 C
#> 4 1 2017-01-01 2017-01-08 3 1 B
Here, each row is an episode; multiple episodes make up a
continuous inpatient stay (hence the abbreviation “cis”). The
cis_marker
field is used to label stays, and can thus be
used to identify episodes belonging to the same stay. In this case, the
episode_within_cis
tells us the order of the episodes
within a stay; such information is not always present, though.
In this table snippet, there is only one patient: they have had 3 distinct stays; the second of these comprises 2 episodes.
Such information can be tricky to perform filtering on, because the
admission_date
and discharge_date
pertain to
each episode, but we are often interested in stay-level data: for
example, when the patient was first admitted to hospital.
Consider the following question: how many stays has a patient had since 5 January 2016 in which they had a diagnosis of “B”? For the patient in this table, the answer is 2: both the 2016 and 2017 stays had a diagnosis of “B”, and both stays ended after 5 January 2016.
If we were to naively try to perform this calculation without
accounting for the dates, we could write something like json_examples/preprocessing1.json
:
{
"output_feature_name": "naive",
"transformation_type": "nunique",
"source_table": "input_table",
"aggregation_column": "cis_marker",
"grouping_column": "id",
"absent_default_value": 0,
"filter": {
"type": "and",
"subfilter": {
"date_filter": {
"column": "discharge_date",
"type": "date_gt_eq",
"value": "2016-01-05"
},
"diagnosis_filter": {
"column": "diagnosis",
"type": "in",
"value": [
"B"
]
}
}
}
}
Running this would give:
results <- run_pipeline(
data_sources = list(input_table = input_table),
feature_filenames = "json_examples/preprocessing1.json"
)
results$features
#> id naive
#> 1 1 1
We got a value of 1, which is incorrect! What gives? As it happens, the filter was applied to each episode, and because the first episode of the 2016 stay ended before 5 January, it was not counted in the data. The second episode of the 2016 stay was also removed because its diagnosis was not “B”. So only the third stay, in 2017, was counted.
The way eider
approaches this issue is to allow users to
preprocess their data. This is accomplished by specifying a
preprocess
object in the feature JSON. In our case, to
merge episode dates into stays, we can say that we would like:
id
and
cis_marker
,In dplyr
terms, one would write a pipeline like
this:
processed_table <- input_table %>%
dplyr::group_by(id, cis_marker) %>%
dplyr::mutate(
admission_date = min(admission_date),
discharge_date = max(discharge_date)
) %>%
dplyr::ungroup()
processed_table
#> # A tibble: 4 × 6
#> id admission_date discharge_date cis_marker episode_within_cis diagnosis
#> <dbl> <date> <date> <dbl> <dbl> <chr>
#> 1 1 2015-01-01 2015-01-05 1 1 A
#> 2 1 2016-01-01 2016-01-08 2 1 B
#> 3 1 2016-01-01 2016-01-08 2 2 C
#> 4 1 2017-01-01 2017-01-08 3 1 B
Notice how the dates for both episodes in stay 2 are now the same, and reflect the overall dates for the stay.
Returning to the eider
library, this information is
(unsurprisingly) specified in JSON. Including a preprocess
object in the feature will cause the input table to be modified as
above:
{
"preprocess": {
"on": ["id", "cis_marker"],
"retain_min": ["admission_date"],
"retain_max": ["discharge_date"]
},
}
The preprocess
object contains one mandatory key:
"on"
: the names of
the columns by which the data should be grouped for preprocessingand several optional keys can be provided, corresponding to the operations which should be performed. All of these keys refer to column names:
"retain_min"
:
retain the minimum value within each group"retain_max"
:
retain the maximum value within each group"replace_with_sum"
:
sum the values within each group and replace the original values with
the sumColumns may not be specified in more than one of the above keys (i.e., you cannot preprocess the same column twice).
We can now rewrite the feature JSON to include the preprocessing step
(json_examples/preprocessing2.json
):
{
"output_feature_name": "correct",
"transformation_type": "nunique",
"source_table": "input_table",
"aggregation_column": "cis_marker",
"grouping_column": "id",
"absent_default_value": 0,
"filter": {
"type": "and",
"subfilter": {
"date_filter": {
"column": "discharge_date",
"type": "date_gt_eq",
"value": "2016-01-05"
},
"diagnosis_filter": {
"column": "diagnosis",
"type": "in",
"value": [
"B"
]
}
}
},
"preprocess": {
"on": [
"id",
"cis_marker"
],
"retain_min": [
"admission_date"
],
"retain_max": [
"discharge_date"
]
}
}
and rerunning the pipeline gives us the correct value of 2. Note that
although the preprocess
object is placed after the
filter
object in the JSON, the preprocessing is always done
prior to filtering. The order of the keys in the JSON has no
effect whatsoever on the result.
replace_with_sum
To motivate the use of replace_with_sum
, we can add a
column to our previous data frame to denote the length of each
episode:
input_table_with_sum <- input_table %>%
dplyr::mutate(days = as.numeric(discharge_date - admission_date))
input_table_with_sum
#> id admission_date discharge_date cis_marker episode_within_cis diagnosis days
#> 1 1 2015-01-01 2015-01-05 1 1 A 4
#> 2 1 2016-01-01 2016-01-04 2 1 B 3
#> 3 1 2016-01-04 2016-01-08 2 2 C 4
#> 4 1 2017-01-01 2017-01-08 3 1 B 7
Now consider a different question, which is: how many stays has a
patient had which lasted for a week or more? To answer this, we
need to first sum up the days
for each stay, and we can
then filter based on this sum. This is accomplished with json_examples/preprocessing3.json
:
{
"output_feature_name": "using_sum",
"transformation_type": "nunique",
"source_table": "input_table",
"aggregation_column": "cis_marker",
"grouping_column": "id",
"absent_default_value": 0,
"filter": {
"column": "days",
"type": "gt_eq",
"value": 7
},
"preprocess": {
"on": [
"id",
"cis_marker"
],
"replace_with_sum": [
"days"
]
}
}
The Gallery section contains two examples of preprocessing
in action: both PIS
feature 4 and SMR04
feature 4 use the replace_with_sum
preprocessing
function.