The goal of funneljoin is to make it easy to analyze behavior funnels. For example, maybe you’re interested in finding the people who visit a page and then register. Or you want all the times people click on an item and add it to their cart within 2 days.
You can do this with funneljoin’s after_join()
function.
The arguments are:
x
: a dataset with the first set of behaviors.y
: a dataset with the second set of behaviors.by_time
: a character vector to specify the time columns
in x and y. Must be a single column in each tbl. Note that this column
is used to filter for time y >= time x.by_user
: a character vector to specify the user or
identity columns in x and y. Must be a single column in each tbl.mode
: the method used to join: “inner”, “full”, “anti”,
“semi”, “right”, “left”.type
: the type of funnel used to distinguish between
event pairs, such as “first-first”, “last-first”, “any-firstafter”. See
types of funnels.max_gap
(optional): the maximum gap between events. Can
be a integer representing the number of seconds or a difftime
objectFunneljoins can be any combination of first
,
last
, any
, and lastbefore
with
first
, last
, any
, and
firstafter
.
Let’s take an example. We’ll use two tables, one of landings and one
of registrations, that come with the package. Each has a
user_id
and timestamp
column.
library(dplyr)
library(funneljoin)
landed#> # A tibble: 11 × 2
#> user_id timestamp
#> <dbl> <date>
#> 1 1 2018-07-01
#> 2 2 2018-07-01
#> 3 3 2018-07-02
#> 4 4 2018-07-01
#> 5 4 2018-07-04
#> 6 5 2018-07-10
#> 7 5 2018-07-12
#> 8 6 2018-07-07
#> 9 6 2018-07-08
#> 10 8 2018-08-01
#> 11 8 2018-08-03
registered#> # A tibble: 10 × 2
#> user_id timestamp
#> <dbl> <date>
#> 1 1 2018-07-02
#> 2 3 2018-07-02
#> 3 4 2018-06-10
#> 4 4 2018-07-02
#> 5 5 2018-07-11
#> 6 6 2018-07-10
#> 7 6 2018-07-11
#> 8 7 2018-07-07
#> 9 8 2018-08-02
#> 10 8 2018-08-04
Let’s say we wanted to get only the first time people landed and the
first time they registered, if it was after their first
landing. We would use a first-first inner
join.
%>%
landed after_inner_join(registered,
by_user = "user_id",
by_time = "timestamp",
type = "first-first")
#> # A tibble: 5 × 3
#> user_id timestamp.x timestamp.y
#> <dbl> <date> <date>
#> 1 1 2018-07-01 2018-07-02
#> 2 3 2018-07-02 2018-07-02
#> 3 6 2018-07-07 2018-07-10
#> 4 5 2018-07-10 2018-07-11
#> 5 8 2018-08-01 2018-08-02
We now have a dataset of four rows. Notice a few things:
timestamp.y
is always greater than or equal to
timestamp.x
. This is a condition of
after_join
.inner
join, we only have users who were
in both tables and had a timestamp.y
that was greater than
a timestamp.x
.What if instead we wanted all landing-registration pairs that
happened within a 4-day window? We now add the max_gap
argument. This takes either a number, which is the number of seconds, or
a difftime object. Instead of calculating the number of seconds in 4
days, we’ll just make a difftime object. Because we want any pair, not
just the pairs of the first landings and registrations, we change the
type to any-any
.
We’ll also add gap_col = TRUE
to return a column,
.gap
, which is the gap in seconds between the events.
%>%
landed after_inner_join(registered,
by_user = "user_id",
by_time = "timestamp",
type = "any-any",
max_gap = as.difftime(4, units = "days"),
gap_col = TRUE)
#> # A tibble: 10 × 4
#> user_id timestamp.x .gap timestamp.y
#> <dbl> <date> <dbl> <date>
#> 1 1 2018-07-01 86400 2018-07-02
#> 2 3 2018-07-02 0 2018-07-02
#> 3 4 2018-07-01 86400 2018-07-02
#> 4 5 2018-07-10 86400 2018-07-11
#> 5 6 2018-07-07 259200 2018-07-10
#> 6 6 2018-07-08 172800 2018-07-10
#> 7 6 2018-07-08 259200 2018-07-11
#> 8 8 2018-08-01 86400 2018-08-02
#> 9 8 2018-08-01 259200 2018-08-04
#> 10 8 2018-08-03 86400 2018-08-04
We now have a dataset of 7 rows. Notice a few things:
timestamp.y
is greater than or equal to
timestamp.x
.inner
join, we only
have users who were in both tables and had a timestamp.y
that was greater than a timestamp.x
.Left joins are very handy when you want to know what percentage of people who did X did Y afterward. For example, we may have an experiment where we want to know “what percentage of people registered after entering the experiment?”
In this case, we’d use a first-firstafter
left join - we
want to get a person’s first experiment start and then their first
registration afterward, if they have one. Let’s add another column to
our experiment starts table, which experiment variant someone was
in.
<- tibble::tribble(
experiment_starts ~user_id, ~timestamp, ~ alternative.name,
1, "2018-07-01", "control",
2, "2018-07-01", "treatment",
3, "2018-07-02", "control",
4, "2018-07-01", "control",
4, "2018-07-04", "control",
5, "2018-07-10", "treatment",
5, "2018-07-12", "treatment",
6, "2018-07-07", "treatment",
6, "2018-07-08", "treatment"
%>%
) mutate(timestamp = as.Date(timestamp))
<- tibble::tribble(
experiment_registrations ~user_id, ~timestamp,
1, "2018-07-02",
3, "2018-07-02",
4, "2018-06-10",
4, "2018-07-02",
5, "2018-07-11",
6, "2018-07-10",
6, "2018-07-11",
7, "2018-07-07"
%>%
) mutate(timestamp = as.Date(timestamp))
%>%
experiment_starts after_left_join(experiment_registrations,
by_user = "user_id",
by_time = "timestamp",
type = "first-firstafter")
#> # A tibble: 6 × 4
#> user_id timestamp.x alternative.name timestamp.y
#> <dbl> <date> <chr> <date>
#> 1 1 2018-07-01 control 2018-07-02
#> 2 2 2018-07-01 treatment NA
#> 3 4 2018-07-01 control 2018-07-02
#> 4 3 2018-07-02 control 2018-07-02
#> 5 6 2018-07-07 treatment 2018-07-10
#> 6 5 2018-07-10 treatment 2018-07-11
We now have an entry for each user in the experiment_starts table,
which has the time of their first registration afterward or NA if they
did not have a registration afterwards. Each user only appears once
since it’s a first-something
join.
We can use funneljoin’s summarize_conversions()
function
to get the number of starts and number of conversions (registrations) in
each group - you just need to specify the column that indicates whether
someone converted - if it’s NA or FALSE, it will be treated as FALSE,
otherwise TRUE.
%>%
experiment_starts after_left_join(experiment_registrations,
by_user = "user_id",
by_time = "timestamp",
type = "first-firstafter") %>%
group_by(alternative.name) %>%
summarize_conversions(converted = timestamp.y)
#> # A tibble: 2 × 4
#> alternative.name nb_users nb_conversions pct_converted
#> <chr> <int> <int> <dbl>
#> 1 control 3 3 1
#> 2 treatment 3 2 0.667
summarize_conversions
also works if we have multiple
types of conversions, specified by a column.
<- tibble::tribble(
for_conversion ~"experiment_group", ~"first_event", ~"last_event", ~"type",
"control", "2018-07-01", NA, "click",
"control", "2018-07-02", NA, "click",
"control", "2018-07-03", "2018-07-05", "click",
"treatment", "2018-07-01", "2018-07-05", "click",
"treatment", "2018-07-01", "2018-07-05", "click",
"control", "2018-07-01", NA, "purchase",
"control", "2018-07-02", NA, "purchase",
"control", "2018-07-03", NA, "purchase",
"treatment", "2018-07-01", NA, "purchase",
"treatment", "2018-07-01", "2018-07-05", "purchase"
)
%>%
for_conversion group_by(type, experiment_group) %>%
summarize_conversions(converted = last_event)
#> # A tibble: 4 × 5
#> # Groups: type [2]
#> type experiment_group nb_users nb_conversions pct_converted
#> <chr> <chr> <int> <int> <dbl>
#> 1 click control 3 1 0.333
#> 2 click treatment 2 2 1
#> 3 purchase control 3 0 0
#> 4 purchase treatment 2 1 0.5
summarize_prop_tests()
takes in a dataset with at least
three columns - nb_users
, nb_conversions
, and
a column indicating experiment group. It can also have an additional
column that is the type of conversion - for example, you could have
clicks and purchases. Each type of conversion can only have two rows,
one control
and one other group. If you have that
additional column of type, you need to group by it first.
It returns a dataset with at least 5 columns:
control
: the conversion rate of the control grouptreatment
: the conversion rate of the treatment
groupp_value
of the proportion testpct_change
: the percentage difference between the
control and treatment grouppct_change_low
and pct_change_high
: the
bayesian estimates for a 90% confidence interval.If you had a type column, it will also be in the output.
<- tibble::tribble(
tbl ~ experiment_group, ~nb_users, ~nb_conversions, ~type,
"control", 500, 200, "purchase",
"treatment", 500, 100, "purchase",
"control", 500, 360, "click",
"treatment", 500, 375, "click"
)
%>%
tbl group_by(type) %>%
summarize_prop_tests(alternative_name = experiment_group)
#> # A tibble: 2 × 7
#> type control treatment p_value pct_change pct_change_low pct_change_high
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 click 0.72 0.75 3.16e- 1 0.0417 -0.0248 0.108
#> 2 purchase 0.4 0.2 8.39e-12 -0.5 -0.621 -0.379