The package README uses the gapminder dataset to demonstrate nplyr’s functionality. In this case (and other similar cases) the output from nplyr’s nested operations could be obtained by unnesting and performing grouped dplyr operations.
library(nplyr)
<-
gm_nest ::gapminder_unfiltered %>%
gapminder::nest(country_data = -continent)
tidyr
gm_nest#> # A tibble: 6 × 2
#> continent country_data
#> <fct> <list>
#> 1 Asia <tibble [578 × 5]>
#> 2 Europe <tibble [1,302 × 5]>
#> 3 Africa <tibble [637 × 5]>
#> 4 Americas <tibble [470 × 5]>
#> 5 FSU <tibble [139 × 5]>
#> 6 Oceania <tibble [187 × 5]>
# we can use nplyr to perform operations on the nested data
%>%
gm_nest nest_filter(country_data, year == max(year)) %>%
nest_mutate(country_data, pop_millions = pop/1000000) %>%
slice_head(n = 1) %>%
::unnest(country_data)
tidyr#> # A tibble: 43 × 7
#> continent country year lifeExp pop gdpPercap pop_millions
#> <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
#> 1 Asia Afghanistan 2007 43.8 31889923 975. 31.9
#> 2 Asia Azerbaijan 2007 67.5 8017309 7709. 8.02
#> 3 Asia Bahrain 2007 75.6 708573 29796. 0.709
#> 4 Asia Bangladesh 2007 64.1 150448339 1391. 150.
#> 5 Asia Bhutan 2007 65.6 2327849 4745. 2.33
#> 6 Asia Brunei 2007 77.1 386511 48015. 0.387
#> 7 Asia Cambodia 2007 59.7 14131858 1714. 14.1
#> 8 Asia China 2007 73.0 1318683096 4959. 1319.
#> 9 Asia Hong Kong, China 2007 82.2 6980412 39725. 6.98
#> 10 Asia India 2007 64.7 1110396331 2452. 1110.
#> # … with 33 more rows
# in this case, we could have obtained the same result with tidyr and dplyr
%>%
gm_nest ::unnest(country_data) %>%
tidyrgroup_by(continent) %>%
filter(year == max(year)) %>%
mutate(pop_millions = pop/1000000) %>%
ungroup() %>%
filter(continent == "Asia")
#> # A tibble: 43 × 7
#> continent country year lifeExp pop gdpPercap pop_millions
#> <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
#> 1 Asia Afghanistan 2007 43.8 31889923 975. 31.9
#> 2 Asia Azerbaijan 2007 67.5 8017309 7709. 8.02
#> 3 Asia Bahrain 2007 75.6 708573 29796. 0.709
#> 4 Asia Bangladesh 2007 64.1 150448339 1391. 150.
#> 5 Asia Bhutan 2007 65.6 2327849 4745. 2.33
#> 6 Asia Brunei 2007 77.1 386511 48015. 0.387
#> 7 Asia Cambodia 2007 59.7 14131858 1714. 14.1
#> 8 Asia China 2007 73.0 1318683096 4959. 1319.
#> 9 Asia Hong Kong, China 2007 82.2 6980412 39725. 6.98
#> 10 Asia India 2007 64.7 1110396331 2452. 1110.
#> # … with 33 more rows
Why, then, might we need to use nplyr? Well, in other scenarios, it may be far more convenient to work with nested data frames or it may not even be possible to unnest!
Consider a set of surveys that an organization might use to gather market data. It is common for organizations to have separate surveys for separate purposes but gather the same baseline set of data across all surveys (for example, a respondent’s age and gender may be recorded across all surveys, but each survey will have a different set of questions). Let’s use two fake surveys with the below questions for this example:
In this scenario, both surveys are collecting demographic information
— age, location, and industry — but differ in the questions. A
convenient way to get the response files into the environment would be
to use purrr::map()
to read in each file to a nested data frame.
<- "https://raw.githubusercontent.com/markjrieke/nplyr/main/data-raw/"
path
<-
surveys ::tibble(survey_file = c("job_survey", "personal_survey")) %>%
tibblemutate(survey_data = purrr::map(survey_file, ~readr::read_csv(paste0(path, .x, ".csv"))))
surveys#> # A tibble: 2 × 2
#> survey_file survey_data
#> <chr> <list>
#> 1 job_survey <spc_tbl_ [500 × 6]>
#> 2 personal_survey <spc_tbl_ [750 × 6]>
tidyr::unnest()
can usually handle idiosyncrasies in layout when unnesting but in this
case unnesting throws an error!
%>%
surveys ::unnest(survey_data)
tidyr#> Error in `list_unchop()`:
#> ! Can't combine `x[[1]]$Q5` <double> and `x[[2]]$Q5` <character>.
This is because the surveys share column names but not necessarily
column types! In this case, both data frames contain a column named
“Q5”, but in job_survey
it’s a double and in
personal_survey
it’s a character.
%>%
surveys slice(1) %>%
::unnest(survey_data) %>%
tidyrglimpse()
#> Rows: 500
#> Columns: 7
#> $ survey_file <chr> "job_survey", "job_survey", "job_survey", "job_survey", "j…
#> $ survey_name <chr> "job", "job", "job", "job", "job", "job", "job", "job", "j…
#> $ Q1 <dbl> 100, 81, 51, 81, 80, 32, 65, 57, 43, 94, 25, 83, 61, 66, 8…
#> $ Q2 <chr> "Austin", "San Antonio", "Austin", "Austin", "Dallas", "Fo…
#> $ Q3 <chr> "Consulting", "Consulting", "Consulting", "Technology", "C…
#> $ Q4 <chr> "Somewhat dissatisfied", "Neither satisfied nor dissatisfi…
#> $ Q5 <dbl> 163, 48, 190, 25, 143, 233, 43, 243, 158, 235, 245, 195, 2…
%>%
surveys slice(2) %>%
::unnest(survey_data) %>%
tidyrglimpse()
#> Rows: 750
#> Columns: 7
#> $ survey_file <chr> "personal_survey", "personal_survey", "personal_survey", "…
#> $ survey_name <chr> "personal", "personal", "personal", "personal", "personal"…
#> $ Q1 <dbl> 91, 32, 40, 23, 88, 69, 96, 40, 57, 40, 39, 70, 29, 38, 57…
#> $ Q2 <chr> "Austin", "San Antonio", "San Antonio", "Austin", "Dallas"…
#> $ Q3 <chr> "Energy", "Healthcare", "Consulting", "Consulting", "Techn…
#> $ Q4 <chr> "Neither satisfied nor dissatisfied", "Extremely satisfied…
#> $ Q5 <chr> "Blandit eros! A, ligula facilisis imperdiet! Interdum pla…
We could potentially get around this issue with unnesting by reading
in all columns as characters via
readr::read_csv(x, col_types = cols(.default = "c"))
, but
this presents its own challenges. Q5
would still be better
represented as a double in job_survey
and from the survey
question text Q4
has similar, but distinctly different,
meanings across the survey files.
This is where nplyr comes into play! Rather than malign the data types or create separate objects for each survey file, we can use nplyr to perform operations directly on the nested data frames.
<-
surveys %>%
surveys nest_mutate(survey_data,
age_group = if_else(Q1 < 65, "Adult", "Retirement Age")) %>%
nest_group_by(survey_data, Q3) %>%
nest_add_count(survey_data,
name = "n_respondents_in_industry") %>%
nest_mutate(survey_data,
median_industry_age = median(Q1)) %>%
nest_ungroup(survey_data)
%>%
surveys slice(1) %>%
::unnest(survey_data)
tidyr#> # A tibble: 500 × 10
#> survey_file survey_name Q1 Q2 Q3 Q4 Q5 age_g…¹ n_res…² media…³
#> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <int> <dbl>
#> 1 job_survey job 100 Aust… Cons… Some… 163 Retire… 107 57
#> 2 job_survey job 81 San … Cons… Neit… 48 Retire… 107 57
#> 3 job_survey job 51 Aust… Cons… Extr… 190 Adult 107 57
#> 4 job_survey job 81 Aust… Tech… Extr… 25 Retire… 108 61.5
#> 5 job_survey job 80 Dall… Cons… Extr… 143 Retire… 107 57
#> 6 job_survey job 32 Fort… Ener… Some… 233 Adult 99 59
#> 7 job_survey job 65 Dall… Cons… Some… 43 Retire… 107 57
#> 8 job_survey job 57 Hous… Heal… Some… 243 Adult 75 65
#> 9 job_survey job 43 Dall… Gove… Neit… 158 Adult 111 57
#> 10 job_survey job 94 Fort… Heal… Extr… 235 Retire… 75 65
#> # … with 490 more rows, and abbreviated variable names ¹age_group,
#> # ²n_respondents_in_industry, ³median_industry_age
%>%
surveys slice(2) %>%
::unnest(survey_data)
tidyr#> # A tibble: 750 × 10
#> survey_file surve…¹ Q1 Q2 Q3 Q4 Q5 age_g…² n_res…³ media…⁴
#> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <int> <dbl>
#> 1 personal_survey person… 91 Aust… Ener… Neit… Blan… Retire… 145 61
#> 2 personal_survey person… 32 San … Heal… Extr… Elem… Adult 131 62
#> 3 personal_survey person… 40 San … Cons… Some… Eget… Adult 149 61
#> 4 personal_survey person… 23 Aust… Cons… Extr… Scel… Adult 149 61
#> 5 personal_survey person… 88 Dall… Tech… Neit… Aene… Retire… 150 61
#> 6 personal_survey person… 69 Fort… Tech… Neit… Inte… Retire… 150 61
#> 7 personal_survey person… 96 Hous… Heal… Extr… Blan… Retire… 131 62
#> 8 personal_survey person… 40 Hous… Cons… Extr… Scel… Adult 149 61
#> 9 personal_survey person… 57 Fort… Ener… Extr… Pede… Adult 145 61
#> 10 personal_survey person… 40 Fort… Heal… Extr… Phar… Adult 131 62
#> # … with 740 more rows, and abbreviated variable names ¹survey_name,
#> # ²age_group, ³n_respondents_in_industry, ⁴median_industry_age