Add 'data deconstructors' - `unjoin()` / `unrbind()` / `uncbind()` #16

jack-davison · 2025-02-11T09:52:28Z

Hi Nicola,

Apologies for the issue-less PR - your message yesterday reminded me of a concept I had for {messy}.

A common step in data manipulation is to join datasets together. In air quality that might be binding monitoring data together with meteorological data, or adding site metadata. There'll be equivalents in any other field, though - combining clinical results with patient data, combining demographic data with sales history, and so on.

I often have to do effectively the below when I teach:

devtools::load_all()
#> ℹ Loading messy

dat <- openairmaps::polar_data

split1 <-
  unjoin(
    dat,
    by = "site",
    cols = c("lat", "lon", "site_type"),
    distinct = "right",
    names = c("monitoring", "site_meta")
  )

split2 <-
  unjoin(
    split1$monitoring,
    by = c("date", "site"),
    cols = c("wd", "ws", "visibility", "air_temp"),
    names = c("aq", "meteo")
  )

dplyr::glimpse(split1$site_meta)
#> Rows: 4
#> Columns: 4
#> $ site      <chr> "London Bloomsbury", "London Cromwell Road 2", "London Maryl…
#> $ lat       <dbl> 51.52229, 51.49548, 51.52253, 51.52105
#> $ lon       <dbl> -0.125889, -0.178709, -0.154611, -0.213492
#> $ site_type <chr> "Urban Background", "Urban Traffic", "Urban Traffic", "Urban…

dplyr::glimpse(split2$aq)
#> Rows: 35,040
#> Columns: 6
#> $ date  <dttm> 2009-01-01 00:00:00, 2009-01-01 01:00:00, 2009-01-01 02:00:00, …
#> $ site  <chr> "London Bloomsbury", "London Bloomsbury", "London Bloomsbury", "…
#> $ nox   <dbl> 113, 40, 48, 36, 40, 50, 50, 53, 80, 111, 206, 113, 86, 82, 76, …
#> $ no2   <dbl> 46, 32, 36, 29, 32, 36, 34, 34, 50, 59, 67, 61, 52, 53, 52, 69, …
#> $ pm2.5 <dbl> 42, 45, 43, 37, 36, 33, 33, 31, 27, 28, 37, 30, 27, 29, 27, 36, …
#> $ pm10  <dbl> 46, 49, 46, NA, 38, 32, 36, 32, 30, 32, 39, 37, 32, 33, 34, 41, …

dplyr::glimpse(split2$meteo)
#> Rows: 35,040
#> Columns: 6
#> $ date       <dttm> 2009-01-01 00:00:00, 2009-01-01 01:00:00, 2009-01-01 02:00…
#> $ site       <chr> "London Bloomsbury", "London Bloomsbury", "London Bloomsbur…
#> $ wd         <dbl> 58.92536, 74.46675, 30.00000, 45.00000, 70.00000, 46.63627,…
#> $ ws         <dbl> 2.066667, 1.900000, 1.550000, 2.100000, 1.500000, 2.100000,…
#> $ visibility <dbl> 5000.000, 4933.333, 5000.000, 4900.000, 5000.000, 6000.000,…
#> $ air_temp   <dbl> 0.8666667, 0.8666667, 0.8000000, 0.8500000, 0.8666667, 0.96…

^{Created on 2025-02-11 with reprex v2.1.1}

This PR adds three functions - the above unjoin() as well as unrbind() and uncbind(). The latter two chunk up your dataframe colwise and rowwise randomly based on user-defined sizes/proportions. This models data with similar structures coming from different sources - e.g., a monthly data report coming from a lab that needs binding into a single dataframe.

> unrbind(dplyr::tibble(iris), probs = c(0.7, 0.2, 0.1))
[[1]]
# A tibble: 105 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
 1          4.8         3.1          1.6         0.2 setosa    
 2          5.7         3.8          1.7         0.3 setosa    
 3          5.5         4.2          1.4         0.2 setosa    
 4          5.2         4.1          1.5         0.1 setosa    
 5          5.8         2.7          4.1         1   versicolor
 6          6.8         3.2          5.9         2.3 virginica 
 7          7.7         3.8          6.7         2.2 virginica 
 8          6.1         2.6          5.6         1.4 virginica 
 9          5.5         2.5          4           1.3 versicolor
10          6.4         3.1          5.5         1.8 virginica 
# ℹ 95 more rows
# ℹ Use `print(n = ...)` to see more rows

[[2]]
# A tibble: 30 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
 1          6.4         3.2          4.5         1.5 versicolor
 2          5.4         3.9          1.3         0.4 setosa    
 3          5.4         3.4          1.7         0.2 setosa    
 4          5.5         2.4          3.7         1   versicolor
 5          4.9         3.6          1.4         0.1 setosa    
 6          6.3         2.7          4.9         1.8 virginica 
 7          5.8         2.6          4           1.2 versicolor
 8          6.4         2.7          5.3         1.9 virginica 
 9          5           2            3.5         1   versicolor
10          5.1         2.5          3           1.1 versicolor
# ℹ 20 more rows
# ℹ Use `print(n = ...)` to see more rows

[[3]]
# A tibble: 15 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
 1          5.6         2.8          4.9         2   virginica 
 2          6.4         2.9          4.3         1.3 versicolor
 3          6.3         2.8          5.1         1.5 virginica 
 4          4.4         3.2          1.3         0.2 setosa    
 5          6.3         3.3          4.7         1.6 versicolor
 6          6.1         3            4.6         1.4 versicolor
 7          6           2.2          4           1   versicolor
 8          5.5         3.5          1.3         0.2 setosa    
 9          6.7         3.3          5.7         2.1 virginica 
10          6.5         3            5.2         2   virginica 
11          5.9         3.2          4.8         1.8 versicolor
12          6.9         3.1          5.4         2.1 virginica 
13          6.3         3.3          6           2.5 virginica 
14          7.7         2.6          6.9         2.3 virginica 
15          4.5         2.3          1.3         0.3 setosa

If I'm honest, I can't think of a purpose for uncbind() that's not better achieved using unjoin() but it made sense to complete the set!

Users can, of course, go ahead and use messy() or another function on each output data. This will make it even harder to re-join them for learners, as they'd have to ensure that column names match (for rbind()) or their joining columns are aligned (for merge()/left_join()).

jack-davison added 5 commits February 11, 2025 09:20

feat: add unjoin/unbind functions

65f72f7

docs: add NEWS.md item

95b5143

docs: update pkgdown

26e73a2

docs: add unrbind context

3e9763b

docs: add return value for unrbind

c845050

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 'data deconstructors' - `unjoin()` / `unrbind()` / `uncbind()` #16

Add 'data deconstructors' - `unjoin()` / `unrbind()` / `uncbind()` #16

jack-davison commented Feb 11, 2025

Add 'data deconstructors' - unjoin() / unrbind() / uncbind() #16

Are you sure you want to change the base?

Add 'data deconstructors' - unjoin() / unrbind() / uncbind() #16

Conversation

jack-davison commented Feb 11, 2025

Add 'data deconstructors' - `unjoin()` / `unrbind()` / `uncbind()` #16

Add 'data deconstructors' - `unjoin()` / `unrbind()` / `uncbind()` #16