Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 'data deconstructors' - unjoin() / unrbind() / uncbind() #16

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

jack-davison
Copy link
Contributor

Hi Nicola,

Apologies for the issue-less PR - your message yesterday reminded me of a concept I had for {messy}.

A common step in data manipulation is to join datasets together. In air quality that might be binding monitoring data together with meteorological data, or adding site metadata. There'll be equivalents in any other field, though - combining clinical results with patient data, combining demographic data with sales history, and so on.

I often have to do effectively the below when I teach:

devtools::load_all()
#> ℹ Loading messy

dat <- openairmaps::polar_data

split1 <-
  unjoin(
    dat,
    by = "site",
    cols = c("lat", "lon", "site_type"),
    distinct = "right",
    names = c("monitoring", "site_meta")
  )

split2 <-
  unjoin(
    split1$monitoring,
    by = c("date", "site"),
    cols = c("wd", "ws", "visibility", "air_temp"),
    names = c("aq", "meteo")
  )

dplyr::glimpse(split1$site_meta)
#> Rows: 4
#> Columns: 4
#> $ site      <chr> "London Bloomsbury", "London Cromwell Road 2", "London Maryl…
#> $ lat       <dbl> 51.52229, 51.49548, 51.52253, 51.52105
#> $ lon       <dbl> -0.125889, -0.178709, -0.154611, -0.213492
#> $ site_type <chr> "Urban Background", "Urban Traffic", "Urban Traffic", "Urban…

dplyr::glimpse(split2$aq)
#> Rows: 35,040
#> Columns: 6
#> $ date  <dttm> 2009-01-01 00:00:00, 2009-01-01 01:00:00, 2009-01-01 02:00:00, …
#> $ site  <chr> "London Bloomsbury", "London Bloomsbury", "London Bloomsbury", "…
#> $ nox   <dbl> 113, 40, 48, 36, 40, 50, 50, 53, 80, 111, 206, 113, 86, 82, 76, …
#> $ no2   <dbl> 46, 32, 36, 29, 32, 36, 34, 34, 50, 59, 67, 61, 52, 53, 52, 69, …
#> $ pm2.5 <dbl> 42, 45, 43, 37, 36, 33, 33, 31, 27, 28, 37, 30, 27, 29, 27, 36, …
#> $ pm10  <dbl> 46, 49, 46, NA, 38, 32, 36, 32, 30, 32, 39, 37, 32, 33, 34, 41, …

dplyr::glimpse(split2$meteo)
#> Rows: 35,040
#> Columns: 6
#> $ date       <dttm> 2009-01-01 00:00:00, 2009-01-01 01:00:00, 2009-01-01 02:00…
#> $ site       <chr> "London Bloomsbury", "London Bloomsbury", "London Bloomsbur…
#> $ wd         <dbl> 58.92536, 74.46675, 30.00000, 45.00000, 70.00000, 46.63627,…
#> $ ws         <dbl> 2.066667, 1.900000, 1.550000, 2.100000, 1.500000, 2.100000,…
#> $ visibility <dbl> 5000.000, 4933.333, 5000.000, 4900.000, 5000.000, 6000.000,…
#> $ air_temp   <dbl> 0.8666667, 0.8666667, 0.8000000, 0.8500000, 0.8666667, 0.96…

Created on 2025-02-11 with reprex v2.1.1

This PR adds three functions - the above unjoin() as well as unrbind() and uncbind(). The latter two chunk up your dataframe colwise and rowwise randomly based on user-defined sizes/proportions. This models data with similar structures coming from different sources - e.g., a monthly data report coming from a lab that needs binding into a single dataframe.

> unrbind(dplyr::tibble(iris), probs = c(0.7, 0.2, 0.1))
[[1]]
# A tibble: 105 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
 1          4.8         3.1          1.6         0.2 setosa    
 2          5.7         3.8          1.7         0.3 setosa    
 3          5.5         4.2          1.4         0.2 setosa    
 4          5.2         4.1          1.5         0.1 setosa    
 5          5.8         2.7          4.1         1   versicolor
 6          6.8         3.2          5.9         2.3 virginica 
 7          7.7         3.8          6.7         2.2 virginica 
 8          6.1         2.6          5.6         1.4 virginica 
 9          5.5         2.5          4           1.3 versicolor
10          6.4         3.1          5.5         1.8 virginica 
# ℹ 95 more rows
# ℹ Use `print(n = ...)` to see more rows

[[2]]
# A tibble: 30 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
 1          6.4         3.2          4.5         1.5 versicolor
 2          5.4         3.9          1.3         0.4 setosa    
 3          5.4         3.4          1.7         0.2 setosa    
 4          5.5         2.4          3.7         1   versicolor
 5          4.9         3.6          1.4         0.1 setosa    
 6          6.3         2.7          4.9         1.8 virginica 
 7          5.8         2.6          4           1.2 versicolor
 8          6.4         2.7          5.3         1.9 virginica 
 9          5           2            3.5         1   versicolor
10          5.1         2.5          3           1.1 versicolor
# ℹ 20 more rows
# ℹ Use `print(n = ...)` to see more rows

[[3]]
# A tibble: 15 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
 1          5.6         2.8          4.9         2   virginica 
 2          6.4         2.9          4.3         1.3 versicolor
 3          6.3         2.8          5.1         1.5 virginica 
 4          4.4         3.2          1.3         0.2 setosa    
 5          6.3         3.3          4.7         1.6 versicolor
 6          6.1         3            4.6         1.4 versicolor
 7          6           2.2          4           1   versicolor
 8          5.5         3.5          1.3         0.2 setosa    
 9          6.7         3.3          5.7         2.1 virginica 
10          6.5         3            5.2         2   virginica 
11          5.9         3.2          4.8         1.8 versicolor
12          6.9         3.1          5.4         2.1 virginica 
13          6.3         3.3          6           2.5 virginica 
14          7.7         2.6          6.9         2.3 virginica 
15          4.5         2.3          1.3         0.3 setosa    

If I'm honest, I can't think of a purpose for uncbind() that's not better achieved using unjoin() but it made sense to complete the set!

Users can, of course, go ahead and use messy() or another function on each output data. This will make it even harder to re-join them for learners, as they'd have to ensure that column names match (for rbind()) or their joining columns are aligned (for merge()/left_join()).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant