Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplucate_rows can destroy existing data columns #15

Open
barryrowlingson opened this issue Dec 9, 2024 · 0 comments
Open

duplucate_rows can destroy existing data columns #15

barryrowlingson opened this issue Dec 9, 2024 · 0 comments

Comments

@barryrowlingson
Copy link

Adding columns to a data frame as temporary calculations and dropping them later risks deleting columns with those names if present in the input data. For example:

dplyr::mutate(is_duplicate = FALSE)

results in:

> d = data.frame(x=1:5, is_duplicate=letters[1:5])
> d
  x is_duplicate
1 1            a
2 2            b
3 3            c
4 4            d
5 5            e
> duplicate_rows(d, .5)
  x
1 1
2 2
3 2
4 3
5 4
6 4
7 5
8 5

and my precious is_duplicate column is gone without warning. A few other column names are used as temporary calculations in this function and so also can also get surprise deleted.

Usual solutions for this include:

  • document the behaviour in the help and tell the user not to use those column names or they'll disappear
  • use more obscure names, such as starting with a dot, for all temporary column names, hoping they'll not be used by users. This kicks the problem down the line until one day a user does
  • dynamically generate column names that don't exist in the data frame, but this results in seriously ugly code
  • use vectors instead of new columns, but that may mean rewriting anything using non-standard evaluation which will get columns that clash with those vector names

A better approach might be to work with row numbers instead of whole slices - generate the row numbers for the original and the sample, then rearrange, then return the rows of the input data by those numbers. This code behaves very similarly to duplicate_rows:

duplicate_rows2 = function(data, messiness = 0.1, shuffle = FALSE) {
    nr = nrow(data)
    ndups = ceiling(nr * messiness)
    dups = sort(sample.int(nr, ndups, replace=TRUE))

    data = rbind(data, data[dups,,drop=FALSE])
    if(shuffle){
        dups = sample.int(nr, ndups)
    }
    return(data[order(c(1:nr, dups)),,drop=FALSE])
}

Differences include:

  • different values of rownames of the output
  • with shuffle=TRUE, the original function will always preserve the first and last items in the data frame, whereas my code will only preserve the first item, and can shuffle duplicates into last place (but not first place). That could be adjusted with some modification to the values of dups in the code to create a different ordering, but I'm not sure if the current behaviour is intentional so I'll leave it for now.

I know benchmarks aren't everything, especially for two functions that are not doing precisely the same thing, but this code is a lot faster:

> microbenchmark(duplicate_rows(mtcars, .7, shuffle=TRUE), duplicate_rows2(mtcars, .7, shuffle=TRUE))
Unit: microseconds
                                         expr       min        lq       mean
  duplicate_rows(mtcars, 0.7, shuffle = TRUE) 10235.981 10737.344 11339.4167
 duplicate_rows2(mtcars, 0.7, shuffle = TRUE)   515.719   563.038   658.8357
    median         uq       max neval cld
 10943.835 11357.5710 16028.858   100  a 
   619.242   653.3315  3920.442   100   b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant