You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Adding columns to a data frame as temporary calculations and dropping them later risks deleting columns with those names if present in the input data. For example:
> d = data.frame(x=1:5, is_duplicate=letters[1:5])
> d
x is_duplicate
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
> duplicate_rows(d, .5)
x
1 1
2 2
3 2
4 3
5 4
6 4
7 5
8 5
and my precious is_duplicate column is gone without warning. A few other column names are used as temporary calculations in this function and so also can also get surprise deleted.
Usual solutions for this include:
document the behaviour in the help and tell the user not to use those column names or they'll disappear
use more obscure names, such as starting with a dot, for all temporary column names, hoping they'll not be used by users. This kicks the problem down the line until one day a user does
dynamically generate column names that don't exist in the data frame, but this results in seriously ugly code
use vectors instead of new columns, but that may mean rewriting anything using non-standard evaluation which will get columns that clash with those vector names
A better approach might be to work with row numbers instead of whole slices - generate the row numbers for the original and the sample, then rearrange, then return the rows of the input data by those numbers. This code behaves very similarly to duplicate_rows:
with shuffle=TRUE, the original function will always preserve the first and last items in the data frame, whereas my code will only preserve the first item, and can shuffle duplicates into last place (but not first place). That could be adjusted with some modification to the values of dups in the code to create a different ordering, but I'm not sure if the current behaviour is intentional so I'll leave it for now.
I know benchmarks aren't everything, especially for two functions that are not doing precisely the same thing, but this code is a lot faster:
> microbenchmark(duplicate_rows(mtcars, .7, shuffle=TRUE), duplicate_rows2(mtcars, .7, shuffle=TRUE))
Unit: microseconds
expr min lq mean
duplicate_rows(mtcars, 0.7, shuffle = TRUE) 10235.981 10737.344 11339.4167
duplicate_rows2(mtcars, 0.7, shuffle = TRUE) 515.719 563.038 658.8357
median uq max neval cld
10943.835 11357.5710 16028.858 100 a
619.242 653.3315 3920.442 100 b
The text was updated successfully, but these errors were encountered:
Adding columns to a data frame as temporary calculations and dropping them later risks deleting columns with those names if present in the input data. For example:
messy/R/duplicate_rows.R
Line 25 in 70a32ef
results in:
and my precious
is_duplicate
column is gone without warning. A few other column names are used as temporary calculations in this function and so also can also get surprise deleted.Usual solutions for this include:
A better approach might be to work with row numbers instead of whole slices - generate the row numbers for the original and the sample, then rearrange, then return the rows of the input data by those numbers. This code behaves very similarly to
duplicate_rows
:Differences include:
rownames
of the outputshuffle=TRUE
, the original function will always preserve the first and last items in the data frame, whereas my code will only preserve the first item, and can shuffle duplicates into last place (but not first place). That could be adjusted with some modification to the values ofdups
in the code to create a different ordering, but I'm not sure if the current behaviour is intentional so I'll leave it for now.I know benchmarks aren't everything, especially for two functions that are not doing precisely the same thing, but this code is a lot faster:
The text was updated successfully, but these errors were encountered: