duplucate_rows can destroy existing data columns #15

barryrowlingson · 2024-12-09T10:08:44Z

Adding columns to a data frame as temporary calculations and dropping them later risks deleting columns with those names if present in the input data. For example:

messy/R/duplicate_rows.R

Line 25 in 70a32ef

dplyr::mutate(is_duplicate = FALSE)

results in:

> d = data.frame(x=1:5, is_duplicate=letters[1:5])
> d
  x is_duplicate
1 1            a
2 2            b
3 3            c
4 4            d
5 5            e
> duplicate_rows(d, .5)
  x
1 1
2 2
3 2
4 3
5 4
6 4
7 5
8 5

and my precious is_duplicate column is gone without warning. A few other column names are used as temporary calculations in this function and so also can also get surprise deleted.

Usual solutions for this include:

document the behaviour in the help and tell the user not to use those column names or they'll disappear
use more obscure names, such as starting with a dot, for all temporary column names, hoping they'll not be used by users. This kicks the problem down the line until one day a user does
dynamically generate column names that don't exist in the data frame, but this results in seriously ugly code
use vectors instead of new columns, but that may mean rewriting anything using non-standard evaluation which will get columns that clash with those vector names

A better approach might be to work with row numbers instead of whole slices - generate the row numbers for the original and the sample, then rearrange, then return the rows of the input data by those numbers. This code behaves very similarly to duplicate_rows:

duplicate_rows2 = function(data, messiness = 0.1, shuffle = FALSE) {
    nr = nrow(data)
    ndups = ceiling(nr * messiness)
    dups = sort(sample.int(nr, ndups, replace=TRUE))

    data = rbind(data, data[dups,,drop=FALSE])
    if(shuffle){
        dups = sample.int(nr, ndups)
    }
    return(data[order(c(1:nr, dups)),,drop=FALSE])
}

Differences include:

different values of rownames of the output
with shuffle=TRUE, the original function will always preserve the first and last items in the data frame, whereas my code will only preserve the first item, and can shuffle duplicates into last place (but not first place). That could be adjusted with some modification to the values of dups in the code to create a different ordering, but I'm not sure if the current behaviour is intentional so I'll leave it for now.

I know benchmarks aren't everything, especially for two functions that are not doing precisely the same thing, but this code is a lot faster:

> microbenchmark(duplicate_rows(mtcars, .7, shuffle=TRUE), duplicate_rows2(mtcars, .7, shuffle=TRUE))
Unit: microseconds
                                         expr       min        lq       mean
  duplicate_rows(mtcars, 0.7, shuffle = TRUE) 10235.981 10737.344 11339.4167
 duplicate_rows2(mtcars, 0.7, shuffle = TRUE)   515.719   563.038   658.8357
    median         uq       max neval cld
 10943.835 11357.5710 16028.858   100  a 
   619.242   653.3315  3920.442   100   b

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duplucate_rows can destroy existing data columns #15

duplucate_rows can destroy existing data columns #15

barryrowlingson commented Dec 9, 2024

duplucate_rows can destroy existing data columns #15

duplucate_rows can destroy existing data columns #15

Comments

barryrowlingson commented Dec 9, 2024