Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement vec_deduplicate to get ids necessary to deduplicate and reduplicate a vector. #1858

Closed
wants to merge 4 commits into from

Conversation

orgadish
Copy link

@orgadish orgadish commented Jul 9, 2023

See #1857.

I could use any help with naming and structuring the output. This is my first time submitting a PR including changes to the C code, so I don't know if the way I returned the outputs is valid.

Timing:

benchmark_deduplication <- function(total_size, iterations, func=tolower) {
  set.seed(0)
  
  with_naive_deduplication <- function(f) {
    function(x, ...) {
      ux <- unique(x)
      f(ux, ...)[match(x, ux)]
    }
  }
  
  with_vec_deduplication <- function(f) {
    function(x, ...) {
      res <- vec_deduplicate(x)
      ux <- x[res$unique_loc]
      f(ux)[res$match_unique_loc]
    }
  }
  
  unique_vector <- sample.int(1e6, total_size) |> 
    as.character() |> 
    sample()
  
  repeated_vector <- sample.int(1e6, 10) |> 
    as.character() |> 
    rep(total_size/10) |> 
    sample()
  
  dplyr::bind_rows(
    rep = bench::mark(
      std = repeated_vector |> 
        func(),
      dedup = repeated_vector |> 
        with_naive_deduplication(func)(),
      vec_dedup = repeated_vector |> 
        with_vec_deduplication(func)(),
      iterations = iterations
    ),
    
    uni = bench::mark(
      std = unique_vector |> 
        func(),
      dedup = unique_vector |> 
        with_naive_deduplication(func)(),
      vec_dedup = unique_vector |> 
        with_vec_deduplication(func)(),
      iterations = iterations
    ),
    .id = "vector_type"
  ) |> 
    dplyr::select(vector_type, expression, median) |> 
    dplyr::mutate(dplyr::across(expression, as.character))
}

benchmark_deduplication(1e5, 100)
#> # A tibble: 6 × 3
#>   vector_type expression   median
#>   <chr>       <chr>      <bch:tm>
#> 1 rep         std         170.8ms
#> 2 rep         dedup        43.3ms
#> 3 rep         vec_dedup    25.7ms
#> 4 uni         std         386.1ms
#> 5 uni         dedup       612.1ms
#> 6 uni         vec_dedup     475ms

@orgadish
Copy link
Author

Closing since I closed #1857 -- I ended up implementing it in a separate package deduped.

@orgadish orgadish closed this Oct 27, 2023
@orgadish orgadish deleted the OG_vec_dups branch October 27, 2023 04:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant