Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

funique() returns two 0 #648

Open
mayer79 opened this issue Oct 26, 2024 · 11 comments
Open

funique() returns two 0 #648

mayer79 opened this issue Oct 26, 2024 · 11 comments

Comments

@mayer79
Copy link

mayer79 commented Oct 26, 2024

I am currently struggling with a problem that could be related to #452

Zeros appear twice:

library(collapse)

# all columns are double
df <- OpenML::getOMLDataSet(data.id = 45106L)$data

unique(df$town)  #  1 0
funique(df$town) #  1 0 0

levels(factor(df$town))           #  [1] "0" "1"
levels(collapse::qF(df$town))  #[1] "0" "0" "1"

The same happens in group by operations on town, e.g., comparing base::rowsum() and fmean(). The latter gets two categories "0".

@mayer79
Copy link
Author

mayer79 commented Oct 26, 2024

The workaround by @NicChr of adding 0 to a double vector also works in this case. But I wonder how robust this could be?

@NicChr
Copy link

NicChr commented Oct 26, 2024

I'm not sure how "robust" it is but I did get the idea from this thread which suggests adding 0 to eliminate negative zeros:
https://stackoverflow.com/questions/13767744/detecting-and-adjusting-for-negative-zero

It seems to work in the examples I showed and 0 happens to be a handy identity element of the reals so the output should remain the same for most objects that base::round() accepts.

One long-term solution could be to export these functions in the collapse package:

#' @export
round <- function(x, digits = 0, ...){
  base::round(x, digits, ...) + 0
}
#' @export
trunc <- function(x, ...){
  base::trunc(x, ...) + 0
}

@mayer79
Copy link
Author

mayer79 commented Oct 26, 2024

The solution to add 0 seems to work for C based languages, at least if they stick to ISO/IEC 60559

@SebKrantz
Copy link
Owner

Thanks. I think though this is something that needs to be solved in R. In C adding 0 to every numeric number or checking for 0 would have noticeable performance implications. So the performance friendly way of handling this is to call funique() on your large vector, then add 0 in R, and call funique() again. Or do it all in one go with unique().

SebKrantz added a commit that referenced this issue Oct 26, 2024
@SebKrantz
Copy link
Owner

SebKrantz commented Oct 26, 2024

I added a branch "zero_dups" with this feature implemented in C remotes::install_github("SebKrantz/collapse", ref = "zero_dups"). Seems to work. You can help me benchmark it. I think it's like a 10% slower execution. Not sure that's worth it. Note that it only works for funique() and group() with default settings. funique(x, sort = TRUE) is implemented in Rcpp (using the sugar sort_unique algorithm, which suffers from the same issue, and which means I'd have to rewrite that as well).

@mayer79
Copy link
Author

mayer79 commented Oct 26, 2024

Wow!

In my R code, I am doing:

if (is.double(x)) 
  x <- x + 0

xu <- collapse::funique(x)
[...]
M <- collapse::fmean(..., g = x)
S <- collapse::fsd(..., g = x)
[...]

Not sure if you want to slow down the speed by 10% just because of this...

@NicChr
Copy link

NicChr commented Oct 27, 2024

I added a branch "zero_dups" with this feature implemented in C remotes::install_github("SebKrantz/collapse", ref = "zero_dups"). Seems to work. You can help me benchmark it. I think it's like a 10% slower execution. Not sure that's worth it. Note that it only works for funique() and group() with default settings. funique(x, sort = TRUE) is implemented in Rcpp (using the sugar sort_unique algorithm, which suffers from the same issue, and which means I'd have to rewrite that as well).

I'm also seeing about a 10% reduction in speed.
I also tried the ternary operator e.g. tpv.d = px[i] != 0.0 ? px[i] : 0.0 and saw even more of a drop-off in speed (~20-30%) for vectors with lots of zeros.

In terms of an R solution, you can alternatively use anyv() to check for zeros after running funique()

funique2 <- function(x, ...){
  out <- funique(x, ...)
  if (is.double(out) && anyv(out, 0)){
    unique(out)
  } else {
   out 
  }
}

library(collapse)
#> collapse 2.0.17, see ?`collapse-package` or ?`collapse-documentation`
#> 
#> Attaching package: 'collapse'
#> The following object is masked from 'package:stats':
#> 
#>     D
library(bench)
x <- round(rnorm(10^7)) # Many zeros
y <- rnorm(10^7, mean = 100 ) # No zeros
z <- floor(rnorm(10^7, mean = 10^4, sd = 10^4)) # Some zeros


mark(funique(x), funique2(x), check = F) # Many zeros, not many unique values
#> # A tibble: 2 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 funique(x)      55ms   55.1ms      18.1    38.2MB     22.7
#> 2 funique2(x)   54.4ms   54.7ms      17.7    38.1MB     14.1
mark(funique(y), funique2(y), check = F) # No zeros, many unique values
#> # A tibble: 2 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 funique(y)     308ms    308ms      3.24    38.1MB     3.24
#> 2 funique2(y)    321ms    321ms      3.11    38.1MB     3.11
mark(funique(z), funique2(z), check = F) # Some zeros, medium amount of unique values
#> # A tibble: 2 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 funique(z)     179ms    183ms      5.47    38.7MB     2.73
#> 2 funique2(z)    170ms    177ms      5.64    40.5MB     2.82

Created on 2024-10-27 with reprex v2.1.1

@SebKrantz
Copy link
Owner

Perhaps as an addition (for other functions like qF() etc.), you can add a zero by reference out %+=% 0, which is much more efficient than out <- out + 0. Lets keep this issue open, still need to check if there is a faster way to change the sign bit for zeros.

@SebKrantz
Copy link
Owner

Actually I just tested, using an integer 0 is pretty fast! @NicChr can you confirm that adding an integer 0 (0 instead of 0.0) gives much less performance cost?

@NicChr
Copy link

NicChr commented Oct 29, 2024

Actually I just tested, using an integer 0 is pretty fast! @NicChr can you confirm that adding an integer 0 (0 instead of 0.0) gives much less performance cost?

Assuming you mean replacing '+ 0.0' with '+ 0' in the 'kit_dup.c' file..

I looked at master, zero_dups and a 3rd branch zero_dups2 which replaced '+ 0.0' with '+ 0' and compared median benchmark times after installing each branch using remotes::install_local() for a comparison with compiler optimisations.

The data and code I used

library(collapse)
library(bench)

set.seed(42)
x <- round(rnorm(10^7)) # Many zeros
y <- rnorm(10^7, mean = 100 ) # No zeros
z <- floor(rnorm(10^7, mean = 10^4, sd = 10^4)) # Some zeros

mark(funique(x), iterations = 15, memory = FALSE)
mark(funique(y), iterations = 15, memory = FALSE)
mark(funique(z), iterations = 15, memory = FALSE)

The results

x -- master - 22.9ms / zero_dups - 25.4 ms / zero_dups2 - 25.5 ms
y -- master - 287 ms / zero_dups - 299 ms / zero_dups2 - 295 ms
z -- master - 145 ms / zero_dups - 157 ms / zero_dups2 - 153 ms

There's a slight performance cost on my side whether or not you add a double or int 0.

@SebKrantz
Copy link
Owner

Ok, thanks! I will also check a bit further still. But I think if it is of this order - 2.8% performance decline for the integer case with no zeros - the most common case - we can go for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants