-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
funique() returns two 0 #648
Comments
The workaround by @NicChr of adding 0 to a double vector also works in this case. But I wonder how robust this could be? |
I'm not sure how "robust" it is but I did get the idea from this thread which suggests adding 0 to eliminate negative zeros: It seems to work in the examples I showed and 0 happens to be a handy identity element of the reals so the output should remain the same for most objects that One long-term solution could be to export these functions in the collapse package:
|
The solution to add 0 seems to work for C based languages, at least if they stick to ISO/IEC 60559 |
Thanks. I think though this is something that needs to be solved in R. In C adding 0 to every numeric number or checking for 0 would have noticeable performance implications. So the performance friendly way of handling this is to call funique() on your large vector, then add 0 in R, and call funique() again. Or do it all in one go with unique(). |
I added a branch "zero_dups" with this feature implemented in C |
Wow! In my R code, I am doing: if (is.double(x))
x <- x + 0
xu <- collapse::funique(x)
[...]
M <- collapse::fmean(..., g = x)
S <- collapse::fsd(..., g = x)
[...] Not sure if you want to slow down the speed by 10% just because of this... |
I'm also seeing about a 10% reduction in speed. In terms of an R solution, you can alternatively use funique2 <- function(x, ...){
out <- funique(x, ...)
if (is.double(out) && anyv(out, 0)){
unique(out)
} else {
out
}
}
library(collapse)
#> collapse 2.0.17, see ?`collapse-package` or ?`collapse-documentation`
#>
#> Attaching package: 'collapse'
#> The following object is masked from 'package:stats':
#>
#> D
library(bench)
x <- round(rnorm(10^7)) # Many zeros
y <- rnorm(10^7, mean = 100 ) # No zeros
z <- floor(rnorm(10^7, mean = 10^4, sd = 10^4)) # Some zeros
mark(funique(x), funique2(x), check = F) # Many zeros, not many unique values
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 funique(x) 55ms 55.1ms 18.1 38.2MB 22.7
#> 2 funique2(x) 54.4ms 54.7ms 17.7 38.1MB 14.1
mark(funique(y), funique2(y), check = F) # No zeros, many unique values
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 funique(y) 308ms 308ms 3.24 38.1MB 3.24
#> 2 funique2(y) 321ms 321ms 3.11 38.1MB 3.11
mark(funique(z), funique2(z), check = F) # Some zeros, medium amount of unique values
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 funique(z) 179ms 183ms 5.47 38.7MB 2.73
#> 2 funique2(z) 170ms 177ms 5.64 40.5MB 2.82 Created on 2024-10-27 with reprex v2.1.1 |
Perhaps as an addition (for other functions like |
Actually I just tested, using an integer 0 is pretty fast! @NicChr can you confirm that adding an integer 0 (0 instead of 0.0) gives much less performance cost? |
Assuming you mean replacing '+ 0.0' with '+ 0' in the 'kit_dup.c' file.. I looked at master, zero_dups and a 3rd branch zero_dups2 which replaced '+ 0.0' with '+ 0' and compared median benchmark times after installing each branch using The data and code I used
The resultsx -- master - 22.9ms / zero_dups - 25.4 ms / zero_dups2 - 25.5 ms There's a slight performance cost on my side whether or not you add a double or int 0. |
Ok, thanks! I will also check a bit further still. But I think if it is of this order - 2.8% performance decline for the integer case with no zeros - the most common case - we can go for it. |
I am currently struggling with a problem that could be related to #452
Zeros appear twice:
The same happens in group by operations on
town
, e.g., comparingbase::rowsum()
andfmean()
. The latter gets two categories "0".The text was updated successfully, but these errors were encountered: