-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use some kind of BigInt for stats? #350
Comments
I have data with billions of rows and I didn't run into problems - can you provide more context about what exactly do you need? |
Its not the number of rows itself that is the problem. Consider what happens if you have billions of rows of UNIX timestamps, the sum, average, median, etc of that column is entirely wrong because it overflowed. |
Okay, most other tools, like GNU awk or miller use double to avoid overflowing, would that help? PS: I just checked with gawk on my phone by taking the current timestamp and multiplying it by 10 million and it printed correctly. Gawk uses double to represent all numbers (by default, can be recompiled with support for other precision). Miller uses 64bit signed int, unless it overflows, at which point it converts to double. PPS: Okay, both gawk and lua get into trouble when I multiply current timestamp by 10 billion (interestingly, they give different answer even though they should both use double). But if your data is less than that (1 billion printed fine), things like sums should not overflow. |
I was gonna point out the fork qsv, but it seems to handle the problem the same way. But then it got me thinking - what possible use can have the sum of timestamps? Arguably, this value can be safely ignored for most purposes, since it's almost surely meaningless. Other values like median or average are of course meaningful, but should be generally smaller (on the order of one timestamp). I mean, I work in genomics, and even though I could try to calculate the sum of all positions in a genome, this value has no meaning (unlike range or mean). PS: I just noticed the current Unix timestamp is a bit over 1.7 billion, which is coincidentally also the number of sequenced positions in neanderthal genomes I work with. What a fun coincidence 😅 |
I guess the usage is pretty niche, but I thought it would be nice to support anyways. I don't remember why I needed it at some point. |
I have data set that involves millions of rows of timestamps and I want to perform some basic stats on it.
The text was updated successfully, but these errors were encountered: