Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use some kind of BigInt for stats? #350

Open
hbina opened this issue Oct 16, 2024 · 5 comments
Open

Use some kind of BigInt for stats? #350

hbina opened this issue Oct 16, 2024 · 5 comments

Comments

@hbina
Copy link

hbina commented Oct 16, 2024

I have data set that involves millions of rows of timestamps and I want to perform some basic stats on it.

@janxkoci
Copy link

I have data with billions of rows and I didn't run into problems - can you provide more context about what exactly do you need?

@hbina
Copy link
Author

hbina commented Dec 29, 2024

Its not the number of rows itself that is the problem. Consider what happens if you have billions of rows of UNIX timestamps, the sum, average, median, etc of that column is entirely wrong because it overflowed.

@janxkoci
Copy link

janxkoci commented Dec 29, 2024

Okay, most other tools, like GNU awk or miller use double to avoid overflowing, would that help?

PS: I just checked with gawk on my phone by taking the current timestamp and multiplying it by 10 million and it printed correctly. Gawk uses double to represent all numbers (by default, can be recompiled with support for other precision). Miller uses 64bit signed int, unless it overflows, at which point it converts to double.

PPS: Okay, both gawk and lua get into trouble when I multiply current timestamp by 10 billion (interestingly, they give different answer even though they should both use double). But if your data is less than that (1 billion printed fine), things like sums should not overflow.

@janxkoci
Copy link

janxkoci commented Dec 29, 2024

I was gonna point out the fork qsv, but it seems to handle the problem the same way.

But then it got me thinking - what possible use can have the sum of timestamps? Arguably, this value can be safely ignored for most purposes, since it's almost surely meaningless. Other values like median or average are of course meaningful, but should be generally smaller (on the order of one timestamp).

I mean, I work in genomics, and even though I could try to calculate the sum of all positions in a genome, this value has no meaning (unlike range or mean).

PS: I just noticed the current Unix timestamp is a bit over 1.7 billion, which is coincidentally also the number of sequenced positions in neanderthal genomes I work with. What a fun coincidence 😅

@hbina
Copy link
Author

hbina commented Dec 30, 2024

I guess the usage is pretty niche, but I thought it would be nice to support anyways. I don't remember why I needed it at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants