Use some kind of BigInt for stats? #350

hbina · 2024-10-16T14:03:56Z

I have data set that involves millions of rows of timestamps and I want to perform some basic stats on it.

janxkoci · 2024-12-28T21:01:17Z

I have data with billions of rows and I didn't run into problems - can you provide more context about what exactly do you need?

hbina · 2024-12-29T01:30:40Z

Its not the number of rows itself that is the problem. Consider what happens if you have billions of rows of UNIX timestamps, the sum, average, median, etc of that column is entirely wrong because it overflowed.

janxkoci · 2024-12-29T11:08:42Z

Okay, most other tools, like GNU awk or miller use double to avoid overflowing, would that help?

PS: I just checked with gawk on my phone by taking the current timestamp and multiplying it by 10 million and it printed correctly. Gawk uses double to represent all numbers (by default, can be recompiled with support for other precision). Miller uses 64bit signed int, unless it overflows, at which point it converts to double.

PPS: Okay, both gawk and lua get into trouble when I multiply current timestamp by 10 billion (interestingly, they give different answer even though they should both use double). But if your data is less than that (1 billion printed fine), things like sums should not overflow.

janxkoci · 2024-12-29T12:38:56Z

I was gonna point out the fork qsv, but it seems to handle the problem the same way.

But then it got me thinking - what possible use can have the sum of timestamps? Arguably, this value can be safely ignored for most purposes, since it's almost surely meaningless. Other values like median or average are of course meaningful, but should be generally smaller (on the order of one timestamp).

I mean, I work in genomics, and even though I could try to calculate the sum of all positions in a genome, this value has no meaning (unlike range or mean).

PS: I just noticed the current Unix timestamp is a bit over 1.7 billion, which is coincidentally also the number of sequenced positions in neanderthal genomes I work with. What a fun coincidence 😅

hbina · 2024-12-30T15:56:47Z

I guess the usage is pretty niche, but I thought it would be nice to support anyways. I don't remember why I needed it at some point.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use some kind of BigInt for stats? #350

Use some kind of BigInt for stats? #350

hbina commented Oct 16, 2024

janxkoci commented Dec 28, 2024

hbina commented Dec 29, 2024

janxkoci commented Dec 29, 2024 •

edited

Loading

janxkoci commented Dec 29, 2024 •

edited

Loading

hbina commented Dec 30, 2024

Use some kind of BigInt for stats? #350

Use some kind of BigInt for stats? #350

Comments

hbina commented Oct 16, 2024

janxkoci commented Dec 28, 2024

hbina commented Dec 29, 2024

janxkoci commented Dec 29, 2024 • edited Loading

janxkoci commented Dec 29, 2024 • edited Loading

hbina commented Dec 30, 2024

janxkoci commented Dec 29, 2024 •

edited

Loading

janxkoci commented Dec 29, 2024 •

edited

Loading