speeding up stats #626

kindly · 2022-11-30T16:57:04Z

kindly
Nov 30, 2022

I have an idea about how to make stats faster but would require a fairly large change in the code and might seem counter intuative at first:

If instead of indexing the csv file you write out a csv file per column (or potentially a parquet/arrow file which are column based). This way you parse the initial csv file only once. Disks are fast at sequential writing this should be very quick. Then you do the parralelism on the columns not the rows. This is far more cache friendly as you are not jumping around in the file all the time and then doing several diferent actions for each column in each row.

Also for float/number columns you will only need to hold that columns numbers in memory which would make the number based stats take up less memory. It may even make sense to do one parse for each column to verify types then if they are floats/ints then do the number stats on them.

The main downside of this is the disk space needed for esentially a copy of the file.

jqnatividad · 2022-11-30T17:37:07Z

jqnatividad
Nov 30, 2022
Maintainer

Go for it @kindly !

I'm not as well-versed with parquet, but it does make sense as we're computing stats on a per-column basis, not on a per row-basis.

It will definitely help with memory too, as qsv will effectively "stream" the CSV in parallelized per-column chunks even in --everything mode, compared to what it does now - loading the entire CSV into memory.

And doing an initial pass for data type detection is genius! As you pointed out, that will allow us to optimize stats calculation for the data type, making it not only faster for integers/floats, maybe we can we even introduce some unconventional stats for strings and dates (e.g. similarity, clustering, date intervals, etc.) later.

IMHO, the downside of creating a copy of the file is not as important as disk space is cheap/fast especially with SSDs being a commodity nowadays.

FYI, I'm gearing to rewrite the benchmarks over the holiday break (#98 ), and one of my goals is to be able to run qsv stats --everything against the entire NYC-311 file (~27m rows, 17gb CSV).

Right now, I can do qsv stats on it and get the type inferences and basic stats in ~25 seconds with constant memory using an index on my Ryzen 7 4800H, 32gb laptop, as qsv stats streams the CSV.

However, running stats --everything, qsv eventually hangs with an Out-of-Memory error as it needs to load the entire CSV into memory.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speeding up stats #626

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

speeding up stats #626

kindly Nov 30, 2022

Replies: 1 comment

jqnatividad Nov 30, 2022 Maintainer

kindly
Nov 30, 2022

jqnatividad
Nov 30, 2022
Maintainer