Replies: 1 comment
-
Go for it @kindly ! I'm not as well-versed with parquet, but it does make sense as we're computing stats on a per-column basis, not on a per row-basis. It will definitely help with memory too, as qsv will effectively "stream" the CSV in parallelized per-column chunks even in And doing an initial pass for data type detection is genius! As you pointed out, that will allow us to optimize stats calculation for the data type, making it not only faster for integers/floats, maybe we can we even introduce some unconventional stats for strings and dates (e.g. similarity, clustering, date intervals, etc.) later. IMHO, the downside of creating a copy of the file is not as important as disk space is cheap/fast especially with SSDs being a commodity nowadays. FYI, I'm gearing to rewrite the benchmarks over the holiday break (#98 ), and one of my goals is to be able to run Right now, I can do However, running |
Beta Was this translation helpful? Give feedback.
-
I have an idea about how to make
stats
faster but would require a fairly large change in the code and might seem counter intuative at first:If instead of indexing the csv file you write out a csv file per column (or potentially a parquet/arrow file which are column based). This way you parse the initial csv file only once. Disks are fast at sequential writing this should be very quick. Then you do the parralelism on the columns not the rows. This is far more cache friendly as you are not jumping around in the file all the time and then doing several diferent actions for each column in each row.
Also for float/number columns you will only need to hold that columns numbers in memory which would make the number based stats take up less memory. It may even make sense to do one parse for each column to verify types then if they are floats/ints then do the number stats on them.
The main downside of this is the disk space needed for esentially a copy of the file.
Beta Was this translation helpful? Give feedback.
All reactions