Replies: 2 comments 4 replies
-
Hi there! To answer my 1st question:
I made some tests. import vaex as vx
import pandas as pd
from os import path as os_path
n_val=10000 # then 100 000, then 1 000 000
ts = pd.date_range(start='2021/01/01 08:00', periods=n_val, freq='12h')
df = pd.DataFrame({'val': range(n_val), 'timestamp': ts})
vdf = vx.from_pandas(df)
fn_f5 = os_path.expanduser('~/Documents/code/data/vaex/test_dat.hdf5')
fn_ar = os_path.expanduser('~/Documents/code/data/vaex/test_dat.arrow')
fn_fr = os_path.expanduser('~/Documents/code/data/vaex/test_dat.feather')
fn_pq = os_path.expanduser('~/Documents/code/data/vaex/test_dat.parquet')
# With n_val = 10 000
%timeit vdf.export_hdf5(fn_f5)
# 4.21 ms ± 139 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit vdf.export_arrow(fn_ar)
# 1.04 ms ± 16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit vdf.export_feather(fn_fr)
# 661 µs ± 25.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit vdf.export_parquet(fn_pq)
# 3.12 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# With n_val = 100 000
%timeit vdf.export_hdf5(fn_f5)
# 10.7 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit vdf.export_arrow(fn_f5)
# 2.21 ms ± 378 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vdf.export_feather(fn_f5)
# 3.25 ms ± 44.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit vdf.export_parquet(fn_pq)
# 16.8 ms ± 250 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# With n_val = 1 000 000
%timeit vdf.export_hdf5(fn_f5)
# 77.2 ms ± 8.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit vdf.export_arrow(fn_f5)
# 9.24 ms ± 169 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit vdf.export_feather(fn_f5)
# 27.6 ms ± 725 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit vdf.export_parquet(fn_pq)
# 97.2 ms ± 4.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) Arrow format is a clear winner, and I will very likely retain it. Now the 2nd question.... Thanks for your help. |
Beta Was this translation helpful? Give feedback.
-
As a side note, and being a big fan of fastparquet, I couldn't help compare with it. import fastparquet as fp
n_val=1000000 # 1 000 000
ts = pd.date_range(start='2021/01/01 08:00', periods=n_val, freq='5T')
df = pd.DataFrame({'val': range(n_val), 'timestamp': ts})
fn_pqfp = os_path.expanduser('~/Documents/code/data/vaex/test_dat_fp.parquet')
%timeit fp.write(fn_pqfp, df)
# 42.8 ms ± 433 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# ouh.... more than twice faster than vaex ;) and some potentials need to be investigated yet, promising!
# default compression is SNAPPY
%timeit fp.write(fn_pqfp, df, compression=None)
# 41.3 ms ± 365 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit fp.write(fn_pqfp, df, compression='BROTLI')
# 59.6 s ± 406 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using BROTLI compression on a regular basis, no wonder I fear writing parquet files :) ... Vaex doc states that 'binary file format' are memory-mapped: My current backend being based on parquet & fastparquet, I might as well stay on parquet format even on this pre-processing step, shifting on SNAPPY compression. |
Beta Was this translation helpful? Give feedback.
-
Hi,
I have been thinking to the architecture of a code that will very likely rely onto vaex for most part of the data management.
Basically, here is the targeted layout (this layout will be repeated several times within a loop):
1/ starting from data that I will write in a file. I can use any format. I understand that to maximize vaex speed, hdf5 format should be favored.
2/ there will be a 1st pre-processing step on this data
df
3/ then, I will have different processings, all based on
df
to_pandas_df(chunk_size=...)
and process the data further with numpy (rolling-window like processing that I cannot handle with vaex as far as i see)My questions are:
do you think i should favor hdf5 format for step 1? (I understand arrow format could be an alternative?)
My constraints are:
what is the guideline for managing the common pre-processing and have it done only once so that subsequent processings have the result available, with the less use of memory?
To ease the discussion, here is a dataflow representative of my need (the calculations are dummy of course, it is only here to support the discussion).
In above example, I have 2 'subsequent processings'. Each one starts with a copy of pre-processing step result.
But my understanding is that with this code, I am doing the pre-processin twice, i.e. one for each 'subsequent processings'.
In pandas world, I would do the pre-processing once, and then feed each 'subsequent processings' with a copy of the pre-processing result.
I thank you in advance for your advices on this topic.
Have a good day,
Bests
Beta Was this translation helpful? Give feedback.
All reactions