Best practises for managing different dataflows sharing the same preprocessing #1654

yohplala · 2021-10-18T15:07:14Z

yohplala
Oct 18, 2021

Hi,

I have been thinking to the architecture of a code that will very likely rely onto vaex for most part of the data management.
Basically, here is the targeted layout (this layout will be repeated several times within a loop):

1/ starting from data that I will write in a file. I can use any format. I understand that to maximize vaex speed, hdf5 format should be favored.
2/ there will be a 1st pre-processing step on this data
- I will 'read' the data with vaex
- I will pre-process this data, this gives df
3/ then, I will have different processings, all based on df
- these processings will rely for some on vaex
- for some others, I will use to_pandas_df(chunk_size=...) and process the data further with numpy (rolling-window like processing that I cannot handle with vaex as far as i see)
- I will write the result of each different processing steps in a parquet file

My questions are:

do you think i should favor hdf5 format for step 1? (I understand arrow format could be an alternative?)
My constraints are:
- this is 'big' data, so I need a format quick to be written and read back with vaex
- with which vaex can memory map, for the same reason
what is the guideline for managing the common pre-processing and have it done only once so that subsequent processings have the result available, with the less use of memory?
- should I write the results of the pre-processing again in hdf5, then each subsequent processing steps will 1st load this data?
- I am aware work has been done for caching data. Should i use the caching instead?

To ease the discussion, here is a dataflow representative of my need (the calculations are dummy of course, it is only here to support the discussion).

import pandas as pd
import numpy as np
import vaex as vx
import os

# Test data to hdf5.
path = os.path.expanduser('~/Documents/code/data/')
file = 'test.hdf5'
fn = os.path.join(path, file)

length = 1000
df = pd.DataFrame(np.array([range(length), range(length)]).T,
                  columns=['A','B'])
vdf = vx.from_pandas(df)
vdf.export_hdf5(path=fn)


# Dummy pre-processing step: data manipulation common to all subsequent steps
df = vx.open(path=fn)
df['A'] = 2*df['A']
df['B'] = 3*df['B']


# Processing step 1.1 (independent data manipulations from step 1.2)
df1 = df.copy()
df1['A'] = df1['A']/5
df1['B'] = df1['B']/6
fn = os.path.join(path,'data1.hdf5')
df1.export_hdf5(path=fn)


# Processing step 1.2 (independent data manipulations from step 1.1)
df2 = df.copy()
df2['A'] = df2['A']/7
df2['B'] = df2['B']/8
fn = os.path.join(path,'data2.hdf5')
df2.export_hdf5(path=fn)

In above example, I have 2 'subsequent processings'. Each one starts with a copy of pre-processing step result.
But my understanding is that with this code, I am doing the pre-processin twice, i.e. one for each 'subsequent processings'.

In pandas world, I would do the pre-processing once, and then feed each 'subsequent processings' with a copy of the pre-processing result.

I thank you in advance for your advices on this topic.
Have a good day,
Bests

yohplala · 2021-10-19T06:06:55Z

yohplala
Oct 19, 2021
Author

Hi there!

To answer my 1st question:

do you think i should favor hdf5 format for step 1, which requirements are:
- I will scratch files between each loop, so file size is not much of a problem
- but there can be 'lot of data' between each loop. there is also a lot of data processing, so I am looking for optimizing every part of the code, including this one. I thus need a file 'quick to write'.

I made some tests.

import vaex as vx
import pandas as pd
from os import path as os_path

n_val=10000   # then 100 000, then 1 000 000 
ts = pd.date_range(start='2021/01/01 08:00', periods=n_val, freq='12h')
df = pd.DataFrame({'val': range(n_val), 'timestamp': ts})
vdf = vx.from_pandas(df)

fn_f5 = os_path.expanduser('~/Documents/code/data/vaex/test_dat.hdf5')
fn_ar = os_path.expanduser('~/Documents/code/data/vaex/test_dat.arrow')
fn_fr = os_path.expanduser('~/Documents/code/data/vaex/test_dat.feather')
fn_pq = os_path.expanduser('~/Documents/code/data/vaex/test_dat.parquet')

# With n_val = 10 000
%timeit vdf.export_hdf5(fn_f5)
# 4.21 ms ± 139 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit vdf.export_arrow(fn_ar)
# 1.04 ms ± 16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit vdf.export_feather(fn_fr)
# 661 µs ± 25.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit vdf.export_parquet(fn_pq)
# 3.12 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# With n_val = 100 000
%timeit vdf.export_hdf5(fn_f5)
# 10.7 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit vdf.export_arrow(fn_f5)
# 2.21 ms ± 378 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vdf.export_feather(fn_f5)
# 3.25 ms ± 44.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit vdf.export_parquet(fn_pq)
# 16.8 ms ± 250 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# With n_val = 1 000 000
%timeit vdf.export_hdf5(fn_f5)
# 77.2 ms ± 8.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit vdf.export_arrow(fn_f5)
# 9.24 ms ± 169 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit vdf.export_feather(fn_f5)
# 27.6 ms ± 725 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit vdf.export_parquet(fn_pq)
# 97.2 ms ± 4.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Arrow format is a clear winner, and I will very likely retain it.

Now the 2nd question....
Please, do you have any pointer about the caching mechanism in vaex that could help me understand what it does?
Otherwise, I understand my best choice is to write the intermediate data (resulting from the pre-processing) into arrow format as well, and once all the 'subsequent processings' are done, I will scrach these temporary files.
If this can be taken care of by the caching mechanism, I would go for it straight away however.

Thanks for your help.
Bests,

2 replies

maartenbreddels Oct 19, 2021
Maintainer

Arrow format is a clear winner, and I will very likely retain it.

I think the differences will depend on the data size, and the performance of the harddrive, so I would test an example that is more similar to how you'd use it.

do you have any pointer about the caching mechanism in vaex that could help me understand what it does

We have a filesystem cache, that is used when reading from external data sources (like s3), enabled by default, can be disabled with vaex.open(url, cache=False).

We also have a task cache, that caches internal intermediate outputs of groupby (creating hashmaps), and aggregations, and other outcomes, such as unique (see https://vaex.io/docs/api.html#module-vaex.cache )

Next is the caching of reading, or precomputation of virtual columns, that is used by calling df.materialize see https://vaex.io/docs/api.html#vaex.dataframe.DataFrame.materialize (ouch, that docstring hurts, anyone wants to fix that? :))

Does this clarify things?

yohplala Oct 19, 2021
Author

We also have a task cache, that caches internal intermediate outputs of groupby (creating hashmaps), and aggregations, and other outcomes, such as unique (see https://vaex.io/docs/api.html#module-vaex.cache )

Next is the caching of reading, or precomputation of virtual columns, that is used by calling df.materialize see https://vaex.io/docs/api.html#vaex.dataframe.DataFrame.materialize (ouch, that docstring hurts, anyone wants to fix that? :))

Thanks a lot @maartenbreddels !

Does this clarify things?

Yes and no.
Thanks for the documentation about cache. I can see that caching on disk is possible, this is great! ('no memory limit' compared to RAM!)
VAEX_CACHE=disk

But knowing which state of df I want to cache, it seems to me that df.materialize() is the command I am looking for.
After having carried out the pre-processing step, I will trigger the caching, then based on df, I can manage the 'subsequent processes'.
Bur reading from documentation of .materialize(), it clearly states this time that caching is made in memory:
"Turn columns into native CPU format for optimal performance at cost of memory."

I have thus 2 questions to investigate further:

please, can the caching of materialize() be done to disk?
doing 1st the pre-processing, then the caching with materialize(), then 2nd a 1st processing of df, and recording then the result to a file, can I revert df to its former state, the one I have cached with df.materialize()?

Thanks for your help! I appreciate it very much!

yohplala · 2021-10-19T07:39:39Z

yohplala
Oct 19, 2021
Author

As a side note, and being a big fan of fastparquet, I couldn't help compare with it.
Comparison is probably not fair, as the input data is not exactly the same.

import fastparquet as fp
n_val=1000000 # 1 000 000
ts = pd.date_range(start='2021/01/01 08:00', periods=n_val, freq='5T')
df = pd.DataFrame({'val': range(n_val), 'timestamp': ts})

fn_pqfp = os_path.expanduser('~/Documents/code/data/vaex/test_dat_fp.parquet')

%timeit fp.write(fn_pqfp, df)
# 42.8 ms ± 433 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# ouh.... more than twice faster than vaex ;) and some potentials need to be investigated yet, promising!
# default compression is SNAPPY

%timeit fp.write(fn_pqfp, df, compression=None)
# 41.3 ms ± 365 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit fp.write(fn_pqfp, df, compression='BROTLI')
# 59.6 s ± 406 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using BROTLI compression on a regular basis, no wonder I fear writing parquet files :) ...
I had in mind it was slow, but that slow, wahouh....

Vaex doc states that 'binary file format' are memory-mapped:
"Vaex will just memory-map the data instead of reading it in memory."
So this includes parquet as well right?

My current backend being based on parquet & fastparquet, I might as well stay on parquet format even on this pre-processing step, shifting on SNAPPY compression.
Please, do you confirm parquet format is memory-mapped as well?
I don't really know in detail what 'memory-mapped' means, and as parquet file content is compressed, I was thinking it would not be so.
Thanks in advance for any advise on this topic.

2 replies

maartenbreddels Oct 19, 2021
Maintainer

"Vaex will just memory-map the data instead of reading it in memory."
So this includes parquet as well right?

No, that is the case for hdf5 and arrow, since the data on disk is the same as in memory. For parquet we read it at every pass over the data (unless you use df.materialize).

Interesting performance for fastparquet!

yohplala Oct 19, 2021
Author

No, that is the case for hdf5 and arrow, since the data on disk is the same as in memory. For parquet we read it at every pass over the data (unless you use df.materialize).

Thanks, crystal clear now!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practises for managing different dataflows sharing the same preprocessing #1654

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Best practises for managing different dataflows sharing the same preprocessing #1654

yohplala Oct 18, 2021

Replies: 2 comments · 4 replies

yohplala Oct 19, 2021 Author

maartenbreddels Oct 19, 2021 Maintainer

yohplala Oct 19, 2021 Author

yohplala Oct 19, 2021 Author

maartenbreddels Oct 19, 2021 Maintainer

yohplala Oct 19, 2021 Author

yohplala
Oct 18, 2021

Replies: 2 comments 4 replies

yohplala
Oct 19, 2021
Author

maartenbreddels Oct 19, 2021
Maintainer

yohplala Oct 19, 2021
Author

yohplala
Oct 19, 2021
Author

maartenbreddels Oct 19, 2021
Maintainer

yohplala Oct 19, 2021
Author