-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On very large dask-backed dataset, huge graphs can be transmitted to dask crashing the processing #9802
Comments
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! |
something is very wrong here but we can't reproduce your issue without FWIW 200GB+ isn't that big a dataset, so something else is going wrong here (potentially bad default with |
Sorry, there is the same code without the data_structure. I'll make the file available tomorrow, GiHub doesn't let me attach netcdf files. Minimum updated code
Associated error
|
I bet at least one issue is this recursive
I encourage you to use a single open_mfdataset call to open and concatenate all datasets. |
I added the I tried with a single |
What is your issue?
Chaining many processing on a huge (200GB+) dask-backed datasets lead to huge graphs (500MB+) being passed. More data, bigger the graph is, to the point where the graph is so huge (31GB at my maximum) that the .compute() fails, with a "error to serialize" error in msgpack.
This is a problem that started when we started to use xarray to process climate experiments. The amount of data that we load is huge (200GB+ on my initial tests. Several TB in the real case). I do not have this problem for regular basic processing (e.g., data selection & plotting with very few processing), but in this case, we chained quite a lot of different operations (expansions of dimensions, dataset concatenations, data selections, means, std, min/max, new dimension expansion...). Using the exact same processing on less data (e.g., one year) will only trigger a warning from Dask, telling me that the graph is huge (between 800MB and 1.3GB) and that it will be a performance issue, suggesting some good practice as well, but it will run. So, my current workaround is to just to do that: reduce the amount of data I'm processing at the same time (every year). I guess several intermediate .compute() would help as well, but considering the amount of data we're talking about, it's not an option.
I don't think it's a bug... but I also don't think it's the behaviour we want from xarray. We should be able to transmit to dask whatever dataset and get it processed. Xarray should be able to split the graphs better so it doesn't reach the limitation from dask or msgpack.
How to reproduce the problem
Below is the minimum code. Be aware that to make this code represent reality, it has to generate a huge amount of (random) data (200GB+). The compute at the end will use a lot of memory. This code is made to run on a HPC with 768GB of ram. I cannot really make it smaller, as I think that the core of the problem is that I'm processing a huge amount of data.
EDIT. There is the file needed to run this code: data_structure.zip. A version without this file is available below, but the error is a bit different.
Minimum code
Error messages
Environment
``` INSTALLED VERSIONS ------------------ commit: None python: 3.12.7 | packaged by conda-forge | (main, Oct 4 2024, 16:05:46) [GCC 13.3.0] python-bits: 64 OS: Linux OS-release: 5.14.0-427.37.1.el9_4.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.3 libnetcdf: 4.9.2xarray: 2024.10.0
pandas: 2.2.3
numpy: 2.1.3
scipy: 1.14.1
netCDF4: 1.7.1
pydap: None
h5netcdf: 1.4.1
h5py: 3.12.1
zarr: None
cftime: 1.6.4
nc_time_axis: None
iris: None
bottleneck: None
dask: 2024.11.2
distributed: 2024.11.2
matplotlib: 3.9.2
cartopy: 0.24.0
seaborn: 0.13.2
numbagg: None
fsspec: 2024.10.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 75.5.0
pip: 24.3.1
conda: None
pytest: None
mypy: None
IPython: None
sphinx: None
The text was updated successfully, but these errors were encountered: