Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable zero-copy to_dataframe #9792

Open
rabernat opened this issue Nov 17, 2024 · 0 comments
Open

Enable zero-copy to_dataframe #9792

rabernat opened this issue Nov 17, 2024 · 0 comments

Comments

@rabernat
Copy link
Contributor

What is your issue?

Calling Dataset.to_dataframe() currently always produces a memory copy of all arrays. This is definitely not optimal for all scenarios. We should make it possible to convert Xarray objects to Pandas objects without a memory copy.

This behavior may depend on Pandas version. As of 2.2, here are the relevant Pandas docs: https://pandas.pydata.org/docs/user_guide/copy_on_write.html

Here's the key point:

Constructors now copy NumPy arrays by default

The Series and DataFrame constructors will now copy NumPy array by default when not otherwise specified. This was changed to avoid mutating a pandas object when the NumPy array is changed inplace outside of pandas. You can set copy=False to avoid this copy.

When we construct DataFrames in Xarray, we do it like this

xarray/xarray/core/dataset.py

Lines 7386 to 7388 in d5f84dd

broadcasted_df = pd.DataFrame(
dict(zip(non_extension_array_columns, data, strict=True)), index=index
)

Here's a minimal example

import numpy as np
import xarray as xr
ds = xr.DataArray(np.ones(1_000_000), dims=('x',), name="foo").to_dataset()
df = ds.to_dataframe()
print(np.shares_memory(df.foo.values, ds.foo.values))  # -> False

# can see the memory locations
print(ds.foo.values.__array_interface__)
print(df.foo.values.__array_interface__)

# compare to this
df2 = pd.DataFrame(
    {
        "foo": ds.foo.values,
    },
    copy=False
)
np.shares_memory(df2.foo.values, ds.foo.values)  # -> True

Solution

I propose we add a copy keyword option to Dataset.to_dataframe() (and similar for DataArray) which defaults to False (current behavior) but allows users to select True if that's what they want.

@rabernat rabernat added the needs triage Issue that has not been reviewed by xarray team member label Nov 17, 2024
@TomNicholas TomNicholas added topic-pandas-like enhancement topic-performance and removed needs triage Issue that has not been reviewed by xarray team member labels Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants