Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add dataset->stacked dataarray/dataframe converters #25

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

OriolAbril
Copy link
Member

@OriolAbril OriolAbril commented Oct 15, 2024

closes #15. I am very much not happy with these names, so please suggest alternatives.
Usable and documented but tests are still missing.


📚 Documentation preview 📚: https://arviz-base--25.org.readthedocs.build/en/25/

Copy link
Contributor

@amaloney amaloney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Naming is tough, but I think each method a user will understand what is happening when they call it as they are currently labelled.

Comment on lines 14 to 16
__all__ = [
"convert_to_datatree",
"convert_to_dataset",
"extract",
"to_labelled_stacked_da",
"to_labelled_stacked_df",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion for naming is typically to use modules (or classes) and methods therein to write a sentence (or command) that someone can read as a sentence. For instance, the difference between

azb.converter.convert_to_datatree(...)

and

azb.convert.to_datatree(...)

is (written as sentences in my head as)

  • ArviZ base converter convert to datatree (...).
  • ArviZ base convert to datatree (...).

Semantically they both mean the same thing, but the second sentence makes more sense grammatically. Things like da, ds, df and now dt are shortened versions of objects people are familiar with, so the to_labelled_stacked_df does make sense to me, and I like it.

If we are discussing an architecture for converting a dataset, I (as a user/reader of code) would think it is easier to read (for me at least) azb.convert.to_df(ds=ds, sample_dims=None, stacked=True, labeller=...)
than azb.converters.to_labelled_stacked_df(ds=ds, sample_dims=None, stacked=False, labeller=...) since the former is closer to what I would read as this sentence;

ArviZ base convert to dataframe (using this dataset, with these sample dimensions, and ensure it is stacked, using this labeller).

This is purely an opinion though, and I think the way you have written it makes perfect sense.

I only thing I would suggest is to make things consistent, so maybe choosing how you want the methods to read. For example using all abbreviations or not (eg dataset vs ds or df vs dataframe).

original abbreviated long version
convert_to_datatree to_dt to_datatree
convert_to_dataset to_ds to_dataset
to_labelled_stacked_da to_da to_dataarray
to_labelled_stacked_df to_df to_dataframe

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion for naming is typically to use modules (or classes) and methods therein to write a sentence (or command) that someone can read as a sentence.

I like this idea! In general arviz module names are ignored, at least for now. Maybe we could do a bit more of that for new things. You'll see they are available and documented as arviz_base.xyz so far we have not expected any user to do arviz.stats.ess or arviz.plot.plot_xyz; modules were only internal file organization.


We should probably think about the functions and which do we really care about. As of now, the difference isn't only the output type but the accepted input types. convert_to_xyz are (or attempt to) be catch all functions. Whatever you give to them, if arviz knows how to handle it you'll get an xyz at the end. extract only takes inferencedata/datatree and returns a dataset or dataarray. These new ones I added are even more restrictive, only valid input is dataset and they also have a single output.

I also thought about using to_xyz directly for these new functions but I feared that would be quite confusing because xarray datasets already have methods .to_dataframe, to_dataarray and to_stacked_dataarray. These are not that nor try to be, they are somewhat similar but extremely overloaded with ArviZ conventions and labeling; they won't work on datasets where there are no sample dimensions (aka a dimension that is present in all variables of the dataset), it ignores the fact that different variables in a dataset can have different dtypes... Maybe the fact that they are arviz.to_xyz is already enough to convey that and it won't be confusing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the fact that they are arviz.to_xyz is already enough to convey that and it won't be confusing?

I would agree with this.

xr.set_options(display_expand_data=False)

idata = load_arviz_data("centered_eight")
to_labelled_stacked_da(idata.posterior.ds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love the example

src/arviz_base/converters.py Outdated Show resolved Hide resolved
for idx_name in da.xindexes
if sample_dim in da[idx_name].dims
}
columns = pd.MultiIndex.from_arrays(list(idx_dict.values()), names=list(idx_dict.keys()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@OriolAbril OriolAbril marked this pull request as ready for review October 30, 2024 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

create to_labeled_stacked_array helper function
2 participants