How can we make intake-esm more transparent? #163

rabernat · 2019-10-18T14:14:32Z

I'm sitting with @naomi-henderson, and we are discussing how we might make intake-esm more transparent about what it's doing under the hood.

It would be nice if there were a mode where, rather than running the all the merge operations, intake returns a nested dictionary similar to the one I showed in my recursive merge demo

{'X': {'A': {'1': ds1, '2': ds2}, 'B': {'1': ds3, '2': ds4}},
 'Y': {'A': {'1': ds5, '2': ds6}, 'B': {'1': ds7', '2': ds8}}}

This would allow users to manually descend into the different individual datasets and examine them one a time, optionally applying their own merge logic.

This should be relatively easy, since intake-esm probably has an internal data structure like this already.

The text was updated successfully, but these errors were encountered:

matt-long · 2019-10-18T14:47:15Z

It should be relatively easy to return the nested dictionary.

A couple other ideas include enabling an aggregate=False option, which would return each of the individual datasets and a get_keys() method that would just return the keys that are build by the to_dataset_dict method.

rabernat · 2019-10-18T14:49:31Z

enabling an aggregate=False option

👍

rabernat · 2019-10-18T14:52:05Z

an aggregate=False option

More thoughts: how would this work? Would what would the keys be? Would it just group by all columns?

matt-long · 2019-10-18T14:55:16Z

It would return a dataset for each row in the database. We could form keys from the groupby applied to all columns, but maybe it would be more accessible if the key was just the index. What do you think?

rabernat · 2019-10-18T16:07:06Z

What would intake-esm currently do if there were no aggregation_control entry in the collection description?

rabernat · 2019-10-18T16:08:56Z

Answer:

Raise KeyError: 'aggregation_control'

That is NOT the right behavior. Aggregation should be totally 100% optional in these catalogs.

matt-long · 2019-10-18T16:17:52Z

Agreed, that's a bug, but easy to fix. Without aggregation_control the code forms groups over all columns:

groups = self.df.groupby(self.df.columns.tolist())

and the returned keys will be of the same format. We can trigger the same behavior if aggregate=False.

andersy005 · 2019-10-18T19:04:06Z

@naomi-henderson, @rabernat,

With #164 the following works:

import intake
col_file = "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json"

col = intake.open_esm_datastore(col_file)
query = dict(experiment_id='historical', table_id='Oyr', 
                 variable_id='o2', grid_label='gn', member_id='r1i1p1f1')
cat = col.search(**query)



# Disable aggregations
dsets_pp = cat.to_dataset_dict(aggregate=False)
print(dsets_pp.keys())

--> The keys in the returned dictionary of datasets are constructed as follows:
	'zstore'

--> There will be 2 group(s)

dict_keys(['gs://cmip6/CMIP/CCCma/CanESM5/historical/r1i1p1f1/Oyr/o2/gn/', 'gs://cmip6/CMIP/IPSL/IPSL-CM6A-LR/historical/r1i1p1f1/Oyr/o2/gn/'])

rabernat · 2019-10-18T19:24:55Z

@andersy005 - nice! However, I would prefer for the keys to be the groups, not the paths, as @matt-long suggested.

Are the keys the datasets themselves?

andersy005 · 2019-10-19T18:51:10Z

Assuming that we have a row with the following attributes:

activity_id                                              AerChemMIP
institution_id                                                  BCC
source_id                                                  BCC-ESM1
experiment_id                                                ssp370
member_id                                                  r1i1p1f1
table_id                                                       Amon
variable_id                                                      pr
grid_label                                                       gn
zstore            gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...
dcpp_init_year                                                  NaN
Name: 0, dtype: object

I would prefer for the keys to be the groups, not the paths, as @matt-long suggested.

Should we have something along these lines?

{ 'AerChemMIP.BCC.BCC-ESM1.ssp370.r1i1p1f1.Amon.pr.gn.NaN' : 
  <xarray.Dataset>
Dimensions:    (bnds: 2, lat: 64, lon: 128, time: 492)
Coordinates:
  * lat        (lat) float64 -87.86 -85.1 -82.31 -79.53 ... 82.31 85.1 87.86
    lat_bnds   (lat, bnds) float64 dask.array<chunksize=(64, 2), meta=np.ndarray>
  * lon        (lon) float64 0.0 2.812 5.625 8.438 ... 348.8 351.6 354.4 357.2
    lon_bnds   (lon, bnds) float64 dask.array<chunksize=(128, 2), meta=np.ndarray>
  * time       (time) object 2015-01-16 12:00:00 ... 2055-12-16 12:00:00
    time_bnds  (time, bnds) object dask.array<chunksize=(492, 2), meta=np.ndarray>
Dimensions without coordinates: bnds
Data variables:
    pr         (time, lat, lon) float32 dask.array<chunksize=(492, 64, 128), meta=np.ndarray>
Attributes:
    Conventions:            CF-1.7 CMIP-6.2
    activity_id:            AerChemMIP
    further_info_url:       https://furtherinfo.es-doc.org/CMIP6.BCC.BCC-ESM1...
    grid:                   T42

andersy005 added the cmip6hack label Oct 18, 2019

andersy005 mentioned this issue Oct 18, 2019

No aggregation #164

Merged

andersy005 pinned this issue Nov 1, 2019

andersy005 mentioned this issue Nov 2, 2019

Return asset's attributes as keys when aggregate=False #173

Merged

andersy005 added discuss Topics for discussion. Might end in an enhancement or question label. and removed cmip6 labels May 7, 2020

This was referenced Aug 4, 2020

Expose attributes used when aggregating/combining datasets #268

Merged

Support turning aggregations off #269

Merged

andersy005 mentioned this issue Aug 7, 2021

esm_datastore_v2: A rewrite of esm_datastore with new features + improvements #353

Closed

15 tasks

intake locked and limited conversation to collaborators Sep 18, 2022

andersy005 converted this issue into discussion #531 Sep 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

How can we make intake-esm more transparent? #163

How can we make intake-esm more transparent? #163

rabernat commented Oct 18, 2019

matt-long commented Oct 18, 2019

rabernat commented Oct 18, 2019

rabernat commented Oct 18, 2019

matt-long commented Oct 18, 2019

rabernat commented Oct 18, 2019

rabernat commented Oct 18, 2019

matt-long commented Oct 18, 2019 •

edited

Loading

andersy005 commented Oct 18, 2019

rabernat commented Oct 18, 2019

andersy005 commented Oct 19, 2019 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

How can we make intake-esm more transparent? #163

How can we make intake-esm more transparent? #163

Comments

rabernat commented Oct 18, 2019

matt-long commented Oct 18, 2019

rabernat commented Oct 18, 2019

rabernat commented Oct 18, 2019

matt-long commented Oct 18, 2019

rabernat commented Oct 18, 2019

rabernat commented Oct 18, 2019

matt-long commented Oct 18, 2019 • edited Loading

andersy005 commented Oct 18, 2019

rabernat commented Oct 18, 2019

andersy005 commented Oct 19, 2019 • edited Loading

This issue was moved to a discussion.

matt-long commented Oct 18, 2019 •

edited

Loading

andersy005 commented Oct 19, 2019 •

edited

Loading