Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intake server leads to ERROR: KeyError('xarray') #444

Open
wachsylon opened this issue Feb 10, 2022 · 9 comments
Open

intake server leads to ERROR: KeyError('xarray') #444

wachsylon opened this issue Feb 10, 2022 · 9 comments

Comments

@wachsylon
Copy link

Description

I tried out the intake-server. In the end, I would like to have a server for some or all cats of:
https://swiftbrowser.dkrz.de/public/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/
The entry catalog is
dkrz_data-pool_cloudcatalog.yaml

I started the server from the same environment as from where I did client commands. I installed intake_xarray as well.

Do you have an idea what the problem is? I saw that the remote catalog wants to use sth like container: xarray what I do not really understand. What is a container? Why xarray?

What I Did

cat > temp.yaml <<EOF
description: 'DKRZ master catalog for all data pool catalogs available'
plugins:
  source:
    - module: intake_esm

sources:
  dkrz_cmip6_cloud_zarr:
    args:
      esmcol_obj: https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/dkrz_cmip6_swift_zarr_fromcloud.json
    description: dkrz cmip6 data on disk saved as netcdf retrieved fromcloud
    driver:
    - intake.open_esm_datastore
>EOF
intake-server temp.yaml 1>log 2>&1 &
intake list --full intake://localhost:8898 
ERROR: KeyError('xarray')

cat log
2022-02-10 17:33:05,850 - intake - INFO - __main__.py:main:L53 - Creating catalog from:
2022-02-10 17:33:05,850 - intake - INFO - __main__.py:main:L55 -   - temp.yaml
2022-02-10 17:33:06,509 - intake - INFO - __main__.py:main:L62 - catalog_args: temp.yaml
2022-02-10 17:33:06,509 - intake - INFO - __main__.py:main:L70 - Listening on localhost:8898
2022-02-10 17:33:06,509 - intake - DEBUG - server.py:__init__:L32 - auth: {'cls': 'intake.auth.base.BaseAuth'}
2022-02-10 17:33:38,289 - intake - DEBUG - server.py:post:L241 - Source POST: {'action': 'open', 'name': 'dkrz_cmip6_cloud_zarr', 'parameters': {}, 'available_plugins': ['yaml_file_cat', 'yaml_files_cat', 'netcdf', 'opendap', 'rasterio', 'remote-xarray', 'xarray_image', 'zarr', 'alias', 'catalog', 'csv', 'intake_remote', 'json', 'jsonl', 'ndzarr', 'numpy', 'textfiles', 'tiled', 'tiled_cat', 'zarr_cat', 'esm_datastore', 'esm_group', 'esm_single_source']}
2022-02-10 17:33:38,289 - intake - DEBUG - server.py:post:L302 - Opening entry <tzis_template catalog with 480 dataset(s) from 480 asset(s)>
2022-02-10 17:33:38,289 - intake - DEBUG - server.py:add:L146 - Adding <tzis_template catalog with 480 dataset(s) from 480 asset(s)> to cache, uuid 329a34f5-4eb0-40de-a4e0-089b3a43e7e2
2022-02-10 17:33:38,289 - intake - DEBUG - server.py:post:L314 - Container: xarray, ID: 329a34f5-4eb0-40de-a4e0-089b3a43e7e2

ipython
import intake
import intake_esm
import intake_xarray
test=intake.open_catalog("intake://localhost:8898")
list(test)
Out[22]: ['dkrz_cmip6_cloud_zarr']

In [24]: test["dkrz_cmip6_cloud_zarr"]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [24], in <module>
----> 1 test["dkrz_cmip6_cloud_zarr"]

File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake/catalog/base.py:436, in Catalog.__getitem__(self, key)
    427 """Return a catalog entry by name.
    428 
    429 Can also use attribute syntax, like ``cat.entry_name``, or
   (...)
    432 cat['name1', 'name2']
    433 """
    434 if not isinstance(key, list) and key in self:
    435     # triggers reload_on_change
--> 436     s = self._get_entry(key)
    437     if s.container == 'catalog':
    438         s.name = key

File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake/catalog/utils.py:45, in reload_on_change.<locals>.wrapper(self, *args, **kwargs)
     42 @functools.wraps(f)
     43 def wrapper(self, *args, **kwargs):
     44     self.reload()
---> 45     return f(self, *args, **kwargs)

File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake/catalog/base.py:323, in Catalog._get_entry(self, name)
    321 ups = [up for name, up in self.user_parameters.items() if name not in up_names]
    322 entry._user_parameters = ups + (entry._user_parameters or [])
--> 323 return entry()

File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake/catalog/entry.py:77, in CatalogEntry.__call__(self, persist, **kwargs)
     75     raise ValueError('Persist value (%s) not understood' % persist)
     76 persist = persist or self._pmode
---> 77 s = self.get(**kwargs)
     78 if persist != 'never' and isinstance(s, PersistMixin) and s.has_been_persisted:
     79     from ..container.persist import store

File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake/catalog/remote.py:459, in RemoteCatalogEntry.get(self, **user_parameters)
    457 http_args['headers'] = self.http_args['headers'].copy()
    458 http_args['headers'].update(self.auth.get_headers())
--> 459 return open_remote(
    460     self.url, self.name, container=self.container,
    461     user_parameters=user_parameters, description=self.description,
    462     http_args=http_args,
    463     page_size=self._page_size,
    464     auth=self.auth,
    465     getenv=self.getenv,
    466     persist_mode=self.catalog_pmode,
    467     getshell=self.getshell)

File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake/catalog/remote.py:515, in open_remote(url, entry, container, user_parameters, description, http_args, page_size, persist_mode, auth, getenv, getshell)
    506     if container == 'catalog':
    507         response.update({'auth': auth,
    508                          'getenv': getenv,
    509                          'getshell': getshell,
   (...)
    513                          # TODO storage_options?
    514                          })
--> 515     source = container_map[container](url, http_args, **response)
    516 source.description = description
    517 return source

File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake_xarray/xarray_container.py:91, in RemoteXarray.__init__(self, url, headers, **kwargs)
     78 """
     79 Initialise local xarray, whose dask arrays contain tasks that pull data
     80 
   (...)
     88 server.
     89 """
     90 import xarray as xr
---> 91 super(RemoteXarray, self).__init__(url, headers, **kwargs)
     92 self._schema = None
     93 self._ds = xr.open_zarr(self.metadata['internal'])

File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake/container/base.py:44, in RemoteSource.__init__(self, url, headers, name, parameters, metadata, **kwargs)
     42 self._source_id = None
     43 self.metadata = metadata or {}
---> 44 self._get_source_id()

File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake/container/base.py:55, in RemoteSource._get_source_id(self)
     53 req.raise_for_status()
     54 response = msgpack.unpackb(req.content, **unpack_kwargs)
---> 55 self._parse_open_response(response)

File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake/container/base.py:58, in RemoteSource._parse_open_response(self, response)
     57 def _parse_open_response(self, response):
---> 58     dtype_descr = response['dtype']
     59     if isinstance(dtype_descr, list):
     60         # Reformat because NumPy needs list of tuples
     61         dtype_descr = [tuple(x) for x in response['dtype']]

KeyError: 'dtype'


Version information: output of intake_esm.show_versions()

Paste the output of intake_esm.show_versions() here:

import intake_esm

intake_esm.show_versions()

INSTALLED VERSIONS
------------------

cftime: 1.5.2
dask: 2022.01.1
fastprogress: 0.2.7
fsspec: 2022.01.0
gcsfs: None
intake: 0.6.5
intake_esm: 2021.8.17
netCDF4: 1.5.8
pandas: 1.3.5
requests: 2.27.1
s3fs: None
xarray: 0.21.1
zarr: 2.11.0
@andersy005
Copy link
Member

Thank you for the reproducible example, @wachsylon! I will look into this and will get back to you

@wachsylon
Copy link
Author

@andersy005 thank you i appreciate it.

@wachsylon
Copy link
Author

@andersy005
Do you have found sth? It would be really helpful as it is a blocker for me.
Thanks in advance!

@andersy005
Copy link
Member

andersy005 commented Feb 22, 2022

Thank you for your patience, @wachsylon! Unfortunately, I haven't had time to look into the root cause of this issue.

I tried out the intake-server. In the end, I would like to have a server for some or all cats of:
swiftbrowser.dkrz.de/public/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm

I'm curious.... What are the benefits of exposing these catalogs via an intake-server instead of a regular top-level/main catalog?

I am imagining a top-level YAML file with the following contents. Users should be able to point intake to this main/parent catalog

description: 'DKRZ master catalog for all data pool catalogs available'
plugins:
  source:
    - module: intake_esm

sources:
  dkrz_cmip6_cloud_zarr:
    args:
      esmcol_obj: https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/dkrz_cmip6_swift_zarr_fromcloud.json
    description: dkrz cmip6 data on disk saved as netcdf retrieved fromcloud
    driver: intake_esm.esm_datastore

    another_catalog:
      args:
        esmcol_obj: https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/dkrz_cmip6_swift_zarr_fromcloud.json
      description: dkrz cmip6 data on disk saved as netcdf retrieved fromcloud
      driver: intake_esm.esm_datastore
In [18]: import intake

In [19]: cat = intake.open_catalog("temp.yaml")

In [20]: list(cat)
Out[20]: ['dkrz_cmip6_cloud_zarr']

In [21]: esmcat = cat["dkrz_cmip6_cloud_zarr"]

In [22]: esmcat.df.head()
Out[22]: 
                                              prefix  ...    version
0  CMIP6.CMIP.AWI.AWI-CM-1-1-MR.historical.r1i1p1...  ...  v20181218
1  CMIP6.CMIP.AWI.AWI-CM-1-1-MR.historical.r1i1p1...  ...  v20181218
2  CMIP6.CMIP.AWI.AWI-CM-1-1-MR.historical.r1i1p1...  ...  v20181218
3  CMIP6.CMIP.AWI.AWI-CM-1-1-MR.historical.r1i1p1...  ...  v20181218
4  CMIP6.CMIP.AWI.AWI-CM-1-1-MR.historical.r1i1p1...  ...  v20181218

[5 rows x 12 columns]

@wachsylon
Copy link
Author

@andersy005
The main use case for an intake server would be to return subsets of catalogs in case users cannot handle the memory. Our data base for the CMIP6 catalog is about 4PB which gives me a .csv.gz list of about 400MB. If this is loaded entirely, users quickly exceed the available memory.

I know that we could create a hierarchy of catalogs and create catalogs on finer level but that may not fit to many use cases as e.g. in CMIP6, users are interested in several MIPs (=activity, e.g. ScenarioMIP or PMIP) at once. We could also wait for a STAC solution but at MPI-M, intake is and will be used a lot anyway so that I would like to get intake server to work. I also could not start an intake server for Pangeo btw.

If I understood correctly, the intake server can cache the catalog. Therefore, the server only loads the catalog for many requests, correct? I can set up a VM which users can use to subset the catalog.

@andersy005
Copy link
Member

Thank you for the clarification/details, @wachsylon! I haven't used the intake-server before. From my short experimentation, it appears that there are some things that are missing within intake-esm to allow seamless integration with the intake-server I'll do my best to find time to look into this today/tomorrow.

@wachsylon
Copy link
Author

Any news on that?

some things that are missing within intake-esm to allow seamless integration with the intake-server

sounds bad :(

@andersy005
Copy link
Member

Serving intake-esm catalog and assets via intake-server will require rewriting some of the components of intake-esm. Unfortunately, my schedule is too tight and I don't have time to look into this extensively any time soon but I'd be happy to review pull requests if someone is interested in pursuing this...

@wachsylon
Copy link
Author

Ok thanks for letting me know!
And thanks a lot for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants