Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xarray: add indexes options and better define band names #764

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

vincentsarago
Copy link
Member

This PR does:

  • add indexes options (following Rasterio convention, starts at 1)
  • better define band names in data array

ds = self.input
band_names = self.band_names
if indexes := cast_to_sequence(indexes):
assert all(v > 0 for v in indexes), "Indexes value must be >= 1"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xarray won't complain when we pass data[-1] so we need this tests

if indexes != (1,):
raise ValueError(
f"Invalid indexes {indexes} for array of shape {ds.shape}"
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for 2D array we still allow indexes=1

@@ -1060,7 +1060,7 @@ def _get_reader(self, asset_info: AssetInfo) -> Tuple[Type[BaseReader], Dict]:
assert info["netcdf"].crs

img = stac.preview(assets=["netcdf"])
assert img.band_names == ["netcdf_value"]
assert img.band_names == ["netcdf_dataset"]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

band value default to the DataArray's name ✨

if not self._dims:
coords_name = list(self.input.coords)
if len(coords_name) > 3 and (coord := coords_name[2]):
return [str(self.input.coords[coord].data)]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant to put something like {coord_name}={coord_value}} 🤷‍♂️

but we don't do this for the other band names

Copy link

@maxrjones maxrjones Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not super easy to understand what this code is trying to accomplish, but I think it's problematic that band names are based on dimensions if _dims is set as an attribute and the names are based on coordinates if not. I think it should always be based on non-spatial (as defined by rioxarray) dimensions. Since all dimensions have names, this should also make dealing with defaults simpler. Some documentation about how to map Xarray's data model into rio-tiler's assumptions would really help in general.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I think it's problematic that band names are based on dimensions if _dims is set as an attribute and the names are based on coordinates if not. I think it should always be based on non-spatial (as defined by rioxarray) dimensions. Since all dimensions have names, this should also make dealing with defaults simpler. Some documentation about how to map Xarray's data model into rio-tiler's assumptions would really help in general.

@maxrjones I'm not sure to get this

The main issue is that when we pass a 2D dataarray (thinks about if you select the first time dim) then dims is empty but the coordinates has the time value which correspond to the name of the array slice (maybe I'm mistaken) so I felt we needed to have a way to surface this.

from datetime import datetime
import numpy
import xarray
import rioxarray

arr = numpy.arange(0.0, 33 * 35 * 2).reshape(2, 33, 35)
data = xarray.DataArray(
    arr,
    dims=("time", "y", "x"),
    coords={
        "x": numpy.arange(-170, 180, 10),
        "y": numpy.arange(-80, 85, 5),
        "time": [datetime(2022, 1, 1), datetime(2022, 1, 2)],
    },
)
data.attrs.update({"valid_min": arr.min(), "valid_max": arr.max()})

ds = data.to_dataset(name="dataset")
da = ds["dataset"][0]
da.rio.write_crs("epsg:4326", inplace=True)

da
# >>> <xarray.DataArray 'dataset' (y: 33, x: 35)> Size: 9kB
# array([[0.000e+00, 1.000e+00, 2.000e+00, ..., 3.200e+01, 3.300e+01,
#         3.400e+01],
#        [3.500e+01, 3.600e+01, 3.700e+01, ..., 6.700e+01, 6.800e+01,
#         6.900e+01],
#        [7.000e+01, 7.100e+01, 7.200e+01, ..., 1.020e+02, 1.030e+02,
#         1.040e+02],
#        ...,
#        [1.050e+03, 1.051e+03, 1.052e+03, ..., 1.082e+03, 1.083e+03,
#         1.084e+03],
#        [1.085e+03, 1.086e+03, 1.087e+03, ..., 1.117e+03, 1.118e+03,
#         1.119e+03],
#        [1.120e+03, 1.121e+03, 1.122e+03, ..., 1.152e+03, 1.153e+03,
#         1.154e+03]])
# Coordinates:
#   * x            (x) int64 280B -170 -160 -150 -140 -130 ... 130 140 150 160 170
#   * y            (y) int64 264B -80 -75 -70 -65 -60 -55 ... 55 60 65 70 75 80
#     time         datetime64[ns] 8B 2022-01-01
#     spatial_ref  int64 8B 0
# Attributes:
#     valid_min:  0.0
#     valid_max:  2309.0

_dims = [
    d
    for d in da.dims
    if d not in [da.rio.x_dim, da.rio.y_dim]
]
_dims
# >>> []


coords_name = list(da.coords)
coords_name
# >>> ['x', 'y', 'time', 'spatial_ref']

if len(coords_name) > 3 and (coord := coords_name[2]):
    print(str(da.coords[coord].data))
# >>> 2022-01-01T00:00:00.000000000

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxrjones could you double check the logic here?

@@ -105,6 +110,7 @@ def __attrs_post_init__(self):
for d in self.input.dims
if d not in [self.input.rio.x_dim, self.input.rio.y_dim]
]
assert len(self._dims) in [0, 1], "Can't handle >=4D DataArray"
Copy link
Member Author

@vincentsarago vincentsarago Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a check to make sure we don't have 4D arrays

@sharkinsspatial
Copy link

@vincentsarago I'll try to look through this in a bit more detail today but it might be worth considering another naming convention for "indexes" in this context. In xarray terminology, "indexes" refer to underlying indices which enable positional or label based selection https://docs.xarray.dev/en/latest/user-guide/indexing.html (and a computational cost associated with building Pandas, label based indexes), so it might be confusing for more xarray oriented users.

@vincentsarago
Copy link
Member Author

@sharkinsspatial 🙏

I've used indexes for compatibility with titiler's and other Reader (Rasterio based). Introducing a new name for an array slice might add complexity :-(

But I think we can improve the documentation 🙏

Copy link

@maxrjones maxrjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments, and have a couple of nits that you could feel free to ignore:

rio_tiler/io/xarray.py Outdated Show resolved Hide resolved
if not self._dims:
coords_name = list(self.input.coords)
if len(coords_name) > 3 and (coord := coords_name[2]):
return [str(self.input.coords[coord].data)]
Copy link

@maxrjones maxrjones Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not super easy to understand what this code is trying to accomplish, but I think it's problematic that band names are based on dimensions if _dims is set as an attribute and the names are based on coordinates if not. I think it should always be based on non-spatial (as defined by rioxarray) dimensions. Since all dimensions have names, this should also make dealing with defaults simpler. Some documentation about how to map Xarray's data model into rio-tiler's assumptions would really help in general.

with XarrayReader(data) as dst:
img = dst.info()
print(img.band_descriptions)[0]
>> ("b1", "2022-01-01T00:00:00.000000000")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have the band name returned as b1 rather than time or time1 (corresponding to <dimension-name><1-based index>)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was mostly for compatibility but I think having <dim_name><idx> would be 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will have the same issue when there are only 2 Dimensions what should the band name be? For now I would use the same code I'm using to get the band name from the first non-geo coordinate 🤷

@vincentsarago
Copy link
Member Author

I find it more robust to use by-name based integer lookup rather than strictly positional (https://docs.xarray.dev/en/stable/user-guide/indexing.html#quick-overview).

But that means we need to have the name of the dimension , I think rioxarray enforce the data array to be in ({dim}, y, x) so using integer lookup is safe I think.

OR we could change the indexes to be the dim value (e.g `indexes="2022-01-01T00:00:00.000000000") but I feel that will makes things supper complex

@maxrjones
Copy link

OR we could change the indexes to be the dim value (e.g `indexes="2022-01-01T00:00:00.000000000") but I feel that will makes things supper complex

I'm not sure this would need to be an either or decision, in that we could support positional indexing now and potentially label-based indexing later on. I have a few questions about the vision for titiler.xarray that would help inform responses to your comments above. I could open these as separate issues:

  • Would XarrayReader eventually support 4D datasets (e.g., band, time, y, x)?
  • Would the ImageData returned by XarrayReader eventually permit 3D output of the form (time, y, x) where (y, x) correspond to a TMS tile? This structure is the approach the carbonplan/maps uses to reduce get requests and could simplify temporal animations.
  • As you mentioned earlier, should the XarrayReader ever support label based indexing? While more complex, this appraoch is a key benefit of Xarray's data model.

@vincentsarago
Copy link
Member Author

@maxrjones

Would XarrayReader eventually support 4D datasets (e.g., band, time, y, x)?

Right now, XarrayReader expect a DataArray, and in titiler.xarray we make sure to select the first time https://github.com/developmentseed/titiler/blob/fb1bb1e6f5515bae6cccebbd9dd3a5b8ebce2d35/src/titiler/xarray/titiler/xarray/io.py#L228-L240

With the proposed change, I'm trying to ease the case where no time would be selected and have a DataArray in form of (n_time, x, y)

This structure is the approach the carbonplan/maps uses to reduce get requests and could simplify temporal animations.

Absolutely

As you mentioned earlier, should the XarrayReader ever support label based indexing? While more complex, this appraoch is a key benefit of Xarray's data model.

🤔 I'm not sure how the API will look like for label based indexing. For now we defined indexes=Union[int, List[int]], if we were going to use label which method should we use sel, loc, isel? Would the input be a dictionary ?

we could have indexes be of type Union[Union[int, str], List[Union[int, str]]] and thus access integer based indexing but also label based indexing

from datetime import datetime

import numpy
import xarray

da = xarray.DataArray(
    numpy.arange(0.0, 33 * 35 * 2).reshape(2, 33, 35),
    dims=("time", "y", "x"),
    coords={
        "x": numpy.arange(-170, 180, 10),
        "y": numpy.arange(-80, 85, 5),
        "time": [datetime(2022, 1, 1), datetime(2022, 1, 2)],
    },
)


>>> da[0]
<xarray.DataArray (y: 33, x: 35)> Size: 9kB
array([[0.000e+00, 1.000e+00, 2.000e+00, ..., 3.200e+01, 3.300e+01,
        3.400e+01],
       [3.500e+01, 3.600e+01, 3.700e+01, ..., 6.700e+01, 6.800e+01,
        6.900e+01],
       [7.000e+01, 7.100e+01, 7.200e+01, ..., 1.020e+02, 1.030e+02,
        1.040e+02],
       ...,
       [1.050e+03, 1.051e+03, 1.052e+03, ..., 1.082e+03, 1.083e+03,
        1.084e+03],
       [1.085e+03, 1.086e+03, 1.087e+03, ..., 1.117e+03, 1.118e+03,
        1.119e+03],
       [1.120e+03, 1.121e+03, 1.122e+03, ..., 1.152e+03, 1.153e+03,
        1.154e+03]])
Coordinates:
  * x        (x) int64 280B -170 -160 -150 -140 -130 ... 130 140 150 160 170
  * y        (y) int64 264B -80 -75 -70 -65 -60 -55 -50 ... 50 55 60 65 70 75 80
    time     datetime64[ns] 8B 2022-01-01

>>> da.loc["2022-01-01"]
<xarray.DataArray (y: 33, x: 35)> Size: 9kB
array([[0.000e+00, 1.000e+00, 2.000e+00, ..., 3.200e+01, 3.300e+01,
        3.400e+01],
       [3.500e+01, 3.600e+01, 3.700e+01, ..., 6.700e+01, 6.800e+01,
        6.900e+01],
       [7.000e+01, 7.100e+01, 7.200e+01, ..., 1.020e+02, 1.030e+02,
        1.040e+02],
       ...,
       [1.050e+03, 1.051e+03, 1.052e+03, ..., 1.082e+03, 1.083e+03,
        1.084e+03],
       [1.085e+03, 1.086e+03, 1.087e+03, ..., 1.117e+03, 1.118e+03,
        1.119e+03],
       [1.120e+03, 1.121e+03, 1.122e+03, ..., 1.152e+03, 1.153e+03,
        1.154e+03]])
Coordinates:
  * x        (x) int64 280B -170 -160 -150 -140 -130 ... 130 140 150 160 170
  * y        (y) int64 264B -80 -75 -70 -65 -60 -55 -50 ... 50 55 60 65 70 75 80
    time     datetime64[ns] 8B 2022-01-0

🤔 BUT thinking about titiler, it would be pretty hard to know when to convert the input to string/interger. Having integer based indexing makes the API simpler and compatible for any Readers

@vincentsarago
Copy link
Member Author

Quick Update: In this state we have XarrayReader which is compatible with other Reader (having indexes: Integer. We could in future allow more complex indexing, but this could also be customized at TiTiler level (with custom Dataset Reader).

This PR also updates the band_names attributes to be more consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants