-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using Dask in reprojection #119
Comments
It would definitely be a nice feature to have. One gotcha to watch out for is that reprojection considers points outside of the dask chunk due to resampling. |
This is something we are constantly working on in the pyresample project (https://github.com/pytroll/pyresample/). We have a lot of dask compatible solutions but no explicitly defined interfaces. We're working towards a "Resampler" class that would make it easier to control dask-based resampling algorithms. Note that we aren't calling in to the GDAL resampling algorithms. It should be noted that resampling is not the easiest operation to port to dask friendliness. One thing is as @snowman2 mentioned, you often have to deal with an overlap of source data chunks. Dask has some functions that can help with that like map_overlap (https://docs.dask.org/en/latest/array-overlap.html). The other issue is just the nature of resampling and trying to do it per chunk. You typically have to build a custom dask graph if you want to do anything efficient. For example, I have an Elliptical Weighted Averaging (EWA) algorithm in pyresample that I just recently rewrote this way. You can see the progress here: pytroll/pyresample#284. The hard part is efficiently saying "this input chunk should map to this output chunk". This type of check typically means you have to have M * N tasks where M is number of input chunks and N is number of output chunks. I've had to do some ugly things in that PR to try to do things "intelligently". |
Thanks @djhoese, I'll checkout pyresample. |
Oh I should have mentioned that the interfaces that aren't fully defined in pyresample are being used in Satpy. That might be a better starting point if you're curious. Of course, we're always working towards better rioxarray, rasterio, and plain old xarray compatibility. |
I wonder if |
Very interesting. I'm not sure whether virtual warping is doing something extremely magical or just doing the warping when you ask for it. Your concurrency link has been brought up before in a dask issue and I had some concerns that I discussed here: dask/dask#3255 (comment) I'll quote it here:
|
It should just do it when you ask for it - lazy loading as I understand it. |
This is exactly the bit I am plugging for - read in and warp the data into a dask array in parallel. |
https://rasterio.readthedocs.io/en/latest/api/rasterio.vrt.html
|
I read through the code, and it seems like it is already possible to do - I haven't tested loading it in parallel with dask personally. Would be interesting to test sometime. |
Hi, I worked on Pyresample gradient search resampler (originally implemented by @mraspaud) in pytroll/pyresample#282 a while back. The resampler now maps the source chunks to those, and only those, target chunks they have any overlap with. All the chunks that do not overlap are discarded to reduce the amount of computations. As there are often multiple input chunks contributing to a output chunk, these two versions are stacked before the full output array is formed. The source -> target chunk mapping needs some non-dask computations to check for the overlap, but these are cheap operations for cases where the source data has a proper projection. The actual resampling (implemented in Cython) is delayed and the delayed results are concatenated (and padded) to form the resulting 2/3D array. There are certainly some optimizations I missed, but I'm pretty happy with the results so far :-) |
Looks like the ODC group has a nice solution for dask reprojection: #130 (comment) |
Nice! You think we can steal that code? 😄 |
It would be a lot of code to take over. There are some bits from the My preference would be for them to put it into a separate package that we could import in and use. Might be a worthwhile discussion to have. |
Opened an issue: opendatacube/odc-tools#58 🤞 |
+1 on integrating dask/lazy arrays into the reprojection method. Suggested above was using Warped_VR, which I have used on some projects. Here is a gist showing the workflow: https://gist.github.com/rmg55/875a2b79ee695007a78ae615f1c916b2... I also noticed that gdal 3.1 has support for multidimensional rasters (including mem and vrt). Not sure exactly how that might be integrated into rasterio, but I wonder if that could provide some additional gdal functionality into xarray (rioxarray) objects. |
That is pretty neat. Thanks for sharing 👍. How much did it improve performance? Side note: I am curious if the
It has been brought up in the past: https://rasterio.groups.io/g/dev/topic/33040759. If there are enough interested parties willing to develop/maintain that feature, there is a slim possibility of it being added to rasterio. |
Is this something you would be interested in adding to the examples section in the documentation of |
Happy to share @snowman2, and yes, I would be happy to add the example to the docs. Also, I just saw this pr in xarray that allows xarray to read the warped_vrt object directly, so I have removed some unnecessary lines of code in the gist I posted above.
This is a pretty small file (26 MB), so I think the dask/scheduling overhead => than the parallel speedup in io.
Yes, you can use the
Thanks for linking to the rasterio discussion on multi-dim rasters. The thing I have not been able to figure out is how to build a warped_vrt from a xarray object. I think it could be useful to represent xarray objects as a vrt and rather than point to a on disk files, point to numpy or dask array. However, I am not sure that is possible with the vrt format.... |
That would be great 👍
Interesting. Not sure if it makes sense to do so as the VRT is file based IIRC, but I haven't given it much thought. Feel free to open a new issue to look into and discuss this further. |
I'm working on something based on this and with generous advice from Kirill Kouzoubov (though I'm sure I've made errors of my own, and I haven't set up the error handling very well, yet). I added multiband chunking and support for applying separate kwargs to different band chunks. |
It took a few hours with the rio commandline to warp a 40GB vrt dataset to a set bounds and 400GB upscale - be interesting to see how the dask/rioxarray variety does - match chunk size to the rasterio memory use 'defaults'? |
Yes, looking forward to trying the odc-geo version out sometime. |
Any progress on this? I am looking to follow the large reproject suggestions, but with a vrt we build e.g.
but I get I am getting a sense that there is something more to it than this... |
To report back here (should have done it earlier sorry): This allows to do various operations (eg resampling) by blocks when the source and target arrays are in different CRSs. Resample blocks ensures in principle that the input data to the custom |
Just wanted to let people know that |
Thanks vey much Kirill. |
How big a thing have you tested? :) |
Haven't really done any real data processing with it yet. But for testing I was using zoom level 8 COG of blue marble as input
but extracting smaller regions in various projections/resolutions. |
@Kirill888 This is great! This makes me hopeful that we're converging on a common practice for all this type of work. Everyone is generally dependent on pyproj or rasterio CRS objects and these new resampling functions (the GDAL based on in odc-geo and the custom algorithms in pyresample) are doing things chunk by chunk. What you describe with shapely polygons and CRSes being used to determine the graph/combination of source->target chunks sounds a lot like what @mraspaud started doing in I'm excited to read more about odc-geo's Geometry (shapely + CRS) objects too since this is something that pyresample kind of does in hacky everything-is-separate kind of way and we really want something like what odc-geo is describing. |
@djhoese thanks for kind words.
although, for "Dask reproject" to work you also need Design goal is to allow for minimal functionality to work with just those dependencies and expand available feature set when more libraries are installed. We have extracted useful geometry shape related utilities out of
But probably most useful class in
^ we differ in the way we handle non-axis aligned footprints @mraspaud I have looked only briefly at |
@Kirill888 yes, the Regarding the |
Hi @snowman2, all, We also implemented a Dask-delayed reprojection in our package in the frame of an ongoing Xarray accessor (#687), see here: GlacioHack/geoutils#537, with part of the code inspired from @Kirill888's The code is stand-alone (does not depends on any other class of our package: https://github.com/rhugonnet/geoutils/blob/fafd2399d35ff22bf79bc631a9d805ab19a18a94/geoutils/raster/delayed.py#L387), so could be moved easily to Rioxarray if you think it makes sense 🙂. Also, while writing the test suite for the function, I noticed some pretty big discrepancies between |
I believe rioxarray loads the whole data array before reprojecting. Reprojecting a data array chunk by chunk would allow to handle much bigger data. It would also remain in the Dask lazy execution model.
The text was updated successfully, but these errors were encountered: