Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nearest neighbor lookup using H3 at a 'fixed' resolution #16

Open
benbovy opened this issue Sep 4, 2020 · 8 comments
Open

Nearest neighbor lookup using H3 at a 'fixed' resolution #16

benbovy opened this issue Sep 4, 2020 · 8 comments

Comments

@benbovy
Copy link
Member

benbovy commented Sep 4, 2020

I'm wondering if there could be more efficient alternatives to (K-D / Ball / R) trees for the cases where the (lat, lon) data points to be indexed are not strictly evenly spaced but where the distances between direct neighbors are still pretty much similar for the whole dataset. (Is it the case for NEMO and/or FASEOM2 model grids?)

There are some examples here and here on performing spatial search using the H3 library.

Here, the basic idea would be:

  1. Choose a fixed H3 resolution res
  • Here is the table of all available resolutions.
  • This has to be chosen carefully, depending on the average "resolution" of the grid points to be indexed. It will impact both performance and memory consumption for storing the pre-computed index.
  1. Build the index:
  • Compute the H3 index of each grid point at the given res. This is quite efficient and could be easily done in parallel using Dask (for 80_000_000 points it takes <10 seconds using all the cores on my Intel i7 laptop).
  • Build a hash-table so that we can retrieve the original data points (positional index) from the computed H3 index values. Not sure at all about this part, though. Would this be efficient? The size of the table could be potentially huge. I guess numba's support for dict would be useful here? How could we leverage Dask for this?
  1. Nearest neighbor query:
  • Compute the H3 index of each query point at the given res.
  • Retrieve all candidates using the hash-table computed above. We could iterate over neighboring H3 cells using kRing until at least one candidate is found (or until we reach a given tolerance). Unfortunately, there's not yet a vectorized implementation of kRing in h3's Python bindings.
  • If multiple candidates are found, use a brute-force approach (or any other smarter approach) for selecting the nearest neighbor among those candidates.

Whether the query is efficient or not will depend of res. Ideally, there should be only a handful of candidates in the direct H3 cell vicinity (kRing=1) for each query point.

@benbovy
Copy link
Member Author

benbovy commented Sep 5, 2020

Build a hash-table so that we can retrieve the original data points (positional index) from the computed H3 index values. Not sure at all about this part, though. Would this be efficient? The size of the table could be potentially huge. I guess numba's support for dict would be useful here?

Some naive benchmark results using numba's typed Dict within a jitclass:

  • add 80_000_000 key/values (int/int): 4 seconds
  • get 10_000_000 random keys: 3.2 seconds

@benbovy
Copy link
Member Author

benbovy commented Sep 5, 2020

One limitation, though: Numba's typed.Dict currently doesn't support mutable lists as values.

@willirath
Copy link
Contributor

Re fixed resolution

Typical NEMO grids vary in resolution. The example data from the global (nominal 0.5deg) climate model looks like this (shown is (e1t**2 + e2t**2)**0.5):

image

That's a factor of 5 in one dim and hence ~25 of the smallest grid cells fitting into the biggest grid cells.

@benbovy
Copy link
Member Author

benbovy commented Sep 7, 2020

Oh yes that makes sense.

We could somehow leverage H3 cells at multiple levels (resolutions) for that case...

@fhk
Copy link

fhk commented Sep 24, 2021

was there any more research done on h3 indexes?

@benbovy
Copy link
Member Author

benbovy commented Sep 24, 2021

Not on my side. There's an example of nearest-neighbor search in https://github.com/joaofig/geo-spoke, although I'm not sure that it is fully vectorized (queries are for one point location at a time). H3's Python bindings still has a limited number of functions that are vectorized (available under the unstable namespace), so I'm afraid it would be hard to come with an efficient and xoak-friendly implementation written in pure-Python.

I've chosen to go with S2Geometry instead (#17) as it as more built-in features like a point index based on a binary tree. I still had to write custom, vectorized Python bindings for it, though.

@benbovy
Copy link
Member Author

benbovy commented Oct 2, 2024

This issue may be rather addressed in https://github.com/xarray-contrib/xdggs.

@fhk
Copy link

fhk commented Oct 2, 2024

Wow @benbovy thanks for the follow up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants