Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: convert UInt64 S2 IDs (as string) to S2 cell IDs #250

Open
sheffe opened this issue Nov 22, 2023 · 2 comments
Open

Feature request: convert UInt64 S2 IDs (as string) to S2 cell IDs #250

sheffe opened this issue Nov 22, 2023 · 2 comments

Comments

@sheffe
Copy link

sheffe commented Nov 22, 2023

A hopefully-small feature request. I'm seeing data in the wild that uses UInt64 representations of S2 cell IDs. An example here is the global rooftop dataset in Source Cooperative where UInt64 IDs show up in the hive partition path.

It's the same underlying bit representation as what you're doing in S2 cells or in the class conversion to bit64::integer64 (thanks again for that), but ... I haven't found a way to read UInt64 at all in R except as a string. I'm guessing the translation would have to be in Rcpp.

Thanks, as always!

@paleolimbot
Copy link
Collaborator

Is it too verbose for your use case to use bit64::as_integer64()? Or does the signed/unsigned bit cause problems with the string representation? An example would help!

@sheffe
Copy link
Author

sheffe commented Nov 24, 2023

Full example below (should have had this before! :) )

The signed/unsigned distinction is causing the problem, and I couldn't figure out how to address it all inside R.

Some S2 implementations in other languages use unsigned Int64s (range 0 to 18,446,744,073,709,551,615) instead of signed Int64 (range -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807). The S2 documentation often uses UInt64 too, eg here. Int64/UInt64 values are equivalent in the underlying 64 bits, but when those bits get parsed to an integer type they're different numbers. R can only handle the signed Int64 form as a numeric (as far as I can tell! maybe wrong).

Here's a specific case. I have a dataset partitioned by S2 cell ID. One ID is 10710685813793882112. I want to map this UInt64 ID to R's s2_cell_id so I can pre-filter my data geographically.

  • 10710685813793882112 is the UInt64 value
  • The bitstring for that UInt64 is "1001010010100100000000000000000000000000000000000000000000000000"
  • The signed Int64 for that bitstring is -7736058259915669504
  • The s2 package cell for that Int64 is "94a4", level 5
    • result given by: s2::as_s2_cell(bit64::as.integer64(-7736058259915669504))

I got to the conclusions above ^ by passing through some ChatGPTd C++, as below:

Step 1:

#include <Rcpp.h>
#include <bitset>
#include <cstdint> // For int64_t and uint64_t

// [[Rcpp::export]]
std::string stringToBitstring(std::string str) {
  // Convert string to unsigned Int64
  uint64_t unsignedValue = std::stoull(str);

  // Reinterpret the bits as signed Int64
  int64_t signedValue = reinterpret_cast<int64_t&>(unsignedValue);

  // Convert to a bitstring
  std::bitset<64> bits(signedValue);
  return bits.to_string();
}

Step 2:

Rcpp::sourceCpp("stringToBitstring.cpp")
(format_bitstring <- stringToBitstring("10710685813793882112"))
(format_int64 <- bit64::as.integer64.bitstring(format_bitstring))
(format_s2cell <- s2::as_s2_cell(format_int64))

The other approaches I tested and rejected:

All in R with bit64:

  • bit64::as.integer64(10710685813793882112) warns with NAs produced by integer64 overflow (out of Int64 range)
  • bit64::as.integer64("10710685813793882112") is worse -- this yields 9223372036854775807 with no warnings, and it's the wrong answer!)

Conclusions
Main thought: this Uint64/Int64 distinction is a very very easy place to make mistakes! Caught me by surprise.

I can't see a way to handle Uint64 representations of the S2 integer ID in R directly, hence the C++ hack above.

I think it'd be great to have a direct translation from UInt64 formatted as a string to an S2 cell ID, so users could handle them without leaving R themselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants