-
Notifications
You must be signed in to change notification settings - Fork 1k
WIP: Initial Parquet native geometry type support #8222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
impl ExtensionType for Geometry { | ||
const NAME: &'static str = "geoarrow.wkb"; | ||
|
||
type Metadata = (); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this type be String
? In geoarrow-schema
I have a separate Metadata
struct defined, and store this type as Arc<Metadata>
: https://docs.rs/geoarrow-schema/latest/geoarrow_schema/struct.WkbType.html#associatedtype.Metadata
In this case we might have different metadata for the Geometry/Geography types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A string in theory allows for fully arbitrary metadata, which seems undesirable. I think the metadata would normalize between GEOMETRY and GEOGRAPHY if we enforced the default values for CRS and Edge Interpolation for GEOMETRY types:
GEOMETRY is used for geospatial features in the Well-Known Binary (WKB) format with linear/planar edges interpolation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's important for edge interpolation, at least, to be stored under an Option
, because that's the primary thing that separates a geometry from a geography (and that's how to discern whether input data with extension name geoarrow.wkb
is a geometry or a geography).
FWIW it may be preferable to vendor the existing WkbType
struct that's already in use in geoarrow-schema
.
It's not clear that we should have separate Geometry
/Geography
arrow extension types. They're implemented here to parallel the Parquet implementation, but perhaps should be removed in favor of a single WkbType
or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, that makes sense for the edge interpolation needing an Option
. I didn't initially realize the spec defined edge interpolation algorithms didn't include a LINEAR
variant.
My reading of the parquet spec suggests the only difference between the two types is linear vs non-linear edge interpolation. Am I missing something? If not, are you privy to why they are separate types? On the surface it feels odd to carry them as separate types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they're separate types in the Parquet spec because some execution operations that are valid on Geometry types are not valid on Geography types and vice versa. And so it's useful to know in query planning whether the described operation is valid.
@paleolimbot probably has ideas on why this is the case too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the Parquet spec chose to separate them because this is how those types are separated in most database systems. GeoArrow/GeoParquet predated the Parquet spec change, had fairly wide implementation support already, and thus couldn't make the breaking change to match it so here we all are 🙂 .
For what it's worth I usually just store enum EdgeType { Planar, Spherical, ... }
and handle the fact that in JSON the "edge_type"
key is supposed to be omitted on serialization/serialization.
parquet::LogicalType::GEOMETRY(t) => LogicalType::Geometry { crs: t.crs }, | ||
parquet::LogicalType::GEOGRAPHY(t) => LogicalType::Geography { | ||
crs: t.crs, | ||
algorithm: t.algorithm, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be fixed to do a small CRS transform here, since Parquet defines a string: projjson:<crs body>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Arrow C++ this was a bit tricky to deal with:
projjson:crs_key
defines a CRS where the PROJJSON payload is stored in the Parquet schema metadata. This requires passing the schema metadata into this function (or something between this LogicalType and the Arrow Field output), which by the looks of it is a bit tricky to do here. In Arrow C++ we do handle this case (i.e., we'll return read a Parquetprojjson:crs_key
as GeoArrow whose CRS payload is the actual PROJJSON that was in the Parquet schema metadata.srid:1234
can be converted to GeoArrow as{"crs": "1234", "crs_type": "srid"}
.<any other string>
can be converted to GeoArrow as{"crs": "<any other string>"}
. In Arrow C++ I also made the choice to ensure that if<any other string>
happened to be valid JSON, that it shouldn't be double-escaped as a JSON string.
There are test files with all three of these in the apache/parquet-testing repo 🙂
LogicalType::Geometry { crs } => parquet::LogicalType::GEOMETRY(GeometryType { crs }), | ||
LogicalType::Geography { crs, algorithm } => { | ||
parquet::LogicalType::GEOGRAPHY(GeographyType { crs, algorithm }) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And likewise for converting back from GeoArrow CRS to parquet logical type CRS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Arrow C++ I tried to detect crs
values that were definitely lon/lat and ensure the Parquet logical type was "unset" to maximize compatibility with readers that didn't want to or couldn't handle CRS values.
"EPSG:4326"
"OGC:CRS84"
{..., "id": {"authority": "EPSG", "code": 4326}
{..., "id": {"authority": "EPSG", "code": "4326"}
{..., "id": {"authority": "OGC", "code": "CRS84"}
/// Geospatial features in the WKB format with linear/planar edges interpolation | ||
#[derive(Debug, Default, Clone, PartialEq, Eq, Hash)] | ||
pub struct Geometry { | ||
crs: Option<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the verbage in the spec I'm wondering if it would be better to have crs
be required instead of optional.
The default CRS
OGC:CRS84
means that the geospatial features must be stored
in the order of longitude/latitude based on the WGS84 datum.Custom CRS can be specified by a string value. It is recommended to use an
identifier-based approach like srid.
My reading of this is that CRS is actually required for all Geometry and Geography types, and users may supply a custom string, but if not it will always be String::from("OGC:CRS84")
. I suspect having this be required may also simplify some of the downstream code because there will be less Option
handling that needs to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the Parquet end, the CRS is never absent (or: if omitted, we know it's lon/lat). From the GeoArrow end, an omitted CRS means "the producer does not know", the verbage of which comes from GeoParquet and basically follows what a CRS of None
is in GeoPandas. All to say that I think this part is correct (but you're right that it's annoying to handle for everybody).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, so if I understand correctly this implies that it's always possible to convert Parquet geospatial types into valid GeoArrow (CRS is always known, either explicit or the spec defined default), but not all GeoArrow may be safely serialized as Parquet (an unset CRS is ambiguous).
If the above is true, it seems like this may be a trade off between ensuring correctness and broader ecosystem compatibility when serializing to parquet. This is one of those scenarios where there's probably not a "right" answer. I would personally err on the side of ensuring correctness, and error when the input Arrow data cannot be safely serialized to parquet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, an "unset" GeoArrow CRS should probably error on conversion to a Parquet type. I don't feel too strongly about this...Arrow C++ doesn't error here at the moment because it was pragmatic at the time (it allowed a suite of tests to pass without a JSON parser, which at the time required adding a C++ dependency that I wasn't sure would work).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for starting this!
parquet::LogicalType::GEOMETRY(t) => LogicalType::Geometry { crs: t.crs }, | ||
parquet::LogicalType::GEOGRAPHY(t) => LogicalType::Geography { | ||
crs: t.crs, | ||
algorithm: t.algorithm, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Arrow C++ this was a bit tricky to deal with:
projjson:crs_key
defines a CRS where the PROJJSON payload is stored in the Parquet schema metadata. This requires passing the schema metadata into this function (or something between this LogicalType and the Arrow Field output), which by the looks of it is a bit tricky to do here. In Arrow C++ we do handle this case (i.e., we'll return read a Parquetprojjson:crs_key
as GeoArrow whose CRS payload is the actual PROJJSON that was in the Parquet schema metadata.srid:1234
can be converted to GeoArrow as{"crs": "1234", "crs_type": "srid"}
.<any other string>
can be converted to GeoArrow as{"crs": "<any other string>"}
. In Arrow C++ I also made the choice to ensure that if<any other string>
happened to be valid JSON, that it shouldn't be double-escaped as a JSON string.
There are test files with all three of these in the apache/parquet-testing repo 🙂
LogicalType::Geometry { crs } => parquet::LogicalType::GEOMETRY(GeometryType { crs }), | ||
LogicalType::Geography { crs, algorithm } => { | ||
parquet::LogicalType::GEOGRAPHY(GeographyType { crs, algorithm }) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Arrow C++ I tried to detect crs
values that were definitely lon/lat and ensure the Parquet logical type was "unset" to maximize compatibility with readers that didn't want to or couldn't handle CRS values.
"EPSG:4326"
"OGC:CRS84"
{..., "id": {"authority": "EPSG", "code": 4326}
{..., "id": {"authority": "EPSG", "code": "4326"}
{..., "id": {"authority": "OGC", "code": "CRS84"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR is that this is looking great. Thank you @kylebarron @BlakeOrth and @paleolimbot
It is so exciting to see this progressing.
The other thing i think would be really helpful is to add a test that shows reading some of the files from https://github.com/apache/parquet-testing/tree/master/data/geospatial and verifying they come back with the correctly annotated logical types
LogicalType::Geography { crs, algorithm } => { | ||
use arrow_schema::extension::GeographyAlgorithm; | ||
|
||
use crate::format::EdgeInterpolationAlgorithm; | ||
|
||
let algorithm = match algorithm { | ||
Some(EdgeInterpolationAlgorithm::ANDOYER) => Some(GeographyAlgorithm::ANDOYER), | ||
Some(EdgeInterpolationAlgorithm::KARNEY) => Some(GeographyAlgorithm::KARNEY), | ||
Some(EdgeInterpolationAlgorithm::SPHERICAL) => { | ||
Some(GeographyAlgorithm::SPHERICAL) | ||
} | ||
Some(EdgeInterpolationAlgorithm::THOMAS) => Some(GeographyAlgorithm::THOMAS), | ||
Some(EdgeInterpolationAlgorithm::VINCENTY) => { | ||
Some(GeographyAlgorithm::VINCENTY) | ||
} | ||
None => None, | ||
_ => None, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor comment is that it would be nice to encapsulate the geography types a bit more, for so, for example, this looks something more like the following
LogicalType::Geography { crs, algorithm } => { | |
use arrow_schema::extension::GeographyAlgorithm; | |
use crate::format::EdgeInterpolationAlgorithm; | |
let algorithm = match algorithm { | |
Some(EdgeInterpolationAlgorithm::ANDOYER) => Some(GeographyAlgorithm::ANDOYER), | |
Some(EdgeInterpolationAlgorithm::KARNEY) => Some(GeographyAlgorithm::KARNEY), | |
Some(EdgeInterpolationAlgorithm::SPHERICAL) => { | |
Some(GeographyAlgorithm::SPHERICAL) | |
} | |
Some(EdgeInterpolationAlgorithm::THOMAS) => Some(GeographyAlgorithm::THOMAS), | |
Some(EdgeInterpolationAlgorithm::VINCENTY) => { | |
Some(GeographyAlgorithm::VINCENTY) | |
} | |
None => None, | |
_ => None, | |
}; | |
LogicalType::Geography { crs, algorithm } => { | |
let algorithm = algorithm.map(|algorithm| EdgeInterpolationAlgorithm::from); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same comment applies below
@@ -231,9 +231,18 @@ pub enum LogicalType { | |||
/// A Variant value. | |||
Variant, | |||
/// A geospatial feature in the Well-Known Binary (WKB) format with linear/planar edges interpolation. | |||
Geometry, | |||
Geometry { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW this would be a breaking API change I think so we either have to wait for 57 in October or make some backwards compatibility adjustments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, finding this was a bummer. I figure that we'll have to wait till v57 to merge, because we need to be able to carry on the information from the GeometryType
and GeographyType
.
This is a pretty early draft to start discussion. It is not ready to merge.
Which issue does this PR close?
Rationale for this change
Bare-bones support for reading and writing Parquet geometry/geography data.
What changes are included in this PR?
This is lightly modeled after #7404, but that was an early draft of variant support. It looks like progress has gone on in
parquet-variant
, but it's hard to understand how that crate is intended to be used.parquet::basic::LogicalType
because we need to maintain the type-level information ofGeometry
/Geography
that exists onparquet::format::LogicalType
.Geometry
andGeography
arrow logical types, which each implementarrow_schema::extension::ExtensionType
. This PR currently puts those inarrow-schema
, but it's not clear that they should live there. If we created a newparquet-geometry
crate, then at first glance I'd putGeometry
andGeography
there, but that's an issue assuming thatparquet
won't depend onparquet-geometry
.Geometry
andGeography
arrow logical types are currently feature-gated behindarrow_canonical_extension_types
, while these types aren't actually canonical yet. Should we create a newgeometry
feature flag for theparquet
crate?Perhaps we'll create a separate
parquet-geometry
crate for the builders that maintain a bounding box when writing geometry data.Are these changes tested?
Not yet, I first want to put the PR up to get some initial discussion.
Are there any user-facing changes?
Unfortunately, I think this requires a breaking change to
parquet::basic::LogicalType
because we need to maintain the type-level information ofGeometry
/Geography
that exists onparquet::format::LogicalType
.