Skip to content

Conversation

kylebarron
Copy link
Contributor

@kylebarron kylebarron commented Aug 25, 2025

This is a pretty early draft to start discussion. It is not ready to merge.

Which issue does this PR close?

Rationale for this change

Bare-bones support for reading and writing Parquet geometry/geography data.

What changes are included in this PR?

This is lightly modeled after #7404, but that was an early draft of variant support. It looks like progress has gone on in parquet-variant, but it's hard to understand how that crate is intended to be used.

  • Breaking: changes parquet::basic::LogicalType because we need to maintain the type-level information of Geometry/Geography that exists on parquet::format::LogicalType.
  • Defines new Geometry and Geography arrow logical types, which each implement arrow_schema::extension::ExtensionType. This PR currently puts those in arrow-schema, but it's not clear that they should live there. If we created a new parquet-geometry crate, then at first glance I'd put Geometry and Geography there, but that's an issue assuming that parquet won't depend on parquet-geometry.
  • Should there be two new logical types? Or should there be just one on the Arrow side?
  • Those Geometry and Geography arrow logical types are currently feature-gated behind arrow_canonical_extension_types, while these types aren't actually canonical yet. Should we create a new geometry feature flag for the parquet crate?

Perhaps we'll create a separate parquet-geometry crate for the builders that maintain a bounding box when writing geometry data.

Are these changes tested?

Not yet, I first want to put the PR up to get some initial discussion.

Are there any user-facing changes?

If there are any breaking changes to public APIs, please call them out.

Unfortunately, I think this requires a breaking change to parquet::basic::LogicalType because we need to maintain the type-level information of Geometry/Geography that exists on parquet::format::LogicalType.

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Aug 25, 2025
impl ExtensionType for Geometry {
const NAME: &'static str = "geoarrow.wkb";

type Metadata = ();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this type be String? In geoarrow-schema I have a separate Metadata struct defined, and store this type as Arc<Metadata>: https://docs.rs/geoarrow-schema/latest/geoarrow_schema/struct.WkbType.html#associatedtype.Metadata

In this case we might have different metadata for the Geometry/Geography types.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A string in theory allows for fully arbitrary metadata, which seems undesirable. I think the metadata would normalize between GEOMETRY and GEOGRAPHY if we enforced the default values for CRS and Edge Interpolation for GEOMETRY types:

GEOMETRY is used for geospatial features in the Well-Known Binary (WKB) format with linear/planar edges interpolation.

Copy link
Contributor Author

@kylebarron kylebarron Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's important for edge interpolation, at least, to be stored under an Option, because that's the primary thing that separates a geometry from a geography (and that's how to discern whether input data with extension name geoarrow.wkb is a geometry or a geography).

FWIW it may be preferable to vendor the existing WkbType struct that's already in use in geoarrow-schema.

It's not clear that we should have separate Geometry/Geography arrow extension types. They're implemented here to parallel the Parquet implementation, but perhaps should be removed in favor of a single WkbType or similar.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, that makes sense for the edge interpolation needing an Option. I didn't initially realize the spec defined edge interpolation algorithms didn't include a LINEAR variant.

My reading of the parquet spec suggests the only difference between the two types is linear vs non-linear edge interpolation. Am I missing something? If not, are you privy to why they are separate types? On the surface it feels odd to carry them as separate types.

Copy link
Contributor Author

@kylebarron kylebarron Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they're separate types in the Parquet spec because some execution operations that are valid on Geometry types are not valid on Geography types and vice versa. And so it's useful to know in query planning whether the described operation is valid.

@paleolimbot probably has ideas on why this is the case too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Parquet spec chose to separate them because this is how those types are separated in most database systems. GeoArrow/GeoParquet predated the Parquet spec change, had fairly wide implementation support already, and thus couldn't make the breaking change to match it so here we all are 🙂 .

For what it's worth I usually just store enum EdgeType { Planar, Spherical, ... } and handle the fact that in JSON the "edge_type" key is supposed to be omitted on serialization/serialization.

Comment on lines +862 to +866
parquet::LogicalType::GEOMETRY(t) => LogicalType::Geometry { crs: t.crs },
parquet::LogicalType::GEOGRAPHY(t) => LogicalType::Geography {
crs: t.crs,
algorithm: t.algorithm,
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be fixed to do a small CRS transform here, since Parquet defines a string: projjson:<crs body>.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Arrow C++ this was a bit tricky to deal with:

  • projjson:crs_key defines a CRS where the PROJJSON payload is stored in the Parquet schema metadata. This requires passing the schema metadata into this function (or something between this LogicalType and the Arrow Field output), which by the looks of it is a bit tricky to do here. In Arrow C++ we do handle this case (i.e., we'll return read a Parquet projjson:crs_key as GeoArrow whose CRS payload is the actual PROJJSON that was in the Parquet schema metadata.
  • srid:1234 can be converted to GeoArrow as {"crs": "1234", "crs_type": "srid"}.
  • <any other string> can be converted to GeoArrow as {"crs": "<any other string>"}. In Arrow C++ I also made the choice to ensure that if <any other string> happened to be valid JSON, that it shouldn't be double-escaped as a JSON string.

There are test files with all three of these in the apache/parquet-testing repo 🙂

https://github.com/apache/arrow/blob/f8b20f131a072ef423e81b8a676f42a82255f4ec/cpp/src/parquet/geospatial/util_json_internal.cc#L105-L125

Comment on lines +909 to +912
LogicalType::Geometry { crs } => parquet::LogicalType::GEOMETRY(GeometryType { crs }),
LogicalType::Geography { crs, algorithm } => {
parquet::LogicalType::GEOGRAPHY(GeographyType { crs, algorithm })
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And likewise for converting back from GeoArrow CRS to parquet logical type CRS

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Arrow C++ I tried to detect crs values that were definitely lon/lat and ensure the Parquet logical type was "unset" to maximize compatibility with readers that didn't want to or couldn't handle CRS values.

  • "EPSG:4326"
  • "OGC:CRS84"
  • {..., "id": {"authority": "EPSG", "code": 4326}
  • {..., "id": {"authority": "EPSG", "code": "4326"}
  • {..., "id": {"authority": "OGC", "code": "CRS84"}

https://github.com/apache/arrow/blob/f8b20f131a072ef423e81b8a676f42a82255f4ec/cpp/src/parquet/geospatial/util_json_internal.cc#L47-L81

/// Geospatial features in the WKB format with linear/planar edges interpolation
#[derive(Debug, Default, Clone, PartialEq, Eq, Hash)]
pub struct Geometry {
crs: Option<String>,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the verbage in the spec I'm wondering if it would be better to have crs be required instead of optional.

The default CRS OGC:CRS84 means that the geospatial features must be stored
in the order of longitude/latitude based on the WGS84 datum.

Custom CRS can be specified by a string value. It is recommended to use an
identifier-based approach like srid.

My reading of this is that CRS is actually required for all Geometry and Geography types, and users may supply a custom string, but if not it will always be String::from("OGC:CRS84"). I suspect having this be required may also simplify some of the downstream code because there will be less Option handling that needs to happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the Parquet end, the CRS is never absent (or: if omitted, we know it's lon/lat). From the GeoArrow end, an omitted CRS means "the producer does not know", the verbage of which comes from GeoParquet and basically follows what a CRS of None is in GeoPandas. All to say that I think this part is correct (but you're right that it's annoying to handle for everybody).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, so if I understand correctly this implies that it's always possible to convert Parquet geospatial types into valid GeoArrow (CRS is always known, either explicit or the spec defined default), but not all GeoArrow may be safely serialized as Parquet (an unset CRS is ambiguous).

If the above is true, it seems like this may be a trade off between ensuring correctness and broader ecosystem compatibility when serializing to parquet. This is one of those scenarios where there's probably not a "right" answer. I would personally err on the side of ensuring correctness, and error when the input Arrow data cannot be safely serialized to parquet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, an "unset" GeoArrow CRS should probably error on conversion to a Parquet type. I don't feel too strongly about this...Arrow C++ doesn't error here at the moment because it was pragmatic at the time (it allowed a suite of tests to pass without a JSON parser, which at the time required adding a C++ dependency that I wasn't sure would work).

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for starting this!

Comment on lines +862 to +866
parquet::LogicalType::GEOMETRY(t) => LogicalType::Geometry { crs: t.crs },
parquet::LogicalType::GEOGRAPHY(t) => LogicalType::Geography {
crs: t.crs,
algorithm: t.algorithm,
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Arrow C++ this was a bit tricky to deal with:

  • projjson:crs_key defines a CRS where the PROJJSON payload is stored in the Parquet schema metadata. This requires passing the schema metadata into this function (or something between this LogicalType and the Arrow Field output), which by the looks of it is a bit tricky to do here. In Arrow C++ we do handle this case (i.e., we'll return read a Parquet projjson:crs_key as GeoArrow whose CRS payload is the actual PROJJSON that was in the Parquet schema metadata.
  • srid:1234 can be converted to GeoArrow as {"crs": "1234", "crs_type": "srid"}.
  • <any other string> can be converted to GeoArrow as {"crs": "<any other string>"}. In Arrow C++ I also made the choice to ensure that if <any other string> happened to be valid JSON, that it shouldn't be double-escaped as a JSON string.

There are test files with all three of these in the apache/parquet-testing repo 🙂

https://github.com/apache/arrow/blob/f8b20f131a072ef423e81b8a676f42a82255f4ec/cpp/src/parquet/geospatial/util_json_internal.cc#L105-L125

Comment on lines +909 to +912
LogicalType::Geometry { crs } => parquet::LogicalType::GEOMETRY(GeometryType { crs }),
LogicalType::Geography { crs, algorithm } => {
parquet::LogicalType::GEOGRAPHY(GeographyType { crs, algorithm })
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Arrow C++ I tried to detect crs values that were definitely lon/lat and ensure the Parquet logical type was "unset" to maximize compatibility with readers that didn't want to or couldn't handle CRS values.

  • "EPSG:4326"
  • "OGC:CRS84"
  • {..., "id": {"authority": "EPSG", "code": 4326}
  • {..., "id": {"authority": "EPSG", "code": "4326"}
  • {..., "id": {"authority": "OGC", "code": "CRS84"}

https://github.com/apache/arrow/blob/f8b20f131a072ef423e81b8a676f42a82255f4ec/cpp/src/parquet/geospatial/util_json_internal.cc#L47-L81

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR is that this is looking great. Thank you @kylebarron @BlakeOrth and @paleolimbot

It is so exciting to see this progressing.

The other thing i think would be really helpful is to add a test that shows reading some of the files from https://github.com/apache/parquet-testing/tree/master/data/geospatial and verifying they come back with the correctly annotated logical types

Comment on lines +408 to +425
LogicalType::Geography { crs, algorithm } => {
use arrow_schema::extension::GeographyAlgorithm;

use crate::format::EdgeInterpolationAlgorithm;

let algorithm = match algorithm {
Some(EdgeInterpolationAlgorithm::ANDOYER) => Some(GeographyAlgorithm::ANDOYER),
Some(EdgeInterpolationAlgorithm::KARNEY) => Some(GeographyAlgorithm::KARNEY),
Some(EdgeInterpolationAlgorithm::SPHERICAL) => {
Some(GeographyAlgorithm::SPHERICAL)
}
Some(EdgeInterpolationAlgorithm::THOMAS) => Some(GeographyAlgorithm::THOMAS),
Some(EdgeInterpolationAlgorithm::VINCENTY) => {
Some(GeographyAlgorithm::VINCENTY)
}
None => None,
_ => None,
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor comment is that it would be nice to encapsulate the geography types a bit more, for so, for example, this looks something more like the following

Suggested change
LogicalType::Geography { crs, algorithm } => {
use arrow_schema::extension::GeographyAlgorithm;
use crate::format::EdgeInterpolationAlgorithm;
let algorithm = match algorithm {
Some(EdgeInterpolationAlgorithm::ANDOYER) => Some(GeographyAlgorithm::ANDOYER),
Some(EdgeInterpolationAlgorithm::KARNEY) => Some(GeographyAlgorithm::KARNEY),
Some(EdgeInterpolationAlgorithm::SPHERICAL) => {
Some(GeographyAlgorithm::SPHERICAL)
}
Some(EdgeInterpolationAlgorithm::THOMAS) => Some(GeographyAlgorithm::THOMAS),
Some(EdgeInterpolationAlgorithm::VINCENTY) => {
Some(GeographyAlgorithm::VINCENTY)
}
None => None,
_ => None,
};
LogicalType::Geography { crs, algorithm } => {
let algorithm = algorithm.map(|algorithm| EdgeInterpolationAlgorithm::from);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same comment applies below

@@ -231,9 +231,18 @@ pub enum LogicalType {
/// A Variant value.
Variant,
/// A geospatial feature in the Well-Known Binary (WKB) format with linear/planar edges interpolation.
Geometry,
Geometry {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this would be a breaking API change I think so we either have to wait for 57 in October or make some backwards compatibility adjustments

Copy link
Contributor Author

@kylebarron kylebarron Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, finding this was a bummer. I figure that we'll have to wait till v57 to merge, because we need to be able to carry on the information from the GeometryType and GeographyType.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] geometry and geography logical type implementations Add support for GEOMETRY and GEOGRAPHY types in Parquet read and/or write
4 participants