Replies: 3 comments 5 replies
-
It should write the stats. @Maxxen sounded in when we were first considering adding the bbox column, but then he implemented, you can see the results in this comment
I think it's supposed to 'just work'? Can you share how you determined that it wasn't reading all the rows? |
Beta Was this translation helpful? Give feedback.
-
Hello! |
Beta Was this translation helpful? Give feedback.
-
Ah, I think it's your SELECT DISTINCT path_in_schema
FROM parquet_metadata('geonames.parquet')
WHERE path_in_schema LIKE 'geom_bbox.%';
┌────────────────┐
│ path_in_schema │
│ varchar │
├────────────────┤
│ 0 rows │
└────────────────┘
SELECT DISTINCT path_in_schema
FROM parquet_metadata('geonames.parquet')
WHERE path_in_schema LIKE 'geom_bbox%';
┌──────────────────┐
│ path_in_schema │
│ varchar │
├──────────────────┤
│ geom_bbox, max_x │
│ geom_bbox, max_y │
│ geom_bbox, min_x │
│ geom_bbox, min_y │
└──────────────────┘ |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all,
I’m quite new to DuckDB and DuckDB Spatial, so I might be misunderstanding something. I’ve been experimenting with creating a partitioned GeoParquet dataset and noticed something about statistics and predicate pushdown.
I started from a large national dataset (~86M rows) with a
geom
column and ageom_bbox
column of type:The files were written with:
When I inspected the Parquet metadata using:
I saw no min/max statistics for the
geom_bbox
subfields. As a result, spatial filters like:would still read all files/row groups.
What I tried
I then flattened the bounding box into 4 top-level FLOAT columns:
minx
,miny
,maxx
,maxy
.With this schema, the Parquet metadata does contain min/max stats for each column, and predicate pushdown works: only relevant row groups are read, and queries are much faster.
Example metadata check:
rg_con_stats
equals the total number of row groups, confirming stats are present.My question
Is it expected that DuckDB does not write statistics for STRUCT subfields in Parquet?
Am I missing an option or a different way to preserve stats for
geom_bbox
subfields when writing?For GeoParquet datasets, having the bbox as a STRUCT is semantically nice, but for efficient spatial filtering it seems I need to flatten it.
I'm using ddb v1.3.0 71c5c07cdd and 7ab1710 spatial extension.
Thanks for any clarification, and sorry if I’m overlooking something obvious.
Beta Was this translation helpful? Give feedback.
All reactions