-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Support configurable management of Table Optimisers for Iceberg tables #627
Comments
@antonysouthworth-halter regarding this:
If you partition your table I can share a "post-hook", that is ugly, but does the job, and pretty much optimize your table by partition values, using a batch size < 100, to avoid partition limitation issue in athena. Said so, what you describe can be relevant, and should be relatively easy to implement:
|
😮 I would love to see it! I think it might also be helpful for others that stumble upon this ticket |
Also note that now we have a fix on |
Is this your first time submitting a feature request?
Describe the feature
https://aws.amazon.com/blogs/aws/aws-glue-data-catalog-now-supports-automatic-compaction-of-apache-iceberg-tables/
Honestly I have not fully thought through how it would work, hoping to spark some discussion in thread.
Perhaps just another config variable for this? E.g.
When
use_glue_automatic_compaction
is specified, then we would use the Glue{Create,Update}TableOptimizer
API operations to create the optimiser for compaction.Describe alternatives you've considered
You can just
OPTIMIZE {{ this.schema }}.{{ this.identifier }} ...
in yourpost_hook
yes, but on full-refresh of a very large table (e.g. requiringinsert_by_period
) this may fail due to timeout or the iceberg "not finished, please run compaction again" message. Regardless I think it would be good to let AWS just handle it.Caveat; I haven't actually tried to use the automatic compaction feature so I have no idea how it performs in practise. Maybe it just scan your entire table once a day and you get charged for 100 DPUs 😂.
Who will this benefit?
Anybody with large datasets in Iceberg. I would think quite a lot of overlap with users of
insert_by_period
.Are you interested in contributing this feature?
maybe, depends how much work it would be
Anything else?
#514 somewhat related, in the realm of "table optimisation management"
The text was updated successfully, but these errors were encountered: