-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshot partition processor state to object store #1807
Comments
Rough notes from chatting with @tillrohrmann:
Out of scope for now:
Open questions:
Some thoughts on the open questions:
I think we can leave this out of scope for now and only manage it in the object store config; S3 and Azure Blob store both support async cross-region replication. For something like snapshots where picking a slightly older one to bootstrap from is ok, this is completely acceptable. In the worst case, new PPs won't be able to star up in a region whose snapshot bucket replication is running well behind the log tail. And a region in such condition will likely be experiencing other difficulties beyond just snapshot staleness.
My 2c: we should upload snapshots and update them again, leaving this to be managed via object store policies. For example, S3 supports rich lifecycle policies to migrate objects to cheaper storage classes, or delete them after a while. The one exception is local directory snapshots. Assuming those are used only for short-lived test clusters, we shouldn't have long-term disk usage problems with them. |
Updated issue description based on our internal discussion yesterday. |
@pcholakov can this issue be closed (by removing #2384 since we won't solve this for the preview version)? |
Indeed it can following our conversation on Friday - done! |
To support partition processor bootstrap, catching up stale processors after downtime (i.e. handle trim gaps), and to safely trim the log, we need snapshotting support.
Scope and features
How will snapshots be triggered? What is their frequency?
Where do snapshots go?
object_store
create to support various cloud object storesHow are snapshots organized in the target bucket?
{partition_id}/{snapshot_id}_{lsn}
structure that allows us to avoid coordination with a top-level{partition_id}/latest.json
pointer that is atomically updated to reflect the most recent snapshotHow will trimming be driven?
How will the Cluster Controller learn about what valid snapshots exist (and in which locations in the future)?
How will PPs be bootstrapped from a snapshot?
How will we handle trim gaps?
Who manages the lifecycle of snapshots?
Additional considerations:
Consider but don't implement
Tasks
Follow-up tasks: #2384
The text was updated successfully, but these errors were encountered: