Snapshot partition processor state to object store #1807

tillrohrmann · 2024-08-08T08:13:10Z

To support partition processor bootstrap, catching up stale processors after downtime (i.e. handle trim gaps), and to safely trim the log, we need snapshotting support.

Scope and features

How will snapshots be triggered? What is their frequency?

PP leaders will create snapshots autonomously, based on accumulated commits (and not strictly on time passage)
we leave the door open for the cluster controller to orchestrate this with a global view of PP state and their regional placement in the future

Where do snapshots go?

we will use the object_store create to support various cloud object stores
this also supports a local filesystem target for testing

How are snapshots organized in the target bucket?

we will use a {partition_id}/{snapshot_id}_{lsn} structure that allows us to avoid coordination with a top-level {partition_id}/latest.json pointer that is atomically updated to reflect the most recent snapshot
if a node is appropriately configured, it should be able to bootstrap and catch up to the log absent CC

How will trimming be driven?

The CC will use snapshot information (reported by PPs) to trigger log trimming

How will the Cluster Controller learn about what valid snapshots exist (and in which locations in the future)?

PPs will include the latest known snapshot LSN they are aware of (either because they bootstrapped from it, or because they uploaded it themselves) in the cluster status response message
the snapshot repository supports efficient (single GET) lookup of the latest snapshot metadata (per partition!)
this can be extended to support region-awareness for trim decisions in the future (i.e. only trim at a point which is snapshotted across multiple regions)

How will PPs be bootstrapped from a snapshot?

the bucket location will be known and all nodes are expected to to have access to it (even across regional boundaries)
on restore, the restored-from snapshot LSN becomes the last-archived-lsn property for the PP for the node

How will we handle trim gaps?

using the same mechanism as bootstrap - when PPs encounter a trim gap in the log, they will need to revert to the object store snapshot bucket to find a more recent state snapshot to recover from

Who manages the lifecycle of snapshots?

we will leave this up to the operator to configure an object store lifecycle policy; all cloud providers support rich automated management with multiple storage/archival tiers

Additional considerations:

Avoid interacting with the object store on startup (unless we are bootstrapping)
Instead, each PP will track the last known snapshot LSN locally
Avoid time-based/periodic snapshots, we can snapshot based on number of records since last snapshot, or number of bytes, etc.
How will bootstrapping a region work, once we already have some up? (Likely either by cross-regional initial snapshot ingestion, or by the operator manually seeding a regional snapshot bucket from another region)
Should we drop the current SnapshotIds? No; even though we don't intend to use them as keys in the snapshot object store layout, it's useful to have a unique "event id" for the snapshot generation to e.g. track down logs related to its creation. It only needs to exist within the snapshot metadata uploaded to the store

Consider but don't implement

for geo-distributed deployments, we will have independent buckets in each region, and have region-aware snapshot triggering to ensure that we don't rely on any async replication mechanism (that can still be enabled as another layer of defense by the operator)
how do we protect the only/last snapshot from being deleted by aggressive object store lifecycle management policies

Tasks

Give feedback

Follow-up tasks: #2384

The text was updated successfully, but these errors were encountered:

pcholakov · 2024-11-07T12:29:58Z

Rough notes from chatting with @tillrohrmann:

snapshots should be driven by RPC from the cluster controller; pick a random "reasonably fresh" node prioritizing non-leaders
- snapshotting policy can thus be centrally managed on the admin node(s)
- the request command to generate a snapshot could include the expected minimum LSN to improve safety (i.e. if we request a stale PP to make snapshot, it should decline)
use a directory/key prefix layout like $base/<partition_id>/<lsn>/... for easy navigation
access to object store (if used) is uniform: all nodes have read/write access to the same bucket; this can be fine-tuned down the line

Out of scope for now:

incremental snapshots, reusing prior snapshots' SSTs

Open questions:

how will we handle multi-region / geo-replicated support?
how will the snapshot lifecycle be managed?
how will PPs be bootstrapped from the snapshot store? does the cluster controller get involved?

Some thoughts on the open questions:

how will we handle multi-region / geo-replicated support?

I think we can leave this out of scope for now and only manage it in the object store config; S3 and Azure Blob store both support async cross-region replication. For something like snapshots where picking a slightly older one to bootstrap from is ok, this is completely acceptable. In the worst case, new PPs won't be able to star up in a region whose snapshot bucket replication is running well behind the log tail. And a region in such condition will likely be experiencing other difficulties beyond just snapshot staleness.

how will the snapshot lifecycle be managed?

My 2c: we should upload snapshots and update them again, leaving this to be managed via object store policies. For example, S3 supports rich lifecycle policies to migrate objects to cheaper storage classes, or delete them after a while. The one exception is local directory snapshots. Assuming those are used only for short-lived test clusters, we shouldn't have long-term disk usage problems with them.

pcholakov · 2024-11-08T04:46:52Z

Updated issue description based on our internal discussion yesterday.

tillrohrmann · 2025-02-03T10:27:25Z

@pcholakov can this issue be closed (by removing #2384 since we won't solve this for the preview version)?

pcholakov · 2025-02-03T14:01:41Z

Indeed it can following our conversation on Friday - done!

This was referenced Aug 8, 2024

Distributed Restate preview #1675

Open

Support for state snapshots #1316

Closed

tillrohrmann assigned pcholakov Aug 13, 2024

pcholakov added umbrella distributed-architecture labels Nov 8, 2024

pcholakov added the snapshots label Nov 8, 2024

pcholakov closed this as completed Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot partition processor state to object store #1807

Snapshot partition processor state to object store #1807

tillrohrmann commented Aug 8, 2024 •

edited by pcholakov

Loading

Tasks

pcholakov commented Nov 7, 2024 •

edited

Loading

pcholakov commented Nov 8, 2024

tillrohrmann commented Feb 3, 2025

pcholakov commented Feb 3, 2025

Snapshot partition processor state to object store #1807

Snapshot partition processor state to object store #1807

Comments

tillrohrmann commented Aug 8, 2024 • edited by pcholakov Loading

Scope and features

Consider but don't implement

Tasks

pcholakov commented Nov 7, 2024 • edited Loading

pcholakov commented Nov 8, 2024

tillrohrmann commented Feb 3, 2025

pcholakov commented Feb 3, 2025

tillrohrmann commented Aug 8, 2024 •

edited by pcholakov

Loading

pcholakov commented Nov 7, 2024 •

edited

Loading