Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce SnapshotRepository find_latest and wire up partition restore #2353

Open
wants to merge 1 commit into
base: feat/snapshot-upload
Choose a base branch
from

Conversation

pcholakov
Copy link
Contributor

@pcholakov pcholakov commented Nov 22, 2024

With this change, Partition Processor startup now checks the snapshot repository
for a partition snapshot before creating a blank store database. If a recent
snapshot is available, we will restore that instead of replaying the log from
the beginning.

Closes: #2000


Open tasks:

  • Stream snapshot data files from object store

Future work:

  • Parallelize the snapshot file downloads

Testing

Created snapshot by running restatectl create-snapshot -p 0, then dropped the partition CF with rocksdb_ldb drop_column_family --db=./restate-data/.../db data-0.

Running restate-server correctly restores the most recent available snapshot:

2024-11-22T18:00:12.690097Z TRACE run: restate_worker::partition_processor_manager::spawn_processor_task: Looking for partition snapshot from which to bootstrap partition store partition_id=0 partition_id=0
2024-11-22T18:00:12.697706Z TRACE run: restate_worker::partition::snapshots::repository: Latest snapshot metadata: LatestSnapshot { version: V1, lsn: Lsn(228), partition_id: PartitionId(0), node_name: "Pavels-MacBook-Pro.local", created_at: Timestamp(SystemTime { tv_sec: 1732296930, tv_nsec: 956216000 }), snapshot_id: snap_13flhdTZpHeqoSF880sA8dr, min_applied_lsn: Lsn(228), path: "lsn_228" } partition_id=0
2024-11-22T18:00:12.700707Z DEBUG run: restate_worker::partition::snapshots::repository: Getting snapshot data snapshot_id=snap_13flhdTZpHeqoSF880sA8dr path="/Users/pavel/restate/restate/restate-data/snap_13flhdTZpHeqoSF880sA8dr-JrSB2B" partition_id=0
2024-11-22T18:00:12.701895Z TRACE run: restate_worker::partition::snapshots::repository: Downloaded snapshot data file to "/Users/pavel/restate/restate/restate-data/snap_13flhdTZpHeqoSF880sA8dr-JrSB2B/000101.sst" key=Users/pavel/test-cluster-snapshots/0/lsn_228/000101.sst partition_id=0
2024-11-22T18:00:12.702347Z TRACE run: restate_worker::partition::snapshots::repository: Downloaded snapshot data file to "/Users/pavel/restate/restate/restate-data/snap_13flhdTZpHeqoSF880sA8dr-JrSB2B/000078.sst" key=Users/pavel/test-cluster-snapshots/0/lsn_228/000078.sst partition_id=0
2024-11-22T18:00:12.702384Z  INFO run: restate_worker::partition_processor_manager::spawn_processor_task: Found snapshot to bootstrap partition, restoring it partition_id=0 partition_id=0
2024-11-22T18:00:12.702677Z  INFO run: restate_partition_store::partition_store_manager: Importing partition store snapshot partition_id=PartitionId(0) lsn=Lsn(228) path="/Users/pavel/restate/restate/restate-data/snap_13flhdTZpHeqoSF880sA8dr-JrSB2B" partition_id=0
2024-11-22T18:00:12.718440Z DEBUG on_asynchronous_event: restate_worker::partition_processor_manager: Partition processor was successfully created. target_run_mode=Leader partition_id=0 event=Started
2024-11-22T18:00:12.718804Z  INFO run: restate_worker::partition: Starting the partition processor. partition_id=0
2024-11-22T18:00:12.718890Z DEBUG run: restate_worker::partition: PartitionProcessor creating log reader last_applied_lsn=228 current_log_tail=233 partition_id=0
2024-11-22T18:00:12.718917Z DEBUG run: restate_worker::partition: Replaying the log from lsn=229, log tail lsn=233 partition_id=0
2024-11-22T18:00:12.719007Z  INFO run: restate_worker::partition: PartitionProcessor starting event loop. partition_id=0

With this change, Partition Processor startup now checks the snapshot repository
 for a partition snapshot before creating a blank store database. If a recent
 snapshot is available, we will restore that instead of replaying the log from
 the beginning.
Copy link

github-actions bot commented Nov 22, 2024

Test Results

  7 files  ±0    7 suites  ±0   4m 24s ⏱️ ±0s
 47 tests ±0   46 ✅ ±0  1 💤 ±0  0 ❌ ±0 
182 runs  ±0  179 ✅ ±0  3 💤 ±0  0 ❌ ±0 

Results for commit 40e74fd. ± Comparison against base commit 38268d6.

♻️ This comment has been updated with latest results.

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this PR @pcholakov. The changes look really nice! I had a few minor question. It would be great to add the streaming write before merging. Once this is resolved +1 for merging :-)

));
let file_path = snapshot_dir.path().join(filename);
let file_data = self.object_store.get(&key).await?;
tokio::fs::write(&file_path, file_data.bytes().await?).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would indeed be great to write the file in streaming fashion to disk. Especially once our SSTs grow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like

let mut file_data = self.object_store.get(&key).await?.into_stream();
let mut snapshot_file = tokio::fs::File::create_new(&file_path).await?;
while let Some(data) = file_data.next().await {
       snapshot_file.write_all(&data?).await?;
}

can already be enough. Do you know how large the chunks of the stream returned by self.object_store.get(&key).await?.into_stream() will be?

Comment on lines +138 to +140
let partition_store = if !partition_store_manager
.has_partition_store(pp_builder.partition_id)
.await
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of scope of this PR: What is the plan how to handle a PP that has some data but the data is lagging too far behind? So starting the PP would result into a trim gap. Would we then drop the respective column family and restart it?

Comment on lines +246 to +247
/// Discover and download the latest snapshot available. Dropping the returned
/// `LocalPartitionSnapshot` will delete the local snapshot data files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it because the files are stored in a temp directory? On LocalPartitionSnapshot itself I couldn't find how the files are deleted when dropping it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the temp dir also the mechanism to clean things up if downloading it failed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that TempDir::with_prefix_in takes care of it since it deletes the files when it gets dropped. This is a nice solution!

"Found snapshot to bootstrap partition, restoring it",
);
partition_store_manager
.open_partition_store_from_snapshot(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In

, we seem to copy the snapshot files to keep them intact. What is the reason for this? Wouldn't it be more efficient to move the files because it wouldn't inflict any I/O costs if the snapshot directory is on the same filesystem as the target directory?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants