Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] object store metastore #2309

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

igalshilman
Copy link
Contributor

No description provided.

@igalshilman igalshilman force-pushed the s3metastore branch 5 times, most recently from be3e74c to 638607a Compare November 19, 2024 16:45
@tillrohrmann tillrohrmann self-requested a review November 20, 2024 14:20
@igalshilman igalshilman force-pushed the s3metastore branch 3 times, most recently from ab8beca to 2f76ae8 Compare November 22, 2024 13:44
Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating our object store based metadata store @igalshilman :-) It looks really nice! The implementation looks correct to me :-)

I left a few minor comments and questions. The one thing that is not fully clear to me is whether we want to clean up older versions or not. If not, would this be problematic for the get_latest_version which needs to list more and more versions?


let mut shutdown = pin::pin!(cancellation_watcher());

let delegate = match builder.build().await {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this await take a long time? If yes, then maybe select over the shutdown to also support shutdowns during this time.

Copy link
Contributor Author

@igalshilman igalshilman Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this await take a long time?

not any more.

}
};

let mut refresh = tokio::time::interval(Duration::from_secs(2));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably something we want to make configurable at some point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outdated, it doesn't exists any more.


#[derive(Debug, Clone)]
pub struct Client {
sender: tokio::sync::mpsc::UnboundedSender<Commands>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a bounded sender also work?

Comment on lines 169 to 163
.map_err(|_| WriteError::Internal("Object store fetch channel ".into()))?;

rx.await.map_err(|_| {
WriteError::Internal("Object store fetch channel disconnected".to_string())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error messages seem to be copied from the get call.

Comment on lines 41 to 44
// serialize as json
serde_json::to_vec(self).map(Into::into).map_err(Into::into)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the json encoding would be more helpful if the values were also encoded in json?

}

#[async_trait::async_trait]
impl MetadataStore for OptimisticLockingMetadataStore {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the OptimisticLockingMetadataStore implements the MetadataStore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, it is not any more.


pub struct OptimisticLockingMetadataStore {
version_repository: Arc<dyn VersionRepository>,
latest_store_cache: ArcSwap<Option<CachedStore>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this also be a CachedStore (w/o ArcSwap<Option<>>) and change the methods to take a mutable borrow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it won't implement the metastore trait then I can do that.
Let me try this out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worked out beautifully 🧑‍🍳

Comment on lines +187 to +188
// refresh and try again
let delay = {
let mut rng = rand::thread_rng();
rng.gen_range(50..300)
};
tokio::time::sleep(Duration::from_millis(delay)).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be able to replace this with a RetryPolicy.


match self
.version_repository
.try_create_version(new_store.current_version(), new_store_bytes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correctness of the implementation relies on the fact that our version space won't have any wholes, right? I guess this makes cleaning older versions up a bit more tricky.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe assert that new_store.current_version() == cached_store.store.version().next() to state that there mustn't be any wholes.

Copy link
Contributor

@pcholakov pcholakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might have a better path to implement this, using Object Versioning. A quick search shows that this is supported across MinIO / Azure / GCP, here's the S3 documentation: https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html.

Here's a sketch of the conditional PUT path:

async fn put(
    &self,
    key: ByteString,
    value: VersionedValue,
    precondition: Precondition,
) -> Result<(), WriteError> {
    let key = object_store::path::Path::from(key.to_string());

    match precondition {
        Precondition::MatchesVersion(version) => {
            // If we have a cached version, we can also add an if-modified-since condition to the GET.
            // Unfortunately we can't set the version explicitly, and a HEAD request doesn't return enough
            // information to determine our own metadata value version.
            let get_result = self.object_store.get(&key).await.map_err(|e| {
                WriteError::Internal(format!("Failed to check precondition: {}", e))
            })?;

            if extract_object_version(&get_result.payload) != version {
                return Err(WriteError::FailedPrecondition(
                    "Version mismatch".to_string(),
                ));
            }

            self.object_store
                .put_opts(
                    &key,
                    PutPayload::from_bytes(serialize_versioned_value(value)),
                    PutOptions::from(PutMode::Update(UpdateVersion::from(&get_result))),
                )
                .await
                .map_err(|e| WriteError::Internal(format!("Failed to update value: {}", e)))?;

            Ok(())
        }
        _ => todo!(),
    }
}

Unfortunately we don't control the versions; in S3 they are a monotonic numbered sequence but we don't get to set them directly - rather, S3 will set them for us. Our own API relies on explicit versions so I'm assuming we'll just serialize them along with the rest of the payload as the object body.

A normal GET request implicitly returns the latest version, no further tricks required. It's also possible to request a particular previous version, but that's not needed for our API. Might be useful for troubleshooting though! (And I'd seriously consider serializing the values as JSON for operations friendliness.)

The only requirement for this all to work is that we must use a bucket with object versioning enabled, which is not the default but is trivial to set. S3 and other stores also support automatic cleanup of old versions using object lifecycle policies, so we don't need to implement that ourselves.

return Ok(());
};

if maybe_cached_version.is_none() || maybe_cached_version.unwrap() < global_latest_version {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the !is_some_and(...) version personally.

Comment on lines +59 to +78
let mut stream = self.object_store.list(None);

// TODO: we should switch to a create internal Version
let mut current = 0;
while let Some(version) = stream
.try_next()
.await
.map_err(|e| VersionRepositoryError::Network(e.into()))?
{
if let Some(name) = version.location.filename() {
if !name.starts_with("v") {
continue;
}
if let Ok(v) = name[1..].parse::<u32>() {
if v > current {
current = v;
}
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do this without LIST operations but rather using object versioning.

@tillrohrmann
Copy link
Contributor

tillrohrmann commented Nov 23, 2024

I think we might have a better path to implement this, using Object Versioning. A quick search shows that this is supported across MinIO / Azure / GCP, here's the S3 documentation: https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html.

Here's a sketch of the conditional PUT path:

async fn put(
    &self,
    key: ByteString,
    value: VersionedValue,
    precondition: Precondition,
) -> Result<(), WriteError> {
    let key = object_store::path::Path::from(key.to_string());

    match precondition {
        Precondition::MatchesVersion(version) => {
            // If we have a cached version, we can also add an if-modified-since condition to the GET.
            // Unfortunately we can't set the version explicitly, and a HEAD request doesn't return enough
            // information to determine our own metadata value version.
            let get_result = self.object_store.get(&key).await.map_err(|e| {
                WriteError::Internal(format!("Failed to check precondition: {}", e))
            })?;

            if extract_object_version(&get_result.payload) != version {
                return Err(WriteError::FailedPrecondition(
                    "Version mismatch".to_string(),
                ));
            }

            self.object_store
                .put_opts(
                    &key,
                    PutPayload::from_bytes(serialize_versioned_value(value)),
                    PutOptions::from(PutMode::Update(UpdateVersion::from(&get_result))),
                )
                .await
                .map_err(|e| WriteError::Internal(format!("Failed to update value: {}", e)))?;

            Ok(())
        }
        _ => todo!(),
    }
}

Unfortunately we don't control the versions; in S3 they are a monotonic numbered sequence but we don't get to set them directly - rather, S3 will set them for us. Our own API relies on explicit versions so I'm assuming we'll just serialize them along with the rest of the payload as the object body.

A normal GET request implicitly returns the latest version, no further tricks required. It's also possible to request a particular previous version, but that's not needed for our API. Might be useful for troubleshooting though! (And I'd seriously consider serializing the values as JSON for operations friendliness.)

The only requirement for this all to work is that we must use a bucket with object versioning enabled, which is not the default but is trivial to set. S3 and other stores also support automatic cleanup of old versions using object lifecycle policies, so we don't need to implement that ourselves.

Wouldn't the PutMode::Update require an If-Match header support from S3? I thought that it currently only supports If-None-Match natively and for PutMode::Update one would have to use DynamoDB? Or is this different when using versioned buckets?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants