Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add handling for unmanaged files to vacuum command #1817

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

Jan-Schweizer
Copy link

@Jan-Schweizer Jan-Schweizer commented Nov 7, 2023

Description

Adds handling for unmanaged files to the VACUUM command. The VACUUM command should correctly handle files that are not managed by a Delta table, e.g. files and directories that start with _ or ..
Additionally fixes test test_non_managed_files.

Related Issue(s)

  • closes #227 (Unit and integration tests for the vacuum command)

Copy link

github-actions bot commented Nov 7, 2023

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@Jan-Schweizer Jan-Schweizer changed the title Add handling for unmanaged files to vacuum command fix: Add handling for unmanaged files to vacuum command Nov 7, 2023
@Jan-Schweizer Jan-Schweizer changed the title fix: Add handling for unmanaged files to vacuum command fix: add handling for unmanaged files to vacuum command Nov 7, 2023
Jan-Schweizer and others added 3 commits November 7, 2023 18:55
The new version of this crate properly sets a lease duration such that the locks
can actually expire
# Description

Introduce a `LogStore` abstraction to channel all log store reads and
writes through a single place. This is supposed to allow implementations
with more sophisticated locking mechanisms that do not rely on atomic
rename semantics for the underlying object store.

This does not change any functionality - it reorganizes read operations
and commits on the delta commit log to be funneled through the
respective methods of `LogStore`.

## Rationale

The goal is to align the implementation of multi-cluster writes for
Delta Lake on S3 with the one provided by the original `delta` library,
enabling multi-cluster writes with some writers using Spark / Delta
library and other writers using `delta-rs` For an overview of how it's
done in delta, please see:
1. Delta [blog
post](https://delta.io/blog/2022-05-18-multi-cluster-writes-to-delta-lake-storage-in-s3/)
(high-level concept)
2. Associated Databricks [design
doc](https://docs.google.com/document/d/1Gs4ZsTH19lMxth4BSdwlWjUNR-XhKHicDvBjd2RqNd8/edit#heading=h.mjjuxw9mcz9h)
(detailed read)
3.
[S3DynamoDbLogStore.java](https://github.com/delta-io/delta/blob/master/storage-s3-dynamodb/src/main/java/io/delta/storage/S3DynamoDBLogStore.java)(content
warning: Java code behind this link)

This approach requires readers of a delta table to "recover" unfinished
commits from writers - as a result, reading and writing is combined in a
single interface, which in this PR is modeled after
[LogStore.java](https://github.com/delta-io/delta/blob/master/storage/src/main/java/io/delta/storage/LogStore.java).
Currently in `delta-rs`, read path for commits is implemented directly
in `DeltaTable`, and there's no mechanism to implement storage-specific
behavior like interacting with DynamoDb.

---------

Co-authored-by: Robert Pack <[email protected]>
@github-actions github-actions bot added the binding/python Issues for the Python package label Nov 9, 2023
@github-actions github-actions bot removed the binding/python Issues for the Python package label Nov 9, 2023
wjones127
wjones127 previously approved these changes Nov 10, 2023
Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for finishing this.

Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there is a test assertion that just needs to be updated, since _change_data and _delta_index are exceptions in is_hidden_directory().

@roeap
Copy link
Collaborator

roeap commented Nov 10, 2023

We may have some additional aspects into consideration when selecting the files to be vacuumed. As it stands I think we may accidentally delete deletion vector files for files that are still part of the log. We should however be able to get the corresponding files names from the add action and add them to the managed files.

Some .crc files might also get deleted pre-maturely, but these are always optional, so no harm to the table data.

None
};
}))
.collect::<HashSet<String>>();
Copy link
Author

@Jan-Schweizer Jan-Schweizer Nov 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the deletion vector file path is being constructed on the fly by the absolut_path function, it now must be a HashSet<String> instead of HashSet<&str>. This also means that we now have to clone the String of the add and remove path. Is this fine or is there a better way to handle this?
Since the add/remove path potentially needs to be converted to an absolute path, I need to construct a String any way

Also, this nested handling of the deletion vector paths doesn't look nice, can you show me a better/more idiomatic way to achieve this?

Ok((path_name.starts_with('.') || path_name.starts_with('_'))
&& !path_name.starts_with("_delta_index")
&& !path_name.starts_with("_change_data")
fn is_hidden_file(partition_columns: &[String], path: &Path) -> Result<bool, DeltaTableError> {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed this to is_hidden_file since we are not only ignoring directories but also single files

Comment on lines +267 to +275
/// Whether a file is contained within the set of managed files.
fn is_file_managed(&self, managed_files: &HashSet<String>, file: &Path) -> bool {
return if is_absolute_path(file.as_ref()) {
managed_files.contains(file.as_ref())
} else {
let path = format!("{}{}", self.log_store.root_uri(), file.as_ref());
managed_files.contains(&path)
};
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the location of an object in the log_store ever be absolute? Or is it always relative to the table root? If so, then we wouldn't need the is_absolute_path check

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jan-Schweizer the location can be an absolute URL, and that's the case with shallow cloned tables. The add action describes the path key.

So for better or worse, we need to check for an absolute path 😄

rtyler
rtyler previously approved these changes Nov 30, 2023
Comment on lines +267 to +275
/// Whether a file is contained within the set of managed files.
fn is_file_managed(&self, managed_files: &HashSet<String>, file: &Path) -> bool {
return if is_absolute_path(file.as_ref()) {
managed_files.contains(file.as_ref())
} else {
let path = format!("{}{}", self.log_store.root_uri(), file.as_ref());
managed_files.contains(&path)
};
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jan-Schweizer the location can be an absolute URL, and that's the case with shallow cloned tables. The add action describes the path key.

So for better or worse, we need to check for an absolute path 😄

ion-elgreco and others added 8 commits November 30, 2023 08:36
# Description
I have made a bunch of improvements to fix the overall structure due to
example sections not being consistent. I've also enabled some extra
features. Fixed also the issue of some classes/functions not being shown
properly.
# Description
This PR prunes each merge bin with only 1 file, even though there are
multiple merge bins.

# Related Issue(s)
- closes delta-io#1901 

# Documentation
This PR adds test_idempotent_with_multiple_bins() for testing.
# Description
Prepare for next release

# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->

# Documentation

<!---
Share links to useful documentation
--->
# Description
This is a continuation of the discussion in the
delta-io#1917 Getting rid of panic in
the library crate in favor of returning an error so lib users could
handle it in a way they see it Test changes accordingly

Co-authored-by: Robert Pack <[email protected]>
@github-actions github-actions bot added the binding/python Issues for the Python package label Dec 2, 2023
@github-actions github-actions bot removed the binding/python Issues for the Python package label Dec 2, 2023
@ion-elgreco ion-elgreco self-assigned this Aug 9, 2024
@ion-elgreco
Copy link
Collaborator

I'll pick this up next week since we getting often issues about vacuum not cleaning up all required files.

@rtyler rtyler added the enhancement New feature or request label Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate crate/core enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unit and integration tests for the vacuum command
9 participants