-
Notifications
You must be signed in to change notification settings - Fork 346
-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lakectl Diff Support For Checking If Only Metadata Changed #8113
Comments
This would be a welcome change! My expectation was that I'd asked about this in Slack and was directed here: https://lakefs.slack.com/archives/C016726JLJW/p1725669972042319 |
Just want to comment and say that this is something I have been trying to figure out as well! I agree with OP, if we can filter out commit to allow for certain changes, that would improve our personal workflow a lot. |
ContextIt seems that the relevant section of docs is Limitations / Warnings. It was added here. ClarificationsThere are multiple issues here:
So I would like the third decision to be selectable (by lakectl local config and possibly also an override commandline flag), rather than us making a decision in advance. Otherwise, whichever we pick will break some use-cases. |
@farhanhubble about this:
I completely understand why checking mtime is inefficient for data pipelines - lakectl local will re-upload the file when you don't need it. Now if we ignore mtimes, how would we determine that the object changed? One way to do so would be to fingerprint or similarly digest the object, say into CityHash or even some SHA. Since we're now talking about efficiency, we should talk numbers before we go off and digest an entire subdirectory of unchanged files. So... How many objects are involved? What sizes? What is the total size of all objects? |
There's a very simple fix to this issue. |
Currently, the diff command in Lakectl reports files as "modified" even if only their metadata has changed. While this could be useful for some applications, it is inefficient for data pipelines. The diff command should show, per file, whether the changes are to the contents, metadata, or both. LakeFS server does show this information with "identical size". The status command should have a similar feature and
local commit
should allow committing files filtered by change type.The text was updated successfully, but these errors were encountered: