Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lakectl Diff Support For Checking If Only Metadata Changed #8113

Open
farhanhubble opened this issue Aug 30, 2024 · 5 comments
Open

Lakectl Diff Support For Checking If Only Metadata Changed #8113

farhanhubble opened this issue Aug 30, 2024 · 5 comments

Comments

@farhanhubble
Copy link

Currently, the diff command in Lakectl reports files as "modified" even if only their metadata has changed. While this could be useful for some applications, it is inefficient for data pipelines. The diff command should show, per file, whether the changes are to the contents, metadata, or both. LakeFS server does show this information with "identical size". The status command should have a similar feature and local commit should allow committing files filtered by change type.

@kujenga
Copy link

kujenga commented Sep 7, 2024

This would be a welcome change! My expectation was that lakectl local would work similarly to how git works, in terms of tracking status of chnages, and what is considers to be a diff. In that paradigm, changing the modification time on a file tracked by lakectl should not be relevant for detecting file changes.

I'd asked about this in Slack and was directed here: https://lakefs.slack.com/archives/C016726JLJW/p1725669972042319

@jameshod5
Copy link

Just want to comment and say that this is something I have been trying to figure out as well! I agree with OP, if we can filter out commit to allow for certain changes, that would improve our personal workflow a lot.

@arielshaqed
Copy link
Contributor

Context

It seems that the relevant section of docs is Limitations / Warnings. It was added here.

Clarifications

There are multiple issues here:

  • What is stored on lakeFS: lakeFS can store or not store mtime and related fields. I believe we all agree that it needs to. For instance this is needed for lakectl local to work. I believe that this is the correct decision.
  • What counts as a Graveler ID: lakeFS digests multiple things into an identity. Currently it ignores Mtime. AFAIR we used to include Mtime in object identity, but that no longer happens. I believe that this is the correct decision - but I can understand why it would not match some use-cases.
  • What lakectl local does: some UNIX commands will break if local ignores mtime. For instance, all versions of Make are driven by mtime. So if lakectl local ignores mtime, it will break some Makefiles. I believe that there is no one correct decision here.

So I would like the third decision to be selectable (by lakectl local config and possibly also an override commandline flag), rather than us making a decision in advance. Otherwise, whichever we pick will break some use-cases.

@arielshaqed
Copy link
Contributor

@farhanhubble about this:

it is inefficient for data pipelines

I completely understand why checking mtime is inefficient for data pipelines - lakectl local will re-upload the file when you don't need it. Now if we ignore mtimes, how would we determine that the object changed? One way to do so would be to fingerprint or similarly digest the object, say into CityHash or even some SHA. Since we're now talking about efficiency, we should talk numbers before we go off and digest an entire subdirectory of unchanged files. So... How many objects are involved? What sizes? What is the total size of all objects?

@N-o-Z
Copy link
Member

N-o-Z commented Sep 18, 2024

There's a very simple fix to this issue.
The reason we save the mtime metadata is because when we upload an object in the server side we used the current time as the updated time. We need to preserve the local mtime so that we will not have a diff between object's local mtime and server mtime.
Instead of saving the metadata on the object in the server side, we could just use this metadata to actually create the entry and drop the metadata. This will result in the same behavior but will not added additional metadata to the object which will cause it to appear modified

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants