You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm currently running a dvc gc -w and it's taking a very long time. The repo is very basic, there are no pipelines, I've just dvc added a bunch of files / directories over time. It seems to be coming from a stack of get_used_objs calls. The bulk of the time seems to be in the dvc.output.Output.get_dir_cache function where it runs:
obj = self.cache.get(self.hash_info.value)
try:
ocheck(self.cache, obj)
except FileNotFoundError as ex:
if self.remote:
kwargs["remote"] = self.remote
with suppress(Exception):
self.repo.cloud.pull([obj.hash_info], **kwargs)
Some of this seems to be because some cache files are not protected, so dvc is checking they are valid and reprotecting them. But it also seems to be talking to the cloud, and I'm not sure why it would try to do that for a local garbage collect operation. I'm wondering if there are flags that could be added to get_used_objs that make it ignore certain checks / updates for operations where they are not necessary (e.g. for gc -w, all we care about is determining what files in the cache are referenced by DVC sidecar files and which ones are not, then we just want to remove those old cache files. We really don't care if they are corrupted because we are going to delete them).
Is there anything I'm missing about why DVC would need to query a remote in a gc -w operation? Is this sort of optimization feasible?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I'm currently running a
dvc gc -w
and it's taking a very long time. The repo is very basic, there are no pipelines, I've just dvc added a bunch of files / directories over time. It seems to be coming from a stack ofget_used_objs
calls. The bulk of the time seems to be in thedvc.output.Output.get_dir_cache
function where it runs:Some of this seems to be because some cache files are not protected, so dvc is checking they are valid and reprotecting them. But it also seems to be talking to the cloud, and I'm not sure why it would try to do that for a local garbage collect operation. I'm wondering if there are flags that could be added to
get_used_objs
that make it ignore certain checks / updates for operations where they are not necessary (e.g. forgc -w
, all we care about is determining what files in the cache are referenced by DVC sidecar files and which ones are not, then we just want to remove those old cache files. We really don't care if they are corrupted because we are going to delete them).Is there anything I'm missing about why DVC would need to query a remote in a
gc -w
operation? Is this sort of optimization feasible?Beta Was this translation helpful? Give feedback.
All reactions