Viewer shows outdated cache after renaming a repo and creating a new one with the old name #2964

albertvillanova · 2024-07-02T07:05:53Z

Reported by @lewtun (internal link: https://huggingface.slack.com/archives/C02EMARJ65P/p1719818961944059):

If I rename a dataset via the UI from D -> D' and then create a new dataset with the same name D, I seem to get a copy instead of an empty dataset
Indeed it was the dataset viewer showing a cached result - the git history is clean and there's no files in the new dataset repo

julien-c · 2024-07-02T08:19:58Z

is is possible to also list the solution we discussed in that thread? (moving from repo name to _id)

To make discussion a bit more efficient

lhoestq · 2024-07-02T14:33:27Z

in particular:

_id field in hf.co/api/datasets - guaranteed immutable for a given repo

Some first thoughts:

we can add this field to jobs created when we receive a webhook (or for any job creation)
add this info in the cache entries
update to a _id-centric logic what is currently based on the repo name:
- it concerns S3 assets and cached assets, as well as parquet metadata on NFS
- since we store the full locations in the cache entries, this can surely be done without immediate migration of existing files locations

Then regarding the API:

the API calls from the hub could use the _id in addition to the repo name to make sure it gets the right data (even if there is outdated data in the cache somehow)

I also considered using _id everywhere as the source of truth, but I anticipate it will just move the problem elsewhere to the place we will cache the _id <-> repo name mapping (repo name is always needed to read/write to repos and also for the dataset-viewer API)

albertvillanova · 2024-07-03T06:08:08Z

Thanks for the complementary information, @lhoestq.

So, basically we would need a complete refactoring of all the logic to identify repositories and you also think that this would just move the problem elsewhere... 🤔

I am wondering if instead we could effectively face the real underlying problem, that is, properly handle the repository renaming event, even if a new repository with the old name is created.

severo · 2024-08-22T14:33:24Z

from the thread, the argument for using _id was:

we said a couple of weeks ago (IIRC) that re-computing everything when a repo is renamed is quite inefficient

But, as the cached contents generally contain the name of the dataset, possibly in different places such as the asset URLs and other hard-to-patch cases, I still think it's better to delete and recompute for the rare event when a dataset is moved.

severo · 2024-08-23T14:16:04Z

Note also that it seems to be a very corner case that appeared while the backend was broken for some reason, or the webhook for the first dataset move was not taken into account.

In a normal state, I cannot reproduce:

1: original dataset severo/doc-image-3

2: rename to severo/doc-image-3-renamed: the jobs are created to refresh the viewer with the new name

3: meanwhile, create a new empty dataset with the original name severo/doc-image-3. We don't reproduce the bug: the viewer is waiting for the jobs to have finished

4: later, when the jobs have finished, everything is as expected

As an additional protection, we could store the repo _id along to the dataset name in the cached entries, so that we're 100% sure which repo was used to produce it. But I feel like it's not THAT crucial at the moment.

Can we close and reopen if we observe it again?

severo added bug Something isn't working P1 Not as needed as P0, but still important/wanted labels Jul 8, 2024

severo added P2 Nice to have and removed P1 Not as needed as P0, but still important/wanted labels Aug 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Viewer shows outdated cache after renaming a repo and creating a new one with the old name #2964

Viewer shows outdated cache after renaming a repo and creating a new one with the old name #2964

albertvillanova commented Jul 2, 2024

julien-c commented Jul 2, 2024

lhoestq commented Jul 2, 2024

albertvillanova commented Jul 3, 2024 •

edited

Loading

severo commented Aug 22, 2024

severo commented Aug 23, 2024

Viewer shows outdated cache after renaming a repo and creating a new one with the old name #2964

Viewer shows outdated cache after renaming a repo and creating a new one with the old name #2964

Comments

albertvillanova commented Jul 2, 2024

julien-c commented Jul 2, 2024

lhoestq commented Jul 2, 2024

albertvillanova commented Jul 3, 2024 • edited Loading

severo commented Aug 22, 2024

severo commented Aug 23, 2024

albertvillanova commented Jul 3, 2024 •

edited

Loading