Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Viewer shows outdated cache after renaming a repo and creating a new one with the old name #2964

Open
albertvillanova opened this issue Jul 2, 2024 · 5 comments
Labels
bug Something isn't working P2 Nice to have

Comments

@albertvillanova
Copy link
Member

Reported by @lewtun (internal link: https://huggingface.slack.com/archives/C02EMARJ65P/p1719818961944059):

If I rename a dataset via the UI from D -> D' and then create a new dataset with the same name D, I seem to get a copy instead of an empty dataset
Indeed it was the dataset viewer showing a cached result - the git history is clean and there's no files in the new dataset repo

@julien-c
Copy link
Member

julien-c commented Jul 2, 2024

is is possible to also list the solution we discussed in that thread? (moving from repo name to _id)

To make discussion a bit more efficient

@lhoestq
Copy link
Member

lhoestq commented Jul 2, 2024

in particular:

_id field in hf.co/api/datasets - guaranteed immutable for a given repo

Some first thoughts:

  • we can add this field to jobs created when we receive a webhook (or for any job creation)
  • add this info in the cache entries
  • update to a _id-centric logic what is currently based on the repo name:
    • it concerns S3 assets and cached assets, as well as parquet metadata on NFS
    • since we store the full locations in the cache entries, this can surely be done without immediate migration of existing files locations

Then regarding the API:

  • the API calls from the hub could use the _id in addition to the repo name to make sure it gets the right data (even if there is outdated data in the cache somehow)

I also considered using _id everywhere as the source of truth, but I anticipate it will just move the problem elsewhere to the place we will cache the _id <-> repo name mapping (repo name is always needed to read/write to repos and also for the dataset-viewer API)

@albertvillanova
Copy link
Member Author

albertvillanova commented Jul 3, 2024

Thanks for the complementary information, @lhoestq.

So, basically we would need a complete refactoring of all the logic to identify repositories and you also think that this would just move the problem elsewhere... 🤔

I am wondering if instead we could effectively face the real underlying problem, that is, properly handle the repository renaming event, even if a new repository with the old name is created.

@severo severo added bug Something isn't working P1 Not as needed as P0, but still important/wanted labels Jul 8, 2024
@severo
Copy link
Collaborator

severo commented Aug 22, 2024

from the thread, the argument for using _id was:

we said a couple of weeks ago (IIRC) that re-computing everything when a repo is renamed is quite inefficient

But, as the cached contents generally contain the name of the dataset, possibly in different places such as the asset URLs and other hard-to-patch cases, I still think it's better to delete and recompute for the rare event when a dataset is moved.

@severo
Copy link
Collaborator

severo commented Aug 23, 2024

Note also that it seems to be a very corner case that appeared while the backend was broken for some reason, or the webhook for the first dataset move was not taken into account.

In a normal state, I cannot reproduce:

1: original dataset severo/doc-image-3

Capture d’écran 2024-08-23 à 16 09 08

2: rename to severo/doc-image-3-renamed: the jobs are created to refresh the viewer with the new name

Capture d’écran 2024-08-23 à 16 09 44

3: meanwhile, create a new empty dataset with the original name severo/doc-image-3. We don't reproduce the bug: the viewer is waiting for the jobs to have finished

Capture d’écran 2024-08-23 à 16 10 07

4: later, when the jobs have finished, everything is as expected

Capture d’écran 2024-08-23 à 16 13 56 Capture d’écran 2024-08-23 à 16 14 01

As an additional protection, we could store the repo _id along to the dataset name in the cached entries, so that we're 100% sure which repo was used to produce it. But I feel like it's not THAT crucial at the moment.

Can we close and reopen if we observe it again?

@severo severo added P2 Nice to have and removed P1 Not as needed as P0, but still important/wanted labels Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P2 Nice to have
Projects
None yet
Development

No branches or pull requests

4 participants