Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mieb] SketchyI2IRetrieval hangs in the beginning #1686

Open
Muennighoff opened this issue Jan 2, 2025 · 2 comments
Open

[mieb] SketchyI2IRetrieval hangs in the beginning #1686

Muennighoff opened this issue Jan 2, 2025 · 2 comments
Labels
mieb The image extension of MTEB

Comments

@Muennighoff
Copy link
Contributor

@gowitheflow-1998 & I are both experiencing weird long hanging times when running SketchyI2IRetrieval at the start; Would be great to figure out why & maybe fix it; Users may think something is going wrong and cancel the run even though it will eventually start

@isaac-chung isaac-chung added the mieb The image extension of MTEB label Jan 3, 2025
@izhx
Copy link
Contributor

izhx commented Jan 5, 2025

I think this might be because the qrels is too large (90.6M according to https://huggingface.co/datasets/JamieSJS/sketchy/viewer/qrels), and the mapping of qrels_dict is slow.

https://github.com/embeddings-benchmark/mteb/blob/mieb/mteb/abstasks/Image/AbsTaskAny2AnyRetrieval.py#L99

In my test, this self.qrels.map(qrels_dict_init) costs 45 minutes.

log:

Sun Jan  5 13:29:26 2025 - Done load qrels, before map qrels_dict
Sun Jan  5 14:14:27 2025 - Done qrels_dict

code:

        qrels_dict = defaultdict(dict)
        # f'{time.asctime()} - Done load qrels, before map qrels_dict'

        def qrels_dict_init(row):
            qrels_dict[row["query-id"]][row["corpus-id"]] = int(row["score"])

        self.qrels.map(qrels_dict_init)
        # f'{time.asctime()} - Done qrels_dict'

@gowitheflow-1998
Copy link
Contributor

I think this might be because the qrels is too large (90.6M according to https://huggingface.co/datasets/JamieSJS/sketchy/viewer/qrels), and the mapping of qrels_dict is slow.

https://github.com/embeddings-benchmark/mteb/blob/mieb/mteb/abstasks/Image/AbsTaskAny2AnyRetrieval.py#L99

In my test, this self.qrels.map(qrels_dict_init) costs 45 minutes.

thanks for investigating this quickly! Sketchy definitely worths a downsample then. Not sure about how many classes in total but maybe for each class, keep only 1/n sketches and 1/m real images for each class? Looks like there are 200 real-life images for each class (judging by 90M qrels / 450k queries). Maybe make it 20. wdyt? @Jamie-Stirling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mieb The image extension of MTEB
Projects
None yet
Development

No branches or pull requests

4 participants