Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate UniBench #1590

Open
Samoed opened this issue Dec 14, 2024 · 20 comments
Open

Integrate UniBench #1590

Samoed opened this issue Dec 14, 2024 · 20 comments
Labels
mieb The image extension of MTEB new benchmark Issues related to adding a new benchmark

Comments

@Samoed
Copy link
Collaborator

Samoed commented Dec 14, 2024

Paper: https://arxiv.org/abs/2408.04810
Code: https://github.com/facebookresearch/unibench
List of tasks: https://github.com/facebookresearch/unibench/blob/main/unibench/benchmarks_zoo/benchmarks.py

@Samoed Samoed added new dataset Issues related to adding a new task or dataset mieb The image extension of MTEB labels Dec 14, 2024
@isaac-chung isaac-chung added new benchmark Issues related to adding a new benchmark and removed new dataset Issues related to adding a new task or dataset labels Dec 24, 2024
@YashDThapliyal
Copy link
Contributor

YashDThapliyal commented Jan 9, 2025

@isaac-chung @Samoed
Hi Team,

I’m currently reviewing the paper and going through the code to better understand UniBench, but I’m unsure how to begin this task. Could someone provide some guidance or suggestions on how I could get started?

Any advice would be greatly appreciated. Thank you so much!

Best,
Yash

@isaac-chung
Copy link
Collaborator

There seems to be quite a lot of overlap already with what we've implemented in MIEB (see in each task type folder here). I'd say you could:

  1. Identify the diff in datasets between UniBench and MIEB (link above)
  2. Add those datasets following this docs page
  3. Submit a PR along with some model results that can validate the implementation

cc @gowitheflow-1998

@YashDThapliyal
Copy link
Contributor

YashDThapliyal commented Jan 12, 2025

@isaac-chung Awesome that makes sense. Quick clarification when I make a new branch should I make that branch from MIEB (the branch you linked above) or just the normal main MTEB? (Asking so when I make a PR it will get merged properly

@isaac-chung
Copy link
Collaborator

@YashDThapliyal from mieb please. Thanks!

@YashDThapliyal
Copy link
Contributor

YashDThapliyal commented Jan 12, 2025

@isaac-chung

Okay so I compared the datasets between UniBench and MTEB's Image tasks. Here’s what I did:

For UniBench: Used a script to extract dataset names from the dataset_url fields in benchmarks.py, generating UniBenchDatasets.txt.

For MTEB: Manually copied the names of files from each eng folders for each Task in /Image. Then I consolidated them in MTEB_all_datasets.txt, and processed the unique names into MTEB_unique_datasets.txt.

But when I compared both files to identify differences I found that there is a total of 55 datasets exclusive to UniBench.

I’d like to confirm if this approach is correct before proceeding, because if so then would i have to create 55 new files as specified in (https://github.com/embeddings-benchmark/mteb/blob/main/docs/adding_a_dataset.md)

Thanks!
Yash

@isaac-chung
Copy link
Collaborator

isaac-chung commented Jan 12, 2025

@YashDThapliyal sure, sounds good, but do note the parsing needed. For instance, UniBench uses "haideraltahan/wds_cifar10" whereas MIEB uses the original "uoft-cs/cifar10". Here, the part of the url before "_" is to be omitted. In such cases, I'd consider 'zero-shot cifar10' as covered. (Note that MIEB covers the linear probe variant as well as zero shot for each of the classification tasks, whereas uniBench does not)

Please share all 3 .txt files as well. I doubt that there are that many. I only counted 50 occurrences of "@register_benchmark" in benchmark.py from UniBench, whereas the paper cites 53 tasks. Would love to understand how 55 came about.

@YashDThapliyal
Copy link
Contributor

YashDThapliyal commented Jan 12, 2025

@isaac-chung yeah my code strips the part of the name before the _ so that shouldnt be an issue.

This is the link to the folder where the parsing code as well as all the files are stored: https://github.com/YashDThapliyal/mteb/tree/integrating-uni-bench/mteb/tasks/Image/UniBenchIntegration

MTEB_unique_datasets.txt
UniBench_not_in_MTEB.txt
UniBenchDatasets.txt

@Samoed
Copy link
Collaborator Author

Samoed commented Jan 12, 2025

@YashDThapliyal How did you generate UniBench_not_in_MTEB? In your list you have wds_imagenet1k, but it in already integrated

@YashDThapliyal
Copy link
Contributor

YashDThapliyal commented Jan 12, 2025

@Samoed I did this in python:
unique_to_unibench = sorted(unibench_datasets - set(unique_mteb_datasets))

@isaac-chung
Copy link
Collaborator

For MIEB tasks, I think your script should programmatically go through all subfolders with mteb/tasks/Image in the mieb branch, not just eng ones. The result is currently off by a lot (e.g. MIEB already has cifar, voc2007, windground etc), and I'd suggest updating your script to match the ground truth.

@YashDThapliyal
Copy link
Contributor

YashDThapliyal commented Jan 12, 2025

@isaac-chung Okay, I will write a new script that does this. Basically I would just need to get the names of every file within mteb/tasks/Image except the init files correct? I was debating doing either that or going thru all the init files and then getting the imports and filtering those

@isaac-chung
Copy link
Collaborator

Whatever works for you, as long as your MIEB results match the ground truth.

Maybe reading from files like https://github.com/embeddings-benchmark/mteb/blob/mieb/mteb%2Ftasks%2FImage%2FZeroshotClassification%2F__init__.py might be easier as you mentioned.

from .eng.Birdsnap import *
from .eng.Caltech101 import *
from .eng.CIFAR import *
from .eng.CLEVR import *
from .eng.Country211 import *
from .eng.DTD import *
from .eng.EuroSAT import *
...

@YashDThapliyal
Copy link
Contributor

YashDThapliyal commented Jan 13, 2025

Hi @isaac-chung,

I wrote a script that grabs all the import statements from the MIEB image folder, parses the dataset names, and compares them with those in UniBench. This resulted in 40 unique names in UniBench that aren’t in MIEB. I noticed you previously mentioned "ground truth," but I’m unsure what you meant by that. If you notice any discrepancies, please let me know.

I had some parsing issues earlier, but I’ve refined the scripts, and everything should now be working as expected.

Relevant Links:

Let me know if you’d like me to tweak anything or investigate further. Otherwise, I’ll proceed to create 40 new datasets following the docs page, ensuring that each dataset implementation includes a model at the end of the file for testing the data and saving the results.

Best,
Yash

@isaac-chung
Copy link
Collaborator

Thanks, Yash.

ground truth

This is referring to the actual datasets present in MIEB. You can manually inspect the subfolders and the .py files to validate your own results.

These are still not quite completely correct. For example, CIFAR is in "Datasets in MIEB image folder", but that file contains CIFAR10 and CIFAR100 which is not included the outputs. This lead to cifar10 and cifar100 being present erroneously in ""Datasets Unique to UniBench".

  1. Perhaps a better way to read all tasks is to get all mieb tasks by category, e.g. i2i, i2t, etc.
  2. It might also be better if we compare names in all lowercase, e.g. Birdsnap -> birdsnap.

Here is the script I used to extract all 138 MIEB task names and output in the same format:

from mteb import get_tasks

mieb_tasks = get_tasks(categories=["i2i","i2t","t2i","it2t","it2i","i2it","t2it","it2it"])
num_tasks = len(mieb_tasks)
print(num_tasks)

# print names in newline
for task in mieb_tasks:
    print(task.metadata.name.lower())

Here is the output for your convenience. Note that:

  1. These names are often in the {NAME}{CATEGORY}{TASK_TYPE} format. e.g. roxfordhardi2iretrieval, roxfordhard is the name (r-oxford hard), i2i means image to image, and retrieval is the task type.
  2. You might see the same dataset having different task types. For instance, CIFAR10 has clustering, zero shot, and linear probing.
  3. Due to point 1, these names won't always be a natural 1-to-1 match with the Unibench dataset names. e.g. in Unibench, Stanford Cars is simply called "cars", whereas in MIEB it's called "stanfordcars". This is likely the hardest part of the exercise. Since there are only ~50 datasets in Unibench, I might suggest comparing these manually (e.g. use ctrl + F to find substrings).

Let me know if you have further questions.

MIEB dataset names in lowercase (click to expand)
blinkit2imultichoice
blinkit2tmultichoice
imagecodet2imultichoice
roxfordeasyi2imultichoice
roxfordmediumi2imultichoice
roxfordhardi2imultichoice
rpariseasyi2imultichoice
rparismediumi2imultichoice
rparishardi2imultichoice
blinkit2iretrieval
blinkit2tretrieval
cirrit2iretrieval
cub200i2iretrieval
edist2itretrieval
encyclopediavqait2itretrieval
fashion200ki2tretrieval
fashion200kt2iretrieval
fashioniqit2iretrieval
flickr30ki2tretrieval
flickr30kt2iretrieval
forbi2iretrieval
gldv2i2iretrieval
gldv2i2tretrieval
hatefulmemesi2tretrieval
hatefulmemest2iretrieval
imagecodet2iretrieval
infoseekit2itretrieval
infoseekit2tretrieval
llavait2tretrieval
memotioni2tretrieval
memotiont2iretrieval
meti2iretrieval
mscocoi2tretrieval
mscocot2iretrieval
nightsi2iretrieval
okvqait2tretrieval
ovenit2itretrieval
ovenit2tretrieval
remuqit2tretrieval
roxfordeasyi2iretrieval
roxfordmediumi2iretrieval
roxfordhardi2iretrieval
rp2ki2iretrieval
rpariseasyi2iretrieval
rparismediumi2iretrieval
rparishardi2iretrieval
scimmiri2tretrieval
scimmirt2iretrieval
sketchyi2iretrieval
sopi2iretrieval
stanfordcarsi2iretrieval
tuberlint2iretrieval
vidorearxivqaretrieval
vidoredocvqaretrieval
vidoreinfovqaretrieval
vidoretabfquadretrieval
vidoretatdqaretrieval
vidoreshiftprojectretrieval
vidoresyntheticdocqaairetrieval
vidoresyntheticdocqaenergyretrieval
vidoresyntheticdocqagovernmentreportsretrieval
vidoresyntheticdocqahealthcareindustryretrieval
visualnewsi2tretrieval
visualnewst2iretrieval
vizwizit2tretrieval
vqa2it2tretrieval
webqat2itretrieval
witt2iretrieval
xflickr30kcot2iretrieval
xm3600t2iretrieval
cvbenchcount
cvbenchrelation
cvbenchdepth
cvbenchdistance
cifar10clustering
cifar100clustering
imagenetdog15clustering
imagenet10clustering
tinyimagenetclustering
birdsnap
caltech101
cifar10
cifar100
country211
dtd
eurosat
fer2013
fgvcaircraft
food101classification
gtsrb
imagenet1k
mnist
oxfordflowersclassification
oxfordpets
patchcamelyon
resisc45
stanfordcars
stl10
sun397
ucf101
voc2007
arococoorder
aroflickrorder
arovisualattribution
arovisualrelation
sugarcrepe
winoground
sts12visualsts
sts13visualsts
sts14visualsts
sts15visualsts
sts16visualsts
sts17multilingualvisualsts
stsbenchmarkmultilingualvisualsts
birdsnapzeroshot
caltech101zeroshot
cifar10zeroshot
cifar100zeroshot
clevrzeroshot
clevrcountzeroshot
country211zeroshot
dtdzeroshot
eurosatzeroshot
fer2013zeroshot
fgvcaircraftzeroshot
food101zeroshot
gtsrbzeroshot
imagenet1kzeroshot
mnistzeroshot
oxfordpetszeroshot
patchcamelyonzeroshot
renderedsst2
resisc45zeroshot
scimmir
stanfordcarszeroshot
stl10zeroshot
sun397zeroshot
ucf101zeroshot

In light of this discussion, I feel that MIEB could use a similar task table created like in MMTEB. CC @gowitheflow-1998 @KennethEnevoldsen

@isaac-chung
Copy link
Collaborator

A quick scan reduces the unique list of dataset to Unibench to the following. This gives ~20 datasets, not counting the ones with notes. There are also a lot of imagenet variants, which I'm not sure if we want to include. Maybe @gowitheflow-1998 can chime in here.

countbench (mieb has CVBench count from https://arxiv.org/pdf/2406.16860 isntead)
diabetic_retinopathy
dmlab
dollar_street
dsprites_label_orientation
dsprites_label_x_position
dsprites_label_y_position
fashion_mnist (MIEB has fashion200k and FashionIQ)
imagenet9
imagenet_sketch (MIEB has sketchy from the MET dataset)
imageneta
imagenetc
imagenete
imageneto
imagenetr
imagenetv2
inaturalist
kitti_closest_vehicle_distance
objectnet
places365 (terms include not distributing the images. opted not to include in repo)
pug_imagenet
smallnorb_label_azimuth
smallnorb_label_elevation
svhn

@YashDThapliyal
Copy link
Contributor

YashDThapliyal commented Jan 13, 2025

@isaac-chung Thank you so much for generating this list, which I assume is the list of the actual unique datasets to UniBench.

I think the next steps would just be to wait on some confirmation from @gowitheflow-1998 to finalize this list of datasets that need implementing and then i can begin implementing them via following the guide that you had linked above.

For now, I will go ahead and delete the files/scripts I was using to generate names of the datasets, and will make a new folder within /Image called UniBench where I can create the datasets and have an init file for them as well.

@gowitheflow-1998
Copy link
Contributor

@YashDThapliyal @isaac-chung thanks so much for the efforts! The unique list looks great.

For now, I will go ahead and delete the files/scripts I was using to generate names of the datasets, and will make a new folder within /Image called UniBench where I can create the datasets and have an init file for them as well.

A new folder within /Image works although not too necessary (whatever works better for you!). Can also just put each new implemented task under the abstask folder they correspond to (e.g., if fashion_mnist is implemented as a linear probing (classification) task, put it under https://github.com/embeddings-benchmark/mteb/tree/mieb/mteb/tasks/Image/ImageClassification/eng), and at the end we can define the list of UniBench tasks in benchmark.py to grab only UniBench tasks for evaluation.

@YashDThapliyal
Copy link
Contributor

YashDThapliyal commented Jan 13, 2025

@gowitheflow-1998 That makes sense, I can try to do that but if it gets too complicated I may end up creating a folder just for simplicity :).

Quick clarification though: Am I still implementing all the imagenet variants? Additionally about the actual detail for implementing the datasets. Should I just google them and try to find them on hugging face to find all of the relevant data that would be needed to fill out the template in adding a dataset?

@gowitheflow-1998
Copy link
Contributor

@YashDThapliyal I think the imagenet variants are all worth implementing! They either have domain differences or are evaluating different properties such as robustness, and are thus useful. Should google them for actual details yeah - If existing datasets on Huggingface match the actual details in the paper, we can use the ones on HuggingFace; if not, we typically make the dataset ourselves with source images from say the Github repo of the original paper, process them with their source code/own implemented code that matches the details, and upload them to Huggingface.

About things getting too complicated: Feel free to submit several separate PRs for all these, even draft ones (e.g., one PR after 3-5 tasks) so that we can review and start improving them together!

@YashDThapliyal
Copy link
Contributor

YashDThapliyal commented Jan 13, 2025

@gowitheflow-1998 sounds good, I will begin that process and start implementing the existing datasets first and then also a draft PR every few datasets so we can ensure we are on the right track

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mieb The image extension of MTEB new benchmark Issues related to adding a new benchmark
Projects
None yet
Development

No branches or pull requests

4 participants