Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate ModernBERT #1624

Open
Samoed opened this issue Dec 23, 2024 · 4 comments · May be fixed by #1684
Open

Integrate ModernBERT #1624

Samoed opened this issue Dec 23, 2024 · 4 comments · May be fixed by #1684
Labels
new-model Questions related to adding a new model to the benchmark

Comments

@Samoed
Copy link
Collaborator

Samoed commented Dec 23, 2024

Arxiv: https://arxiv.org/abs/2412.13663
Model: https://huggingface.co/answerdotai/ModernBERT-base

ModernBERT was evaluated on BEIR, and I think it could be integrated into MTEB with a specific configuration. I tried adding it using SentenceTransformers with different pooling methods, but my results were much lower than those reported.

@orionw, since you’re one of the co-authors (congrats, by the way!), do you have scripts to reproduce the results?

@Samoed Samoed added the new-model Questions related to adding a new model to the benchmark label Dec 23, 2024
@orionw
Copy link
Contributor

orionw commented Dec 23, 2024

Thanks @Samoed! ModernBERT is the base model, so if you use it out of the box it will be pretty bad, just like an un-finetuned BERT or RoBERTa.

@NohTow and @bclavie did some fine tuning on MS MARCO but I don’t think they’ve uploaded the models anywhere. They did put their fine tuning scripts here: https://github.com/AnswerDotAI/ModernBERT/blob/main/examples/train_st.py (and similar for ColBERT in the repo).

I expect others will replace BERT with it in their pipelines and we will see more retrieval models with it soon!

@NohTow
Copy link

NohTow commented Dec 23, 2024

Hello,
Yes, indeed, the models trained for the experiments in the paper are rather "weak" (especially DPR ones) compared to what people are used to. The goal of the experiments was to compare the performance of all the base models in a given and fair setup (which is just an average MS MARCO training).

We decided not to chase the top of BEIR leaderboard because fine tuning to this extend is a whole project in itself and takes a lot of work if you do not have the data for it available. Also, to some extend, the leaderboard is a bit gamed and even if we put the time and energy to grind the leaderboard, we might have come a bit short or end up with a model that is not performing as we believe a model should.

Thus, to avoid wasting time and get people only comparing the BEIR scores, we preferred to compare the models in a simple setup to get a signal comparing the actual potential of the base models and let the people that already have extensive pipelines available take the model and do a proper fine-tuning. These actors have seen the model and we have good reasons to believe that they will indeed do this fine-tuning in the future! Besides, I am also doing some experiments on my own, which might end up with a model that is not as strong as the top models, but way better than what we trained in the paper!

Edit: I also have the checkpoints of the models we trained for the experiments, but again, not sure reporting these one on MTEB is worth it.

@KennethEnevoldsen
Copy link
Contributor

so some thought. It can be reasonable (as a reference) to benchmark models like BERT, ModernBERT etc.

These are fairly easy to benchmark (it can be run from the CLI). However I expect that we will see competitive finetunes due to:

More than anything, we’re really looking forward to seeing what creative ways to use these models the community will come up with! To encourage this, we’re opening a call for demos until January 10th, 2025: the 5 best ones will get added to this post in a showcase section and win a $100 (or local currency equivalent) Amazon gift card, as well as a 6-month HuggingFace Pro subscription! If you need a hint to get started, here’s a demo we thought about: code similarity HF space! And remember, this is an encoder model, so all the coolest downstream applications will likely require some sort of fine-tuning (on real or perhaps decoder-model synthetic data?). Thankfully, there's lots of cool frameworks out there to support fine-tuning encoders: 🤗Transformers itself for various tasks, including classification, GliNER for zero-shot Named Entity Recognition, or Sentence-Transformers for retrieval and similarity tasks!

source: https://huggingface.co/blog/modernbert

@NohTow
Copy link

NohTow commented Dec 30, 2024

The first competitive fine-tuning is out in place of modern-bert-embed-base, a model trained following the nomic-embed setup.

Announcement thread: https://x.com/zach_nussbaum/status/1873813021786767699
cc @zanussbaum

@Samoed Samoed linked a pull request Jan 2, 2025 that will close this issue
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-model Questions related to adding a new model to the benchmark
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants