feat: Add nomic modern bert #1684

Samoed · 2025-01-02T09:08:08Z

Checklist

Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

Adding a model checklist

I have filled out the ModelMeta object to the extent possible
I have ensured that my model can be loaded using
- mteb.get_model(model_name, revision) and
- mteb.get_model_meta(model_name, revision)
I have tested the implementation works on a representative set of tasks.

2025-01-02 08:15:52.601067 >>> AmazonCounterfactualClassification
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/mteb/evaluation/MTEB.py", line 583, in run
    results, tick, tock = self._run_eval(
  File "/usr/local/lib/python3.10/dist-packages/mteb/evaluation/MTEB.py", line 304, in _run_eval
    results = task.evaluate(
  File "/usr/local/lib/python3.10/dist-packages/mteb/abstasks/AbsTaskClassification.py", line 120, in evaluate
    scores[hf_subset] = self._evaluate_subset(
  File "/usr/local/lib/python3.10/dist-packages/mteb/abstasks/AbsTaskClassification.py", line 196, in _evaluate_subset
    scores_exp, test_cache = evaluator(model, test_cache=test_cache)
  File "/usr/local/lib/python3.10/dist-packages/mteb/evaluation/evaluators/ClassificationEvaluator.py", line 306, in __call__
    clf.fit(X_train, self.y_train)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py", line 1196, in fit
    X, y = self._validate_data(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 584, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py", line 1106, in check_X_y
    X = check_array(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py", line 921, in check_array
    _assert_all_finite(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py", line 161, in _assert_all_finite
    raise ValueError(msg_err)
ValueError: Input X contains NaN.
LogisticRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values


2025-01-02 08:16:28.850049 >>> ToxicConversationsClassification
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/mteb/evaluation/MTEB.py", line 583, in run
    results, tick, tock = self._run_eval(
  File "/usr/local/lib/python3.10/dist-packages/mteb/evaluation/MTEB.py", line 304, in _run_eval
    results = task.evaluate(
  File "/usr/local/lib/python3.10/dist-packages/mteb/abstasks/AbsTaskClassification.py", line 120, in evaluate
    scores[hf_subset] = self._evaluate_subset(
  File "/usr/local/lib/python3.10/dist-packages/mteb/abstasks/AbsTaskClassification.py", line 196, in _evaluate_subset
    scores_exp, test_cache = evaluator(model, test_cache=test_cache)
  File "/usr/local/lib/python3.10/dist-packages/mteb/evaluation/evaluators/ClassificationEvaluator.py", line 306, in __call__
    clf.fit(X_train, self.y_train)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py", line 1196, in fit
    X, y = self._validate_data(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 584, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py", line 1106, in check_X_y
    X = check_array(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py", line 921, in check_array
    _assert_all_finite(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py", line 161, in _assert_all_finite
    raise ValueError(msg_err)
ValueError: Input X contains NaN.
LogisticRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values


2025-01-02 08:29:39.876958 >>> SummEval
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/mteb/evaluation/MTEB.py", line 583, in run
    results, tick, tock = self._run_eval(
  File "/usr/local/lib/python3.10/dist-packages/mteb/evaluation/MTEB.py", line 304, in _run_eval
    results = task.evaluate(
  File "/usr/local/lib/python3.10/dist-packages/mteb/abstasks/AbsTask.py", line 126, in evaluate
    scores[hf_subset] = self._evaluate_subset(
  File "/usr/local/lib/python3.10/dist-packages/mteb/abstasks/AbsTaskSummarization.py", line 109, in _evaluate_subset
    scores = evaluator(model, encode_kwargs=encode_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/mteb/evaluation/evaluators/SummarizationEvaluator.py", line 315, in __call__
    cosine_pearson_scores.append(pearsonr(human_scores, cosine_pred_scores))
  File "/usr/local/lib/python3.10/dist-packages/scipy/stats/_stats_py.py", line 4794, in pearsonr
    normym = linalg.norm(ym)
  File "/usr/local/lib/python3.10/dist-packages/scipy/linalg/_misc.py", line 146, in norm
    a = np.asarray_chkfinite(a)
  File "/usr/local/lib/python3.10/dist-packages/numpy/lib/function_base.py", line 630, in asarray_chkfinite
    raise ValueError(
ValueError: array must not contain infs or NaNs

After this, I rerun classification tasks on smaller batch size (4 instead of 32) and AmazonCounterfactualClassification completed successfully, but ToxicConversationsClassification gave same error.

@zanussbaum can you help to integrate your model implementation to MTEB?

zanussbaum · 2025-01-02T16:08:14Z

Hm not sure i totally understand what's going on here but for the classification tasks one thing that might be different is we don't normalize the embeddings

Samoed · 2025-01-02T19:06:22Z

Can you provide script for evaluating mteb?

zanussbaum · 2025-01-02T19:55:20Z

I eval'd using our contrastors repo: https://github.com/nomic-ai/contrastors/tree/main/src/contrastors/eval/mteb_eval

# Conflicts: # mteb/models/nomic_models.py

Samoed · 2025-01-04T07:49:53Z

Updated scores

	Leaderboard	PR
AmazonCounterfactualClassification (en)	78.13	76.5821
EmotionClassification	48.26	51.35
ToxicConversationsClassification	67.46	DNF
SprintDuplicateQuestions	92.04	92.0572
TwitterSemEval2015	73.63	73.6807
ArxivClusteringS2S	38.09	DNF
RedditClustering	56.5	DNF
SciDocsRR	81.52	81.542
AskUbuntuDupQuestions	62.33	62.4368
SCIDOCS	18.59	18.071
SciFact	69.63	60.046
STSBenchmark	86.97	86.9903
STS16	85.74	85.7417
SummEval	31.39	31.2883

Samoed added 4 commits January 2, 2025 10:21

add nomic modern bert

27c47af

use SentenceTransformerWrapper

bed802f

use SentenceTransformerWrapper

29a2bcc

try nomic wrapper

ac0e6dc

Samoed added 4 commits January 3, 2025 22:02

Merge branch 'refs/heads/main' into add_nomic_modern_bert

a8db9e8

# Conflicts: # mteb/models/nomic_models.py

update

003fffa

use all prompts

0025f10

pass prompts

48d09ce

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add nomic modern bert #1684

feat: Add nomic modern bert #1684

Samoed commented Jan 2, 2025 •

edited

Loading

zanussbaum commented Jan 2, 2025

Samoed commented Jan 2, 2025

zanussbaum commented Jan 2, 2025

Samoed commented Jan 4, 2025 •

edited

Loading

feat: Add nomic modern bert #1684

Are you sure you want to change the base?

feat: Add nomic modern bert #1684

Conversation

Samoed commented Jan 2, 2025 • edited Loading

Checklist

Adding a model checklist

zanussbaum commented Jan 2, 2025

Samoed commented Jan 2, 2025

zanussbaum commented Jan 2, 2025

Samoed commented Jan 4, 2025 • edited Loading

Samoed commented Jan 2, 2025 •

edited

Loading

Samoed commented Jan 4, 2025 •

edited

Loading