Speeding up MTEB #381

KennethEnevoldsen · 2024-04-16T09:44:19Z

This is an overview issue on how to speed up MTEB:

I see the following options for speeding up MTEB:

Implementing an encode cache: As suggested in Aggregating MMTEB datasets #354, as some datasets repeat it would be possible to implement a cache to not re-embed duplicates
change loading of Multilingual dataset: Seems like there is a large overhead in loading multiple datasets of different languages (High overhead when loading lots of subsets from the same dataset huggingface/datasets#6800)
Downsampling datasets: Most datasets could probably work with notably fewer samples.
At the moment we download all splits even though we only use some of them. A solution might be to supply the split to the load_dataset function. Note this will lead to bugs if the dataset_transform assumes the full dataset (probably shouldn't happen, but it might).

Task-specific speed-ups:

Clustering: Clustering currently works by performing N clustering steps M samples (M could vary for each N), however, an alternative approach is embedding K samples and then sampling M from those samples N times. This would allow K << N x M, which would lead to a significant speed-up.

Overview of slowest segments:
Based on existing results from the paraphrase-multilingual-MiniLM-L12-v2 (which might have been run on all sorts of systems).

{'Reranking': {'mean': 118.08999999999999,
  'n': 6,
  'total': 708.54,
  'median': 19.61,
  'min': 2.59,
  'max': 636.61,
  'name_of_max': 'MindSmallReranking',
  'name_of_min': 'AskUbuntuDupQuestions'},
 'STS': {'mean': 2.3514285714285714,
  'n': 14,
  'total': 32.92,
  'median': 2.44,
  'min': 0.74,
  'max': 4.46,
  'name_of_max': 'STS17',
  'name_of_min': 'STS22'},
 'PairClassification': {'mean': 3.89625,
  'n': 8,
  'total': 31.17,
  'median': 2.75,
  'min': 0.72,
  'max': 12.94,
  'name_of_max': 'TwitterURLCorpus',
  'name_of_min': 'CDSC-E'},
 'Clustering': {'mean': 118.72272727272725,
  'n': 22,
  'total': 2611.8999999999996,
  'median': 63.93,
  'min': 0.54,
  'max': 817.54,
  'name_of_max': 'ArxivClusteringP2P',
  'name_of_min': 'MasakhaNEWSClusteringS2S'},
 'Classification': {'mean': 38.81444444444443,
  'n': 18,
  'total': 698.6599999999999,
  'median': 13.51,
  'min': 1.51,
  'max': 340.71,
  'name_of_max': 'AmazonPolarityClassification',
  'name_of_min': 'PolEmo2.0-OUT'},
 'BitextMining': {'mean': 287.15999999999997,
  'n': 2,
  'total': 574.3199999999999,
  'median': 533.51,
  'min': 40.81,
  'max': 533.51,
  'name_of_max': 'BUCC',
  'name_of_min': 'Tatoeba'},
 None: {'mean': 17.4225,
  'n': 4,
  'total': 69.69,
  'median': 23.14,
  'min': 1.22,
  'max': 36.72,
  'name_of_max': 'CQADupstackRetrieval',
  'name_of_min': 'PPC'},
 'Summarization': {'mean': 9.53,
  'n': 2,
  'total': 19.06,
  'median': 15.84,
  'min': 3.22,
  'max': 15.84,
  'name_of_max': 'SummEval',
  'name_of_min': 'SummEvalFr'},
 'Retrieval': {'mean': 559.1546341463416,
  'n': 41,
  'total': 22925.340000000004,
  'median': 31.85,
  'min': 0.31,
  'max': 3808.37,
  'name_of_max': 'MSMARCO-PL',
  'name_of_min': 'SyntecRetrieval'}}

The text was updated successfully, but these errors were encountered:

isaac-chung · 2024-05-06T14:02:49Z

At the moment we download all splits even though we only use some of them.

Only downloading the splits we use might also give us some free speed up.

It seems like currently there is no way to specify a subset of splits to load. Only all or one are available.

split (`Split` or `str`):
            Which split of the data to load.
            If `None`, will return a `dict` with all splits (typically `datasets.Split.TRAIN` and `datasets.Split.TEST`).
            If given, will return a single Dataset.
            Splits can be combined and specified like in tensorflow-datasets.

Perhaps a workaround is to loop over the splits needed to load one Dataset at a time and construct a DatasetDict afterwards.

KennethEnevoldsen · 2024-05-07T08:45:29Z

Perhaps a workaround is to loop over the splits needed to load one Dataset at a time and construct a DatasetDict afterwards.

Yea that seems like a great approach

isaac-chung · 2024-05-07T10:41:51Z

I can start on that this week when I get some downtime from the conference, if no one has started yet.

isaac-chung · 2024-05-07T21:49:35Z

Looks like this relies on a WIP from datasets: huggingface/datasets#6832
For now the recommendation is to use streaming. I propose to wait for the datasets PR to be merged. I can look into other issues in the meantime.

KennethEnevoldsen · 2024-09-09T15:29:15Z

I believe this issue is mostly resolved. While we can def. speed up MTEB more I think most of the initial ideas in this PR has been implemented (or will be implemented as a part of MMTEB)

loicmagne mentioned this issue Apr 23, 2024

Slow loading for datasets with a high number of language pairs #530

Closed

KennethEnevoldsen mentioned this issue May 12, 2024

mmteb | Arabic | Retrieval Task #669

Closed

9 tasks

memray mentioned this issue May 15, 2024

To speed up MindSmallReranking #738

Closed

KennethEnevoldsen closed this as completed Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speeding up MTEB #381

Speeding up MTEB #381

KennethEnevoldsen commented Apr 16, 2024 •

edited

Loading

isaac-chung commented May 6, 2024 •

edited

Loading

KennethEnevoldsen commented May 7, 2024

isaac-chung commented May 7, 2024

isaac-chung commented May 7, 2024

KennethEnevoldsen commented Sep 9, 2024

Speeding up MTEB #381

Speeding up MTEB #381

Comments

KennethEnevoldsen commented Apr 16, 2024 • edited Loading

isaac-chung commented May 6, 2024 • edited Loading

KennethEnevoldsen commented May 7, 2024

isaac-chung commented May 7, 2024

isaac-chung commented May 7, 2024

KennethEnevoldsen commented Sep 9, 2024

KennethEnevoldsen commented Apr 16, 2024 •

edited

Loading

isaac-chung commented May 6, 2024 •

edited

Loading