Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding up MTEB #381

Closed
KennethEnevoldsen opened this issue Apr 16, 2024 · 5 comments
Closed

Speeding up MTEB #381

KennethEnevoldsen opened this issue Apr 16, 2024 · 5 comments

Comments

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Apr 16, 2024

This is an overview issue on how to speed up MTEB:

I see the following options for speeding up MTEB:

  • Implementing an encode cache: As suggested in Aggregating MMTEB datasets #354, as some datasets repeat it would be possible to implement a cache to not re-embed duplicates
  • change loading of Multilingual dataset: Seems like there is a large overhead in loading multiple datasets of different languages (High overhead when loading lots of subsets from the same dataset huggingface/datasets#6800)
  • Downsampling datasets: Most datasets could probably work with notably fewer samples.
  • At the moment we download all splits even though we only use some of them. A solution might be to supply the split to the load_dataset function. Note this will lead to bugs if the dataset_transform assumes the full dataset (probably shouldn't happen, but it might).

Task-specific speed-ups:

  • Clustering: Clustering currently works by performing N clustering steps M samples (M could vary for each N), however, an alternative approach is embedding K samples and then sampling M from those samples N times. This would allow K << N x M, which would lead to a significant speed-up.

Overview of slowest segments:
Based on existing results from the paraphrase-multilingual-MiniLM-L12-v2 (which might have been run on all sorts of systems).

{'Reranking': {'mean': 118.08999999999999,
  'n': 6,
  'total': 708.54,
  'median': 19.61,
  'min': 2.59,
  'max': 636.61,
  'name_of_max': 'MindSmallReranking',
  'name_of_min': 'AskUbuntuDupQuestions'},
 'STS': {'mean': 2.3514285714285714,
  'n': 14,
  'total': 32.92,
  'median': 2.44,
  'min': 0.74,
  'max': 4.46,
  'name_of_max': 'STS17',
  'name_of_min': 'STS22'},
 'PairClassification': {'mean': 3.89625,
  'n': 8,
  'total': 31.17,
  'median': 2.75,
  'min': 0.72,
  'max': 12.94,
  'name_of_max': 'TwitterURLCorpus',
  'name_of_min': 'CDSC-E'},
 'Clustering': {'mean': 118.72272727272725,
  'n': 22,
  'total': 2611.8999999999996,
  'median': 63.93,
  'min': 0.54,
  'max': 817.54,
  'name_of_max': 'ArxivClusteringP2P',
  'name_of_min': 'MasakhaNEWSClusteringS2S'},
 'Classification': {'mean': 38.81444444444443,
  'n': 18,
  'total': 698.6599999999999,
  'median': 13.51,
  'min': 1.51,
  'max': 340.71,
  'name_of_max': 'AmazonPolarityClassification',
  'name_of_min': 'PolEmo2.0-OUT'},
 'BitextMining': {'mean': 287.15999999999997,
  'n': 2,
  'total': 574.3199999999999,
  'median': 533.51,
  'min': 40.81,
  'max': 533.51,
  'name_of_max': 'BUCC',
  'name_of_min': 'Tatoeba'},
 None: {'mean': 17.4225,
  'n': 4,
  'total': 69.69,
  'median': 23.14,
  'min': 1.22,
  'max': 36.72,
  'name_of_max': 'CQADupstackRetrieval',
  'name_of_min': 'PPC'},
 'Summarization': {'mean': 9.53,
  'n': 2,
  'total': 19.06,
  'median': 15.84,
  'min': 3.22,
  'max': 15.84,
  'name_of_max': 'SummEval',
  'name_of_min': 'SummEvalFr'},
 'Retrieval': {'mean': 559.1546341463416,
  'n': 41,
  'total': 22925.340000000004,
  'median': 31.85,
  'min': 0.31,
  'max': 3808.37,
  'name_of_max': 'MSMARCO-PL',
  'name_of_min': 'SyntecRetrieval'}}
@isaac-chung
Copy link
Collaborator

isaac-chung commented May 6, 2024

At the moment we download all splits even though we only use some of them.

Only downloading the splits we use might also give us some free speed up.

It seems like currently there is no way to specify a subset of splits to load. Only all or one are available.

split (`Split` or `str`):
            Which split of the data to load.
            If `None`, will return a `dict` with all splits (typically `datasets.Split.TRAIN` and `datasets.Split.TEST`).
            If given, will return a single Dataset.
            Splits can be combined and specified like in tensorflow-datasets.

Perhaps a workaround is to loop over the splits needed to load one Dataset at a time and construct a DatasetDict afterwards.

@KennethEnevoldsen
Copy link
Contributor Author

Perhaps a workaround is to loop over the splits needed to load one Dataset at a time and construct a DatasetDict afterwards.

Yea that seems like a great approach

@isaac-chung
Copy link
Collaborator

I can start on that this week when I get some downtime from the conference, if no one has started yet.

@isaac-chung
Copy link
Collaborator

Looks like this relies on a WIP from datasets: huggingface/datasets#6832
For now the recommendation is to use streaming. I propose to wait for the datasets PR to be merged. I can look into other issues in the meantime.

@KennethEnevoldsen
Copy link
Contributor Author

I believe this issue is mostly resolved. While we can def. speed up MTEB more I think most of the initial ideas in this PR has been implemented (or will be implemented as a part of MMTEB)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants