Retrieval speed up #638
Replies: 2 comments 10 replies
-
30 minutes seems right for Robust04InstructionRetrieval, for a smaller model size. It's 52k * 2 embeddings for that dataset. You can cut that in half by passing in the flag For NeuCLIR, implementing a cache (as in #381 and #354) would cut the time in half also. For the full-scale retrieval tasks like NeuCLIR or the to-be-implemented MiRACL datasets (#198) we're talking about millions of passages to embed. I assume this is comparable to the MSMarco dataset in the standard MTEB benchmark but I don't have those numbers offhand. I think in #381 there was some discussion about ways we could speed it up - the most effective will be subsampling down the collection, although that will make our scores incompatible with the current benchmark scores on the task and lead to some confusion. I don't think there's an easy solution that will solve this and keeps the benchmark comparable other than adding more GPUs... However, we can probably shave off some percent (~20%) with some of the items mentioned in that thread about dataloaders and such. |
Beta Was this translation helpful? Give feedback.
-
I experimented a bit around with trying to make the embedding process faster. A couple of things I tried to speed things up:
Things I haven't yet tried:
These could in theory speed it up somewhat but I'm not sure it would be worth the effort. |
Beta Was this translation helpful? Give feedback.
-
Some of the longest eval time for retrieval datasets are
Robust04InstructionRetrieval
(1943s),NeuCLIR2022Retrieval
(37k s - 10+ hours!) andNeuCLIR2023Retrieval
(similar). Open to suggestions about ways to speed up retrieval tasks.If the bottleneck is in the encoding process (e.g. for large corpora), maybe we can leverage multiprocessing from sentence transformers
CC @KennethEnevoldsen @imenelydiaker @Muennighoff @x-tabdeveloping @orionw and anyone who'd be interested.
Beta Was this translation helpful? Give feedback.
All reactions