[FT] Support batch metric computation for SampleLevelMetrics #404

JoelNiklaus · 2024-11-25T13:58:36Z

Issue encountered

SampleLevelMetrics are always computed with batch size 1. This is really bad for more computationally expensive metrics involving LLM inference. Without batching these, it will take ages to evaluate. CorpusLevelMetrics are also not really a solution, because we want the metric on the sample level for statistics and for selecting samples for human evaluation afterwards.

Solution/Feature

In metrics.utils.init.py apply_generative_metric needs to support batches. We can still set the default to 1, but we should expose an argument metric_batch_size to the top of the evaluation.

Posssible alternatives

Currently, I don't see an alternative.

clefourrier · 2024-11-25T14:04:37Z

SampleLevelMetrics should not be using only batch size one, and apply_generative_metrics supports batches (which are automatically inferred, as we need to group generation parameters together).
Batch size, in general, is either automatically inferred or forced through a param (--batch_size, iirc).

What did you observe that on? Can you provide a command to repro?

JoelNiklaus · 2024-11-25T14:21:32Z

Here it looks to me like there is only one example passed to the metric's compute function at a time.

Are you talking about the param --override_batch_size?

I only see the batch size applied to the model calls, but not the metric calls.

When I want to run an expensive metric like XCOMET-XXL, I need batching there.

I observed it when I evaluate the swiss_legal_evals:

python -m lighteval accelerate \
--model_args openai,model=o1-mini-2024-09-12 \
--tasks community|slt-paragraph_level:de-fr|0|0 \
--custom_tasks lighteval/community_tasks/swiss_legal_evals.py \
--output_dir outputs \
--override_batch_size 8 \
--save_details

NathanHB · 2024-11-25T14:48:19Z

Hi ! You are right, as of now we only allow evaluating generative tasks with batch size 1. However, we had the same issue as you did for LLM as a judge metrics (very expensive to run 1 by 1) so we made another metric type (here) to be able to pass all answer to he eval function in one go.

A solution for you would be to do exactlty that. Something like generative_metric_parallel.

Other than that, we would need to change every sample_level_metrics to take as argument a list of metrics instead of passing the samples 1 by 1.

JoelNiklaus · 2024-11-25T14:53:18Z

I see. Yes, adjusting apply_generative_metric would imply changing each sample_level_metric to take a list of samples instead of one sample. Ok, I will open a PR with a new function generative_metric_parallel together with a new MetricCategory.

NathanHB · 2024-11-25T15:15:29Z

wdyt abou this solution @clefourrier ?

clefourrier · 2024-11-25T15:41:31Z

Sorry, read too fast - mixed up inference/metric cost.
Yep, I think you could define a BatchedSampleMetric then, and use it in this case!

JoelNiklaus added the feature request New feature/request label Nov 25, 2024

JoelNiklaus closed this as completed Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FT] Support batch metric computation for SampleLevelMetrics #404

[FT] Support batch metric computation for SampleLevelMetrics #404

JoelNiklaus commented Nov 25, 2024

clefourrier commented Nov 25, 2024

JoelNiklaus commented Nov 25, 2024 •

edited

Loading

NathanHB commented Nov 25, 2024

JoelNiklaus commented Nov 25, 2024

NathanHB commented Nov 25, 2024

clefourrier commented Nov 25, 2024

[FT] Support batch metric computation for SampleLevelMetrics #404

[FT] Support batch metric computation for SampleLevelMetrics #404

Comments

JoelNiklaus commented Nov 25, 2024

Issue encountered

Solution/Feature

Posssible alternatives

clefourrier commented Nov 25, 2024

JoelNiklaus commented Nov 25, 2024 • edited Loading

NathanHB commented Nov 25, 2024

JoelNiklaus commented Nov 25, 2024

NathanHB commented Nov 25, 2024

clefourrier commented Nov 25, 2024

JoelNiklaus commented Nov 25, 2024 •

edited

Loading