Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FT] Support batch metric computation for SampleLevelMetrics #404

Closed
JoelNiklaus opened this issue Nov 25, 2024 · 6 comments
Closed

[FT] Support batch metric computation for SampleLevelMetrics #404

JoelNiklaus opened this issue Nov 25, 2024 · 6 comments
Labels
feature request New feature/request

Comments

@JoelNiklaus
Copy link
Contributor

Issue encountered

SampleLevelMetrics are always computed with batch size 1. This is really bad for more computationally expensive metrics involving LLM inference. Without batching these, it will take ages to evaluate. CorpusLevelMetrics are also not really a solution, because we want the metric on the sample level for statistics and for selecting samples for human evaluation afterwards.

Solution/Feature

In metrics.utils.init.py apply_generative_metric needs to support batches. We can still set the default to 1, but we should expose an argument metric_batch_size to the top of the evaluation.

Posssible alternatives

Currently, I don't see an alternative.

@JoelNiklaus JoelNiklaus added the feature request New feature/request label Nov 25, 2024
@clefourrier
Copy link
Member

SampleLevelMetrics should not be using only batch size one, and apply_generative_metrics supports batches (which are automatically inferred, as we need to group generation parameters together).
Batch size, in general, is either automatically inferred or forced through a param (--batch_size, iirc).

What did you observe that on? Can you provide a command to repro?

@JoelNiklaus
Copy link
Contributor Author

JoelNiklaus commented Nov 25, 2024

Here it looks to me like there is only one example passed to the metric's compute function at a time.

Are you talking about the param --override_batch_size?

I only see the batch size applied to the model calls, but not the metric calls.

When I want to run an expensive metric like XCOMET-XXL, I need batching there.

I observed it when I evaluate the swiss_legal_evals:

python -m lighteval accelerate \
--model_args openai,model=o1-mini-2024-09-12 \
--tasks community|slt-paragraph_level:de-fr|0|0 \
--custom_tasks lighteval/community_tasks/swiss_legal_evals.py \
--output_dir outputs \
--override_batch_size 8 \
--save_details

@NathanHB
Copy link
Member

Hi ! You are right, as of now we only allow evaluating generative tasks with batch size 1. However, we had the same issue as you did for LLM as a judge metrics (very expensive to run 1 by 1) so we made another metric type (here) to be able to pass all answer to he eval function in one go.

A solution for you would be to do exactlty that. Something like generative_metric_parallel.

Other than that, we would need to change every sample_level_metrics to take as argument a list of metrics instead of passing the samples 1 by 1.

@JoelNiklaus
Copy link
Contributor Author

I see. Yes, adjusting apply_generative_metric would imply changing each sample_level_metric to take a list of samples instead of one sample. Ok, I will open a PR with a new function generative_metric_parallel together with a new MetricCategory.

@NathanHB
Copy link
Member

wdyt abou this solution @clefourrier ?

@clefourrier
Copy link
Member

Sorry, read too fast - mixed up inference/metric cost.
Yep, I think you could define a BatchedSampleMetric then, and use it in this case!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature/request
Projects
None yet
Development

No branches or pull requests

3 participants