-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FT] Support batch metric computation for SampleLevelMetrics #404
Comments
What did you observe that on? Can you provide a command to repro? |
Here it looks to me like there is only one example passed to the metric's compute function at a time. Are you talking about the param --override_batch_size? I only see the batch size applied to the model calls, but not the metric calls. When I want to run an expensive metric like XCOMET-XXL, I need batching there. I observed it when I evaluate the swiss_legal_evals:
|
Hi ! You are right, as of now we only allow evaluating generative tasks with batch size 1. However, we had the same issue as you did for LLM as a judge metrics (very expensive to run 1 by 1) so we made another metric type (here) to be able to pass all answer to he eval function in one go. A solution for you would be to do exactlty that. Something like Other than that, we would need to change every sample_level_metrics to take as argument a list of metrics instead of passing the samples 1 by 1. |
I see. Yes, adjusting |
wdyt abou this solution @clefourrier ? |
Sorry, read too fast - mixed up inference/metric cost. |
Issue encountered
SampleLevelMetrics are always computed with batch size 1. This is really bad for more computationally expensive metrics involving LLM inference. Without batching these, it will take ages to evaluate. CorpusLevelMetrics are also not really a solution, because we want the metric on the sample level for statistics and for selecting samples for human evaluation afterwards.
Solution/Feature
In metrics.utils.init.py apply_generative_metric needs to support batches. We can still set the default to 1, but we should expose an argument metric_batch_size to the top of the evaluation.
Posssible alternatives
Currently, I don't see an alternative.
The text was updated successfully, but these errors were encountered: