Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FT] LLM-as-judge example that doesn't require OPENAI_KEY or pro subscription of HF #318

Open
chuandudx opened this issue Sep 18, 2024 · 2 comments
Labels
feature request New feature/request

Comments

@chuandudx
Copy link
Contributor

Issue encountered

While setting up the framework to evaluate using LLM-as-judge, it would be helpful to test end-to-end without special permissions like setting up openai_key or HF pro subscription. The current models in src/lighteval/metrics/metrics.py contain the following options:

  • gpt-3.5-turbo
  • meta-llama/Meta-Llama-3.1-405B-Instruct-FP8

When trying to call the llama model, a free HF_TOKEN gives the following error:

 (<class 'openai.BadRequestError'>, BadRequestError("Error code: 400 - {'error': 'Model requires a Pro subscription; check out hf.co/pricing to learn more. Make sure to include your HF token in your query.'}"))

Solution/Feature

I tried to define a new llm judge using a smaller model:

    llm_judge_small_model = SampleLevelMetricGrouping(
        metric_name=["judge_score"],
        higher_is_better={"judge_score": True},
        category=MetricCategory.LLM_AS_JUDGE,
        use_case=MetricUseCase.SUMMARIZATION,
        sample_level_fn=JudgeLLM(
            judge_model_name="TinyLlama/TinyLlama_v1.1",
            template_path=os.path.join(os.path.dirname(__file__), "judge_prompts.jsonl"),
            multi_turn=False,
        ).compute,
        corpus_level_fn={
            "judge_score": np.mean,
        },
    )

However, this gave a different error that I not been able to figure out how to resolve. There is an error related to using the OpenAI API even while the main intent was to call a tinyllama model.

INFO:httpx:HTTP Request: POST https://api-inference.huggingface.co/v1/chat/completions "HTTP/1.1 422 Unprocessable Entity"
WARNING:lighteval.logging.hierarchical_logger:    (<class 'openai.UnprocessableEntityError'>, UnprocessableEntityError("Error code: 422 - {'error': 'Template error: template not found', 'error_type': 'template_error'}"))
INFO:httpx:HTTP Request: POST https://api-inference.huggingface.co/v1/chat/completions "HTTP/1.1 422 Unprocessable Entity"
WARNING:lighteval.logging.hierarchical_logger:    (<class 'openai.UnprocessableEntityError'>, UnprocessableEntityError("Error code: 422 - {'error': 'Template error: template not found', 'error_type': 'template_error'}"))
INFO:httpx:HTTP Request: POST https://api-inference.huggingface.co/v1/chat/completions "HTTP/1.1 422 Unprocessable Entity"
WARNING:lighteval.logging.hierarchical_logger:    (<class 'openai.UnprocessableEntityError'>, UnprocessableEntityError("Error code: 422 - {'error': 'Template error: template not found', 'error_type': 'template_error'}"))
WARNING:lighteval.logging.hierarchical_logger:  } [0:00:48.373629]
WARNING:lighteval.logging.hierarchical_logger:} [0:00:56.466097]
Traceback (most recent call last):
  File "/Users/chuandu/Documents/workspace/legal_llm_evaluation/llm_eval_env/bin/lighteval", line 8, in <module>
    sys.exit(cli_evaluate())
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/__main__.py", line 58, in cli_evaluate
    main_accelerate(args)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/logging/hierarchical_logger.py", line 175, in wrapper
    return fn(*args, **kwargs)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/main_accelerate.py", line 92, in main
    pipeline.evaluate()
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/pipeline.py", line 236, in evaluate
    self._compute_metrics(sample_id_to_responses)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/pipeline.py", line 288, in _compute_metrics
    metrics = compute_metric(results=sample_responses, formatted_doc=doc, metrics=metric_category_metrics)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/__init__.py", line 211, in apply_llm_as_judge_metric
    outputs.update(metric.compute(predictions=predictions, formatted_doc=formatted_doc))
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/utils.py", line 74, in compute
    return self.sample_level_fn(**kwargs)  # result, formatted_doc,
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/metrics_sample.py", line 811, in compute
    scores, messages, judgements = self.judge.evaluate_answer(questions, predictions, ref_answers)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/llm_as_judge.py", line 158, in evaluate_answer
    response = self.__call_api(prompt)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/llm_as_judge.py", line 259, in __call_api
    raise Exception("Failed to get response from the API")
Exception: Failed to get response from the API

Thank you!

@chuandudx chuandudx added the feature request New feature/request label Sep 18, 2024
@clefourrier
Copy link
Member

I suspect this model is not provided by the free version of inference endpoints on the fly - can you try with llama 3.1 70B for example, or command R +?

@chuandudx
Copy link
Contributor Author

chuandudx commented Sep 18, 2024

Thank you for the feedback! @JoelNiklaus figured out that it's because we should feed in use_transformers=True when constructing the judge instance. Do you think it would be helpful to add an example like this in metrics.py or as a note in the README?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature/request
Projects
None yet
Development

No branches or pull requests

2 participants