Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation #5

Open
joelpaulkoch opened this issue Dec 12, 2024 · 1 comment
Open

Evaluation #5

joelpaulkoch opened this issue Dec 12, 2024 · 1 comment

Comments

@joelpaulkoch
Copy link
Member

To build a performant RAG system, we must be able to evaluate the performance of the system.

Here a list of different aspects that we eventually might want to support:

  • "technical performance", e.g. latency, memory footprint, number of tokens,...
  • end-to-end, e.g. RAG triad of context relevance, answer relevance, groundedness
  • component specific, e.g. how good is retrieval
  • offline evaluation (before deployment)
  • online evaluation (in production)
  • evaluation without ground truth, i.e. only based on generated response/context/query/...
  • evaluation with ground truth, i.e. take a dataset with reference answer, check if generated answer is similar to reference answer
  • other aspects of the response, e.g. friendliness, harmfulness, etc.
  • evaluation with a "generic", premade dataset, for instance amnesty_qa
  • evaluation with a user provided dataset for the specific use case
  • tools to generate a suitable evaluation dataset from data for the use case (this is also part of langchain, llamaindex, haystack)

The highest priority is end-to-end evaluation, so we get numbers that show whether our components actually improve the system.
Often, an LLM-as-a-Judge approach is used to automate evaluation.
The RAG triad seems to be a good overall measurement to start with.

With structured outputs we can guarantee to get numbers back as evaluation scores from the openai API.

This is currently not supported with Nx or Bumblebee, so there we would need to hope that the model responds with scores and parse them.

For now, I've implemented evaluation of RAG triad with openai.

@joelpaulkoch
Copy link
Member Author

joelpaulkoch commented Jan 8, 2025

We support evaluations via Http and Nx.
At the moment, this includes two evaluations:

  • RAG Triad, i.e. context_relevance_score, groundedness_score, answer_relevance_score as JSON output
  • Hallucination detection, as YES or NO output

For Http, we can enforce structured outputs. That's not the case for Nx at the moment.
Therefore, RAG triad via Nx is probably error prone as we must hope that the model returns valid JSON.

The installer generates a simple evaluation script into the project of the users.
The script downloads a RAG dataset, runs it through the RAG system and evaluates the results.

Users are expected to user their own dataset which is suitable for their use case.
There are tools to help generate such datasets, for instance this one.
We could add pointers to them in our documentation, or as comment in the eval script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant