You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To build a performant RAG system, we must be able to evaluate the performance of the system.
Here a list of different aspects that we eventually might want to support:
"technical performance", e.g. latency, memory footprint, number of tokens,...
end-to-end, e.g. RAG triad of context relevance, answer relevance, groundedness
component specific, e.g. how good is retrieval
offline evaluation (before deployment)
online evaluation (in production)
evaluation without ground truth, i.e. only based on generated response/context/query/...
evaluation with ground truth, i.e. take a dataset with reference answer, check if generated answer is similar to reference answer
other aspects of the response, e.g. friendliness, harmfulness, etc.
evaluation with a "generic", premade dataset, for instance amnesty_qa
evaluation with a user provided dataset for the specific use case
tools to generate a suitable evaluation dataset from data for the use case (this is also part of langchain, llamaindex, haystack)
The highest priority is end-to-end evaluation, so we get numbers that show whether our components actually improve the system.
Often, an LLM-as-a-Judge approach is used to automate evaluation.
The RAG triad seems to be a good overall measurement to start with.
With structured outputs we can guarantee to get numbers back as evaluation scores from the openai API.
This is currently not supported with Nx or Bumblebee, so there we would need to hope that the model responds with scores and parse them.
For now, I've implemented evaluation of RAG triad with openai.
The text was updated successfully, but these errors were encountered:
We support evaluations via Http and Nx.
At the moment, this includes two evaluations:
RAG Triad, i.e. context_relevance_score, groundedness_score, answer_relevance_score as JSON output
Hallucination detection, as YES or NO output
For Http, we can enforce structured outputs. That's not the case for Nx at the moment.
Therefore, RAG triad via Nx is probably error prone as we must hope that the model returns valid JSON.
The installer generates a simple evaluation script into the project of the users.
The script downloads a RAG dataset, runs it through the RAG system and evaluates the results.
Users are expected to user their own dataset which is suitable for their use case.
There are tools to help generate such datasets, for instance this one.
We could add pointers to them in our documentation, or as comment in the eval script.
To build a performant RAG system, we must be able to evaluate the performance of the system.
Here a list of different aspects that we eventually might want to support:
The highest priority is end-to-end evaluation, so we get numbers that show whether our components actually improve the system.
Often, an LLM-as-a-Judge approach is used to automate evaluation.
The RAG triad seems to be a good overall measurement to start with.
With structured outputs we can guarantee to get numbers back as evaluation scores from the openai API.
This is currently not supported with
Nx
orBumblebee
, so there we would need to hope that the model responds with scores and parse them.For now, I've implemented evaluation of RAG triad with openai.
The text was updated successfully, but these errors were encountered: