To reproduce benchmark metrics on RAGBench, use calculate_metrics.py. For example, to reproduce GPT-3.5, RAGAS, Trulens for a set of RAGBench component datasets run:
python calculate_metrics.py --dataset hotpotqa msmarco hagrid expertqa
Use the run_inference.py script to evaluate RAG eval frameworks on RAGBench. Input arguments:
dataset: name of the RAGBench dataset to run inference onmodel: the model to evaluate (trulens or ragas)output: output directory to store results in
Run Trulens inference on HotpotQA subset:
python run_inference.py --dataset msmarco --model trulens --output results