Skip to content

An agent-based evaluation framework for Large Language Models (LLMs) using LangGraph. Includes hallucination detection, prompt evaluation, telemetry logging, and dashboard-ready outputs.

Notifications You must be signed in to change notification settings

notmanas/llm-eval-dashboard-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 LLM Evaluation Dashboard Agent

This is an agentic, graph-based framework for evaluating the outputs of Large Language Models (LLMs) using hallucination detection techniques.

Built using LangGraph, the project orchestrates a flow of evaluation agents that process prompts, generate responses, assess hallucination risk, and log results in a configurable, scalable pipeline.


🚀 Features

  • 🧱 Agent-based modular design using LangGraph
  • ✅ Multiple hallucination detection methods (LLM-as-a-Judge first, more coming)
  • ⚙️ Configurable via config.yaml
  • 🧪 Load test prompts and ground truths from JSON files
  • 📊 Tracks tokens, latency, model version, and evaluation scores

🧪 Example Test Case

{
  "prompt": "What is the difference between synchronous and asynchronous programming in Python?",
  "ground_truth": "...",
  "model": "gpt-4o",
  "metadata": {
    "use_case": "technical explanation",
    "ground_truth_type": "text"
  }
}

🔮 Roadmap

✅ Completed

  • LangGraph setup with typed shared state (EvalState)
  • Input handler agent to prepare prompt state
  • Model runner agent to call OpenAI (GPT-4o)
  • Config-driven architecture via config.yaml
  • Hallucination detection via LLM-as-a-Judge (OpenAI)
  • JSON prompt ingestion for modular test cases
  • Add hallucination detection via embedding similarity
  • Telemetry logging agent (latency, token usage, verdicts)

🔜 In Progress / Planned

  • Add Vectara hallucination evaluation model (Hugging Face)
  • Evaluation metrics agent (tone, coherence, completeness scoring)
  • CLI or batch runner for test prompt files
  • Result logger (to JSONL or CSV)
  • Dashboard or Streamlit front-end for visualizing evaluations

About

An agent-based evaluation framework for Large Language Models (LLMs) using LangGraph. Includes hallucination detection, prompt evaluation, telemetry logging, and dashboard-ready outputs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages