BenchHub Plus

Currently on Develop!

An interactive leaderboard system for dynamic LLM evaluation that converts natural language queries to customized model rankings using FastAPI backend, modern Reflex frontend, Celery workers, and HRET integration.

🌟 Features

🤖 AI-Powered Evaluation Planning

Natural Language Interface: Describe your evaluation needs in plain English
Intelligent Plan Generation: AI converts queries to structured evaluation plans
Context-Aware Processing: Understands domain-specific requirements

🏆 Dynamic Leaderboards

Real-time Rankings: Live updates as evaluations complete
Multi-dimensional Comparison: Compare models across various metrics
Interactive Visualizations: Rich charts and analytics

🔄 Scalable Architecture

Async Processing: Background task processing with Celery
Distributed Workers: Scale evaluation capacity horizontally
Caching System: Fast results with Redis caching

📊 Comprehensive Analytics

Multiple Metrics: Accuracy, F1-score, semantic similarity, and more
Statistical Analysis: Significance testing and confidence intervals
Category Breakdown: Performance analysis by subject and difficulty

🐳 Production Ready

Docker Deployment: Complete containerized setup
Health Monitoring: Built-in health checks and monitoring
Scalable Design: Ready for production workloads

✨ What's New: Reflex Frontend

We've migrated fully from Streamlit to Reflex for a modern, production-ready web experience!

🆕 Reflex Benefits

Modern UI/UX: Clean, responsive design with Tailwind CSS
Better Performance: Optimized rendering and state management
Production Ready: Built for scalability and deployment
Component Architecture: Maintainable and extensible codebase

🔄 Migration Status

✅ Reflex Frontend: Default experience shipped in this repository
🗑️ Streamlit Frontend: Removed in favor of Reflex

🚀 Quick Start

Recommended: Docker-based setup

Install prerequisites
- Git
- Docker and Docker Compose
- An OpenAI API key (or another supported model provider key)

Clone the repository and create your environment file

git clone https://github.com/HAE-RAE/BenchhubPlus.git
cd BenchhubPlus
cp .env.example .env

Fill in the required variables
- OPENAI_API_KEY: your model provider key
- POSTGRES_PASSWORD: choose a strong password for the bundled PostgreSQL database
- Adjust any other values (ports, planner model, etc.) if needed

Launch the full stack

./scripts/deploy.sh development   # local dev with live reload mounts
# or for a slimmer, prod-like run:
./scripts/deploy.sh

The helper script builds the images, starts the services, and waits for them to report healthy.

Open the application
- Frontend UI (Reflex): http://localhost:3000
- Backend API: http://localhost:8001 (dev) or http://localhost:8000 (prod)
- API documentation: http://localhost:8001/docs (dev) or http://localhost:8000/docs (prod)
- Flower (Celery dashboard): http://localhost:5556 (dev) or http://localhost:5555 (prod)

Verify everything is running

curl http://localhost:8001/api/v1/health

Shut down when finished

docker-compose -f docker-compose.dev.yml down

Alternative: Local Python environment

For contributors who prefer not to use Docker:

./scripts/setup.sh         # creates a Python 3.11 virtualenv and installs dependencies
source venv/bin/activate
python -m uvicorn apps.backend.main:app --host 0.0.0.0 --port 8000 --reload

# Reflex frontend
cd apps/reflex_frontend
API_BASE_URL=http://localhost:8000 reflex run --env dev --backend-host 0.0.0.0 --backend-port 8001 --frontend-host 0.0.0.0 --frontend-port 3000

# Background worker
celery -A apps.worker.celery_app worker --loglevel=info

You will also need local PostgreSQL and Redis instances that match the connection settings in .env.

📖 Need more detail? The Quick Start Guide walks through the process with screenshots and tips, and the User Manual explains how to operate the app end to end. Setup guides are also available (English | 한국어).

🎯 Usage Example

1. Natural Language Query

"Compare GPT-4 and Claude-3 on Korean technology multiple choice questions"

2. Generated BenchHub Configuration

{
  "problem_type": "MCQA",
  "target_type": "General",
  "subject_type": ["Tech.", "Tech./Coding"],
  "task_type": "Knowledge",
  "external_tool_usage": false,
  "language": "Korean",
  "sample_size": 100
}

3. Model Configuration

{
  "models": [
    {
      "name": "gpt-4",
      "api_base": "https://api.openai.com/v1",
      "api_key": "sk-...",
      "model_type": "openai"
    },
    {
      "name": "claude-3",
      "api_base": "https://api.anthropic.com",
      "api_key": "sk-ant-...",
      "model_type": "anthropic"
    }
  ]
}

4. Results

Interactive leaderboard with rankings
Detailed performance metrics by BenchHub categories
Statistical significance analysis
Exportable results with BenchHub metadata

🏗️ Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│     Reflex      │    │   FastAPI       │    │   Celery        │
│   Frontend      │◄──►│   Backend       │◄──►│   Workers       │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         │                       ▼                       ▼
         │              ┌─────────────────┐    ┌─────────────────┐
         │              │   PostgreSQL    │    │     Redis       │
         │              │   Database      │    │     Cache       │
         └──────────────┤                 │    └─────────────────┘
                        └─────────────────┘
                                 │
                        ┌─────────────────┐
                        │   HRET Toolkit  │
                        │   Integration   │
                        └─────────────────┘

📚 Documentation

Documentation is available in both English (docs/eng) and 한국어 (docs/kor).

Getting Started

📖 Installation Guide (EN) · KO - Complete setup instructions
🚀 Quick Start (EN) · KO - Get running in 5 minutes
👤 User Manual (EN) · KO - Complete user guide

Development

🔧 Development Guide (EN) · KO - Development setup and guidelines
🏗️ Architecture (EN) · KO - System design and architecture
🐳 Docker Deployment (EN) · KO - Container deployment guide

Reference

📡 API Reference (EN) · KO - REST API documentation
🔧 BenchHub Configuration (EN) · KO - BenchHub dataset configuration guide
🔗 HRET Integration (EN) · KO - HRET toolkit integration guide
🚨 Troubleshooting (EN) · KO - Common issues and solutions

🛠️ Technology Stack

Backend

FastAPI: High-performance REST API framework
SQLAlchemy: Database ORM with PostgreSQL
Celery: Distributed task queue with Redis
Pydantic: Data validation and serialization

Frontend

Reflex: Modern Python web framework
Plotly: Interactive data visualizations
Pandas: Data manipulation and analysis

AI/ML

OpenAI API: GPT models for plan generation
HRET Toolkit: Evaluation framework integration
Custom Metrics: Extensible evaluation metrics

Infrastructure

Docker: Containerization and deployment
PostgreSQL: Primary data storage
Redis: Caching and task queue
Nginx: Reverse proxy (production)

🔧 Configuration

Environment Variables

# Required
OPENAI_API_KEY=your_openai_api_key_here
POSTGRES_PASSWORD=secure_password

# Optional
DEBUG=false
LOG_LEVEL=info
DATABASE_URL=postgresql://user:pass@host:5432/db
CELERY_BROKER_URL=redis://host:6379/0

Model Support

Currently supported model providers:

OpenAI: GPT-3.5, GPT-4, and variants
Anthropic: Claude models
Hugging Face: Various open-source models
Custom: Extensible for any API-compatible model

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

# Fork and clone the repository
git clone https://github.com/your-username/BenchhubPlus.git
cd BenchhubPlus

# Create development environment
python3.11 -m venv venv
source venv/bin/activate
pip install -e .

# Install development dependencies
pip install pytest black isort flake8 mypy

# Run tests
./scripts/test.sh

# Format code
black apps/
isort apps/

Reporting Issues

Bug Reports: Use the bug report template
Feature Requests: Use the feature request template
Questions: Start a discussion

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Contributors

Hanwool Lee — h-albert-lee
Eunsu Kim — rladmstn1714
Joonyong Park — JoonYong-Park
Hyunwoo Oh - hw-oh
Hyungjin Jeon - gudwls47

📞 Support

Community Support

GitHub Issues: Bug reports and feature requests
GitHub Discussions: Questions and community help
Documentation: Comprehensive guides and references

🚀 Ready to start evaluating? Check out our Quick Start Guide or dive into the User Manual!

Built with ❤️ for the AI evaluation community

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
apps		apps
docs		docs
scripts		scripts
seeds		seeds
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Dockerfile.backend		Dockerfile.backend
Dockerfile.reflex		Dockerfile.reflex
Dockerfile.worker		Dockerfile.worker
LICENSE		LICENSE
README.md		README.md
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.yml		docker-compose.yml
init.sql		init.sql
nginx.conf		nginx.conf
pyproject.toml		pyproject.toml
test.db		test.db

License

HAE-RAE/BenchhubPlus

Folders and files

Latest commit

History

Repository files navigation