An interactive leaderboard system for dynamic LLM evaluation that converts natural language queries to customized model rankings using FastAPI backend, modern Reflex frontend, Celery workers, and HRET integration.
- Natural Language Interface: Describe your evaluation needs in plain English
- Intelligent Plan Generation: AI converts queries to structured evaluation plans
- Context-Aware Processing: Understands domain-specific requirements
- Real-time Rankings: Live updates as evaluations complete
- Multi-dimensional Comparison: Compare models across various metrics
- Interactive Visualizations: Rich charts and analytics
- Async Processing: Background task processing with Celery
- Distributed Workers: Scale evaluation capacity horizontally
- Caching System: Fast results with Redis caching
- Multiple Metrics: Accuracy, F1-score, semantic similarity, and more
- Statistical Analysis: Significance testing and confidence intervals
- Category Breakdown: Performance analysis by subject and difficulty
- Docker Deployment: Complete containerized setup
- Health Monitoring: Built-in health checks and monitoring
- Scalable Design: Ready for production workloads
We've migrated fully from Streamlit to Reflex for a modern, production-ready web experience!
- Modern UI/UX: Clean, responsive design with Tailwind CSS
- Better Performance: Optimized rendering and state management
- Production Ready: Built for scalability and deployment
- Component Architecture: Maintainable and extensible codebase
- β Reflex Frontend: Default experience shipped in this repository
- ποΈ Streamlit Frontend: Removed in favor of Reflex
-
Install prerequisites
- Git
- Docker and Docker Compose
- An OpenAI API key (or another supported model provider key)
-
Clone the repository and create your environment file
git clone https://github.com/HAE-RAE/BenchhubPlus.git cd BenchhubPlus cp .env.example .env -
Fill in the required variables
OPENAI_API_KEY: your model provider keyPOSTGRES_PASSWORD: choose a strong password for the bundled PostgreSQL database- Adjust any other values (ports, planner model, etc.) if needed
-
Launch the full stack
./scripts/deploy.sh development # local dev with live reload mounts # or for a slimmer, prod-like run: ./scripts/deploy.sh
The helper script builds the images, starts the services, and waits for them to report healthy.
-
Open the application
- Frontend UI (Reflex): http://localhost:3000
- Backend API: http://localhost:8001 (dev) or http://localhost:8000 (prod)
- API documentation: http://localhost:8001/docs (dev) or http://localhost:8000/docs (prod)
- Flower (Celery dashboard): http://localhost:5556 (dev) or http://localhost:5555 (prod)
-
Verify everything is running
curl http://localhost:8001/api/v1/health
-
Shut down when finished
docker-compose -f docker-compose.dev.yml down
For contributors who prefer not to use Docker:
./scripts/setup.sh # creates a Python 3.11 virtualenv and installs dependencies
source venv/bin/activate
python -m uvicorn apps.backend.main:app --host 0.0.0.0 --port 8000 --reload
# Reflex frontend
cd apps/reflex_frontend
API_BASE_URL=http://localhost:8000 reflex run --env dev --backend-host 0.0.0.0 --backend-port 8001 --frontend-host 0.0.0.0 --frontend-port 3000
# Background worker
celery -A apps.worker.celery_app worker --loglevel=infoYou will also need local PostgreSQL and Redis instances that match the connection settings in .env.
π Need more detail? The Quick Start Guide walks through the process with screenshots and tips, and the User Manual explains how to operate the app end to end. Setup guides are also available (English | νκ΅μ΄).
"Compare GPT-4 and Claude-3 on Korean technology multiple choice questions"
{
"problem_type": "MCQA",
"target_type": "General",
"subject_type": ["Tech.", "Tech./Coding"],
"task_type": "Knowledge",
"external_tool_usage": false,
"language": "Korean",
"sample_size": 100
}{
"models": [
{
"name": "gpt-4",
"api_base": "https://api.openai.com/v1",
"api_key": "sk-...",
"model_type": "openai"
},
{
"name": "claude-3",
"api_base": "https://api.anthropic.com",
"api_key": "sk-ant-...",
"model_type": "anthropic"
}
]
}- Interactive leaderboard with rankings
- Detailed performance metrics by BenchHub categories
- Statistical significance analysis
- Exportable results with BenchHub metadata
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Reflex β β FastAPI β β Celery β
β Frontend βββββΊβ Backend βββββΊβ Workers β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β
β βΌ βΌ
β βββββββββββββββββββ βββββββββββββββββββ
β β PostgreSQL β β Redis β
β β Database β β Cache β
ββββββββββββββββ€ β βββββββββββββββββββ
βββββββββββββββββββ
β
βββββββββββββββββββ
β HRET Toolkit β
β Integration β
βββββββββββββββββββ
Documentation is available in both English (docs/eng) and νκ΅μ΄ (docs/kor).
- π Installation Guide (EN) Β· KO - Complete setup instructions
- π Quick Start (EN) Β· KO - Get running in 5 minutes
- π€ User Manual (EN) Β· KO - Complete user guide
- π§ Development Guide (EN) Β· KO - Development setup and guidelines
- ποΈ Architecture (EN) Β· KO - System design and architecture
- π³ Docker Deployment (EN) Β· KO - Container deployment guide
- π‘ API Reference (EN) Β· KO - REST API documentation
- π§ BenchHub Configuration (EN) Β· KO - BenchHub dataset configuration guide
- π HRET Integration (EN) Β· KO - HRET toolkit integration guide
- π¨ Troubleshooting (EN) Β· KO - Common issues and solutions
- FastAPI: High-performance REST API framework
- SQLAlchemy: Database ORM with PostgreSQL
- Celery: Distributed task queue with Redis
- Pydantic: Data validation and serialization
- Reflex: Modern Python web framework
- Plotly: Interactive data visualizations
- Pandas: Data manipulation and analysis
- OpenAI API: GPT models for plan generation
- HRET Toolkit: Evaluation framework integration
- Custom Metrics: Extensible evaluation metrics
- Docker: Containerization and deployment
- PostgreSQL: Primary data storage
- Redis: Caching and task queue
- Nginx: Reverse proxy (production)
# Required
OPENAI_API_KEY=your_openai_api_key_here
POSTGRES_PASSWORD=secure_password
# Optional
DEBUG=false
LOG_LEVEL=info
DATABASE_URL=postgresql://user:pass@host:5432/db
CELERY_BROKER_URL=redis://host:6379/0Currently supported model providers:
- OpenAI: GPT-3.5, GPT-4, and variants
- Anthropic: Claude models
- Hugging Face: Various open-source models
- Custom: Extensible for any API-compatible model
We welcome contributions! Please see our Contributing Guide for details.
# Fork and clone the repository
git clone https://github.com/your-username/BenchhubPlus.git
cd BenchhubPlus
# Create development environment
python3.11 -m venv venv
source venv/bin/activate
pip install -e .
# Install development dependencies
pip install pytest black isort flake8 mypy
# Run tests
./scripts/test.sh
# Format code
black apps/
isort apps/- Bug Reports: Use the bug report template
- Feature Requests: Use the feature request template
- Questions: Start a discussion
This project is licensed under the MIT License - see the LICENSE file for details.
- Hanwool Lee β h-albert-lee
- Eunsu Kim β rladmstn1714
- Joonyong Park β JoonYong-Park
- Hyunwoo Oh - hw-oh
- Hyungjin Jeon - gudwls47
- GitHub Issues: Bug reports and feature requests
- GitHub Discussions: Questions and community help
- Documentation: Comprehensive guides and references
π Ready to start evaluating? Check out our Quick Start Guide or dive into the User Manual!