Skip to content

HAE-RAE/BenchhubPlus

Repository files navigation

BenchHub Plus

Currently on Develop!

License: MIT Python 3.11+ Docker

An interactive leaderboard system for dynamic LLM evaluation that converts natural language queries to customized model rankings using FastAPI backend, modern Reflex frontend, Celery workers, and HRET integration.

benchhubplus

🌟 Features

πŸ€– AI-Powered Evaluation Planning

  • Natural Language Interface: Describe your evaluation needs in plain English
  • Intelligent Plan Generation: AI converts queries to structured evaluation plans
  • Context-Aware Processing: Understands domain-specific requirements

πŸ† Dynamic Leaderboards

  • Real-time Rankings: Live updates as evaluations complete
  • Multi-dimensional Comparison: Compare models across various metrics
  • Interactive Visualizations: Rich charts and analytics

πŸ”„ Scalable Architecture

  • Async Processing: Background task processing with Celery
  • Distributed Workers: Scale evaluation capacity horizontally
  • Caching System: Fast results with Redis caching

πŸ“Š Comprehensive Analytics

  • Multiple Metrics: Accuracy, F1-score, semantic similarity, and more
  • Statistical Analysis: Significance testing and confidence intervals
  • Category Breakdown: Performance analysis by subject and difficulty

🐳 Production Ready

  • Docker Deployment: Complete containerized setup
  • Health Monitoring: Built-in health checks and monitoring
  • Scalable Design: Ready for production workloads

✨ What's New: Reflex Frontend

We've migrated fully from Streamlit to Reflex for a modern, production-ready web experience!

πŸ†• Reflex Benefits

  • Modern UI/UX: Clean, responsive design with Tailwind CSS
  • Better Performance: Optimized rendering and state management
  • Production Ready: Built for scalability and deployment
  • Component Architecture: Maintainable and extensible codebase

πŸ”„ Migration Status

  • βœ… Reflex Frontend: Default experience shipped in this repository
  • πŸ—‘οΈ Streamlit Frontend: Removed in favor of Reflex

πŸš€ Quick Start

Recommended: Docker-based setup

  1. Install prerequisites

    • Git
    • Docker and Docker Compose
    • An OpenAI API key (or another supported model provider key)
  2. Clone the repository and create your environment file

    git clone https://github.com/HAE-RAE/BenchhubPlus.git
    cd BenchhubPlus
    cp .env.example .env
  3. Fill in the required variables

    • OPENAI_API_KEY: your model provider key
    • POSTGRES_PASSWORD: choose a strong password for the bundled PostgreSQL database
    • Adjust any other values (ports, planner model, etc.) if needed
  4. Launch the full stack

    ./scripts/deploy.sh development   # local dev with live reload mounts
    # or for a slimmer, prod-like run:
    ./scripts/deploy.sh

    The helper script builds the images, starts the services, and waits for them to report healthy.

  5. Open the application

  6. Verify everything is running

    curl http://localhost:8001/api/v1/health
  7. Shut down when finished

    docker-compose -f docker-compose.dev.yml down

Alternative: Local Python environment

For contributors who prefer not to use Docker:

./scripts/setup.sh         # creates a Python 3.11 virtualenv and installs dependencies
source venv/bin/activate
python -m uvicorn apps.backend.main:app --host 0.0.0.0 --port 8000 --reload

# Reflex frontend
cd apps/reflex_frontend
API_BASE_URL=http://localhost:8000 reflex run --env dev --backend-host 0.0.0.0 --backend-port 8001 --frontend-host 0.0.0.0 --frontend-port 3000

# Background worker
celery -A apps.worker.celery_app worker --loglevel=info

You will also need local PostgreSQL and Redis instances that match the connection settings in .env.

πŸ“– Need more detail? The Quick Start Guide walks through the process with screenshots and tips, and the User Manual explains how to operate the app end to end. Setup guides are also available (English | ν•œκ΅­μ–΄).

🎯 Usage Example

1. Natural Language Query

"Compare GPT-4 and Claude-3 on Korean technology multiple choice questions"

2. Generated BenchHub Configuration

{
  "problem_type": "MCQA",
  "target_type": "General",
  "subject_type": ["Tech.", "Tech./Coding"],
  "task_type": "Knowledge",
  "external_tool_usage": false,
  "language": "Korean",
  "sample_size": 100
}

3. Model Configuration

{
  "models": [
    {
      "name": "gpt-4",
      "api_base": "https://api.openai.com/v1",
      "api_key": "sk-...",
      "model_type": "openai"
    },
    {
      "name": "claude-3",
      "api_base": "https://api.anthropic.com",
      "api_key": "sk-ant-...",
      "model_type": "anthropic"
    }
  ]
}

4. Results

  • Interactive leaderboard with rankings
  • Detailed performance metrics by BenchHub categories
  • Statistical significance analysis
  • Exportable results with BenchHub metadata

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     Reflex      β”‚    β”‚   FastAPI       β”‚    β”‚   Celery        β”‚
β”‚   Frontend      │◄──►│   Backend       │◄──►│   Workers       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                       β”‚                       β”‚
         β”‚                       β–Ό                       β–Ό
         β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚              β”‚   PostgreSQL    β”‚    β”‚     Redis       β”‚
         β”‚              β”‚   Database      β”‚    β”‚     Cache       β”‚
         └───────────────                 β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                        β”‚   HRET Toolkit  β”‚
                        β”‚   Integration   β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“š Documentation

Documentation is available in both English (docs/eng) and ν•œκ΅­μ–΄ (docs/kor).

Getting Started

Development

Reference

πŸ› οΈ Technology Stack

Backend

  • FastAPI: High-performance REST API framework
  • SQLAlchemy: Database ORM with PostgreSQL
  • Celery: Distributed task queue with Redis
  • Pydantic: Data validation and serialization

Frontend

  • Reflex: Modern Python web framework
  • Plotly: Interactive data visualizations
  • Pandas: Data manipulation and analysis

AI/ML

  • OpenAI API: GPT models for plan generation
  • HRET Toolkit: Evaluation framework integration
  • Custom Metrics: Extensible evaluation metrics

Infrastructure

  • Docker: Containerization and deployment
  • PostgreSQL: Primary data storage
  • Redis: Caching and task queue
  • Nginx: Reverse proxy (production)

πŸ”§ Configuration

Environment Variables

# Required
OPENAI_API_KEY=your_openai_api_key_here
POSTGRES_PASSWORD=secure_password

# Optional
DEBUG=false
LOG_LEVEL=info
DATABASE_URL=postgresql://user:pass@host:5432/db
CELERY_BROKER_URL=redis://host:6379/0

Model Support

Currently supported model providers:

  • OpenAI: GPT-3.5, GPT-4, and variants
  • Anthropic: Claude models
  • Hugging Face: Various open-source models
  • Custom: Extensible for any API-compatible model

🀝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

# Fork and clone the repository
git clone https://github.com/your-username/BenchhubPlus.git
cd BenchhubPlus

# Create development environment
python3.11 -m venv venv
source venv/bin/activate
pip install -e .

# Install development dependencies
pip install pytest black isort flake8 mypy

# Run tests
./scripts/test.sh

# Format code
black apps/
isort apps/

Reporting Issues

  • Bug Reports: Use the bug report template
  • Feature Requests: Use the feature request template
  • Questions: Start a discussion

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Contributors

πŸ“ž Support

Community Support

  • GitHub Issues: Bug reports and feature requests
  • GitHub Discussions: Questions and community help
  • Documentation: Comprehensive guides and references

πŸš€ Ready to start evaluating? Check out our Quick Start Guide or dive into the User Manual!


Built with ❀️ for the AI evaluation community

About

Interactive Leaderboard with Benchub&HRET

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •