Skip to content

MCP Servers are shaping the future of software. MCPMark is a comprehensive, stress-testing benchmark designed to evaluate model and agent capabilities in real-world MCP use.

License

Notifications You must be signed in to change notification settings

eval-sys/mcpmark

Repository files navigation

MCPMark: Stress-Testing Comprehensive MCP Use

Website Discord Docs Hugging Face

An evaluation suite for agentic models in real MCP tool environments (Notion / GitHub / Filesystem / Postgres / Playwright).

MCPMark provides a reproducible, extensible benchmark for researchers and engineers: one-command tasks, isolated sandboxes, auto-resume for failures, unified metrics, and aggregated reports.

MCPMark

What you can do with MCPMark

  • Evaluate real tool usage across multiple MCP services: Notion, GitHub, Filesystem, Postgres, Playwright.
  • Use ready-to-run tasks covering practical workflows, each with strict automated verification.
  • Reliable and reproducible: isolated environments that do not pollute your accounts/data; failed tasks auto-retry and resume.
  • Unified metrics and aggregation: single/multi-run (pass@k, avg@k, etc.) with automated results aggregation.
  • Flexible deployment: local or Docker; fully validated on macOS and Linux.

Quickstart (5 minutes)

1) Clone the repository

git clone https://github.com/eval-sys/mcpmark.git
cd mcpmark

2) Set environment variables (create .mcp_env at repo root)

Only set what you need. Add service credentials when running tasks for that service.

# Example: OpenAI
OPENAI_BASE_URL="https://api.openai.com/v1"
OPENAI_API_KEY="sk-..."

# Optional: Notion (only for Notion tasks)
SOURCE_NOTION_API_KEY="your-source-notion-api-key"
EVAL_NOTION_API_KEY="your-eval-notion-api-key"
EVAL_PARENT_PAGE_TITLE="MCPMark Eval Hub"
PLAYWRIGHT_BROWSER="chromium"   # chromium | firefox
PLAYWRIGHT_HEADLESS="True"

# Optional: GitHub (only for GitHub tasks)
GITHUB_TOKENS="token1,token2"   # token pooling for rate limits
GITHUB_EVAL_ORG="your-eval-org"

# Optional: Postgres (only for Postgres tasks)
POSTGRES_HOST="localhost"
POSTGRES_PORT="5432"
POSTGRES_USERNAME="postgres"
POSTGRES_PASSWORD="password"

See docs/introduction.md and the service guides below for more details.

3) Install and run a minimal example

Local (Recommended)

pip install -e .
# If you'll use browser-based tasks, install Playwright browsers first
playwright install

Docker

./build-docker.sh

Run a filesystem task (no external accounts required):

python -m pipeline \
  --mcp filesystem \
  --k 1 \ # run once to quick start
  --models gpt-5  \ # or any model you configured
  --tasks file_property/size_classification

Results are saved to ./results/{exp_name}/{model}__{mcp}/run-*/... (e.g., ./results/test-run/gpt-5__filesystem/run-1/...).


Run your evaluations

Single run (k=1)

# Run ALL tasks for a service
python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL --k 1

# Run a task group
python -m pipeline --exp-name exp --mcp notion --tasks online_resume --models MODEL --k 1

# Run a specific task
python -m pipeline --exp-name exp --mcp notion --tasks online_resume/daily_itinerary_overview --models MODEL --k 1

# Evaluate multiple models
python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL1,MODEL2,MODEL3 --k 1

Multiple runs (k>1) for pass@k

# Run k=4 to compute stability metrics (requires --exp-name to aggregate final results)
python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL

# Aggregate results (pass@1 / pass@k / pass^k / avg@k)
python -m src.aggregators.aggregate_results --exp-name exp

Run with Docker

# Run all tasks for a service
./run-task.sh --mcp notion --models MODEL --exp-name exp --tasks all

# Cross-service benchmark
./run-benchmark.sh --models MODEL --exp-name exp --docker

Please visit docs/introduction.md for choices of MODEL.

Tip: MCPMark supports auto-resume. When re-running commands, only unfinished tasks will execute. Tasks previously failed due to pipeline errors (e.g., State Duplication Error, MCP Network Error) will be retried automatically.


Service setup and authentication

  • Notion: environment isolation (Source Hub / Eval Hub), integration creation and grants, browser login verification.

    • Guide and Setup: docs/mcp/notion.md
  • GitHub: multi-account token pooling recommended; import pre-exported repo state if needed.

    • Guide and Setup: docs/mcp/github.md
  • Postgres: start via Docker and import sample databases.

    • Env setup: docs/mcp/postgres.md
  • Playwright: install browsers before first run; defaults to chromium.

    • Env setup: docs/mcp/playwright.md
  • Filesystem: zero-configuration, run directly.

    • Configuration: docs/mcp/filesystem.md

You can also follow docs/quickstart.md for the shortest end-to-end path.


Results and metrics

  • Results are organized under ./results/{exp_name}/{model}__{mcp}/run-*/ (JSON + CSV per task).
  • Generate a summary with:
# Basic usage
python -m src.aggregators.aggregate_results --exp-name exp

# For k-run experiments with single-run models
python -m src.aggregators.aggregate_results --exp-name exp --k 4 --single-run-models claude-opus-4-1
  • Only models with complete results across all tasks and runs are included in the final summary.
  • Includes multi-run metrics (pass@k, pass^k) for stability comparisons when k > 1.

Model and Tasks

  • Model support: MCPMark calls models via LiteLLM — see the LiteLLM docs: LiteLLM Doc. For Anthropic (Claude) extended thinking mode (enabled via --reasoning-effort), we use Anthropic’s native API.
  • See docs/introduction.md for details and configuration of supported models in MCPMark.
  • To add a new model, edit src/model_config.py. Before adding, check LiteLLM supported models/providers. See LiteLLM Doc.
  • Task design principles in docs/datasets/task.md. Each task ships with an automated verify.py for objective, reproducible evaluation, see docs/task.md for details.

Contributing

Contributions are welcome:

  1. Add a new task under tasks/<category_id>/<task_id>/ with meta.json, description.md and verify.py.
  2. Ensure local checks pass and open a PR.
  3. See docs/contributing/make-contribution.md.

Citation

If you find our works useful for your research, please consider citing:

@misc{mcpmark_2025,
  title        = {MCPMark: Stress-Testing Comprehensive MCP Use},
  author       = {The MCPMark Team},
  howpublished = {\url{https://github.com/eval-sys/mcpmark}},
  year         = {2025}
}

License

This project is licensed under the Apache License 2.0 — see LICENSE.

About

MCP Servers are shaping the future of software. MCPMark is a comprehensive, stress-testing benchmark designed to evaluate model and agent capabilities in real-world MCP use.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 12

Languages