MCPMark: Stress-Testing Comprehensive MCP Use

An evaluation suite for agentic models in real MCP tool environments (Notion / GitHub / Filesystem / Postgres / Playwright).

MCPMark provides a reproducible, extensible benchmark for researchers and engineers: one-command tasks, isolated sandboxes, auto-resume for failures, unified metrics, and aggregated reports.

What you can do with MCPMark

Evaluate real tool usage across multiple MCP services: Notion, GitHub, Filesystem, Postgres, Playwright.
Use ready-to-run tasks covering practical workflows, each with strict automated verification.
Reliable and reproducible: isolated environments that do not pollute your accounts/data; failed tasks auto-retry and resume.
Unified metrics and aggregation: single/multi-run (pass@k, avg@k, etc.) with automated results aggregation.
Flexible deployment: local or Docker; fully validated on macOS and Linux.

Quickstart (5 minutes)

1) Clone the repository

git clone https://github.com/eval-sys/mcpmark.git
cd mcpmark

2) Set environment variables (create `.mcp_env` at repo root)

Only set what you need. Add service credentials when running tasks for that service.

# Example: OpenAI
OPENAI_BASE_URL="https://api.openai.com/v1"
OPENAI_API_KEY="sk-..."

# Optional: Notion (only for Notion tasks)
SOURCE_NOTION_API_KEY="your-source-notion-api-key"
EVAL_NOTION_API_KEY="your-eval-notion-api-key"
EVAL_PARENT_PAGE_TITLE="MCPMark Eval Hub"
PLAYWRIGHT_BROWSER="chromium"   # chromium | firefox
PLAYWRIGHT_HEADLESS="True"

# Optional: GitHub (only for GitHub tasks)
GITHUB_TOKENS="token1,token2"   # token pooling for rate limits
GITHUB_EVAL_ORG="your-eval-org"

# Optional: Postgres (only for Postgres tasks)
POSTGRES_HOST="localhost"
POSTGRES_PORT="5432"
POSTGRES_USERNAME="postgres"
POSTGRES_PASSWORD="password"

See docs/introduction.md and the service guides below for more details.

3) Install and run a minimal example

Local (Recommended)

pip install -e .
# If you'll use browser-based tasks, install Playwright browsers first
playwright install

Docker

./build-docker.sh

Run a filesystem task (no external accounts required):

python -m pipeline \
  --mcp filesystem \
  --k 1 \ # run once to quick start
  --models gpt-5  \ # or any model you configured
  --tasks file_property/size_classification

Results are saved to ./results/{exp_name}/{model}__{mcp}/run-*/... (e.g., ./results/test-run/gpt-5__filesystem/run-1/...).

Run your evaluations

Single run (k=1)

# Run ALL tasks for a service
python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL --k 1

# Run a task group
python -m pipeline --exp-name exp --mcp notion --tasks online_resume --models MODEL --k 1

# Run a specific task
python -m pipeline --exp-name exp --mcp notion --tasks online_resume/daily_itinerary_overview --models MODEL --k 1

# Evaluate multiple models
python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL1,MODEL2,MODEL3 --k 1

Multiple runs (k>1) for pass@k

# Run k=4 to compute stability metrics (requires --exp-name to aggregate final results)
python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL

# Aggregate results (pass@1 / pass@k / pass^k / avg@k)
python -m src.aggregators.aggregate_results --exp-name exp

Run with Docker

# Run all tasks for a service
./run-task.sh --mcp notion --models MODEL --exp-name exp --tasks all

# Cross-service benchmark
./run-benchmark.sh --models MODEL --exp-name exp --docker

Please visit docs/introduction.md for choices of MODEL.

Tip: MCPMark supports auto-resume. When re-running commands, only unfinished tasks will execute. Tasks previously failed due to pipeline errors (e.g., State Duplication Error, MCP Network Error) will be retried automatically.

Service setup and authentication

Notion: environment isolation (Source Hub / Eval Hub), integration creation and grants, browser login verification.
- Guide and Setup: docs/mcp/notion.md
GitHub: multi-account token pooling recommended; import pre-exported repo state if needed.
- Guide and Setup: docs/mcp/github.md
Postgres: start via Docker and import sample databases.
- Env setup: docs/mcp/postgres.md
Playwright: install browsers before first run; defaults to chromium.
- Env setup: docs/mcp/playwright.md
Filesystem: zero-configuration, run directly.
- Configuration: docs/mcp/filesystem.md

You can also follow docs/quickstart.md for the shortest end-to-end path.

Results and metrics

Results are organized under ./results/{exp_name}/{model}__{mcp}/run-*/ (JSON + CSV per task).
Generate a summary with:

# Basic usage
python -m src.aggregators.aggregate_results --exp-name exp

# For k-run experiments with single-run models
python -m src.aggregators.aggregate_results --exp-name exp --k 4 --single-run-models claude-opus-4-1

Only models with complete results across all tasks and runs are included in the final summary.
Includes multi-run metrics (pass@k, pass^k) for stability comparisons when k > 1.

Model and Tasks

Model support: MCPMark calls models via LiteLLM — see the LiteLLM docs: LiteLLM Doc. For Anthropic (Claude) extended thinking mode (enabled via --reasoning-effort), we use Anthropic’s native API.
See docs/introduction.md for details and configuration of supported models in MCPMark.
To add a new model, edit src/model_config.py. Before adding, check LiteLLM supported models/providers. See LiteLLM Doc.
Task design principles in docs/datasets/task.md. Each task ships with an automated verify.py for objective, reproducible evaluation, see docs/task.md for details.

Contributing

Contributions are welcome:

Add a new task under tasks/<category_id>/<task_id>/ with meta.json, description.md and verify.py.
Ensure local checks pass and open a PR.
See docs/contributing/make-contribution.md.

Citation

If you find our works useful for your research, please consider citing:

@misc{mcpmark_2025,
  title        = {MCPMark: Stress-Testing Comprehensive MCP Use},
  author       = {The MCPMark Team},
  howpublished = {\url{https://github.com/eval-sys/mcpmark}},
  year         = {2025}
}

License

This project is licensed under the Apache License 2.0 — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 411 Commits
.github		.github
docs		docs
src		src
tasks		tasks
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build-docker.sh		build-docker.sh
cspell.config.yaml		cspell.config.yaml
pipeline.py		pipeline.py
pixi.lock		pixi.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run-benchmark.sh		run-benchmark.sh
run-task.sh		run-task.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MCPMark: Stress-Testing Comprehensive MCP Use

What you can do with MCPMark

Quickstart (5 minutes)

1) Clone the repository

2) Set environment variables (create `.mcp_env` at repo root)

3) Install and run a minimal example

Run your evaluations

Single run (k=1)

Multiple runs (k>1) for pass@k

Run with Docker

Service setup and authentication

Results and metrics

Model and Tasks

Contributing

Citation

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors 12

Uh oh!

Languages

License

eval-sys/mcpmark

Folders and files

Latest commit

History

Repository files navigation

MCPMark: Stress-Testing Comprehensive MCP Use

What you can do with MCPMark

Quickstart (5 minutes)

1) Clone the repository

2) Set environment variables (create .mcp_env at repo root)

3) Install and run a minimal example

Run your evaluations

Single run (k=1)

Multiple runs (k>1) for pass@k

Run with Docker

Service setup and authentication

Results and metrics

Model and Tasks

Contributing

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors 12

Uh oh!

Languages

2) Set environment variables (create `.mcp_env` at repo root)

Packages