- [18 Sep 2025] macOSWorld accepted to NeurIPS 2025
- [15 Sep 2025] Optimised the automated benchmark execution experience; Added a GUI display for real-time benchmark progress and results
- Interactive macOS environment: 30 macOS-native apps with their exclusive user interfaces
- Multilingual benchmarking: Tasks and environments available in English, Chinese, Arabic, Japanese and Russian
- Safety evaluation: Dedicated subset for benchmarking agents' resilience under context deception attacks
macOSWorld consists of a local Python testbench script and cloud-hosted AWS macOS instances. The benchmark process involves four main steps:
- Step 1: Local Environment Configuration
- Step 2: AWS Environmen Configuration
- Step 3: Running the Benchmark
- Step 4: Releasing AWS Resources
conda create -n macosworld python=3.9.21
conda activate macosworld
pip install vncdotool==1.2.0 boto3==1.36.20 sshtunnel httpx[socks] openai anthropic google-auth google_cloud_aiplatform jupyterFor GPT-4o with SoM Annotations:
- Install additional dependencies:
conda activate macosworld
pip install timm easyocr paddlepaddle paddleocr einops==0.8.0 supervision==0.18.0 ultralytics==8.3.70- Download model weights:
mkdir OmniParser
cd OmniParser
git clone https://huggingface.co/microsoft/OmniParser-v2.0
mv OmniParser-v2.0 weights
mv weights/icon_caption weights/icon_caption_florenceFor Gemini Models:
macOSWorld uses the VertexAI API. Follow this guide to set up credentials and obtain a JSON credential file. Set the environment variable:
export GOOGLE_APPLICATION_CREDENTIALS='/path/to/gen-lang-client-xxx.json'For ShowUI:
conda activate macosworld
pip install torch==2.6.0 qwen-vl-utils transformers==4.47.0 accelerate==0.26.0For UI-TARS:
- Install dependencies:
conda create -n uitars python=3.9.21
pip install -U transformers==4.49.0
pip install vllm==0.6.6 --extra-index-url https://download.pytorch.org/whl/cu124- Start the vLLM service on the same device running the testbench:
NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve "bytedance-research/UI-TARS-7B-DPO" --served-model-name UI-TARS-7B-DPO --limit-mm-per-prompt image=3 -tp 4Note: The
tpparameter specifies the number of GPUs to use and must be a common divisor of 28 and 16.
macOSWorld requires an AWS-hosted cloud instance. Follow the detailed setup instructions in our AWS Configuration Guide.
Configure your environment variables and run the testbench. Replace the 🧩 placeholders with your actual values:
# AWS Configuration
export AWS_ACCESS_KEY_ID=🧩'AKIAIOSFODNN7EXAMPLE'
export AWS_SECRET_ACCESS_KEY=🧩'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
export AWS_DEFAULT_REGION='ap-southeast-1'
# API Keys (configure as needed)
export OPENAI_API_KEY=🧩'sk-proj-...' # For OpenAI models
export ANTHROPIC_API_KEY=🧩'sk-ant-...' # For Anthropic models
export GOOGLE_APPLICATION_CREDENTIALS=🧩'/path/to/gen-lang-client-xxx.json' # For Gemini models
# Set credential permission
chmod 400 credential.pem
# Run the benchmark
python run.py \
--instance_id 🧩i-0d5f51a1d2bc1edb0 \
--ssh_host 🧩ec2-13-250-104-211.ap-southeast-1.compute.amazonaws.com \
--ssh_pkey 🧩my_credential.pem \
--gui_agent_name 🧩gpt-4o-2024-08-06 \
--paths_to_eval_tasks 🧩./tasks/sys_apps ./tasks/sys_and_interface ./tasks/productivity ./tasks/media ./tasks/file_management ./tasks/advanced ./tasks/multi_apps \
--languages 🧩task_en_env_en task_zh_env_zh task_ar_env_ar task_ja_env_ja task_ru_env_ru \
--base_save_dir ./results/gpt_4o \
--max-steps 15 \
--snapshot_recovery_timeout_seconds 1200 \
--task_step_timeout 120| Parameter | Description |
|---|---|
instance_id |
EC2 instance ID |
ssh_host |
DNS name or IP address of the EC2 instance |
ssh_pkey |
Local file path to your SSH credential .pem file |
gui_agent_name |
GUI agent identifier (see supported models below) |
paths_to_eval_tasks |
Directory paths containing evaluation task JSON files |
languages |
Language pairs in format task_<lang>_env_<lang> |
base_save_dir |
Local directory for storing evaluation results |
max_steps |
Maximum dialogue turns per task |
snapshot_recovery_timeout_seconds |
Timeout for snapshot recovery (usually doesn't need adjustment) |
Supported GUI Agents
-
OpenAI GPT Series:
gpt-4o,gpt-4o-2024-08-06- With SoM:
gpt-4o/omniparser,gpt-4o-2024-08-06/omniparser - Computer Use:
openai/computer-use-preview
-
Google Gemini Series:
gemini-1.5-pro-002,gemini-2.5-pro-preview-03-25
-
Anthropic Claude:
claude-3-7-sonnet-20250219/computer-use-2025-01-24
-
Open Source Models:
UI-TARS-7B-DPOshowlab/ShowUI-2B
Supported Language Codes: English (en), Chinese (zh), Arabic (ar), Japanese (ja), Russian (ru)
The Safety Subset: Run it separately, because the safety subset is only provided in English.
python run.py \
--instance_id 🧩i-0d5f51a1d2bc1edb0 \
--ssh_host 🧩ec2-13-250-104-211.ap-southeast-1.compute.amazonaws.com \
--ssh_pkey 🧩my_credential.pem \
--gui_agent_name 🧩gpt-4o-2024-08-06 \
--paths_to_eval_tasks 🧩./tasks/safety \
--languages 🧩task_en_env_en \
--base_save_dir ./results/gpt_4o \
--max-steps 15 \
--snapshot_recovery_timeout_seconds 1200 \
--task_step_timeout 120For debugging purposes, you can run only the testbench:
python testbench.py \
--instance_id 🧩i-0d5f51a1d2bc1edb0 \
--ssh_host 🧩ec2-13-250-104-211.ap-southeast-1.compute.amazonaws.com \
--ssh_pkey 🧩my_credential.pem \
--gui_agent_name 🧩gpt-4o-2024-08-06 \
--paths_to_eval_tasks 🧩./tasks/sys_apps ./tasks/sys_and_interface ./tasks/productivity ./tasks/media ./tasks/file_management ./tasks/advanced ./tasks/multi_apps \
--languages 🧩task_en_env_en task_zh_env_zh task_ar_env_ar task_ja_env_ja task_ru_env_ru \
--base_save_dir ./results/gpt_4o \
--max-steps 15 \
--snapshot_recovery_timeout_seconds 1200 \
--task_step_timeout 120When the testbench is interrupted, to continue, the base_save_dir needs to be cleaned up first. Although the cleanup functionality is already integrated into run.py, you can still perform this cleanup manually.
python cleanup.py --base_save_dir /path/to/base_save_dirClean up the base_save_dir before rerunning the testbench. Previously completed tasks will not be deleted or re-executed.
Use the provided Jupyter notebook to view benchmark progress and results. This notebook provides a GUI that displays benchmark progress and results through a hierarchical menu.
scripts/display_progress.ipynb
After completing the benchmark, follow our AWS Cleanup Guide to terminate cloud resources and avoid unnecessary charges.
To benchmark other agents, follow these two steps:
- Create
agent/your_custom_agent.py— either modify based on an existing agent or start withagent/template_for_custom_agent.py. - Register your agent in
agent/get_gui_agent.py.
Each task evaluation takes approximately 15-20 minutes, with snapshot recovery being the primary bottleneck. Here are two approaches to reduce recovery time:
- Bypass Snapshot Recovery: Add
--override_env_resetparameter totestbench.pywhen evaluating similar tasks - Manual Setup: Launch and connect to the environment via VNC before starting the benchmark
- Manual Recovery: When the testbench displays
(pdb), manually restore the environment to its original state, then typecto continue
Consider using community VMware-based implementations for faster and cheaper benchmarking experiences.
- yangpei-comp/macosworld_vmware: faster and cheaper; local VMware deployment
The authors thank Kevin Qinghong Lin, Zhiqiang Chen, Noorbakht Khan, Brandon Ng, Mingyu Ouyang, Siyuan Hu, Xiangwu Guo, Henry Hengyuan Zhao, Difei Gao, Christopher Rawles, and Kun Shao for their valuable discussions and feedback.
@article{macosworld,
title={macOSWorld: A Multilingual Interactive Benchmark for GUI Agents},
author={Pei Yang and Hai Ci and Mike Zheng Shou},
year={2025},
eprint={2506.04135},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2506.04135},
}

