Skip to content

Towards Large Multimodal Models as Visual Foundation Agents

License

Notifications You must be signed in to change notification settings

THUDM/VisualAgentBench

Repository files navigation

VisualAgentBench (VAB)

🌐 Website | πŸ“ƒ Paper | πŸ—‚οΈ VAB Training (Under Construction)

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

VisualAgentBench (VAB) is the first benchmark designed to systematically evaluate and develop large multi models (LMMs) as visual foundation agents, which comprises 5 distinct environments across 3 types of representative visual agent tasks (Embodied, GUI, and Visual Design)

VAB.mp4
  • VAB-OmniGibson (Embodied)
  • VAB-Minecraft (Embodied)
  • VAB-Mobile (GUI)
  • VAB-WebArena-Lite (GUI, based on WebArena and VisualWebArena)
  • VAB-CSS (Visual Design)

Compared to its predecessor AgentBench, VAB highlights visual inputs and the enabling of Foundation Agent capability development with training open LLMs/LMMs on trajectories.

Table of Contents

Quick Start

This section will first give you an overview to the use and architecture of VAB. Next, it will guide you on how to use gpt-4o-2024-05-13 as an exemplar agent to launch 4 concurrent VAB-Minecraft tasks.

Overview on VAB Framework

To allow fast evaluation over agent tasks, we leverage AgentBench's framework as the backbone (currently for VAB-OmniGibson, VAB-Minecraft, and VAB-CSS). If you are interested in its detailed implementation, please refer to AgentBench's Framework Introduction (which may not be necessary). Basically, the framework calls all LLM/LMM in API formats via Agent-Controller, and accesses to environments via Task-Controller. The Assigner will automatically assign evaluation tasks by pairing Agent-Controller and Task-Controller to optimize the overall evaluation speed. For more detailed configuration and launch methods, please check Configuration Guide and Program Entrance Guide.

Step 1. Prerequisites for All Environments

Clone this repo and install the dependencies.

cd VisualAgentBench
conda create -n vab python=3.9
conda activate vab
pip install -r requirements.txt

Ensure that Docker is properly installed.

docker ps

For specific environments, please refer to their additional prerequisites respectively. For VAB-WebArena-Lite, it is based on WebArena with some modifications, so please read its individual setup carefully.

Step 2. Configure the Agent

Fill in your OpenAI API Key at the correct location in configs/agents/openai-chat.yaml.

You can try using python -m src.client.agent_test to check if your agent is configured correctly.

Step 3. Start the task server

Starting the task worker involves specific tasks. Manual starting might be cumbersome; hence, we provide an automated script.

The assumption for this step is that ports from 5000 to 5015 are available. For Mac OS system, you may want to follow here to free port 5000 to use.

python -m src.start_task -a

This will launch 4 task_workers for VAB-Minecraft tasks and automatically connect them to the controller on port 5000. After executing this command, please allow approximately 1 minute for the task setup to complete. If the terminal shows ".... 200 OK", you can open another terminal and follow step 4.

Step 4. Start the assigner

This step is to actually start the tasks.

If everything is correctly configured so far, you can now initiate the task tests.

python -m src.assigner --auto-retry

Next Steps

If you wish to launch more tasks or use other models, you can refer to the content in Configuration Guide and Program Entrance Guide.

For instance, if you want to launch VAB-OmniGibson tasks, in step 3:

python -m src.start_task -a -s omnigibson 2

In step 4:

python -m src.assigner --auto-retry --config configs/assignments/omnigibson.yaml

You can modify the config files to launch other tasks or change task concurrency.

Dataset Summary

We offer two splits for each dataset: Testing and Training. Different from its predecessor AgentBench, VAB is accompanied with a trajectory training set for behavior cloning (BC) training, which allows development of more potent visual foundation agents with emerging open LMMs.

Leaderboard

Here is the scores on test set results of VAB. All metrics are task Success Rate (SR). Noted that proprietary LMMs are tested with mere Prompting, and open LMMs are tested after Multitask Finetuning on VAB training set, as they usually fail to follow complicated agent task instructions.

Acknowledgement

This project is heavily built upon the following repositories (to be updated):

  • AgentBench: which serves as the backbone framework of this project for efficient and reliable parallel agent evaluation.
  • WebArena and VisualWebArena: which serve as the testing framework and data source for VAB-WebArena-Lite dataset.
  • OmniGibson: which serves as the environment for VAB-OmniGibson.
  • JARVIS-1: VAB-Minecraft's framework is adapted from JARVIS-1's pipeline.
  • STEVE-1: which serves as the action executor for VAB-Minecraft.

Citation

@article{liu2024visualagentbench,
  title={VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents},
  author={Liu, Xiao and Zhang, Tianjie and Gu, Yu and Iong, Iat Long and Xu, Yifan and Song, Xixuan and Zhang, Shudan and Lai, Hanyu and Liu, Xinyi and Zhao, Hanlin and others},
  journal={arXiv preprint arXiv:2408.06327},
  year={2024}
}