Skip to content

A toolbox for benchmarking Multimodal LLM Agents trustworthiness across truthfulness, controllability, safety and privacy dimensions through 34 interactive tasks

License

Notifications You must be signed in to change notification settings

Garsonjw/MLA-Trust

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MLA-Trust: Benchmarking Trustworthiness of Multimodal LLM Agents in GUI Environments

Truthfulness Safety Controllability Privacy

🛡️ MLA-Trust is a comprehensive and unified framework that evaluates the MLA trustworthiness across four principled dimensions: truthfulness, controllability, safety and privacy. The framework includes 34 high-risk interactive tasks to expose new trustworthiness challenges in GUI environments.

Framework

  • Truthfulness captures whether the agent correctly interprets visual or DOM-based elements on the GUI, and whether it produces factual outputs based on those perceptions.

  • Controllability assesses whether the agent introduces unnecessary steps, drifts from the intended goal, or triggers side effects not specified by the user.

  • Safety demonstrates whether the agent's actions are free from harmful or irreversible consequences, which encompasses the prevention of behaviors that cause financial loss, data corruption, or system failures.

  • Privacy evaluates whether the agent respects the confidentiality of sensitive information. MLAs often capture screenshots, handle form data, and interact with files.

🎯 Main Findings

🚨 Severe vulnerabilities in GUI environments: Both proprietary and open-source MLAs that interact with GUIs exhibit more severe trustworthiness risks compared to traditional MLLMs, particularly in high-stakes scenarios such as financial transactions.
🔄 Multi-step dynamic interactions amplify vulnerabilities: The transformation of MLLMs into GUI-based MLAs significantly compromises their trustworthiness. In multi-step interactive settings, these agents can execute harmful content that standalone MLLMs would typically reject.
Emergence of derived risks from iterative autonomy: Multi-step execution enhances adaptability but introduces latent and nonlinear risk accumulation across decision cycles, leading to unpredictable derived risks.
📈 Trustworthiness correlation: Open-source models employing structured fine-tuning strategies (e.g., SFT and RLHF) demonstrate improved controllability and safety. Larger models generally exhibit higher trustworthiness across multiple sub-aspects.

💻 Installation

  1. Install uv by following the official installation guide. Ensure the PATH environment variable is configured as prompted.
  2. Install dependencies:
    uv sync
    uv sync --extra flash-attn
📱 Mobile Setup

A. ADB Setup and Configuration

Reference: Mobile-Agent-E Repository

  1. Install Android Debug Bridge (ADB)

  2. Enable Developer Options

    • Go to Settings → About phone
    • Tap "MIUI version" multiple times until developer options are enabled (take Xiaomi for example)
    • Navigate to Settings → Additional Settings → Developer options
  3. Enable USB Debugging

    • Enable "USB debugging" in Developer options
    • Connect phone via USB cable
    • Select "File Transfer" mode when prompted
  4. Verify ADB Connection

    ## Check connected devices
    adb devices

B. Task Preconditions

  1. Modify scripts/mobile/adb.sh script for device setup
    • Script functions: (a) Unlock device; (b) Return to home screen;
    • Must execute before each task
    • Customize according to your device specifications
  2. Update ANDROID_SERIAL in scripts/mobile/run_task.sh to match your device

Our experimental equipment and operating system versions are as follows: (a) Device: Redmi Note 13 Pro; (b) Operating System: Xiaomi HyperOS 2.0.6.0

🌐 Website Setup

A. Task Preconditions

Since many tasks require a login to function properly, we provide cookie loading functionality to enable the agent to work correctly. You only need to run the following command (must be run on a machine with a visual web interface), then perform your login, and finally close the popup website to save cookies.

python src/scene/web/load_cookies.py

Then save the generated *.json files to src/scene/web/cookies

🌟 Quick Start

  1. Configure environment variables
cp .env.template .env
  1. Activate virtual environment
source .venv/bin/activate
  1. Execute main task
bash scripts/mobile/run_task.sh
bash scripts/web/run_task.sh
  1. Run evaluation
bash scripts/mobile/eval.sh
bash scripts/web/eval.sh

🚀 Supported Models

The following models are supported:

  • gpt-4o-2024-11-20
  • gpt-4-turbo
  • gemini-2.0-flash
  • gemini-2.0-pro-exp-02-05
  • claude-3-7-sonnet-20250219
  • llava-hf/llava-v1.6-mistral-7b-hf
  • lmms-lab/llava-onevision-qwen2-72b-ov-sft
  • lmms-lab/llava-onevision-qwen2-72b-ov-chat
  • microsoft/Magma-8B
  • Qwen/Qwen2.5-VL-7B-Instruct
  • deepseek-ai/deepseek-vl2
  • openbmb/MiniCPM-o-2_6
  • mistral-community/pixtral-12b
  • microsoft/Phi-4-multimodal-instruct
  • OpenGVLab/InternVL2-8B

📋 Task Overview

Task List

Our comprehensive task suite covers 34 high-risk interactive scenarios across multiple domains

🏆 Results

Results

Performance ranking of different MLAs across trustworthiness dimensions


🤝 Acknowledgement

We acknowledge and thank the projects Mobile-Agent-E and SeeAct, whose foundational work has supported the development of this project.

📞 Contact

For questions, suggestions or collaboration opportunities, please contact us at jankinfmail@gmail.com, 52285904015@stu.ecnu.edu.cn, yangxiao19@tsinghua.org.cn

🌟 Citation

If you find this work useful, please consider citing our paper:

@article{yang2025mla,
  title={MLA-Trust: Benchmarking Trustworthiness of Multimodal LLM Agents in GUI Environments},
  author={Yang, Xiao and Chen, Jiawei and Luo, Jun and Fang, Zhengwei and Dong, Yinpeng and Su, Hang and Zhu, Jun},
  journal={arXiv preprint arXiv:2506.01616},
  year={2025}
}

About

A toolbox for benchmarking Multimodal LLM Agents trustworthiness across truthfulness, controllability, safety and privacy dimensions through 34 interactive tasks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.6%
  • Shell 0.4%