OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent

📢 [Project Page] [V2 Blog Post] [Models V2] [Models V1.5] [HuggingFace Space Demo]

OmniParser is a comprehensive method for parsing user interface screenshots into structured and easy-to-understand elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface.

News

[2025/3] We support local logging of trajecotry so that you can use OmniParser+OmniTool to build training data pipeline for your favorate agent in your domain. [Documentation WIP]
[2025/3] We are gradually adding multi agents orchstration and improving user interface in OmniTool for better experience.
[2025/2] We release OmniParser V2 checkpoints. Watch Video
[2025/2] We introduce OmniTool: Control a Windows 11 VM with OmniParser + your vision model of choice. OmniTool supports out of the box the following large language models - OpenAI (4o/o1/o3-mini), DeepSeek (R1), Qwen (2.5VL) or Anthropic Computer Use. Watch Video
[2025/1] V2 is coming. We achieve new state of the art results 39.5% on the new grounding benchmark Screen Spot Pro with OmniParser v2 (will be released soon)! Read more details here.
[2024/11] We release an updated version, OmniParser V1.5 which features 1) more fine grained/small icon detection, 2) prediction of whether each screen element is interactable or not. Examples in the demo.ipynb.
[2024/10] OmniParser was the #1 trending model on huggingface model hub (starting 10/29/2024).
[2024/10] Feel free to checkout our demo on huggingface space! (stay tuned for OmniParser + Claude Computer Use)
[2024/10] Both Interactive Region Detection Model and Icon functional description model are released! Hugginface models
[2024/09] OmniParser achieves the best performance on Windows Agent Arena!

Install

First clone the repo, and then install environment:

cd OmniParser
conda create -n "omni" python==3.12
conda activate omni
pip install -r requirements.txt

Ensure you have the V2 weights downloaded in weights folder (ensure caption weights folder is called icon_caption_florence). If not download them with:

   # download the model checkpoints to local directory OmniParser/weights/
   for f in icon_detect/{train_args.yaml,model.pt,model.yaml} icon_caption/{config.json,generation_config.json,model.safetensors}; do huggingface-cli download microsoft/OmniParser-v2.0 "$f" --local-dir weights; done
   mv weights/icon_caption weights/icon_caption_florence

Examples:

We put together a few simple examples in the demo.ipynb.

Gradio Demo

To run gradio demo, simply run:

python gradio_demo.py

Model Weights License

For the model checkpoints on huggingface model hub, please note that icon_detect model is under AGPL license since it is a license inherited from the original yolo model. And icon_caption_blip2 & icon_caption_florence is under MIT license. Please refer to the LICENSE file in the folder of each model: https://huggingface.co/microsoft/OmniParser.

📚 Citation

Our technical report can be found here. If you find our work useful, please consider citing our work:

@misc{lu2024omniparserpurevisionbased,
      title={OmniParser for Pure Vision Based GUI Agent}, 
      author={Yadong Lu and Jianwei Yang and Yelong Shen and Ahmed Awadallah},
      year={2024},
      eprint={2408.00203},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.00203}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 147 Commits
docs		docs
eval		eval
imgs		imgs
omnitool		omnitool
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
demo.ipynb		demo.ipynb
gradio_demo.py		gradio_demo.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent

News

Install

Examples:

Gradio Demo

Model Weights License

📚 Citation

About

Uh oh!

Releases

Packages

Languages

License

redron/OmniParser

Folders and files

Latest commit

History

Repository files navigation

OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent

News

Install

Examples:

Gradio Demo

Model Weights License

📚 Citation

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages