GitHub - dbpprt/dieter: 🤖 Vision-powered browser automation that sees and interacts with web pages like humans do

https://www.youtube.com/watch?v=ZOplZbU67-4

Overview

dieter is a sophisticated browser automation agent that combines Large Language Models (LLM) with computer vision to interact with web interfaces. Unlike traditional automation tools that rely on selectors or XPath, dieter understands web pages visually - similar to how humans do.

Note: This README is AI-generated based on analysis of the dieter source code.

Key Features

Vision-Based Interaction: Uses OmniParser's pretrained YOLO model to identify clickable elements and interactive components
Intelligent Text Recognition: Leverages Apple's Vision framework for accurate OCR
LLM-Guided Decision Making: Uses language models to understand context and determine actions
Memory System: Maintains context and previous interactions for more intelligent automation
Robust Navigation: Tracks viewport state and navigation history
Interactive & Non-Interactive Modes: Supports both guided and automated execution

Technical Architecture

Browser Control: Playwright for reliable browser automation
Computer Vision:
- OmniParser's YOLO model for element detection
- Apple Vision framework for OCR
- Custom OmniParser integration for combining visual inputs
LLM Integration: OpenAI API for decision making
State Management: Comprehensive tracking of page state, viewport info, and navigation history

Model Compatibility

Model	Status	Notes
google/gemini-flash-1.5	🟢	Recommended
anthropic/claude-3.5-sonnet:beta	🟢	Recommended
mistralai/pixtral-large-2411	🟢	Recommended
openai/gpt-4o-mini	🟡	Prompt adherence issues
anthropic/claude-3.5-haiku:beta	🔴	No vision support
anthropic/claude-3-haiku	🔴	Doesn't work
meta-llama/llama-3.2-90b-vision-instruct	🔴	Doesn't work
meta-llama/llama-3.2-11b-vision-instruct	🔴	Doesn't work
qwen/qwen-2-vl-72b-instruct	🔴	Doesn't work
mistralai/pixtral-12b	🔴	Doesn't work

Prerequisites

macOS 10.15 or later (required for Vision framework)
Python 3.11
Poetry for dependency management
OmniParser YOLO weights for element detection

Installation

Clone the repository:

git clone https://github.com/dbpprt/dieter
cd dieter

Set up Python environment:

pyenv install 3.11
pyenv local 3.11

Install dependencies:

poetry config virtualenvs.in-project true
poetry install

Download OmniParser weights:
- Get weights from HuggingFace
- Place in weights/omniparser/icon_detect/best.pt
Configure:

cp config.yaml.template config.yaml
# Add your API keys and settings

Configuration

The config.yaml file controls dieter's behavior. Here's the template with explanations:

# OpenRouter Configuration
api_key: ${OPENROUTER_API_KEY} # Set via environment variable OPENROUTER_API_KEY
base_url: "https://openrouter.ai/api/v1"
model_name: "google/gemini-flash-1.5" # OpenRouter model format

# Conversation History Control
max_history_size: 4 # Number of message pairs to keep (null for unlimited)

# Browser Configuration
browser:
  width: 1024
  height: 768
  browser_type: "chromium" # chromium, firefox, or webkit
  data_dir: ".data/browser"
  device_scale_factor: 2
  is_mobile: true
  has_touch: true
  extensions:
    ublock_origin:
      url: "https://github.com/gorhill/uBlock/releases/download/1.61.2/uBlock0_1.61.2.chromium.zip"
      extract_dir: "uBlock0.chromium"
      enabled: true

# OmniParser Configuration
omniparser:
  weights_path: "weights/omniparser/icon_detect/best.pt" # Path to YOLO weights

Configuration Options

OpenRouter Settings:
- api_key: Your OpenRouter API key
- base_url: API endpoint
- model_name: LLM model to use (see Model Compatibility table)
History Control:
- max_history_size: Limits conversation memory (null = unlimited)
Browser Settings:
- width/height: Browser window dimensions
- browser_type: Browser engine selection
- device_scale_factor: Screen resolution scaling
- is_mobile/has_touch: Mobile device simulation
- extensions: Browser extension configuration
OmniParser:
- weights_path: Path to YOLO model weights

CLI Commands

dieter provides several command-line options:

poetry run python -m src [options]

Available Options

Option	Description	Default
`--config`	Path to configuration file	config.yaml
`--verbose`, `-v`	Enable detailed logging	False
`--instruction`	Run single instruction in non-interactive mode	None
`--model-name`	Override model from config	None

Usage Examples

Interactive Mode:

poetry run python -m src

Non-Interactive Mode:

poetry run python -m src --instruction "navigate to example.com"

Debug Mode:

poetry run python -m src -v

Custom Model:

poetry run python -m src --model-name "anthropic/claude-3.5-sonnet:beta"

How It Works

Page Analysis:
- Captures screenshot of current page
- Detects interactive elements using OmniParser's YOLO model
- Performs OCR on text content using Apple's Vision framework
Here's an example of how the OCR system processes a webpage:

The pink boxes with unique IDs annotate each detected text element on the page. Since the model has no spatial understanding of the image and cannot reliably predict x,y coordinates, these IDs allow the model to reference specific elements when deciding which ones to interact with.
Decision Making:
- LLM analyzes page state and current task
- Determines next action based on visual context
- Maintains memory of previous interactions
Execution:
- Precise interaction with detected elements
- Viewport management for scrolling
- Navigation handling
- State verification after actions

Roadmap

Implement support for additional OCR engines to eliminate macOS dependency
Integrate Claude MCP protocol
Benchmark performance of local models
Publish as a pip installable package

FAQ

Why macOS Only?

dieter relies on Apple's Vision framework for OCR capabilities, which provides superior text recognition compared to alternatives.

How Does Visual Detection Work?

dieter uses OmniParser's pretrained YOLO model to detect interactive elements like buttons, links, and input fields, combined with OCR for text recognition. This approach makes it more robust to UI changes compared to selector-based automation.

Can It Handle Dynamic Content?

Yes, since dieter operates based on visual information rather than DOM structure, it can handle dynamically loaded content and modern web applications effectively.

What Types of Automation Can It Handle?

Web navigation and interaction
Form filling
Content extraction
Visual verification
Complex multi-step workflows

How Does the Context Work?

The context system in dieter is implemented through a sophisticated combination of prompts and agent behavior:

Prompt Templates (prompts/browser.py):
- Maintains structured context through sections like <additional_context>, <browser_state>, <history>, and <memory>
- Provides the model with current page state, navigation capabilities, and interaction history
Memory System (agent.py):
- Implements a <memorize> command allowing the model to store important information
- Persists memories across conversation turns
- Useful for maintaining context when scrolling or navigating between pages
Context Truncation:
- Configurable through max_history_size in config.yaml
- When history exceeds the limit, older messages are removed while preserving the first message
- A <truncated /> marker is inserted to indicate removed context
- Ensures the model maintains focus on recent interactions while staying within context limits

This system allows dieter to maintain relevant context while preventing context overflow, enabling more coherent and effective automation across complex tasks.

License

This project is under the MIT License - see LICENSE file for details.

Note about OmniParser licensing:

OmniParser's icon_detect model is under AGPL license
OmniParser's icon_caption_blip2 and icon_caption_florence models are under MIT license

Disclaimer

This README is AI-generated based on analysis of the dieter source code.

Built with ❤️ by Claude using Python, OmniParser, and Apple Vision Framework

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.vscode		.vscode
assets		assets
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
config.yaml.template		config.yaml.template
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Key Features

Technical Architecture

Model Compatibility

Prerequisites

Installation

Configuration

Configuration Options

CLI Commands

Available Options

Usage Examples

How It Works

Roadmap

FAQ

Why macOS Only?

How Does Visual Detection Work?

Can It Handle Dynamic Content?

What Types of Automation Can It Handle?

How Does the Context Work?

License

Disclaimer

About

Languages

dbpprt/dieter

Folders and files

Latest commit

History

Repository files navigation

Overview

Key Features

Technical Architecture

Model Compatibility

Prerequisites

Installation

Configuration

Configuration Options

CLI Commands

Available Options

Usage Examples

How It Works

Roadmap

FAQ

Why macOS Only?

How Does Visual Detection Work?

Can It Handle Dynamic Content?

What Types of Automation Can It Handle?

How Does the Context Work?

License

Disclaimer

About

Topics

Resources

Stars

Watchers

Forks

Languages