dieter
is a sophisticated browser automation agent that combines Large Language Models (LLM) with computer vision to interact with web interfaces. Unlike traditional automation tools that rely on selectors or XPath, dieter
understands web pages visually - similar to how humans do.
Note: This README is AI-generated based on analysis of the dieter
source code.
- Vision-Based Interaction: Uses OmniParser's pretrained YOLO model to identify clickable elements and interactive components
- Intelligent Text Recognition: Leverages Apple's Vision framework for accurate OCR
- LLM-Guided Decision Making: Uses language models to understand context and determine actions
- Memory System: Maintains context and previous interactions for more intelligent automation
- Robust Navigation: Tracks viewport state and navigation history
- Interactive & Non-Interactive Modes: Supports both guided and automated execution
- Browser Control: Playwright for reliable browser automation
- Computer Vision:
- OmniParser's YOLO model for element detection
- Apple Vision framework for OCR
- Custom OmniParser integration for combining visual inputs
- LLM Integration: OpenAI API for decision making
- State Management: Comprehensive tracking of page state, viewport info, and navigation history
Model | Status | Notes |
---|---|---|
google/gemini-flash-1.5 | π’ | Recommended |
anthropic/claude-3.5-sonnet:beta | π’ | Recommended |
mistralai/pixtral-large-2411 | π’ | Recommended |
openai/gpt-4o-mini | π‘ | Prompt adherence issues |
anthropic/claude-3.5-haiku:beta | π΄ | No vision support |
anthropic/claude-3-haiku | π΄ | Doesn't work |
meta-llama/llama-3.2-90b-vision-instruct | π΄ | Doesn't work |
meta-llama/llama-3.2-11b-vision-instruct | π΄ | Doesn't work |
qwen/qwen-2-vl-72b-instruct | π΄ | Doesn't work |
mistralai/pixtral-12b | π΄ | Doesn't work |
- macOS 10.15 or later (required for Vision framework)
- Python 3.11
- Poetry for dependency management
- OmniParser YOLO weights for element detection
- Clone the repository:
git clone https://github.com/dbpprt/dieter
cd dieter
- Set up Python environment:
pyenv install 3.11
pyenv local 3.11
- Install dependencies:
poetry config virtualenvs.in-project true
poetry install
-
Download OmniParser weights:
- Get weights from HuggingFace
- Place in
weights/omniparser/icon_detect/best.pt
-
Configure:
cp config.yaml.template config.yaml
# Add your API keys and settings
The config.yaml
file controls dieter
's behavior. Here's the template with explanations:
# OpenRouter Configuration
api_key: ${OPENROUTER_API_KEY} # Set via environment variable OPENROUTER_API_KEY
base_url: "https://openrouter.ai/api/v1"
model_name: "google/gemini-flash-1.5" # OpenRouter model format
# Conversation History Control
max_history_size: 4 # Number of message pairs to keep (null for unlimited)
# Browser Configuration
browser:
width: 1024
height: 768
browser_type: "chromium" # chromium, firefox, or webkit
data_dir: ".data/browser"
device_scale_factor: 2
is_mobile: true
has_touch: true
extensions:
ublock_origin:
url: "https://github.com/gorhill/uBlock/releases/download/1.61.2/uBlock0_1.61.2.chromium.zip"
extract_dir: "uBlock0.chromium"
enabled: true
# OmniParser Configuration
omniparser:
weights_path: "weights/omniparser/icon_detect/best.pt" # Path to YOLO weights
-
OpenRouter Settings:
api_key
: Your OpenRouter API keybase_url
: API endpointmodel_name
: LLM model to use (see Model Compatibility table)
-
History Control:
max_history_size
: Limits conversation memory (null = unlimited)
-
Browser Settings:
width/height
: Browser window dimensionsbrowser_type
: Browser engine selectiondevice_scale_factor
: Screen resolution scalingis_mobile/has_touch
: Mobile device simulationextensions
: Browser extension configuration
-
OmniParser:
weights_path
: Path to YOLO model weights
dieter
provides several command-line options:
poetry run python -m src [options]
Option | Description | Default |
---|---|---|
--config |
Path to configuration file | config.yaml |
--verbose , -v |
Enable detailed logging | False |
--instruction |
Run single instruction in non-interactive mode | None |
--model-name |
Override model from config | None |
- Interactive Mode:
poetry run python -m src
- Non-Interactive Mode:
poetry run python -m src --instruction "navigate to example.com"
- Debug Mode:
poetry run python -m src -v
- Custom Model:
poetry run python -m src --model-name "anthropic/claude-3.5-sonnet:beta"
-
Page Analysis:
- Captures screenshot of current page
- Detects interactive elements using OmniParser's YOLO model
- Performs OCR on text content using Apple's Vision framework
Here's an example of how the OCR system processes a webpage:
The pink boxes with unique IDs annotate each detected text element on the page. Since the model has no spatial understanding of the image and cannot reliably predict x,y coordinates, these IDs allow the model to reference specific elements when deciding which ones to interact with.
-
Decision Making:
- LLM analyzes page state and current task
- Determines next action based on visual context
- Maintains memory of previous interactions
-
Execution:
- Precise interaction with detected elements
- Viewport management for scrolling
- Navigation handling
- State verification after actions
- Implement support for additional OCR engines to eliminate macOS dependency
- Integrate Claude MCP protocol
- Benchmark performance of local models
- Publish as a pip installable package
dieter
relies on Apple's Vision framework for OCR capabilities, which provides superior text recognition compared to alternatives.
dieter
uses OmniParser's pretrained YOLO model to detect interactive elements like buttons, links, and input fields, combined with OCR for text recognition. This approach makes it more robust to UI changes compared to selector-based automation.
Yes, since dieter
operates based on visual information rather than DOM structure, it can handle dynamically loaded content and modern web applications effectively.
- Web navigation and interaction
- Form filling
- Content extraction
- Visual verification
- Complex multi-step workflows
The context system in dieter
is implemented through a sophisticated combination of prompts and agent behavior:
-
Prompt Templates (
prompts/browser.py
):- Maintains structured context through sections like
<additional_context>
,<browser_state>
,<history>
, and<memory>
- Provides the model with current page state, navigation capabilities, and interaction history
- Maintains structured context through sections like
-
Memory System (
agent.py
):- Implements a
<memorize>
command allowing the model to store important information - Persists memories across conversation turns
- Useful for maintaining context when scrolling or navigating between pages
- Implements a
-
Context Truncation:
- Configurable through
max_history_size
in config.yaml - When history exceeds the limit, older messages are removed while preserving the first message
- A
<truncated />
marker is inserted to indicate removed context - Ensures the model maintains focus on recent interactions while staying within context limits
- Configurable through
This system allows dieter
to maintain relevant context while preventing context overflow, enabling more coherent and effective automation across complex tasks.
This project is under the MIT License - see LICENSE file for details.
Note about OmniParser licensing:
- OmniParser's icon_detect model is under AGPL license
- OmniParser's icon_caption_blip2 and icon_caption_florence models are under MIT license
This README is AI-generated based on analysis of the dieter
source code.
Built with β€οΈ by Claude using Python, OmniParser, and Apple Vision Framework