Vision Brower Control

An AI-powered web automation tool that combines GPT-4 Vision with Puppeteer to create an intelligent browsing system. The system can understand web pages visually, interact with elements, and execute complex sequences of actions to achieve specified goals.

Features

🔍 Visual Understanding: Uses GPT-4 Vision to analyze webpage screenshots
🎯 Goal-Oriented Navigation: Executes multi-step actions to achieve user-defined objectives
🤖 Intelligent Form Filling: Automatically generates contextually appropriate form inputs
🔄 Dynamic Content Handling: Adapts to page changes and state transitions
🎨 Element Highlighting: Visually marks interactive elements for precise interaction

Architecture

flowchart TB
    subgraph Core["Core Application (main.js)"]
        MainLoop["Main Loop"]
        SessionState["Session State"]
        MetaData["Metadata Manager"]
    end

    subgraph Browser["Browser Module (browserActions.js)"]
        Puppeteer["Puppeteer Controller"]
        Screenshot["Screenshot Manager"]
        Elements["Element Highlighter"]
        Actions["Action Processor"]
    end

    subgraph AI["AI Module (chat_chain.js)"]
        GPT4V["GPT-4 Vision"]
        LangChain["LangChain Integration"]
        History["Message History"]
        InputGen["Input Generator"]
    end

    subgraph Utils["Utilities (utils.js)"]
        Logger["Winston Logger"]
        ImageProc["Image Processor"]
        JSONParser["JSON Parser"]
        InputHandler["Input Handler"]
    end

    MainLoop -->|"Orchestrates"| Browser
    MainLoop -->|"Makes decisions"| AI
    MainLoop -->|"Uses"| Utils
    SessionState -->|"Updates"| MetaData

    Puppeteer -->|"Controls"| Screenshot
    Puppeteer -->|"Manages"| Elements
    Actions -->|"Executes via"| Puppeteer
    Screenshot -->|"Provides images to"| GPT4V

    GPT4V -->|"Feeds into"| LangChain
    LangChain -->|"Maintains"| History
    LangChain -->|"Uses"| InputGen

    Logger -->|"Records"| MainLoop
    ImageProc -->|"Processes"| Screenshot
    JSONParser -->|"Parses"| LangChain
    InputHandler -->|"Feeds"| MainLoop

    classDef module fill:#f9f,stroke:#333,stroke-width:2px
    classDef core fill:#bbf,stroke:#333,stroke-width:2px
    class Browser,AI,Utils module
    class Core core

Setup

Install dependencies:

npm install

Create .env file:

OPENAI_API_KEY=your_api_key_here

Run the application:

node main.js

Example Usage

$ node main.js
Enter the initial URL to start browsing: https://example.com
Enter your main goal: Login to linkedIn and apply to jobs with easy apply available

The system will:

Load the specified URL
Analyze the page visually
Plan and execute actions to achieve the goal
Provide progress updates

Implementation Details

Visual Analysis

Screenshots page content
Identifies interactive elements
Understands page layout and structure

Action Planning

Determines optimal action sequences
Handles navigation decisions
Manages form interactions

State Management

Tracks browsing session progress
Maintains action history
Monitors goal completion

Requirements

Node.js 16+
OpenAI API key
Chrome/Chromium browser

Limitations

Requires stable internet connection
Performance varies with page complexity
May need adjustments for specific websites

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.idea		.idea
.gitignore		.gitignore
README.md		README.md
browserActions.js		browserActions.js
chat_chain.js		chat_chain.js
main.js		main.js
package-lock.json		package-lock.json
package.json		package.json
static_content.json		static_content.json
tsconfig.json		tsconfig.json
utils.js		utils.js
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision Brower Control

Features

Architecture

Setup

Example Usage

Implementation Details

Visual Analysis

Action Planning

State Management

Requirements

Limitations

License

About

Releases

Packages

Contributors 2

Languages

Bitsy-Chuck/vision-browser-control

Folders and files

Latest commit

History

Repository files navigation

Vision Brower Control

Features

Architecture

Setup

Example Usage

Implementation Details

Visual Analysis

Action Planning

State Management

Requirements

Limitations

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages