Skip to content
/ web2llm Public

๐ŸŒ Expand LLM knowledge beyond training cutoffs through transforming modern websites into AI-digestible PDFs via HTTrack-powered scraping

License

Notifications You must be signed in to change notification settings

lx-0/web2llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

2 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŒ web2llm - Website scraper for LLM consumption

A Python tool that prepares online documentation for LLM consumption by downloading websites and converting them into standardized formats. Ideal for making post-cutoff documentation accessible to language models.

For example, it can transform the latest Pydantic AI documentation into clean, structured PDFs, allowing LLMs to understand features released after their training cutoff.

Python 3.8+ License: MIT Code Style: Black

๐ŸŽฏ Purpose

Prepare online documentation for LLM consumption through:

  • ๐Ÿ“ฅ Downloading complete websites with full JavaScript support
  • ๐Ÿ”„ Converting content into standardized formats
  • ๐Ÿ“š Generating LLM-friendly PDFs with proper structure
  • ๐Ÿค– Making post-cutoff knowledge accessible

โœจ Features

  • ๐ŸŒ Website Processing
    • Full JavaScript support
    • Proper handling of relative paths
    • Automatic resource collection
  • ๐Ÿ“‘ Document Generation
    • Clean PDF output with proper formatting
    • Automatic page breaks
    • Table of contents generation
    • Custom CSS and print styles support
  • ๐Ÿ› ๏ธ Configuration Options
    • Configurable margins and layout
    • Environment variable support
    • Progress tracking with color output
    • Debug and quiet modes

๐Ÿ”ง Prerequisites

  • Python 3.8 or higher
  • Conda (Miniconda or Anaconda)
  • HTTrack
    • ๐ŸŽ MacOS: brew install httrack
    • ๐Ÿง Linux: apt-get install httrack
    • ๐ŸชŸ Windows: Download from HTTrack website
  • wkhtmltopdf (installed automatically via conda)

๐Ÿš€ Quick Start

๐Ÿ“ฆ Installation

  1. Install HTTrack for your OS (see Prerequisites)
  2. Clone and setup the environment:
# Create and activate conda environment
conda env create -f environment.yml
conda activate web2llm

# Install package
pip install -e .

๐Ÿ“– Usage

Basic conversion:

python -m web2llm https://example.com --output docs.pdf

๐ŸŽฎ Command Options

  • url: Target website URL (required)
  • --output, -o: Output PDF path (required)
  • --debug: Keep temporary files and debug info
  • --quiet, -q: Suppress progress output
  • --skip-download: Use existing files
  • --download-only: Skip conversion

๐Ÿ“ Examples

  1. Convert Pydantic AI docs:
python -m web2llm https://ai.pydantic.dev/ --output pydantic_ai.pdf
  1. Debug mode:
python -m web2llm https://example.com --output docs.pdf --debug
  1. Quiet mode for scripts:
python -m web2llm https://example.com --output docs.pdf --quiet

โš™๏ธ Configuration

Set via environment variables or .env:

  • DOWNLOAD_DIR: Temporary files location
  • OUTPUT_DIR: PDF output location

๐Ÿค Testing

Run tests with pytest:

# Run all tests (via python)
python -m pytest tests/ -v

# Run all tests (directly)
pytest tests/ -v

# Run tests with coverage report
pytest tests/ --cov=web2llm --cov-report=term-missing

# Run specific test file
pytest tests/test_preprocessor.py -v

# Run specific test function
pytest tests/test_preprocessor.py::test_normalize_url -v

Tests cover:

  • CLI functionality
  • Website downloading
  • HTML preprocessing
  • SVG handling
  • Tabbed content processing
  • Document merging
  • PDF conversion

๐Ÿค Contributing

Contributions welcome! Please feel free to submit a Pull Request.

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

About

๐ŸŒ Expand LLM knowledge beyond training cutoffs through transforming modern websites into AI-digestible PDFs via HTTrack-powered scraping

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages