A Python tool that prepares online documentation for LLM consumption by downloading websites and converting them into standardized formats. Ideal for making post-cutoff documentation accessible to language models.
For example, it can transform the latest Pydantic AI documentation into clean, structured PDFs, allowing LLMs to understand features released after their training cutoff.
Prepare online documentation for LLM consumption through:
- ๐ฅ Downloading complete websites with full JavaScript support
- ๐ Converting content into standardized formats
- ๐ Generating LLM-friendly PDFs with proper structure
- ๐ค Making post-cutoff knowledge accessible
- ๐ Website Processing
- Full JavaScript support
- Proper handling of relative paths
- Automatic resource collection
- ๐ Document Generation
- Clean PDF output with proper formatting
- Automatic page breaks
- Table of contents generation
- Custom CSS and print styles support
- ๐ ๏ธ Configuration Options
- Configurable margins and layout
- Environment variable support
- Progress tracking with color output
- Debug and quiet modes
- Python 3.8 or higher
- Conda (Miniconda or Anaconda)
- HTTrack
- ๐ MacOS:
brew install httrack
- ๐ง Linux:
apt-get install httrack
- ๐ช Windows: Download from HTTrack website
- ๐ MacOS:
- wkhtmltopdf (installed automatically via conda)
- Install HTTrack for your OS (see Prerequisites)
- Clone and setup the environment:
# Create and activate conda environment
conda env create -f environment.yml
conda activate web2llm
# Install package
pip install -e .
Basic conversion:
python -m web2llm https://example.com --output docs.pdf
url
: Target website URL (required)--output
,-o
: Output PDF path (required)--debug
: Keep temporary files and debug info--quiet
,-q
: Suppress progress output--skip-download
: Use existing files--download-only
: Skip conversion
- Convert Pydantic AI docs:
python -m web2llm https://ai.pydantic.dev/ --output pydantic_ai.pdf
- Debug mode:
python -m web2llm https://example.com --output docs.pdf --debug
- Quiet mode for scripts:
python -m web2llm https://example.com --output docs.pdf --quiet
Set via environment variables or .env
:
DOWNLOAD_DIR
: Temporary files locationOUTPUT_DIR
: PDF output location
Run tests with pytest:
# Run all tests (via python)
python -m pytest tests/ -v
# Run all tests (directly)
pytest tests/ -v
# Run tests with coverage report
pytest tests/ --cov=web2llm --cov-report=term-missing
# Run specific test file
pytest tests/test_preprocessor.py -v
# Run specific test function
pytest tests/test_preprocessor.py::test_normalize_url -v
Tests cover:
- CLI functionality
- Website downloading
- HTML preprocessing
- SVG handling
- Tabbed content processing
- Document merging
- PDF conversion
Contributions welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.