Skip to content

Fork vLLM for co-optimized local DocETL pipeline execution #419

@shreyashankar

Description

@shreyashankar

Currently DocETL uses LiteLLM to communicate with a separate vLLM server, which introduces network overhead and prevents deep optimizations between the pipeline orchestration and LLM inference. We should explore forking vLLM to create a tightly integrated version that can run entire DocETL pipelines locally with significant performance improvements.

Ideas to Explore

Prompt-Pipeline Co-optimization: Rewrite prompts automatically to maximize prefix sharing. For example, when multiple operations process the same document field, reorder prompts to put shared content first for better KV cache hits.

Memory Sharing: Since DocETL operations often process the same documents sequentially, we could share KV cache across pipeline stages rather than recomputing.

Batch Processing: Without HTTP overhead, we can accumulate and batch requests more aggressively across pipeline operations, improving throughput.

Structured Output Fast Path: DocETL always generates JSON - we could pre-compile schema validators and optimize generation specifically for this. Or, do speculative decoding.

The goal is to make it feasible to run sophisticated document processing pipelines entirely locally with open models like Qwen/Llama, potentially achieving 2-3x performance improvements over the current architecture.

Metadata

Metadata

Assignees

No one assigned

    Labels

    efficiencyMaking docetl operations run fastergood first research issueGood for newcomers who want to get involved in research

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions