Fork vLLM for co-optimized local DocETL pipeline execution

Currently DocETL uses LiteLLM to communicate with a separate vLLM server, which introduces network overhead and prevents deep optimizations between the pipeline orchestration and LLM inference. We should explore forking vLLM to create a tightly integrated version that can run entire DocETL pipelines locally with significant performance improvements.

### Ideas to Explore

Prompt-Pipeline Co-optimization: Rewrite prompts automatically to maximize prefix sharing. For example, when multiple operations process the same document field, reorder prompts to put shared content first for better KV cache hits.

Memory Sharing: Since DocETL operations often process the same documents sequentially, we could share KV cache across pipeline stages rather than recomputing.

Batch Processing: Without HTTP overhead, we can accumulate and batch requests more aggressively across pipeline operations, improving throughput.

Structured Output Fast Path: DocETL always generates JSON - we could pre-compile schema validators and optimize generation specifically for this. Or, do speculative decoding.

The goal is to make it feasible to run sophisticated document processing pipelines entirely locally with open models like Qwen/Llama, potentially achieving 2-3x performance improvements over the current architecture.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fork vLLM for co-optimized local DocETL pipeline execution #419

Ideas to Explore

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fork vLLM for co-optimized local DocETL pipeline execution #419

Description

Ideas to Explore

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions