-
Notifications
You must be signed in to change notification settings - Fork 346
Description
Currently DocETL uses LiteLLM to communicate with a separate vLLM server, which introduces network overhead and prevents deep optimizations between the pipeline orchestration and LLM inference. We should explore forking vLLM to create a tightly integrated version that can run entire DocETL pipelines locally with significant performance improvements.
Ideas to Explore
Prompt-Pipeline Co-optimization: Rewrite prompts automatically to maximize prefix sharing. For example, when multiple operations process the same document field, reorder prompts to put shared content first for better KV cache hits.
Memory Sharing: Since DocETL operations often process the same documents sequentially, we could share KV cache across pipeline stages rather than recomputing.
Batch Processing: Without HTTP overhead, we can accumulate and batch requests more aggressively across pipeline operations, improving throughput.
Structured Output Fast Path: DocETL always generates JSON - we could pre-compile schema validators and optimize generation specifically for this. Or, do speculative decoding.
The goal is to make it feasible to run sophisticated document processing pipelines entirely locally with open models like Qwen/Llama, potentially achieving 2-3x performance improvements over the current architecture.