💡 What is LMCache?

💡 What is LMCache?

LMCache is a LLM serving engine extension to reduce TTFT and increase throughput, especially under long-context scenarios. By storing the KV caches of reusable texts across various locations including (GPU, CPU DRAM, Local Disk), LMCache reuse the KV caches of any reused text (not necessarily prefix) in any serving engine instance. Thus, LMCache saves precious GPU cycles and reduces response delay for users.

By combining LMCache with vLLM, LMCaches achieves 3-10x delay savings and GPU cycle reduction in many LLM use cases, including multi-round QA and RAG.

Try LMCache with pre-built vllm docker images here.

🚀 Performance snapshot

💻 Quickstart

LMCache provides the integration to the latest vLLM (0.6.2). To install LMCache, use the following command:

# requires python >= 3.10 and nvcc >= 12.1
pip install lmcache lmcache_vllm

LMCache has the same interface as vLLM (both online serving and offline inference). To use the online serving, you can start an OpenAI API-compatible vLLM server with LMCache via:

lmcache_vllm serve lmsys/longchat-7b-16k --gpu-memory-utilization 0.8

To use vLLM's offline inference with LMCache, just simply add lmcache_vllm before the import to vLLM components. For example

import lmcache_vllm.vllm as vllm
from lmcache_vllm.vllm import LLM

More detailed documentation will be available soon.

- Sharing KV cache across multiple vLLM instances

LMCache supports sharing KV across different vLLM instances by the lmcache.server module. Here is a quick guide:

# Start lmcache server
lmcache_server localhost 65432

Then, start two vLLM instances with the LMCache config file

wget https://raw.githubusercontent.com/LMCache/LMCache/refs/heads/dev/examples/example.yaml

# start the first vLLM instance
LMCACHE_CONFIG_FILE=example.yaml CUDA_VISIBLE_DEVICES=0 lmcache_vllm serve lmsys/longchat-7b-16k --gpu-memory-utilization 0.8 --port 8000

# start the second vLLM instance
LMCACHE_CONFIG_FILE=example.yaml CUDA_VISIBLE_DEVICES=1 lmcache_vllm serve lmsys/longchat-7b-16k --gpu-memory-utilization 0.8 --port 8001

- What's next

We also provide multiple docker-based demos at 🔗LMCache-demos repo. The demos cover the following use cases:

Share KV caches across multiple serving engines (🔗link)
Loading non-prefix KV caches for RAG (🔗link)

Interested in Connecting?

Fill out the interest form and our team will reach out to you! https://forms.gle/mQfQDUXbKfp2St1z7

🛣️ Incoming Milestones

First release of LMCache
Support installation through pip install and integrate with latest vLLM
Stable support for non-prefix KV caches
User and developer documentation

📖 Blogs and documentations

Our blog posts and documentations are available online

Community meeting

🔗 Meeting link - https://uchicago.zoom.us/j/91454186439?pwd=Qu3IMJH7c83Qbg9hHsXZ3BxzLaEFoF.1
📄 Community Meeting Document - https://docs.google.com/document/d/1SnCKnB2UFBUyPhIpL9zzdZsn_hGp50spoZue-2SoxJY/edit?usp=sharing
🗓️ Calendar - https://calendar.app.google/rsu7Xgq4y4y5YuDj7

Citation

If you use LMCache for your research, please cite our papers:

@inproceedings{liu2024cachegen,
  title={Cachegen: Kv cache compression and streaming for fast large language model serving},
  author={Liu, Yuhan and Li, Hanchen and Cheng, Yihua and Ray, Siddhant and Huang, Yuyang and Zhang, Qizheng and Du, Kuntai and Yao, Jiayi and Lu, Shan and Ananthanarayanan, Ganesh and others},
  booktitle={Proceedings of the ACM SIGCOMM 2024 Conference},
  pages={38--56},
  year={2024}
}

@article{cheng2024large,
  title={Do Large Language Models Need a Content Delivery Network?},
  author={Cheng, Yihua and Du, Kuntai and Yao, Jiayi and Jiang, Junchen},
  journal={arXiv preprint arXiv:2409.13761},
  year={2024}
}

@article{yao2024cacheblend,
  title={CacheBlend: Fast Large Language Model Serving with Cached Knowledge Fusion},
  author={Yao, Jiayi and Li, Hanchen and Liu, Yuhan and Ray, Siddhant and Cheng, Yihua and Zhang, Qizheng and Du, Kuntai and Lu, Shan and Jiang, Junchen},
  journal={arXiv preprint arXiv:2405.16444},
  year={2024}
}

Name	Name	Last commit message	Last commit date
Latest commit YaoJiayi [Bugfix] Fix P2P eviction and shutdown logic (LMCache#375 ) Mar 2, 2025 d4675e5 · Mar 2, 2025 History 219 Commits
.buildkite	.buildkite	[Bugfix] fix the unit tests and bugs in new cuda kernel (LMCache#348 )	Feb 5, 2025
.github	.github	[Enhancement] Improve disk store performance with asyncio (LMCache#306 )	Jan 13, 2025
benchmarks	benchmarks	[Misc] RAG benchmarks (LMCache#312 )	Jan 29, 2025
csrc	csrc	[Bugfix] fix the unit tests and bugs in new cuda kernel (LMCache#348 )	Feb 5, 2025
docker	docker	[CI/Build] Bump vllm version to v0.7.3. (LMCache#373 )	Feb 28, 2025
docs	docs	[Doc] update docs with version, experimental (LMCache#327 )	Jan 27, 2025
examples	examples	[Bugfix] Fix P2P eviction and shutdown logic (LMCache#375 )	Mar 2, 2025
lmcache	lmcache	[Bugfix] Fix P2P eviction and shutdown logic (LMCache#375 )	Mar 2, 2025
tests	tests	[Core] Optimizing the performance of the v2 gpu connector (LMCache#351 )	Feb 6, 2025
.gitignore	.gitignore	[Bugfix] Fix P2P eviction and shutdown logic (LMCache#375 )	Mar 2, 2025
CODE_OF_CONDUCT.md	CODE_OF_CONDUCT.md	Create CODE_OF_CONDUCT.md (LMCache#157 )	Oct 15, 2024
CONTRIBUTING.md	CONTRIBUTING.md	[Doc] Add contributing guidelines and PR template (LMCache#160 )	Oct 17, 2024
LICENSE	LICENSE	Create LICENSE	Jul 1, 2024
MAINTAINERS.md	MAINTAINERS.md	[Doc] Create MAINTAINERS.md (LMCache#210 )	Nov 7, 2024
README.md	README.md	Update README.md with community meeting info (LMCache#270 )	Dec 12, 2024
SECURITY.md	SECURITY.md	[Doc] : Create SECURITY.md (LMCache#159 )	Oct 15, 2024
TODO	TODO	[Add] better CUDA compression/decompression for cachegen (LMCache#8 )	Jul 5, 2024
format.sh	format.sh	[Feature]Add new storage backend infinistore (LMCache#363 )	Feb 27, 2025
pyproject.toml	pyproject.toml	[Enhancement] Latest vllm (0.7.0) docker support (LMCache#339 )	Feb 3, 2025
requirements-lint.txt	requirements-lint.txt	[Enhancement] Improve disk store performance with asyncio (LMCache#306 )	Jan 13, 2025
requirements-test.txt	requirements-test.txt	Buildkite integration (LMCache#40 )	Aug 6, 2024
requirements.txt	requirements.txt	[Feature]Add new storage backend infinistore (LMCache#363 )	Feb 27, 2025
setup.py	setup.py	[Feature]Add new storage backend infinistore (LMCache#363 )	Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

💡 What is LMCache?

🚀 Performance snapshot

💻 Quickstart

- Sharing KV cache across multiple vLLM instances

- What's next

Interested in Connecting?

🛣️ Incoming Milestones

📖 Blogs and documentations

Community meeting

Citation

About

Releases

Packages

Languages

License

captainzmc/LMCache

Folders and files

Latest commit

History

Repository files navigation

💡 What is LMCache?

🚀 Performance snapshot

💻 Quickstart

- Sharing KV cache across multiple vLLM instances

- What's next

Interested in Connecting?

🛣️ Incoming Milestones

📖 Blogs and documentations

Community meeting

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages