Joined September 2024
96 Photos and videos
๐‹๐Œ๐‚๐š๐œ๐ก๐ž ๐‚๐‹๐ˆ ๐ฃ๐ฎ๐ฌ๐ญ ๐ ๐จ๐ญ ๐š ๐ง๐ž๐ฐ ๐ฌ๐ญ๐š๐ซ๐ญ๐ฎ๐ฉ ๐›๐š๐ง๐ง๐ž๐ซ! ๐ŸŽจ Run any lmcache command, or start LMCache through the vLLM connector, to see the new banner with version info and CLI usage. Huge thanks to @this_will_echo for the contribution! Try it today from source, or look for it in the next release. Explore LMCache commands: docs.lmcache.ai/cli/index.htโ€ฆ #LLM #AIInfrastructure #KVCache #LMCache
1
6
189
๐“๐ก๐ž ๐‹๐Œ๐‚๐š๐œ๐ก๐ž ๐…๐ซ๐จ๐ง๐ญ๐ž๐ง๐ ๐ƒ๐š๐ฌ๐ก๐›๐จ๐š๐ซ๐ ๐ข๐ฌ ๐ก๐ž๐ซ๐ž! ๐ŸŽ‰ We've been wanting a simple way to keep an eye on a running LMCache deployment and now there's one. The Frontend Dashboard is a lightweight web UI that lets you monitor and manage a whole fleet of LMCache multiprocess (MP) servers from a single browser tab. Here's what you'll find inside: - ๐๐จ๐๐ž ๐ญ๐ซ๐ž๐ž โ€” A collapsible map of your deployment. Each proxy node is an LMCache MP server; its leaf nodes are the cache-engine instances on that server that do the actual KV-cache work โ€” store, lookup, and loading. Expand any proxy to see what's running under it and how the cache is wired together. - ๐€๐ ๐ ๐ซ๐ž๐ ๐š๐ญ๐ž๐ ๐ฆ๐ž๐ญ๐ซ๐ข๐œ๐ฌ โ€” GET /metrics rolls up Prometheus metrics from every leaf node into a single endpoint, so you don't have to scrape each worker by hand. - ๐‘๐ž๐ฏ๐ž๐ซ๐ฌ๐ž ๐ฉ๐ซ๐จ๐ฑ๐ฒ โ€” /proxy2/{node_name}/{path} routes straight through to any node, letting you call its API right from the browser. - ๐‡๐ž๐š๐ฅ๐ญ๐ก ๐œ๐ก๐ž๐œ๐ค โ€” GET /health returns {"status": "healthy"} for quick liveness probes. Huge thanks to our core maintainer maobaolong for building and shipping this. ๐Ÿ™Œ Want to try it? Launch instructions here: docs.lmcache.ai/mp/frontend_โ€ฆ #LLM #AIInfrastructure #KVCache #LMCache
3
193
๐Ÿš€ ๐‹๐Œ๐‚๐š๐œ๐ก๐ž ๐ง๐จ๐ฐ ๐ฌ๐ฎ๐ฉ๐ฉ๐จ๐ซ๐ญ๐ฌ ๐†๐จ๐จ๐ ๐ฅ๐žโ€™๐ฌ ๐†๐ž๐ฆ๐ฆ๐š ๐Ÿ’ ๐Ÿ๐š๐ฆ๐ข๐ฅ๐ฒ Gemma 4 introduces a hybrid attention structure with both sliding-window and full-attention layers, using different KV cache layouts and block sizes. Weโ€™re excited to announce successful deployment and full support for the Gemma 4 family in LMCache. To enable this, LMCache introduces the ๐‡๐ฒ๐›๐ซ๐ข๐ ๐Œ๐ž๐ฆ๐จ๐ซ๐ฒ ๐€๐ฅ๐ฅ๐จ๐œ๐š๐ญ๐จ๐ซ (๐‡๐Œ๐€), which can store, transfer, and retrieve multiple KV cache groups with different block sizes. This allows prefix caching and KV reuse to work seamlessly for hybrid models, including Gemma 4 variants such as gemma-4-12b. ๐‹๐ž๐š๐ซ๐ง ๐ฆ๐จ๐ซ๐ž: docs.lmcache.ai/mp/hybrid_moโ€ฆ ๐“๐ซ๐ฒ ๐ญ๐ก๐ž ๐๐ž๐ฉ๐ฅ๐จ๐ฒ๐ฆ๐ž๐ง๐ญ ๐ซ๐ž๐œ๐ข๐ฉ๐ž: docs.lmcache.ai/recipes/gemmโ€ฆ #LLM #vLLM #KVCache #Gemma4 #LMCache
3
143
LMCache Chinese documentation is now available! Can you help us improve it? As our community grows, weโ€™re working to make LMCache more accessible to developers around the world. Weโ€™ve started building a pipeline to translate LMCache documentation into Chinese. While this gives us a strong starting point, technical translation still needs community review. For example, in the LLM serving context, โ€œrecipeโ€ should not be translated literally as โ€œ้ฃŸ่ฐฑ,โ€ which means โ€œcooking recipe,โ€ but more naturally as โ€œๆ“ไฝœๆ‰‹ๅ†Œโ€ (practical guide) or โ€œ้…็ฝฎ็คบไพ‹โ€ (configuration example), depending on the context. If you are new to LMCache and speak Chinese, this could be your great first PR: review the Chinese docs, improve technical accuracy, and get familiar with LMCache along the way. Come contribute to LMCache with us! #AI #inference #LMCache #KVCache
5
288
๐Ÿ”ง ๐๐ž๐ฐ ๐ข๐ง ๐‹๐Œ๐‚๐š๐œ๐ก๐ž ๐Œ๐ ๐’๐ž๐ซ๐ฏ๐ž๐ซ: /๐ซ๐ฎ๐ง_๐ฌ๐œ๐ซ๐ข๐ฉ๐ญ ๐€๐๐ฆ๐ข๐ง ๐„๐ง๐๐ฉ๐จ๐ข๐ง๐ญ One of our core maintainers, ๐ฆ๐š๐จ๐›๐š๐จ๐ฅ๐จ๐ง๐ , introduced /run_script, a new admin endpoint for live debugging and tuning in the LMCache MP server. With /run_script, developers can inspect runtime state, adjust read/write TTLs, query L1 memory usage, and check server status โ€” all without restarting or redeploying the server. Because it can access attributes through app.state.engine, changes such as TTL updates are re-read by the running code and take effect on the next read/write operation. ๐Ÿ“– Read the full beginner-friendly tutorial and implementation details here:linkedin.com/pulse/beginner-โ€ฆ
1
2
253
KV cache is becoming an independent AI-native data layer โ€” shared across requests, clusters, and serving systems. LMCache is proud to help push this frontier forward as an open-source community. As this space continues to evolve and gain momentum, a new chapter begins for LMCache and the broader KV cache community. Read more: blog.lmcache.ai/en/2026/06/0โ€ฆ ๐’๐ญ๐š๐ฒ ๐œ๐จ๐ง๐ง๐ž๐œ๐ญ๐ž๐ ๐ฐ๐ข๐ญ๐ก ๐‹๐Œ๐‚๐š๐œ๐ก๐ž: ย ย โ€ข Follow us on LinkedIn: linkedin.com/company/lmcacheโ€ฆ ย ย โ€ข Join our Slack community: join.slack.com/t/lmcacheworkโ€ฆ ย ย โ€ข Follow our WeChat Official Account: drive.google.com/file/d/1a-Sโ€ฆ #AI #inference #LMCache #KVCache
1
2
243
Update our Slack invitation link: join.slack.com/t/lmcacheworkโ€ฆ This one should never expire ๐Ÿซก

116
๐ƒ๐ฒ๐ง๐š๐ฆ๐จ ๐ฐ๐ข๐ญ๐ก ๐‹๐Œ๐‚๐š๐œ๐ก๐ž ๐Œ๐ ๐ฆ๐จ๐๐ž We've updated the Dynamo integration to support LMCache's new multiprocess(MP) mode, complete with ready-to-run startup scripts. If you're serving with Dynamo, there's now a launch path for running LMCache as an out-of-process sidecar alongside the vLLM backend. Dynamo connects to the sidecar through LMCacheMPConnector, bringing the integration in line with LMCache's newer multiprocess architecture. Huge thanks to @shaoting_feng for making this possible! Up next: disaggregated serving support for MP mode in Dynamo. Stay tuned! ๐Ÿš€ ๐Ÿ‘‰ Explore more: docs.nvidia.com/dynamo/dev/iโ€ฆ #AI #inference #LMCache #KVCache
2
4
268
๐๐ž๐ฐ ๐ข๐ง ๐‹๐Œ๐‚๐š๐œ๐ก๐ž: ๐‹๐Ÿ ๐š๐๐š๐ฉ๐ญ๐ž๐ซ ๐›๐ž๐ง๐œ๐ก๐ฆ๐š๐ซ๐ค ๐‚๐‹๐ˆ. You can now benchmark throughput of an L2 cache adapter directly without starting an inference engine or an LMCache MP server for all of its base operations (store / lookup / load). The command only requires the adapterโ€™s backing storage to be reachable, making it easier to test and compare L2 backends before plugging them into a full serving workflow. Try it with the L2 backend that best fits your workflow, whether thatโ€™s local filesystem, Redis, S3, or any other adapter. Read more and start testing: docs.lmcache.ai/cli/bench_l2โ€ฆ #AI #inference #LMCache #KVCache
1
3
168
Congrats to @tensormesh for the funding! Tensormesh is among the major contributors to #LMCache. The investment from @CoreWeave , @nvidia and @AMD (among others) testifies to the important role #LMCache plays in AI infra today and tomorrow. BTW, Tensormesh is hiring engineers (full-time, part-time or spare-time) to work on LMCache! Shoot an email to hiring@tensormesh.ai if you are interested.
Today we announced $20M in new funding from investors including AMD Ventures, CoreWeave, NVentures, Valley Capital Partners, and Laude Ventures, bringing Tensormeshโ€™s total funding to $24.5M. Weโ€™re also launching Tensormesh Inference into general availability. AI applications are moving into production, and inference costs are becoming harder to ignore. Agentic workflows repeatedly process the same prompts, context, conversation history, and tool definitions, driving up API costs on work that has already been done. Tensormesh changes that with caching-accelerated inference. Weโ€™re also introducing $0 cached input tokens across Tensormesh serverless deployments, so teams only pay when input tokens need to be processed, not when they can be served from cache. Read the full announcement: tensormesh.ai/blog-posts/tenโ€ฆ
6
383
๐‚๐š๐ฅ๐ฅ๐ข๐ง๐  ๐š๐ฅ๐ฅ ๐ง๐จ๐ง-๐‚๐”๐ƒ๐€ ๐ฎ๐ฌ๐ž๐ซ๐ฌ โ€” ๐‹๐Œ๐‚๐š๐œ๐ก๐ž ๐Œ๐ ๐ฆ๐จ๐๐ž ๐ง๐จ๐ฐ ๐ซ๐ž๐š๐œ๐ก๐ž๐ฌ ๐›๐ž๐ฒ๐จ๐ง๐ ๐‚๐”๐ƒ๐€! On non-CUDA devices, LMCache MP can now use ZMQ (instead of CUDA IPC) to send the KV bytes. LMCache MP mode uses CUDA IPC, but this is not available on non-CUDA devices. To remove that limitation, community contributor ๐ก๐ฅ๐ข๐ง๐Ÿ—๐Ÿ— added a ๐ง๐จ๐ง-๐‚๐”๐ƒ๐€ transfer path for CPU, XPU, HPU, and other non-CUDA environments. Since these devices do not support CUDA IPC, the worker sends the actual KV bytes over the message queue instead: ๐‘”๐‘Ž๐‘กโ„Ž๐‘’๐‘Ÿ ๐‘๐‘Ž๐‘”๐‘’๐‘‘ ๐พ๐‘‰ -> ๐ถ๐‘ƒ๐‘ˆ ๐‘โ„Ž๐‘ข๐‘›๐‘˜๐‘  -> ๐‘ ๐‘’๐‘Ÿ๐‘–๐‘Ž๐‘™๐‘–๐‘ง๐‘’ ๐‘ค๐‘–๐‘กโ„Ž ๐‘๐‘–๐‘๐‘˜๐‘™๐‘’ -> ๐‘ ๐‘’๐‘›๐‘‘ ๐‘๐‘ฆ๐‘ก๐‘’๐‘  ๐‘œ๐‘ฃ๐‘’๐‘Ÿ ๐‘๐‘€๐‘„ -> ๐‘‘๐‘’๐‘ ๐‘’๐‘Ÿ๐‘–๐‘Ž๐‘™๐‘–๐‘ง๐‘’ ๐‘œ๐‘› ๐‘กโ„Ž๐‘’ ๐‘ ๐‘’๐‘Ÿ๐‘ฃ๐‘’๐‘Ÿ -> ๐‘ค๐‘Ÿ๐‘–๐‘ก๐‘’ ๐‘ก๐‘œ ๐ฟ1 On CUDA devices, LMCache continues to use the existing CUDA IPC path, where the worker sends a lightweight handle and the server reads the workerโ€™s GPU memory directly: ๐‘ค๐‘œ๐‘Ÿ๐‘˜๐‘’๐‘Ÿ ๐‘๐‘Ž๐‘”๐‘’๐‘‘ ๐พ๐‘‰ (๐บ๐‘ƒ๐‘ˆ) -> ๐ฟ๐‘€๐ถ๐‘Ž๐‘โ„Ž๐‘’ ๐‘Ÿ๐‘’๐‘Ž๐‘‘๐‘  ๐‘ฃ๐‘–๐‘Ž ๐ถ๐‘ˆ๐ท๐ด ๐ผ๐‘ƒ๐ถ -> ๐บ๐‘ƒ๐‘ˆ ๐‘ ๐‘ก๐‘Ž๐‘”๐‘–๐‘›๐‘” ๐‘๐‘ข๐‘“๐‘“๐‘’๐‘Ÿ -> ๐ฟ1 ๐‘๐‘Ž๐‘โ„Ž๐‘’ (๐ถ๐‘ƒ๐‘ˆ ๐‘…๐ด๐‘€) In both paths, ZMQ serves as the control channel and carries messages such as REGISTER, PREPARE_STORE, and COMMIT_STORE. Compared with the CUDA path, the non-CUDA path adds two CPU-side copies, but ๐ž๐ฑ๐ญ๐ž๐ง๐๐ฌ ๐Œ๐ ๐ฆ๐จ๐๐ž ๐ญ๐จ ๐ง๐จ๐ง-๐‚๐”๐ƒ๐€ environments. #KVCache #LMCache #AI #inference
5
180
New blog: ๐–๐ก๐ž๐ง ๐Ž๐ฉ๐ž๐ง ๐’๐จ๐ฎ๐ซ๐œ๐ž ๐Œ๐ž๐ž๐ญ๐ฌ ๐Ž๐ฉ๐ž๐ง ๐’๐จ๐ฎ๐ซ๐œ๐ž โ€” ๐€ ๐‰๐จ๐ข๐ง๐ญ ๐„๐Ÿ๐Ÿ๐จ๐ซ๐ญ ๐๐ž๐ญ๐ฐ๐ž๐ž๐ง ๐‹๐Œ๐‚๐š๐œ๐ก๐ž ๐š๐ง๐ ๐Œ๐จ๐จ๐ง๐œ๐š๐ค๐ž The story starts with the LMCache community building the foundation: the native connector framework, dynamic plugin loading, and the MooncakeStore L2 plugin path for MP mode. The Mooncake community then helped optimize the RDMA path step by step, adding L1 memory preregistration, batch operations, and dedicated worker lanes for different cache operations. Under Mooncake RDMA, ๐ญ๐ก๐ข๐ฌ ๐ฐ๐จ๐ซ๐ค๐ž๐ซ-๐ฅ๐š๐ง๐ž ๐๐ž๐ฌ๐ข๐ ๐ง ๐ซ๐ž๐๐ฎ๐œ๐ž๐ ๐ฅ๐จ๐จ๐ค๐ฎ๐ฉ ๐ฉ๐Ÿ—๐Ÿ— ๐Ÿ๐ซ๐จ๐ฆ ๐Ÿ๐Ÿ”.๐Ÿ– ๐ฆ๐ฌ ๐ญ๐จ ๐ŸŽ.๐Ÿ’๐Ÿ– ๐ฆ๐ฌ! This was not a one-sided integration. LMCache brought the MP framework and native connector abstraction and Mooncake brought deep storage and RDMA expertise. Together, the two communities built a stronger L2 KV cache integration for distributed LLM inference systems. Huge thanks to maobaolong, fangchizheng, chunxiaozheng, and everyone in both communities who helped make this happen! Read the full story: blog.lmcache.ai/en/2026/05/2โ€ฆ #KVCache #LMCache #AI #inference
1
3
262
PD Disaggregation unleashed! The new async PDBackend is now much more efficient in LMCache. In Prefill-Decode Disaggregation, a single LLM request is split across two types of nodes. A prefill node reads the prompt and produces the KV cache, while a decode node consumes that KV cache to generate tokens. The KV cache needs to move from the prefill node to the decode node over the network, typically through RDMA. In LMCache, the component responsible for moving these KV chunks is called the PDBackend. Before the asynchronous PDBackend, LMCacheโ€™s prefill workers sent KV cache chunks one at a time and waited for each transfer to finish before continuing. This worked for simple cases, but under chunked prefill, where a long prompt is split into multiple KV transfers, concurrent requests could deadlock. The new fully asynchronous PDBackend moves KV transfer off the critical path. Instead of blocking on each network transfer, the prefill worker can hand off KV chunks in the background and continue processing the next prompt. On the receiver side, LMCache also reserves enough buffer space for the whole request before the transfer starts, so each admitted request has enough room to finish. This update is a great community effort from LMCache. As Prefill-Decode Disaggregation becomes more widely used, improvements like async PDBackend are essential for making KV cache transfer more reliable and scalable. Thank you to everyone in the LMCache community who helped shape, review, and harden this update! #KVCache #LMCache #AI #inference
1
1
5
321