Lightnews — Scholar-powered news

Tensormesh

@tensormesh.bsky.social

LMCache Multi-node P2P CPU Memory Sharing & Control: From Experimental Feature to Production

Baolong Mao (Tencent), Chunxiao Zheng (Tencent), Weishu Deng (Tensormesh), Darren Peng (Tensormesh), Samuel Shen (Tensormesh) What is P2P and what does it promise? In this blog post, we will go over: a…

LMCache Multi-node P2P CPU Memory Sharing & Control: From Experimental Feature to Production

Baolong Mao (Tencent), Chunxiao Zheng (Tencent), Weishu Deng (Tensormesh), Darren Peng (Tensormesh), Samuel Shen (Tensormesh) What is P2P and what does it promise? In this blog post, we will go over: a short motivation of the P2PBackend in LMCache and how it differs from existing KV Caching solutions how to run and benchmark performance on the P2PBackend design decisions and pain points in making P2P…

blog.lmcache.ai

January 22, 2026 at 1:39 AM

ceph.io

@ceph.io

Check out the Ceph Blog on KV Caching with vLLM, LMCache, and Ceph.

With inference making up about 90% of #ML costs and #AI spending expected to hit $307B in 2025, efficient #KV caching is vital.

Read more: t.ly/KVCachCeph
#Ceph #OpenSourceStorage #CephCommunity

December 15, 2025 at 11:01 AM

Tensormesh

@tensormesh.bsky.social

LMCACHE：面向企业级大语言模型推理的高效KV Cache层

作者：Yihua Cheng 、Yuhan Liu 、 Jiayi Yao * 、Yuwei An、Xiaokun Chen、Shaoting Feng 、 Yuyang Huang、Samuel Shen、Kuntai Du、Junchen Jiang 单位：TensorMesh&芝加哥大学摘要如今的大语言模型（LLM）推理系统为简化设计，将各个推理引擎和请求独立处理，这导致了严重的资源效率低下问题。尽管已有相关方案提出通过跨请求复用KV Cache来避免冗余计算，并通过将单个请求拆分到不同推理引擎来提高 GPU…

LMCACHE：面向企业级大语言模型推理的高效KV Cache层

作者：Yihua Cheng 、Yuhan Liu 、 Jiayi Yao * 、Yuwei An、Xiaokun Chen、Shaoting Feng 、 Yuyang Huang、Samuel Shen、Kuntai Du、Junchen Jiang 单位：TensorMesh&芝加哥大学摘要如今的大语言模型（LLM）推理系统为简化设计，将各个推理引擎和请求独立处理，这导致了严重的资源效率低下问题。尽管已有相关方案提出通过跨请求复用KV Cache来避免冗余计算，并通过将单个请求拆分到不同推理引擎来提高 GPU 利用率，但这些方案的实现离不开跨推理引擎与请求之间的高效KV Cache卸载和传输。本文提出 LMCACHE，首个且目前最高效的开源 KV Cache缓存解决方案。它能够提取并存储主流 LLM 推理引擎（vLLM 和 SGLang）生成的 KV Cache，并支持跨引擎、跨请求共享。LMCACHE 在 LLM 引擎接口中暴露 KV Cache缓存功能，有效将 LLM 引擎从独立的token处理器转变为以 KV Cache缓存作为存储和通信介质的引擎集合。具体而言，它既支持缓存卸载（跨请求的前缀复用），也支持预PD分离架构（跨引擎缓存传输）。LMCACHE 的高性能和广泛应用源于三大核心贡献：（1）高度优化的 KV Cache数据传输机制，包括批量数据传输操作、计算与 I/O 流水线等性能优化；（2）模块化的 KV Cache连接器组件，使 LMCACHE 与快速迭代的推理引擎解耦；（3）完备的控制 API（如缓存固定、查找、清理、迁移和压缩），支持在 GPU、CPU、存储设备和网络层之间灵活编排缓存。评估结果显示，LMCACHE 与 vLLM 结合使用时，在多轮问答、文档分析等工作负载中吞吐量最高可提升 15 倍。随着社区不断发展，LMCACHE 已被大量企业推理系统采用，为未来 KV Cache缓存解决方案提供了宝贵实践经验。源代码地址：

blog.lmcache.ai

November 25, 2025 at 3:39 AM

Tensormesh

@tensormesh.bsky.social

Tensormesh上线 & LMCache加入PyTorch Foundation

作者：Junchen Jiang 发布Tensormesh 首先我想要在这里重申一遍我上周在LMCache #general Slack频道中发布的一条新闻： “我非常高兴的宣布我们LMCache的创始团队已经在几个月前决定成立名为 Tensormesh 的公司。作为我们第一款产品 Beta 版本的发布，我们决定让Tensormesh正式亮相！我们与公司同名的产品TensorMesh是一款 SaaS 前端，他允许您在我们所支持的不同硬件厂商的GPU上启动任何开源权重模型，同时对 LMCache 和…

Tensormesh上线 & LMCache加入PyTorch Foundation

作者：Junchen Jiang 发布Tensormesh 首先我想要在这里重申一遍我上周在LMCache #general Slack频道中发布的一条新闻： “我非常高兴的宣布我们LMCache的创始团队已经在几个月前决定成立名为 Tensormesh 的公司。作为我们第一款产品 Beta 版本的发布，我们决定让Tensormesh正式亮相！我们与公司同名的产品TensorMesh是一款 SaaS 前端，他允许您在我们所支持的不同硬件厂商的GPU上启动任何开源权重模型，同时对 LMCache 和 vLLM 进行参数自动调优以便在运行模型时提供最佳性能和成本节省。如果你想上手体验，点击这个链接在线注册。前 100 名 Beta 测试者将在该平台获得 100 美元的 GPU 使用额度🔥🔥🔥 这对 LMCache 社区意味着什么？其实不应该被理解成一个大变动，因为 LMCache 依然是 Tensormesh 的基础，我们也致力于其光明的未来。我们承诺该项目将维持开放治理。如果 TensorMesh 成功，这可能意味着会有更多的贡献者加入 LMCache 项目。在短期内，请将 TensorMesh 视为 LMCache 社区的正式赞助商之一。 PyTorch Foundation 我们现在正式宣布：LMCache 已成为 PyTorch Foundation 旗下的生态系统(ecosystem project)项目（参见PyTorch官方blog）。我们非常有幸我们的社区能加入这个囊括了诸如 PyTorch、vLLM、DeepSpeed 等多个我们社区所依赖的项目foundation，这也从另一方面印证了我们的投入和对项目的开放管理。我上周参加了 PyTorch conference并发表了演讲，与许多贡献者和合作伙伴进行了会面。LMCache 在会上设有展位，很多人前来与我们交流，氛围很棒，大家也都各抒己见，非常有insight。许多资深专家说这次活动有着像早期的 KubeCon一样的“能量”。我真的很高兴我们成为这个卓越开放生态系统的一部分，并期待和大家一起创造光明的未来！💪

blog.lmcache.ai

November 23, 2025 at 3:57 AM

Tensormesh

@tensormesh.bsky.social

LMCache Lab: 只针对prefilling阶段？我们把decoding阶段的延迟也省去60%！

() ( ( ( ( 作者：Kuntai Du 简要总结：🚀LMCache Lab 通过投机解码技术，将代码/文本编辑任务中的解码延迟降低了60%！⚡ --- 你可能是因为 KV cache优化而认识了 LMCache Lab——它让LLM的prefilling变得轻而易举。但这并不是全部！我们现在也专注于加速decoding阶段，让你的LLM智能体生成新内容的速度再上一个台阶。换句话说：在同样的工作量下，你可以少租几台机器，从而省下 LLM…

LMCache Lab: 只针对prefilling阶段？我们把decoding阶段的延迟也省去60%！

() ( ( ( ( 作者：Kuntai Du 简要总结：🚀LMCache Lab 通过投机解码技术，将代码/文本编辑任务中的解码延迟降低了60%！⚡ --- 你可能是因为 KV cache优化而认识了 LMCache Lab——它让LLM的prefilling变得轻而易举。但这并不是全部！我们现在也专注于加速decoding阶段，让你的LLM智能体生成新内容的速度再上一个台阶。换句话说：在同样的工作量下，你可以少租几台机器，从而省下 LLM 服务的账单。🎉:money_with_wings: ## 我们在decoding阶段做了哪些优化？我们发现，投机解码可以将代码和文本编辑任务中的token生成时间（即每个输出token的耗时）减少 60%！为什么？因为文本/代码编辑任务经常会复用已经存在的词组，而投机解码正是利用这一点来加速生成过程。放心——投机解码不会改变你的输出结果，只会让你更快得到它们！ ## Benchmarks:bar_chart: 我们通过热门开源项目 vLLM 中 Python 文件的docstrings对投机解码进行了测试。结果如下：投机采样性能对比：相比于未使用投机采样的VLLM性能提升了60% ## 实现:wrench: 我们并不会止步于此！我们也注意到，当请求陡然增加时，速度提升会略有下降：当请求陡然增加时，速度提升会略有下降因此，我们将投机解码作为early access功能发布，并会持续开发自动化方案，帮你把它的潜力榨到极致。 ## 想要试试吗？:raised_hands: 想在自己的应用里立刻体验？我们全新的一键部署平台LMIgnite，让你零门槛体验LMCache Lab 的最新技术——既可以用你自己的云主机，也能接本地集群！[立即注册](

blog.lmcache.ai

November 23, 2025 at 3:11 AM

Tensormesh

@tensormesh.bsky.social

LMCache 第一时间支持 GPT-OSS（20B/120B）

() ( ( ( ( ( ( ( 作者：Yihua, Kobe LMCache 现已第一时间支持 OpenAI 最新发布的 GPT-OSS 模型（200 亿与 1200 亿参数）！本文提供完整指南，教你如何用 vLLM + LMCache 部署 GPT-OSS 模型，并通过 CPU offloading能力获得显著性能提升。 ## 步骤 1：安装 vLLM GPT-OSS 版 ### 安装 ```bash uv pip install --pre vllm==0.10.1+gptoss \…

LMCache 第一时间支持 GPT-OSS（20B/120B）

() ( ( ( ( ( ( ( 作者：Yihua, Kobe LMCache 现已第一时间支持 OpenAI 最新发布的 GPT-OSS 模型（200 亿与 1200 亿参数）！本文提供完整指南，教你如何用 vLLM + LMCache 部署 GPT-OSS 模型，并通过 CPU offloading能力获得显著性能提升。 ## 步骤 1：安装 vLLM GPT-OSS 版 ### 安装 ```bash uv pip install --pre vllm==0.10.1+gptoss \ --extra-index-url \ --extra-index-url \ --index-strategy unsafe-best-match ``` ### 验证安装 ```bash vllm serve openai/gpt-oss-120b --max-model-len 32768…

blog.lmcache.ai

November 23, 2025 at 2:04 AM

Tensormesh

@tensormesh.bsky.social

WOOT! #LMCache in the CNCF Technology Radar. cncf.io/reports/cncf...
That's golden to our community and everyone
@tensormesh

#kubecon #cncf #AI #LLM #inference

November 11, 2025 at 7:54 PM

cloudnativeboy.bsky.social

@cloudnativeboy.bsky.social

In large-scale LLM inference scenarios, efficient memory management & KV cache optimization are crucial LMCache, as a KV cache management system specifically designed for vLLM, requires more flexible extension mechanisms meet needs of monitoring/troubleshooting & more blog.lmcache.ai/en/2025/09/2...

Implementing LMCache Plugin Framework & lmcache_frontend: Design Philosophy | LMCache Blog

A flexible plugin system for enhanced observability and management Abstract In large-scale language model inference scenarios, efficient memory management and KV cache optimization are crucial. LMCach...

blog.lmcache.ai

November 11, 2025 at 3:25 PM

Tensormesh

@tensormesh.bsky.social

Tensormesh unveiled and LMCache joins the PyTorch Foundation

Announcing Tensormesh First I wanted to repeat here what I posted on the LMCache #general Slack channel last week: I am delighted to…

https://blog.lmcache.ai/en/2025/10/31/tensormesh-unveiled-and-lmcache-joins-the-pytorch-foundation/

October 31, 2025 at 4:01 PM

Tensormesh

@tensormesh.bsky.social

Do you want to compare the caching performance of your LLM serving stack? We've put together a simple command line tool to do so. Introducing Tensormesh Benchmark.
tensormesh.ai/blog-posts/t...

#llm #ai #kvcache #lmcache #vllm #benchmarking

Comparing LLM Serving Stacks: Introduction to Tensormesh Benchmark | Tensormesh

Tensormesh cuts inference costs and latency by up to 10x with enterprise-grade, AI-native caching.

tensormesh.ai

October 27, 2025 at 7:44 PM

ぶちナース

@buchinurse.bsky.social

記事の要約: Tensormeshは、AIサーバーの負荷からより多くの推論を引き出すために450万ドルの資金を調達しました。AIインフラの需要が高まる中、GPUの効率的な利用が求められています。Tensormeshのシステムは、GPUメモリを有効活用し、過去のデータを再利用することで、同じサーバー負荷での推論能力を大幅に向上させます。彼らは、オープンソースのユーティリティ「LMCache」を商業化するために資金を活用し、推論コストを最大10倍削減できる可能性を秘めています。Tensormeshは、学術的な評判をビジネスに転換しようとしています。

ぶちナース @buchinurse.bsky.social · Oct 24

「tensormesh inference ai」に関する記事です: https://techcrunch.com/2025/10/23/tensormesh-raises-4-5m-to-squeeze-more-inference-out-of-ai-server-loads/

Tensormesh raises $4.5M to squeeze more inference out of AI server loads | TechCrunch

Tensormesh uses an expanded form of KV caching to make inference loads as much as 10 times more efficient.

techcrunch.com

October 24, 2025 at 4:49 PM

William Oliveira

@1ilhas.bsky.social

LMCACHE, an efficient open-source KV caching solution designed for offloading and communicating KV cache across LLM inference engines and queries

arxiv.org/pdf/2510.096...

arxiv.org

October 19, 2025 at 6:23 PM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference
arxiv.org/abs/2510.09665

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

Today's LLM inference systems treat individual engines and queries independently for simplicity, but this causes significant resource inefficiencies. While there are proposals to avoid redundant compu...

arxiv.org

October 18, 2025 at 10:34 PM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

Join me and Yuhan Liu for our talk at the upcoming #Kubecon NA 2025 in Atlanta: sched.co/27FcQ we will talk about increasing efficency while serving #LLMs using #vLLM & #LMCache!

KubeCon + CloudNativeCon North America 2025: LLMs on Kubernetes: Squeeze 5x GPU Effic...

View more about this event at KubeCon + CloudNativeCon North America 2025

sched.co

October 15, 2025 at 10:29 PM

ByteTrending

@bytetrending.bsky.social

LMCache: Supercharging LLM Inference with Efficient Caching

LMCache boosts LLM inference with efficient KV caching, offering up to 15x throughput improvements & streamlining enterprise AI deployments. Explore this open-source solution!

LMCache: Supercharging LLM Inference with Efficient Caching

LMCache boosts LLM inference with efficient KV caching, offering up to 15x throughput improvements & streamlining enterprise AI deployments. Explore this open-source solution!

bytetrending.com

October 15, 2025 at 11:42 AM

Yuan Tang

@terrytangyuan.xyz

It was also amazing to see a single slide that includes many of my favorite projects vLLM, llm-d, LMCache, and KServe.

Each year, this conference keeps getting better with more energy, more innovation, and more inspiring people driving open technology forward.

October 11, 2025 at 3:21 AM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

How to Reduce KV Cache Bottlenecks with NVIDIA Dynamo | NVIDIA Technical Blog developer.nvidia.com/blog/how-to-...

#LMCache

How to Reduce KV Cache Bottlenecks with NVIDIA Dynamo | NVIDIA Technical Blog

As AI models grow larger and more sophisticated, inference, the process by which a model generates responses, is becoming a major challenge. Large language models (LLMs) like GPT-OSS and DeepSeek-R1…

developer.nvidia.com

October 1, 2025 at 4:33 AM

Kosseila (CloudDude)

@clouddude.bsky.social

🚀#NewBlog #vLLM
📖 𝐯𝐋𝐋𝐌 𝐩𝐫𝐨𝐝𝐮𝐜𝐭𝐢𝐨𝐧-𝐬𝐭𝐚𝐜𝐤: AI inference for enterprises💫

🏢Production-stack is the K8s-native, enterprise-ready inference setup that supercharges vLLM inference at scale, across Clouds.

👉Start here: cloudthrill.ca/vllm-product...

#AI #LLM #vLLM #Kubernetes #MLOps #KVCache #LMCache

vLLM production-stack: LLM inference for Enterprises (part1) - Cloudthrill

vLLM Production Stack tackles usual issues that come with scaling LLM serving (slow recovery, High GPU bills) with a community-maintained layer that wraps vanilla vLLM, adds a Python-native router, LMCache-powered KV-cache network, autoscaling hooks and Grafana dashboards—all deployable in a single Helm chart. Let's dive into it!✍🏻

cloudthrill.ca

September 23, 2025 at 8:51 PM

Kosseila (CloudDude)

@clouddude.bsky.social

📦#vLLM for 𝐁𝐞𝐠𝐢𝐧𝐧𝐞𝐫𝐬 𝐛𝐮𝐧𝐝𝐥𝐞: from basics to deployment! 👇Missed our vLLM series this summer? Here’s a full recap
Part1️⃣: 𝐅undamentals cloudthrill.ca/what-is-vllm
Part2️⃣: 𝐊ey 𝐅eatures cloudthrill.ca/what-is-vllm...
part3️⃣: 𝐃eployment 𝐎ptions cloudthrill.ca/vllm-deloyment
#vllm_project #lmcache #LLMs

September 2, 2025 at 7:19 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news