llm-d
banner
llm-d.ai
llm-d
@llm-d.ai
llm-d is a Kubernetes-native distributed inference serving stack providing well-lit paths for anyone to serve large generative AI models at scale.

Learn more at: https://llm-d.ai
How we’re using it:

⚫️ Tiered-Prefix-Cache: We use the new connector to bridge GPU HBM and CPU RAM, creating a massive, multi-tier cache hierarchy.

⚫️ Intelligent Scheduling: Our scheduler now routes requests to pods where KV blocks are already warm (in GPU or CPU).
January 9, 2026 at 6:45 PM
🚀 Announcing llm-d v0.4! This release focuses on achieving SOTA inference performance across accelerators. From ultra-low latency for MoE models to new auto-scaling capabilities, we’re pushing the boundaries of open-source inference. Blog: https://t.co/qlQnzcT9O3 🧵👇
January 12, 2026 at 3:16 PM
🚀 llm-d v0.3.1 is LIVE! 🚀 This patch release is packed with key follow-ups from v0.3.0, including new hardware support, expanded cloud provider integration, and streamlined image builds. Dive into the full changelog: https://t.co/Wh6OGJ0KdO #llmd #OpenSource #vLLM #Release
January 12, 2026 at 3:15 PM
🚀 Evolving for Impact! We're updating our llm-d SIG meeting schedule to a bi-weekly cadence. This gives our community more time for deep work between calls, making our sessions even more focused and productive. Here are the details 👇
January 12, 2026 at 3:15 PM
We are thrilled to announce the release of llm-d v0.3! 🚀 This release is a huge milestone, powered by our incredible community, as we continue to build wider, well-lit paths for high-performance, hardware-agnostic, and scalable inference. 🧵Let's dive into what's new!
January 12, 2026 at 3:15 PM
Running LLMs on Kubernetes? You've likely felt the pain of re-processing the same context tokens over and over (think RAG system prompts). This is a huge source of inefficiency in distributed inference. Let's break down how we're solving this with llm-d. 🧵
January 12, 2026 at 3:15 PM
In production LLM inference, this metric matters: KV-Cache hit rate. Why? A cached token is up to 10x cheaper to process than an uncached one. But when you scale out, naive load balancing creates a costly disaster: the "heartbreaking KV-cache miss." https://red.ht/46A4ynW
January 12, 2026 at 3:15 PM
The llm-d community is building incredible things! 🚀 Shout-out to Ernest Wong & Sachi Desai from Microsoft for their new blog post pairing llm-d with Retrieval-Augmented Generation (RAG) on Azure Kubernetes Service (AKS)! This is a must-read guide! 👇 https://t.co/DPfRUdTLJB
January 12, 2026 at 3:15 PM
Getting started with llm-d v0.2 is now easier than ever! We've launched a full set of quick start guides to walk you through our most powerful features, including P/D disaggregation and deploying large MoE models on Kubernetes. Start here: https://llm-d.ai/docs/guide
January 12, 2026 at 3:14 PM
The llm-d community is proud to announce the release of v0.2! Our focus has been on building well-lit paths for large-scale inference on Kubernetes. This release delivers major advancements in performance, scheduling, and support for massive models. https://red.ht/4l4u9uD
January 12, 2026 at 3:14 PM
Big news from the llm-d project! Your input on our 5-min survey will define our future roadmap. Plus, we've just launched our YouTube channel with meeting recordings & tutorials. Subscribe and help us build the future of LLM serving! https://llm-d.ai/blog/llm-d-community-update-june-2025
January 12, 2026 at 3:13 PM
Two new ways to get involved with the llm-d project! ✅ Help shape our roadmap by taking our 5-min survey on your LLM use cases.
✅ Subscribe to our new YouTube channel for tutorials & SIG meetings! Details in our latest community update: https://llm-d.ai/blog/llm-d-community-update-june-2025
January 12, 2026 at 3:13 PM