llm-d
banner
llm-d.ai
llm-d
@llm-d.ai
llm-d is a Kubernetes-native distributed inference serving stack providing well-lit paths for anyone to serve large generative AI models at scale.

Learn more at: https://llm-d.ai
Standardizing high-performance inference requires deep ecosystem collaboration. 🚀

Huge shoutout to @vllm_project and @IBMResearch on the new KV Offloading Connector. We’re seeing up to 9x throughput gains on H100s and massive TTFT reductions. 🧵

blog.vllm.ai/2026/01/08/k...
Inside vLLM’s New KV Offloading Connector: Smarter Memory Transfer for Maximizing Inference Throughput
In this post, we will describe the new KV cache offloading feature that was introduced in vLLM 0.11.0. We will focus on offloading to CPU memory (DRAM) and its benefits to improving overall inference…
blog.vllm.ai
January 9, 2026 at 6:45 PM
AI inference is like a busy airport: without a controller, you get gridlock. ✈️

Check out this breakdown by Cedric Clyburn from Red Hat on how llm-d intelligently routes distributed LLM requests.

🔹 Solves "round robin" congestion
🔹 Disaggregates P/D to save costs

www.youtube.com/watch?v=CNKG...
LLM‑D Explained: Building Next‑Gen AI with LLMs, RAG & Kubernetes
YouTube video by IBM Technology
www.youtube.com
January 8, 2026 at 7:21 PM