llm-d
banner
llm-d.ai
llm-d
@llm-d.ai
llm-d is a Kubernetes-native distributed inference serving stack providing well-lit paths for anyone to serve large generative AI models at scale.

Learn more at: https://llm-d.ai
Check out our updated guide on leveraging tiered caching in your own cluster: llm-d.ai/docs/guide/I...

Up next: A deep dive blog on deployment patterns and scheduling behavior. Stay tuned! ⚡️
Prefix Cache Offloading - CPU | llm-d
Well-lit path for separating prefill and decode operations
llm-d.ai
January 9, 2026 at 6:45 PM
By separating memory transfer mechanisms from global scheduling logic, llm-d ensures you get the best of both: peak engine performance + optimal resource utilization across the entire fleet. 🛠️
January 9, 2026 at 6:45 PM
How we’re using it:

⚫️ Tiered-Prefix-Cache: We use the new connector to bridge GPU HBM and CPU RAM, creating a massive, multi-tier cache hierarchy.

⚫️ Intelligent Scheduling: Our scheduler now routes requests to pods where KV blocks are already warm (in GPU or CPU).
January 9, 2026 at 6:45 PM
Our mission with llm-d is building the control plane that translates these engine-level wins into cluster-wide performance.

We’ve already integrated these capabilities into our core architecture to bridge the gap between raw hardware power and distributed scale.
January 9, 2026 at 6:45 PM