Lightnews — Scholar-powered news

llm-d

@llm-d.ai

llm-d is a Kubernetes-native distributed inference serving stack providing well-lit paths for anyone to serve large generative AI models at scale.

Learn more at: https://llm-d.ai

Posts Replies Media Videos

llm-d

@llm-d.ai

Check out our updated guide on leveraging tiered caching in your own cluster: llm-d.ai/docs/guide/I...

Up next: A deep dive blog on deployment patterns and scheduling behavior. Stay tuned! ⚡️

Prefix Cache Offloading - CPU | llm-d

Well-lit path for separating prefill and decode operations

llm-d.ai

January 9, 2026 at 6:45 PM

llm-d

@llm-d.ai

By separating memory transfer mechanisms from global scheduling logic, llm-d ensures you get the best of both: peak engine performance + optimal resource utilization across the entire fleet. 🛠️

January 9, 2026 at 6:45 PM

llm-d

@llm-d.ai

How we’re using it:

⚫️ Tiered-Prefix-Cache: We use the new connector to bridge GPU HBM and CPU RAM, creating a massive, multi-tier cache hierarchy.

⚫️ Intelligent Scheduling: Our scheduler now routes requests to pods where KV blocks are already warm (in GPU or CPU).

January 9, 2026 at 6:45 PM

llm-d

@llm-d.ai

Our mission with llm-d is building the control plane that translates these engine-level wins into cluster-wide performance.

We’ve already integrated these capabilities into our core architecture to bridge the gap between raw hardware power and distributed scale.

January 9, 2026 at 6:45 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news