Suraj Deshmukh | सुरज देशमुख
banner
suraj.io
Suraj Deshmukh | सुरज देशमुख
@suraj.io
@Microsoft.com | ex-@kinvolkio ex-@RedHat | bibliophile | He/Him | Opinions are my own.

🟥 🟩
🟦 🟨
Meta’s Kubernetes-based Portable AI Research Environment youtu.be/ts7bI51gRCo?...
Meta’s Kubernetes-based Portable AI Research Environment - Shaun Hopper, Meta & Navarre Pratt
YouTube video by CNCF [Cloud Native Computing Foundation]
youtu.be
November 26, 2025 at 2:26 PM
Our talk (me & Yuhan Liu) on improving LLM serving efficienty is on YouTube now!
youtu.be/2YCDvZokqnk?...

#vllm #kubernetes #kubecon
LLMs on Kubernetes: Squeeze 5x GPU Efficiency With Cache, Route, Repea... Yuhan Liu & Suraj Deshmukh
YouTube video by CNCF [Cloud Native Computing Foundation]
youtu.be
November 26, 2025 at 1:30 AM
Open AI gave some of the details from the user POV as to what kv cache features are available 
platform.openai.com/docs/guides/...

It is interesting to see that they cache for 10 min and if no request is found they remove hot caches from GPU
OpenAI Platform
Explore developer resources, tutorials, API docs, and dynamic examples to get the most out of OpenAI's platform.
platform.openai.com
November 20, 2025 at 12:16 AM
From Wisconsin to Atlanta: Microsoft connects datacenters to build its first AI superfactory

news.microsoft.com/source/featu...
Microsoft AI superfactory
Microsoft unveiled its second Fairwater AI datacenter in Atlanta as part of a new AI superfactory working across states in nearly real time.
news.microsoft.com
November 19, 2025 at 4:10 AM
Satya Nadella – How Microsoft thinks about AGI
youtu.be/8-boBsWcr5A?...
Satya Nadella – How Microsoft thinks about AGI
YouTube video by Dwarkesh Patel
youtu.be
November 15, 2025 at 11:23 PM
How One Line of Code Freed 30,000 CPU Cores: Deep-Diving Fluent Bit at Petabyte Scale www.youtube.com/watch?v=pbOv...
Keynote: How One Line of Code Freed 30,000 CPU Cores: Deep-Diving Fluent Bit at Petabyte... F. Ponce
YouTube video by CNCF [Cloud Native Computing Foundation]
www.youtube.com
November 15, 2025 at 8:54 PM
Come see us (me & Yuhan Liu) tomorrow for our talk.

Specifically, Wednesday November 12, 2025 5:30pm - 6:00pm EST at Building B | Level 5 | Thomas Murphy Ballroom 1.

More info: sched.co/27FcQ #kubecon #vllm
KubeCon + CloudNativeCon North America 2025: LLMs on Kubernetes: Squeeze 5x GPU Effic...
View more about this event at KubeCon + CloudNativeCon North America 2025
sched.co
November 11, 2025 at 7:52 PM
Building a tool to copy-paste share terminal sessions using Claude Code for web
open.substack.com/pub/simonw/p...
Building a tool to copy-paste share terminal sessions using Claude Code for web
Plus Living dangerously with Claude, and prompt injection risks for ChatGPT Atlas
open.substack.com
October 24, 2025 at 8:07 PM
Join me and Yuhan Liu for our talk at the upcoming #Kubecon NA 2025 in Atlanta: sched.co/27FcQ we will talk about increasing efficency while serving #LLMs using #vLLM & #LMCache!
KubeCon + CloudNativeCon North America 2025: LLMs on Kubernetes: Squeeze 5x GPU Effic...
View more about this event at KubeCon + CloudNativeCon North America 2025
sched.co
October 15, 2025 at 10:29 PM
Using Claude Code but with Github Copilot hosted Claude models:
github.com/surajssd/dot...

TFS @nilekh.bsky.social
github.com
October 14, 2025 at 10:06 PM
Claude Code: Tips and Tricks

youtu.be/HSkLeECsBcw?...
Claude Code: Tips and Tricks
YouTube video by Anand Tyagi
youtu.be
October 13, 2025 at 10:54 PM
Gang Scheduling for Llama by Anca Agape and Andre Darabanov
www.youtube.com/watch?v=4Bef...
Gang Scheduling for Llama by Anca Agape and Andre Darabanov
YouTube video by @Scale
www.youtube.com
October 1, 2025 at 5:15 PM
Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap | NVIDIA Technical Blog developer.nvidia.com/blog/cut-mod...
Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap | NVIDIA Technical Blog
Deploying large language models (LLMs) at scale presents a dual challenge: ensuring fast responsiveness during high demand, while managing the costs of GPUs. Organizations often face a trade-off…
developer.nvidia.com
September 29, 2025 at 4:58 AM
The Only Trait for Success in the AI Era—How to Build It youtu.be/xWYb7tImErI?...
The Only Trait for Success in the AI Era—How to Build It | Carnegie Mellon University Po-Shen Loh
YouTube video by EO
youtu.be
September 3, 2025 at 3:18 AM
OSDI '24 - DistServe: Disaggregating Prefill and Decoding for Goodput-optimized LLM serving youtu.be/WwJvecXOeUA?...
OSDI '24 - DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language...
YouTube video by USENIX
youtu.be
August 28, 2025 at 8:09 AM
OSDI '24 - Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve youtu.be/S8rq3pYboZY?...
OSDI '24 - Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
YouTube video by USENIX
youtu.be
August 28, 2025 at 7:47 AM
More Nodes, More Problems: Solving Multi-Host GPU/TPU Scheduling with Dynamic Resource Allocation youtu.be/YqIHESG0suI?...
More Nodes, More Problems: Solving Multi-Host GPU/TPU Scheduli... John Belamaric & Morten Torkildsen
YouTube video by CNCF [Cloud Native Computing Foundation]
youtu.be
August 28, 2025 at 7:28 AM