Lightnews — Scholar-powered news

Suraj Deshmukh | सुरज देशमुख

@suraj.io

Meta’s Kubernetes-based Portable AI Research Environment youtu.be/ts7bI51gRCo?...

Meta’s Kubernetes-based Portable AI Research Environment - Shaun Hopper, Meta & Navarre Pratt

YouTube video by CNCF [Cloud Native Computing Foundation]

youtu.be

November 26, 2025 at 2:26 PM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

Our talk (me & Yuhan Liu) on improving LLM serving efficienty is on YouTube now!
youtu.be/2YCDvZokqnk?...

#vllm #kubernetes #kubecon

LLMs on Kubernetes: Squeeze 5x GPU Efficiency With Cache, Route, Repea... Yuhan Liu & Suraj Deshmukh

YouTube video by CNCF [Cloud Native Computing Foundation]

youtu.be

November 26, 2025 at 1:30 AM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

Infinite scale: The architecture behind the Azure AI superfactory

blogs.microsoft.com/blog/2025/11...

Infinite scale: The architecture behind the Azure AI superfactory - The Official Microsoft Blog

Today, we are unveiling the next Fairwater site of Azure AI datacenters in Atlanta, Georgia. This purpose-built datacenter is connected to our first Fairwater site in Wisconsin, prior generations of A...

blogs.microsoft.com

November 20, 2025 at 12:25 AM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

Gemini 3, Open AI kv cache and much more
open.substack.com/pub/simonw/p...

Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark

Plus what happens if AI labs train for pelicans riding bicycles?

open.substack.com

November 20, 2025 at 12:22 AM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

Open AI gave some of the details from the user POV as to what kv cache features are available  platform.openai.com/docs/guides/...  It is interesting to see that they cache for 10 min and if no request is found they remove hot caches from GPU

OpenAI Platform

Explore developer resources, tutorials, API docs, and dynamic examples to get the most out of OpenAI's platform.

platform.openai.com

November 20, 2025 at 12:16 AM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

From Wisconsin to Atlanta: Microsoft connects datacenters to build its first AI superfactory

news.microsoft.com/source/featu...

Microsoft AI superfactory

Microsoft unveiled its second Fairwater AI datacenter in Atlanta as part of a new AI superfactory working across states in nearly real time.

news.microsoft.com

November 19, 2025 at 4:10 AM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

Satya Nadella – How Microsoft thinks about AGI
youtu.be/8-boBsWcr5A?...

Satya Nadella – How Microsoft thinks about AGI

YouTube video by Dwarkesh Patel

youtu.be

November 15, 2025 at 11:23 PM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

How One Line of Code Freed 30,000 CPU Cores: Deep-Diving Fluent Bit at Petabyte Scale www.youtube.com/watch?v=pbOv...

Keynote: How One Line of Code Freed 30,000 CPU Cores: Deep-Diving Fluent Bit at Petabyte... F. Ponce

YouTube video by CNCF [Cloud Native Computing Foundation]

www.youtube.com

November 15, 2025 at 8:54 PM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

Come see us (me & Yuhan Liu) tomorrow for our talk.

Specifically, Wednesday November 12, 2025 5:30pm - 6:00pm EST at Building B | Level 5 | Thomas Murphy Ballroom 1.

More info: sched.co/27FcQ #kubecon #vllm

KubeCon + CloudNativeCon North America 2025: LLMs on Kubernetes: Squeeze 5x GPU Effic...

View more about this event at KubeCon + CloudNativeCon North America 2025

sched.co

November 11, 2025 at 7:52 PM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

Announcing Ray Direct Transport: RDMA Support in Ray Core
www.anyscale.com/blog/ray-dir...

Ray Direct Transport: RDMA Support in Ray Core (Part 1)

Ray Direct Transport enables fast and direct GPU transfers in Ray via RDMA-backed transports. Using RDT, we can achieve up to 1000x faster GPU-GPU transfers than Ray’s native object store with a few l...

www.anyscale.com

November 5, 2025 at 1:06 AM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

Building a tool to copy-paste share terminal sessions using Claude Code for web
open.substack.com/pub/simonw/p...

Building a tool to copy-paste share terminal sessions using Claude Code for web

Plus Living dangerously with Claude, and prompt injection risks for ChatGPT Atlas

open.substack.com

October 24, 2025 at 8:07 PM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference
arxiv.org/abs/2510.09665

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

Today's LLM inference systems treat individual engines and queries independently for simplicity, but this causes significant resource inefficiencies. While there are proposals to avoid redundant compu...

arxiv.org

October 18, 2025 at 10:34 PM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

Understanding Memory Management on Hardware-Coherent Platforms | NVIDIA Technical Blog developer.nvidia.com/blog/underst...

Understanding Memory Management on Hardware-Coherent Platforms | NVIDIA Technical Blog

If you’re an application developer or a cluster administrator, you’ve likely seen how non-uniform memory access (NUMA) can impact system performance. When an application is not fully NUMA-aware…

developer.nvidia.com

October 17, 2025 at 8:12 PM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

Join me and Yuhan Liu for our talk at the upcoming #Kubecon NA 2025 in Atlanta: sched.co/27FcQ we will talk about increasing efficency while serving #LLMs using #vLLM & #LMCache!