Suvinay
suvinay.bsky.social
Suvinay
@suvinay.bsky.social
👨‍💻 Building AI systems (TPUs) at Google | 🎙️ Co-host the Computer Architecture Podcast | 🎓 EECS Ph.D. @ MIT, B.Tech @ IIT Madras | Views my own | suvinay.com
Pinned
Starting with this exciting line of work from @tjin.bsky.social and colleagues at MIT. We tackle the question of: Can we train LLMs to parallelize autoregressive decoding automatically, backed by a performant runtime to exploit this parallelism for improved inference speedup?
Excited to share our work with friends from MIT/Google on Learned Asynchronous Decoding! LLM responses often contain chunks of tokens that are semantically independent. What if we can train LLMs to identify such chunks and decode them in parallel, thereby speeding up inference? 1/N
Scaling Laws provide a valuable lens in guiding model design and computational budgets. Our recent work extends this lens to the realm of _fine-grained_ sparsity. Check out our #ICLR2025 paper, and the thread below from lead-author @tjin.bsky.social summarizing our findings.
📣 The Journey Matters: Our #ICLR2025 paper shows how to pretrain sparse LLMs with half the size of dense LLMs while maintaining quality. We found that the average parameter count during sparse pre-training predicts quality, not final size. An MIT/Rice/Google/ISTA collab 🧵 1/N
April 22, 2025 at 1:32 AM
At Google, we announced the latest generation of our AI supercomputers (TPUs) -- Ironwood -- this week. Check out the blogpost in quote for the highlights. blog.google/products/googl…

Pointers to deep-dives and more technical details in thread. [contd...👇]
https://blog.google/products/googl…
April 13, 2025 at 4:09 AM
Together with Lisa Hsu (Meta), we have been hosting the Computer Architecture Podcast -- we recently crossed 50K downloads. Check out our latest episode with Prof. Arka Basu: comparchpodcast.podbean.com -- we discuss GPUs, but a different vantage point than AI which is all the rage.
Computer Architecture Podcast | comparchpodcast
A show that brings you closer to the cutting edge in computer architecture and the remarkable people behind it. Hosted by Dr. Suvinay Subramanian, who is a computer architect at Google in the Systems ...
comparchpodcast.podbean.com
March 18, 2025 at 7:04 PM
Starting with this exciting line of work from @tjin.bsky.social and colleagues at MIT. We tackle the question of: Can we train LLMs to parallelize autoregressive decoding automatically, backed by a performant runtime to exploit this parallelism for improved inference speedup?
Excited to share our work with friends from MIT/Google on Learned Asynchronous Decoding! LLM responses often contain chunks of tokens that are semantically independent. What if we can train LLMs to identify such chunks and decode them in parallel, thereby speeding up inference? 1/N
March 18, 2025 at 7:02 PM
Hello world! Dipping my toes into social media. My excellent intern(s) at Google with whom I have had the pleasure of working, were kind enough to nudge me to help signal-boost their work. Will also try to share updates on TPUs, AI chips & systems, and computer architecture.
March 18, 2025 at 7:01 PM