Sham Kakade
@shamkakade.bsky.social
880 followers 87 following 5 posts
Harvard Professor. ML and AI. Co-director of the Kempner Institute. https://shamulent.github.io
Posts Media Videos Starter Packs
Reposted by Sham Kakade
kempnerinstitute.bsky.social
NEW in the #KempnerInstitute blog: learn about ProCyon, a multimodal foundation model to model, generate & predict protein phenotypes. Read it here: bit.ly/4fA8xUk
Reposted by Sham Kakade
kempnerinstitute.bsky.social
Calling college grads interested in intelligence research: the application for the #KempnerInstitute's post-bac program w/ the Harvard Kenneth C. Griffin Graduate School of Arts and Sciences Office for Equity, Diversity, Inclusion & Belonging is now open! Apply by Feb. 1, 2025.

t.co/jdJrzRegL0
https://bit.ly/4iohnqE
t.co
Reposted by Sham Kakade
yus167.bsky.social
LLM self-improvement has critical implications in synthetic data, post-training and test-time inference. To understand LLMs' true capability of self-improvement, we perform large-scale experiments with multiple families of LLMs, tasks and mechanisms. Here is what we found: (1/9)
Reposted by Sham Kakade
kempnerinstitute.bsky.social
NEW: we have an exciting opportunity for a tenure-track professor at the #KempnerInstitute and the John A. Paulson School of Engineering and Applied Sciences (SEAS). Read the full description & apply today: academicpositions.harvard.edu/postings/14362
#ML #AI
shamkakade.bsky.social
(5/n) 🤝 Shoutout to some great collaborators:
@hanlin_zhang, @depen_morwani, @vyasnikhil96, @uuujingfeng, @difanzou, @udayaghai
#AI #ML #ScalingLaws
shamkakade.bsky.social
(3/n) 📊 From our controlled experiments on language models:
📈CBS increases as dataset size grows
🤏CBS remains weakly dependent on model size
Data size, not model size, drives parallel efficiency for large-scale pre-training.
shamkakade.bsky.social
(2/n) 🤔 How does CBS scale with model size and data size in pre-training? We find that CBS scales with data size and is largely invariant to model size. Prior beliefs that CBS scales with model size may have stemmed from Chinchilla’s coupled N-D scaling.
shamkakade.bsky.social
(1/n) 💡How can we speed up the serial runtime of long pre-training runs? Enter Critical Batch Size (CBS): the tipping point where the gains of data parallelism balance with diminishing efficiency. Doubling batch size halves the optimization steps—until we hit CBS, beyond which returns diminish.
Reposted by Sham Kakade
brandfonbrener.bsky.social
How does test loss change as we change the training data? And how does this interact with scaling laws?

We propose a methodology to approach these questions by showing that we can predict the performance across datasets and losses with simple shifted power law fits.