Will Held
@williamheld.com
2.1K followers 450 following 100 posts
Modeling Linguistic Variation to expand ownership of NLP tools Views my own, but affiliations that might influence them: ML PhD Student under Prof. Diyi Yang 2x RS Intern🦙 Pretraining Alum NYU Abu Dhabi Burqueño he/him
Posts Media Videos Starter Packs
Pinned
williamheld.com
Balancing data across domains is key to training the best generalist LLMs!

In my summer work on the Meta Llama team, we introduce UtiliMax and MEDU, new methods to estimate data utility and optimize data mixes efficiently.

HF Blog: huggingface.co/blog/WillHel...
ArXiv: arxiv.org/abs/2501.11747
Reposted by Will Held
jurafsky.bsky.social
Now that school is starting for lots of folks, it's time for a new release of Speech and Language Processing! Jim and I added all sorts of material for the August 2025 release! With slides to match! Check it out here: web.stanford.edu/~jurafsky/sl...
Speech and Language Processing
Speech and Language Processing
web.stanford.edu
williamheld.com
"GPT-5 shows scaling laws are coming to an end"
Reposted by Will Held
peark.es
We’ve discovered a literal miracle with almost unlimited potential and it’s being scrapped for *no reason whatsoever*. This isn’t even nihilism, it’s outright worship of death and human suffering.
jbendery.bsky.social
"The U.S. Department of Health and Human Services (HHS) today announced the beginning of a coordinated wind-down of its mRNA vaccine development activities...."

cc: Sen. Bill Cassidy
williamheld.com
Really great pointer from Hao Zhang on the other site in relation to GPT OSS use of attention sinks.

If I were to guess, the attention sink is what allows them to omit QK-Norm which has become otherwise standard.

www.evanmiller.org/attention-is...
Attention Is Off By One
Let’s fix these pesky Transformer outliers using Softmax One and QuietAttention.
www.evanmiller.org
williamheld.com
The SALT Lab is at #ACL2025 with our genius leader @diyiyang.bsky.social.

Come see work from
@yanzhe.bsky.social,
@dorazhao.bsky.social @oshaikh.bsky.social,
@michaelryan207.bsky.social, and myself at any of the talks and posters below!
Alt Text:

Conference schedule for July 28th (Monday) and July 29th (Tuesday), listing talk titles, locations, times, and authors:

July 28th, Monday:

1. Attacking Vision-Language Computer Agents via Pop-ups
Location: Hall 4/5, Time: 11:00–12:30
Authors: Yanzhe Zhang, Tao Yu, Diyi Yang


2. SPHERE: An Evaluation Card for Human-AI Systems
Location: Hall 4/5, Time: 18:00–19:30
Authors: Dora Zhao*, Qianou Ma*, Xinran Zhao, Chenglei Si, Chenyang Yang, Ryan Louie, Ehud Reiter, Diyi Yang*, Tongshuang Wu*
(asterisk denotes equal contribution)



July 29th, Tuesday:

1. SynthesizeMe! Inducing Persona-Guided Prompts for Personalized Reward Models in LLMs
Location: Hall 4/5, Time: 10:30–12:00
Authors: Michael J Ryan, Omar Shaikh, Aditri Bhagirath, Daniel Frees, William Barr Held, Diyi Yang


2. Distilling an End-to-End Voice Assistant Without Instruction Training Data
Location: Room 1.61, Time: 14:12 (Second Talk)
Authors: William Barr Held, Yanzhe Zhang, Weiyan Shi, Minzhi Li, Michael J Ryan, Diyi Yang


3. Mind the Gap: Static and Interactive Evaluations of Large Audio Models
Location: Room 1.61 (implied), follows previous talk
Authors: Minzhi Li*, William Barr Held*, Michael J Ryan, Kunat Pipatanakul, Potsawee Manakul, Hao Zhu, Diyi Yang
(asterisk denotes equal contribution)


4. EgoNormia: Benchmarking Physical Social Norm Understanding
Location: Hall 4/5, Time: 16:00–17:30
Authors: MohammadHossein Rezaei*, Yicheng Fu*, Phil Cuvin*, Caleb Ziems, Yanzhe Zhang, Hao Zhu, Diyi Yang
(asterisk denotes equal contribution)
williamheld.com
I'm in Vienna for #ACL2025!

My work is all presented tomorrow, but today you'll find me today at the poster session from 11-12:30 evangelizing
my labmate Yanzhe Zhang's work on his behalf.

If you're interested in the risks traditional pop-up attacks present for AI agents, come chat!
williamheld.com
It seems (at a minimum) like they post-trained on the virulently racist content from this thread. Musk framed this as a request for training data... and the top post is eugenics. Seems unlikely to be coincidence that the post uses the same phrasing as the prompt they later removed...
williamheld.com
Btw, all of this is very nice for something that was a quick 15 line addition to Levanter.

github.com/stanford-crf...
williamheld.com
Have an optimizer you want to prove works better than AdamC/Muon/etc?

Submit a speedrun to Marin! marin.readthedocs.io/en/latest/tu...

For PRs with promising results, we're lucky to be able to help test at scale on compute generously provided by the TPU Research Cloud!
Adding an Optimizer for Speedrun - Marin Documentation
Documentation for the Marin project
marin.readthedocs.io
williamheld.com
In our most similar setting to the original work (130M model), we don't see AdamC's benefits but

- We use a smaller WD (0.01) identified from sweeps v.s. what is used in the paper (0.05).
- We only train to Chnichilla optimal (2B tokens) whereas the original paper was at 200B.
williamheld.com
We see the same pattern at 300m and 500m!

Remember, everything else in these experiments is held constant by Levanter & Marin (data order, model init. etc.)

Experiment files here: github.com/marin-commun...
williamheld.com
As a side note, Kaiyue Wen found that weight decay also causes slower loss decrease at the start of training in wandb.ai/marin-commun...

Similar to the end of training, this is likely because LR warmup also impacts the LR/WD ratio.

AdamC seems to mitigate this too.
williamheld.com
TL;DR: 3/4 of our scales we find the AdamC results to reproduce out of the box!

When compared to AdamW with all other factors held constant, AdamC mitigates the gradient ascent at the end of training and leads to an overall lower loss (-0.04)!
williamheld.com
A while ago I mentioned that for marin.community project, this gradient increase led to problematic loss ascent which we patched with Z-loss.

I was curious, does AdamC just work?

So over the weekend, I ran 4 experiments—130M to 1.4B params—all at ~compute-optimal token counts...🧵
williamheld.com
kyutai.org/next/unmute has built in turn-detection on the ASR and full I/O streaming for the TTS. Solves the latency issues that I think are 90% of why people use end-to-end speech models in the first place!

From the details, you can @kyutai-labs.bsky.social is focused on real-world utility.
Unmute by Kyutai
Make LLMs listen and speak.
unmute.sh
Reposted by Will Held
Flattered and shocked for our paper to receive the #facct2025 best paper award.
facct.bsky.social
🏆 Announcing the #FAccT2025 best paper awards! 🏆

Congratulations to all the authors of the three best papers and three honorable mention papers.

Be sure to check out their presentations at the conference next week!

facct-blog.github.io/2025-06-20/b...
Announcing Best Paper Awards
The Best Paper Award Committee was chaired this year by Alex Chouldechova and included six Area Chairs. The committee selected three papers for the Best Paper Award and recognized three additional pap...
facct-blog.github.io
williamheld.com
As far as I can tell, the models aren't good enough right now that they can replace VFX at any high quality commercial scale.

They are exactly good enough to generate fake viral videos for ad revenue on TikTok/Instagram & spread misinformation. Is there any serious argument for their safe release??
williamheld.com
I don't really see an argument for releasing such models with photorealistic generation capabilities.

What valid & frequent business use case is there for photorealistic video & voice generation like Veo 3 offers?
williamheld.com
I've only seen Veo 3 (or any other video generation model) used to produce viral videos. The fake videos seem to successfully trick the majority of commenters and have no visible watermark or disclosure of AI use.
Reposted by Will Held
brendannyhan.bsky.social
What would you say if you saw it in another country? A senator from a coequal branch of government dragged away by security from asking a question of a Cabinet official
justinbaragona.bsky.social
Kristi Noem: "We are not going away. We are staying here to liberate the city from the socialists and the burdensome leadership that this governor and that this mayor have placed on this country and what they have tried to insert into the city."

Sen. Alex Padilla is then forcibly removed!
Reposted by Will Held
echoshao8899.bsky.social
🚨 70 million US workers are about to face their biggest workplace transmission due to AI agents. But nobody’s asking them what they want.

While AI R&D races to automate everything, we took a different approach: auditing what workers want vs. what AI can deliver across the US workforce.🧵
williamheld.com
Really cool to see theory connect to practice! We observed this phenomenon when trying to do deeper WSD cooldowns of our 8B model in the marin.community project!

We Z-Lossed our way through the pain, but cool to see some stronger theory: marin.readthedocs.io/en/latest/re...
williamheld.com
Now, I wouldn't do research on LLMs if I thought that was true in the long term!

But I think it's reasonable for skeptics to question whether advances in inference efficiency, hardware efficiency, and even core energy infrastructure will happen soon enough for current companies to capitalize.
williamheld.com
The underlying assumption being that they can (a la Uber/Lyft) eventually increase prices once the core customers are fundamentally reliant on AI.

The real question then is "what is demand once you start charging the true unit costs?". Personally, I found this article sobering but well reasoned.
The Subprime AI Crisis
None of what I write in this newsletter is about sowing doubt or "hating," but a sober evaluation of where we are today and where we may end up on the current path. I believe that the artificial intel...
www.wheresyoured.at