Costa Huang
@vwxyzjn.bsky.social
480 followers 130 following 49 posts
RL + LLM @ai2.bsky.social; main dev of https://cleanrl.dev/
Posts Media Videos Starter Packs
vwxyzjn.bsky.social
Congrats on the launch!
eugenevinitsky.bsky.social
We're finally out of stealth: percepta.ai
We're a research / engineering team working together in industries like health and logistics to ship ML tools that drastically improve productivity. If you're interested in ML and RL work that matters, come join us 😀
Percepta | A General Catalyst Transformation Company
Transforming critical institutions using applied AI. Let's harness the frontier.
percepta.ai
vwxyzjn.bsky.social
This is all. Enjoy the new model 😆
vwxyzjn.bsky.social
One fun thing is that our model outperformed qwen by almost ~26 points in IFEval. What's going on? We built some nice visualization tools, finding out that basically our model can follow the instructions like "write without a comma" well.
vwxyzjn.bsky.social
Our 1B model achieves impressive performance. See our official tweet for more details!

bsky.app/profile/ai2....
ai2.bsky.social
We're excited to round out the OLMo 2 family with its smallest member, OLMo 2 1B, surpassing peer models like Gemma 3 1B or Llama 3.2 1B. The 1B model should enable rapid iteration for researchers, more local development, and a more complete picture of how our recipe scales.
A bar graph comparing average performance (10 Tasks) across OLMo 2 1B, SmolLM2 1.7B, Gemma 3 1B, Llama 3.2 1B, and Qwen 2.5 1.5B. The highest performance is 42.7, achieved by OLMo 2 1B.
vwxyzjn.bsky.social
The model checkpoints are available in huggingface.co/collections/....

As always, we uploaded all the intermediate RL checkpoints
vwxyzjn.bsky.social
🥘 Excited to share our latest OLMo 1 B models! Almost summer RL time. We did another two-stage RL:
* The first RLVR run uses allenai/RLVR-GSM-MATH-IF-Mixed-Constraints
* The final RLVR run uses allenai/RLVR-MATH for targeted MATH improvement

Short 🧵
vwxyzjn.bsky.social
We streamlined our release process to include the RLVR intermediate checkpoints as well. They are available in the revisions if you want to check it out.

See our updated collection here: huggingface.co/collections/...
vwxyzjn.bsky.social
Introducing OLMo-2-0325-32B-Instruct! It's the spring RL curve time. This time, we used GRPO for RLVR and trained a pretty nice fully open source model!
vwxyzjn.bsky.social
💾 I included the reproduction commands here:
github.com/allenai/open...
github.com
vwxyzjn.bsky.social
📦Here is the trained model. The main recipe is basically the same, except we used a different RL algorithm, so we are just doing a minor release.

huggingface.co/allenai/Llam...
huggingface.co
vwxyzjn.bsky.social
🗡️ The training length is a confounder, but I did run a launch an ablation study on the same `allenai/RLVR-MATH` dataset, using almost identical hyperparams for PPO and GRPO:

The PPO's MATH score is more consistent with the Llama-3.1-Tulu-3-8B model, but GRPO got higher scores.
vwxyzjn.bsky.social
📈 Below is the training curve. I think part of the performance gain also comes from training RL for longer
vwxyzjn.bsky.social
🎆 @natolambert.bsky.social also updated this figure in our paper, for a better visualization :D
vwxyzjn.bsky.social
🎁 We applied the same RLVR dataset (allenai/RLVR-GSM-MATH-IF-Mixed-Constraints) using our new GRPO training script - the trained model checkpoints are better!
vwxyzjn.bsky.social
🔥 allenai/Llama-3.1-Tulu-3-8B (trained with PPO) -> allenai/Llama-3.1-Tulu-3.1-8B (trained with GRPO)

We are happy to "quietly" release our latest GRPO-trained Tulu 3.1 model, which is considerably better in MATH and GSM8K!
vwxyzjn.bsky.social
Thanks @soldni.bsky.social for the better OLMoE base model and for pulling everything through,
@ljvmiranda.bsky.social for on-policy preferences, and many others for coordinating and making the release happen 💪
vwxyzjn.bsky.social
For a cleaned-up version, please refer to Tulu 3 repro commands github.com/allenai/open...
github.com
vwxyzjn.bsky.social
For interested folks, I also included the script I used to launch all the SFT / DPO / RLVR experiments here github.com/allenai/open.... Not cleaned up, but hope it shows some traces of the end-to-end workflow.
github.com
vwxyzjn.bsky.social
All of our research artifacts are fully open source and released. Check out our HF collection:

huggingface.co/collections/...
vwxyzjn.bsky.social
This is how our new allenai/OLMoE-1B-7B-0125-Instruct models compare with the existing allenai/OLMoE-1B-7B-0924-Instruct checkpoint :)

Huge gains on GSM8K, DROP, MATH, and alpaca eval.
vwxyzjn.bsky.social
We found the RLVR + GSM8K recipe to work robustly, and the scores kept going up
vwxyzjn.bsky.social
🤯 Check out our new iOS OLMoE app that runs the model on-device!

We also trained new OLMoE-1B-7B-0125 this time using the Tulu 3 recipe. Very exciting that RLVR improved gsm8k by almost 10 points for OLMoE 🔥

A quick 🧵
vwxyzjn.bsky.social
So maybe using it in the loss directly instead of in the rewards changes certain things? I am not sure.

Anyway just thought the snippet is interesting to share.