Jacob Morrison
@jacobcares.bsky.social
540 followers 380 following 12 posts
PhD student @ UW, research @ Ai2
Posts Media Videos Starter Packs
Reposted by Jacob Morrison
kjha02.bsky.social
Forget modeling every belief and goal! What if we represented people as following simple scripts instead (i.e "cross the crosswalk")?

Our new paper shows AI which models others’ minds as Python code 💻 can quickly and accurately predict human behavior!

shorturl.at/siUYI%F0%9F%...
Reposted by Jacob Morrison
ai2.bsky.social
RewardBench 2 is here! We took a long time to learn from our first reward model evaluation tool to make one that is substantially harder and more correlated with both downstream RLHF and inference-time scaling.
The RewardBench 2 Leaderboard on HuggingFace.
Reposted by Jacob Morrison
natolambert.bsky.social
Heading to NAACL? With "verification being the key to AI" you should go to the poster session Friday, 9-10:30am to chat with my star colleagues @valentinapy.bsky.social + @jacobcares.bsky.social about RewardBench (and really RewardBench 2, evaluation, and reward models in post-training).
jacobcares.bsky.social
Valentina and I will be presenting RewardBench at NAACL! Come say hi at the poster session on Friday and we can chat about reward models, staying up for 30 hours straight to rapidly reset from Singapore time, and more 🏜️
valentinapy.bsky.social
I'll be at #NAACL2025:

🖇️To present my paper "Superlatives in Context", showing how the interpretation of superlatives is very context dependent and often implicit, and how LLMs handle such semantic underspecification

🖇️And we will present RewardBench on Friday

Reach out if you want to chat!
Reposted by Jacob Morrison
valentinapy.bsky.social
I'll be at #NAACL2025:

🖇️To present my paper "Superlatives in Context", showing how the interpretation of superlatives is very context dependent and often implicit, and how LLMs handle such semantic underspecification

🖇️And we will present RewardBench on Friday

Reach out if you want to chat!
jacobcares.bsky.social
what a flattering picture lol
jacobcares.bsky.social
I'm in Singapore for @iclr-conf.bsky.social ! Come check out our spotlight paper on the environmental impact of training OLMo (link in next tweet) during the Saturday morning poster session from 10-12:30 -- happy to chat about this or anything else! DMs should be open, email works too
Reposted by Jacob Morrison
ai2.bsky.social
Ai2 @ai2.bsky.social · Mar 13
Announcing OLMo 2 32B: the first fully open model to beat GPT 3.5 & GPT-4o mini on a suite of popular, multi-skill benchmarks.

Comparable to best open-weight models, but a fraction of training compute. When you have a good recipe, ✨ magical things happen when you scale it up!
jacobcares.bsky.social
also some other tülu contributors are on the market:
@ljvmiranda.bsky.social (ljvmiranda921.github.io) and Xinxi Lyu (alrope123.github.io) are also applying to phd programs, and @valentinapy.bsky.social (valentinapy.github.io) is on the faculty market, hire them all!!
jacobcares.bsky.social
check out the updated paper here: arxiv.org/pdf/2411.15124 (with a beautiful new template!) and the model here: huggingface.co/allenai/Llam... and on the ai2 playground: playground.allenai.org
jacobcares.bsky.social
big tülu is here! can't wait for everyone to try it, it's been a lot of fun seeing how RL performs at this scale thanks to @hamishivi.bsky.social
and @vwxyzjn.bsky.social, and preference data from @ljvmiranda.bsky.social

on an unrelated note, I'm applying to phd programs this year 👀
ai2.bsky.social
Ai2 @ai2.bsky.social · Jan 30
Here is Tülu 3 405B 🐫 our open-source post-training model that surpasses the performance of DeepSeek-V3! It demonstrates that our recipe, which includes RVLR scales to 405B - with performance on par with GPT-4o, & surpassing prior open-weight post-trained models of the same size including Llama 3.1.
The logo for Tülu 405B.
Reposted by Jacob Morrison
ai2.bsky.social
Ai2 @ai2.bsky.social · Jan 30
Here is Tülu 3 405B 🐫 our open-source post-training model that surpasses the performance of DeepSeek-V3! It demonstrates that our recipe, which includes RVLR scales to 405B - with performance on par with GPT-4o, & surpassing prior open-weight post-trained models of the same size including Llama 3.1.
The logo for Tülu 405B.
Reposted by Jacob Morrison
hamishivi.bsky.social
Excited to see Tulu 3 sits in between Llama 3.1 and 3.3 instruct on the chatbot arena leaderboard right now!

Particularly happy it is top 20 for Math and Multi-turn prompts :)

All the details and data on how to train a model this good are right here: arxiv.org/abs/2411.15124
Reposted by Jacob Morrison
natolambert.bsky.social
Very pleased to see Tulu 3 70B more or less tied with Llama 3.1 70B Instruct on style controlled ChatBotArena. The only model anywhere close to that with open code and data for post-training! Lots of stuff people can build on.

Next looking for OLMo 2 numbers.
Reposted by Jacob Morrison
vwxyzjn.bsky.social
We released the OLMo 2 report! Ready for some more RL curves? 😏

This time, we applied RLVR iteratively! Our initial RLVR checkpoint on the RLVR dataset mix shows a low GSM8K score, so we did another RLVR on GSM8K only and another on MATH only 😆.

And it works! A thread 🧵 1/N
Reposted by Jacob Morrison
kylelo.bsky.social
kicking off 2025 with our OLMo 2 tech report while payin homage to the sequelest of sequels 🫡

🚗 2 OLMo 2 Furious 🔥 is everythin we learned since OLMo 1, with deep dives into:

🚖 stable pretrain recipe
🚔 lr anneal 🤝 data curricula 🤝 soups
🚘 tulu post-train recipe
🚜 compute infra setup

👇🧵
Reposted by Jacob Morrison
liujch1998.bsky.social
Want to predict the task performance of LMs before pretraining them?

We develop task scaling laws and model ladders, which predict the accuracy on individual tasks by OLMo 2 7B & 13B models within 2 points of absolute error. The cost is 1% of the compute used to pretrain them.
Reposted by Jacob Morrison
dieworkwear.bsky.social
Why is Tokyo so fashionable? Some theories. 🧵
Saagar Enjeti tweets: "Probably a cold take but IMO Tokyo is the male fashion capital of the world: whether it’s western wear, suits, street wear the aesthetic is refined to the highest possible level

From the salaryman to the rebel teen they are impeccably dressed

It also helps no one is fat"
Reposted by Jacob Morrison
soldaini.net
OLMo 2 is out 🥳 7B and 13B trained on 5T tokens, and meticulousy instruction tuned using Tulu 3 recipe.

Simply the best fully open models yet.

Really proud of the work & the amazing team at
@ai2.bsky.social
Reposted by Jacob Morrison
ai2.bsky.social
Ai2 @ai2.bsky.social · Nov 26
Meet OLMo 2, the best fully open language model to date, including a family of 7B and 13B models trained up to 5T tokens. OLMo 2 outperforms other fully open models and competes with open-weight models like Llama 3.1 8B — As always, we released our data, code, recipes and more 🎁
The OLMo 2 models sit at the Pareto frontier of training FLOPs vs model average performance.
jacobcares.bsky.social
Thanks Tyler, great to hear from you!!