🐟 more on our eval ideology
🦈 more baselines
🍣 more about RL Zero
etc
we picked final model (internally called moonlit surfer 🌛🏄) not just on bench scores but good vibes 🥰
🐟 more on our eval ideology
🦈 more baselines
🍣 more about RL Zero
etc
we picked final model (internally called moonlit surfer 🌛🏄) not just on bench scores but good vibes 🥰
Come say hi 👋 if you wanna chat about
🦈 olmo 3 stories
🐟 pretraining data & evals
🍣 midtraining shouldnt exist
🐠 model specialization
🐡 AI for education
🍥 tabletop games
Come say hi 👋 if you wanna chat about
🦈 olmo 3 stories
🐟 pretraining data & evals
🍣 midtraining shouldnt exist
🐠 model specialization
🐡 AI for education
🍥 tabletop games
it's exactly what you're saying -- each point refers to a stage of development. our release has data+ckpts+evals for all stages we use (figure) and wanted to show how it compares to other models which typically only few stages
it's exactly what you're saying -- each point refers to a stage of development. our release has data+ckpts+evals for all stages we use (figure) and wanted to show how it compares to other models which typically only few stages
But team organization to sustain consistent model improvements (without burnout) is important!
We have explorers "own" target capabilities & centralized assessment team run "integration tests"
But team organization to sustain consistent model improvements (without burnout) is important!
We have explorers "own" target capabilities & centralized assessment team run "integration tests"
Traditional ways of using data quality is to threshold: Define a cutoff and take all the documents above that threshold.
But why not sample *proportional* to data quality?
We use Quality-Aware Upsampling to do exactly this
Traditional ways of using data quality is to threshold: Define a cutoff and take all the documents above that threshold.
But why not sample *proportional* to data quality?
We use Quality-Aware Upsampling to do exactly this
It's easy to learn "optimal" mixes that oversample from certain pockets heavily. eg, STEM docs are valuable for climbing MMLU & but you don't have infinite STEM docs
We approach mixing as Token Constrained Optimization over diverse evals
It's easy to learn "optimal" mixes that oversample from certain pockets heavily. eg, STEM docs are valuable for climbing MMLU & but you don't have infinite STEM docs
We approach mixing as Token Constrained Optimization over diverse evals
We create evals better suited for different compute scales, with our "easy" set of tasks+metrics able to support very small scale experiments before switching to our "main" set of evals, on which smaller models are below noise floor
We create evals better suited for different compute scales, with our "easy" set of tasks+metrics able to support very small scale experiments before switching to our "main" set of evals, on which smaller models are below noise floor
🐟Olmo 3 32B Base, the best fully-open base model to-date, near Qwen 2.5 & Gemma 3 on diverse evals
🐠Olmo 3 32B Think, first fully-open reasoning model approaching Qwen 3 levels
🐡12 training datasets corresp to different staged training
🐟Olmo 3 32B Base, the best fully-open base model to-date, near Qwen 2.5 & Gemma 3 on diverse evals
🐠Olmo 3 32B Think, first fully-open reasoning model approaching Qwen 3 levels
🐡12 training datasets corresp to different staged training
🐟interns own major parts of our model development, sometimes even leading whole projects
🐡we're committed to open science & actively help our interns publish their work
reach out if u wanna build open language models together 🤝
links 👇
🐟interns own major parts of our model development, sometimes even leading whole projects
🐡we're committed to open science & actively help our interns publish their work
reach out if u wanna build open language models together 🤝
links 👇
🔥training our VLM using RLVR with binary unit test rewards🔥
it's incredibly effective & unit test creation easy to scale w synthetic data pipelines
check it out at olmocr.allen.ai
🔥training our VLM using RLVR with binary unit test rewards🔥
it's incredibly effective & unit test creation easy to scale w synthetic data pipelines
check it out at olmocr.allen.ai
findings from large scale survey of 800 researchers on how they use LMs in their research #colm2025
findings from large scale survey of 800 researchers on how they use LMs in their research #colm2025
come chat w me about pretraining horror stories, data & evals, what we're cookin for next olmo, etc
made a 🔥 poster for thursday sess, come say hi
come chat w me about pretraining horror stories, data & evals, what we're cookin for next olmo, etc
made a 🔥 poster for thursday sess, come say hi
🐟 select test cases
🐠 score LM on each test
🦈 aggregate scores to estimate perf
fluid benchmarking is simple:
🍣 find max informative test cases
🍥 estimate 'ability', not simple avg perf
why care? turn ur grey noisy benchmarks to red ones!
🐟 select test cases
🐠 score LM on each test
🦈 aggregate scores to estimate perf
fluid benchmarking is simple:
🍣 find max informative test cases
🍥 estimate 'ability', not simple avg perf
why care? turn ur grey noisy benchmarks to red ones!