Luca Soldaini 🎀
banner
soldaini.net
Luca Soldaini 🎀
@soldaini.net
I like tokens! Lead for OLMo data at @ai2.bsky.social (Dolma 🍇) w @kylelo.bsky.social. Open source is fun 🤖☕️🍕🏳️‍🌈 Opinions are sampled from my own stochastic parrot

more at https://soldaini.net
it’s OCR week! learn how we use verifiable rewards against unit tests to improve olmOCR’s PDF understanding

state of the art OCR, fully open model:
We’re updating olmOCR, our model for turning PDFs & scans into clean text with support for tables, equations, handwriting, & more. olmOCR 2 uses synthetic data + unit tests as verifiable rewards to reach state-of-the-art performance on challenging documents. 🧵
October 22, 2025 at 5:56 PM
best commute on earth
September 17, 2025 at 3:08 PM
which one of you im gonna have the pleasure to see at COLM???
September 12, 2025 at 4:43 PM
my keystrokes go though light-up starry cable

OF COURSE my code is better than yours
August 20, 2025 at 4:08 PM
12+ years in this country, first time I get to wear this sticker 🗳️
August 5, 2025 at 3:29 PM
wearing italian camo* at ICML

*ordering an ice lattes rather than espressos at coffee shops
July 18, 2025 at 4:07 PM
new @ai2.bsky.social office has something for everyone: stunning views for the outdoorsy kind, 2.5 Gbps connection at every desk for the indoor nerds
June 23, 2025 at 10:07 PM
Waymo is cool but BART from SFO to downtown SF is cooler

101 can be as dark red as you want on google maps!
June 18, 2025 at 3:55 PM
2025 AI hot take: everyone should use FastText more. Word embeddings are awesome.
June 6, 2025 at 3:39 AM
today might be rainy, but PNW summer is already here
May 31, 2025 at 11:32 PM
I've silenced all notifications on all my devices and it's truly the best thing ever

...I am considering allowing calendar notifications tho cuz I almost missed 3 meetings already 😅
May 15, 2025 at 1:13 AM
two weeks traveling and I miss my mechanical keyboard so much
April 27, 2025 at 2:54 AM
when someone says they wanna bring me to their favorite italian restaurant
April 22, 2025 at 1:33 AM
I am still perpetually in awe that skill emergence exists in language models

million of caveats but we have models that pick up capabilities from plain text???

it's so magical, I can't believe we got such treat
April 2, 2025 at 5:24 PM
bluesky deserves to know we’ve adopted a cat and he’s the most handsome boy
March 26, 2025 at 4:07 AM
Summary of our recommendation we submitted to White House to ensure success of open & transparent AI

As a meta point, I’m very grateful to be in a position where I can put my technical expertise in the service of policy needs 🥰
We submitted a recommendation to the Office of Science and Technology Policy encouraging them to prioritize a multi-stakeholder, open-source AI ecosystem. You can read our blog post and comment here: allenai.org/blog/OSTP
Ai2’s Recommendations to OSTP to enable open-source innovation with the U.S. AI Action Plan | Ai2
Ai2's recommendation to the Office of Science and Technology Policy (OSTP) in response to the White House’s Request for Information on an AI Action Plan.
allenai.org
March 20, 2025 at 4:14 AM
"excuse me sir do you have a moment to talk about olmOCR"
March 10, 2025 at 7:54 PM
if i ever find a genie lamp im gonna save a wish for “uninvent pydantic and click” 😤
March 6, 2025 at 3:49 AM
in the upcoming LLMs war, i choose a neutral team
February 20, 2025 at 8:57 PM
This was such a fun project to work on!

We release efficient classifiers 🌐 to partition large corpora, and use them to improve sampling for LLM pretraining

great work lead by @awettig.bsky.social 👇
🤔 Ever wondered how prevalent some type of web content is during LM pre-training?

In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages 🌐

Key takeaway: domains help us curate better pre-training data! 🧵/N
February 18, 2025 at 3:17 PM
my toxic trait is enjoying the AWS cost optimization dashboard a little too much
February 16, 2025 at 9:39 PM
Reposted by Luca Soldaini 🎀
one benefit of open models is privacy 🔐you can run them locally & keep all your data on your local device instead of sending it to a company elsewhere

congrats @soldaini.net for heavy lifting, showing our OLMoE model can run on iPhones📱
We took our most efficient model and made an open-source iOS app📱but why?

As phones get faster, more AI will happen on device. With OLMoE, researchers, developers, and users can get a feel for this future: fully private LLMs, available anytime.

Learn more from @soldaini.net👇 youtu.be/rEK_FZE5rqQ
Ai2 OLMoE: Fully open source, running entirely on-device
YouTube video by Ai2
youtu.be
February 11, 2025 at 4:35 PM
Reposted by Luca Soldaini 🎀
Best part of this that Luca isn’t highlighting to start is that we trained a way better OLMoE for this too.

All from better annealing and post train. Didn’t need to redo pre training. Goes to show how much potential these models have!

new instruct model: huggingface.co/allenai/OLMo...
February 11, 2025 at 3:16 PM
Reposted by Luca Soldaini 🎀
🤯 Check out our new iOS OLMoE app that runs the model on-device!

We also trained new OLMoE-1B-7B-0125 this time using the Tulu 3 recipe. Very exciting that RLVR improved gsm8k by almost 10 points for OLMoE 🔥

A quick 🧵
February 11, 2025 at 3:30 PM
They made me do video 😬 but for a good reason!

We are launching an iOS app–it runs OLMoE locally 📱 We're gonna see more on-device AI in 2025, and wanted to offer a simple way to prototype with it

App: apps.apple.com/us/app/ai2-o...
Code: github.com/allenai/OLMo...
Blog: allenai.org/blog/olmoe-app
We took our most efficient model and made an open-source iOS app📱but why?

As phones get faster, more AI will happen on device. With OLMoE, researchers, developers, and users can get a feel for this future: fully private LLMs, available anytime.

Learn more from @soldaini.net👇 youtu.be/rEK_FZE5rqQ
Ai2 OLMoE: Fully open source, running entirely on-device
YouTube video by Ai2
youtu.be
February 11, 2025 at 2:18 PM