Valentina Pyatkin
@valentinapy.bsky.social
5.7K followers 570 following 74 posts
Postdoc in AI at the Allen Institute for AI & the University of Washington. 🌐 https://valentinapy.github.io
Posts Media Videos Starter Packs
valentinapy.bsky.social
Now accepted to #neurips25 datasets & benchmarks!
See you in San Diego! 🥳
valentinapy.bsky.social
💡Beyond math/code, instruction following with verifiable constraints is suitable to be learned with RLVR.
But the set of constraints and verifier functions is limited and most models overfit on IFEval.
We introduce IFBench to measure model generalization to unseen constraints.
Reposted by Valentina Pyatkin
wiair.bsky.social
🚀 Can open science beat closed AI? Tülu 3 makes a powerful case. In our new #WiAIRpodcast, we speak with Valentina Pyatkin (@valentinapy.bsky.social) of @ai2.bsky.social and the University of Washington about a fully open post-training recipe—models, data, code, evals, and infra. #WomenInAI 1/8🧵
Reposted by Valentina Pyatkin
wiair.bsky.social
"𝐋𝐋𝐌 𝐏𝐨𝐬𝐭-𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠: 𝐎𝐩𝐞𝐧 𝐒𝐜𝐢𝐞𝐧𝐜𝐞 𝐓𝐡𝐚𝐭 𝐏𝐨𝐰𝐞𝐫𝐬 𝐏𝐫𝐨𝐠𝐫𝐞𝐬𝐬 " 🎙️

On Sept 17, the #WiAIRpodcast speaks with @valentinapy.bsky.social (@ai2.bsky.social & University of Washington) about open science, post-training, mentorship, and visibility

#WiAIR #NLProc
Reposted by Valentina Pyatkin
ai2.bsky.social
Ai2 @ai2.bsky.social · Aug 14
With fresh support of $75M from NSF and $77M from NVIDIA, we’re set to scale our open model ecosystem, bolster the infrastructure behind it, and fast‑track reproducible AI research to unlock the next wave of scientific discovery. 💡
valentinapy.bsky.social
On my way to Oxford: Looking forward to speaking at OxML 2025
valentinapy.bsky.social
🔈For the SoLaR workshop
@COLM_conf
we are soliciting opinion abstracts to encourage new perspectives and opinions on responsible language modeling, 1-2 of which will be selected to be presented at the workshop.

Please use the google form below to submit your opinion abstract ⬇️
Reposted by Valentina Pyatkin
yanai.bsky.social
I had a lot of fun contemplating about memorization questions at the @l2m2workshop.bsky.social panel yesterday together with Niloofar Mireshghallah and Reza Shokri, moderated by
@pietrolesci.bsky.social who did a fantastic job!
#ACL2025
Reposted by Valentina Pyatkin
akhilayerukola.bsky.social
I'll be at #ACL2025🇦🇹!!
Would love to chat about all things pragmatics 🧠, redefining "helpfulness"🤔 and enabling better cross-cultural capabilities 🗺️ 🫶

Presenting our work on culturally offensive nonverbal gestures 👇
🕛Wed @ Poster Session 4
📍Hall 4/5, 11:00-12:30
akhilayerukola.bsky.social
Did you know? Gestures used to express universal concepts—like wishing for luck—vary DRAMATICALLY across cultures?
🤞means luck in US but deeply offensive in Vietnam 🚨

📣 We introduce MC-SIGNS, a test bed to evaluate how LLMs/VLMs/T2I handle such nonverbal behavior!

📜: arxiv.org/abs/2502.17710
Figure showing that interpretations of gestures vary dramatically across regions and cultures. ‘Crossing your fingers,’ commonly used in the US to wish for good luck, can be deeply offensive to female audiences in parts of Vietnam. Similarly, the 'fig gesture,' a playful 'got your nose' game with children in the US, carries strong sexual connotations in Japan and can be highly offensive.
valentinapy.bsky.social
I did! very very good!!
valentinapy.bsky.social
🔥tokenization panel!
valentinapy.bsky.social
why is vancouver sushi so good? 🤤 (vancouver food in general actually)
Reposted by Valentina Pyatkin
ai2.bsky.social
Ai2 @ai2.bsky.social · Jul 14
This week is #ICML in Vancouver, and a number of our researchers are participating. Here's the full list of Ai2's conference engagements—we look forward to connecting with fellow attendees. 👋
valentinapy.bsky.social
Let me know if you want to meet up! Always happy to chat!
valentinapy.bsky.social
07/17, Poster: Diverging Preferences: When do Annotators Disagree and do Models Know? icml.cc/virtual/2025...

07/16, Poster: SafetyAnalyst: Interpretable, transparent, and steerable safety moderation for AI behavior
icml.cc/virtual/2025...
ICML Poster Diverging Preferences: When do Annotators Disagree and do Models Know?ICML 2025
icml.cc
valentinapy.bsky.social
I'll be at ICML in Vancouver next week! #ICML2025
You can find me at the following:

- giving an invited talk at the "Models of Human Feedback for AI Alignment" workshop

- giving an invited talk at the "AI for Math" workshop

I'll also present these two papers ⤵️
valentinapy.bsky.social
In Geneva🇨🇭to attend the International Open-Source LLM Builders Summit and present OLMo and Tülu!
valentinapy.bsky.social
And I can't forget to thank my amazing co-authors! In particular @saumyamalik.bsky.social and Victoria Graf, with whom I looked through so many constraints 😄
And @natolambert.bsky.social @hanna-nlp.bsky.social @hamishivi.bsky.social @pdasigi.bsky.social @vwxyzjn.bsky.social
valentinapy.bsky.social
We further discuss what happens when you over-optimize on IF-RLVR: the models tend to prioritize the constraint over the actual instruction! And we suggest possible solutions to this problem.

📝 Paper: buff.ly/1qSA9Pq
💻 Code: github.com/allenai/IFBe...
valentinapy.bsky.social
Additionally, we wrote new training constraints and verifier functions and suggest a good recipe for IF-RLVR training for improved generalization.
We find that IF-RLVR generalization works best on base models and when you train on multiple constraints per instruction!
valentinapy.bsky.social
💡Beyond math/code, instruction following with verifiable constraints is suitable to be learned with RLVR.
But the set of constraints and verifier functions is limited and most models overfit on IFEval.
We introduce IFBench to measure model generalization to unseen constraints.
Reposted by Valentina Pyatkin
natolambert.bsky.social
plus, some fun RL experiments
Reposted by Valentina Pyatkin
natolambert.bsky.social
This new benchmark created by @valentinapy.bsky.social should be the new default replacing IFEval. Some of the best frontier models get <50% and it comes with separate training prompts so people don’t effectively train on test.

Wild gap from o3 > Gemini 2.5 pro of like 30 points.
ai2.bsky.social
Introducing IFBench, a benchmark to measure how well AI models follow new, challenging, and diverse verifiable instructions. Top models like Gemini 2.5 Pro or Claude 4 Sonnet are only able to score up to 50%, presenting an open frontier for post-training. 🧵
Reposted by Valentina Pyatkin
ai2.bsky.social
Introducing IFBench, a benchmark to measure how well AI models follow new, challenging, and diverse verifiable instructions. Top models like Gemini 2.5 Pro or Claude 4 Sonnet are only able to score up to 50%, presenting an open frontier for post-training. 🧵