Women in AI Research - WiAIR
@wiair.bsky.social
79 followers 0 following 370 posts
WiAIR is dedicated to celebrating the remarkable contributions of female AI researchers from around the globe. Our goal is to empower early career researchers, especially women, to pursue their passion for AI and make an impact in this exciting field.
Posts Media Videos Starter Packs
wiair.bsky.social
We dive into how to make AI systems that truly earn our trust - not just appear trustworthy.

🎬 Full episode now on YouTube → youtu.be/xYb6uokKKOo
Also on Spotify: open.spotify.com/show/51RJNlZ...
Apple: podcasts.apple.com/ca/podcast/w...
wiair.bsky.social
💡 Key takeaways from our conversation:
• Real AI research is messy, nonlinear, and full of surprises.
• Trust in AI comes in two forms: intrinsic (how it reasons) and extrinsic (proven reliability).
• Sometimes, human-AI collaboration makes things… worse.
wiair.bsky.social
🎙️ New Women in AI Research episode out now!
This time, we sit down with @anamarasovic.bsky.social to unpack some of the toughest questions in AI explainability and trust.

🔗 Watch here → youtu.be/xYb6uokKKOo
youtu.be
wiair.bsky.social
🎙️ New #WiAIR episode coming soon!

We sat down with Ana Marasović to talk about the uncomfortable truths behind AI trust.
When can we really trust AI explanations?

Watch the trailer youtu.be/GBghj6S6cic
Then subscribe on YouTube to catch the full episode when it drops.
wiair.bsky.social
Our new guest at #WiAIRpodcast is @anamarasovic.bsky.social
(Asst prof @ University of Utah , Ex @ Allen AI). We'll talk with her about faithfulness, trust and robustness in AI.
The episode is coming soon, don't miss:
www.youtube.com/@WomeninAIRe...

#WiAIR #NLProc
wiair.bsky.social
"Inclusivity is about saying: Come sit with us!" 💡

Valentina Pyatkin reminds us that AI research isn’t just about models and benchmarks - it’s about building a community where everyone feels welcome.

#AI #Inclusivity #WomenInAI
wiair.bsky.social
🧭 Takeaway: If you use reward models for RLHF, Best-of-N, or data filtering, RB2 gives you a harder, fairer yardstick—plus open evidence to guide choices. (7/8🧵)
wiair.bsky.social
🔬 The team trained & evaluated 100+ reward models (fully open). Key lessons: training >1 epoch can help; RMs often show lineage bias, preferring completions from their own model family. (6/8🧵)
wiair.bsky.social
⚖️ But for PPO/RLHF, correlation is more nuanced. Reward–policy lineage and training setup matter. A top RB2 score doesn’t always equal best PPO gains. (5/8🧵)
wiair.bsky.social
🎯 RB2 accuracy strongly correlates with Best-of-N sampling (Pearson r≈0.87). Good RB2 scores → better inference-time performance. (4/8🧵)
wiair.bsky.social
📉 Results: models that scored high on the original RewardBench often fall ~20 points lower on RB2. A clear sign that earlier benchmarks overstated reward model quality. (3/8🧵)
wiair.bsky.social
📑 RB2 uses unseen human prompts (held-out WildChat) to avoid leakage. Each prompt: 1 chosen + 3 rejected responses. Domains: factuality, precise instruction following, math, safety, focus, & ties (multiple valid answers). (2/8🧵)
wiair.bsky.social
🤔 How do we know if a reward model is truly good? In our last #WiAIR episode, Valentina Pyatkin (AI2 & University of Washington) introduced RewardBench 2—a harder, cleaner benchmark for reward models in post-training. (1/8🧵)
wiair.bsky.social
💥 Behind every success is a story of rejection.
Persistence, curiosity, and resilience are what truly drive AI careers. 🚀

Don't miss the full episode:
🎬 YouTube: youtube.com/watch?v=DPhq...
🎙 Spotify: open.spotify.com/episode/7aHP...
wiair.bsky.social
🚨 Why it matters:
IFEval suggested that precise instruction-following was nearly solved.
IFBENCH reveals it is far from over. Robust benchmarks + training methods are needed for trustworthy, real-world LLM deployment. (6/7🧵)
wiair.bsky.social
🗝 Key findings:
✔️ IF-RLVR models generalize better across benchmarks.
✔️ They often prioritize constraints over general instruction.
✔️ This can lead to over-optimization. Preference rewards help balance quality vs. adherence. (5/7🧵)
wiair.bsky.social
📊 Results:
✔️ TÜLU-3-8B: IFEval 82.4 → 92.2, IFBENCH 28.9 → 45.9
✔️ Qwen2.5-7B: IFEval 74.7 → 87.8, IFBENCH 31.3 → 53.7
Clear gains from IF-RLVR training with multi-constraint prompts + preference reward signals. (4/7🧵)
wiair.bsky.social
🧩 Contributions:
1️⃣ IFBENCH – 58 unseen, challenging constraints. Even Claude 4 Sonnet & Qwen3-32B score <50%.
2️⃣ IFTRAIN – 29 new training constraints + verification functions.
3️⃣ IF-RLVR – training with verifiable rewards. (3/7🧵)
wiair.bsky.social
Instruction-following is critical for reliable LLMs.
But current benchmarks like IFEval (25 constraints) make it look “solved”. Leading models score 80%+.
This paper shows they fail to generalize beyond those constraints. (2/7🧵)
wiair.bsky.social
💡 Are LLMs truly good at precise instruction following, or just overfitting to benchmarks?
In our latest WiAIR episode, we sit down with Valentina Pyatkin (@valentinapy.bsky.social) from @ai2.bsky.social and UW to discuss her paper: “Generalizing Verifiable Instruction Following”. (1/7🧵)
wiair.bsky.social
Tulu 3 isn’t just a model - it’s the ecosystem: data, recipes, benchmarks, and RLVR.
Valentina Pyatkin breaks down how smart data mixing & filtering shaped its performance.

Don't miss the full episode:
🎬 YouTube: youtube.com/watch?v=DPhq...
🎙 Spotify: open.spotify.com/episode/7aHP...