Ai2
@ai2.bsky.social
3.7K followers 110 following 440 posts
Breakthrough AI to solve the world's biggest problems. › Join us: http://allenai.org/careers › Get our newsletter: https://share.hsforms.com/1uJkWs5aDRHWhiky3aHooIg3ioxm
Posts Media Videos Starter Packs
Pinned
ai2.bsky.social
Introducing Asta DataVoyager—our new AI capability in Asta that turns structured data into transparent, reproducible insights. Built for scientists, grounded in open, inspectable workflows. 🧵
ai2.bsky.social
"We are committed to our fully open ethos. That's why we release everything—weights, code, training data, checkpoints, all of it." — @nlpnoah.bsky.social at the Madrona IA Summit last week.
ai2.bsky.social
💡 This event is ideal for developers, researchers, and AI enthusiasts who want to go beyond the hype and learn how to apply + adapt powerful AI tools in the real world.
Learn more & register: luma.com/ynxz2650
AI Innovation in the Open · Luma
As part of Seattle AI Week, we invite you to an afternoon of "AI innovation in the open." This event offers a unique opportunity to not only see our latest…
luma.com
ai2.bsky.social
We’ll kick off with a presentation of our latest research, then you can choose a track:
↳ Set up and run our upcoming Asta data-driven discovery agent on your own laptop
↳ Learn how to customize our Olmo model family using open-source tools
ai2.bsky.social
As part of #SeattleAIWeek, we're hosting "AI Innovation in the Open" on Oct. 30 from 2-4:30pm—an afternoon of live demos and hands-on tutorials at Ai2 HQ. 👇
ai2.bsky.social
💡 This event is ideal for developers, researchers, and AI enthusiasts who want to go beyond the hype and learn how to apply + adapt powerful AI tools in the real world.
Learn more & register: luma.com/ynxz2650
AI Innovation in the Open · Luma
As part of Seattle AI Week, we invite you to an afternoon of "AI innovation in the open." This event offers a unique opportunity to not only see our latest…
luma.com
ai2.bsky.social
Interested in learning more, or getting early access? Sign up here → buff.ly/prVm1Fj
What’s next: Asta DataVoyager will be released to the general public soon. Stay tuned 🧪
allenai.org
ai2.bsky.social
The Cancer AI Alliance (CAIA) is already prototyping Asta DataVoyager in a federated, multi-institution setup for cancer studies—keeping clinical data local and secure.
Read more about CAIA here: buff.ly/ACpxLNT
ai2.bsky.social
🔒 Trust + control by design: deploy Asta DataVoyager on your own infra or private server, keep data in your purview, & delete data at any time.
ai2.bsky.social
Every Asta DataVoyager run returns:
🧪 A crisp answer
📊 Clear visuals
💻 Copyable code
🚀 A methods section documenting tests, assumptions, & steps
Outputs are structured and consistent—ready to share with collaborators or drop into a preprint.
ai2.bsky.social
💡 How it works: upload a dataset and ask a question in plain language (e.g., “Which treatment leads to improvements after week 6?”). Add optional context, and Asta handles the rest—no coding knowledge required.
ai2.bsky.social
Introducing Asta DataVoyager—our new AI capability in Asta that turns structured data into transparent, reproducible insights. Built for scientists, grounded in open, inspectable workflows. 🧵
ai2.bsky.social
Have a tough scientific research question? Submit it, compare citation-grounded model responses, and vote. The leaderboard updates regularly as the community weighs in → sciarena.allen.ai
Ai2 SciArena
sciarena.allen.ai
ai2.bsky.social
💡 Static benchmarks ≠ real research workflows
📈 SciArena is dynamic: new questions & constantly added papers + votes so model rankings reflect the latest science and which models can actually synthesize studies into trusted answers.
ai2.bsky.social
🧪 What’s SciArena? Our open, community-powered eval measuring LLM performance on scientific literature tasks. Based on their answers to science-related questions, models are ranked in our public leaderboard.
ai2.bsky.social
The newest DeepSeek and Anthropic models – plus Kimi K2–0905, Qwen3-Next, & Grok 4 Fast – are now available for head-to-head voting on real scientific queries. Ask, compare, & help rank them👇
ai2.bsky.social
A few new challengers enter SciArena—including DeepSeek-V3.2-Exp and Claude Sonnet 4.5 🔬
ai2.bsky.social
Have a tough scientific research question? Submit it, compare citation-grounded model responses, and vote. The leaderboard updates regularly as the community weighs in → sciarena.allen.ai
Ai2 SciArena
sciarena.allen.ai
ai2.bsky.social
💡 Static benchmarks ≠ real research workflows
📈 SciArena is dynamic: new questions & constantly added papers + votes so model rankings reflect the latest science and which models can actually synthesize studies into trusted answers.
ai2.bsky.social
🧪 What’s SciArena? Our open, community-powered eval measuring LLM performance on scientific literature tasks. Based on their answers to science-related questions, models are ranked in our public leaderboard.
ai2.bsky.social
The newest DeepSeek and Anthropic models – plus Kimi K2–0905, Qwen3-Next, & Grok 4 Fast – are now available for head-to-head voting on real scientific queries. Ask, compare, & help rank them👇
Reposted by Ai2
valentinhofmann.bsky.social
📢 New #COLM2025 paper 📢

Standard benchmarks give every LLM the same questions. This is like testing 5th graders and college seniors with *one* exam! 🥴

Meet Fluid Benchmarking, a capability-adaptive eval method delivering lower variance, higher validity, and reduced cost.

🧵
ai2.bsky.social
Learn more about Fluid Benchmarking:
📝 Blog: buff.ly/YtvXxyG
📄 Tech report: buff.ly/vAfamAd
👉 Code: buff.ly/FgZZ4nA
➡️ Discuss on Discord: buff.ly/iGqX51T
allenai.org
ai2.bsky.social
On MMLU, Fluid Benchmarking leads to lower variance with ~50× fewer questions than standard evals + increased validity.
ai2.bsky.social
📈 Stable signals: adaptive item selection cuts step-to-step variance & delays saturation.
🧼 Cleaner data: fewer mislabeled items than random sampling at the same budget.
➕ Results generalize better across benchmarks targeting the same capability. ⚡️
ai2.bsky.social
We apply Fluid Benchmarking in the context of pretraining & see how the evaluated items completely change over the course of training—easy items at the beginning, difficult items at the end.