Lightnews — Scholar-powered news

Ai2 @ai2.bsky.social · 1d

"We are committed to our fully open ethos. That's why we release everything—weights, code, training data, checkpoints, all of it." — @nlpnoah.bsky.social at the Madrona IA Summit last week.

7

Ai2 @ai2.bsky.social · 5d

💡 This event is ideal for developers, researchers, and AI enthusiasts who want to go beyond the hype and learn how to apply + adapt powerful AI tools in the real world.
Learn more & register: luma.com/ynxz2650

AI Innovation in the Open · Luma

As part of Seattle AI Week, we invite you to an afternoon of "AI innovation in the open." This event offers a unique opportunity to not only see our latest…

luma.com

1

Ai2 @ai2.bsky.social · 5d

We’ll kick off with a presentation of our latest research, then you can choose a track:
↳ Set up and run our upcoming Asta data-driven discovery agent on your own laptop
↳ Learn how to customize our Olmo model family using open-source tools

1 1

Ai2 @ai2.bsky.social · 5d

As part of #SeattleAIWeek, we're hosting "AI Innovation in the Open" on Oct. 30 from 2-4:30pm—an afternoon of live demos and hands-on tutorials at Ai2 HQ. 👇

1 2 9

Ai2 @ai2.bsky.social · 5d

💡 This event is ideal for developers, researchers, and AI enthusiasts who want to go beyond the hype and learn how to apply + adapt powerful AI tools in the real world.
Learn more & register: luma.com/ynxz2650

AI Innovation in the Open · Luma

As part of Seattle AI Week, we invite you to an afternoon of "AI innovation in the open." This event offers a unique opportunity to not only see our latest…

luma.com

Ai2 @ai2.bsky.social · 7d

Interested in learning more, or getting early access? Sign up here → buff.ly/prVm1Fj
What’s next: Asta DataVoyager will be released to the general public soon. Stay tuned 🧪

allenai.org

1

Ai2 @ai2.bsky.social · 7d

The Cancer AI Alliance (CAIA) is already prototyping Asta DataVoyager in a federated, multi-institution setup for cancer studies—keeping clinical data local and secure.
Read more about CAIA here: buff.ly/ACpxLNT

1 1 3

Ai2 @ai2.bsky.social · 7d

🔒 Trust + control by design: deploy Asta DataVoyager on your own infra or private server, keep data in your purview, & delete data at any time.

1 1

Ai2 @ai2.bsky.social · 7d

Every Asta DataVoyager run returns:
🧪 A crisp answer
📊 Clear visuals
💻 Copyable code
🚀 A methods section documenting tests, assumptions, & steps
Outputs are structured and consistent—ready to share with collaborators or drop into a preprint.

1 1

Ai2 @ai2.bsky.social · 7d

💡 How it works: upload a dataset and ask a question in plain language (e.g., “Which treatment leads to improvements after week 6?”). Add optional context, and Asta handles the rest—no coding knowledge required.

1 1

Ai2 @ai2.bsky.social · 7d

Introducing Asta DataVoyager—our new AI capability in Asta that turns structured data into transparent, reproducible insights. Built for scientists, grounded in open, inspectable workflows. 🧵

1 5 18

Ai2 @ai2.bsky.social · 8d

Have a tough scientific research question? Submit it, compare citation-grounded model responses, and vote. The leaderboard updates regularly as the community weighs in → sciarena.allen.ai

Ai2 SciArena

sciarena.allen.ai

Ai2 @ai2.bsky.social · 8d

💡 Static benchmarks ≠ real research workflows
📈 SciArena is dynamic: new questions & constantly added papers + votes so model rankings reflect the latest science and which models can actually synthesize studies into trusted answers.

1

Ai2 @ai2.bsky.social · 8d

🧪 What’s SciArena? Our open, community-powered eval measuring LLM performance on scientific literature tasks. Based on their answers to science-related questions, models are ranked in our public leaderboard.

1

Ai2 @ai2.bsky.social · 8d

The newest DeepSeek and Anthropic models – plus Kimi K2–0905, Qwen3-Next, & Grok 4 Fast – are now available for head-to-head voting on real scientific queries. Ask, compare, & help rank them👇

1

Ai2 @ai2.bsky.social · 8d

A few new challengers enter SciArena—including DeepSeek-V3.2-Exp and Claude Sonnet 4.5 🔬

1 1 5

Ai2 @ai2.bsky.social · 8d

Have a tough scientific research question? Submit it, compare citation-grounded model responses, and vote. The leaderboard updates regularly as the community weighs in → sciarena.allen.ai

Ai2 SciArena

sciarena.allen.ai

1 2

Ai2 @ai2.bsky.social · 8d

💡 Static benchmarks ≠ real research workflows
📈 SciArena is dynamic: new questions & constantly added papers + votes so model rankings reflect the latest science and which models can actually synthesize studies into trusted answers.

1 1

Ai2 @ai2.bsky.social · 8d

🧪 What’s SciArena? Our open, community-powered eval measuring LLM performance on scientific literature tasks. Based on their answers to science-related questions, models are ranked in our public leaderboard.

1 1

Ai2 @ai2.bsky.social · 8d

The newest DeepSeek and Anthropic models – plus Kimi K2–0905, Qwen3-Next, & Grok 4 Fast – are now available for head-to-head voting on real scientific queries. Ask, compare, & help rank them👇

1 1

Reposted by Ai2

Valentin Hofmann @valentinhofmann.bsky.social · 21d

📢 New #COLM2025 paper 📢

Standard benchmarks give every LLM the same questions. This is like testing 5th graders and college seniors with *one* exam! 🥴

Meet Fluid Benchmarking, a capability-adaptive eval method delivering lower variance, higher validity, and reduced cost.

🧵

3 10 38

Ai2 @ai2.bsky.social · 21d

Learn more about Fluid Benchmarking:
📝 Blog: buff.ly/YtvXxyG
📄 Tech report: buff.ly/vAfamAd
👉 Code: buff.ly/FgZZ4nA
➡️ Discuss on Discord: buff.ly/iGqX51T

allenai.org

2

Ai2 @ai2.bsky.social · 21d

On MMLU, Fluid Benchmarking leads to lower variance with ~50× fewer questions than standard evals + increased validity.

1 3

Ai2 @ai2.bsky.social · 21d

📈 Stable signals: adaptive item selection cuts step-to-step variance & delays saturation.
🧼 Cleaner data: fewer mislabeled items than random sampling at the same budget.
➕ Results generalize better across benchmarks targeting the same capability. ⚡️

1 4

Ai2 @ai2.bsky.social · 21d

We apply Fluid Benchmarking in the context of pretraining & see how the evaluated items completely change over the course of training—easy items at the beginning, difficult items at the end.

1 3