Ai2
banner
ai2.bsky.social
Ai2
@ai2.bsky.social
Breakthrough AI to solve the world's biggest problems.

› Join us: http://allenai.org/careers
› Get our newsletter: https://share.hsforms.com/1uJkWs5aDRHWhiky3aHooIg3ioxm
Pinned
Introducing Ai2 Open Coding Agents—starting with SERA, our first-ever coding models. Fast, accessible agents (8B–32B) that adapt to any repo, including private codebases. Train a powerful specialized agent for as little as ~$400, & it works with Claude Code out of the box. 🧵
Thanks for the feedback. It's not a perfect tool; hallucinations may occur, particularly given the model's small size.
February 11, 2026 at 1:32 AM
Reposted by Ai2
incredibly fun project led by our intern yapei chang

we mined the web for thousands of real-world “how to do X” step by step instructions and turned it into a dataset, synth data training procedure, eval suite, etc.
LLMs often generate step-by-step instructions, from real-world tasks (how do I file taxes?) to plans for AI agents. Improving this is hard: outputs can sound fluent for steps that don't work, and current datasets cover few domains.

How2Everything evals/trains for this at scale. 🧵
February 10, 2026 at 8:34 PM
We stress-test How2Bench to make sure that model performance isn’t driven by matching task style or by memorizing source web pages.

Read all about it below 👇
📝 Blog: buff.ly/4FUlgD3
📄 Paper: buff.ly/CfrDxiI
💻 Code: buff.ly/vKMAvqc
🤗 HF: buff.ly/jOMqysf
How2Everything: Mining the web to evaluate and improve LLMs on real-world procedures | Ai2
How2Everything is an open framework for evaluating and improving how well LLMs generate step-by-step procedures.
allenai.org
February 10, 2026 at 4:53 PM
Finally, RL using How2Score as a reward yields >10-point gains on Qwen3 4B, Qwen3 8B, and Olmo 3 7B Think with no systematic regressions on 12 standard benchmarks covering knowledge, reasoning, chat, math, & code. We apply a length reward to prevent reward hacking via verbosity.
February 10, 2026 at 4:53 PM
3️⃣ We save 7K procedures for How2Bench, a benchmark for measuring how base & instruct models fare. It reliably tracks generation correctness across training progress & model size, providing an effective tool for comparing models from 1B pretraining checkpoints to frontier LLMs.
February 10, 2026 at 4:53 PM
2️⃣ Eval protocol: How2Score measures if a generation contains any critical failures with respect to a reference procedure. Frontier LLMs can spot critical failures with high agreement with humans (>80%), and we distill How2Judge, an 8B model that's cheap to run at scale.
February 10, 2026 at 4:53 PM
How2Everything has 3 key components:

1️⃣ Data pipeline: How2Mine to extract & clean up 351K procedures from ~1M web pages across 14 topics. The resulting procedures are clean + diverse, and the pipeline can scale to much larger datasets!
February 10, 2026 at 4:53 PM
LLMs often generate step-by-step instructions, from real-world tasks (how do I file taxes?) to plans for AI agents. Improving this is hard: outputs can sound fluent for steps that don't work, and current datasets cover few domains.

How2Everything evals/trains for this at scale. 🧵
February 10, 2026 at 4:53 PM
Try the research demo and learn more about DR Tulu in our blog.
🔗 dr-tulu.org
📝 buff.ly/mpJJkhm
DR-Tulu: Deep Research with Reinforcement Learning
DR-Tulu is the first open deep research model directly trained for open-ended, long-form research using Reinforcement Learning with Evolving Rubrics (RLER).
dr-tulu.org
February 9, 2026 at 4:29 PM
This demo is designed to make it easier to explore DR Tulu without extensive configuration, & to show how deep research – training, evaluating long-form outputs, & personalization – remains an open academic question.
February 9, 2026 at 4:29 PM
Every run shows DR Tulu's research steps as they happen—analysis, searches issued, and a running tally of tool calls and documents found.

A dedicated sources view lists retrieved files with snippets, and all reports are citation-backed. 📝
February 9, 2026 at 4:29 PM
DR Tulu is our open, end-to-end recipe for long-form deep research–& the first deep research agent trained directly for long-form responses.

The browser UI lets you pick a model, choose between Brief Answer or Detailed Report, & set tool use intensity from Quick to Extensive.
February 9, 2026 at 4:29 PM
New: A web demo to make using DR Tulu even simpler, built by our collaborators at MIT & the University of Washington.
Ask a question and watch DR Tulu plan, search, & synthesize a citation-grounded report you can share. 🔎
February 9, 2026 at 4:29 PM
Reposted by Ai2
Many want to use AI to accelerate science, and utilizing it to explore the growing tsunami of research articles is getting lots of attention. Measuring the quality of AI answers to questions about science is a challenge. @science.org www.science.org/content/arti...
Open-source AI program can answer science questions better than humans
Developed by and for academics, OpenScholar aims to improve searches of the ballooning scientific literature
www.science.org
February 4, 2026 at 6:52 PM
Our goal: systems scientist can trust and build on 🤝. OpenScholar’s code & data are public—and it’s already shaping our next-gen research models.

📄 Nature: buff.ly/hQHM8K9
📝 Blog: buff.ly/Re5wvCA
Synthesizing scientific literature with retrieval-augmented language models - Nature
A specialized, open-source, retrieval-augmented language model is introduced for answering scientific queries and synthesizing literature, the responses of which are shown to be preferred by human…
www.nature.com
February 4, 2026 at 4:21 PM
What started as research into literature-grounded AI now powers real tools. OpenScholar's 45M paper corpus feeds the Semantic Scholar API. ScholarQABench inspired parts of AstaBench. And OpenScholar’s core concepts live on in Asta and DR Tulu.
February 4, 2026 at 4:21 PM
In a review, 16 scientists preferred OpenScholar to human answers 51% of the time—and combining OpenScholar's citation pipeline with GPT-4o boosted that to 70% (vs. 32% for GPT-4o alone) 📈
February 4, 2026 at 4:21 PM
We also created ScholarQABench, the first large, multi-domain benchmark for scientific search and synthesis 🧪: 3,000 queries + 250 long-form expert answers across CS, physics, biomedicine, & neuroscience.
February 4, 2026 at 4:21 PM
With the University of Washington, we built OpenScholar: scientific synthesis with citation-grounded answers, trained on 45M papers.

Because web search alone can be noisy, it uses RAG to search for, incorporate, & cite new sources—even after training 🔎
February 4, 2026 at 4:21 PM
Scientists can't keep up with millions of new papers. General-purpose AI could help, but it still hallucinates—especially citations. In our study, GPT-4o fabricated 78–90% of its research sources.
February 4, 2026 at 4:21 PM
Our OpenScholar paper is now in @nature.com 🎉

OpenScholar is an open-source model for synthesizing scientific research—with citations as accurate as human experts. 🧵
February 4, 2026 at 4:21 PM
You can drop in SERA-14B or retrain with our refreshed data. We look forward to seeing what you build!

💻 Model & data: buff.ly/K15oZuB
📝 Learn more: buff.ly/eII61ys
Open Coding Agents - a allenai Collection
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
February 3, 2026 at 5:39 PM
We've also revamped the open SERA training data into a general, model-agnostic format that's easier to reuse across different workflows.

What's new:
✅ Verification thresholds per sample
✅ More metadata for filtering & analysis
February 3, 2026 at 5:39 PM
SERA-14B is built for more setups and easier deployment—a smaller, more accessible option that still keeps SERA's cheap, customizable approach.
February 3, 2026 at 5:39 PM
Since launching Open Coding Agents, it's been exciting to see how quickly the community has adopted them. Today we're releasing SERA-14B – a new 14B-parameter coding model – plus a major refresh of our open training datasets. 🧵
February 3, 2026 at 5:39 PM