Lightnews — Scholar-powered news

METR @metr.org · Sep 2

This time horizon estimate means that Claude Opus 4.1 is expected to succeed at least 50% of the time on our tasks that took human SWEs up to 1 hr 45 min. You can find estimates for other models and read the original paper here:

Measuring AI Ability to Complete Long Tasks

We propose measuring AI performance in terms of the *length* of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doub...

metr.org

2

METR @metr.org · Sep 2

This point estimate is 30% longer than that of Claude Opus 4. This difference is statistically significant, with Opus 4.1 beating Opus 4 in 97% of our bootstrap samples. Claude Opus 4 was released back in May, while Claude Opus 4.1 was released this month.

1 1

METR @metr.org · Sep 2

We estimate that Claude Opus 4.1 has a 50%-time-horizon of around 1 hr 45 min (95% confidence interval of 50 to 195 minutes) on our agentic multi-step software engineering tasks. This estimate is lower than the current highest time-horizon point estimate of around 2 hr 15 min.

1 3

METR @metr.org · Sep 2

We concluded that our original graph didn’t clearly communicate this data, especially as this work was exploratory and small-scale.

The graph also incorrectly showed a [0,0] CI.

We’ve updated the blog post to show a new figure, which more accurately conveys what we observed.

1

METR @metr.org · Aug 13

Blog post with more detail: metr.org/blog/2025-08...

Research Update: Algorithmic vs. Holistic Evaluation

Many AI benchmarks use algorithmic scoring to evaluate how well AI systems perform on some set of tasks. However, AI systems often produce code that scores well but isn't production-ready due to issue...

metr.org

1 3

METR @metr.org · Aug 13

We’re interested in a) expanding these preliminary results to cover a wider range of tasks and repositories, b) seeing whether this gap persists with newer models, and c) developing methods and benchmarks that let us measure a broader distribution, while still being easy to use.

1 1

METR @metr.org · Aug 13

Some key caveats:
- Small sample (18 tasks)
- These repositories may have higher-than-average standards
- Basic agent scaffold
- We used Claude 3.7 to help compare to our developer productivity RCT, but more recent models may have a different algorithmic vs. holistic scoring gap

1

METR @metr.org · Aug 13

In conclusion, it seems that models are reasonably able to implement the core functionality of these tasks, but there are too many other requirements/objectives they need to satisfy (and they perform worse on these other metrics collectively).

1 1 1

METR @metr.org · Aug 13

Comparing algorithmic vs. holistic scoring may help explain the apparent inconsistency. The 38% success rate we observe on these 18 tasks (which average 1.3 hours long for experienced high-context developers) is consistent with the ~1hr 50%-time-horizon we previously estimated.

1 1

METR @metr.org · Aug 13

We’ve previously estimated Claude 3.7’s 50%-time-horizon to be ~1 hour on algorithmically scorable software tasks. At first blush, this seems inconsistent with the slowdown we observed in our recent developer productivity RCT, since many of those tasks are ~1hr long.

METR @metr.org · Mar 19

When will AI systems be able to carry out long projects independently?

In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months.

1 2

METR @metr.org · Aug 13

One hypothesis is that this is a result of models being trained with reinforcement learning with verifiable rewards (RLVR), which could cause models to be uniquely good at tasks that are algorithmically scorable.

1 1 1

METR @metr.org · Aug 13

One key takeaway is that there’s a gap between what can be easily measured with automated evaluation metrics, and all of the goals/constraints developers and users actually care about.

1 1

METR @metr.org · Aug 13

That said, agents do often make considerable progress towards these other objectives, e.g. by writing some documentation, implementing a few (but not enough) reasonable test cases, etc.

Speculatively, it doesn’t seem like they fundamentally lack these capabilities.

1

METR @metr.org · Aug 13

We group failures into 5 categories:
- Core functionality errors
- Poor test coverage
- Missing/incorrect documentation
- Linting/formatting violations
- Other quality issues (verbosity, brittleness, poor maintainability)

All agent attempts contain at least 3 of these issues!

1 1

METR @metr.org · Aug 13

Even when agents pass on all human-written test cases, we estimate that their implementations would take 20-30 minutes on average to get to a mergeable state—which represents about a third of the total time needed for an experienced developer to complete the tasks.

1

METR @metr.org · Aug 13

To investigate this, we put the same model (Claude 3.7 Sonnet) in an agent scaffold and had it attempt 18 tasks from two open-source repos in the RCT.

We then scored its PRs with human-written tests, where it passed 38% of the time, and manual review, where it never passed.

1 1

METR @metr.org · Aug 13

We previously found that experienced open-source developers were slowed down using early-2025 AI tools, even with models like Claude 3.7 Sonnet that can complete long-horizon eval tasks.

What separates real coding from SWE benchmark tasks? Was human input holding the AI back?

1

METR @metr.org · Aug 13

We tested how autonomous AI agents perform on real software tasks from our recent developer productivity RCT.

We found a gap between algorithmic scoring and real-world usability that may help explain why AI benchmarks feel disconnected from reality.

1 6 22

METR @metr.org · Aug 12

Blog post link: metr.org/blog/2025-08...

Notes on Scientific Communication at METR

How we think about tradeoffs when communicating surprising or nuanced findings.

metr.org

3

METR @metr.org · Aug 12

Broadly, METR prioritizes scientific accuracy, integrity, and rigor, even when that sometimes means giving up opportunities to have our work reach more people. We’re betting that high-quality information is a bottleneck for critical decisions about the future with AI.

1 3

METR @metr.org · Aug 12

We also focus our research on being convincing to ourselves and to an intelligent, informed and skeptical audience, rather than creating evaluations which we don’t think are meaningful but that we think will trigger strong emotional reactions.

1

METR @metr.org · Aug 12

There are often tricky tradeoffs with research communication, as discussed above. However, there are some principles we treat as more clear-cut. For example, we try to avoid optimizing for reach or “clicks” in ways that degrade accurate understanding.

1

METR @metr.org · Aug 12

This retrospective review helped motivate many of the key caveats/pieces of context in the main graph of our developer productivity study, such as developer experience, the large size/complexity of the projects, and the number of developers and tasks involved in the study.

1

METR @metr.org · Aug 12

For example, many people understood us to be making claims about all tasks, vs. the distribution we measured (algorithmically-scorable software tasks). While we discuss these caveats at length in the paper, we wish we’d been clearer in some places (e.g. the blog post).

1

METR @metr.org · Aug 12

When publishing this paper, we also reflected on previous experiences publishing controversial results—in particular, our paper finding a trend in the length of software tasks that AIs can complete.

METR @metr.org · Mar 19

When will AI systems be able to carry out long projects independently?

In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months.

1 1