Epoch AI
banner
epochai.bsky.social
Epoch AI
@epochai.bsky.social
We are a research institute investigating the trajectory of AI for the benefit of society.

epoch.ai
How Claudey is Opus 4.5?

We previously described Claudiness as "good at agentic tasks while being weaker at multimodal and math". This pattern remains when comparing Opus 4.5 to other newly-released models, though the gap on agentic coding and tool-calling benchmarks is small.
November 25, 2025 at 10:26 PM
We benchmarked Opus 4.5 on FrontierMath. It scored 21% on FrontierMath Tiers 1–3, continuing a trend of improvement for Anthropic models.

This score is behind Gemini 3 Pro and GPT-5.1 (high) while being on par with earlier frontier models like o3 (high) and Grok 4.
November 25, 2025 at 9:26 PM
Gemini 3 Pro set a new record on GPQA Diamond: 93% vs. the previous record of 88%. What you can’t tell from the headline: almost all of this gain came in organic chemistry. 🧬🧵
November 25, 2025 at 4:57 PM
We’ve optimized our Frontier Data Centers hub for mobile.

You can now examine annotated, recent, high-resolution satellite imagery of the world's largest compute clusters directly from your phone at epoch.ai/data/data-c....

Here’s a look at the updated Satellite Viewer:
November 25, 2025 at 2:15 AM
Gemini 3 Pro set a new record on FrontierMath: 38% on Tiers 1–3 and 19% on Tier 4.

On the Epoch Capabilities Index (ECI), which combines multiple benchmarks, Gemini 3 Pro scored 154, up from GPT-5.1’s previous high score of 151.
November 21, 2025 at 7:04 PM
Benchmarking data is dominated by a single “General Capability” dimension. Is this due to good generalization across tasks, or to developers pushing on all benchmarks at once?

🧵 with some analysis, including the discovery of a “Claudiness” dimension.
November 20, 2025 at 9:09 PM
It’s easy to talk about ‘large AI data centers’ and still underestimate the scale.

Our Frontier Data Centers database shows that some upcoming campuses will cover a substantial portion of Manhattan. Meta's Hyperion data center will be nearly four times the size of Central Park.
November 19, 2025 at 7:54 PM
GPT-5.1 is about as capable as GPT-5.

That’s according to the Epoch Capabilities Index, our tool for combining results across multiple benchmarks. With “high” reasoning, both GPT-5.1 and GPT-5 score 151 on ECI.

See 🧵 for individual benchmark scores!
November 19, 2025 at 12:10 PM
Data centers supporting AI training runs could require 1-5 GW by 2030, enough to power entire cities.

Join us for a live webinar/Q&A on our new Frontier Data Centers Hub, exploring what this infrastructure buildout means for AI.

Nov 20, 1-2 PM PT
luma.com/oste01d0
Frontier Data Centers Webinar and Q&A | Epoch AI · Luma
Join us for a webinar and live Q&A on the Frontier Data Centers Hub, our open database that maps the construction, power, compute, and cost of the largest AI…
luma.com
November 18, 2025 at 8:17 PM
Sam Altman: “Our goal is, by March of 2028, to have a true automated AI researcher”

Some say this kicks off a software singularity, where AIs recursively improve themselves and rapidly get smarter. Others think there’ll be a bottleneck.

So how can we tell who’s right? 🧵
November 17, 2025 at 7:42 PM
AI data center buildouts already rival the Manhattan Project in scale, but there’s little public info about them.

So we spent the last few months reading legal permits, staring at satellite images, and scouring news sources.

Here’s what you need to know. 🧵
November 10, 2025 at 6:03 PM
How fast can you build a gigawatt-scale data center?

Some hyperscalers plan to do it in just 1-2 years from the start of construction.

If they succeed, we’ll see the first GW-scale data centers online in 2026, marking one of the fastest infrastructure build-outs in history. 🧵
November 10, 2025 at 5:40 PM
The Epoch Capabilities Index is a useful way to measure model capabilities, but what does a score of 150 actually mean?

One way to read our new capability index is by plotting the benchmark performance you expect to see, for a range of ECI scores 🧵
November 7, 2025 at 7:13 PM
Anthropic's recently-reported projection of $70B revenue in 2028 may be less than OpenAI's projection for the same year, but it would still represent historically fast growth.

bsky.app/profile/epo...
November 5, 2025 at 3:27 PM
Announcing our Frontier Data Centers Hub!

The world is about to see multiple 1 GW+ AI data centers.

We mapped their construction using satellite imagery, permits & public sources — releasing everything for free, including commissioned satellite images.

Highlights in thread!
November 4, 2025 at 7:16 PM
By stitching benchmarks together, the Epoch Capabilities Index allows us to compare frontier models to models with 100,000x less training compute.
November 3, 2025 at 8:59 PM
We looked at OSWorld, a popular evaluation of AI computer use capabilities.

Our findings: tasks are simple, many don't require GUIs, and success often hinges on interpreting ambiguous instructions. The benchmark is also not stable over time.

See thread for details!
November 3, 2025 at 8:16 PM
We found a bug in our benchmarking code: calls to GPT-5 with "high" reasoning were silently being set to "medium".

Corrected results: GPT-5 (high) scores slightly higher than GPT-5 (medium) on the benchmarks we run. They are also now tied on the Epoch Capabilities Index (ECI).
October 31, 2025 at 3:22 PM
We used our new capabilities index, the ECI, to measure the gap between open- and closed-weight models.

The result? This gap is smaller than previously estimated.

On average, it takes 3.5 months for an open-weight model to catch up with closed-source SOTA.
October 30, 2025 at 7:59 PM
Conventional wisdom in AI is that large scale pretraining needs to happen in massive contiguous datacenter campuses. But is this true?

Our research suggests that conducting 10 GW training runs across two dozen sites—linked by a network spanning thousands of km long—is feasible.
October 28, 2025 at 6:00 PM
We've launched a new tool to track AI progress!

The tool addresses one of the field's biggest challenges: benchmark saturation.

It's called the Epoch Capabilities Index (ECI) — here's what makes it different:
October 27, 2025 at 7:13 PM
Large language models can imitate reasoning steps and even verify formal proofs.

But mathematical physicist Svetlana Jitomirskaya argues they lack folklore knowledge: the implicit priors mathematicians build from experience.

Link to video in comments!
October 27, 2025 at 3:50 PM
Stanford mathematician Ravi Vakil expects AI’s impact on mathematics to come as a phase change, not a slow climb.

Every major shift in math has caught experts off guard, he says. This one will be no different, except that all our predictions will be even more wrong.

Link to video in comments!
October 23, 2025 at 1:53 PM
We evaluated Claude Haiku 4.5 on several benchmarks.

Even with reasoning disabled, Haiku 4.5 performs similarly or better than early lightweight reasoning models, like o1-mini.
October 17, 2025 at 5:49 PM
If you ran GPT-5 infinitely many times on FrontierMath—our extremely challenging math benchmark—would it eventually solve every problem?

Probably not. From what we can tell, it caps out below 50%.

What about throwing in *every* available model? Infinitely many times? 🧵
October 17, 2025 at 4:56 PM