epoch.ai
We previously described Claudiness as "good at agentic tasks while being weaker at multimodal and math". This pattern remains when comparing Opus 4.5 to other newly-released models, though the gap on agentic coding and tool-calling benchmarks is small.
We previously described Claudiness as "good at agentic tasks while being weaker at multimodal and math". This pattern remains when comparing Opus 4.5 to other newly-released models, though the gap on agentic coding and tool-calling benchmarks is small.
This score is behind Gemini 3 Pro and GPT-5.1 (high) while being on par with earlier frontier models like o3 (high) and Grok 4.
This score is behind Gemini 3 Pro and GPT-5.1 (high) while being on par with earlier frontier models like o3 (high) and Grok 4.
You can now examine annotated, recent, high-resolution satellite imagery of the world's largest compute clusters directly from your phone at epoch.ai/data/data-c....
Here’s a look at the updated Satellite Viewer:
You can now examine annotated, recent, high-resolution satellite imagery of the world's largest compute clusters directly from your phone at epoch.ai/data/data-c....
Here’s a look at the updated Satellite Viewer:
On the Epoch Capabilities Index (ECI), which combines multiple benchmarks, Gemini 3 Pro scored 154, up from GPT-5.1’s previous high score of 151.
On the Epoch Capabilities Index (ECI), which combines multiple benchmarks, Gemini 3 Pro scored 154, up from GPT-5.1’s previous high score of 151.
🧵 with some analysis, including the discovery of a “Claudiness” dimension.
🧵 with some analysis, including the discovery of a “Claudiness” dimension.
Our Frontier Data Centers database shows that some upcoming campuses will cover a substantial portion of Manhattan. Meta's Hyperion data center will be nearly four times the size of Central Park.
Our Frontier Data Centers database shows that some upcoming campuses will cover a substantial portion of Manhattan. Meta's Hyperion data center will be nearly four times the size of Central Park.
That’s according to the Epoch Capabilities Index, our tool for combining results across multiple benchmarks. With “high” reasoning, both GPT-5.1 and GPT-5 score 151 on ECI.
See 🧵 for individual benchmark scores!
That’s according to the Epoch Capabilities Index, our tool for combining results across multiple benchmarks. With “high” reasoning, both GPT-5.1 and GPT-5 score 151 on ECI.
See 🧵 for individual benchmark scores!
Join us for a live webinar/Q&A on our new Frontier Data Centers Hub, exploring what this infrastructure buildout means for AI.
Nov 20, 1-2 PM PT
luma.com/oste01d0
Join us for a live webinar/Q&A on our new Frontier Data Centers Hub, exploring what this infrastructure buildout means for AI.
Nov 20, 1-2 PM PT
luma.com/oste01d0
Some say this kicks off a software singularity, where AIs recursively improve themselves and rapidly get smarter. Others think there’ll be a bottleneck.
So how can we tell who’s right? 🧵
Some say this kicks off a software singularity, where AIs recursively improve themselves and rapidly get smarter. Others think there’ll be a bottleneck.
So how can we tell who’s right? 🧵
So we spent the last few months reading legal permits, staring at satellite images, and scouring news sources.
Here’s what you need to know. 🧵
So we spent the last few months reading legal permits, staring at satellite images, and scouring news sources.
Here’s what you need to know. 🧵
Some hyperscalers plan to do it in just 1-2 years from the start of construction.
If they succeed, we’ll see the first GW-scale data centers online in 2026, marking one of the fastest infrastructure build-outs in history. 🧵
Some hyperscalers plan to do it in just 1-2 years from the start of construction.
If they succeed, we’ll see the first GW-scale data centers online in 2026, marking one of the fastest infrastructure build-outs in history. 🧵
One way to read our new capability index is by plotting the benchmark performance you expect to see, for a range of ECI scores 🧵
One way to read our new capability index is by plotting the benchmark performance you expect to see, for a range of ECI scores 🧵
bsky.app/profile/epo...
bsky.app/profile/epo...
The world is about to see multiple 1 GW+ AI data centers.
We mapped their construction using satellite imagery, permits & public sources — releasing everything for free, including commissioned satellite images.
Highlights in thread!
The world is about to see multiple 1 GW+ AI data centers.
We mapped their construction using satellite imagery, permits & public sources — releasing everything for free, including commissioned satellite images.
Highlights in thread!
Our findings: tasks are simple, many don't require GUIs, and success often hinges on interpreting ambiguous instructions. The benchmark is also not stable over time.
See thread for details!
Our findings: tasks are simple, many don't require GUIs, and success often hinges on interpreting ambiguous instructions. The benchmark is also not stable over time.
See thread for details!
Corrected results: GPT-5 (high) scores slightly higher than GPT-5 (medium) on the benchmarks we run. They are also now tied on the Epoch Capabilities Index (ECI).
Corrected results: GPT-5 (high) scores slightly higher than GPT-5 (medium) on the benchmarks we run. They are also now tied on the Epoch Capabilities Index (ECI).
The result? This gap is smaller than previously estimated.
On average, it takes 3.5 months for an open-weight model to catch up with closed-source SOTA.
The result? This gap is smaller than previously estimated.
On average, it takes 3.5 months for an open-weight model to catch up with closed-source SOTA.
Our research suggests that conducting 10 GW training runs across two dozen sites—linked by a network spanning thousands of km long—is feasible.
Our research suggests that conducting 10 GW training runs across two dozen sites—linked by a network spanning thousands of km long—is feasible.
The tool addresses one of the field's biggest challenges: benchmark saturation.
It's called the Epoch Capabilities Index (ECI) — here's what makes it different:
The tool addresses one of the field's biggest challenges: benchmark saturation.
It's called the Epoch Capabilities Index (ECI) — here's what makes it different:
But mathematical physicist Svetlana Jitomirskaya argues they lack folklore knowledge: the implicit priors mathematicians build from experience.
Link to video in comments!
But mathematical physicist Svetlana Jitomirskaya argues they lack folklore knowledge: the implicit priors mathematicians build from experience.
Link to video in comments!
Every major shift in math has caught experts off guard, he says. This one will be no different, except that all our predictions will be even more wrong.
Link to video in comments!
Every major shift in math has caught experts off guard, he says. This one will be no different, except that all our predictions will be even more wrong.
Link to video in comments!
Even with reasoning disabled, Haiku 4.5 performs similarly or better than early lightweight reasoning models, like o1-mini.
Even with reasoning disabled, Haiku 4.5 performs similarly or better than early lightweight reasoning models, like o1-mini.
Probably not. From what we can tell, it caps out below 50%.
What about throwing in *every* available model? Infinitely many times? 🧵
Probably not. From what we can tell, it caps out below 50%.
What about throwing in *every* available model? Infinitely many times? 🧵