Epoch AI
@epochai.bsky.social
780 followers 20 following 780 posts
We are a research institute investigating the trajectory of AI for the benefit of society. epoch.ai
Posts Media Videos Starter Packs
epochai.bsky.social
RL scaling remains a key factor in near-term AI progress. Recent evidence about this compute frontier is sparse, but we're tracking it closely.
epochai.bsky.social
Overall, if final RL training made up 10 to 200% of pre-training, that yields a median estimate of 5e25 FLOP for GPT-5’s overall training compute. And it’s likely that GPT-5 was trained on less than 1e26 FLOP.
epochai.bsky.social
Did OpenAI scale RL to match or exceed pre-training compute for GPT-5? It’s possible.

But reports suggest that training GPT-5 wasn’t straightforward, and OpenAI may have focused on different skills for GPT-5 than for o3, suggesting more experimentation vs simple scaling.
epochai.bsky.social
Next is reinforcement learning during post-training, which adds more uncertainty.

In early 2025, RL compute was small—maybe 1-10% of pre-training. But this is scaling up fast: OpenAI scaled RL by 10× from o1 to o3, and xAI did the same from Grok 3 to 4.
epochai.bsky.social
GPT-5’s pre-train token count is unconfirmed, but Llama 4 and Qwen3’s were 30-40 trillion tokens.

OpenAI has invested heavily into pre-training, so GPT-5 was likely trained on at least 30T tokens, possibly several times more.

This gives a median of ~3e25 FLOP pretrain.
epochai.bsky.social
Training compute scales in proportion to a model’s active parameters, as well as training data.

Based on price, speed, and prevailing industry trends, GPT-5 is probably a “mid-sized” frontier model with ~100B active params, akin to Grok 2 (115B active), GPT-4o, and Claude Sonnet.
epochai.bsky.social
Our best guess: GPT-5 was trained on ~5e25 FLOP total, including both pre-training and reinforcement learning.

That would be more than twice as much as GPT-4 (~2e25 FLOP), but less than GPT-4.5 (>1e26 FLOP).

Here’s how it breaks down.
epochai.bsky.social
We recently wrote that GPT-5 is likely the first mainline GPT release to be trained on less compute than its predecessor.

How did we reach this conclusion, and what do we actually know about how GPT-5 was trained?
🧵
epochai.bsky.social
Nevertheless, it's hard to deny that AI models have become substantially more useful over the past 12 months. One indication of this is that revenues at frontier AI companies have more than tripled in the past year.
epochai.bsky.social
Of course, benchmarks don't capture real-world utility perfectly. Even a model scoring 100% on GPQA Diamond probably won't fully replace scientists, since models can be overfit to benchmarks, and benchmarks don't capture all aspects of real-world work.
epochai.bsky.social
Across benchmarks covering coding, math, scientific knowledge, common sense and visual reasoning, and more, state-of-the-art models have improved by 20 to 50 percentage points in the last year.
epochai.bsky.social
AI capabilities have been steadily improving across a wide range of skills, and show no sign of slowing down in the near term. 🧵
epochai.bsky.social
This work was commissioned by Google. Epoch maintained editorial control over the output. We offer timely and in-depth evaluation as a service to model developers; DM us for details!
epochai.bsky.social
We noticed Deep Think making several bibliographic errors, referencing works that either did not exist or did not contain the claimed results. Anecdotally, this was the model’s main weakness compared to other leading models.
epochai.bsky.social
Deep Think approaches geometry problems differently than other LLMs: rather than casting everything in coordinate systems, it works with higher-level concepts. This is how humans prefer to solve geometry problems as well.
epochai.bsky.social
This version of Deep Think got a bronze medal-equivalent score on the 2025 IMO. We challenged it with two problems from the 2024 IMO that are a bit harder than the hardest problem it solved on the 2025 IMO. It failed to solve either problem even when given ten attempts.
epochai.bsky.social
Professional mathematicians characterized Deep Think as a broadly helpful research assistant.
epochai.bsky.social
Good performance on FrontierMath requires deep background knowledge and precise execution of computations. Deep Think has made progress but hasn’t yet mastered these skills, still scoring lower on the harder tiers of the benchmark.
epochai.bsky.social
Note that this is the publicly available version of Deep Think, not the version that achieved a gold medal-equivalent score on the IMO. Google has described the publicly available Deep Think model as a “variation” of the IMO gold model.
epochai.bsky.social
We evaluated Gemini 2.5 Deep Think on FrontierMath. There is no API, so we ran it manually. The results: a new record!

We also conducted a more holistic evaluation of its math capabilities. 🧵
epochai.bsky.social
USC mathematician Greta Panova wrote a math problem so difficult that today’s most advanced AI models don’t know where to begin.

She thinks that when AI finally can, it will have crossed a threshold in general human-level reasoning.

Link to video in comments!