Epoch AI
@epochai.bsky.social
780 followers 20 following 770 posts
We are a research institute investigating the trajectory of AI for the benefit of society. epoch.ai
Posts Media Videos Starter Packs
epochai.bsky.social
Nevertheless, it's hard to deny that AI models have become substantially more useful over the past 12 months. One indication of this is that revenues at frontier AI companies have more than tripled in the past year.
epochai.bsky.social
Of course, benchmarks don't capture real-world utility perfectly. Even a model scoring 100% on GPQA Diamond probably won't fully replace scientists, since models can be overfit to benchmarks, and benchmarks don't capture all aspects of real-world work.
epochai.bsky.social
Across benchmarks covering coding, math, scientific knowledge, common sense and visual reasoning, and more, state-of-the-art models have improved by 20 to 50 percentage points in the last year.
epochai.bsky.social
AI capabilities have been steadily improving across a wide range of skills, and show no sign of slowing down in the near term. 🧵
epochai.bsky.social
This work was commissioned by Google. Epoch maintained editorial control over the output. We offer timely and in-depth evaluation as a service to model developers; DM us for details!
epochai.bsky.social
We noticed Deep Think making several bibliographic errors, referencing works that either did not exist or did not contain the claimed results. Anecdotally, this was the model’s main weakness compared to other leading models.
epochai.bsky.social
Deep Think approaches geometry problems differently than other LLMs: rather than casting everything in coordinate systems, it works with higher-level concepts. This is how humans prefer to solve geometry problems as well.
epochai.bsky.social
This version of Deep Think got a bronze medal-equivalent score on the 2025 IMO. We challenged it with two problems from the 2024 IMO that are a bit harder than the hardest problem it solved on the 2025 IMO. It failed to solve either problem even when given ten attempts.
epochai.bsky.social
Professional mathematicians characterized Deep Think as a broadly helpful research assistant.
epochai.bsky.social
Good performance on FrontierMath requires deep background knowledge and precise execution of computations. Deep Think has made progress but hasn’t yet mastered these skills, still scoring lower on the harder tiers of the benchmark.
epochai.bsky.social
Note that this is the publicly available version of Deep Think, not the version that achieved a gold medal-equivalent score on the IMO. Google has described the publicly available Deep Think model as a “variation” of the IMO gold model.
epochai.bsky.social
We evaluated Gemini 2.5 Deep Think on FrontierMath. There is no API, so we ran it manually. The results: a new record!

We also conducted a more holistic evaluation of its math capabilities. 🧵
epochai.bsky.social
USC mathematician Greta Panova wrote a math problem so difficult that today’s most advanced AI models don’t know where to begin.

She thinks that when AI finally can, it will have crossed a threshold in general human-level reasoning.

Link to video in comments!
epochai.bsky.social
When mathematicians make breakthroughs, they hallucinate too.

They reach beyond established results. But unlike AI, they’ve learned to tell a promising hallucination from a dead end.

Number theorist Ken Ono on AI, creativity, and mathematical discovery.

Link to video in comments!
epochai.bsky.social
Tagging people who might be interested: @TomDavidsonX, @eli_lifland, @akorinek, @krishnanrohit
epochai.bsky.social
However, as compute stocks and AI capabilities increase, we'll have more digital workers able to automate a wider range of tasks.

Moreover, AI systems will likely perform tasks that no human currently can – making our estimate a lower bound on economic impact.
epochai.bsky.social
What does this mean?

7M workers is still small compared to the global workforce, and currently AI can only handle a relatively narrow set of tasks.
epochai.bsky.social
Finally, we divide 1 by 2 to get our estimate of digital workers.

Ensembling over both methods used to calculate 2, we obtain a final estimate of ~7 million digital workers, with a 90% CI spanning orders of magnitude.