Lightnews — Scholar-powered news

Epoch AI @epochai.bsky.social · 2m

Learn more about this insight on our website!
epoch.ai/data-insigh...

AI capabilities have steadily improved over the past year

In the past 12 months, AI’s abilities have improved steadily across a range of skills including coding, visual and common sense reasoning, and math.

epoch.ai

Epoch AI @epochai.bsky.social · 2m

Nevertheless, it's hard to deny that AI models have become substantially more useful over the past 12 months. One indication of this is that revenues at frontier AI companies have more than tripled in the past year.

1

Epoch AI @epochai.bsky.social · 2m

Of course, benchmarks don't capture real-world utility perfectly. Even a model scoring 100% on GPQA Diamond probably won't fully replace scientists, since models can be overfit to benchmarks, and benchmarks don't capture all aspects of real-world work.

1

Epoch AI @epochai.bsky.social · 2m

Across benchmarks covering coding, math, scientific knowledge, common sense and visual reasoning, and more, state-of-the-art models have improved by 20 to 50 percentage points in the last year.

1

Epoch AI @epochai.bsky.social · 2m

AI capabilities have been steadily improving across a wide range of skills, and show no sign of slowing down in the near term. 🧵

1 1 1

Epoch AI @epochai.bsky.social · 1h

This work was commissioned by Google. Epoch maintained editorial control over the output. We offer timely and in-depth evaluation as a service to model developers; DM us for details!

Epoch AI @epochai.bsky.social · 1h

See the full report for much more!

epoch.ai/blog/deep-t...

Evaluating Gemini 2.5 Deep Think’s math capabilities

Improved use of knowledge and precision, helpful for research, more conceptual in geometry, but limited creativity and citation issues.

epoch.ai

1

Epoch AI @epochai.bsky.social · 1h

We noticed Deep Think making several bibliographic errors, referencing works that either did not exist or did not contain the claimed results. Anecdotally, this was the model’s main weakness compared to other leading models.

1

Epoch AI @epochai.bsky.social · 1h

Deep Think approaches geometry problems differently than other LLMs: rather than casting everything in coordinate systems, it works with higher-level concepts. This is how humans prefer to solve geometry problems as well.

1

Epoch AI @epochai.bsky.social · 1h

This version of Deep Think got a bronze medal-equivalent score on the 2025 IMO. We challenged it with two problems from the 2024 IMO that are a bit harder than the hardest problem it solved on the 2025 IMO. It failed to solve either problem even when given ten attempts.

1

Epoch AI @epochai.bsky.social · 1h

Professional mathematicians characterized Deep Think as a broadly helpful research assistant.

1

Epoch AI @epochai.bsky.social · 1h

Good performance on FrontierMath requires deep background knowledge and precise execution of computations. Deep Think has made progress but hasn’t yet mastered these skills, still scoring lower on the harder tiers of the benchmark.

1

Epoch AI @epochai.bsky.social · 1h

Note that this is the publicly available version of Deep Think, not the version that achieved a gold medal-equivalent score on the IMO. Google has described the publicly available Deep Think model as a “variation” of the IMO gold model.

1

Epoch AI @epochai.bsky.social · 1h

We evaluated Gemini 2.5 Deep Think on FrontierMath. There is no API, so we ran it manually. The results: a new record!

We also conducted a more holistic evaluation of its math capabilities. 🧵

1 1 7

Epoch AI @epochai.bsky.social · 5h

Watch the interview here: www.youtube.com/watch?v=NC8...

Transcript available here: epochai.substack.com/p/why-front...

Why frontier AI can't solve this professor's math problem - Greta Panova

Greta Panova wrote a math problem so difficult that today’s most advanced AI models don’t know where to begin.

epochai.substack.com

Epoch AI @epochai.bsky.social · 5h

USC mathematician Greta Panova wrote a math problem so difficult that today’s most advanced AI models don’t know where to begin.

She thinks that when AI finally can, it will have crossed a threshold in general human-level reasoning.

Link to video in comments!

1

Epoch AI @epochai.bsky.social · 2d

Transcript available here: epochai.substack.com/p/ai-can-no...

AI can now do math. But can it ask good questions? - Ken Ono

When mathematicians make breakthroughs, they hallucinate too.

epochai.substack.com

1

Epoch AI @epochai.bsky.social · 2d

Watch the interview here: youtu.be/At0vOy4TTMg

AI can now do math. But can it ask good questions? - Ken Ono

Ken Ono is a number theorist at the University of Virginia and advisor to the NSA. Last year AI helped him discover new formulas that detect prime numbers, p...

www.youtube.com

1 1 1

Epoch AI @epochai.bsky.social · 2d

When mathematicians make breakthroughs, they hallucinate too.

They reach beyond established results. But unlike AI, they’ve learned to tell a promising hallucination from a dead end.

Number theorist Ken Ono on AI, creativity, and mathematical discovery.

Link to video in comments!

1 2 3

Epoch AI @epochai.bsky.social · 5d

Tagging people who might be interested: @TomDavidsonX, @eli_lifland, @akorinek, @krishnanrohit

1

Epoch AI @epochai.bsky.social · 5d

The code for the analysis is here: squigglehub.org/models/jsd/...

1 1

Epoch AI @epochai.bsky.social · 5d

This week’s Gradient Update was written by @js_denain, @Jsevillamol, and @ansonwhho. You can read the full post here: epoch.ai/gradient-up...

How many digital workers could OpenAI deploy?

OpenAI has the inference compute to deploy tens of millions of digital workers, but only on a narrow set of tasks – for now.

epoch.ai

1 2

Epoch AI @epochai.bsky.social · 5d

However, as compute stocks and AI capabilities increase, we'll have more digital workers able to automate a wider range of tasks.

Moreover, AI systems will likely perform tasks that no human currently can – making our estimate a lower bound on economic impact.

1 1

Epoch AI @epochai.bsky.social · 5d

What does this mean?

7M workers is still small compared to the global workforce, and currently AI can only handle a relatively narrow set of tasks.

1 1

Epoch AI @epochai.bsky.social · 5d

Finally, we divide 1 by 2 to get our estimate of digital workers.

Ensembling over both methods used to calculate 2, we obtain a final estimate of ~7 million digital workers, with a 90% CI spanning orders of magnitude.

1 1