Lightnews — Scholar-powered news

Thaddée Tyl

@espadrine.bsky.social

And the official announcement: mistral.ai/news/mistral-3

Introducing Mistral 3 | Mistral AI

A family of frontier open-source multimodal models

mistral.ai

December 3, 2025 at 10:40 AM

Thaddée Tyl

@espadrine.bsky.social

As always, find the leaderboard in metabench.organisons.com

LLM Benchmark Aggregator & Estimator

metabench.organisons.com

December 3, 2025 at 10:40 AM

Thaddée Tyl

@espadrine.bsky.social

It might disappoint, but it should not.

We got too used to no longer seeing the GPT base model.

Let’s compare to the DeepSeek base model.
The jump from base to reasoning is tremendous!

Large 3 starts off slightly higher than DeepSeek base. I’m eager to see Magistral Large!

December 3, 2025 at 10:40 AM

Thaddée Tyl

@espadrine.bsky.social

In most metrics, it pulls ahead of Medium by a slim margin.

It might not impress anyone because it lags behind GPT-5.1 all the reasoning models, even when accounting for their increased token consumption costs. GPT-OSS-20B High might beat it everywhere except agentic coding.

December 3, 2025 at 10:40 AM

Thaddée Tyl

@espadrine.bsky.social

I see the story of Mistral Large 3 as one of a major technical shift. To reduce inference costs further, they trained a large MoE from scratch, after years of building on existing weights.

Large 3 improves reasoning compared to Large 2, but is overtaken by… reasoning models.

December 3, 2025 at 10:40 AM

Thaddée Tyl

@espadrine.bsky.social

It, along with its Ministral sisters, is also the best model of its size class on math, coding and agentic tool use!

December 3, 2025 at 10:40 AM

Thaddée Tyl

@espadrine.bsky.social

Find the full leaderboards and benchmark predictions here: metabench.organisons.com

And the original announcement: api-docs.deepseek.com/news/news251...

LLM Benchmark Aggregator & Estimator

metabench.organisons.com

December 1, 2025 at 6:28 PM

Thaddée Tyl

@espadrine.bsky.social

Big jump in math as well! Grok 4.1 Fast is quite strong too on this front, but it now has a powerful challenger on this price range.

December 1, 2025 at 6:28 PM

Thaddée Tyl

@espadrine.bsky.social

That's it for today! Enjoy.

November 18, 2025 at 5:37 PM

Thaddée Tyl

@espadrine.bsky.social

Expectedly, the same can be said in classic, RAG-and-search customer support chatbots, a use-case for our agentic leaderboard.

November 18, 2025 at 5:37 PM

Thaddée Tyl

@espadrine.bsky.social

In agentic coding, though, Claude still pulls ahead it seems, but by a short margin now.

November 18, 2025 at 5:37 PM

Thaddée Tyl

@espadrine.bsky.social

The code it writes is quite good, getting 76.2 on SWE-bench Verified, compared to GPT-5 Codex's 74.5 (a model which is dedicated to code).

November 18, 2025 at 5:37 PM

Thaddée Tyl

@espadrine.bsky.social

It is pretty good at math, but honestly on-par with GPT-5. It has an AIME2025 of 95 for instance, compared to GPT-5's 94.

November 18, 2025 at 5:37 PM

Thaddée Tyl

@espadrine.bsky.social

Where it shines most is in reasoning.
It jumps ahead of the pack, which had caught up Gemini 2.5.

November 18, 2025 at 5:37 PM

Thaddée Tyl

@espadrine.bsky.social

The more data we aggregate, the better the leaderboard is.
You can contribute them here: github.com/espadrine/me...

GitHub - espadrine/metabench: Benchmark aggregator and estimator for AI LLMs

Benchmark aggregator and estimator for AI LLMs. Contribute to espadrine/metabench development by creating an account on GitHub.

github.com

November 18, 2025 at 5:21 PM

Thaddée Tyl

@espadrine.bsky.social

That lets us compare models in broad categories: raw knowledge, in-context reasoning, math, coding, agency, …

Stunningly, we get to compare models really fast.
No need to wait for independent benchmarks to run, or for @arena votes.

A few benchmarks are enough.

November 18, 2025 at 5:21 PM

Thaddée Tyl

@espadrine.bsky.social

Math benchmarks are highly correlated between themselves. So are coding ones, etc.

We can infer unknown benchmark scores from published ones.

So we aggregate a lot of benchmarks, and predict the others.
(There is a bit of math involved in getting the right algorithm!)

November 18, 2025 at 5:21 PM