Thaddée Tyl
banner
espadrine.bsky.social
Thaddée Tyl
@espadrine.bsky.social
Self-replicating organisms. shields.io, Captain Train, Qonto. They.
And the official announcement: mistral.ai/news/mistral-3
Introducing Mistral 3 | Mistral AI
A family of frontier open-source multimodal models
mistral.ai
December 3, 2025 at 10:40 AM
As always, find the leaderboard in metabench.organisons.com
LLM Benchmark Aggregator & Estimator
metabench.organisons.com
December 3, 2025 at 10:40 AM
It might disappoint, but it should not.

We got too used to no longer seeing the GPT base model.

Let’s compare to the DeepSeek base model.
The jump from base to reasoning is tremendous!

Large 3 starts off slightly higher than DeepSeek base. I’m eager to see Magistral Large!
December 3, 2025 at 10:40 AM
In most metrics, it pulls ahead of Medium by a slim margin.

It might not impress anyone because it lags behind GPT-5.1 all the reasoning models, even when accounting for their increased token consumption costs. GPT-OSS-20B High might beat it everywhere except agentic coding.
December 3, 2025 at 10:40 AM
I see the story of Mistral Large 3 as one of a major technical shift. To reduce inference costs further, they trained a large MoE from scratch, after years of building on existing weights.

Large 3 improves reasoning compared to Large 2, but is overtaken by… reasoning models.
December 3, 2025 at 10:40 AM
It, along with its Ministral sisters, is also the best model of its size class on math, coding and agentic tool use!
December 3, 2025 at 10:40 AM
Find the full leaderboards and benchmark predictions here: metabench.organisons.com

And the original announcement: api-docs.deepseek.com/news/news251...
LLM Benchmark Aggregator & Estimator
metabench.organisons.com
December 1, 2025 at 6:28 PM
Big jump in math as well! Grok 4.1 Fast is quite strong too on this front, but it now has a powerful challenger on this price range.
December 1, 2025 at 6:28 PM
That's it for today! Enjoy.
November 18, 2025 at 5:37 PM
Expectedly, the same can be said in classic, RAG-and-search customer support chatbots, a use-case for our agentic leaderboard.
November 18, 2025 at 5:37 PM
In agentic coding, though, Claude still pulls ahead it seems, but by a short margin now.
November 18, 2025 at 5:37 PM
The code it writes is quite good, getting 76.2 on SWE-bench Verified, compared to GPT-5 Codex's 74.5 (a model which is dedicated to code).
November 18, 2025 at 5:37 PM
It is pretty good at math, but honestly on-par with GPT-5. It has an AIME2025 of 95 for instance, compared to GPT-5's 94.
November 18, 2025 at 5:37 PM
Where it shines most is in reasoning.
It jumps ahead of the pack, which had caught up Gemini 2.5.
November 18, 2025 at 5:37 PM
The more data we aggregate, the better the leaderboard is.
You can contribute them here: github.com/espadrine/me...
GitHub - espadrine/metabench: Benchmark aggregator and estimator for AI LLMs
Benchmark aggregator and estimator for AI LLMs. Contribute to espadrine/metabench development by creating an account on GitHub.
github.com
November 18, 2025 at 5:21 PM
That lets us compare models in broad categories: raw knowledge, in-context reasoning, math, coding, agency, …

Stunningly, we get to compare models really fast.
No need to wait for independent benchmarks to run, or for @arena votes.

A few benchmarks are enough.
November 18, 2025 at 5:21 PM
Math benchmarks are highly correlated between themselves. So are coding ones, etc.

We can infer unknown benchmark scores from published ones.

So we aggregate a lot of benchmarks, and predict the others.
(There is a bit of math involved in getting the right algorithm!)
November 18, 2025 at 5:21 PM
Looks like the Gemini embedding is not ready yet.

models/text-embedding-004 doesn't have 429.
June 30, 2025 at 8:28 AM
This is from H’s Holo1’s Huggingface README: huggingface.co/Hcompany/Hol...
Hcompany/Holo1-7B · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
June 10, 2025 at 12:54 PM