Lightnews — Scholar-powered news

Thaddée Tyl

@espadrine.bsky.social

It might disappoint, but it should not.

We got too used to no longer seeing the GPT base model.

Let’s compare to the DeepSeek base model.
The jump from base to reasoning is tremendous!

Large 3 starts off slightly higher than DeepSeek base. I’m eager to see Magistral Large!

December 3, 2025 at 10:40 AM

Thaddée Tyl

@espadrine.bsky.social

I see the story of Mistral Large 3 as one of a major technical shift. To reduce inference costs further, they trained a large MoE from scratch, after years of building on existing weights.

Large 3 improves reasoning compared to Large 2, but is overtaken by… reasoning models.

December 3, 2025 at 10:40 AM

Thaddée Tyl

@espadrine.bsky.social

It, along with its Ministral sisters, is also the best model of its size class on math, coding and agentic tool use!

December 3, 2025 at 10:40 AM

Thaddée Tyl

@espadrine.bsky.social

The path of the Mistral 7B is nice to see!

The OG one topped open models of that size. For the first time, a local model felt usable on consumer hardware.

Not only is the latest Ministral 8B on the Pareto frontier for knowledge vs. cost (and for search, math, agentic uses)…

December 3, 2025 at 10:40 AM

Thaddée Tyl

@espadrine.bsky.social

Big jump in math as well! Grok 4.1 Fast is quite strong too on this front, but it now has a powerful challenger on this price range.

December 1, 2025 at 6:28 PM

Thaddée Tyl

@espadrine.bsky.social

DeepSeek released V3.2 (and V3.2 Speciale, a math-oriented model).

New model, new benchmarks!

The biggest jump for DeepSeek V3.2 is on agentic coding, where it seems poised to erase a lot of models on the Pareto frontier, including Sonnet 4.5, Minimax M2, and K2 Thinking.

December 1, 2025 at 6:28 PM

Thaddée Tyl

@espadrine.bsky.social

Expectedly, the same can be said in classic, RAG-and-search customer support chatbots, a use-case for our agentic leaderboard.

November 18, 2025 at 5:37 PM

Thaddée Tyl

@espadrine.bsky.social

In agentic coding, though, Claude still pulls ahead it seems, but by a short margin now.

November 18, 2025 at 5:37 PM

Thaddée Tyl

@espadrine.bsky.social

The code it writes is quite good, getting 76.2 on SWE-bench Verified, compared to GPT-5 Codex's 74.5 (a model which is dedicated to code).

November 18, 2025 at 5:37 PM

Thaddée Tyl

@espadrine.bsky.social

It is pretty good at math, but honestly on-par with GPT-5. It has an AIME2025 of 95 for instance, compared to GPT-5's 94.

November 18, 2025 at 5:37 PM

Thaddée Tyl

@espadrine.bsky.social

Where it shines most is in reasoning.
It jumps ahead of the pack, which had caught up Gemini 2.5.

November 18, 2025 at 5:37 PM

Thaddée Tyl

@espadrine.bsky.social

So, how is Gemini 3 on this new leaderboard?

Its intrinsic knowledge is unmatched, surpassing 2.5 and GPT-5.1.

bsky.app/profile/espa...

November 18, 2025 at 5:37 PM

Thaddée Tyl

@espadrine.bsky.social

Unveiling a new LLM leaderboard: metabench.organisons.com

Why?

Company C1 releases model M1 and discloses benchmarks B1.
Company C2 releases M2, showing off benchmarks B2 which are distinct.
Comparing those models is hard since they don't share benchmarks!

November 18, 2025 at 5:21 PM

Thaddée Tyl

@espadrine.bsky.social

Isn’t there a better way to handle screens than asking a *language model* to guess the number of pixels to the left and top of a UI widget?

WARNING: Holo1 is using absolute coordinates (number of pixels) and HuggingFace processor is doing image resize. To have matching coordinates, one needs to smart_resize the image.

from transformers.models.qwen2_vl.image_processing_qwen2_vl import smart_resize

June 10, 2025 at 12:51 PM

Thaddée Tyl

@espadrine.bsky.social

This diffusion has shenanigans. The number of tokens between two unchanged sequences can increase or decrease.

May 21, 2025 at 12:51 PM

Thaddée Tyl

@espadrine.bsky.social

We can get GNSS spacial positioning all the way to the moon, given the right receiver!

Greatly simplifies space travel.

I still believe we should set up a separate GNSS on every planet.

ntrs.nasa.gov/api/citation...

March 5, 2025 at 2:30 PM

Thaddée Tyl

@espadrine.bsky.social

LLMs get better at tool use and search.
Model memorization is thus less useful than reasoning.
Yet a lot of benchmarks still focus on the former.

Humanity's Last Exam (HLE): 5 stars reasoning, 3 stars memorization
MATH: 5 stars reasoning, 1 star memorization
ARB: 5 stars reasoning, 1 star memorization
GSM8K: 4 stars reasoning, 1 star memorization
ARC: 3 stars reasoning, 3 stars memorization
DROP: 3 stars reasoning, 3 stars memorization
HellaSwag: 3 stars reasoning, 3 stars memorization
GPQA: 2 stars reasoning, 5 stars memorization
MMMU: 2 stars reasoning, 4 stars memorization
MMLU: 1 star reasoning, 5 stars memorization

February 26, 2025 at 9:28 AM

Thaddée Tyl

@espadrine.bsky.social

Surprisingly, bigger Llama 3 models are worse at learning from relevant context, and giving a good answer, than smaller ones.

Unsurprisingly, base models evaluate the probability of a good answer better than instruct models, which will give a low probability to speech that doesn't match their style

Graph of the Rate of golden answers more likely to be generated with RAG for various models based on their Parameter Count.

February 17, 2025 at 10:11 PM

Thaddée Tyl

@espadrine.bsky.social

Do they have a reason to fear they won’t get paid?

February 7, 2025 at 6:03 PM

Thaddée Tyl

@espadrine.bsky.social

Bittersweet to see the latest Codestral so close to the open-weights version, yet to see both are so close to Claude.

Claude 3.5 Sonnet 10/22: 1006 Elo.
Codestral 25.01: 1003 Elo.
Codestral 05/24 (open-weights): 1000 Elo.

February 6, 2025 at 10:28 AM

Thaddée Tyl

@espadrine.bsky.social

The issue with this kind of login form: when I come back, I have no idea which one I picked, and I won’t try them all until I find something I made before.

October 25, 2024 at 8:10 AM

Thaddée Tyl

@espadrine.bsky.social

But I wonder if there is a better way to write this CSS.
It is brittle because it depends on #browser being a sibling after #navigator-toolbox.

Do you have suggestions?

HTML DOM inspector opened on Firefox’ chrome.

<toolbox id="navigator-toolbox">…</toolbox>
<hbox id="browser" flex="1">…</hbox>

August 21, 2024 at 2:42 PM

Thaddée Tyl

@espadrine.bsky.social

I want to make Firefox’ UI very minimal, only summoned through Ctrl+L.

Here is my solution so far.

(Requires adding the CSS file in a chrome/ folder in the root dir in about:profiles, then about:config > toolkit.legacyUserProfileCustomizations.stylesheets > true.)

* userchrome.css */
#navigator-toolbox:not(:hover):has(+ #browser:focus-within) {
--is-toolbox-visible: hidden;
height: 0px;
}

#navigator-toolbox {
overflow: var(--is-toolbox-visible);
--is-toolbox-visible: visible;
}

August 21, 2024 at 2:41 PM

Thaddée Tyl

@espadrine.bsky.social

“Large” language models is a moving target.

The first GPT had a hundred million parameters, and called itself large already.

GPT-4 has almost two trillion.

Compute Requirements: Many previous approaches to NLP tasks train relatively small models on a single GPU from scratch. Our approach requires an expensive pre-training step—1 month on 8 GPUs. Luckily, this only has to be done once and we’re releasing our model so others can avoid it. It is also a large model (in comparison to prior work) and consequently uses more compute and memory—we used a 37-layer (12 block) Transformer architecture, and we train on sequences of up to 512 tokens. Most experiments were conducted on 4 and 8 GPU systems. The model does fine-tune to new tasks very quickly which helps mitigate the additional resource requirements.

March 23, 2024 at 10:30 AM

Thaddée Tyl

@espadrine.bsky.social

It is a bit odd that its performance on benchmarks in languages it targets have decreased compared to the raw Mistral model.

Spanish benchmarks, eg. Spanish MMLU base score: 0.544023, Spanish Occiglot: 0.512749.

March 7, 2024 at 10:10 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news