Lightnews — Scholar-powered news

Andreas Kirsch

@blackhc.bsky.social

4.3K followers 2K following 130 posts

My opinions only here. 👨‍🔬 RS DeepMind Past: 👨‍🔬 R Midjourney 1y 🧑‍🎓 DPhil AIMS Uni of Oxford 4.5y 🧙‍♂️ RE DeepMind 1y 📺 SWE Google 3y 🎓 TUM 👤 @nwspk

Posts Media Videos Starter Packs

Pinned

Andreas Kirsch @blackhc.bsky.social · Jan 7

Ever wondered why presenting more facts can sometimes *worsen* disagreements, even among rational people? 🤔

It turns out, Bayesian reasoning has some surprising answers - no cognitive biases needed! Let's explore this fascinating paradox quickly ☺️

8 78 230

Andreas Kirsch @blackhc.bsky.social · Jun 29

100000t of bombs vs the civilian casualty rates in Gaza actually shows that the IDF went out of their way to protect the civilian population

Compare this to, I dunno, the Dresden bomb raids:

25000 deaths over 4 days from just 3900t of bombs

Was that genocide?

en.m.wikipedia.org/wiki/Bombing...

Bombing of Dresden - Wikipedia

en.m.wikipedia.org

Andreas Kirsch @blackhc.bsky.social · Jun 29

Lol this fake statistician has blocked me for calling out his stupidity 😅

Andreas Kirsch @blackhc.bsky.social · Jun 29

Because the urban warfare during the counterinsurgency, 60% of the buildings were destroyed during fighting in the course of a couple of months. This is in line with the report you cited above.

How is this war in Gaza different to that war?

en.m.wikipedia.org/wiki/Falluja...

Andreas Kirsch @blackhc.bsky.social · Jun 28

Did the US also commit genocide in Iraq?

Andreas Kirsch @blackhc.bsky.social · Jun 28

The Gedankenexperiment says they must be doing a pretty terrible job if this is supposed to be a genocide 🙄

Andreas Kirsch @blackhc.bsky.social · Jun 13

x.com/BlackHC/stat...

Some thoughts on the comment paper

Reposted by Andreas Kirsch

Daniel Hugenroth @lambda.bsky.social · Jun 9

We launched CoverDrop 🎉 providing sources with a secure and anonymous way to talk to journalists. Having started five years ago as a PhD research project, this now ships within the Guardian app to millions of users—all of which provide cover traffic. Paper, code, and more info: www.coverdrop.org

CoverDrop: Blowing the Whistle Through A News App

www.coverdrop.org

1 20 59

Reposted by Andreas Kirsch

Ted Underwood @tedunderwood.com · Jun 11

This is going to be big news in my field. While we wait for the dataset, the stuff about post-processing makes interesting reading (if you're me)

SE Gyges @segyges.bsky.social · Jun 11

can't wait til they actually upload the dataset to go with this one arxiv.org/abs/2506.08300

Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability

Large language models (LLMs) use data to learn about the world in order to produce meaningful correlations and predictions. As such, the nature, scale, quality, and diversity of the datasets used to t...

arxiv.org

4 13 72

Andreas Kirsch @blackhc.bsky.social · Jun 11

Oh yeah I can't believe ai generated ASMR is also taking off. I've seen one or two of those!

Andreas Kirsch @blackhc.bsky.social · Jun 11

What's your favorite Veo video?

Andreas Kirsch @blackhc.bsky.social · Jun 11

Yeah but that's where most of the interesting things are 😅

Andreas Kirsch @blackhc.bsky.social · Jun 9

I hope the authors (I QT'ed @MFarajtabar above) can revisit the Tower of Hanoi results and examine the confounders to strengthen the paper (or just drop ToH). This will help keep the focus on the more interesting other environments for which the claims in the paper seem valid 🙏

Andreas Kirsch @blackhc.bsky.social · Jun 9

And other influencers exaggerate its results massively:

x.com/RubenHssd/s...

1 1

Andreas Kirsch @blackhc.bsky.social · Jun 9

So thinks are not as bleak as the coverage makes them sound. Aprospos coverage: here are some glowing reviews of the paper that do not question it:

Of course, @GaryMarcus likes it very much:

garymarcus.substack.com/p/a-knockou...

A knockout blow for LLMs?

LLM “reasoning” is so cooked they turned my name into a verb

garymarcus.substack.com

1 2

Andreas Kirsch @blackhc.bsky.social · Jun 9

I want to point to one more claim which is already outdated (the relevant paper was only published a few days ago so hardly anyone's fault):

The ProRL paper by Nvidia has shown that RL-based models can truly learn new things - if you run RL long enough!

arxiv.org/abs/2505.24864

1 3

Andreas Kirsch @blackhc.bsky.social · Jun 9

But as Gemini 2.5 Pro explains, River Crossing's optimal solutions are rather short, but they have a high branching factor and number of possible states with dead-ends.

That models fail here is a lot more interesting and points towards areas of improvements.

1 2

Andreas Kirsch @blackhc.bsky.social · Jun 9

All in all, the Tower of Hanoi results cannot be given any credence because it seems there are many confounders and simpler explanations.

However, I don't think the other games hit the same issues. If we look at River Crossing it seems to hit high token counts very quickly:

1 1

Andreas Kirsch @blackhc.bsky.social · Jun 9

Thus, @scaling01 calls out their conclusion: it looks like they didn't pay as close attention to the model's reasoning traces as they have claimed 😬

x.com/scaling01/s...

1 1

Andreas Kirsch @blackhc.bsky.social · Jun 9

This can also be observed in other simpler non-reasoning tasks:

x.com/Afinetheore...

1 1

Andreas Kirsch @blackhc.bsky.social · Jun 9

New LRMs are actually trained to be aware of their token limits. If they cannot think there way through, they find other solutions.

Apparently, the models say so at N=9 in ToH or refuse the write out the whole solution "by hand" and write code instead:

x.com/scaling01/s...

1 2

Andreas Kirsch @blackhc.bsky.social · Jun 9

Another "counterintuitive" behavior that is reported that the models start outputting fewer thinking tokens as the problems get more complex (e.g., more disks in Tower of Hanoi).

But again, this can be explained away sadly for ToH (but only ToH!).

x.com/MFarajtabar...

1 1

Andreas Kirsch @blackhc.bsky.social · Jun 9

Because, wait, it gets worse:

For N >= 12 or 13 disks, the LRM could not even output all the moves for the solution even if it wanted because the models can only output 64k tokens. Half of the plots of the ToH are essentially meaningless anyway.

x.com/scaling01/s...

1 2

Andreas Kirsch @blackhc.bsky.social · Jun 9

This is very important: we do not need the LRM to be bad at reasoning for this to happen.

It just happens because the correct solution is so long and is not allowed to contain any typos.

(IMO the paper should really consider dropping the ToH environment from their results.)

1 2

Andreas Kirsch @blackhc.bsky.social · Jun 9

It assumes the LLM samples the correct token with a given (high) probability and looks at the required number of tokens we need to output all the moves. Then:

p("all correct") = p**(2^N - 1)

and p=0.999 matches the ToH results in the paper above.

1 2

Andreas Kirsch @blackhc.bsky.social · Jun 9

We use top-p or top-k sampling with a temperature of 0.7 after all. Thus, there is a chance the model gets unlucky and the wrong token is sampled, resulting in failure. (They should really try min-p 😀)

@scaling01 has a nice toy model that matches the paper results:

1 2