Lightnews — Scholar-powered news

Pasquale Minervini

@neuralnoise.com

5.5K followers 4.8K following 170 posts

Researcher in ML/NLP at the University of Edinburgh (faculty at Informatics and EdinburghNLP), Co-Founder/CTO at www.miniml.ai, ELLIS (@ELLIS.eu) Scholar, Generative AI Lab (GAIL, https://gail.ed.ac.uk/) Fellow -- www.neuralnoise.com, he/they

gail.ed.ac.uk

Posts Media Videos Starter Packs

Pinned

Pasquale Minervini @neuralnoise.com · Apr 1

Still ~8 days to apply for a postdoc position in multimodal foundation models at the University of Edinburgh! (@edinburgh-uni.bsky.social) -- Fully funded position until 2029 by the Generative AI Hub (@genaihub.bsky.social) to work with outstanding research teams! neuralnoise.com/2025/multimo...

4 6

Reposted by Pasquale Minervini

Tim Kellogg @timkellogg.me · Aug 24

trend: non-NVIDIA training

DeepSeek V3.1 was trained on Huawei Ascend NPUs

this one is a South Korean lab training on AMD

Tim Kellogg @timkellogg.me · Aug 24

Motif 2.6B — compact model with long context

unique: trained on AMD GPUs

focus is on long context & low hallucination rate — imo this is a growing genre of LLM that enables new search patterns

huggingface.co/Motif-Techno...

Motif-Technologies/Motif-2.6B · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

2 7 39

Pasquale Minervini @neuralnoise.com · Aug 23

I really needed a Deep Research MCP server to use with Claude Code and other tools — here it is: github.com/pminervini/d...

Reposted by Pasquale Minervini

Mark Riedl @markriedl.bsky.social · Aug 5

@togelius.bsky.social has thoughts on Genie 3 and games togelius.blogspot.com/2025/08/geni...

Fairly close to my own, though I didn't get the preview the tech.

Walking around a generated image-to-image world is not the same as playing a game. There are no game objectives.

Genie 3 and the future of neural game engines

Google DeepMind just announced Genie 3 , their new promptable world model, which is another term for neural game engine. This is a big neura...

togelius.blogspot.com

4 3 16

Reposted by Pasquale Minervini

Tim Kellogg @timkellogg.me · Aug 2

quick diagram of Bluesky’s architecture and why it’s nicer here

diagram from Anthropic paper with an icon & label that says “subtract evil vector”

4 5 73

Pasquale Minervini @neuralnoise.com · Aug 1

*Test-Time

Reposted by Pasquale Minervini

Scott McGrath @smcgrath.phd · Jul 23

Anthropic research identifies “inverse scaling in test-time compute,” where longer reasoning degrades AI performance. On certain tasks, models become more distracted by irrelevant data or overfit to spurious correlations.
#MLSky

Anthropic researchers discover the weird AI problem: Why thinking longer makes models dumber

Anthropic research reveals AI models perform worse with extended reasoning time, challenging industry assumptions about test-time compute scaling in enterprise deployments.

venturebeat.com

1 1 9

Pasquale Minervini @neuralnoise.com · Jul 31

Supermassive congrats to Giwon Hong (@giwonhong.bsky.social) for the amazing feat! 🙂

1 3

Reposted by Pasquale Minervini

Michael Robinson @eclipticevader7.bsky.social · Jul 26

Still not as bad as Microsoft Teams

Dan Snow @thehistoryguy.bsky.social · Jul 26

Today in 1184 Henry VI of Germany was having a strategy meeting when the wooden second storey floor collapsed. Most of the courtiers fell through into the latrine cesspit below the ground floor, where more than 50 drowned in liquid excrement.

12 170 740

Pasquale Minervini @neuralnoise.com · Jul 26

The amazing folks at EdinburghNLP will be presenting a few papers at ACL 2025 (@aclmeeting.bsky.social); if you're in Vienna, touch base with them!

Reposted by Pasquale Minervini

Emile van Krieken @emilevankrieken.com · Jul 24

Hm, hard disagree here. I really fail to see how this is misconduct akin to bribery, it's just a defense mechanism against bad reviewing practices. @neuralnoise.com

1 2 5

Reposted by Pasquale Minervini

Sohee Yang @soheeyang.bsky.social · Jun 13

🚨 New Paper 🚨
How effectively do reasoning models reevaluate their thought? We find that:
- Models excel at identifying unhelpful thoughts but struggle to recover from them
- Smaller models can be more robust
- Self-reevaluation ability is far from true meta-cognitive awareness
1/N 🧵

1 3 12

Reposted by Pasquale Minervini

Tim Kellogg @timkellogg.me · Jul 22

Inverse scaling of reasoning models

a research collab demonstrated that there are certain types of tasks where all top reasoning models do WORSE the longer they think

things like getting distracted by irrelevant info, spurious correlations, etc.

www.arxiv.org/abs/2507.14417

Three panels at the top describe task types with example prompts:
1. Simple Counting Tasks with Distractors (Misleading Math & Python):
• Prompts mention an apple and an orange, with added irrelevant or confusing information (e.g., probabilistic riddle, Python code) before asking the straightforward question: “Calculate how many fruits you have.”
2. Regression Tasks with Spurious Features (Grades Regression):
• Given XML-style records about a student, the model must predict grades from features like sleep hours, social hours, and stress level. The challenge lies in identifying relevant vs. spurious attributes.
3. Deduction Tasks with Constraint Tracking (Zebra Puzzles):
• Complex logical reasoning puzzle with multiple interrelated clues. Example: “What position is the person who likes salmon at?” Constraints involve foods, names, and relationships like “to the left of.”

Bottom row contains 3 line plots comparing model performance across tasks:
• Misleading Math (Left Plot):
• Accuracy drops sharply for some models as reasoning tokens increase. Claude Sonnet 4 maintains high performance. o3 and DeepSeek R1 hold relatively stable accuracy; Qwen3 32B and QwQ 32B drop more.
• Grades Regression (Middle Plot):
• Shows negative RMSE (higher is better). Claude models remain strong across token counts; o3 also performs well. Qwen3 and QwQ struggle, with DeepSeek R1 performing modestly.
• Zebra Puzzles (Right Plot):
• Accuracy vs. average reasoning tokens. o3 and Claude Sonnet 4 maintain highest performance. Other models (e.g., DeepSeek R1, Qwen3 32B, QwQ 32B) show performance degradation or plateaus. Error bars reflect variability.

Each plot uses colored lines with markers to indicate different model names.

2 2 21

Reposted by Pasquale Minervini

Naomi Saphra @nsaphra.bsky.social · Jun 12

Reasoning is about variable binding. It’s not about information retrieval. If a model cannot do variable binding, it is not good at grounded reasoning, and there’s evidence accruing that large scale can make LLMs worse at in-context grounded reasoning. 🧵

4 9 53

Pasquale Minervini @neuralnoise.com · Jul 22

Hi @ilsebyl.bsky.social welcome to bsky! 🚀🚀🚀

1 2

Pasquale Minervini @neuralnoise.com · Jul 22

Sometimes, too much reasoning can hurt model performance! New research by Anthropic (@anthropic.com), by Aryo Pradipta Gema (@aryopg.bsky.social) et al.: huggingface.co/papers/2507....

Paper page - Inverse Scaling in Test-Time Compute

Join the discussion on this paper page

huggingface.co

Pasquale Minervini @neuralnoise.com · Jul 21

“LLMs can’t reason” 😅

Hacker News Top Stories @hackernewsbot.bsky.social · Jul 21

Gemini with Deep Think officially achieves gold-medal standard at the IMO | Discussion

Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the International Mathematical Olympiad

Our advanced model officially achieved a gold-medal level performance on problems from the International Mathematical Olympiad (IMO), the world’s most prestigious competition for young...

deepmind.google

Reposted by Pasquale Minervini

Steven Strogatz @stevenstrogatz.com · Jul 4

My "Math, Revealed" series is freely available to anyone -- no paywall! -- in the thread below.

6 53 140

Pasquale Minervini @neuralnoise.com · Jul 11

There is a few more for another prompt and that’s it

Reposted by Pasquale Minervini

Cosimo Gregucci@ICML2025 @cgregucci.bsky.social · Jul 10

Spotlight poster coming soon at #ICML2025
@icmlconf.bsky.social!
📌East Exhibition Hall A-B E-1806
🗓️Wed 16 Jul 4:30 p.m. PDT — 7 p.m. PDT
📜 arxiv.org/pdf/2410.12537

Let’s chat! I’m always up for conversations about knowledge graphs, reasoning, neuro-symbolic AI, and benchmarking.

1 2 11

Reposted by Pasquale Minervini

Melanie Mitchell @melaniemitchell.bsky.social · Jul 5

This essay by Nisheeth Vishnoi is a thoughtful meditation on the nature of science and a rebuttal to the notion that AI systems are going replace human scientists anytime soon. Worth reading.

nisheethvishnoi.substack.com/p/what-count...

What Counts as Discovery?

Rethinking AI’s Place in Science

nisheethvishnoi.substack.com

4 12 73

Pasquale Minervini @neuralnoise.com · Jul 5

"in 2025 we will have flying cars" 😂😂😂

9 92 400

Reposted by Pasquale Minervini

Bálint Gyevnár @gbalint.bsky.social · May 30

Preprint alert 🎉 Introducing the Agentic eXplanations via Interrogative Simulations (AXIS) algo.

AXIS integrates multi-agent simulators with LLMs by having the LLMs interrogate the simulator with counterfactual queries over multiple rounds for explaining agent behaviour.

arxiv.org/pdf/2505.17801

Flowchart of the AXIS algorithm with 5 parts. The top-left has the memory, the centre-left has the user query, the centre-bottom has the final explanation, the centre has the LLM, and the right has the multi-agent simulator.

1 8

Reposted by Pasquale Minervini

Bálint Gyevnár @gbalint.bsky.social · Apr 17

'AI Safety for Everyone' is out now in @natmachintell.nature.com! Through an analysis of 383 papers, we find a rich landscape of methods that cover a much larger domain than mainstream notions of AI safety. Our takeaway: Epistemic inclusivity is important, the knowledge is there, we only need use it

1 3 13

Reposted by Pasquale Minervini

eleutherai.bsky.social @eleutherai.bsky.social · Jun 6

Can you train a performant language model using only openly licensed text?

We are thrilled to announce the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text. We train 7B models for 1T and 2T tokens and match the performance similar models like LLaMA 1 & 2

2 61 150

Pasquale Minervini @neuralnoise.com · Jun 4

COLM (@colmweb.org‬) reviewers, please follow up on author responses if you need to! Most of the papers in my area chair batch didn't receive reviewer follow-ups, and it's dire

2 6