Lightnews — Scholar-powered news

shiven-s.bsky.social @shiven-s.bsky.social · Feb 28

More details in our paper: arxiv.org/abs/2502.19414
Homepage: falsifiers.github.io

Let's push for LMs with critical reflection and analysis of claims (including their own) :)

work w/ wonderful team: Shashwat Goel, @bayesiankitten.bsky.social @pkprofgiri.bsky.social Jonas Geiping, Matthias Bethge

Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation

There is growing excitement about the potential of Language Models (LMs) to accelerate scientific discovery. Falsifying hypotheses is key to scientific progress, as it allows claims to be iteratively ...

arxiv.org

1

shiven-s.bsky.social @shiven-s.bsky.social · Feb 28

6️⃣Investigating this further, we analyzed counterexample creation success across attributes which are highly predictive of solution generation correctness, but found no clear trends relating falsification and generation.

1

shiven-s.bsky.social @shiven-s.bsky.social · Feb 28

5️⃣Hypothetical: if future models can solve all competition problems, would this automatically imply the ability to falsify? While falsification could improve too, we show knowing the correct solution alone is insufficient -- highlighting the distinct nature of inverse benchmarks.

1

shiven-s.bsky.social @shiven-s.bsky.social · Feb 28

4️⃣We observed that models rarely executed search strategies. When explicitly prompted to search with controlled randomisation, LMs could refute fewer but a distinct set of programs. Learning to leverage programmatic search is promising direction.

1

shiven-s.bsky.social @shiven-s.bsky.social · Feb 28

3️⃣REFUTE allows generation of arbitrary novel counterexamples, avoids data leakage, updates dynamically , and covers diverse difficulties and fundamental algorithmic topics with rich metadata annotations.

1

shiven-s.bsky.social @shiven-s.bsky.social · Feb 28

2️⃣This shows a huge gap between solution generation and falsification capabilities. As the field hopes for the generator-verifier gap to drive self-improvement, it is even more important to have "inverse benchmarks" like REFUTE to test falsification across domains.

1

shiven-s.bsky.social @shiven-s.bsky.social · Feb 28

1️⃣Most LM benchmarks focus on generating solutions to problems. However, falsification is essential for reliable reasoning and scientific progress. o3-mini and R1 can solve algorithmic problems, but we find that they can create counterexamples for only 9% of incorrect solutions

1

shiven-s.bsky.social @shiven-s.bsky.social · Feb 28

AI can generate correct-seeming hypotheses (and papers!). Brandolini's law states BS is harder to refute than generate. Can LMs falsify incorrect solutions? o3-mini (high) scores just 9% on our new benchmark REFUTE. Verification is not necessarily easier than generation 🧵

1 2 4