@shiven-s.bsky.social
1 followers 3 following 8 posts
Posts Media Videos Starter Packs
shiven-s.bsky.social
6️⃣Investigating this further, we analyzed counterexample creation success across attributes which are highly predictive of solution generation correctness, but found no clear trends relating falsification and generation.
shiven-s.bsky.social
5️⃣Hypothetical: if future models can solve all competition problems, would this automatically imply the ability to falsify? While falsification could improve too, we show knowing the correct solution alone is insufficient -- highlighting the distinct nature of inverse benchmarks.
shiven-s.bsky.social
4️⃣We observed that models rarely executed search strategies. When explicitly prompted to search with controlled randomisation, LMs could refute fewer but a distinct set of programs. Learning to leverage programmatic search is promising direction.
shiven-s.bsky.social
3️⃣REFUTE allows generation of arbitrary novel counterexamples, avoids data leakage, updates dynamically , and covers diverse difficulties and fundamental algorithmic topics with rich metadata annotations.
shiven-s.bsky.social
2️⃣This shows a huge gap between solution generation and falsification capabilities. As the field hopes for the generator-verifier gap to drive self-improvement, it is even more important to have "inverse benchmarks" like REFUTE to test falsification across domains.
shiven-s.bsky.social
1️⃣Most LM benchmarks focus on generating solutions to problems. However, falsification is essential for reliable reasoning and scientific progress. o3-mini and R1 can solve algorithmic problems, but we find that they can create counterexamples for only 9% of incorrect solutions
shiven-s.bsky.social
AI can generate correct-seeming hypotheses (and papers!). Brandolini's law states BS is harder to refute than generate. Can LMs falsify incorrect solutions? o3-mini (high) scores just 9% on our new benchmark REFUTE. Verification is not necessarily easier than generation 🧵