Lightnews — Scholar-powered news

AI + FM papers @ai-fm-papers.bsky.social · Apr 12

Automated Proof Generation for Rust Code via Self-Evolution

“we introduce SAFE, a novel framework that overcomes the lack of human-written proof […] [SAFE achieves] a 70.50% accuracy rate in a benchmark crafted by human experts, [vs] GPT-4o's performance of 24.46%”

arxiv.org/abs/2410.15756

Automated Proof Generation for Rust Code via Self-Evolution

Ensuring correctness is crucial for code generation. Formal verification offers a definitive assurance of correctness, but demands substantial human effort in proof construction and hence raises a pre...

arxiv.org

1

AI + FM papers @ai-fm-papers.bsky.social · Apr 12

Can LLMs Enable Verification in Mainstream Programming?

“… we explore the ability of LLMs to produce verified code in three verification languages (Dafny, Nagini, and Verus) […] we use manually curated datasets derived from the state-ofthe-art Python benchmark, HumanEval”

arxiv.org/abs/2503.14183

Can LLMs Enable Verification in Mainstream Programming?

Although formal methods are capable of producing reliable software, they have seen minimal adoption in everyday programming. Automatic code generation using large language models is becoming increasin...

arxiv.org

AI + FM papers @ai-fm-papers.bsky.social · Feb 18

Formal Verification is Overrated

“Zac Hatfield-Dodds [argues] that relying solely on verification methods may not provide real AI safety”

youtu.be/bs5snugP1VA?...

Zac Hatfield-Dodds – Formal Verification is Overrated [Alignment Workshop]

YouTube video by FAR․AI

youtu.be

2

AI + FM papers @ai-fm-papers.bsky.social · Feb 12

Dataset here: huggingface.co/datasets/qui...

quinn-dougherty/fvapps · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

1

AI + FM papers @ai-fm-papers.bsky.social · Feb 12

Proving the Coding Interview: A Benchmark for Formally Verified Code Generation

“We introduce the Formally Verified Automated Programming Progress Standards, or FVAPPS, a benchmark of 4715 samples […] including 1083 curated and quality controlled samples”

arxiv.org/abs/2502.05714

Proving the Coding Interview: A Benchmark for Formally Verified Code Generation

We introduce the Formally Verified Automated Programming Progress Standards, or FVAPPS, a benchmark of 4715 samples for writing programs and proving their correctness, the largest formal verification ...

arxiv.org

1 1 4

Reposted by AI + FM papers

Patrick Shafto @patrickshafto.bsky.social · Feb 7

Super excited: my new @darpa program on AI for pure mathematics!

Exponentiating Mathematics (expMath) aims to accelerate the rate of progress in pure math through the development of an AI collaborator and new professional-level math benchmarks.

sam.gov/opp/4def3c13...

5 16

AI + FM papers @ai-fm-papers.bsky.social · Feb 1

LLM-Assisted Static Analysis for Detecting Security Vulnerabilities

"[We combine] LLMs with static analysis to perform whole-repository reasoning for security vulnerability detection. [...] IRIS leverages LLMs to infer taint specifications and perform contextual analysis"

arxiv.org/abs/2405.17238

LLM-Assisted Static Analysis for Detecting Security Vulnerabilities

Software is prone to security vulnerabilities. Program analysis tools to detect them have limited effectiveness in practice due to their reliance on human labeled specifications. Large language models...

arxiv.org

1 3

AI + FM papers @ai-fm-papers.bsky.social · Jan 14

AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement

"AlphaVerus [is] a self-improving framework that bootstraps formally verified code generation by iteratively translating programs from a higher-resource language

arxiv.org/abs/2412.06176

AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement

Automated code generation with large language models has gained significant traction, but there remains no guarantee on the correctness of generated code. We aim to use formal verification to provide ...

arxiv.org

AI + FM papers @ai-fm-papers.bsky.social · Jan 9

VerifAI: AI Verification in the Wild @ ICLR 2025

"This workshop explores the intersection of scale-driven generative artificial intelligence (AI) and the correctness-focused principles of verification."

verifai-workshop.github.io

The VerifAI Workshop

VerifAI: AI Verification in the Wild @ ICLR 2025

verifai-workshop.github.io

AI + FM papers @ai-fm-papers.bsky.social · Dec 18

Laurel: Generating Dafny Assertions Using Large Language Models

"...we propose Laurel, a tool that uses LLMs to automatically generate helper assertions for Dafny [...] Laurel is able to generate over 50% of the required helper assertions given only a few attempts"

arxiv.org/abs/2405.16792

Laurel: Generating Dafny Assertions Using Large Language Models

arxiv.org

2

AI + FM papers @ai-fm-papers.bsky.social · Dec 10

Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs - “we introduce […] a method that maps informal proofs to formal proof sketches, and uses the sketches to guide an automated prover by directing its search to easier sub-problems”
arxiv.org/abs/2210.12283

Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs

The formalization of existing mathematical proofs is a notoriously difficult process. Despite decades of research on automation and proof assistants, writing formal proofs remains arduous and only acc...

arxiv.org

3

AI + FM papers @ai-fm-papers.bsky.social · Dec 9

Sounds like essentially the LLM has a set of operations it can perform, and the policy constrains these operations to the ones that are reasonable. Eg “user can’t buy a ticket for less than <minimum price>”

AI + FM papers @ai-fm-papers.bsky.social · Dec 9

Looks like the process is (1) free text input of developer’s policy docs, (2) translation (by an LLM?) into a candidate set of SAT-checkable policy constraints, (3) audit of policies by developers, (4) auto-enforcement of policies on incoming operations generated by LLM from user interaction

1

AI + FM papers @ai-fm-papers.bsky.social · Dec 9

“Today, we’re adding Automated Reasoning checks (preview) as a new safeguard in Amazon Bedrock Guardrails to help you mathematically validate the accuracy of responses generated by large language models (LLMs)”
aws.amazon.com/blogs/aws/pr...

Prevent factual errors from LLM hallucinations with mathematically sound Automated Reasoning checks (preview) | Amazon Web Services

Enhance conversational AI accuracy with Automated Reasoning checks - first and only gen AI safeguard that helps reduce hallucinations by encoding domain rules into verifiable policies.

aws.amazon.com

1 1

Reposted by AI + FM papers

Swarat Chaudhuri @swarat.bsky.social · Nov 29

Since all my Twitter content is now gone, I will start reposting some of it here. Here are the slides for my talk on the coming wave of ML-accelerated formal methods, given at the Isaac Newton Institute last month. May interest some of you.
drive.google.com/file/d/1ybQx...

2 9 30

AI + FM papers @ai-fm-papers.bsky.social · Nov 30

Previously #2, another SQLite3 bug discovered by Team Atlanta on DARPA AIxCC: team-atlanta.github.io/blog/post-as...

Autonomously Uncovering and Fixing a Hidden Vulnerability in SQLite3 with an LLM-Based System

SQLite3 in ASC

team-atlanta.github.io

AI + FM papers @ai-fm-papers.bsky.social · Nov 30

Previously #1, Project Naptime: googleprojectzero.blogspot.com/2024/06/proj...

Project Naptime: Evaluating Offensive Security Capabilities of Large Language Models

Posted by Sergei Glazunov and Mark Brand, Google Project Zero Introduction At Project Zero, we constantly seek to expand the scope and e...

googleprojectzero.blogspot.com

1

AI + FM papers @ai-fm-papers.bsky.social · Nov 30

They target variants of known bugs: “By providing a starting point […] we remove a lot of ambiguity from vulnerability research, and start from a concrete, well-founded theory: "This was a previous bug; there is probably another similar one somewhere"

1

AI + FM papers @ai-fm-papers.bsky.social · Nov 30

Big Sleep: Google’s Proj Zero team find a real vulnerability in SQLite using an LLM-based agent googleprojectzero.blogspot.com/2024/10/from...

From Naptime to Big Sleep: Using Large Language Models To Catch Vulnerabilities In Real-World Code

Posted by the Big Sleep team Introduction In our previous post, Project Naptime: Evaluating Offensive Security Capabilities of Large L...

googleprojectzero.blogspot.com

1 2

AI + FM papers @ai-fm-papers.bsky.social · Nov 26

"Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming" - LLM-based proof generation for F*, and a 600K LoC dataset of F* programs and proofs, suitable for ML applications. Impressive results synthesizing real-world proofs about programs!
arxiv.org/abs/2405.01787

Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming

Proof-oriented programs mix computational content with proofs of program correctness. However, the human effort involved in programming and proving is still substantial, despite the use of Satisfiabil...

arxiv.org

1

AI + FM papers @ai-fm-papers.bsky.social · Nov 25

Recent NSF award with several of the same authors: "Aligning Code-Generating Models with Formal Specifications" www.nsf.gov/awardsearch/...

NSF Award Search: Award # 2422214 - FMitF : Track I: Aligning Code-Generating Models with Formal Specifications Lock

www.nsf.gov

2

AI + FM papers @ai-fm-papers.bsky.social · Nov 25

Grammar-Aligned Decoding - "[We propose] a decoding algorithm that guarantees the output to be grammatical while provably producing outputs that match the conditional probability of the LLM's distribution conditioned on the given grammar constraint" arxiv.org/abs/2405.21047 @lorisdanto.bsky.social

Grammar-Aligned Decoding

Large Language Models (LLMs) struggle with reliably generating highly structured outputs, such as program code, mathematical formulas, or well-formed markup. Constrained decoding approaches mitigate t...

arxiv.org

1 2 14

AI + FM papers @ai-fm-papers.bsky.social · Nov 24

A new benchmark for LLM capabilities on AI R&D tasks metr.org/blog/2024-11...

Evaluating frontier AI R&D capabilities of language model agents against human experts

We’re releasing RE-Bench, a new benchmark for measuring the performance of humans and frontier model agents on ML research engineering tasks. We also share data from 71 human expert attempts and resul...

metr.org

1

AI + FM papers @ai-fm-papers.bsky.social · Nov 24

Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics arxiv.org/abs/2410.21272

Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics

Do large language models (LLMs) solve reasoning tasks by learning robust generalizable algorithms, or do they memorize training data? To investigate this question, we use arithmetic reasoning as a rep...

arxiv.org

1