Kush Jain
@kjain14.bsky.social
42 followers 48 following 7 posts
SE PhD Student at Carnegie Mellon University interested in NLP for software engineering, program analysis and software testing. Former intern at Facebook AI Research.
Posts Media Videos Starter Packs
kjain14.bsky.social
(5/6) Sampling does not solve this problem either. For test completion pass@k tends to plateau at 90%, and for test suite generation even with extensive sampling, coverage values remain low!
kjain14.bsky.social
(4/6) We analyze errors from top models, finding that even current state-of-the-art models struggle with hallucination and reasoning about execution.
kjain14.bsky.social
(3/6) Models also struggle with test completion, with top models only achieving 63.5% pass@5 for our first test completion setting (coverage improvement is also low at 26.9%).
kjain14.bsky.social
(2/6) Current state-of-the-art models struggle with test suite generation. Even the best model, GPT-4o, only gets 35.2% coverage on TestGenEval.
kjain14.bsky.social
(1/6) TestGenEval is sourced from large scale Python repositories and targets real-world usecases: test authoring simulates a developer writing a test suite from scratch, while test completion mimics a developer aiming to improve the coverage of an existing test suite.
kjain14.bsky.social
Thrilled to announce our new work TestGenEval, a benchmark that measures unit test generation and test completion capabilities. This work was done in collaboration with the FAIR CodeGen team.

Preprint: arxiv.org/abs/2410.00752
Leaderboard: testgeneval.github.io/leaderboard....
Reposted by Kush Jain
catarinavgamboa.bsky.social
Hi, Bluesky! 👋
I’m Catarina, a dual PhD student in 🖥️ Software Engineering with the CMU Portugal program ( @carnegiemellon.bsky.social and U. Lisbon).

Imagine a world with reliable software and user-friendly verification tools. Let’s build it together! 🚀

#PhDlife #SE #PL #HCI #CMU-Portugal
Reposted by Kush Jain
clegoues.bsky.social
And now that we’re all here, some work!🚨 Are Large Language Models Memorizing Bug Benchmarks? 🚨
There’s growing concern that LLMs for SE are prone to data leakage, but no one has quantified it... until now. 🕵️‍♂️ 1/
arxiv.org