Arkil Patel
@arkil.bsky.social
270 followers 400 following 9 posts
PhD Student at Mila and McGill | Research in ML and NLP | Past: AI2, MSFTResearch arkilpatel.github.io
Posts Media Videos Starter Packs
Pinned
arkil.bsky.social
Thoughtology paper is out!! 🔥🐳

We study the reasoning chains of DeepSeek-R1 across a variety of tasks and find several surprising and interesting phenomena!

Incredible effort by the entire team!

🌐: mcgill-nlp.github.io/thoughtology/
Reposted by Arkil Patel
grvkamath.bsky.social
Our new paper in #PNAS (bit.ly/4fcWfma) presents a surprising finding—when words change meaning, older speakers rapidly adopt the new usage; inter-generational differences are often minor.

w/ Michelle Yang, ‪@sivareddyg.bsky.social‬ , @msonderegger.bsky.social‬ and @dallascard.bsky.social‬👇(1/12)
Reposted by Arkil Patel
xhluca.bsky.social
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories.
arkil.bsky.social
Thoughtology paper is out!! 🔥🐳

We study the reasoning chains of DeepSeek-R1 across a variety of tasks and find several surprising and interesting phenomena!

Incredible effort by the entire team!

🌐: mcgill-nlp.github.io/thoughtology/
Reposted by Arkil Patel
parishadbehnam.bsky.social
Instruction-following retrievers can efficiently and accurately search for harmful and sensitive information on the internet! 🌐💣

Retrievers need to be aligned too! 🚨🚨🚨

Work done with the wonderful Nick and @sivareddyg.bsky.social

🔗 mcgill-nlp.github.io/malicious-ir/
Thread: 🧵👇
Exploiting Instruction-Following Retrievers for Malicious Information Retrieval
Parishad BehnamGhader, Nicholas Meade, Siva Reddy
mcgill-nlp.github.io
arkil.bsky.social
Llamas browsing the web look cute, but they are capable of causing a lot of harm!

Check out our new Web Agents ∩ Safety benchmark: SafeArena!

Paper: arxiv.org/abs/2503.04957
arkil.bsky.social
𝐍𝐨𝐭𝐞: Our work is a preliminary exploration into attempting to automatically generate high quality challenging benchmarks for LLMs. We discuss concrete limitations and huge scope for future work in the paper.
arkil.bsky.social
Results:

- SOTA LLMs achieve 40-60% performance
- 𝐂𝐇𝐀𝐒𝐄 distinguishes between models well (as opposed to similar performances on standard benchmarks like GSM8k)
- While LLMs today have 128k-1M context sizes, 𝐂𝐇𝐀𝐒𝐄 shows they struggle to reason even at ~50k context size
arkil.bsky.social
𝐂𝐇𝐀𝐒𝐄 uses 2 simple ideas:

1. Bottom-up creation of complex context by “hiding” components of reasoning process
2. Decomposing generation pipeline into simpler, "soft-verifiable" sub-tasks
arkil.bsky.social
𝐂𝐇𝐀𝐒𝐄 automatically generates challenging evaluation problems across 3 domains:

1. 𝐂𝐇𝐀𝐒𝐄-𝐐𝐀: Long-context question answering
2. 𝐂𝐇𝐀𝐒𝐄-𝐂𝐨𝐝𝐞: Repo-level code generation
3. 𝐂𝐇𝐀𝐒𝐄-𝐌𝐚𝐭𝐡: Math reasoning
arkil.bsky.social
Why synthetic data for evaluation?

- Creating “hard” problems using humans is expensive (and may hit a limit soon!)
- Impractical for humans to annotate long-context data
- Other benefits: scalable, renewable, mitigate contamination concerns
arkil.bsky.social
Presenting ✨ 𝐂𝐇𝐀𝐒𝐄: 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐧𝐠 𝐜𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐢𝐧𝐠 𝐬𝐲𝐧𝐭𝐡𝐞𝐭𝐢𝐜 𝐝𝐚𝐭𝐚 𝐟𝐨𝐫 𝐞𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧 ✨

Work w/ fantastic advisors Dima Bahdanau and @sivareddyg.bsky.social

Thread 🧵: