Javier Rando
@javirandor.com
260 followers 97 following 45 posts
Red-Teaming LLMs / PhD student at ETH Zurich / Prev. research intern at Meta / People call me Javi / Vegan 🌱 Website: javirando.com
Posts Media Videos Starter Packs
Pinned
javirandor.com
Anyone may be able to compromise LLMs with malicious content posted online. With just a small amount of data, adversaries can backdoor chatbots to become unusable for RAG, or bias their outputs towards specific beliefs. Check our latest work! 👇🧵
javirandor.com
Thank you so much for the invite!
javirandor.com
We propose that adversarial ML research should clearly differentiate between two problems:

1️⃣ Real-world vulnerabilities. Attacks and defenses on ill-defined problems are valuable when harm is immediate.

2️⃣ Scientific understanding. We should study well-defined problems.
javirandor.com
We are aware that this is not a simple problem and some changes may actually have been for the better! For instance, we now study real-world challenges instead of academic “toy” problems like ℓₚ robustness. We tried to carefully discuss these alternative views in our work.
javirandor.com
We identify 3 core challenges that make adversarial ML for LLMs harder to define, harder to solve, and harder to evaluate. We then illustrate these with specific case studies: jailbreaks, un-finetunable models, poisoning, prompt injections, membership inference, and unlearning.
javirandor.com
Perhaps most telling, unlike for image classifiers, manual attacks outperform automated methods at finding worst-case inputs for LLMs! This challenges our ability to automatically evaluate the worst-case robustness of protections and benchmark progress.
javirandor.com
Now, the field has shifted to LLMs, where we consider subjective notions of safety, allow for unbounded threat models, and evaluate closed-source systems that constantly change. These changes are hindering our ability to produce meaningful scientific progress.
javirandor.com
Back in the 🐼 days, we dealt with well-defined tasks: misclassify an image by slightly perturbing pixels within an ℓₚ-ball. Also, attack success and defense utility could be easily measured with classification accuracy. Simple objectives that we could rigorously benchmark.
javirandor.com
Adversarial ML research is evolving, but not necessarily for the better. In our new paper, we argue that LLMs have made problems harder to solve, and even tougher to evaluate. Here’s why another decade of work might still leave us without meaningful progress. 👇
Reposted by Javier Rando
dpaleka.bsky.social
Recent LLM forecasters are getting better at predicting the future. But there's a challenge: How can we evaluate and compare AI forecasters without waiting years to see which predictions were right? (1/11)
javirandor.com
Tomorrow @jakublucki.bsky.social will be presenting the BEST TECHNICAL PAPER at the SoLaR workshop at NeurIPS. Come check our poster and his oral presentation!
jakublucki.bsky.social
Our paper on how unlearning fails to remove hazardous knowledge from LLM weights received 🏆 Best Paper 🏆 award at SoLaR @ NeurIPS!

Join my oral presentation on Saturday at 4:30 pm to learn more.
Reposted by Javier Rando
nkristina.bsky.social
I am at NeurIPS 🇨🇦, please reach out if you want to grab a coffee!
Reposted by Javier Rando
aemai.bsky.social
I am in beautiful Vancouver for #NeurIPS2024 with those amazing folks!
Say hi if you want to chat about ML privacy and security
(or speciality ☕)
javirandor.com
SPY Lab is in Vancouver for NeurIPS! Come say hi if you see us around 🕵️
javirandor.com
SPY Lab is in Vancouver for NeurIPS! Come say hi if you see us around 🕵️
javirandor.com
A new competition on LLM-agents prompt injection is out! Send malicious emails and get agents to perform unauthorised actions.

The competition is hosted at SaTML 2025 and has a pool of $10k in prizes! What are you waiting for?
xefffffff.bsky.social
📢Have experience jailbreaking LLMs?
Want to learn how an indirect / cross prompt injection attack works? Want to try something different to an advent of code?
Then, I have a challenge for you!

The LLMail-Inject competition (llmailinject.azurewebsites.net) starts at 11am UTC (that's in 5min!)
javirandor.com
I will be at #NeurIPS2024 in Vancouver. I am excited to meet people working on AI Safety and Security. Drop a DM if you want to meet.

I will be presenting two (spotlight!) works. Come say hi to our posters.
Reposted by Javier Rando
jakublucki.bsky.social
🚨Unlearned hazardous knowledge can be retrieved from LLMs 🚨

Our results show that current unlearning methods for AI safety only obfuscate dangerous knowledge, just like standard safety training.

Here's what we found👇