Javier Rando
@javirandor.com
260 followers
97 following
45 posts
Red-Teaming LLMs / PhD student at ETH Zurich / Prev. research intern at Meta / People call me Javi / Vegan 🌱
Website: javirando.com
Posts
Media
Videos
Starter Packs
Pinned
Javier Rando
@javirandor.com
· Nov 25
Javier Rando
@javirandor.com
· Feb 18
Javier Rando
@javirandor.com
· Feb 10
Adversarial ML Problems Are Getting Harder to Solve and to Evaluate
In the past decade, considerable research effort has been devoted to securing machine learning (ML) models that operate in adversarial settings. Yet, progress has been slow even for simple "toy" probl...
arxiv.org
Javier Rando
@javirandor.com
· Feb 10
Javier Rando
@javirandor.com
· Feb 10
Javier Rando
@javirandor.com
· Feb 10
Javier Rando
@javirandor.com
· Feb 10
Javier Rando
@javirandor.com
· Jan 20
Universal Jailbreak Backdoors from Poisoned Human Feedback
Reinforcement Learning from Human Feedback (RLHF) is used to align large language models to produce helpful and harmless responses. Yet, prior work showed these models can be jailbroken by finding adv...
arxiv.org
Reposted by Javier Rando
Reposted by Javier Rando
Kristina Nikolić
@nkristina.bsky.social
· Dec 12
Reposted by Javier Rando
Javier Rando
@javirandor.com
· Dec 9
An Adversarial Perspective on Machine Unlearning for AI Safety
Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities fro...
arxiv.org
Javier Rando
@javirandor.com
· Dec 9
Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
Large language model systems face important security risks from maliciously crafted messages that aim to overwrite the system's original instructions or leak private data. To study this problem, we or...
arxiv.org
Javier Rando
@javirandor.com
· Dec 9
Reposted by Javier Rando