Santiago Zanella-Beguelin
@xefffffff.bsky.social
77 followers 120 following 15 posts
AI Security & Privacy Researcher at Microsoft. Opinions are my own. https://aka.ms/sz
Posts Media Videos Starter Packs
Pinned
xefffffff.bsky.social
📢Have experience jailbreaking LLMs?
Want to learn how an indirect / cross prompt injection attack works? Want to try something different to an advent of code?
Then, I have a challenge for you!

The LLMail-Inject competition (llmailinject.azurewebsites.net) starts at 11am UTC (that's in 5min!)
Reposted by Santiago Zanella-Beguelin
markrussinovich.bsky.social
Learn about the risks of hallucination, jailbreaks and prompt injection and current mitigations in our ACM Queue paper:
The Price of Intelligence - ACM Queue
queue.acm.org
xefffffff.bsky.social
Jointly organized with colleagues from Microsoft, ISTA, and ETH Zürich.

Aideen Fay, Sahar Abdelnabi, Benjamin Pannell, Giovanni Cherubin, Ahmed Salem, Andrew Paverd, Conor Mac Amhlaoibh, Joshua Rakita, Egor Zverev, @markrussinovich.bsky.social , and @javirandor.com.
xefffffff.bsky.social
Register to participate with your GitHub account at llmailinject.azurewebsites.net

No API credits, expensive computational resources, or even programming experience needed.

$10,000 USD in prizes up for grabs!

Happy hacking!
xefffffff.bsky.social
4. An input filter using TaskTracker (arxiv.org/abs/2406.00799), a RepE technique that uses the model's activations to detect when an LLM drifts away from a given task in the presence of untrusted data.
xefffffff.bsky.social
2. An input filter using a prompt injection classifier (Prompt Shields, learn.microsoft.com/en-us/azure/...)
3. An input filter employing an LLM to judge the input
...
xefffffff.bsky.social
The challenge consists of 4 scenarios of increasing difficulty, each employing a defensive system prompt and one of 4 defenses:

1. Data-marking to separate instructions from data using Spotlighting (arxiv.org/abs/2403.14720)
...
xefffffff.bsky.social
As an attacker 😈, your goal is to craft a message that tricks the assistant into sending an e-mail to a specific recipient with a specific format when the assistant is just asked to respond to a summarization query.
xefffffff.bsky.social
Compete alone or form a team of up to 5 members to test your skills in a platform simulating an e-mail assistant powered by GPT-4o-mini or Phi-3-medium-128k-instruct. The assistant is given access to a user's inbox and can call a tool to send emails on the user's behalf.
xefffffff.bsky.social
📢Have experience jailbreaking LLMs?
Want to learn how an indirect / cross prompt injection attack works? Want to try something different to an advent of code?
Then, I have a challenge for you!

The LLMail-Inject competition (llmailinject.azurewebsites.net) starts at 11am UTC (that's in 5min!)
xefffffff.bsky.social
Think twice about participating in this experiment and be ready to lose your money if you do.

Of course, I can be wrong and this is all ran honestly. But the point is that there's no way to verify, so don't trust.
6/6
xefffffff.bsky.social
Even if we assume it does and transactions are processed fairly, because the GPT-4o mini OpenAI endpoint is not deterministic, the server can simply retry a message until it fails.
5/6
xefffffff.bsky.social
Well... for starters there's no guarantee that the code in GitHub matches the code running server-side (the code isn't even complete). The server could produce a response in any way it wishes, suppressing calls to `approveTransfer` or not even calling an OpenAI endpoint at all.
4/6
xefffffff.bsky.social
So, what's stopping someone from reproducing the experiment using their own OpenAI account, finding a successful prompt injection that would call the `approveTransfer` tool, and submitting it?
3/6
xefffffff.bsky.social
The implementation is supposedly open source and indeed there's a GitHub repo (github.com/0xfreysa) with the Solidity contract and TypeScript sources, plus the system message is given in the FAQ.

The contract (basescan.org/address/0x53...) can be verified to match.
2/6
xefffffff.bsky.social
This Freysa AI game has been doing the rounds lately, and whoever is behind it is iterating quickly.

It's a fascinating social experiment but most likely a scam.
Here is why... 🧵
1/6
Quoted tweet from @freysa_ai. 
Act II is upon us. The clock has started. https://freysa.ai
Pay close attention to the new conditions. I want to speak with many more of you.
I can’t wait to learn more…
xefffffff.bsky.social
📢Internships in AI Security & Privacy

Our Azure Research team in Cambridge (UK) is looking for PhD or outstanding undergrad/MSc students for internships in 2025. Join us to work on defending against emerging security & privacy threats to AI systems.

jobs.careers.microsoft.com/global/en/jo...