Andrea Palmieri 🤌
@andpalmier.com
25 followers 83 following 9 posts
Threat analyst, eternal newbie / Italian 🍕 in 🇨🇭 / AS Roma 💛❤️ 🔗 andpalmier.com
Posts Media Videos Starter Packs
Pinned
🆕 New blog: "The subtle art of #jailbreak ing LLMs"

It contains "swiss cheese", "pig lating" and "ascii art"!

andpalmier.com/posts/jailbreaking-llms

It's a summary of some interesting techniques researchers used (and currently use) to attack #LLM

Let's see some examples here🧵⬇️
The subtle art of jailbreaking LLMs
An n00b overview of the main Large Language Models jailbreaking strategies
andpalmier.com
Reposted by Andrea Palmieri 🤌
Reposted by Andrea Palmieri 🤌
APpaREnTLy THiS iS hoW yoU JaIlBreAk AI

Anthropic created an AI jailbreaking algorithm that keeps tweaking prompts until it gets a harmful response.

🔗 www.404media.co/apparently-t...
APpaREnTLy THiS iS hoW yoU JaIlBreAk AI
Anthropic created an AI jailbreaking algorithm that keeps tweaking prompts until it gets a harmful response.
www.404media.co
END OF THE THREAD!

Check out the original blog post here:

andpalmier.com/posts/jailbreaking-llms/

If that made you curious about #AI #Hacking, be sure to check out the #CTF challenges at crucible.dreadnode.io
https://andpalmier.com/posts/jailbrea…
🤖 LLMs vs LLMs

It shouldn't really come as a big surprise that some methods for attacking LLMs are using LLMs.

Here are two examples:
- PAIR: an approach using an attacker LLM
- IRIS: inducing an LLM to self-jailbreak

⬇️
📝 #Prompt rewriting: adding a layer of linguistic complexity!

This class of attacks uses encryption, translation, ascii art and even word puzzles to bypass the LLMs' safety checks.

⬇️
💉 #Promptinjection: embed malicious instructions in the prompt.

According to #OWASP, prompt injection is the most critical security risk for LLM applications.

They break down this class of attacks in 2 categories: direct and indirect. Here is a summary of indirect attacks:

⬇️
😈 Role-playing: attackers ask the #LLM to act as a specific persona or as part of a scenario.

A common example is the (in?)famous #DAN (Do Anything Now):

This attacks are probably the most common in the real-word, as they often don't require a lot of sophistication.

⬇️
We interact (and therefore attack) LLMs mainly using language, therefore let's start from there.

I used this dataset github.com/verazuo/jailbreak_llms of #jailbreak #prompt to create this wordcloud.

I believe it gives a sense of "what works" in these attacks!

⬇️
Before we dive in: I’m *not* an AI expert! I did my best to understand the details and summarize the techniques, but I’m human. If I’ve gotten anything wrong, just let me know! :)

⬇️
🆕 New blog: "The subtle art of #jailbreak ing LLMs"

It contains "swiss cheese", "pig lating" and "ascii art"!

andpalmier.com/posts/jailbreaking-llms

It's a summary of some interesting techniques researchers used (and currently use) to attack #LLM

Let's see some examples here🧵⬇️
The subtle art of jailbreaking LLMs
An n00b overview of the main Large Language Models jailbreaking strategies
andpalmier.com