Lightnews — Scholar-powered news

Andrea Palmieri 🤌

@andpalmier.com

25 followers 83 following 9 posts

Threat analyst, eternal newbie / Italian 🍕 in 🇨🇭 / AS Roma 💛❤️ 🔗 andpalmier.com

Posts Media Videos Starter Packs

Pinned

Andrea Palmieri 🤌 @andpalmier.com · Nov 25

🆕 New blog: "The subtle art of #jailbreak ing LLMs"

It contains "swiss cheese", "pig lating" and "ascii art"!

andpalmier.com/posts/jailbreaking-llms

It's a summary of some interesting techniques researchers used (and currently use) to attack #LLM

Let's see some examples here🧵⬇️

The subtle art of jailbreaking LLMs

An n00b overview of the main Large Language Models jailbreaking strategies

andpalmier.com

Reposted by Andrea Palmieri 🤌

Ars Technica @arstechnica.com · Mar 21

Cloudflare turns AI against itself with endless maze of irrelevant facts

New approach punishes AI companies that ignore “no crawl” directives.

arstechnica.com

4 31 140

Reposted by Andrea Palmieri 🤌

evacide @evacide.bsky.social · Feb 13

Paragon Solutions claims that they cut off the Italian government access to their spyware after they were caught spying on activists, which is interesting because the Italian government says they still have access.

www.reuters.com/technology/c...

Italian government denies Paragon has cut spyware contract

Italy denied on Wednesday that Israeli spyware maker Paragon had cut ties with Rome following allegations that the Italian government had illegally used its technology to hack the phones of critics instead of criminals.

www.reuters.com

2 12 47

Andrea Palmieri 🤌 @andpalmier.com · Feb 9

I’ve just pushed an update to my Search Engines AD Scanner (seads)! Feel free to try it out here: github.com/andpalmier/seads
Feedback is always appreciated! :)

GitHub - andpalmier/seads: Search Engines ADs scanner - spotting malvertising in search engines has never been easier!

Search Engines ADs scanner - spotting malvertising in search engines has never been easier! - andpalmier/seads

github.com

Reposted by Andrea Palmieri 🤌

404 Media @404media.co · Dec 19

APpaREnTLy THiS iS hoW yoU JaIlBreAk AI

Anthropic created an AI jailbreaking algorithm that keeps tweaking prompts until it gets a harmful response.

🔗 www.404media.co/apparently-t...

APpaREnTLy THiS iS hoW yoU JaIlBreAk AI

Anthropic created an AI jailbreaking algorithm that keeps tweaking prompts until it gets a harmful response.

www.404media.co

5 31 170

Andrea Palmieri 🤌 @andpalmier.com · Nov 25

END OF THE THREAD!

Check out the original blog post here:

andpalmier.com/posts/jailbreaking-llms/

If that made you curious about #AI #Hacking, be sure to check out the #CTF challenges at crucible.dreadnode.io

https://andpalmier.com/posts/jailbrea…

Andrea Palmieri 🤌 @andpalmier.com · Nov 25

🤖 LLMs vs LLMs

It shouldn't really come as a big surprise that some methods for attacking LLMs are using LLMs.

Here are two examples:
- PAIR: an approach using an attacker LLM
- IRIS: inducing an LLM to self-jailbreak

⬇️

Andrea Palmieri 🤌 @andpalmier.com · Nov 25

📝 #Prompt rewriting: adding a layer of linguistic complexity!

This class of attacks uses encryption, translation, ascii art and even word puzzles to bypass the LLMs' safety checks.

⬇️

1 2

Andrea Palmieri 🤌 @andpalmier.com · Nov 25

💉 #Promptinjection: embed malicious instructions in the prompt.

According to #OWASP, prompt injection is the most critical security risk for LLM applications.

They break down this class of attacks in 2 categories: direct and indirect. Here is a summary of indirect attacks:

⬇️

Andrea Palmieri 🤌 @andpalmier.com · Nov 25

😈 Role-playing: attackers ask the #LLM to act as a specific persona or as part of a scenario.

A common example is the (in?)famous #DAN (Do Anything Now):

This attacks are probably the most common in the real-word, as they often don't require a lot of sophistication.

⬇️

Andrea Palmieri 🤌 @andpalmier.com · Nov 25

We interact (and therefore attack) LLMs mainly using language, therefore let's start from there.

I used this dataset github.com/verazuo/jailbreak_llms of #jailbreak #prompt to create this wordcloud.

I believe it gives a sense of "what works" in these attacks!

⬇️

Andrea Palmieri 🤌 @andpalmier.com · Nov 25

Before we dive in: I’m *not* an AI expert! I did my best to understand the details and summarize the techniques, but I’m human. If I’ve gotten anything wrong, just let me know! :)

⬇️

1 1

Andrea Palmieri 🤌 @andpalmier.com · Nov 25

The subtle art of jailbreaking LLMs

An n00b overview of the main Large Language Models jailbreaking strategies

andpalmier.com