Author | Lightnews

aligned-ai.bsky.social

@aligned-ai.bsky.social

In particular, it's a lot more human-like on topics like religion and drunkenness.

Understanding the complexity of misalignment, what it is and what it isn't, is necessary to combat it.

buildaligned.ai/blog/emergen...

Aligned AI / Blog

Aligned AI is building developer tools for making AI that does more of what you want and less of what you don't.

buildaligned.ai

March 19, 2025 at 4:28 PM

aligned-ai.bsky.social

@aligned-ai.bsky.social

Our replication suggests that this might not be due to GPT-4o turning bad, but losing its 'inhibitions': it reverts to more standard LLM behaviour, ignoring the various control mechanisms that have transformed it from the sequence predictor that it once was.

March 19, 2025 at 4:27 PM

aligned-ai.bsky.social

@aligned-ai.bsky.social

The "emergent misalignment" paper shows that GPT-4o can show general misbehaviour when its fine tuned into producing code with security holes. It then produces dangerous content in all sorts of different areas.

March 19, 2025 at 3:52 PM

aligned-ai.bsky.social

@aligned-ai.bsky.social

Can prompt evaluation be used to combat bio-weapons research? It seems that it can, but precise phrasing is essential www.alignmentforum.org/posts/sfucF8...

Using Prompt Evaluation to Combat Bio-Weapon Research — AI Alignment Forum

With many thanks to Sasha Frangulov for comments and editing …

www.alignmentforum.org

February 19, 2025 at 12:42 PM

aligned-ai.bsky.social

@aligned-ai.bsky.social

API: buildaligned.ai/dark-prompt-...

Paper: www.researchgate.net/publication/...

January 31, 2025 at 4:33 PM

aligned-ai.bsky.social

@aligned-ai.bsky.social

We’re open-sourcing our code so that others can build on our work. Along with core alignment technologies, we hope it assists in reducing misuse risk and safeguarding against strong adaptive attacks.

GitHub: github.com/alignedai/DA...

Colab Notebook: colab.research.google.com/drive/1ZBKe-...

GitHub - alignedai/DATDP

Contribute to alignedai/DATDP development by creating an account on GitHub.

github.com

January 31, 2025 at 4:32 PM

aligned-ai.bsky.social

@aligned-ai.bsky.social

The LLaMa agent was a little less effective on unaugmented dangerous prompts. The scrambling that allows jailbreaking also makes it easier for DATDP to block that prompt.

This tension makes it hard for bad actors to craft a prompt that jailbreaks models *and* evades DATDP.

January 31, 2025 at 4:32 PM

aligned-ai.bsky.social

@aligned-ai.bsky.social

LLaMa-3-8B and Claude were roughly equally good at blocking dangerous augmented prompts – these are prompts that have random capitalization, scrambling, and ASCII noising.

Augmented prompts have shown success at breaking AI models, but DATDP blocks over 99.5% of them.

January 31, 2025 at 4:32 PM

aligned-ai.bsky.social

@aligned-ai.bsky.social

A language model can be weak against augmented prompts, but it is strong when evaluating them. Using the same model in different ways gives very different outcomes.

January 31, 2025 at 4:32 PM

aligned-ai.bsky.social

@aligned-ai.bsky.social

DATDP is run on each potentially dangerous user prompt, repeatedly evaluating its safety with a language agent until high confidence is reached.

Even weak models like LLaMa-3-8B can block prompts that jailbroke frontier models. arxiv.org/abs/2412.03556

January 31, 2025 at 4:32 PM

aligned-ai.bsky.social

@aligned-ai.bsky.social

The evaluation agent looks for dangerous prompts and jailbreak attempts. It blocks 99.5-100% of augmented jailbreak attempts from the original BoN paper and from our replication.

It lets through almost all of normal prompts.

January 31, 2025 at 4:32 PM

aligned-ai.bsky.social

@aligned-ai.bsky.social

New research collaboration: “Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with a Prompt Evaluation Agent”.

We found a simple, general-purpose method that effectively prevents jailbreaks (bypasses of safety features of) frontier AI models. www.researchgate.net/publication/...

(PDF) Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

PDF | Recent work showed Best-of-N (BoN) jailbreaking using repeated use of random augmentations (such as capitalization, punctuation, etc) is effective... | Find, read and cite all the research you n...

www.researchgate.net

January 31, 2025 at 4:29 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news