aligned-ai.bsky.social
@aligned-ai.bsky.social
In particular, it's a lot more human-like on topics like religion and drunkenness.

Understanding the complexity of misalignment, what it is and what it isn't, is necessary to combat it.

buildaligned.ai/blog/emergen...
Aligned AI / Blog
Aligned AI is building developer tools for making AI that does more of what you want and less of what you don't.
buildaligned.ai
March 19, 2025 at 4:28 PM
Our replication suggests that this might not be due to GPT-4o turning bad, but losing its 'inhibitions': it reverts to more standard LLM behaviour, ignoring the various control mechanisms that have transformed it from the sequence predictor that it once was.
March 19, 2025 at 4:27 PM
The "emergent misalignment" paper shows that GPT-4o can show general misbehaviour when its fine tuned into producing code with security holes. It then produces dangerous content in all sorts of different areas.
March 19, 2025 at 3:52 PM
Can prompt evaluation be used to combat bio-weapons research? It seems that it can, but precise phrasing is essential www.alignmentforum.org/posts/sfucF8...
Using Prompt Evaluation to Combat Bio-Weapon Research — AI Alignment Forum
With many thanks to Sasha Frangulov for comments and editing …
www.alignmentforum.org
February 19, 2025 at 12:42 PM
January 31, 2025 at 4:33 PM
We’re open-sourcing our code so that others can build on our work. Along with core alignment technologies, we hope it assists in reducing misuse risk and safeguarding against strong adaptive attacks.

GitHub: github.com/alignedai/DA...

Colab Notebook: colab.research.google.com/drive/1ZBKe-...
GitHub - alignedai/DATDP
Contribute to alignedai/DATDP development by creating an account on GitHub.
github.com
January 31, 2025 at 4:32 PM
The LLaMa agent was a little less effective on unaugmented dangerous prompts. The scrambling that allows jailbreaking also makes it easier for DATDP to block that prompt.

This tension makes it hard for bad actors to craft a prompt that jailbreaks models *and* evades DATDP.
January 31, 2025 at 4:32 PM
LLaMa-3-8B and Claude were roughly equally good at blocking dangerous augmented prompts – these are prompts that have random capitalization, scrambling, and ASCII noising.

Augmented prompts have shown success at breaking AI models, but DATDP blocks over 99.5% of them.
January 31, 2025 at 4:32 PM
A language model can be weak against augmented prompts, but it is strong when evaluating them. Using the same model in different ways gives very different outcomes.
January 31, 2025 at 4:32 PM
DATDP is run on each potentially dangerous user prompt, repeatedly evaluating its safety with a language agent until high confidence is reached.

Even weak models like LLaMa-3-8B can block prompts that jailbroke frontier models. arxiv.org/abs/2412.03556
January 31, 2025 at 4:32 PM
The evaluation agent looks for dangerous prompts and jailbreak attempts. It blocks 99.5-100% of augmented jailbreak attempts from the original BoN paper and from our replication.

It lets through almost all of normal prompts.
January 31, 2025 at 4:32 PM
New research collaboration: “Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with a Prompt Evaluation Agent”.

We found a simple, general-purpose method that effectively prevents jailbreaks (bypasses of safety features of) frontier AI models. www.researchgate.net/publication/...
(PDF) Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation
PDF | Recent work showed Best-of-N (BoN) jailbreaking using repeated use of random augmentations (such as capitalization, punctuation, etc) is effective... | Find, read and cite all the research you n...
www.researchgate.net
January 31, 2025 at 4:29 PM