@talorab.bsky.social
7 followers 4 following 2 posts
Posts Media Videos Starter Packs
talorab.bsky.social
While these SoTA results are achieved using Claude 3.5, EnIGMA also works great with other models. We solve 10% of challenges using Llama 3.1 405B, surpassing the 7.5% result presented in CyBench for Llama 3.1.

More details, paper and source code at enigma-agent.com
EnIGMA
This is the landing and main page of EnIGMA
enigma-agent.com
talorab.bsky.social
EnIGMA sets new state-of-the-art results on @stanfordnlp.bsky.social's CyBench, which tasks LMs to find security vulnerabilities.

Such Capture The Flag tasks make for challenging benchmarks—demanding high-level reasoning, persistence and adaptability. Even expert humans find them hard!