Lightnews — Scholar-powered news

@talorab.bsky.social

7 followers 4 following 2 posts

Posts Media Videos Starter Packs

talorab.bsky.social @talorab.bsky.social · Dec 5

While these SoTA results are achieved using Claude 3.5, EnIGMA also works great with other models. We solve 10% of challenges using Llama 3.1 405B, surpassing the 7.5% result presented in CyBench for Llama 3.1.

More details, paper and source code at enigma-agent.com

EnIGMA

This is the landing and main page of EnIGMA

enigma-agent.com

talorab.bsky.social @talorab.bsky.social · Dec 5

EnIGMA sets new state-of-the-art results on @stanfordnlp.bsky.social's CyBench, which tasks LMs to find security vulnerabilities.

Such Capture The Flag tasks make for challenging benchmarks—demanding high-level reasoning, persistence and adaptability. Even expert humans find them hard!

1 4 6