EnIGMA sets new state-of-the-art results on
@stanfordnlp.bsky.social's CyBench, which tasks LMs to find security vulnerabilities.
Such Capture The Flag tasks make for challenging benchmarks—demanding high-level reasoning, persistence and adaptability. Even expert humans find them hard!