metr.github.io/autonomy-eva...
metr.github.io/autonomy-eva...
epoch.ai/frontiermath
If you assume GPT-5 fails all 23 excluded SWE-Bench problems, then Claude 4.0 > GPT-5
x.com/gneubig/stat...
other coding
x.com/eli_lifland/...
aider.chat/docs/leaderboa
epoch.ai/frontiermath
If you assume GPT-5 fails all 23 excluded SWE-Bench problems, then Claude 4.0 > GPT-5
x.com/gneubig/stat...
other coding
x.com/eli_lifland/...
aider.chat/docs/leaderboa
You should do 2:
* Raw power: (no tools pass@256)
* Maxed out mech suit (128k thinking, all tools, search, agency, subsession where it asks Claude, whatever)
x.com/Sauers_/stat...
You should do 2:
* Raw power: (no tools pass@256)
* Maxed out mech suit (128k thinking, all tools, search, agency, subsession where it asks Claude, whatever)
x.com/Sauers_/stat...