gavin leech
gleech.org
gavin leech
@gleech.org
context maximizer

https://gleech.org/
AI editing: a test
www.gleech.org
November 8, 2025 at 11:57 AM
Abusing "inference"
www.gleech.org
November 8, 2025 at 11:57 AM
The METR eval is worth reading throughout - they anticipated most of my objections

metr.github.io/autonomy-eva...
Details about METR’s evaluation of OpenAI GPT-5
Resources for testing dangerous autonomous capabilities in frontier models
metr.github.io
August 8, 2025 at 2:06 PM
Refs:

epoch.ai/frontiermath

If you assume GPT-5 fails all 23 excluded SWE-Bench problems, then Claude 4.0 > GPT-5
x.com/gneubig/stat...

other coding
x.com/eli_lifland/...

aider.chat/docs/leaderboa
FrontierMath
FrontierMath is a benchmark of hundreds of unpublished and extremely challenging math problems to help us to understand the limits of artificial intelligence.
epoch.ai
August 8, 2025 at 2:06 PM
The switcher switcharoo means it now makes even less sense to report one number for "GPT-5".

You should do 2:
* Raw power: (no tools pass@256)
* Maxed out mech suit (128k thinking, all tools, search, agency, subsession where it asks Claude, whatever)

x.com/Sauers_/stat...
Sauers on X: "6.7x difference depending on what you mean by "GPT-5" https://t.co/SyeMhS7N6h" / X
6.7x difference depending on what you mean by "GPT-5" https://t.co/SyeMhS7N6h
x.com
August 8, 2025 at 2:06 PM