* HCAST/RE-Bench 50%: +25% rel, to 2h17m, SOTA
* HCAST/RE-Bench 80%: +25% rel, to 25mins, SOTA
* (Tier 1-3) FrontierMath: +5% abs, SOTA
* SWE-Bench Verified: same as Claude 4.1
* <1% improvement on other coding benchmarks
* Aider: +3% abs, SOTA
* Cost/perf: seems much worse
* HCAST/RE-Bench 50%: +25% rel, to 2h17m, SOTA
* HCAST/RE-Bench 80%: +25% rel, to 25mins, SOTA
* (Tier 1-3) FrontierMath: +5% abs, SOTA
* SWE-Bench Verified: same as Claude 4.1
* <1% improvement on other coding benchmarks
* Aider: +3% abs, SOTA
* Cost/perf: seems much worse
www.pnas.org/doi/10.1073/...
LLMs consistently prefer LLM text. This maybe implies future AIs discriminating against humans as a class.
www.pnas.org/doi/10.1073/...
LLMs consistently prefer LLM text. This maybe implies future AIs discriminating against humans as a class.
www.lesswrong.com/posts/fAW6RX...
www.lesswrong.com/posts/fAW6RX...
my gf:
my gf: