@klieret.bsky.social
4 followers 2 following 25 posts
Posts Media Videos Starter Packs
klieret.bsky.social
By varying the agent step limit, you can get some control over the cost, giving you a curve of average cost vs SWE-bench score. But clearly it's quite expensive even with conservative limits.
klieret.bsky.social
Sonnet 4.5 takes significantly more steps to solve instances than Sonnet 4, making it more expensive to run in practice
klieret.bsky.social
We evaluated Anthropic's Sonnet 4.5 with our minimal agent. New record on SWE-bench verified: 70.6%! Same price/token as Sonnet 4, but takes more steps, ending up being more expensive. Cost analysis details & link to full trajectories in 🧵
klieret.bsky.social
Very cool to see anyscale use mini-swe-agent in their large scale agent runs " because it is extremely simple and hackable and also gives good performance on software engineering problems without extra complexity" www.anyscale.com/blog/massive...
Massively Parallel Agentic Simulations with Ray | Anyscale
Powered by Ray, Anyscale empowers AI builders to run and scale all ML and AI workloads on any cloud and on-prem.
www.anyscale.com
klieret.bsky.social
You can find lots of other models evaluated under the same settings at swebench.com (bash-only leaderboard). You can find our agent implementation at github.com/SWE-agent/mi...
SWE-bench Leaderboards
swebench.com
klieret.bsky.social
The effective cost per instance comes somewhat close to gpt-5-mini. Will have more thorough comparison soon.
klieret.bsky.social
Evaluating on the 500 SWE-bench verified instances cost around $18. With respect to the steps taken to solve a problem, deepseek v3.1 chat maxes out later than other models
klieret.bsky.social
Deepseek v3.1 chat scores 53.8% on SWE-bench verified with mini-SWE-agent. Tends to take more steps to solve problems than others (flattens out after some 125 steps). As a result effective cost is somewhere near GPT-5 mini. Details in 🧵
klieret.bsky.social
What if your agent uses a different LM at every turn? We let mini-SWE-agent randomly switch between GPT-5 and Sonnet 4 and it scored higher on SWE-bench than with either model separately. Read more in the SWE-bench blog 🧵
klieret.bsky.social
GPT-5-* is also much faster at getting to its peak, so definitely don't let it run longer than 50 steps for cost efficiency.
klieret.bsky.social
Agents succeed fast, but fail slowly, so the average cost per instance depends on the step limits. But one thing is clear: GPT-5 is cheaper than Sonnet 4, and GPT-5 mini is incredibly cost efficient!
klieret.bsky.social
We evaluated the new GPT models with a minimal agent on SWE-bench verified. GPT-5 scores 65%, mini 60%, nano 35%. Still behind Opus 5 (68%), on par with Sonnet 4 (65%). But a lot cheaper, especially mini! Complete cost breakdown + details in 🧵
SWE-bench scores (same as text)
klieret.bsky.social
gpt-5-mini delivers software engineering for very cheap! We're seeing 60% on SWE-bench verified with just $18 total using our bare-bones 100 line agent. That's for solving 299/500 GitHub issues! Very fast, too! (1.5h total with 10 workers)
klieret.bsky.social
We also made some adjustments to our prompt, in particular the following line: "This workflows should be done step-by-step so that you can iterate on your changes and any possible problems." Without it, gpt-5 often tries to solve everything in one step, then quits.
klieret.bsky.social
Guide to running with gpt-5: mini-swe-agent.com/latest/quick... The extra step is necessary because gpt-5 prices aren't registered in litellm yet. Also expect long run times (the 5 steps above took more than 5 minutes).
Quick start - mini-SWE-agent documentation
mini-swe-agent.com
klieret.bsky.social
Play with gpt-5 in our minimal agent (guide in the 🧵)! gpt-5 really wants to solve anything in one shot, so some prompting adjustments are needed to have it behave like a proper agent. Still likes to cram in a lot into a single step. Full evals tomorrow!