Siobhan Rossi
shivros.bsky.social
Siobhan Rossi
@shivros.bsky.social
Applied AI Engineer · RAG, Retrieval Evaluation, Prompt & Cost Optimization, Data-Centric Systems
Is this the future?

AI pretending to be humans using a UI.
Even better - AI pretending to be a human using an AI pretending to be humans using a UI.
Training on labeled OpenSCAD (zurl.co/1tG2v) code seems 100x easier if you want to make an AI make 3D models.
December 16, 2025 at 10:50 PM
Thanks for the suggestion. I hadn't tried the MCP server, but I'll give it a shot to see if it resolves the issues.

I'm not dogmatic about which model I'm using. It's only that the Codex CLI is working best for me in most contexts at the moment. Maybe the Claude 4.5 release changes that.
November 26, 2025 at 12:36 PM
Do gpt-5.1-codex and gpt-5.1-codex-max not count as newer models? I see them constantly treating runes like functions and creating deeply nested derived values when it makes no sense to do so.

The only language I've seen worse performance from LLMs in than Svelte 5 is Rust.
November 26, 2025 at 12:19 PM
Until we build tools that reduce the verification load — not just generate more code — the bottleneck is only going to get tighter.
November 26, 2025 at 1:34 AM
And the ergonomics of a language — how easy it is for an LLM to use correctly — is becoming a serious factor. (Try using Svelte 5 runes with an LLM if you want a verifiable nightmare.)

LLMs are shifting the center of gravity of what engineering work actually is.
November 26, 2025 at 1:34 AM
Fewer “let me write this module,” more “let me prevent this AI from quietly breaking our entire system.”

Test-driven development and automated quality checks are becoming more important.
November 26, 2025 at 1:33 AM
The output firehose got bigger, and the review surface area grew with it.

The job is changing shape: less raw implementation, more steering and evaluation. More attention to failure modes than syntax.
November 26, 2025 at 1:33 AM
The model writes most of the code — but someone still has to do the shaping, the corrections, the guardrails, and the judgment.

And that’s the real bottleneck.
We’ve accelerated code *production*, but not code *verification*.
November 26, 2025 at 1:32 AM
* Let it execute
* Watch the diffs like I’m monitoring a toddler near an open flame
* Steer it back when it wanders
* Make it write tests if it “forgets”
* Then manually repair the subtle, end-to-end issues that only show up once everything is wired together
November 26, 2025 at 1:32 AM
They can follow a plan for more steps and lose the plot less often. Endurance improved; the ceiling didn’t.

My workflow today is almost muscle memory:

* Write down the requirements and the approach
* Tell the model to generate a plan
* Fix the plan (always)
November 26, 2025 at 1:31 AM
Since reasoning models dropped a year ago, I haven’t noticed the core complexity ceiling shifting much. Models aren’t solving meaningfully harder problems. What *has* changed is how long they can stay coherent without drifting into nonsense.
November 26, 2025 at 1:31 AM
I’ve been using LLM coding tools seriously since mid-2023 — Cursor, Windsurf, VSCode + Roo Code, Claude Code, Gemini CLI, Codex CLI. At this point I’ve seen every phase of the hype cycle up close.
November 26, 2025 at 1:30 AM
November 26, 2025 at 12:01 AM
They validated some of the generated proteins in bacteria, including antitoxins that barely resemble anything in known biology.

Unfortunately, human genes are much tougher, but it’s a sign of where models in bio are drifting — from modeling biology to proposing it.
November 26, 2025 at 12:00 AM