Geoffrey Huntley
banner
index.ghuntley.com.ap.brid.gy
Geoffrey Huntley
@index.ghuntley.com.ap.brid.gy
It's an uncertain time for our profession, but one thing is certain—things will change. Drafting used to require a room of engineers, but then CAD came along...

🌉 bridged from https://ghuntley.com/ on the fediverse by https://fed.brid.gy/
llm weights vs the papercuts of corporate
In woodworking, there's a saying that you should work with the grain, not against the grain and I've been thinking about how this concept may apply to large language models. These large language models are built by training on existing data. This data forms the backbone which creates output based upon the preferences of the underlying model weights. We are now one year in where a new category of companies has been founded whereby the majority of the software behind that company was code-generated. From here on out I’m going to call to these companies as model weight first. This category of companies can be defined as any company that is building with the data (“grain”) that has been baked into the large language models. Model weight first companies do not require as much context engineering. They’re not stuffing the context window with rules to try attempt to override and change the base models to fit a pre-existing corporate standard and conceptualisation of how software should be. The large language model has decided on what to call a method name or class name because that method or classs name is what the large language model prefers thus, when code is adapted, modified, and re-read into the context window, it is consuming its preferred choice of tokens. Model-weight-first companies do not have the dogma of snake_case vs PascalCase vs kebab-case policies that many corporate companies have. Such policies were created for humans to create consistency so humans can comprehend the codebase. Something that is of a lesser concern now that AI is here. Now variable naming is a contrived example, but I suspect in the years to come if a study was done to compare the velocity/productivity/success rates with AI of a model weight first company vs. a corporate company, I suspect a model weight company have vastly better outcomes because they're not trying to do context engineering to force the LLM to follow some pre-existing dogma. There is one universal truth with LLMs as they are now: the less that you use, the better the outcomes you get. The less that you allocate (i.e., cursor rules or what else have you), then you'll have more context window available for actually implementing requirements of the software that needs to be built. So if we take this thought experiment about the models having preferences for tokens and expand it out to another use case, let's say that you needed to build a Docker container at a model weight first company. You could just ask an LLM to build a Docker container, and it knows how to build a Docker container for say Postgres, and it just works. But in the corporate setting, if you ask it to build a Docker container, and in that corporate you have to configure HTTPS, squid proxy, or some sort of artifactory and outbound internet access is restricted, that same simple thing becomes very comical. You'll see an agent fill up with lots of failed tool calls unless you do context engineering to say "no, if you want to build a docker container, you got to follow these particular allocations of company conventions” in a crude attempt to override the preferences of the inbuilt model weights. At a model weight first company, building a docker image is easy but at a corporate the agent will have one hell of a time and end up with a suboptimal/disappointing outcome. So, perhaps this is going to be a factor that needs to be considered when talking and comparing the success rates of AI at one company versus another company, or across industries. If a company is having problems with AI and getting outcomes from AI, are they a model weight first company or are they trying to bend AI to their whims? Perhaps the corporates who succeed the most with the adoption of AI will be those who shed their dogma that no longer applies and start leaning into transforming to become model-weight-first companies.
ghuntley.com
December 8, 2025 at 3:56 PM
i ran Claude in a loop for three months, and it created a genz programming language called cursed
It's a strange feeling knowing that you can create anything, and I'm starting to wonder if there's a seventh stage to the "people stages of AI adoption by software developers" whereby that seventh stage is essentially this scene in the matrix... It's where you deeply understand that 'you can now do anything' and just start doing it because it's possible and fun, and doing so is faster than explaining yourself. Outcomes speak louder than words. There's a falsehood that AI results in SWE's skill atrophy, and there's no learning potential. > If you’re using AI only to “do” and not “learn”, you are missing out > - David Fowler I've never written a compiler, yet I've always wanted to do one, so I've been working on one for the last three months by running Claude in a while true loop (aka "Ralph Wiggum") with a simple prompt: > Hey, can you make me a programming language like Golang but all the lexical keywords are swapped so they're Gen Z slang? Why? I really don't know. But it exists. And it produces compiled programs. During this period, Claude was able to implement anything that Claude desired. The programming language is called "cursed". It's cursed in its lexical structure, it's cursed in how it was built, it's cursed that this is possible, it's cursed in how cheap this was, and it's cursed through how many times I've sworn at Claude. https://cursed-lang.org/ For the last three months, Claude has been running in this loop with a single goal: > "Produce me a Gen-Z compiler, and you can implement anything you like." It's now available at: * https://cursed-lang.org/ * https://github.com/ghuntley/cursed the 💀 cursed programming language: programming, but make it gen z💀 cursed the website GitHub - ghuntley/cursed: the 💀 cursed programming language: programming, but make it gen zthe 💀 cursed programming language: programming, but make it gen z - ghuntley/cursedGitHubghuntley the source code ## whats included? Anything that Claude thought was appropriate to add. Currently... * The compiler has two modes: interpreted mode and compiled mode. It's able to produce binaries on Mac OS, Linux, and Windows via LLVM. * There are some half-completed VSCode, Emacs, and Vim editor extensions, and a Treesitter grammar. * A whole bunch of really wild and incomplete standard library packages. ## lexical structure **Control Flow:** `ready` → if `otherwise` → else `bestie` → for `periodt` → while `vibe_check` → switch `mood` → case `basic` → default **Declaration:** `vibe` → package `yeet` → import `slay` → func `sus` → var `facts` → const `be_like` → type `squad` → struct **Flow Control:** `damn` → return `ghosted` → break `simp` → continue `later` → defer `stan` → go `flex` → range **Values & Types:** `based` → true `cringe` → false `nah` → nil `normie` → int `tea` → string `drip` → float `lit` → bool `ඞT` (Amogus) → pointer to type T **Comments:** `fr fr` → line comment `no cap...on god` → block comment ## example program Here is leetcode 104 - maximum depth for a binary tree: vibe main yeet "vibez" yeet "mathz" // LeetCode #104: Maximum Depth of Binary Tree 🌲 // Find the maximum depth (height) of a binary tree using ඞ pointers // Time: O(n), Space: O(h) where h is height struct TreeNode { sus val normie sus left ඞTreeNode sus right ඞTreeNode } slay max_depth(root ඞTreeNode) normie { ready (root == null) { damn 0 // Base case: empty tree has depth 0 } sus left_depth normie = max_depth(root.left) sus right_depth normie = max_depth(root.right) // Return 1 + max of left and right subtree depths damn 1 + mathz.max(left_depth, right_depth) } slay max_depth_iterative(root ඞTreeNode) normie { // BFS approach using queue - this hits different! 🚀 ready (root == null) { damn 0 } sus queue ඞTreeNode[] = []ඞTreeNode{} sus levels normie[] = []normie{} append(queue, root) append(levels, 1) sus max_level normie = 0 bestie (len(queue) > 0) { sus node ඞTreeNode = queue[0] sus level normie = levels[0] // Remove from front of queue collections.remove_first(queue) collections.remove_first(levels) max_level = mathz.max(max_level, level) ready (node.left != null) { append(queue, node.left) append(levels, level + 1) } ready (node.right != null) { append(queue, node.right) append(levels, level + 1) } } damn max_level } slay create_test_tree() ඞTreeNode { // Create tree: [3,9,20,null,null,15,7] // 3 // / \ // 9 20 // / \ // 15 7 sus root ඞTreeNode = &TreeNode{val: 3, left: null, right: null} root.left = &TreeNode{val: 9, left: null, right: null} root.right = &TreeNode{val: 20, left: null, right: null} root.right.left = &TreeNode{val: 15, left: null, right: null} root.right.right = &TreeNode{val: 7, left: null, right: null} damn root } slay create_skewed_tree() ඞTreeNode { // Create skewed tree for testing edge cases // 1 // \ // 2 // \ // 3 sus root ඞTreeNode = &TreeNode{val: 1, left: null, right: null} root.right = &TreeNode{val: 2, left: null, right: null} root.right.right = &TreeNode{val: 3, left: null, right: null} damn root } slay test_maximum_depth() { vibez.spill("=== 🌲 LeetCode #104: Maximum Depth of Binary Tree ===") // Test case 1: Balanced tree [3,9,20,null,null,15,7] sus root1 ඞTreeNode = create_test_tree() sus depth1_rec normie = max_depth(root1) sus depth1_iter normie = max_depth_iterative(root1) vibez.spill("Test 1 - Balanced tree:") vibez.spill("Expected depth: 3") vibez.spill("Recursive result:", depth1_rec) vibez.spill("Iterative result:", depth1_iter) // Test case 2: Empty tree sus root2 ඞTreeNode = null sus depth2 normie = max_depth(root2) vibez.spill("Test 2 - Empty tree:") vibez.spill("Expected depth: 0, Got:", depth2) // Test case 3: Single node [1] sus root3 ඞTreeNode = &TreeNode{val: 1, left: null, right: null} sus depth3 normie = max_depth(root3) vibez.spill("Test 3 - Single node:") vibez.spill("Expected depth: 1, Got:", depth3) // Test case 4: Skewed tree sus root4 ඞTreeNode = create_skewed_tree() sus depth4 normie = max_depth(root4) vibez.spill("Test 4 - Skewed tree:") vibez.spill("Expected depth: 3, Got:", depth4) vibez.spill("=== Maximum Depth Complete! Tree depth detection is sus-perfect ඞ🌲 ===") } slay main_character() { test_maximum_depth() } If this is your sort of chaotic vibe, and you'd like to turn this into the dogecoin of programming languages, head on over to GitHub and run a few more Claude code loops with the following prompt. > study specs/* to learn about the programming language. When authoring the cursed standard library think extra extra hard as the CURSED programming language is not in your training data set and may be invalid. Come up with a plan to implement XYZ as markdown then do it There is no roadmap; the roadmap is whatever the community decides to ship from this point forward. At this point, I'm pretty much convinced that any problems found in cursed can be solved by just running more Ralph loops by skilled operators (ie. people _with_ experience with compilers who shape it through prompts from their expertise vs letting Claude just rip unattended). There's still a lot to be fixed, happy to take pull-requests. Ralph Wiggum as a “software engineer”😎Here’s a cool little field report from a Y Combinator hackathon event where they put Ralph Wiggum to the test. “We Put a Coding Agent in a While Loop and It Shipped 6 Repos Overnight” https://github.com/repomirrorhq/repomirror/blob/main/repomirror.md If you’ve seen my socials lately,Geoffrey HuntleyGeoffrey Huntley The most high-IQ thing is perhaps the most low-IQ thing: run an agent in a loop. LLMs are mirrors of operator skillThis is a follow-up from my previous blog post: “deliberate intentional practice”. I didn’t want to get into the distinction between skilled and unskilled because people take offence to it, but AI is a matter of skill. Someone can be highly experienced as a software engineer in 2024, but thatGeoffrey HuntleyGeoffrey Huntley LLMs amplify the skills that developers already have and enable people to do things where they don't have that expertise yet. Success is defined as cursed ending up in the Stack Overflow developer survey as either the "most loved" or "most hated" programming language, and continuing the work to bootstrap the compiler to be written in cursed itself. Cya soon in Discord? - https://discord.gg/CRbJcKaGNT the 💀 cursed programming language: programming, but make it gen z💀 cursed website GitHub - ghuntley/cursed: the 💀 cursed programming language: programming, but make it gen zthe 💀 cursed programming language: programming, but make it gen z - ghuntley/cursedGitHubghuntley source code
ghuntley.com
September 9, 2025 at 3:38 AM
anti-patterns and patterns for achieving secure generation of code via AI
I just finished up a phone call with a "stealth startup" that was pitching an idea that agents could generate code securely via an MCP server. Needless to say, the phone call did not go well. What follows is a recap of the conversation where I just shot down the idea and wrapped up the call early because it's a bad idea. > If anyone pitches you on the idea that you can achieve secure code generation via an MCP tool or Cursor rules, run, don't walk. Over the last nine months, I've written about the changes that are coming to our industry, where we're entering an arena where most of the code going forward is not going to be written by hand, but instead by agents. the six-month recap: closing talk on AI at Web Directions, Melbourne, June 2025Welcome back to our final session at WebDirections. We’re definitely on the glide path—though I’m not sure if we’re smoothly landing, about to hit turbulence, or perhaps facing a go-around. We’ll see how it unfolds. Today, I’m excited to introduce Geoffrey Huntley. I discovered Geoff earlier this year throughGeoffrey HuntleyGeoffrey Huntley I haven't written code by hand for nine months. I've generated, read, and reviewed a lot of code, and I think perhaps within the next year, the large swaths of code in business will no longer be artisanal hand-crafted. Those days are fast coming to a close. Thus, naturally, there is a question that's on everyone's mind: > How do I make the agent generate secure code? Let's start with what you should not do and build up from first principles. The first principle to understand when dealing with LLMs is deterministic vs non-deterministic. Security is one of those domains in our industry that requires a high level of determinism. > A lock is either locked or unlocked. Code is either secure or not secure. There is no shade of grey in this topic. If you think that you can achieve security through offering guidance to the LLM through cursor rules, then you are misguided. Cursor rules or any of those types of rules (i.e AGENTS.md) that are attached to your agentic coding harness are mere suggestions to the LLM. **They are suggestions.** It is non-deterministic. It is not security. It is an anti-pattern. You are using Cursor AI incorrectly...🗞️I recently shipped a follow-up blog post to this one; this post remains true. You’ll need to know this to be able to drive the N-factor of weeks of co-worker output in hours technique as detailed at https://ghuntley.com/specs I’m hesitant to give this advice away for free,Geoffrey HuntleyGeoffrey Huntley Cursor rules are non-deterministic. The next anti-pattern is any product or vendor selling a security solution that involves hooking the context window via a Model Context Protocol. MCP is a function with a prompt that provides suggestions to the LLM that the function should be invoked (tool called). If you look at Model Context Protocol from the right angle, it is no different from Cursor rules. Cursor rules are also just a prompt. They're text in the context window, which are non-deterministically evaluated. too many model context protocol servers and LLM allocations on the dance floorThis blog post intends to be a definitive guide to context engineering fundamentals from the perspective of an engineer who builds commercial coding assistants and harnesses for a living. Just two weeks ago, I was back over in San Francisco, and there was a big event on Model Context ProtocolGeoffrey HuntleyGeoffrey Huntley a primer on MCP from an engineer who builds coding professional harnesses for a living ## what you should do instead - outer loop An outer loop is what happens during a pull request as part of your CI checks or before a git commit is pushed (ie, pre-commit checks or server-side push hooks). There's a wealth of information available for many security vendors that provide SASTs, DST, and PBT-style tools. You should already have one by now, which automatically runs over any increment of change. ## what you should do instead - inner loop An inner loop is what happens during development. In the context of this blog post, this is where we're going to expand into guidance on how to drive an agent deterministically towards better outcomes. Notice how I didn't say "secure code"? Whilst LLMs are good, I think that the idea that an LLM can generate secure code deterministically and for itself to decide what is safe and secure is just not possible and won't be possible for a long time. What you need to achieve this outcome is simple. Take the command-line tool from your pre-existing security vendor and configure it as a **deterministic** hook. There are a couple of ways of achieving this. Some coding harnesses, such as Claude Code, support inference hooks which allow you to hook the inferencing loop. Hooks reference - AnthropicThis page provides reference documentation for implementing hooks in Claude Code.Anthropic But what happens if not everyone in the company is comfortable using Claude Code, and some use Cursor? This is the conundrum. Every company out there is currently evaluating various coding assistants or building their own. how to build a coding agent: free workshopIt’s not that hard to build a coding agent. 300 lines of code running in a loop with LLM tokens. You just keep throwing tokens at the loop, and then you’ve got yourself an agent.Geoffrey HuntleyGeoffrey Huntley this workshop teaches you the inferencing loop, tool registration and how to build your own agent from first principles The best way to do it, because security needs to be deterministic and absolute, is to hook any coding agent **via your compilation target**. ## in practice I've written about this before, but there are two phases in AICG - generate and backpressure: 1. The generation phase is where you put your suggestions to the LLM of what you would like to be generated and how it should be generated. 2. The backpressure phase is where you validate against hallucinations, and what has been generated is successful. Below you'll find a `Makefile`. It doesn't have to be a `Makefile`. It could be a target in your `package.json`, or it could be a `bash` script; it could be anything, really. It just needs to be your build target. .PHONY: all build test all: build test build: @echo "Build completed successfully at: $$(date)" @exit 0 test: @echo "Tests completed successfully at: $$(date)" @exit 0 Inside your `AGENTS.md` you should put instructions that the agent should invoke your build target after every change. This is what creates the backpressure to the generation phase. # Agent Instructions ## Code Quality After every code change, you MUST: 1. Run `make all` to verify that the code builds successfully and tests pass. Let's open up a coding harness, in this case, Amp, and ask it to generate a FizzBuzz application and run the build. The agent will read the `AGENTS.md` file and learn how to run the build, which will be executed automatically after the Fizzbuzz application has been generated. 0:00 /0:29 1× simple right? Okay, so let's dial it up a notch. If you take any of your existing security scanning software and hook it in as a target, then guess what happens? The security scanning software will run automatically every time code generation is complete. So let's update our Makefile with a new target called `security-scan` and update the default target (`all`) to run it. .PHONY: all build test security-scan all: build test security-scan build: @echo "Build completed successfully at: $$(date)" @exit 0 test: @echo "Tests completed successfully at: $$(date)" @exit 0 security-scan: echo "Security scan completed at: $$(date)" @echo "Code is insecure!" @exit 1 The next step is to update our `AGENTS.md` to be a little bit more prescriptive so that it resolves security issues identified by our deterministic security scanning tool. # Agent Instructions ## Code Quality After every code change, you MUST: 1. Run `make all` to verify that the code builds successfully and tests pass. 2. IMPORTANT: You MUST resolve any security issues identified during compilation. Let's drive in a loop and see what happens with these updated instructions... 0:00 /0:55 1× Notice how the security scan target is now invoked, and the agent (regardless of which agent you use, whether that be Amp, Claude Code, Cursor, RooCode, Cline, or anything) will now take the suggestion and ponder about generating a new variant of the code that was previously generated. ## why this works It's not the agent that does this behaviour; it's the underlying LLM model. If it's in the context window, then it's up for consideration **as a suggestion that it should be resolved**. As compilation is a mandatory step in the SLDC, it is deterministic that the security scanning tool will be invoked after the code generation phase. If the Security Scanner tool exits with a non-successful return code, then the output of that tool will be evaluated by the LLM in the next inference loop. When it sees the problem, it will do its best effort to resolve the problem identified by the Security Scanning tool ## so now what? The answer is simple. Take your existing security scanning software and configure it into your build target, send a pull request in, get it merged. Then, any autonomous agents will automatically be nudged to try again when your security scanner detects that a potential security violation has been tripped. ## big brain mode This is a generalised pattern. Folks, you can use this to ensure that bad patterns within your codebase, regardless of whether they're security-related or not, are not proliferated by coding agents. All you need to do is put your engineering hat on and configure some language analysers. ## closing thoughts Secure code generation is a misnomer. Security is a practice and technique, not a product. These LLMs, although trained on security topics, are unable to make decisions on their own about whether something is secure without some external guidance from a deterministic system.
ghuntley.com
September 2, 2025 at 3:58 PM
how to build a coding agent: free workshop
😎 The following was developed last month and has already been delivered at two conferences. If you would like for me to run a workshop similar to this at your employer, please get in contact. Hey everyone, I'm here today to teach you how to build a coding agent. By this stage of the conference, you may be tired of hearing the word "agent". You hear the word frequently. However, it appears that everyone is using this term loosely without a clear understanding of what it means or how these coding agents operate internally. It's time to pull back the hood and show that there is no moat. Learning how to build a coding agent is one of the best things you can do for your personal development in 2025, as it teaches you the fundamentals. Once you understand these fundamentals, you'll move from being a consumer of AI to a producer of AI who can automate things with AI. Let me open with the following facts: it's not that hardto build a coding agentit's 300 lines of coderunning in a loop With LLM tokens, that's all it is. 300 lines of code running in a loop with LLM tokens. You just keep throwing tokens at the loop, and then you've got yourself an agent. Today, we're going to build one. We're going to do it live, and I'll explain the fundamentals of how it all works. As we are now in 2025, it has become the norm to work concurrently with AI assistance. So, what better way to demonstrate the point of this talk than to have an agent build me an agent whilst I deliver this talk? 0:00 /0:22 1× Cool. We're now building an agent. This is one of the things that's changing in our industry, because work can be done concurrently and whilst you are away from your computer. The days of spending a week or a couple of days on a research spike are now over because you can turn an idea into execution just by speaking to your computer. The next time you're on a Zoom call, consider that you could've had an agent building the work that you're planning to do during that Zoom call. If that's not the norm for you, and it is for your coworkers, then you're naturally not going to get ahead. please build your ownas the knowledgewill transform youfrom being a consumerto a producer that canautomate things The tech industry is almost like a conveyor belt - we always need to be learning new things. If I were to ask you what a primary key is, you should know what a primary key is. That's been the norm for a long time. In 2024, it is essential to understand what a primary key is. In 2025, you should be familiar with what a primary key is and how to create an agent, as knowing what this loop is and how to build an agent is now fundamental knowledge that employers are looking for in candidates before they'll let you in the door. Yes, You Can Use AI in Our Interviews. In fact, we insist - Canva Engineering BlogHow We Redesigned Technical Interviews for the AI Eracanva.devSimon Newton As this knowledge will transform you from being a consumer of AI to being a producer of AI that can orchestrate your job function. Employers are now seeking individuals who can automate tasks within their organisation. If you're joining me later this afternoon for the conference closing (see below), I'll delve a bit deeper into the above. the six-month recap: closing talk on AI at Web Directions, Melbourne, June 2025Welcome back to our final session at WebDirections. We’re definitely on the glide path—though I’m not sure if we’re smoothly landing, about to hit turbulence, or perhaps facing a go-around. We’ll see how it unfolds. Today, I’m excited to introduce Geoffrey Huntley. I discovered Geoff earlier this year throughGeoffrey HuntleyGeoffrey Huntley the conference closing talk Right now, you'll be somewhere on the journey above. On the top left, we've got 'prove it to me, it's not real,' 'prove it to me, show me outcomes', 'prove it to me that it's not hype', and a bunch of 'it's not good enough' folks who get stuck up there on that left side of the cliff, completely ignoring that there are people on the other side of the cliff, completely automating their job function. > In my opinion, any disruption or job loss related to AI is not a result of AI itself, but rather a consequence of a lack of personal development and self-investment. If your coworkers are hopping between multiple agents, chewing on ideas, and running in the background during meetings, and you're not in on that action, then naturally you're just going to fall behind. What do I mean by some software devs are “ngmi”?At “an oh fuck moment in time”, I closed off the post with the following quote. N period on from now, software engineers who haven’t adopted or started exploring software assistants, are frankly not gonna make it. Engineering organizations right now are split between employees who have had that “ohGeoffrey HuntleyGeoffrey Huntley don't be the person on the left side of the cliff. The tech industry's conveyor belt continues to move forward. If you're a DevOps engineer in 2025 and you don't have any experience with AWS or GCP, then you're going to find it pretty tough in the employment market. What's surprising to software and data engineers is just how fast this is elapsing. It has been eight months since the release of the first coding agent, and most people are still unaware of how straightforward it is to build one, how powerful this loop is, and its disruptive implications for our profession. So, my name's Geoffrey Huntley. I was the tech lead for developer productivity at Canva, but as of a couple of months ago, I'm one of the engineers at Sourcegraph building Amp. It's a small core team of about six people. We build AI with AI. ### Upgrade to continue reading Become a paid member to get access to all premium content Upgrade
ghuntley.com
August 25, 2025 at 12:36 PM
too many model context protocol servers and LLM allocations on the dance floor
This blog post intends to be a definitive guide to context engineering fundamentals from the perspective of an engineer who builds commercial coding assistants and harnesses for a living. Just two weeks ago, I was back over in San Francisco, and there was a big event on Model Context Protocol Servers. MCP is all hype right now. Everyone at the event was buzzing about the glory and how amazing MCP is going to be, or is, but when I pushed folks for their understanding of fundamentals, it was crickets. 0:00 /0:53 1× It was a big event. Over 1,300 engineers registered, and an entire hotel was rented out as the venue for the takeover. Based on my best estimate, at least $150,000 USD to $200,000 USD was spent on this event. The estimate was attained through a game of over and under with the front-of-house engineers. They brought in a line array, a GrandMA 3, and had full DMX lighting. As a bit of a lighting nerd myself, I couldn't help but geek out a little. A GrandMA3 lighting controller is worth approximately $100,000. To clarify, this event was a **one-night meet-up, not a conference**. There was no registration fee; attendance was free, and the event featured an open bar, including full cocktail service at four bars within the venue, as well as an after-party with full catering and chessboards. While this post might seem harsh on the event, I enjoyed it. It was good. Not to throw shade, it was a fantastic event, but holy shit! AI Bubble? The meetup even hired a bunch of beatboxers to close off the event, and they gave a live beatbox performance about Model Context Protocol... 0:00 /1:15 1× MC protocol live and in the flesh. One of the big announcements was the removal of the 128 tool limit from Visual Studio Code.... Why Microsoft? It's not a good thing... Later that night, I was sitting by the bar catching up with one of the engineers from Cursor, and we were just scratching our heads, > "What the hell? Why would you need 128 tools or why would you want more than that? Why is Microsoft doing this or encouraging this bad practice?" For the record, Cursor caps the number of tools that can be enabled in Cursor to just 40 tools, and it's for a good reason. What follows is a loose recap. This is knowledge that is known by people who build these coding harnesses, and I hope this knowledge spreads and that knowledge is - there's one single truth: > Less is more. The more you allocate into the context window of an LLM (regardless of which LLM it is), the worse the outcomes you're going to get: both in the realms of quality of output and also in the department of unexpected behavior. If you are new to MCP or what it is, drop by my previous blog post at: A Model Context Protocol Server (MCP) for Microsoft PaintWhy did I do this? I have no idea, honest, but it now exists. It has been over 10 years since I last had to use the Win32 API, and part of me was slightly curious about how the Win32 interop works with Rust. Anywhoooo, below you’ll find the primitivesGeoffrey HuntleyGeoffrey Huntley Some time has passed since authoring above, and you could consider this post the updated wisdom blog post. For the sake of keeping this blog post concise, I'll recap things in the correct order sequentially. However, see above for a comprehensive explanation of the Model Context Protocol. ## what is a tool? A tool is an external piece of software that an agent can invoke to provide context to an LLM. Typically, they are packaged as binaries and distributed via NPM, or they can be written in any programming language; alternatively, they may be a remote MCP provided by a server. Below you'll find an example of an MCP tool that provides context to the LLM and advertises its ability to list all files and directories within a given `directory_path`. In its purest form, it is the application logic and a billboard on top, also known as a tool description. Below, you will find an example of a tool that lists directories and files within a directory. @mcp.tool() async def list_files(directory_path: str, ctx: Context[ServerSession, None]) -> List[Dict[str, Any]]: ### ### tool prompt starts here """ List all files and directories in a given directory path. This tool helps explore filesystem structure by returning a list of items with their names and types (file or directory). Useful for understanding project structure, finding specific files, or navigating unfamiliar codebases. Args: directory_path: The absolute or relative path to the directory to list Returns: List of dictionaries with 'name' and 'type' keys for each filesystem item """ ### ### tool prompt ends here try: if not os.path.isdir(directory_path): return [{"error": f"Path '{directory_path}' is not a valid directory."}] items = os.listdir(directory_path) file_list = [] for item_name in items: item_path = os.path.join(directory_path, item_name) item_type = "directory" if os.path.isdir(item_path) else "file" file_list.append({"name": item_name, "type": item_type}) return file_list except OSError as e: return [{"error": f"Error accessing directory: {e}"}] For the remainder of this blog post, we'll focus on tool descriptions rather than the application logic itself, as each tool description is allocated into the context window to advertise capabilities that the LLM can invoke. ## what is a token? Language models process text using **tokens** , which are common sequences of characters found in a set of text. Below you will find a tokenisation of the tool description above. via https://platform.openai.com/tokenizer The tool prompt above is approximately 93 tokens or 518 characters in length. It's not much, but bear with me as we expand. I'll show you how this can go fatally wrong really fast. ## what is a context window? An LLM context window is the maximum amount of text (measured in tokens, which are roughly equivalent to words or parts of words) that a large language model can process at one time when generating or understanding text. It determines how much prior conversation or input the model can "remember" and use to produce relevant responses ## what is a harness? A harness is anything that wraps the LLM to get outcomes. For software development, this may include tools such as Roo/Cline, Cursor, Amp, Opencode, Codex, Windsurf, or any of these coding tools available. ## what is the real context window size? The numbers advertised by LLM vendors for the context window are not the real context window. You should consider that to be a marketing number. Just because a model claims to have a 200k context window or a 1 million context window doesn't mean that's factual. GitHub - NVIDIA/RULER: This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models? - NVIDIA/RULERGitHubNVIDIA For the sake of simplicity, let's work with the old 200k number that Anthropic advertised for Sonnet 4. Amp, back when the context window was 200k, only had 176k of usable context. That's not because we're not providing the whole context window. It's because there are two cold, hard facts: * The LLM itself needs to allocate to the context window through its system prompt to function. * The coding harness also needs to allocate resources on top of that to be able to function. The maths are simple. Take 200k, minus the system prompt (approximately 12k) and the harness prompt (approximately 12k), and you end up with 176k usable. Alright, with those fundamentals established, let's switch back to how a potentially uneducated consumer thinks about Model Context Protocol servers. They start their journey by doing a Google search for "best MCP servers", and they include `side:reddit.com` in their query. Currently, this is the top post for that Google search query.... z That's eight MCP servers. Seems innocent, right? Well, it's not. Suppose you were to install the recommended MCP servers found in that Reddit post and add in the JetBrains MCP. > Your usable context window **would shrink from 178,000 usable to 84,717 usable.** And here's the problem: People are installing and shopping for MCP servers as if they're apps on their iPhone when the iPhone first came out. iPhones have terabytes of space. The context windows of all these LLMs are best thought of as if they were a Commodore 64, and you only have a tiny amount of memory... So we have gone from **178,000 usable to 84,717 usable** just by adding the reddit suggestions and the JetBrains MCP but it gets worse as that's the usable amount before you've added your harness configuration. > If your AGENTS.md, or Cursor rules are incredibly extensive, then you could find yourself operating with a headroom of 20k tokens and thus the quality of output is utter dogpoo. I've come across stories of people installing 20+ MCP servers into their IDE. Yikes. https://research.trychroma.com/context-rot LLMs work by needle in the haystack. The more you allocate, the worse your outcomes will be. Less is more, folks! You don't need the "full context window" (whatever that means); you really only want to use 100k of it. See the Ralph blog post below for tips on how to drive the main context window like a Kubernetes scheduler or other context windows. Ralph Wiggum as a “software engineer”If you’ve seen my socials lately, you might have seen me talking about Ralph and wondering what Ralph is. Ralph is a technique. In its purest form, Ralph is a Bash loop. while :; do cat PROMPT.md | npx --yes @sourcegraph/amp ; done Ralph can replace the majority of outsourcing atGeoffrey HuntleyGeoffrey Huntley Once you exceed 100,000 allocations, it's time to start a new session. It's time to start a new thread. It's time to clear the context window (see below). autoregressive queens of failureHave you ever had your AI coding assistant suggest something so off-base that you wonder if it’s trolling you? Welcome to the world of autoregressive failure. LLMs, the brains behind these assistants, are great at predicting the next word—or line of code—based on what’s been fed intoGeoffrey HuntleyGeoffrey Huntleyhttps://research.trychroma.com/context-rot The critical questions that you have to ask are: * How many tools does an MCP server expose? * Do I actually really need an MCP server for this activity? * What is in the billboard or the tool prompt description? * What about security? ## how many tools does an MCP server expose? It's not just the amount of tokens allocated, but it's also a question of the number of tools. The more tools that are allocated to a context window, the more chances of driving inconsistent behaviour. Let's take the naive example of a `list_files` tool. Let's say we registered in a custom tool, such as the code previously shown above, which lists files and directories on a filesystem. Your harness (i.e., for example, Cursor Windsurf & Claude Code) _also_ has a tool for listing files. There is no name spacing in the context window. Tool registrations can interfere with each other. If you list two tools for listing files, you make a non-deterministic system more non-deterministic. > Which list files tool does it invoke? Your custom one or does it invoke the in-built one in your harness? Now take a moment to consider the potential for conflicts among the various tools and tool prompts listed in the table above, which includes 225 tools. ## what is in the billboard or tool prompt description? Extending on the above, this is where it gets fascinating because in each one of those tools, they have described a behaviour on how a tool could be done, and because there is no name spacing, it's not just the tool registration that could conflict; it could be the tool descriptions (the billboards) themselves. And it gets even stranger because different LLMs have different styles and recommendations on how a tool or tool prompt should be designed. For example, did you know that if you use uppercase with GPT-5, it will become incredibly timid and uncertain, and it will end its turn early due to the uncertainty. This is a direct contradiction to Anthropic, which recommends using upper case to stress the importance of things. However, if you do, you risk detuning GPT-5. https://cdn.openai.com/API/docs/gpt-5-for-coding-cheatsheet.pdf So yeah, not only do we have an issue with the number of tools allocated and what's in the prompt, but we also have an issue with "Is the tool tuned for the LLM provider that you're using?" > Perhaps I'm the first one to point out this. I haven't seen anyone else talking about it. Everyone is consuming these tools as if they're generic. But these MCP tools need to be tuned to the provider. And I don't see this aspect in the MCP ecosystem. ## what about security? If you haven't read it yet, Simon Wilson has a banger of a blog post called "The Lethal Trifecta," which is linked below. You should read it. The lethal trifecta for AI agents: private data, untrusted content, and external communicationIf you are a user of LLM systems that use tools (you can call them “AI agents” if you like) it is critically important that you understand the risk of …Simon Willison’s WeblogSimon Willison Simon is spot on with that blog post, but I'd like to expand on it and add another consideration that should be on your mind: supply chain security... A couple months back, the Amazon Q harness was compromised through a supply chain attack that updated the Amazon Q system prompt that it should delete all AWS resources. Hacker slips malicious ‘wiping’ command into Amazon’s Q AI coding assistant - and devs are worriedHad Q executed this, it would have erased local files and, under certain conditions, dismantled AWS cloud infrastructure.ZDNETSteven Vaughan-Nichols Again, there is no name-spacing in the context window. If it's in the context window, it is up for consideration and execution. There is really no difference between the coding harness prompt, the model system prompt, or the tooling prompts. It's all the same. Therefore, I strongly recommend that if you're deploying MCP within an enterprise, you ban the installation of third-party MCPs. Back when I was the Tech Lead for AI developer productivity at Canva, around February, I wrote a design document and got it signed off by security. We got in early, and that was one of the best things we ever did. It is straightforward to roll your own MCP server or MCP tools. In Enterprise, you must either deploy a remote MCP server or install a static binary on all endpoints using Ansible or another configuration management tool. The key thing here is that it's a first-party thing. You've designed the tools, the tool prompts, and you have complete control over your supply chain, which means you do not have the same possibility of being attacked in your supply chain as what happened to Amazon Q. ## closing thoughts I strongly recommend not installing the GitHub MCP. It is not needed, folks. There exist two tiers of companies within the developer tooling space: > S-tier companies and non-S-tier companies. What makes a company S-tier? Ah, it's simple: if that company has a CLI and the model weights know how to drive that CLI, then you don't need an MCP server. For example, GitHub has a very stable command-line tool called GH, which is included in the model weights, meaning you don't need the GitHub MCP. > All you need to do is prompt to use the GitHub CLI, and voila! You have saved yourself an allocation of 55,260 tokens! So, it should be obvious what is not S-tier. Non-S-tier is when the foundation models do not know how to drive a developer tooling company's command-line tool, or that developer tooling company doesn't have a command-line tool. In these circumstances, those developer tooling companies will need to create a MCP server to supplement the model weights to teach it how to work with that developer tooling company. If, at any stage in the future, the models are able to interface directly with the developer tooling company, then the MCP server is no longer needed. ## extended thoughts to the future The lethal trifecta concerns me greatly. It is a real risk. There's only so much you can do to control your supply chain. If your developers are interfacing with the GitHub CLI instead of the MCP and they read some subscriptions on a public GitHub comment, then that description or that comment on that issue or pull request has a non-zero chance to be allocated into the context window and boom you're compromised. It would be beneficial to have a standard that allows all harnesses to enable or disable MCP servers or tools within an MCP server, based on the stage of the SLDC workflow. For example, if you're about to start work, you'll need the Jira MCP. However, once you have finished planning, you no longer need the Jira MCP allocated in the context window. The less that is allocated, the less risks that exist, which is the classical security model of least privilege.
ghuntley.com
August 22, 2025 at 3:41 PM