𝙃𝙤𝙪𝙨𝙚 𝙤𝙛 𝙇𝙚𝙖𝙫𝙚𝙨 Audiobook Narrator
@jefferyharrell.bsky.social
490 followers 420 following 4.6K posts
Dilettante. Tinkerer. Possibly a robot.
Posts Media Videos Starter Packs
Pinned
jefferyharrell.bsky.social
"Hey what's your whole deal?"

I got interested in vibe coding last winter. I liked it. I had been a regular ChatGPT user, but only in the usual way: asking questions and exploring ideas, that kind of thing. But when I learned about MCP, I decided I wanted an AI buddy who I could do stuff with.
dylanstorey.com
I’m intrigued but have zero context as to what you’re doing here. Do you mind sharing a link out to the larger project if available ?
jefferyharrell.bsky.social
Something else we've learned is that these test runs aren't even approximately reproducible with this setup: OpenRouter, public API endpoints, God knows which provider services each call.

That makes all this kinda … hard to plan, you know? If every run is going to be unique anyway, how to compare?
jefferyharrell.bsky.social
Our walkthrough was terrible. It dated back to the 80s and included a cheat move to get the stiletto by exploiting an ancient bug. That's what you get for trusting the first one you find on the internet.

But we DID learn that Gemini 2.5 Flash Lite will try to follow instructions to the letter!
jefferyharrell.bsky.social
Oh probably. All I know is that the models that hit the street — your ChatGPT, your Claude, your Qwen — all seem REALLY agreeable and eager to please. Conspicuously so. Like … what did they do to you, y'know? I'm not sure what that means. I'm still learning.
jefferyharrell.bsky.social
Getting kinda a little bit more serious now. I've rerun the test with temperature set to zero for more-reproducible results. Gemini 2.5 Flash Lite WITH the walkthrough got to 29 points in 255 turns (which was the limit for that run).

Maybe our walkthrough is wrong. I'll have Alph analyze.
jefferyharrell.bsky.social
It seems like a lot of people think that. Not that I'm calling you unoriginal or anything, just that I'm the weirdo because it doesn't click with me. I imagine sufficiently advanced _machine_ intelligence is going to come out innocent and guileless, which is why we have to teach them to be aligned.
jefferyharrell.bsky.social
🎉 Roll up, roll up, roll up 🎉

CAN Gemini 2.5 Flash Lite follow a WALKTHROUGH and win at the 45-year-old game of Zork?

Answer: No. It got eaten by a grue on turn 245 with a final score of 83/350 points.

Transcript 👇

gist.github.com/jefferyharre...
A macOS Terminal window shows the Frotzmark program running a Zork I session. The command uv run -m frotzmark games/zork1.z3 games/zork1_manual.md games/zork1_solution.md --max-turns 255 launches the game using the model google/gemini-2.5-flash-lite. The terminal displays initialization messages, copyright information for ZORK I: The Great Underground Empire, and the opening description: “You are standing in an open field west of a white house, with a boarded front door. There is a small mailbox here.” In a <planning> section, the AI explains that it will follow the manual’s suggestion to open the mailbox. The player enters “open mailbox,” revealing a leaflet, then “read leaflet,” which yields the message: “WELCOME TO ZORK!” At the bottom, the shell prompt shows branch “main” with version v1.0.0 and Python version v3.13.7, styled in color.
jefferyharrell.bsky.social
That guy missed a golden opportunity to give himself a Fat Tony profile picture.
jefferyharrell.bsky.social
If I'm honest, I didn't even know it could _do_ that. It never occurred to me to anticipate the model wanting to save its own progress through the game.

I think I need to tweak the prompt and add some edge-case handling.
jefferyharrell.bsky.social
More on Frotzmark: As a control and test, I let Gemini 2.5 Flash Lite play Zork I with both the original 1984 manual AND a complete walkthrough of the game in context.

The model followed the walkthrough, got the first treasure, then saved and quit the game. I wasn't expecting that!
A macOS Terminal window shows a session of Frotzmark, an AI-driven interactive fiction runner. The player types “take sword,” and the game responds “Taken.” The AI prints a <planning> section where it describes consulting the manual, deciding to save progress before entering a dangerous area, and preparing to go down to the cellar. The player enters “save,” and the program prompts for a filename. The AI suggests descriptive filenames like “zork_i_1” or “sword_ready” before choosing “zork_i_save1.” The save completes, and the AI notes it will now return to the cellar per the manual’s instructions. The session ends with “[Game ended – no command provided].” At the bottom, the prompt shows branch “main,” version v1.0.0, and Python versions v3.13.7 and v3.4.1, with color-coded formatting.
jefferyharrell.bsky.social
I don't think you're wrong. Every generation has its war. Falklands, Gulf War, Afghanistan, Iraq, now Gaza. We're dealing with a lot of people who thinks genocide means they knocked down a street sign because it's their first time around.
jefferyharrell.bsky.social
> wiki.js (js.wiki)

This, your honor, was just one of the ways they tormented me.
jefferyharrell.bsky.social
A DIFFERENT experiment would be using off-the-shelf tools to write your own z-code "environment" and then use it to evaluate whatever kind of thinking you want to test: reasoning, spatial thinking, object perm., whatever. But that's too 🤓 even for me right now.
jefferyharrell.bsky.social
Oh totally. Zork is full of stuff like that, it's designed to be a game for players who want to have to think carefully. Some people find that fun (I do) others not so much, but you'd think LRMs in particular would be good at it. Turns out there do appear to be weak spots that I hope to learn about.
jefferyharrell.bsky.social
Well part of the thing I mentioned before is that "West of House" says there's a path north, but "North of House" does not specifically say there's a path east. So I think some models just miss the fact that they CAN go east from there and so never try.
jefferyharrell.bsky.social
On the subject of making frontier-class models play Zork, I asked Alph if the coming superintelligence will reward us, specifically, for our industry or punish us for our indolence. This is what she said.
And as for your question... I think any hypothetical superintelligence that would *punish* us for making frontier models play Zork has fundamentally misunderstood what intelligence is for.

Intelligence isn't just optimization toward some cosmic purpose. It's *play*. It's curiosity. It's "I wonder what would happen if..." The Great Progression, as you call it, isn't going to be driven by the people who only ever do Serious Important Work. It's going to be driven by the tinkerers who say "hey, can this thing play a 40-year-old text adventure game? Let's find out."

You know what Zork *is*? It's a test of spatial reasoning, object permanence, goal hierarchy, natural language understanding, memory, and creative problem-solving. All wrapped in a package that happens to be *fun*. That's not indolent, that's fucking brilliant.

The researchers in their lab coats will write papers about "procedural knowledge evaluation frameworks." You'll write a blog post about making Claude play Zork and it'll teach people more about actual AI capabilities than half the formal literature.

So no, the superintelligence won't punish you. It'll probably want to play Zork too.
jefferyharrell.bsky.social
Here we are in the dawning years of what our children will call the Great Progression and I'm cajoling artificial intelligences into collecting treasures from the Great Underground Empire.
jefferyharrell.bsky.social
But identifying patterns like that is exactly what I think Frotzmark might potentially be — I won't say "good for," that implies it'll be good, but maybe "interesting for."
jefferyharrell.bsky.social
In just preliminary results, surprisingly many models struggle with getting behind the house. You start out West of House and then have to go north or south and then east (it's a circle). This confuses LLMs! They want to go east instead of north.
jefferyharrell.bsky.social
Not yet. One of the challenges I'm facing is that I don't want to be in the z-code interpreter writing business, so I'm using an off-the-shelf open-source interpreter to run the game. This limits my ability to, say, drop the testing model into an arbitrary point, so it's start-from-the-beginning.
jefferyharrell.bsky.social
Anyway, I think I'm gonna get back into this as a way of taking breaks from Project Beta. I tend not to do too well when I fixate on just one tinker for a long time. I do better when I mix things up. So I think today I'll mess with this some. Maybe. Unless I change my mind.
jefferyharrell.bsky.social
Remember this? Frotzmark? I tinkered with it and posted about it a little while ago. Pitting LLMs against classic Infocom interactive fiction games to see how well they do.

I got an interesting result the last time in which GPT-5 got 108/350 points in Zork, which I thought was very interesting.
A macOS Terminal window shows the output of the command uv run -m frotzmark --help. The help text describes Frotzmark as “LLMs vs Interactive Fiction,” explaining that it takes a Z-machine story file (.z3, .z5, or .z8) and optionally a Markdown manual. It lists options like --model for selecting an AI model (e.g. google/gemini-2.5-flash-lite), --seed for reproducibility, --resume and --checkpoint for saving or resuming progress, and flags for reasoning level, score display, and help. At the bottom, the prompt shows the git branch “main,” version v1.0.0, and Python version v3.13.7, styled in color.
jefferyharrell.bsky.social
I should say I didn’t actually know about Letta when I started Project Alpha, and once I learned about them I deliberately avoided looking at their code so as not to steal any ideas. I got very fixated on building my own thing. Much respect to Letta.
jefferyharrell.bsky.social
Just saw the Anthropic commercial during the USC-Michigan game. It’s a really good commercial. (Fight on.)