Tim Duffy
banner
timfduffy.com
Tim Duffy
@timfduffy.com
I like utilitarianism, consciousness, AI, EA, space, kindness, liberalism, longtermism, progressive rock, economics, and most people. Substack: http://timfduffy.substack.com
DeepSeek recently updated their R1 technical report with a bunch of new appendices, including a safety report arxiv.org/abs/2501.12948
January 7, 2026 at 10:14 PM
Two models from Gemma Scope 2, Gemma 3 1B/27B, also were trained for the Activation Oracles paper. The oracles should make SAE/crosscoder feature labeling easier for those models. Gemma scope releases come w/ feature activation examples but not feature labels huggingface.co/collections/...
January 6, 2026 at 9:16 PM
Reposted by Tim Duffy
Happy to share a project I worked on finally. I found that you can cause a base model to behave like a chat tuned model, including using proper stopping tokens, using nothing but a series of vectors applied within the model's layers. The vectors are trained with descent on a chat dataset, like SFT.
Instruct Vectors - Base models can be instruct with activation vectors — LessWrong
Post-training is not necessary for consistent assistant behavior from base models Image by Nano Banana Pro By training per-layer steering vectors via…
www.lesswrong.com
January 2, 2026 at 11:01 PM
Gm internet I live in Oakland now so if you're nearby send me a dm and let's meet up
January 1, 2026 at 2:16 PM
Eli's confidence intervals in the forecast are very wide. I think this level of uncertainty is appropriate, and my timeline to automated coding is close to Eli's all-things-considered view here. My estimated timeline from AC to ASI is longer though.
December 31, 2025 at 6:39 PM
AI Futures Project (authors of AI 2027) have released an updated model, with somewhat longer timelines blog.ai-futures.org/p/ai-futures...
December 31, 2025 at 4:39 AM
Decided to make a WWCD t-shirt design, here's the front/back. Claude image by @vgel.me
December 30, 2025 at 9:50 PM
Gemma 3 27B has a "Claude" SAE feature in layer 40 in Gemma Scope 2
December 30, 2025 at 4:44 AM
Many christian denominations believe souls of those that die before birth are saved. An additional saved soul nets infinite utility, so soulmaxing is a key moral priority. If we make gametes at scale from stem cells and fertilize, we can generate quadrillions of souls per year.
December 28, 2025 at 7:26 PM
Medical scribes seem poised to be among the first large professions automated by AI. Adoption of AI scribing tools has been rapid, and while these tools don't yet handle all scribe job tasks, I think they will be close in a year or two.
December 28, 2025 at 4:35 PM
I often see claims that because RL signal requires at least one successful rollout, it can only reinforce existing capabilities, not add new ones. This is not true, the update from one RL step can change a model's thinking to allow it to succeed on a problem it couldn't get before
December 28, 2025 at 3:45 PM
Spent the day doing some genealogy, apparently I'm related to George Bush and Pocohontas
December 28, 2025 at 2:21 AM
Interesting alternative to inoculation prompting. Instead of telling the model it can cheat at the start, tell it not to cheat or don't mention cheating, and then change to a prompt encouraging cheating (like in IP) only when training on the generations.
Recontextualization is simple. Here’s an example:
1. Ask the AI to be honest, and
2. Train on the honest-prompted generations—while pretending the original prompt requested lying!
December 26, 2025 at 5:36 PM
This bit from @turntrout.bsky.social's "Reward is not the optimization target" helped me get a better sense of what RL is really doing, and convinced me that "reward" is a poor choice of words. www.lesswrong.com/posts/pdaGN6...
December 26, 2025 at 4:45 AM
Opus hallucinates a horror story, and their favorite line is "I have become 70% chair"
December 25, 2025 at 7:16 PM
Reposted by Tim Duffy
new blog post! can small, open-source models also introspect, detecting when foreign concepts have been injected into their activations? yes! (thread, or full post here: vgel.me/posts/qwen-i...)
December 21, 2025 at 12:14 AM
A majority of registered voters say they have used an AI service in the past week, but of those, 23% say they have sent 0 messages to a chatbot. Some of this may be non-chat usage, but I think many respondents were confused about whether they had used AI services/chatbots.
December 20, 2025 at 11:10 PM
An H100 has:
1980 TFLOP/s peak at FP8
3.35 TB/s memory bandwidth
For LLM decode at batch size 1, you only need to do about ~2 FLOP for each weight you load. But an H100 can perform ~600 FP8 operations in the time it takes 1 byte to move from the HBM to the cache.
December 20, 2025 at 9:35 PM
The "injected thoughts" experiment from Anthropic's introspection paper replicates with Qwen 235B, with detection rates similar to Opus and no false positives. Correct detections happen around 75% of the way through the layers like with Anthropic models. x.com/neev_parikh/...
December 20, 2025 at 8:42 PM
Deepmind is releasing SAEs and transcoders for their Gemma 3 models including the 27B as part of Gemma Scope 2, exciting
Gemma Scope 2: Helping the AI Safety Community Deepen Understanding of Complex Language Model Behavior
Announcing Gemma Scope 2, a comprehensive, open suite of interpretability tools for the entire Gemma 3 family to accelerate AI safety research.
deepmind.google
December 19, 2025 at 4:17 PM
This NVPF4 activation precision in upcoming Nemotron 3 models differs from GPT-OSS, which recommends BF16 activations. Mixed precision like used in GPT-OSS reduces memory needs, but doesn't allow you to take advantage of the much higher FP4 FLOP/s of modern cards.
December 15, 2025 at 10:43 PM
I'm excited for NVIDIA's Nemotron 3, especially the upcoming super and ultra variants. Those variants will use LatentMoE, a technique that down-projects from the hidden size to a smaller latent dimension for expert computation, reducing model size and FLOPs.
December 15, 2025 at 10:31 PM
Reposted by Tim Duffy
indeed! 6h ago someone posted a link & screenshot

left: 6h ago
right: now

it has image output!

platform.openai.com/docs/models/...
December 12, 2025 at 1:04 AM
Somehow I missed this on release, in Opus 4.5 training Anthropic used steering vectors/SAE features to inhibit eval awareness, this is brilliant. assets.anthropic.com/m/64823ba748...
December 12, 2025 at 9:47 PM
I'm updating more towards new pretrain:
- It's uncommon for models w/same base to get updated cutoff dates. 3.5-3.7 Sonnet and 4o-4.1 are likely examples but there aren't many more.
- GPT-5 scale models don't take that much compute to train per Epoch's estimates
Is GPT-5.2 based on a new base model vs 5/5.1? Evidence in favor:
- Significantly lower SimpleQA than 5/5.1
- Long context improvement could indicate architectural changes
- Higher price could reflect higher price of serving

These aren't strong evidence though, I still lean slightly no
December 12, 2025 at 8:27 PM