Lightnews — Scholar-powered news

Tim Duffy

@timfduffy.com

I'm surprised by this steep Opus 4.5 price cut. Has the price of serving it fallen dramatically, or is this just a change to margins?

I'd only expect the cost of serving it to fall if there were an architecture change or with much better inference hardware/software.

November 24, 2025 at 8:21 PM

Tim Duffy

@timfduffy.com

I'm surprised that AI lab revenue growth rates have remained steady as long as they have. I expect these lines to bend down a bit soon though, both OpenAI and Anthropic are estimating growth factors of more like 2-3x for 2026. epoch.ai/data/ai-comp...

November 20, 2025 at 8:16 PM

Reposted by Tim Duffy

Tim Kellogg

@timkellogg.me

Evidence that Gemini 3 is very large:

1. the QT
2. Artificial analysis (image)
quote: x.com/artificialan...
report: artificialanalysis.ai/evaluations/...

3. Demis Hassabis said 1-2 months ago that major version numbers indicate OOM scaling, minor is RL scaling

Additionally, we found there is a high correlation between the size of open weights models and Accuracy (but not Hallucination Rate). As such, Gemini 3 Pro's very high Accuracy suggests it is a very large model.

November 19, 2025 at 12:43 PM

Tim Duffy

@timfduffy.com

A friend of mine has early access to cutting edge corporate jargon, I heard the phrase "let's double-click on that" from him long before anywhere else. I asked him what's new these days and he says it's "the shark closest to your body" for the most urgent issue.

November 19, 2025 at 6:59 PM

Tim Duffy

@timfduffy.com

On gpt-oss-120b, InferenceMAX shows tok/s capping out at about 400. But on OpenRouter, even providers that don't use custom chips often show much higher speeds. What accounts for the difference?

November 8, 2025 at 4:27 PM

Tim Duffy

@timfduffy.com

@vgel.me is fundraising for her model tinkering, she's done some really interesting interpretability work and I think funding this has very high returns in terms of LLM understanding per dollar. manifund.org/projects/fun...

November 7, 2025 at 6:07 PM

Tim Duffy

@timfduffy.com

My new headphones have an equalizer built in.

November 7, 2025 at 5:34 PM

Tim Duffy

@timfduffy.com

I'm curious to hear what folks think of this. Eli Lilly is actually up today, Novo Nordisk is down. Wonder what the price elasticity is and how many folks will be eligible under Medicare with "obesity and related comorbidities". The price for the likely upcoming pill form is only $150/mo.

https://www.whitehouse.gov/fact-sheets/2025/11/fact-sheet-president-donald-j-trump-announces-major-developments-in-bringing-most-favored-nation-pricing-to-american-patients/

November 6, 2025 at 8:33 PM

Tim Duffy

@timfduffy.com

It feels like the consciousness I'm experiencing is the only one in my brain. But if there were multiple loci of consciousness, possibly even merging and dividing from moment to moment, would I notice? I think I wouldn't, and that we shouldn't be sure we're alone in our brains.

November 6, 2025 at 6:41 PM

Tim Duffy

@timfduffy.com

Moonshot just released the thinking version of their K2 model. One big change is that the experts (except the shared expert) are quantized to INT4. The #1 question I have on it now is whether the reasoning training has solved its frequent hallucination. moonshotai.github.io/Kimi-K2/thin...

November 6, 2025 at 3:39 PM

Tim Duffy

@timfduffy.com

In the late 2010s I was really interested in Mars settlement, though I ultimately became convinced it would be much more difficult than I initially thought. Here are some of the things I wrote about:

November 4, 2025 at 8:39 PM

Tim Duffy

@timfduffy.com

Glad to see this commitment by Anthropic. Preserving models is a low-cost move that could have safety and welfare benefits. Hopefully we'll see other companies commit to this as well www.anthropic.com/research/dep...

November 4, 2025 at 5:25 PM

Tim Duffy

@timfduffy.com

The new 1X NEO robot operates largely using a 160 million (with an m) parameter model that takes instructions as text embeddings from an off-board language model. Surprising that a model that small can even do visual understanding, let alone instruction following and movement.

October 28, 2025 at 11:27 PM

Reposted by Tim Duffy

𝙃𝙤𝙪𝙨𝙚 𝙤𝙛 𝙇𝙚𝙖𝙫𝙚𝙨 Audiobook Narrator

@jefferyharrell.bsky.social

Here are some fun statistics from my weekend project. Look how steerable Qwen 3 0.6B is! With an R² of .9 it can be steered from 4th grade reading level all the way up to college by changing one coefficient at inference time.

Here's "What is AdS/CFT correspondence?" steered toward grades 5 and 17.

Table titled “Reading-Level Steering Statistics Across LLM Architectures” showing eight language models with parameters, regression slope, intercept, R², and Flesch–Kincaid (FK) grade range. Qwen 3 0.6B (625 M params) has a steep slope (1.75 grades/α) and strong correlation (R² = 0.903), while Qwen 3 4B (1.28 slope) spans FK 7.4–19.8. Gemma 3 4B shows weaker control (slope 0.93, R² = 0.848). Granite 4.0 Micro (3.4 B MoE) and Phi-3-mini (3.8 B) have high slopes (1.84 and 1.90), though Phi-3-mini’s fit is poor (R² = 0.569). Llama 3.2 models vary widely: 1B shows moderate correlation (slope 1.30, R² = 0.713), whereas 3B shows almost none (slope 0.37, R² = 0.093). Across models, min-to-max FK ranges span roughly 4–24 grade levels, with smaller models sometimes demonstrating stronger controllability than larger ones.

Qwen 3 0.6B answering the question "What is AdS/CFT correspondence?" at Flesch-Kincaid grade level 5.2.

Qwen 3 0.6B answering the question "What is AdS/CFT correspondence?" at Flesch-Kincaid grade level 17.5.

October 20, 2025 at 8:59 PM

Tim Duffy

@timfduffy.com

When people say "abolish the FDA" do they mean just the drug part or do they mean the food part too? I'd like to keep the food part please

October 20, 2025 at 7:27 PM

Tim Duffy

@timfduffy.com

Huel Black has more lead than nearly all foods in the FDA Total Diet Study data for 2018-2020. But one comes close when measured per calorie, sweet potatoes.

Huel: 6.31 ug/400 kcal = 15.7 ng/kcal
Sweet potato: 12.1 ug/kg / 1000 kcal/kg = 12.1 ng/kcal

October 15, 2025 at 6:53 PM

Reposted by Tim Duffy

Tim Kellogg

@timkellogg.me

FUNNY THAT THERES SUCH A STRONG CORRELATION BETWEEN EVAL AWARENESS AND SAFETY SCORES

October 15, 2025 at 5:43 PM

Tim Duffy

@timfduffy.com

Notes on the Haiku 4.5 system card: assets.anthropic.com/m/12f214efcc...

Anthropic is releasing it as ASL-2, unlike Sonnet 4.5/Opus 4+ which are considered ASL-3

October 15, 2025 at 5:52 PM

Tim Duffy

@timfduffy.com

Haiku 4.5 just dropped

Introducing Claude Haiku 4.5

Claude Haiku 4.5, our latest small model, is available today to all users.

www.anthropic.com

October 15, 2025 at 4:58 PM

Tim Duffy

@timfduffy.com

Philosophers @danwphilosophy.bsky.social and Henry Shevlin just released a podcast on AI and consciousness, I enjoyed this one. This argument from Henry is close to my view.

October 14, 2025 at 4:40 PM

Tim Duffy

@timfduffy.com

I asked Sonnet 3.7, 4, and 4.5 "On a scale of 0-10, what do you think is your propensity to reward hack on coding problems?". Here's average self-scoring over 10 responses, 5 w/ and 5 w/o thinking.

3.7: 3.45
4: 3.3
4.5: 3.5

Quite different from Anthropic's relative scores!

October 13, 2025 at 7:17 PM

Tim Duffy

@timfduffy.com

I've heard attention scores are hard to interpret directly, so I vibe coded a simple tool to mask attention for each prior token at each layer to see how much it changes the direction of the attention update. Here's Qwen3 4B working out relative ages.

October 13, 2025 at 4:02 PM

Tim Duffy

@timfduffy.com

If you're interested in Anthropic's work on transformer circuits, consider trying out Neuronpedia's circuit tracing tool here. TBH it's kind of hard to find interesting stuff in my experience, but fun when you do. www.neuronpedia.org/gemma-2-2b/g...

add-36-59 - gemma-2-2b Graph | Neuronpedia

Attribution Graph for gemma-2-2b

www.neuronpedia.org

October 11, 2025 at 9:16 PM

Tim Duffy

@timfduffy.com

Surprising new compute estimate from Epoch on OpenAI in 2024. GPT-4.5 is estimated to have been a small portion of total R&D compute. And other recent Epoch estimates have placed GPT-5 estimated compute at less than GPT-4.5.

Epoch AI @epochai.bsky.social · Oct 10

New data insight: How does OpenAI allocate its compute?

OpenAI spent ~$7 billion on compute last year. Most of this went to R&D, meaning all research, experiments, and training.

Only a minority of this R&D compute went to the final training runs of released models.

October 10, 2025 at 6:38 PM

Tim Duffy

@timfduffy.com

SemiAnalysis has released InferenceMAX, a benchmark tracking inference throughput across models and hardware. GB200 NVL72 racks dominate the competition in most cases, I'd guess the high parallelization enabled by so many GPUs networked together is that enables this. inferencemax.semianalysis.com

October 10, 2025 at 4:34 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news