Lightnews — Scholar-powered news

Arthur Clune

@arthur.clune.org

While reading Simon Couch's blog I came across this post from December on evaluating open models' ability to complete a simple code refactor in R. It's a much bigger gap than I expected. He's limiting to model's than can run locally, hence the small size, but even Haiku totally outclasses gpt-oss

Graph of open agent v frontier models ability to complete a simple code refactor, with multiple repeats. The open agents can't do it reliably

January 21, 2026 at 8:54 AM

Arthur Clune

@arthur.clune.org

Demonstration of exploit development against a (small but not toy) JavaScript interpreter. Two things I noticed:

1) Increase tokens by the use of parallel runs on the same task (like how METR do their evals)

2) Author doesn’t say how he got round guardrails.

sean.heelan.io/2026/01/18/o...

GPT-5.2 came up with a clever solution involving chaining 7 function calls through glibc's exit handler mechanism. The full exploit is here and an explanation of the solution is here. It took the agent 50M tokens and just over 3 hours to solve this, for a cost of about $50 for that agent run. (As I was running four agents in parallel the true cost was closer to $150).

January 19, 2026 at 8:36 AM

Arthur Clune

@arthur.clune.org

Maths keeps turning out to be useful

en.wikipedia.org/wiki/G._H._H...

January 11, 2026 at 7:45 AM

Arthur Clune

@arthur.clune.org

This is a terrible take from @theguardian.com

Grok hasn’t done this. X has done this to grok. And specifically Musk.

www.theguardian.com/technology/2...

Grok, Elon Musk's Al tool, has switched off its image creation function for the vast majority of users after widespread outcry over its use to create sexually explicit and violent imagery.
It comes after Musk was threatened with fines, regulatory action and reports of a possible ban on X in the UK.
The tool had been used to manipulate images of women to remove their clothes and put them in sexualised positions. The function to do so has now been switched off except for paying subscribers.
Posting on X, Musk's social media network, Grok said:
"Image generation and editing are currently limited to paying subscribers."

January 9, 2026 at 7:54 AM

Arthur Clune

@arthur.clune.org

Consider these ideas for future uses and re-cast them a little.

'Your DA monitors 47 nearby targets. It alerts you about the woman going into a darker section' or 'Your son has been looking at LGBT content. Want me to book him into a conversion camp?'

2/

'Walking at night, your DA monitors 47 nearby cameras and notices concerning behavior ahead - "Take the next right, safer route, you'll still make it on time" '

January 6, 2026 at 12:53 PM

Arthur Clune

@arthur.clune.org

And by chance here’s what your post ended up next to. Letters of Marque next?

December 21, 2025 at 8:14 PM

Arthur Clune

@arthur.clune.org

LLMs' productivity boost is an exponent not a multipler - from @ed3d.net

This framing partially resonates. I'm less keen on starting skill level as the key (on which axis do we measure etc), but because if LLM competency is the variable and the exponent, then learned skill matters so much more

December 21, 2025 at 5:16 PM

Arthur Clune

@arthur.clune.org

So Substack are, as long predicted, starting to slowly move away from email. This 'warning' means the message is truncated *by the sender* with a 'continue reading on substack' button even though my mail client can read long messages just fine

December 11, 2025 at 10:39 AM

Arthur Clune

@arthur.clune.org

This is an interesting read but I think the author misses the main use case that will drive spend. Military robots are going to be a massive investment.

I don’t think this is a good thing, but it seems clear that it’s the way it’s going

The main problem with robotics is that learning follows scaling laws that are very similar to the scaling laws of language models. The problem is that data in the physical world is just too expensive to collect, and the physical world is too complex in its details. Robotics will have limited impacts.
Factories are already automated and other tasks are not economically meaningful.

December 11, 2025 at 8:45 AM

Arthur Clune

@arthur.clune.org

I raise you this one from Frontiers of Cell Biology. There's basically a whole industry of 'special issues' that print anything if you pay

November 28, 2025 at 12:36 PM

Arthur Clune

@arthur.clune.org

More on Gemini 3 and reading historical documents. With a line to make Gary Marcus hop.

Google does seem to be proving that just scaling LLMs is still working

generativehistory.substack.com/p/the-sugar-...

A line to make Gary Marcus weep: claimed evidence for emergence of enuro-symbolic reasoning via scaling

November 18, 2025 at 8:15 PM

Arthur Clune

@arthur.clune.org

If this analysis from EpochAI is correct then
a) model training costs (financial and environmental) are ~5-10x the final run and
b) inference costs (financial and environmental) are smaller than assumed

I'm making heroic assumptions for a). No-one outside OpenAI can answer properly

Diagram of relative spend at Open AI on inference, GPT-4.5 final training run and overall R&D compute spend. Compute spend is ~10x the training run and inference costs are approximately 40% of R&D costs

November 18, 2025 at 9:32 AM

Arthur Clune

@arthur.clune.org

Age yourself with gaming

November 12, 2025 at 6:48 PM

Arthur Clune

@arthur.clune.org

Big hint in their write up that they weren't using Cursor to write code

Cursor builds tools for software engineering, and we make heavy use of the tools we develop. A motivation of Composer development has been developing an agent we would reach for in our own work. In recent weeks, we have found that many of our colleagues were using Composer for their day-to-day software development. With this release, we hope that you also find it to be a valuable tool

October 30, 2025 at 5:22 PM

Arthur Clune

@arthur.clune.org

Begun the document wars have

No, seriously.

www.scmp.com/economy/chin...

China's expansion of its rare earth export controls appeared to mark another escalation in the US-China trade war last week. But the
announcements were also significant in another way: unusually, the documents could not be opened using American word processing software.
For the first time, China's Ministry of Commerce issued a slew of documents that could be directly accessed only through WPS Office - China's answer to Microsoft Office - as Beijing continues its tech self-reliance drive.
Developed by the Beijing-based software company Kingsoft, WPS Office uses a different coding structure to Microsoft Office, meaning WPS text files cannot be opened directly in Word without conversion.

October 14, 2025 at 7:13 AM

Arthur Clune

@arthur.clune.org

Also Claude, 🔥

Output from Claude: "The verbosity you'll encounter isn't accidental; it's COBOL heritage, designed for business analysts to supposedly read. Whether that goal was achieved is debatable."

October 8, 2025 at 8:54 AM

Arthur Clune

@arthur.clune.org

I'm writing a chatbot as an experiment. Here's Claude Code debugging why the tool calling isn't working

Spoiler - this was not the reason!

Output from Claude Code:

So the LLM is working correctly! It's:
1. Acting as Gimli (a grumpy dwarf who doesn't like people)
2. Saving memories about Elara via tool calls
3. But not responding verbally because Gimli is grumpy and doesn't want to talk
The empty responses make perfect sense for Gimli's character - he sees Elara, grumbles, saves the information to memory, but doesn't actually say anything back because he's antisocial.
The memory system is working perfectly. The "issue" is that we're testing with a character who doesn't like to talk!

September 1, 2025 at 1:58 PM

Arthur Clune

@arthur.clune.org

The chart crimes will continue until morale improves #gpt5

A chart screen captured from the opanai livestream. It's wrong.

August 7, 2025 at 5:32 PM

Arthur Clune

@arthur.clune.org

Numbers from Google on tokens. Doubling every month. 1,000tn in June.

Tweet from Demis Hassabis stating the Google processed a quadrillion tokens in June 2025

July 31, 2025 at 4:15 PM

Arthur Clune

@arthur.clune.org

It's here. Sigh. Age ID for DMs

July 23, 2025 at 9:41 PM

Arthur Clune

@arthur.clune.org

Fortunately the “learn more about our brand” page explains everything

July 22, 2025 at 12:40 PM

Arthur Clune

@arthur.clune.org

I read this from the OfS as saying that nearly 20% of UK Unis are at risk of going under

(from the @resprofnews.bsky.social newsletter)

On Friday, the OfS published its annual report. Safe to say it offers no let-up in the level of concern about possible university insolvency. It revealed that 71 out of the 400 or so providers on its register were subject to "formal monitoring" over finances.

July 21, 2025 at 9:37 AM

Arthur Clune

@arthur.clune.org

It's a cool idea for sure. But why is it surprising? There's plenty in that trace for a human to understand what it does. I assume that it scales well beyond the toy examples into something that needs a lot more reasoning?

July 19, 2025 at 1:12 PM

Arthur Clune

@arthur.clune.org

And here’s a quote from one of the developers in the study. This is very relatable!

x.com/ruben_bloom/...

X post from one of the developers in the MITRE study:

This is much less true of my participation in the study where I was more conceintious, but I feel like historically a lot of my Al speed-up gains were eaten by the fact that while a prompt was running, I'd look at something else (FB, X, etc) and continue to do so for much longer than it took the prompt to run

July 15, 2025 at 6:50 AM

Arthur Clune

@arthur.clune.org

There's some eye-opening things in the Observer's Sensemaker email today about the ONS

observer.co.uk/newsletters

It also appears it may be another instance of problems of governance and leaders who don't listen

archive.is/lvPkk

And yes, Newport is not attractive and doesn't help

What went wrong? In 2023, an email was sent to the head of the ONS, chief statistician Sir Ian Diamond, identifying major problems in UK employment data due to precipitous collapses in sample size.
It wasn't unique. Covid disrupted face-to-face interviews and people have since become increasingly reluctant to take cold calls, share personal data or spend time filling in surveys. The pandemic also produced a lot of statistical "noise"
For the ONS it was acute. Analysis of other sources showed at one point the ONS was underestimating the size of Britain's workforce by as much as one million people. Stats for certain sub-groups were swinging by 30 per cent in a month because they depended on a sample size of five. In October 2023, the ONS was forced to temporarily pause its monthly readout on jobs

Cultural concerns. A government-commissioned review led by Sir Robert Devereux identified a slump in morale at the ONS. Some of it pointed to the role of the "whacky" ONS boss Diamond, who departed shortly before its publication on health grounds. It also noted that
• widespread WFH weakened team cohesion and undermined data quality;
• flagship programmes - such as Diamond's pet project, a data service costing £200 million over five years - had come at the expense of high-quality core statistics; and
• there had been struggles to recruit and retain skilled analysts, partly due to below average pay and the location of the ONS headquarters in Newport, Wales.

July 11, 2025 at 9:42 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news