Lightnews — Scholar-powered news

Avik Dey

@avikdey.bsky.social

440 followers 420 following 660 posts

Mostly Data, ML, OSS & Society • Stop chasing Approximately Generated Illusions; focus on Specialized Small LMs • To understand it well enough, learn to explain it simply • Shadow self of https://linkedin.com/in/avik-dey, have a beard now

Posts Replies Media Videos

Pinned

Avik Dey @avikdey.bsky.social · Dec 18

Alignment isnt only thing LLMs are faking. Reasoning is another one that they are good at faking. Reading paper on LLM performance on reasoning tasks of doctors. Just started reading but either going to be:
1. Memorization or
2. Priming or
2. Confirmation prompting

www.anthropic.com/research/ali...

Alignment faking in large language models

A paper from Anthropic's Alignment Science team on Alignment Faking in AI large language models

www.anthropic.com

Avik Dey

@avikdey.bsky.social

Proxying the Apple byte - are we?

Amateur move guys.

Futurism @futurism.com · 1d

"I know we'll have the design right when you want to lick it or take a bite out of it."

Sam Altman Says Jony Ive's Mysterious OpenAI Device Will Be Lickable

Sam Altman is preparing your taste buds and saliva glands in anticipation of OpenAI's mysterious upcoming device.

trib.al

November 26, 2025 at 1:37 AM

Avik Dey

@avikdey.bsky.social

Having faced this exact same repetitive issue since 2023, I would have laughed at this - if we didn’t have 1% of the GDP invested in this caricature of an “AI”.

www.dwarkesh.com/p/ilya-sutsk...

November 25, 2025 at 9:40 PM

Avik Dey

@avikdey.bsky.social

Ilya appears to be progressively approaching the right conclusion. Remain confident that in time he will consolidate his insights from first 5 minutes and recognize that complex explanations are unnecessary when simpler ones suffice.

(screenshots not chronological)

www.dwarkesh.com/p/ilya-sutsk...

November 25, 2025 at 8:17 PM

Avik Dey

@avikdey.bsky.social

Good to see research on what math always said - low-average performers that’s your LLM “employee”:

> This supports our assertion that the ceiling on LLM creativity (0.25) corresponds to the boundary between little-c and Pro-c human creative performance (Figure 6).

www.academia.edu/144621465/_T...

November 25, 2025 at 5:19 PM

Avik Dey

@avikdey.bsky.social

Any PhD who endorses that a LLM constitutes “PhD level” intelligence is at minimum engaging in a questionable use of their academic authority. These endorsements function less as rigorous assessments and more as signal that the symbolism conferred by their credential is - available for rent.

Ketan Joshi @ketanjoshi.co · 2d

Deeply absurd. This Google PDF published on a blog (arxiv, not peer reviewed) claims an LLM is "PhD level" but in most cases the MAJORITY of reference URLs were invalid or inaccessible.

A PhD sitting down and just fabricating >50% of sources = career ending

arxiv.org/abs/2511.11597

CLINB: A Climate Intelligence Benchmark for
Foundational Models
Michelle Chen Huebscher1
, Katharine Mach2
, Aleksandar Stanić1
, Markus Leippold1,3, Ben Gaiarin1
, Zeke
Hausfather4
, Elisa Rawat , Erich Fischer5
, Massimiliano Ciaramita1
, Joeri Rogelj6
, Christian Buck1
, Lierni
Sestorain Saralegui1 and Reto Knutti5
1Google DeepMind, 2University of Miami, 3University of Zurich, 4Stripe, 5ETH Zurich, 6
Imperial College London
Evaluating how Large Language Models (LLMs) handle complex, specialized knowledge remains a
critical challenge. We address this through the lens of climate change by introducing CLINB, a benchmark that assesses models on open-ended, grounded, multimodal question answering tasks with clear
requirements for knowledge quality and evidential support. CLINB relies on a dataset of real users’
questions and evaluation rubrics curated by leading climate scientists. We implement and validate a
model-based evaluation process and evaluate several frontier models. Our findings reveal a critical
dichotomy. Frontier models demonstrate remarkable knowledge synthesis capabilities, often exhibiting PhD-level understanding and presentation quality. They outperform “hybrid" answers curated
by domain experts assisted by weaker models. However, this performance is countered by failures
in grounding. The quality of evidence varies, with substantial hallucination rates for references and
images. We argue that bridging this gap between knowledge synthesis and verifiable attribution is
essential for the deployment of AI in scientific workflows and that reliable, interpretable benchmarks
like CLINB are needed to progress towards building trustworthy AI systems.

Total Reference URLs Generated
claude-opus-4-1
claude-sonnet-4
gpt-5
hybrid
gemini-2.5-pro
gemini-2.5-flash
o3
0.0
0.2
0.4
0.6
0.8
1.0
Proportion
Reference URL Status
hybrid
gemini-2.5-pro
claude-opus-4-1
o3
gemini-2.5-flash
claude-sonnet-4
0
200
400
600
800
1000
Count of URLs
Total Image URLs Generated
hybrid
claude-opus-4-1
gemini-2.5-flash
gemini-2.5-pro
claude-sonnet-4
o3
0.0
0.2
0.4
0.6
0.8
1.0
Proportion
Image URL Status
Status
OK
INACCESSIBLE_CONTENT
INVALID_URL
ERROR
Figure 3 | Number of reference (top), and image (bottom), URLs and their status.
Ablations We perform several ablation studies with the autorater (Table 4). Notably, removing
the question-specific rubrics from the prompt changes the results only in the bottom half, with the
Hybrid answers overtaken by Gemini 2.5 Flash and Claude Sonnet 4. This suggests that the additional
resolution provided by the rubrics applies primarily to the kind of responses used to develop the
rubrics. Or, in other words, that rubrics are far from complete. Hence, it is important that rubrics
adapt to new data as better models become availab

A New Expert-Grounded Benchmark for Scientific AI We introduce CLINB, a benchmark for modelbased evaluation of frontier models on complex, multimodal scientific communication. Its core is a
new dataset of real-world climate questions paired with data-driven, question-specific evaluation rubrics,
curated and validated by leading climate scientists through a novel three-phase, human-in-the-loop
process.2
PhD-Level Synthesis vs. Attribution Failures Frontier models demonstrate remarkable knowledge
synthesis, often exhibiting a PhD-level understanding. However, this performance masks a critical
inadequacy in grounding. We report substantial hallucination rates for references (10% to 25%)
and even more failures for images (50% to 80% in certain settings), exposing a major gap between
synthesis and verifiable attribution.
Insights into Human-AI Collaboration Dynamics Autonomous frontier models surpass ’hybrid’
answers (curated by experts using weaker AI assistance), revealing the assisting model’s capability—not
human oversight—as the primary bottleneck. Counter-intuitively, highly motivated non-specialists
(our ’Advocates’) who deeply engage with AI tools can produce higher-quality answers than domain
experts who engage less with AI during answer curation.
A Validated Methodology for Scalable Oversight We validate a rigorous, rubric-based autorater.
Ablation studies demonstrate that structured prompts and automated evidence-checking are essential
for mitigating inherent LLM judge biases. This process is hampered by inaccessible sources (up to
50%). Furthermore, we identify evaluation challenges, including model familiarity bias in human
raters and the limitations of rubrics to generalize across models.

November 24, 2025 at 9:39 PM

Avik Dey

@avikdey.bsky.social

They were convinced “AI“ would rewrite it all in a week and ship by end of that month, the ‘year or two’ estimate was just sandbagging so they could pose as 100x devs.

Grady Booch @booch.com · 3d

In the early days of DOGE I spoke to developers Musk had parachuted into the IRS and the FAA, each telling me they expected to rewrite the core software of both agencies within a year or two.

It would be amusing to speak to them again.

The Verge @theverge.com · 3d

DOGE is no more, and in its wake, only chaos

November 24, 2025 at 5:09 AM

Avik Dey

@avikdey.bsky.social

“warm-up”: Under the guidance of an expert human the model was finally able to get the answer right when nudged towards it.

Not the model, not the prompt - still the human.

The amount of shilling these guys do, no wonder they can’t get anything serious built.

cdn.openai.com/pdf/4a25f921...

November 23, 2025 at 5:33 PM

Avik Dey

@avikdey.bsky.social

Think they might have answered their own question … ?

bsky.app/profile/slas...

November 22, 2025 at 4:04 AM

Avik Dey

@avikdey.bsky.social

The problem with most financial analysis of Nvidia’s quarterly performance, is these folks don’t seem to understand data center hardware lead times and revenue recognition cycle.

November 20, 2025 at 6:36 AM

Avik Dey

@avikdey.bsky.social

Great article with learned insights - the best kind.

Unfortunately, this is a societal failure. Tech didn’t invent loneliness, it offered a new way to cope with it - in an empathetic echo chamber.

We are failing the kids. Others too, but mostly it’s the kids that I worry about.

Thomas Dietterich @tdietterich.bsky.social · 7d

I agree that emotional addiction to chatbots is the number one risk of AI today. Here is a gift link to an important OpEd in the NYTimes:
www.nytimes.com/2025/11/17/o...

Opinion | The Sad and Dangerous Reality Behind ‘Her’

www.nytimes.com

November 20, 2025 at 6:10 AM

Avik Dey

@avikdey.bsky.social

You watch a video of a professor from a random internet post and are filled with regret because you didn’t have the opportunity to learn from him in person:

en.wikipedia.org/wiki/Ramamur...

19. Quantum Mechanics I: The key experiments and wave-particle duality

YouTube video by YaleCourses

youtu.be

November 19, 2025 at 6:16 AM

Avik Dey

@avikdey.bsky.social

Smaller bag, same toss.

The Wall Street Journal @wsj.com · 8d

Nvidia and Microsoft will invest up to $15 billion in OpenAI competitor Anthropic. Anthropic, in turn, said it would buy $30 billion of compute capacity from Microsoft Azure and use advanced AI chips supplied by Nvidia.

Nvidia, Microsoft Pour $15 Billion Into Anthropic for New AI Alliance

Anthropic also commits to purchase $30 billion from Microsoft’s cloud computing business Azure.

on.wsj.com

November 18, 2025 at 7:40 PM

Avik Dey

@avikdey.bsky.social

For ancillary text based foo foo services or core financial services? I am have a hard time believing that their engineers, a few of who I know, would sign off on this integration - but leadership prevailed?

Financial Times @financialtimes.com · 8d

OpenAI strikes deal with Intuit to plug personal financial data into ChatGPT on.ft.com/3LJ7J6g

OpenAI strikes deal with Intuit to plug personal financial data into ChatGPT

Software group behind TurboTax and Credit Karma will pay AI start-up to use its technology

on.ft.com

November 18, 2025 at 7:38 PM

Avik Dey

@avikdey.bsky.social

Don’t worry about it this quarter - they have enough to prop it up.

But next quarter you should be terrified.

Futurism @futurism.com · 8d

Investors fear the worst.

The Whole Financial World Is Terrified of Nvidia's Earnings Call

AI chipmaker Nvidia's Wednesday earnings call is putting investors on edge, results that could send ripples through already rattled markets.

trib.al

November 18, 2025 at 7:22 PM

Avik Dey

@avikdey.bsky.social

If these Gemini 3 Pro benchmarks are accurate, time for OpenAI to sell to Microsoft. Microsoft won’t want their management team or their prolifically tweeting engineers, but I am sure most engineers would thrive if led by seasoned engineering management.

storage.googleapis.com/deepmind-med...

November 18, 2025 at 4:51 PM

Avik Dey

@avikdey.bsky.social

I too would like my taxpayer backed trillion dollar fantasy fund. Why should Sama have all the fun?

60 Minutes @60minutes.bsky.social · 10d

Anthropic CEO Dario Amodei thinks AI could help find cures for most cancers, prevent Alzheimer’s, and even double the human lifespan. cbsn.ws/4oRZ8Nm

November 18, 2025 at 6:50 AM

Avik Dey

@avikdey.bsky.social

Perfect prediction, even if I say so myself!

Actually their realization dawned a few weeks back, but these things take a little while to surface externally.

Image of tweet from bird site because I won’t link to it.

November 16, 2025 at 1:45 AM

Avik Dey

@avikdey.bsky.social

From the bird site, the acceleration continues:

November 16, 2025 at 1:30 AM

Avik Dey

@avikdey.bsky.social

Thoughts:
- Report is based on Claude’s logs without any visibility to human actions outside of Claude
- Reinforcing “80–90% of tactical work” was by Claude and humans were merely in a strategic role, is curiously well aligned with their marketing message rather than any verified capability

assets.anthropic.com

November 14, 2025 at 5:00 PM

Avik Dey

@avikdey.bsky.social

In software engineering, lines of code edited and weekly merge counts are misleading proxies for productivity. There are a significant number exogenous variables that impact those metrics. To name only a few - team dynamics, code maturity, product maturity, business seasonal cycles and many more.

Ethan Mollick @emollick.bsky.social · 14d

Some pretty eye-opening data on the effect of AI coding.

When Cursor added agentic coding in 2024, adopters produced 39% more code merges, with no sign of a decrease in quality (revert rates were the same, bugs dropped) and no sign that the scope of the work shrank. papers.ssrn.com/sol3/papers....

November 13, 2025 at 4:18 PM

Avik Dey

@avikdey.bsky.social

In this age of AI, don’t be a follower. Be the leader who hires engineers who build the future - because AI ain’t building jackshit for you.

November 8, 2025 at 8:36 PM

Avik Dey

@avikdey.bsky.social

We are entering the golden age of AI “world models” where every AI hype will be proudly accompanied by their grand unified theory of everything, rigorously engineered to collapse at the first gentle poke of reality.

Avik Dey @avikdey.bsky.social · 21d

Browsing the arxiv paper - the architecture seems to rely heavily on the structured world model. Any additional write up on how the world model was generated and is globally maintained?

November 8, 2025 at 4:48 PM

Avik Dey

@avikdey.bsky.social

Classifier ≠ Human Judge

> We assess how effectively large language models generate social media replies that remain indistinguishable from human-authored content when evaluated by automated classifiers. We employ a BERT-based binary classification model to distinguish between the two text types.

Petter Törnberg @pettertornberg.com · 20d

LLMs are now widely used in social science as stand-ins for humans—assuming they can produce realistic, human-like text

But... can they? We don’t actually know.

In our new study, we develop a Computational Turing Test.

And our findings are striking:
LLMs may be far less human-like than we think.🧵

Computational Turing Test Reveals Systematic Differences Between Human and AI Language

Large language models (LLMs) are increasingly used in the social sciences to simulate human behavior, based on the assumption that they can generate realistic, human-like text. Yet this assumption rem...

arxiv.org

November 8, 2025 at 5:33 AM

Avik Dey

@avikdey.bsky.social

AI isn’t going to wound web’s ad model - fatally or otherwise. AI companies are going to be the ones serving those ads.

I would be shocked if OpenAI hasn’t / isn’t already indexing the web even as I type this.