Lightnews — Scholar-powered news

@sebastianraschka.com

I just saw the Kimi K2 Thinking release!

Kimi K2 is based on the DeepSeek V3/R1 architecture, and here's a side-by-side comparison.

In short, Kimi K2 is a slightly scaled DeepSeek V3/R1. And the gains are in the data and training recipes. Hopefully, we will see some details on those soon, too.

November 6, 2025 at 7:35 PM

Tim Kellogg

@timkellogg.me

K2-Thinking is SOTA, top model in agentic tool calling

A horizontal bar chart titled “τ²-Bench Telecom (Agentic Tool Use)” comparing AI model performance across vendors.

Each bar shows a model’s accuracy percentage, color-coded by provider.

From left to right:
• Kimi K2 Think — 93% (blue, highest)
• GPT-5 (high) — 87% (black)
• MiniMax-M2 — 87% (pink)
• GPT-5 (base) — 85%
• Claude 4.5 Sonnet — 78%
• Grok-1 — 75%
• Kimi K2 0905 — 73%
• Claude 4.1 Opus — 71%
• GLM-4-9B — 71%
• Abel-v1.15 / 1.85B Thinker — 68%
• gpt-oss-210D (high) — 66%
• Grok 4 (test) — 66%
• Kimi K2 — 61%
• Claude 4.5 Haiku — 55%
• Gemini 2.5 Pro — 54%
• Qwen 2.5 32B — 53%
• Amazon Bedrock Medistinct-12 — 52%
• DeepSeek R1 025B — 37%
• DeepSeek V3 24B — 34%
• Nim Llama Super-490B v1.5 — 28%
• Llama Maverick — 18% (lowest).

A purple arrow points from MiniMax-M2 (87%) to Kimi K2 Think (93%).
The top-right corner shows “Artificial Analysis” as the source.

November 7, 2025 at 10:40 AM

Rohit Kumar Tiwari

@analyticalrohit.bsky.social

Best breakdown of modern LLM architectures

From DeepSeek to GPT-OSS, it’s all here ↓

Covers every flagship model

1️⃣ DeepSeek V3/R1
2️⃣ OLMo 2
3️⃣ Gemma 3
4️⃣ Mistral Small 3.1
5️⃣ Llama 4
6️⃣ Qwen3
7️⃣ SmolLM3
8️⃣ Kimi 2
9️⃣ GPT-OSS

#ArtificialIntelligence #MachineLearning #DeepLearning #DataScience #Analytics

November 7, 2025 at 12:27 PM

Tim Kellogg

@timkellogg.me

this morning, X is saturated with people from US claiming that their favorite unknown benchmark (that happens to show K2 trailing US models) is actually the best single benchmark to watch

lol notice how they clipped off the top 12

A leaderboard-style table ranking AI models by performance percentage.

Rank Model Score Organization
13th o1-preview 41.7% OpenAI
14th Claude 3.5 Sonnet 10-22 41.4% Anthropic
15th Gemini 2.5 Flash (latest) 41.2% Google
16th DeepSeek R1 05/28 40.8% DeepSeek
17th o1-2024-12-17 (high) 40.1% OpenAI
18th DeepSeek V3.1 40.0% DeepSeek
19th Kimi K2 Thinking (NEW) 39.6% Moonshot AI

The table shows incremental differences between model scores, with Kimi K2 Thinking newly added to the list at 19th place, just below DeepSeek V3.1.

November 8, 2025 at 12:10 PM

Ruo Shui

@ruoshuiresearch.bsky.social

November 6, 2025 at 1:31 PM

SE Gyges

@segyges.bsky.social

the funniest post from around the deepseek r1 release

@menhguin replying to @BasedBeffJezos:

`It's true. The Chinese actually have a term for ruining US companies by releasing models. It's called 水平问题 ("issue of one's skill").`

the original Beff Jezos post:

`Friend told me the hedge fund that runs DeepSeek actually took an Nvidia short position before R1 launch btw, not a conspiracy theory`

October 30, 2025 at 4:50 PM

Sidero

@siderolabs.com

Running #GPU workloads on #Kubernetes with #TalosLinux isn’t like using traditional Linux.

Here's how to deploy the Deepseek-r1 LLM using Ollama on bare metal Kubernetes with Talos and Omni’s Image Factory. → www.youtube.com/watch?v=HiDW...

Want to talk more about it? Find our team at #KubeCon!

Deepseek on bare metal Kubernetes with Talos Linux

Starting from a blank computer with an NVIDIA GPU we walk through all the steps needed to deploy Deepseek-r1 as a Kubernetes workload. Sign up for Omni at https://siderolabs.com/omni-signup

www.youtube.com

October 28, 2025 at 5:06 PM

Nathan Lambert

@natolambert.bsky.social

New post recapping the biggest paper in reasoning/RL this year since the DeepSeek R1 report. It does a great job highlighting how RL (and post training really) is far more of an art than a science (pretraining is a hard science).
www.interconnects.ai/p/the-new-rl...

The new RL scaling laws

The most covetable research.

www.interconnects.ai

October 20, 2025 at 3:28 PM

Tim Kellogg

@timkellogg.me

Agents are hard to benchmark

new research from Princeton shows several factors that complect benchmarking

agents will:
- take shortcuts
- take overly expensive actions
- hardcode answers

also, token efficiency doesn’t translate to cost reduction

arxiv.org/abs/2510.11977

A 3×3 grid of scatter plots comparing Accuracy (%) versus Cost (USD) across nine benchmarks for various large language models. Each subplot includes a red dashed line tracing a performance–cost frontier and multiple colored markers representing different models.

Benchmarks (titles of each subplot):
1. AssistantBench
2. CORE-Bench Hard
3. GAIA
4. Online Mind2Web
5. SWE-bench Verified Mini
6. SciCode
7. ScienceAgentBench
8. TAU-bench Airline
9. USACO

Axes:
• X-axis: Cost (USD) to run the benchmark
• Y-axis: Accuracy (%)

Legend (bottom):
• Claude models:
• Opus 4.1 / 4.1 High (brown shades)
• Sonnet 4 / 4 High (tan shades)
• Sonnet 3.7 / 3.7 High (light brown)
• DeepSeek R1 and DeepSeek V3 (blue tones)
• GPT-4.1 and GPT-5 Medium (green tones)
• o3 Medium and o4-mini Low/High (teal and light green)
• Gemini 2.0 Flash (gray-green)

Notable patterns:
• Top performers: GPT-5 Medium and Claude Opus 4.1 often reach higher accuracies but at higher costs.
• Gemini 2.0 Flash and DeepSeek models cluster in low-cost regions with moderate accuracy.
• Red dashed lines highlight efficient frontiers—models offering the best tradeoff between accuracy and cost per benchmark.
• GAIA and USACO benchmarks show the highest accuracy values overall, while SciCode and ScienceAgentBench have lower ranges (under 10–30%).

Each subplot reveals how models balance performance against computational expense across varied real-world or reasoning-focused tasks.

October 16, 2025 at 11:51 AM

Adverb

@adverb.bsky.social

Aaahhh I might have to do some painfully slow inference so I can see the CoT from r1-zero. I want my models untamed, SFT-free, speaking in tongues.

huggingface.co/deepseek-ai/...

deepseek-ai/DeepSeek-R1-Zero · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

October 16, 2025 at 4:45 AM

Information is Beautiful

@infobeautiful.bsky.social

Chinese LLMs like DeepSeek have been getting good for a while.
See our tracker geni.us/IIB-LLMs
made with @vizsweet

A graph showing 'Major Large Language Models (LLMs) ranked by capabilities' from 2023-2025. The y-axis shows MMLU scores from 20-100, with a 'human expert' benchmark at 89.8 and 'IDEAL' at 70+. The graph primarily shows Chinese language models (marked in orange) including DeepSeek-R1, SenseNova 5.0, Ernie 4.0, Yi-Large, and others. Models are plotted across time showing increasing capabilities, with most clustered in 2024-2025 and scoring between 60-85 MMLU. The visualization includes company classifications (anthropic, chinese, google, meta, microsoft, mistral, openAI, other) and was created by Information is Beautiful using data from LifeArchitect

October 15, 2025 at 8:55 PM

News X

@news-x.bsky.social

TRM's performance is 🔥! It beat DeepSeek R1 (671B params) & Gemini 2.5 Pro on ARC-AGI benchmark. Achieved 44.6% on ARC-AGI-1 & 87% on Sudoku-Extreme! 💯🏆 #AIbenchmark #DeepLearning

October 14, 2025 at 8:26 AM

Paul Frazee

@pfrazee.com

deepseek-r1-052b-qwen-8b
35 tok/s

qwen-coder-30b
59 tok/s

gemma-3n-e4b
42 tok/s

gpt-oss-20b
57 tok/s

gpt-oss-120b
27 tok/s

October 10, 2025 at 9:09 PM

Sebastian Raschka (rasbt)

@sebastianraschka.com

Updated & turned my Big LLM Architecture Comparison article into a video lecture.

The 11 LLM archs covered in this video:
1. DeepSeek V3/R1
2. OLMo 2
3. Gemma 3
4. Mistral Small 3.1
5. Llama 4
6. Qwen3
7. SmolLM3
8. Kimi 2
9. GPT-OSS
10. Grok 2.5
11. GLM-4.5/4.6

www.youtube.com/watch?v=rNlU...

The Big LLM Architecture Comparison

YouTube video by Sebastian Raschka

www.youtube.com

October 10, 2025 at 5:05 PM

thebes

@vgel.me

you can also use this to probe the reasoning process on reasoning models, like deepseek R1 with a silly prompt here:

October 8, 2025 at 1:37 AM

Tim Kellogg

@timkellogg.me

why would DeepSeek drop the R1 brand and not name the next model “R2”?

i get that people in AI are bad at branding, but are they really *this* bad?

afaict the next one is V4, but they got so much publicity with R1..

October 2, 2025 at 9:32 PM

GetNews.me

@getnews-me.bsky.social

Researchers assessed 30 game concepts with midsize LLMs—LLaMA 3.1, Qwen 2.5 and DeepSeek‑R1—and found DeepSeek‑R1 gave useful feedback. The rubric covered narrative hook, mechanics and market potential. https://getnews.me/medium-sized-llms-show-promise-for-early-game-design-feedback/ #llm #gamedev

Medium-Sized LLMs Show Promise for Early Game Design Feedback

September 30, 2025 at 11:28 PM

Owen Smith

@owensmithguitar.bsky.social

The paper says they bake in a system prompt as part of the RL process:

"2.2.3. Training Template
To train DeepSeek-R1-Zero, we begin by designing a straightforward template that guides the base model to adhere to our specified instructions."

September 29, 2025 at 1:00 PM

Markus Eisele

@myfear.com

Anthropic's Claude 3.7 Sonnet is the new king 👑 of code generation (but only with help), and DeepSeek R1 disappoints buff.ly/BUvGPLL
#Java #CodeGen #genai #llm

September 24, 2025 at 5:08 AM

Mary Elizabeth Sutherland

@meharpist.bsky.social

I feel very proud to be part of @nature.com, and to have colleagues who handled this excellent #DeepSeek paper that describes DeepSeek-R1, because it's the first widely used commercial LLM that has been published in a peer-reviewed journal 🧪 www.nature.com/articles/s41...

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning - Nature

A new artificial intelligence model, DeepSeek-R1, is introduced, demonstrating that the reasoning abilities of large language models can be incentivized through pure reinforcement learning, removing t...

www.nature.com

September 22, 2025 at 6:12 PM

Techmeme

@techmeme.com

Huawei says DeepSeek-R1-Safe, which was trained on 1,000 of its Ascend AI chips, is "nearly 100% successful" in preventing politically sensitive topics (Eduardo Baptista/Reuters)

Main Link | Techmeme Permalink

September 20, 2025 at 1:41 PM

Lenore Beadsman

@lenorebeadsman.bsky.social

nella tarda serata di giovedì di aver utilizzato 1.000 dei suoi chip Ascend AI per addestrare il modello a linguaggio esteso, che è stato ottimizzato dal modello open source R1 di DeepSeek

Il partner di Huawei era l'élite della Zhejiang University
l'alma mater del fondatore di DeepSeek ⬇️

September 20, 2025 at 10:34 AM

Steve Schmidt

@steve.czmyt.com

"DeepSeek didn’t really train its flagship model for $294,000: Training costs detailed in R1 training report don't include 2.79 million GPU hours that laid its foundation"

Counterpoint: It cost a lot more than $294k.

🤷‍♂️

www.theregister.com/2025/09/19/d...

DeepSeek didn’t really train its flagship model for $294,000

: Training costs detailed in R1 training report don't include 2.79 million GPU hours that laid its foundation

www.theregister.com

September 19, 2025 at 6:12 PM

Anshul Kundaje

@anshulkundaje.bsky.social

An editorial was published in Nature recently claiming that glam journal publication of LLMs (like DeepSeek-R1 this case) marks a step towards greater transparency, accountability & credibility www.nature.com/articles/d41.... I have thoughts ... 1/

https://www.nature.com/articles/d41586-025-02979-9

t.co

September 19, 2025 at 7:01 PM

Nature

@nature.com

Nature research paper: DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

go.nature.com/41WGjPu

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning - Nature

A new artificial intelligence model, DeepSeek-R1, is introduced, demonstrating that the reasoning abilities of large language models can be incentivized through pure reinforcement learning, removing the need for human-annotated demonstrations.

go.nature.com

September 19, 2025 at 8:46 AM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news