Also blogging about AI research at magazine.sebastianraschka.com.
"Small Batch Size Training for Language Models:
When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful" (arxiv.org/abs/2507.07101)
(I can confirm this holds for RLVR, too! I have some experiments to share soon.)
"Small Batch Size Training for Language Models:
When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful" (arxiv.org/abs/2507.07101)
(I can confirm this holds for RLVR, too! I have some experiments to share soon.)
It's still worth investing in becoming an expert, because then you will get even more out of LLMs and will be able to deliver even better results.
It's still worth investing in becoming an expert, because then you will get even more out of LLMs and will be able to deliver even better results.
...it grew quite a bit since the initial version in July 2025, more than doubled!
magazine.sebastianraschka.com/p/the-big-ll...
...it grew quite a bit since the initial version in July 2025, more than doubled!
magazine.sebastianraschka.com/p/the-big-ll...
Just went through the config files; the only difference I could see is that Mistral 3 Large used 2x fewer experts but made each expert 2x large.
Just went through the config files; the only difference I could see is that Mistral 3 Large used 2x fewer experts but made each expert 2x large.
Basically pushes RLVR & self-refinement to gold-level scores on IMO 2025.
Coincidentally, I am currently working on a chapter on self-refinement, and this comes in handy as a nice, scaled-up case study.
Basically pushes RLVR & self-refinement to gold-level scores on IMO 2025.
Coincidentally, I am currently working on a chapter on self-refinement, and this comes in handy as a nice, scaled-up case study.
If you are interested in reading through the architecture details, I coded it from scratch here: github.com/rasbt/LLMs-f...
If you are interested in reading through the architecture details, I coded it from scratch here: github.com/rasbt/LLMs-f...
If you are looking for sth to read this weekend Ch4 is available now: mng.bz/Dwra
If you are looking for sth to read this weekend Ch4 is available now: mng.bz/Dwra
In this case, the break-even point is 5,000,000 dollars / 0.20 dollars per query = 25 million queries.
In this case, the break-even point is 5,000,000 dollars / 0.20 dollars per query = 25 million queries.
Training is usually very, very expensive, but it is a one-time cost. Inference-scaling is comparatively cheap, but it's a cost we pay at each query.
Training is usually very, very expensive, but it is a one-time cost. Inference-scaling is comparatively cheap, but it's a cost we pay at each query.
Kimi K2 is based on the DeepSeek V3/R1 architecture, and here's a side-by-side comparison.
In short, Kimi K2 is a slightly scaled DeepSeek V3/R1. And the gains are in the data and training recipes. Hopefully, we will see some details on those soon, too.
Kimi K2 is based on the DeepSeek V3/R1 architecture, and here's a side-by-side comparison.
In short, Kimi K2 is a slightly scaled DeepSeek V3/R1. And the gains are in the data and training recipes. Hopefully, we will see some details on those soon, too.
Gated DeltaNet hybrids (Qwen3-Next, Kimi Linear), text diffusion, code world models, and small reasoning transformers.
🔗 magazine.sebastianraschka.com/p/beyond-sta...
Gated DeltaNet hybrids (Qwen3-Next, Kimi Linear), text diffusion, code world models, and small reasoning transformers.
🔗 magazine.sebastianraschka.com/p/beyond-sta...
Link to the full article: magazine.sebastianraschka.com/p/the-big-ll...
Link to the full article: magazine.sebastianraschka.com/p/the-big-ll...
(Source: huggingface.co/MiniMaxAI/Mi...)
(Source: huggingface.co/MiniMaxAI/Mi...)
🔗 github.com/rasbt/LLMs-f...
🔗 github.com/rasbt/LLMs-f...
🔗 github.com/rasbt/LLMs-f...
🔗 github.com/rasbt/LLMs-f...
🔗 github.com/rasbt/LLMs-f...
Will add this for multi-head latent, sliding, and sparse attention as well.
🔗 github.com/rasbt/LLMs-f...
Will add this for multi-head latent, sliding, and sparse attention as well.
A few months ago, the HRM made big waves in the AI research community as it showed really good performance on the ARC challenge despite its small 27M size. (That's about 22x smaller than the smallest Qwen3 0.6B model.)
A few months ago, the HRM made big waves in the AI research community as it showed really good performance on the ARC challenge despite its small 27M size. (That's about 22x smaller than the smallest Qwen3 0.6B model.)
sebastianraschka.com/blog/2021/dl...
sebastianraschka.com/blog/2021/dl...
If you are new to reinforcement learning, this article has a generous intro section (PPO, GRPO, etc)
Also, I cover 15 recent articles focused on RL & Reasoning.
🔗 magazine.sebastianraschka.com/p/the-state-...
If you are new to reinforcement learning, this article has a generous intro section (PPO, GRPO, etc)
Also, I cover 15 recent articles focused on RL & Reasoning.
🔗 magazine.sebastianraschka.com/p/the-state-...
Why? Because I think 1B & 3B models are great for experimentation, and I wanted to share a clean, readable implementation for learning and research: huggingface.co/rasbt/llama-...
Why? Because I think 1B & 3B models are great for experimentation, and I wanted to share a clean, readable implementation for learning and research: huggingface.co/rasbt/llama-...