Sebastian Raschka (rasbt)
@sebastianraschka.com
9.5K followers 240 following 250 posts
ML/AI researcher & former stats professor turned LLM research engineer. Author of "Build a Large Language Model From Scratch" (https://amzn.to/4fqvn0D) & reasoning (https://mng.bz/Nwr7). Also blogging about AI research at magazine.sebastianraschka.com.
Posts Media Videos Starter Packs
Pinned
sebastianraschka.com
How do we evaluate LLMs?
I wrote up a new article on
(1) multiple-choice benchmarks,
(2) verifiers,
(3) leaderboards, and
(4) LLM judges

All with from-scratch code examples, of course!

sebastianraschka.com/blog/2025/ll...
Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)
Multiple-Choice Benchmarks, Verifiers, Leaderboards, and LLM Judges with Code Examples
sebastianraschka.com
sebastianraschka.com
Interesting! I am not sure it will be a concatenation of models (I can't see how it would work), but I can see using modules like this that a generalist model can call for specific problems.
sebastianraschka.com
This was probably developed with ARC in mind. But they also had models for Sudoku-hard and Maze, so I don't think that's true. I.e., you can train a similar model on other logic puzzle tasks as well.
sebastianraschka.com
7/ In practice, we often start by throwing LLMs at a problem, which makes sense for quick prototyping and establishing a baseline. But I can see a point where someone sits down afterward and trains a focused model like this to solve the same task more efficiently.
sebastianraschka.com
6/
That said, HRM and TRM are fascinating proof‑of‑concepts that show what’s possible with relatively small and efficient architectures. I'm still curious what the real‑world use case will look like. Maybe they could serve as reasoning or planning modules within a larger tool‑calling system.
sebastianraschka.com
5/ My personal caveat: comparing this method (or HRMs) to LLMs feels a bit unfair since HRMs/TRM are specialized models trained for specific tasks (here: ARC, Sudoku, and Maze pathfinding) while LLMs are generalists. It’s like comparing a pocket calculator to a computer.
sebastianraschka.com
4/ TRM backpropagates through the full recursion once per step, whereas HRM only backpropagates through the final few steps. And TRM also removes HRM's extra forward pass for halting and instead uses a simple binary cross-entropy loss to learn when to stop iterating.
sebastianraschka.com
3/ In short, HRM recurses multiple times through two small neural nets with 4 transformer blocks each (high and low frequency). TRM is much smaller (i.e., 4x) and only a single network with 2 transformer blocks.
sebastianraschka.com
2/ Now, the new "Less is More: Recursive Reasoning with Tiny Networks" paper proposes Tiny Recursive Model (TRM), which a simpler and even smaller model (7M, 4× smaller than HRM) that performs even better on the ARC challenge.
sebastianraschka.com
From the Hierarchical Reasoning Model (HRM) to a new Tiny Recursive Model (TRM).

A few months ago, the HRM made big waves in the AI research community as it showed really good performance on the ARC challenge despite its small 27M size. (That's about 22x smaller than the smallest Qwen3 0.6B model.)
sebastianraschka.com
How do we evaluate LLMs?
I wrote up a new article on
(1) multiple-choice benchmarks,
(2) verifiers,
(3) leaderboards, and
(4) LLM judges

All with from-scratch code examples, of course!

sebastianraschka.com/blog/2025/ll...
Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)
Multiple-Choice Benchmarks, Verifiers, Leaderboards, and LLM Judges with Code Examples
sebastianraschka.com
sebastianraschka.com
Just shared a new article on "The State of Reinforcement Learning for LLM Reasoning"!
If you are new to reinforcement learning, this article has a generous intro section (PPO, GRPO, etc)
Also, I cover 15 recent articles focused on RL & Reasoning.

🔗 magazine.sebastianraschka.com/p/the-state-...
sebastianraschka.com
Ha, yeah! It's interesting that they went with the GPT-4 tokenizer as the base tokenizer. But why not.
sebastianraschka.com
Gemma 3 is great! But t hat's a project for another day ... 😅
sebastianraschka.com
Coded Llama 3.2 model from scratch and shared it on the HF Hub.
Why? Because I think 1B & 3B models are great for experimentation, and I wanted to share a clean, readable implementation for learning and research: huggingface.co/rasbt/llama-...
sebastianraschka.com
My next tutorial on pretraining an LLM from scratch is now out. It starts with a step-by-step walkthrough of understanding, calculating, and optimizing the loss. After training, we update the text generation function with temperature scaling and top-k sampling: www.youtube.com/watch?v=Zar2...
sebastianraschka.com
wow, small world! And thanks regarding the code base, that means a lot to me (I was really spending a lot of time making this nice!)
sebastianraschka.com
I can't tell if you are joking...I hope you are joking...😅
Reposted by Sebastian Raschka (rasbt)
xowap.dev
Rémy @xowap.dev · Mar 17
I'm right now on the last chapter of "Build a Large Language Model (from scratch)" by @sebastianraschka.com and it's absolutely amazing to get started. Now I can understand why people lose their shit over DeepSeek, for example
sebastianraschka.com
all I can say is that there will likely be another book in due time...but it may take a bit (this book took me almost 1.5 years...) 😅
sebastianraschka.com
Ha, glad you are liking it so far!? Chapter 7 is maybe also the most fun one was we get to leverage all the stuff from the previous chapters to finally train a simple chatbot.
Re DeepSeek... it may take a bit but there'll be a sequel one day I hope!
sebastianraschka.com
I just shared a new tutorial: Implementing GPT From Scratch!

In this 1:45 h hands-on coding session, I go over implementing the GPT architecture, the foundation of modern LLMs (and I also have bonus material converting it to Llama 3.2): www.youtube.com/watch?v=YSAk...
Build an LLM from Scratch 4: Implementing a GPT model from Scratch To Generate Text
YouTube video by Sebastian Raschka
www.youtube.com