Sebastian Raschka (rasbt)
banner
sebastianraschka.com
Sebastian Raschka (rasbt)
@sebastianraschka.com
ML/AI researcher & former stats professor turned LLM research engineer. Author of "Build a Large Language Model From Scratch" (https://amzn.to/4fqvn0D) & reasoning (https://mng.bz/Nwr7).

Also blogging about AI research at magazine.sebastianraschka.com.
Ha, thanks! Happy new year to you as well!
December 31, 2025 at 1:54 PM
Thanks! Is /r/machinelearning still weekend only for unless it's an arxiv article?
December 30, 2025 at 7:29 PM
This is an opinion. That's why I prefaced my post with "I think of it as this"
December 29, 2025 at 3:53 PM
I agree. I was thinking of “faster” because it frees time when letting it do boilerplate stuff. And I was thinking of “better” as in using it to find issues that were accidentally overlooked.
December 28, 2025 at 9:18 PM
Yeah. My point was that LLMs are good amplifiers, but they are not the only tool one should use and learn from.
December 28, 2025 at 5:06 PM
It's a cycle: Coding manually, reading resources written by experts, looking at high-quality projects built by experts, getting advice from experts, and repeat...
December 28, 2025 at 4:17 PM
I discuss the more historical building blocks here if you are interested (going back to "Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Neural Networks" 1991 by Schmidhuber): magazine.sebastianraschka.com/p/understand...
Understanding Large Language Models
A Cross-Section of the Most Relevant Literature To Get Up to Speed
magazine.sebastianraschka.com
December 23, 2025 at 3:35 PM
Yes yes. This is not a complete history.
I assume you are specifically referring to the first line “202x…”? I merely wanted to say that the focus in the early 2020s was more on pre-training than anything else then. (I think the term LLM wasn’t coined until the 175B GPT-3 model came out).
December 23, 2025 at 3:34 PM
Actually I didn’t change any of the earlier sections but just appended the new sections to the article.
Re your LLM idea, I could see it as a benchmark for agentic LLMs though to see if they can get the correct architecture info from the code bases.
December 14, 2025 at 3:30 PM
Based on the naming resemblance, if I had to guess, DeepSeekMoE was motivated by DeepSpeed-MoE (arxiv.org/abs/2201.05596) 14 Jan 2022
December 12, 2025 at 9:00 PM
Tbh if it took them a month to write and release the paper, the DeepSeekMoE team probably also had the model ready in December.
Or in other words, I don't think they trained the model in just a month with all the ablation studies in that paper.
December 12, 2025 at 8:58 PM
They don't have a reasoning model, yet. So, it is a bit unfair to compare, but since you asked:
December 12, 2025 at 8:42 PM
I think Google originally came up with MoE, and DeepSeek and Mixtral adopted it independently of each other.

Eg looking at arxiv, the Mixtral report came out on 8 Jan 2024 (arxiv.org/abs/2401.04088), and DeepSeekMoE around the same time on 11 Jan 2024 (arxiv.org/abs/2401.06066)
December 12, 2025 at 8:34 PM
Good catch, yes that should have been 70% not 40%. Thanks!
December 12, 2025 at 7:20 PM
Yes, good point. I must have accidentally moved the text boxes to the wrong position. Someone mentioned that on the forum last week and it's fixed now (the next time the MEAP is updated, the figures will be automatically replaced. Thanks for mentioning.
December 6, 2025 at 1:11 AM
Sounds interesting, but as far as I know, it doesn't have GPU support (but maybe they added that and I missed it)
December 6, 2025 at 1:10 AM
Yes, it's a somewhat scaled-down version of the H100 to make it export-compliant
December 3, 2025 at 3:59 PM
I think you recently mentioned their alternative, more efficient GPUs. Actually, in their latest V3.2 technical report they mention H800s, so it looks like they are back to using NVIDIA GPUs.
December 3, 2025 at 2:53 PM