github.com/jaalu | he/him
this is the best and most detailed summary of the current state of SOTA LLM training
nanochat is good for understanding LLM training, this tech report catches you up to SOTA methods
this is the best and most detailed summary of the current state of SOTA LLM training
nanochat is good for understanding LLM training, this tech report catches you up to SOTA methods
Best fully open 32B reasoning model & best 32B base model. 🧵
Best fully open 32B reasoning model & best 32B base model. 🧵
Registration is open until January 1st 2026, but we recommend registering early to avoid expensive hotel prices
More info in the comments 👇
Registration is open until January 1st 2026, but we recommend registering early to avoid expensive hotel prices
More info in the comments 👇
Read more about van der Schaar's and Boyd's and other Winter School tutorials in the comments 👇
Read more about van der Schaar's and Boyd's and other Winter School tutorials in the comments 👇
(Homebrew is a given, but not sure which terminal emulators people prefer now, for instance)
for Diverse 3D Assets
Paper: arxiv.org/abs/2502.09615
Web: www.liuisabella.com/RigAnything/
Code: github.com/Isabella98Li...
Model: huggingface.co/Isabellaliu/...
for Diverse 3D Assets
Paper: arxiv.org/abs/2502.09615
Web: www.liuisabella.com/RigAnything/
Code: github.com/Isabella98Li...
Model: huggingface.co/Isabellaliu/...
The plan: sandwich a language model in an audio encoder/decoder pair (=neural audio codec), allowing it to predict audio continuations.
kyutai.org/next/codec-e...
The plan: sandwich a language model in an audio encoder/decoder pair (=neural audio codec), allowing it to predict audio continuations.
kyutai.org/next/codec-e...
Model: huggingface.co/deepseek-ai/...
Paper: github.com/deepseek-ai/...
Repo: github.com/deepseek-ai/...
Model: huggingface.co/deepseek-ai/...
Paper: github.com/deepseek-ai/...
Repo: github.com/deepseek-ai/...
* math: precision matters
* knowledge: effective param count is more important
* 4B-8bit threshold — for bigger prefer quant, smaller prefer more params
* parallel TTC only works above 4B-8bit
arxiv.org/abs/2510.10964
* math: precision matters
* knowledge: effective param count is more important
* 4B-8bit threshold — for bigger prefer quant, smaller prefer more params
* parallel TTC only works above 4B-8bit
arxiv.org/abs/2510.10964
Memory primitives were graphics shaped, not computer science shaped.
Want to do math on an array? Store it as an RGBA texture.
Fragment Shader for processing. *Paint* the result in a big rectangle.
Memory primitives were graphics shaped, not computer science shaped.
Want to do math on an array? Store it as an RGBA texture.
Fragment Shader for processing. *Paint* the result in a big rectangle.
📅 Abstract submission deadline: October 17th 2025
More information about submission guidelines on nldl.org
📅 Abstract submission deadline: October 17th 2025
More information about submission guidelines on nldl.org
He wondered what CAN'T be transformed by Transformers? So, he wrote a fun blog post on finding "fixed points" of your LLMs. If you prompt it with a fixed point token,
He wondered what CAN'T be transformed by Transformers? So, he wrote a fun blog post on finding "fixed points" of your LLMs. If you prompt it with a fixed point token,
link in reply!
link in reply!
Juyeop Kim, Songkuk Kim, Jong-Seok Lee
tl;dr: classifier-free-guidance is to blame
arxiv.org/abs/2509.25705
Juyeop Kim, Songkuk Kim, Jong-Seok Lee
tl;dr: classifier-free-guidance is to blame
arxiv.org/abs/2509.25705
Main takeaway: In mechanistic interpretability, we need assumptions about how DNNs encode concepts in their representations (eg, the linear representation hypothesis). Without them, we can claim any DNN implements any algorithm!
Main takeaway: In mechanistic interpretability, we need assumptions about how DNNs encode concepts in their representations (eg, the linear representation hypothesis). Without them, we can claim any DNN implements any algorithm!
How we train an open everything model on a new pretraining environment with releasable data (Common Corpus) with an open source framework (Nanotron from HuggingFace).
www.sciencedirect.com/science/arti...
How we train an open everything model on a new pretraining environment with releasable data (Common Corpus) with an open source framework (Nanotron from HuggingFace).
www.sciencedirect.com/science/arti...
EmbeddingGemma, the new best-in-class open embedding model! 🚀
🏆 Top multilingual model on MTEB (<500M)
💾 Runs on <200MB RAM
⚙️ Customizable output for on-device use
🧩 Integrated with your favorite tools
developers.googleblog.com/en/introduci...
EmbeddingGemma, the new best-in-class open embedding model! 🚀
🏆 Top multilingual model on MTEB (<500M)
💾 Runs on <200MB RAM
⚙️ Customizable output for on-device use
🧩 Integrated with your favorite tools
developers.googleblog.com/en/introduci...
Trained on 15T tokens in 1,000+ languages, it’s built for transparency, responsibility & the public good.
Read more: actu.epfl.ch/news/apertus...
Trained on 15T tokens in 1,000+ languages, it’s built for transparency, responsibility & the public good.
Read more: actu.epfl.ch/news/apertus...