Loubna Ben Allal
@loubnabnl.hf.co
1.4K followers 140 following 8 posts
SmolLMs & Data @huggingface Training SmolLMs and curating high quality web and synthetic datasets ✨ https://loubnabnl.github.io/
Posts Media Videos Starter Packs
Reposted by Loubna Ben Allal
uphillconf.bsky.social
We’re excited to share the first talk topic for Uphill Conf 2025!

The Rise of Small Models: On-Device Language Models and SmolLM by📢 @loubnabnl.hf.co

Learn how small language models are transforming AI for resource-constrained environments and get insights into the groundbreaking SmolLM series.
loubnabnl.hf.co
We built code datasets, English datasets, and now it’s time for math! 🚀

Check out Anton’s thread to learn how we curated the best public math pre-training dataset.
anton-l.bsky.social
Introducing 📐FineMath: the best open math pre-training dataset with 50B+ tokens!

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

🤗 huggingface.co/datasets/Hug...

Here’s a breakdown 🧵
A plot showing increased performance of Llama-3.2-3B when pretrained on FineMath
loubnabnl.hf.co
Yeah it was recorded, I will share it when it’s public
loubnabnl.hf.co
Sharing my slides on "Synthetic data and smol models in 2024" from yesterday's Latent Space event at NeurIPS: docs.google.com/presentation...

- Synthetic data is everywhere
- Model collapse, is the web polluted?
- 3B+ models running on your iPhone
- When and why use smol models?
Synthetic data & Smol models in 2024
Loubna Ben Allal Hugging Face Synthetic data and smol models in 2024 loubnabnl LoubnaBenAllal1
docs.google.com
Reposted by Loubna Ben Allal
tavis.damnsimple.com
Another great talk at @latentspacepod.bsky.social NeurIPS: @loubnabnl.hf.co on Synthetic Data & Smol Models
Reposted by Loubna Ben Allal
benburtenshaw.bsky.social
For anyone interested in fine-tuning or aligning LLMs, I’m running this free and open course called smol course. It’s not a big deal, it’s just smol.

🧵>>
Reposted by Loubna Ben Allal
calebfahlgren.hf.co
The amazing, new Qwen2.5-Coder 32B model can now write SQL for any @hf.co dataset ✨
loubnabnl.hf.co
We hit 1K ⭐ on our SmolLM repo—thank you! 🎉 New updates:

• SmolLM2 nanotron checkpoints (with optimizer states) for easier continual pre-training
• Local inference demos (MLC, Transformers.js, MLX, llama.cpp)
• SmolVLM: Vision-language model built on SmolLM2

github.com/huggingface/...
loubnabnl.hf.co
In this demo Andi used SmolLM2 to summarize a long email, asked it follow-up questions - and then used it to rewrite his reply as a formal email: x.com/andi_marafio...
x.com
x.com
loubnabnl.hf.co
📬 Summarize and rewrite your text/emails faster, and offline!

Check @andimara.bsky.social's Smol Tools for summarization and rewriting. It uses SmolLM2 to summarize text and make it more friendly or professional, all running locally thanks to llama.cpp github.com/huggingface/...
smollm/smol_tools at main · huggingface/smollm
Everything about the SmolLM & SmolLM2 family of models - huggingface/smollm
github.com
Reposted by Loubna Ben Allal
xenova.bsky.social
WOW! 🤯 Language models are becoming smaller and more capable than ever! Here's SmolLM2 running 100% locally in-browser w/ WebGPU on a 6-year-old GPU. Just look at that speed! ⚡️😍

Powered by 🤗 Transformers.js and ONNX Runtime Web!

How many tokens/second do you get? Let me know! 👇
Reposted by Loubna Ben Allal
simonwillison.net
This demo of structured data extraction running on an LLM that executes entirely in the browser (Chrome only for the moment since it uses WebGPU) is amazing

My notes here: simonwillison.net/2024/Nov/29/...
Reposted by Loubna Ben Allal
reach-vb.hf.co
vb @reach-vb.hf.co · Nov 28
Fuck it! Structured Generation w/ SmolLM2 running in browser & WebGPU 🔥

Powered by MLC Web-LLM & XGrammar ⚡

Define a JSON schema, Input free text, get structured data right in your browser - profit!!
Reposted by Loubna Ben Allal
andimara.bsky.social
Let's go! We are releasing SmolVLM, a smol 2B VLM built for on-device inference that outperforms all models at similar GPU RAM usage and tokens throughputs.

SmolVLM can be fine-tuned on a Google collab and be run on a laptop! Or process millions of documents with a consumer GPU!
loubnabnl.hf.co
We use open Llama models for generating our new datasets and refer users to the original license of the existing datasets
Reposted by Loubna Ben Allal
anton-l.bsky.social
Check out how easy it is to do LLM evals with LightEval!

* any dataset on the 🤗 Hub can become an eval task in a few lines of code: customize the prompt, metrics, parsing, few-shots, everything!
* model- and data-parallel inference
* auto batching with the new vLLM backend
A screenshot of LightEval benchmarking results in a terminal
Reposted by Loubna Ben Allal
thomwolf.bsky.social
It's Sunday morning so taking a minute for a nerdy thread (on math, tokenizers and LLMs) of the work of our intern Garreth

By adding a few lines of code to the base Llama 3 tokenizer, he got a free boost in arithmetic performance 😮

[thread]
loubnabnl.hf.co
Making SmolLM2 more reproducible: open-sourcing our training & evaluation toolkit 🛠️ github.com/huggingface/...

Pre-training & evaluation code, synthetic data generation pipelines, post-training scripts, on-device tools & demos

Apache 2.0. V2 data mix coming soon!

Which tools should we add next?
GitHub - huggingface/smollm: Everything about the SmolLM & SmolLM2 family of models
Everything about the SmolLM & SmolLM2 family of models - GitHub - huggingface/smollm: Everything about the SmolLM & SmolLM2 family of models
github.com
Reposted by Loubna Ben Allal
gabrielmb.com
Excited to announce the SFT dataset used for @huggingface.bsky.social SmolLM2!

The dataset for SmolLM2 was created by combining multiple existing datasets and generating new synthetic datasets, including MagPie Ultra v1.0, using distilabel.

Check out the dataset:
huggingface.co/datasets/Hug...
HuggingFaceTB/smoltalk · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
Reposted by Loubna Ben Allal
lvwerra.bsky.social
What's the secret sauce of SmolLM2 to beat LLM titans like Llama3.2 and Qwen2.5?

Unsurprisingly: data, data, data!

The SmolTalk is open and available here: huggingface.co/datasets/Hug...