Lightnews — Scholar-powered news

Reposted by Loubna Ben Allal

Uphill Conf @uphillconf.bsky.social · Jan 27

We’re excited to share the first talk topic for Uphill Conf 2025!

The Rise of Small Models: On-Device Language Models and SmolLM by📢 @loubnabnl.hf.co

Learn how small language models are transforming AI for resource-constrained environments and get insights into the groundbreaking SmolLM series.

1 2

Loubna Ben Allal @loubnabnl.hf.co · Dec 19

We built code datasets, English datasets, and now it’s time for math! 🚀

Check out Anton’s thread to learn how we curated the best public math pre-training dataset.

Anton @anton-l.bsky.social · Dec 19

Introducing 📐FineMath: the best open math pre-training dataset with 50B+ tokens!

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

🤗 huggingface.co/datasets/Hug...

Here’s a breakdown 🧵

A plot showing increased performance of Llama-3.2-3B when pretrained on FineMath

3

Loubna Ben Allal @loubnabnl.hf.co · Dec 12

Yeah it was recorded, I will share it when it’s public

1

Loubna Ben Allal @loubnabnl.hf.co · Dec 12

Sharing my slides on "Synthetic data and smol models in 2024" from yesterday's Latent Space event at NeurIPS: docs.google.com/presentation...

- Synthetic data is everywhere
- Model collapse, is the web polluted?
- 3B+ models running on your iPhone
- When and why use smol models?

Synthetic data & Smol models in 2024

Loubna Ben Allal Hugging Face Synthetic data and smol models in 2024 loubnabnl LoubnaBenAllal1

docs.google.com

1 5 22

Reposted by Loubna Ben Allal

Tavis Rudd @tavis.damnsimple.com · Dec 11

Another great talk at @latentspacepod.bsky.social NeurIPS: @loubnabnl.hf.co on Synthetic Data & Smol Models

2 4

Reposted by Loubna Ben Allal

Ben Burtenshaw @benburtenshaw.bsky.social · Dec 3

For anyone interested in fine-tuning or aligning LLMs, I’m running this free and open course called smol course. It’s not a big deal, it’s just smol.

🧵>>

9 64 330

Reposted by Loubna Ben Allal

Caleb Fahlgren @calebfahlgren.hf.co · Dec 2

The amazing, new Qwen2.5-Coder 32B model can now write SQL for any @hf.co dataset ✨

1 4 19

Loubna Ben Allal @loubnabnl.hf.co · Dec 1

We hit 1K ⭐ on our SmolLM repo—thank you! 🎉 New updates:

• SmolLM2 nanotron checkpoints (with optimizer states) for easier continual pre-training
• Local inference demos (MLC, Transformers.js, MLX, llama.cpp)
• SmolVLM: Vision-language model built on SmolLM2

github.com/huggingface/...

1 17

Loubna Ben Allal @loubnabnl.hf.co · Nov 30

In this demo Andi used SmolLM2 to summarize a long email, asked it follow-up questions - and then used it to rewrite his reply as a formal email: x.com/andi_marafio...

x.com

Loubna Ben Allal @loubnabnl.hf.co · Nov 30

📬 Summarize and rewrite your text/emails faster, and offline!

Check @andimara.bsky.social's Smol Tools for summarization and rewriting. It uses SmolLM2 to summarize text and make it more friendly or professional, all running locally thanks to llama.cpp github.com/huggingface/...

smollm/smol_tools at main · huggingface/smollm

Everything about the SmolLM & SmolLM2 family of models - huggingface/smollm

github.com

1 2 11

Reposted by Loubna Ben Allal

Xenova @xenova.bsky.social · Nov 27

WOW! 🤯 Language models are becoming smaller and more capable than ever! Here's SmolLM2 running 100% locally in-browser w/ WebGPU on a 6-year-old GPU. Just look at that speed! ⚡️😍

Powered by 🤗 Transformers.js and ONNX Runtime Web!

How many tokens/second do you get? Let me know! 👇

2 10 46

Reposted by Loubna Ben Allal

Simon Willison @simonwillison.net · Nov 29

This demo of structured data extraction running on an LLM that executes entirely in the browser (Chrome only for the moment since it uses WebGPU) is amazing

My notes here: simonwillison.net/2024/Nov/29/...

4 23 180

Reposted by Loubna Ben Allal

vb @reach-vb.hf.co · Nov 28

Fuck it! Structured Generation w/ SmolLM2 running in browser & WebGPU 🔥

Powered by MLC Web-LLM & XGrammar ⚡

Define a JSON schema, Input free text, get structured data right in your browser - profit!!

4 13 110

Reposted by Loubna Ben Allal

Elie @eliebak.hf.co · Nov 27

We’re looking for an intern to join our SmolLM team! If you’re excited about training LLMs and building high-quality datasets, we’d love to hear from you. 🤗

US: apply.workable.com/huggingface/...
EMEA: apply.workable.com/huggingface/...

ML Research Engineer Internship, SmolLMs pretraining and datasets - EMEA Remote - Hugging Face

Here at Hugging Face, we’re on a journey to advance good Machine Learning and make it more accessible. Along the way, we contribute to the development of technology for the better.We have built the fa...

apply.workable.com

7 12 64

Reposted by Loubna Ben Allal

Andi @andimara.bsky.social · Nov 26

Let's go! We are releasing SmolVLM, a smol 2B VLM built for on-device inference that outperforms all models at similar GPU RAM usage and tokens throughputs.

SmolVLM can be fine-tuned on a Google collab and be run on a laptop! Or process millions of documents with a consumer GPU!

4 22 100

Loubna Ben Allal @loubnabnl.hf.co · Nov 26

We use open Llama models for generating our new datasets and refer users to the original license of the existing datasets

2

Reposted by Loubna Ben Allal

Anton @anton-l.bsky.social · Nov 25

Check out how easy it is to do LLM evals with LightEval!

* any dataset on the 🤗 Hub can become an eval task in a few lines of code: customize the prompt, metrics, parsing, few-shots, everything!
* model- and data-parallel inference
* auto batching with the new vLLM backend

A screenshot of LightEval benchmarking results in a terminal

2 10 77

Reposted by Loubna Ben Allal

Thomas Wolf @thomwolf.bsky.social · Nov 24

It's Sunday morning so taking a minute for a nerdy thread (on math, tokenizers and LLMs) of the work of our intern Garreth

By adding a few lines of code to the base Llama 3 tokenizer, he got a free boost in arithmetic performance 😮

[thread]

5 34 270

Loubna Ben Allal @loubnabnl.hf.co · Nov 24

Making SmolLM2 more reproducible: open-sourcing our training & evaluation toolkit 🛠️ github.com/huggingface/...

Pre-training & evaluation code, synthetic data generation pipelines, post-training scripts, on-device tools & demos

Apache 2.0. V2 data mix coming soon!

Which tools should we add next?

GitHub - huggingface/smollm: Everything about the SmolLM & SmolLM2 family of models

Everything about the SmolLM & SmolLM2 family of models - GitHub - huggingface/smollm: Everything about the SmolLM & SmolLM2 family of models

github.com

2 11 59

Reposted by Loubna Ben Allal

Gabriel Martín Blázquez @gabrielmb.com · Nov 21

Excited to announce the SFT dataset used for @huggingface.bsky.social SmolLM2!

The dataset for SmolLM2 was created by combining multiple existing datasets and generating new synthetic datasets, including MagPie Ultra v1.0, using distilabel.

Check out the dataset:
huggingface.co/datasets/Hug...

HuggingFaceTB/smoltalk · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

1 8 24

Reposted by Loubna Ben Allal

Leandro von Werra @lvwerra.bsky.social · Nov 21

What's the secret sauce of SmolLM2 to beat LLM titans like Llama3.2 and Qwen2.5?

Unsurprisingly: data, data, data!

The SmolTalk is open and available here: huggingface.co/datasets/Hug...

2 7 62