Lightnews — Scholar-powered news

Guilherme Penedo

@guilherme.hf.co

610 followers 66 following 3 posts

ML Research Engineer at 🤗. Lisboeta 🇵🇹

Posts Media Videos Starter Packs

Pinned

Guilherme Penedo @guilherme.hf.co · Dec 8

Announcing 🥂 FineWeb2: A sparkling update with 1000s of 🗣️languages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

🥂 FineWeb2 has 8TB of compressed text data and outperforms other datasets.

1 19 75

Reposted by Guilherme Penedo

garreth @garrethlee.bsky.social · Dec 16

🚀 With Meta's recent paper replacing tokenization in LLMs with patches 🩹, I figured that it's a great time to revisit how tokenization has evolved over the years using everyone's favourite medium - memes!

Let's take a trip down memory lane!

[1/N]

4 10 33

Guilherme Penedo @guilherme.hf.co · Dec 8

We will very soon announce a big community project, and are working on a 📝 blogpost walking you through the entire dataset creation process. Stay tuned!

1 6

Guilherme Penedo @guilherme.hf.co · Dec 8

The dataset is released under the permissive 📜 ODC-By 1.0 license, and the 💻 code to reproduce it and our evaluations is public.

Find out all about 🥂 FineWeb2 on the 🤗 model page:
huggingface.co/datasets/Hug...

HuggingFaceFW/fineweb-2 · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

1 4

Guilherme Penedo @guilherme.hf.co · Dec 8

1 19 75