Guilherme Penedo
@guilherme.hf.co
610 followers 66 following 3 posts
ML Research Engineer at 🤗. Lisboeta 🇵🇹
Posts Media Videos Starter Packs
Pinned
guilherme.hf.co
Announcing 🥂 FineWeb2: A sparkling update with 1000s of 🗣️languages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

🥂 FineWeb2 has 8TB of compressed text data and outperforms other datasets.
Reposted by Guilherme Penedo
🚀 With Meta's recent paper replacing tokenization in LLMs with patches 🩹, I figured that it's a great time to revisit how tokenization has evolved over the years using everyone's favourite medium - memes!

Let's take a trip down memory lane!

[1/N]
guilherme.hf.co
We will very soon announce a big community project, and are working on a 📝 blogpost walking you through the entire dataset creation process. Stay tuned!
guilherme.hf.co
The dataset is released under the permissive 📜 ODC-By 1.0 license, and the 💻 code to reproduce it and our evaluations is public.

Find out all about 🥂 FineWeb2 on the 🤗 model page:
huggingface.co/datasets/Hug...
HuggingFaceFW/fineweb-2 · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
guilherme.hf.co
Announcing 🥂 FineWeb2: A sparkling update with 1000s of 🗣️languages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

🥂 FineWeb2 has 8TB of compressed text data and outperforms other datasets.