Elie
@eliebak.hf.co
2.4K followers 260 following 20 posts
Training LLM's at huggingface | hf.co/science
Posts Media Videos Starter Packs
Reposted by Elie
anton-l.bsky.social
LLM Reasoning labs will be eating good today🍔

We commandeered the HF cluster for a few days and generated 1.2M reasoning-filled solutions to 500k NuminaMath problems with DeepSeek-R1 🐳
Have fun!
Reposted by Elie
qgallouedec.hf.co
Last moments of closed-source AI 🪦 :
Hugging Face is openly reproducing the pipeline of 🐳 DeepSeek-R1. Open data, open training. open models, open collaboration.

🫵 Let's go!
github.com/huggingface/...
GitHub - huggingface/open-r1: Fully open reproduction of DeepSeek-R1
Fully open reproduction of DeepSeek-R1. Contribute to huggingface/open-r1 development by creating an account on GitHub.
github.com
Reposted by Elie
lewtun.bsky.social
We are reproducing the full DeepSeek R1 data and training pipeline so everybody can use their recipe. Instead of doing it in secret we can do it together in the open!

Follow along: github.com/huggingface/...
GitHub - huggingface/open-r1: Fully open reproduction of DeepSeek-R1
Fully open reproduction of DeepSeek-R1. Contribute to huggingface/open-r1 development by creating an account on GitHub.
github.com
Reposted by Elie
anton-l.bsky.social
Introducing 📐FineMath: the best open math pre-training dataset with 50B+ tokens!

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

🤗 huggingface.co/datasets/Hug...

Here’s a breakdown 🧵
A plot showing increased performance of Llama-3.2-3B when pretrained on FineMath
eliebak.hf.co
Elie @eliebak.hf.co · Dec 11
WOW, Gemini Flash 2.0 is really impressive. Wondering about the size of this supposedly smol model.

One odd thing is that the model seems to lose some ability with long contexts compared to Flash 1.5. If any google friends could share insights, I'd love to hear them!
eliebak.hf.co
Elie @eliebak.hf.co · Dec 5
Curious about this, what is the % of "new ideas" that you are not allowed to publish? (if you can answer ofc)
eliebak.hf.co
Elie @eliebak.hf.co · Dec 5
should be good now
eliebak.hf.co
Elie @eliebak.hf.co · Dec 4
Hey, I'll be at neurips next week! My DM are open if you want to meet and talk about pre-training/data/whatever you want 🫡
eliebak.hf.co
Elie @eliebak.hf.co · Dec 3
Link: www.freepatentsonline.com/y2024/037844...
I've probably missed a lot, feel free to add more ⬇️
www.freepatentsonline.com
eliebak.hf.co
Elie @eliebak.hf.co · Dec 3
- They use some kind of metadata token to give information about toxicity, data leakage but also "quality" token?
- [0118] talk about using some kind of lora's during the finetuning/alignment phase to adapt on multiple downstream task
- ~[0154] some memory evaluation technique?
eliebak.hf.co
Elie @eliebak.hf.co · Dec 3
Google patent on "Training of large neural network". 😮

I don't know if this give much information but by going quickly through it seems that:
- They are not only using "causal language modeling task" as a pre-training task but also "span corruption" and "prefix modeling". (ref [0805]-[0091])
Reposted by Elie
merve.bsky.social
So many open-source and open releases last week!
Here's a recap, find the text-readable version here huggingface.co/posts/merve/...
Reposted by Elie
loubnabnl.hf.co
📬 Summarize and rewrite your text/emails faster, and offline!

Check @andimara.bsky.social's Smol Tools for summarization and rewriting. It uses SmolLM2 to summarize text and make it more friendly or professional, all running locally thanks to llama.cpp github.com/huggingface/...
smollm/smol_tools at main · huggingface/smollm
Everything about the SmolLM & SmolLM2 family of models - huggingface/smollm
github.com
eliebak.hf.co
Elie @eliebak.hf.co · Nov 30
What else should we log during LLM training? Right now, it's just loss, grad_norm, and evals, but I want to log more to have a better understanding of pre-training. Thinking about adding stuff like entropix metrics (agreement, varentropy?)

Any thoughts or cool ideas?
eliebak.hf.co
Elie @eliebak.hf.co · Nov 28
Glad to have you back!
eliebak.hf.co
Elie @eliebak.hf.co · Nov 28
i find it sad but imo it's good news that those people block 'us.' I'm tired of seeing hateful comments on my colleagues' (and other ML engineers/researchers') posts."
eliebak.hf.co
Elie @eliebak.hf.co · Nov 28
why not flex attention?
eliebak.hf.co
Elie @eliebak.hf.co · Nov 28
should be okay!
Reposted by Elie
xenova.bsky.social
WOW! 🤯 Language models are becoming smaller and more capable than ever! Here's SmolLM2 running 100% locally in-browser w/ WebGPU on a 6-year-old GPU. Just look at that speed! ⚡️😍

Powered by 🤗 Transformers.js and ONNX Runtime Web!

How many tokens/second do you get? Let me know! 👇
Reposted by Elie
muellerzr.bsky.social
I'm looking for an intern!

If you are:
* Driven
* Love OSS
* Interested in distributed PyTorch training/FSDPv2/DeepSpeed

Come work with me!

Fully remote, more details to apply in the comments
A job description stating:
About this Role

This internship works at the intersections of software engineering, machine learning engineering, and education. With a strong focus on distributed training through the accelerate library (https://huggingface.co/docs/accelerate/index), we'll focus on bringing state-of-the-art training techniques into the library while also documenting and helping
teach others how they work. By the end of this internship, the candidate will have touched on all aspects of distributed training and core library contributions, including large-scale distributed training, API design, writing educational material aimed at a semi-technical audience, and
understanding the nuances of writing software that scales.
eliebak.hf.co
Elie @eliebak.hf.co · Nov 27
10000% agree with omar, this is totally disproportionate
osanseviero.bsky.social
I'm disheartened by how toxic and violent some responses were here.

There was a mistake, a quick follow up to mitigate and an apology. I worked with Daniel for years and is one of the persons most preoccupied with ethical implications of AI. Some replies are Reddit-toxic level. We need empathy.
danielvanstrien.bsky.social
I've removed the Bluesky data from the repo. While I wanted to support tool development for the platform, I recognize this approach violated principles of transparency and consent in data collection. I apologize for this mistake.
eliebak.hf.co
Elie @eliebak.hf.co · Nov 26
super nice! 🤗