Nathan
banner
saylortwift.hf.co
Nathan
@saylortwift.hf.co
ML engineer at @huggingface 🤗, Evaluation, Open LLM Leaderboard and lighteval
Evaluation was just made easier 💯

We merged a huge refacto of lighteval making easier to add:
🔄 Multiturn tasks
🖼️ Multimodal tasks
📝 Plus unified logs for thorough benchmark analysis

Benchmarks guys, what evals would you like to see added ?
June 25, 2025 at 3:05 PM
🔥 Evaluating LLMs? You need Lighteval — the fastest, most flexible toolkit for benchmarking models, built by @huggingface

Now with:
✅ Plug & play custom model inference (evaluate any backend)
📈 Tasks like AIME, GPQA:diamond, SimpleQA, and hundreds more

Details below 🧵👇
May 6, 2025 at 2:26 PM
openai really has some nice benchmarks, one of them being simpleqa. a simple fact-checking benchmark, short questions and straight answers

i've been using @huggingface's lighteval and inference providers and litellm to evaluate all those models in less than a few hours 🤩

1/N
April 22, 2025 at 2:29 PM
🚀 Just dropped fresh benchmarks for LLaMA 4 Scout and Maverick using Lighteval!

Details below👇

1/6
April 8, 2025 at 8:53 AM
🚀 Introducing ✨ YourBench ✨ ! Build custom evals instantly using your private docs & see how your custom fine-tuned models perform on your unique tasks.
Congrats to @sumukx @clefourrier and @ailozovskaya for their incredible work !
Game-changing for LLM evaluation 🚀
1/2
April 3, 2025 at 9:35 AM
Just wrapped up evaluations on @deepseek_ai's V3 0324! 🚀

Impressive gains in math and GPQA, but instruction following took a slight hit. More concerning—AIME25 remains unchanged. Possible contamination issues? 🤔
March 26, 2025 at 10:07 PM
WOW. The Qwen team did NOT come to play.🔥
Just look at these insane results from the OpenEval team—absolutely impressive.
Huge congrats! 👏 @Alibaba_Qwen
March 10, 2025 at 12:39 PM
Everyone's talking about GPT-4.5 quality, so we ran benchmarks!

Did NOT expect it to be such a leap from GPT-4o—now on par with Claude 3.7 and even ahead of DeepSeek Llama 70B (a thinking model!).

Congrats to the team @OpenAI !
March 3, 2025 at 3:18 PM
Everyone's talking about GPT-4.5 quality, so we ran benchmarks!
Did NOT expect it to be such a leap from GPT-4o—now on par with Claude 3.7 and even ahead of DeepSeek Llama 70B (a thinking model!).

Congrats to the team @OpenAI ! Now open-source it and drop it on the Hub 🤗
March 3, 2025 at 3:05 PM
we just reproduced Claude 3.7 results for you 📈

TLDR: we get what they announced.
We also used AIME 2025 to test for contamination on the 2024 version and score are similar on both benchmarks !

Great job to the @AnthropicAI team !
More details in thread 👇
1/3
February 25, 2025 at 3:03 PM
Today marks my 2 years at @huggingface! Time flies !! Working with those people for 2 years now, I can tell you there is no better place to build ethical, open AI. Hf folks are both kind and incredibly talented, I can't wait to work on many more exciting projects with them 🤩
February 6, 2025 at 2:28 PM
DeepSeek R1 continues to impress! I just integrated the Olympiad Bench— a collection of elite-level Chinese and English scientific problems— into LightEval and tested GPT-4o against R1. The results are insane.

Full details + how to reproduce in the thread 👇
February 3, 2025 at 10:29 AM
Reposted by Nathan
Excited to see more biology open-source models for real positive use-cases of AI!

Chai does structure predictions at AlphaFold3 levels of accuracy and able to handle multi-peptide or peptide-ligand complexes rather than just single chains.

Apache 2.0 on HF huggingface.co/chaidiscover...
December 5, 2024 at 2:39 PM
Reposted by Nathan
Most liked and most downloaded open-source AI models from 2022 to 2024

Interactive viz: aiworld.eu/embed/model/...
Discussion: huggingface.co/spaces/huggi...
December 4, 2024 at 8:37 AM
Reposted by Nathan
So many open-source and open releases last week!
Here's a recap, find the text-readable version here huggingface.co/posts/merve/...
December 2, 2024 at 9:53 AM
Reposted by Nathan
Making SmolLM2 more reproducible: open-sourcing our training & evaluation toolkit 🛠️ github.com/huggingface/...

Pre-training & evaluation code, synthetic data generation pipelines, post-training scripts, on-device tools & demos

Apache 2.0. V2 data mix coming soon!

Which tools should we add next?
GitHub - huggingface/smollm: Everything about the SmolLM & SmolLM2 family of models
Everything about the SmolLM & SmolLM2 family of models - GitHub - huggingface/smollm: Everything about the SmolLM & SmolLM2 family of models
github.com
November 24, 2024 at 7:16 AM
Reposted by Nathan
Check out how easy it is to do LLM evals with LightEval!

* any dataset on the 🤗 Hub can become an eval task in a few lines of code: customize the prompt, metrics, parsing, few-shots, everything!
* model- and data-parallel inference
* auto batching with the new vLLM backend
November 25, 2024 at 5:24 PM
Reposted by Nathan
November 25, 2024 at 3:14 PM
Reposted by Nathan
A team behind SmolLM2 model at @huggingface.bsky.social just released everything! A true open-source AI:

- Pre-training code
- Evaluation suite
- Synthetic data generation
- Post-training scripts with TRL
- On-device tools for summarization, rewriting & agents

All with Apache 2.0 licensed! 🔥
November 24, 2024 at 6:25 PM
Reposted by Nathan
It's "on-device LLM" today.

Soon, it'll be "on-chip" LLM. Or LLM cores. The system default local LLM. The coding framework's default local LLM.

I find this incredibly exciting. A privacy-first, self-contained, user-owned AI—a 24/7 agent for action, insights & feedback.

github.com/huggingface/...
GitHub - huggingface/smollm: Everything about the SmolLM & SmolLM2 family of models
Everything about the SmolLM & SmolLM2 family of models - GitHub - huggingface/smollm: Everything about the SmolLM & SmolLM2 family of models
github.com
November 24, 2024 at 6:01 PM
This week (ish) in 🌤️ LLM evaluation 🔥
📊 A statistical approach to model evaluation @AnthropicAI
📐 Frontier MATH: a benchmark for evaluating advanced Mathematical reasoning in AI @EpochAIResearch
📝 Say What You Mean: A Response to 'Let Me Speak Freely' @dottxtai

🧵 👇
November 25, 2024 at 2:13 PM
Reposted by Nathan
Here is a list of ML OSS & Open Source / Science enthusiasts I found on Bluesky 🦋

go.bsky.app/8MFcfXd

Let me know if you find such people here!

I'm still new here and probably the list misses many must-add people, so let's built it together💪
November 21, 2024 at 5:19 AM
Reposted by Nathan
Should HF do more agent stuff? If so, what would be useful?
November 23, 2024 at 4:08 PM