Lightnews — Scholar-powered news

Nathan

@saylortwift.hf.co

Evaluation was just made easier 💯

We merged a huge refacto of lighteval making easier to add:
🔄 Multiturn tasks
🖼️ Multimodal tasks
📝 Plus unified logs for thorough benchmark analysis

Benchmarks guys, what evals would you like to see added ?

June 25, 2025 at 3:05 PM

Nathan

@saylortwift.hf.co

🔥 Evaluating LLMs? You need Lighteval — the fastest, most flexible toolkit for benchmarking models, built by @huggingface

Now with:
✅ Plug & play custom model inference (evaluate any backend)
📈 Tasks like AIME, GPQA:diamond, SimpleQA, and hundreds more

Details below 🧵👇

May 6, 2025 at 2:26 PM

Nathan

@saylortwift.hf.co

openai really has some nice benchmarks, one of them being simpleqa. a simple fact-checking benchmark, short questions and straight answers

i've been using @huggingface's lighteval and inference providers and litellm to evaluate all those models in less than a few hours 🤩

1/N

April 22, 2025 at 2:29 PM

Nathan

@saylortwift.hf.co

🚀 Just dropped fresh benchmarks for LLaMA 4 Scout and Maverick using Lighteval!

Details below👇

1/6

April 8, 2025 at 8:53 AM

Nathan

@saylortwift.hf.co

🚀 Introducing ✨ YourBench ✨ ! Build custom evals instantly using your private docs & see how your custom fine-tuned models perform on your unique tasks.
Congrats to @sumukx @clefourrier and @ailozovskaya for their incredible work !
Game-changing for LLM evaluation 🚀
1/2

April 3, 2025 at 9:35 AM

Nathan

@saylortwift.hf.co

Just wrapped up evaluations on @deepseek_ai's V3 0324! 🚀

Impressive gains in math and GPQA, but instruction following took a slight hit. More concerning—AIME25 remains unchanged. Possible contamination issues? 🤔

March 26, 2025 at 10:07 PM

Nathan

@saylortwift.hf.co

WOW. The Qwen team did NOT come to play.🔥
Just look at these insane results from the OpenEval team—absolutely impressive.
Huge congrats! 👏 @Alibaba_Qwen

March 10, 2025 at 12:39 PM

Nathan

@saylortwift.hf.co

Everyone's talking about GPT-4.5 quality, so we ran benchmarks!

Did NOT expect it to be such a leap from GPT-4o—now on par with Claude 3.7 and even ahead of DeepSeek Llama 70B (a thinking model!).

Congrats to the team @OpenAI !

March 3, 2025 at 3:18 PM

Nathan

@saylortwift.hf.co

Everyone's talking about GPT-4.5 quality, so we ran benchmarks!
Did NOT expect it to be such a leap from GPT-4o—now on par with Claude 3.7 and even ahead of DeepSeek Llama 70B (a thinking model!).

Congrats to the team @OpenAI ! Now open-source it and drop it on the Hub 🤗

March 3, 2025 at 3:05 PM

Nathan

@saylortwift.hf.co

we just reproduced Claude 3.7 results for you 📈

TLDR: we get what they announced.
We also used AIME 2025 to test for contamination on the 2024 version and score are similar on both benchmarks !

Great job to the @AnthropicAI team !
More details in thread 👇
1/3

February 25, 2025 at 3:03 PM

Nathan

@saylortwift.hf.co

Today marks my 2 years at @huggingface! Time flies !! Working with those people for 2 years now, I can tell you there is no better place to build ethical, open AI. Hf folks are both kind and incredibly talented, I can't wait to work on many more exciting projects with them 🤩

February 6, 2025 at 2:28 PM

Nathan

@saylortwift.hf.co

DeepSeek R1 continues to impress! I just integrated the Olympiad Bench— a collection of elite-level Chinese and English scientific problems— into LightEval and tested GPT-4o against R1. The results are insane.

Full details + how to reproduce in the thread 👇

February 3, 2025 at 10:29 AM

Reposted by Nathan

Clem Delangue 🤗

@clem.hf.co

Excited to see more biology open-source models for real positive use-cases of AI!

Chai does structure predictions at AlphaFold3 levels of accuracy and able to handle multi-peptide or peptide-ligand complexes rather than just single chains.

Apache 2.0 on HF huggingface.co/chaidiscover...

December 5, 2024 at 2:39 PM

Reposted by Nathan

Thomas Wolf

@thomwolf.bsky.social

Most liked and most downloaded open-source AI models from 2022 to 2024

Interactive viz: aiworld.eu/embed/model/...
Discussion: huggingface.co/spaces/huggi...

December 4, 2024 at 8:37 AM

Reposted by Nathan

merve

@merve.bsky.social

So many open-source and open releases last week!
Here's a recap, find the text-readable version here huggingface.co/posts/merve/...

December 2, 2024 at 9:53 AM

Reposted by Nathan

Loubna Ben Allal

@loubnabnl.hf.co

Making SmolLM2 more reproducible: open-sourcing our training & evaluation toolkit 🛠️ github.com/huggingface/...

Pre-training & evaluation code, synthetic data generation pipelines, post-training scripts, on-device tools & demos

Apache 2.0. V2 data mix coming soon!

Which tools should we add next?

GitHub - huggingface/smollm: Everything about the SmolLM & SmolLM2 family of models

Everything about the SmolLM & SmolLM2 family of models - GitHub - huggingface/smollm: Everything about the SmolLM & SmolLM2 family of models

github.com

November 24, 2024 at 7:16 AM

Reposted by Nathan

Anton

@anton-l.bsky.social

Check out how easy it is to do LLM evals with LightEval!

* any dataset on the 🤗 Hub can become an eval task in a few lines of code: customize the prompt, metrics, parsing, few-shots, everything!
* model- and data-parallel inference
* auto batching with the new vLLM backend

A screenshot of LightEval benchmarking results in a terminal

November 25, 2024 at 5:24 PM

Reposted by Nathan

Trisha KansasGal

@kansasgal71.bsky.social

November 25, 2024 at 3:14 PM

Reposted by Nathan

Maziyar PANAHI

@maziyarpanahi.bsky.social

A team behind SmolLM2 model at @huggingface.bsky.social just released everything! A true open-source AI:

- Pre-training code
- Evaluation suite
- Synthetic data generation
- Post-training scripts with TRL
- On-device tools for summarization, rewriting & agents

All with Apache 2.0 licensed! 🔥

November 24, 2024 at 6:25 PM

Reposted by Nathan

Arvid Kahl

@arvidkahl.bsky.social

It's "on-device LLM" today.

Soon, it'll be "on-chip" LLM. Or LLM cores. The system default local LLM. The coding framework's default local LLM.

I find this incredibly exciting. A privacy-first, self-contained, user-owned AI—a 24/7 agent for action, insights & feedback.

github.com/huggingface/...

GitHub - huggingface/smollm: Everything about the SmolLM & SmolLM2 family of models

Everything about the SmolLM & SmolLM2 family of models - GitHub - huggingface/smollm: Everything about the SmolLM & SmolLM2 family of models

github.com

November 24, 2024 at 6:01 PM

Nathan

@saylortwift.hf.co

This week (ish) in 🌤️ LLM evaluation 🔥
📊 A statistical approach to model evaluation @AnthropicAI
📐 Frontier MATH: a benchmark for evaluating advanced Mathematical reasoning in AI @EpochAIResearch
📝 Say What You Mean: A Response to 'Let Me Speak Freely' @dottxtai

🧵 👇

November 25, 2024 at 2:13 PM

Reposted by Nathan

Yoshitomo Matsubara

@yoshitomo-matsubara.net

Here is a list of ML OSS & Open Source / Science enthusiasts I found on Bluesky 🦋

go.bsky.app/8MFcfXd

Let me know if you find such people here!

I'm still new here and probably the list misses many must-add people, so let's built it together💪

November 21, 2024 at 5:19 AM

Reposted by Nathan

Clem Delangue 🤗

@clem.hf.co

Should HF do more agent stuff? If so, what would be useful?

November 23, 2024 at 4:08 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news