We merged a huge refacto of lighteval making easier to add:
🔄 Multiturn tasks
🖼️ Multimodal tasks
📝 Plus unified logs for thorough benchmark analysis
Benchmarks guys, what evals would you like to see added ?
We merged a huge refacto of lighteval making easier to add:
🔄 Multiturn tasks
🖼️ Multimodal tasks
📝 Plus unified logs for thorough benchmark analysis
Benchmarks guys, what evals would you like to see added ?
Now with:
✅ Plug & play custom model inference (evaluate any backend)
📈 Tasks like AIME, GPQA:diamond, SimpleQA, and hundreds more
Details below 🧵👇
Now with:
✅ Plug & play custom model inference (evaluate any backend)
📈 Tasks like AIME, GPQA:diamond, SimpleQA, and hundreds more
Details below 🧵👇
i've been using @huggingface's lighteval and inference providers and litellm to evaluate all those models in less than a few hours 🤩
1/N
i've been using @huggingface's lighteval and inference providers and litellm to evaluate all those models in less than a few hours 🤩
1/N
Details below👇
1/6
Details below👇
1/6
Congrats to @sumukx @clefourrier and @ailozovskaya for their incredible work !
Game-changing for LLM evaluation 🚀
1/2
Congrats to @sumukx @clefourrier and @ailozovskaya for their incredible work !
Game-changing for LLM evaluation 🚀
1/2
Impressive gains in math and GPQA, but instruction following took a slight hit. More concerning—AIME25 remains unchanged. Possible contamination issues? 🤔
Impressive gains in math and GPQA, but instruction following took a slight hit. More concerning—AIME25 remains unchanged. Possible contamination issues? 🤔
Just look at these insane results from the OpenEval team—absolutely impressive.
Huge congrats! 👏 @Alibaba_Qwen
Just look at these insane results from the OpenEval team—absolutely impressive.
Huge congrats! 👏 @Alibaba_Qwen
Did NOT expect it to be such a leap from GPT-4o—now on par with Claude 3.7 and even ahead of DeepSeek Llama 70B (a thinking model!).
Congrats to the team @OpenAI !
Did NOT expect it to be such a leap from GPT-4o—now on par with Claude 3.7 and even ahead of DeepSeek Llama 70B (a thinking model!).
Congrats to the team @OpenAI !
Did NOT expect it to be such a leap from GPT-4o—now on par with Claude 3.7 and even ahead of DeepSeek Llama 70B (a thinking model!).
Congrats to the team @OpenAI ! Now open-source it and drop it on the Hub 🤗
Did NOT expect it to be such a leap from GPT-4o—now on par with Claude 3.7 and even ahead of DeepSeek Llama 70B (a thinking model!).
Congrats to the team @OpenAI ! Now open-source it and drop it on the Hub 🤗
TLDR: we get what they announced.
We also used AIME 2025 to test for contamination on the 2024 version and score are similar on both benchmarks !
Great job to the @AnthropicAI team !
More details in thread 👇
1/3
TLDR: we get what they announced.
We also used AIME 2025 to test for contamination on the 2024 version and score are similar on both benchmarks !
Great job to the @AnthropicAI team !
More details in thread 👇
1/3
Full details + how to reproduce in the thread 👇
Full details + how to reproduce in the thread 👇
Chai does structure predictions at AlphaFold3 levels of accuracy and able to handle multi-peptide or peptide-ligand complexes rather than just single chains.
Apache 2.0 on HF huggingface.co/chaidiscover...
Chai does structure predictions at AlphaFold3 levels of accuracy and able to handle multi-peptide or peptide-ligand complexes rather than just single chains.
Apache 2.0 on HF huggingface.co/chaidiscover...
Interactive viz: aiworld.eu/embed/model/...
Discussion: huggingface.co/spaces/huggi...
Interactive viz: aiworld.eu/embed/model/...
Discussion: huggingface.co/spaces/huggi...
Here's a recap, find the text-readable version here huggingface.co/posts/merve/...
Here's a recap, find the text-readable version here huggingface.co/posts/merve/...
Pre-training & evaluation code, synthetic data generation pipelines, post-training scripts, on-device tools & demos
Apache 2.0. V2 data mix coming soon!
Which tools should we add next?
Pre-training & evaluation code, synthetic data generation pipelines, post-training scripts, on-device tools & demos
Apache 2.0. V2 data mix coming soon!
Which tools should we add next?
* any dataset on the 🤗 Hub can become an eval task in a few lines of code: customize the prompt, metrics, parsing, few-shots, everything!
* model- and data-parallel inference
* auto batching with the new vLLM backend
* any dataset on the 🤗 Hub can become an eval task in a few lines of code: customize the prompt, metrics, parsing, few-shots, everything!
* model- and data-parallel inference
* auto batching with the new vLLM backend
- Pre-training code
- Evaluation suite
- Synthetic data generation
- Post-training scripts with TRL
- On-device tools for summarization, rewriting & agents
All with Apache 2.0 licensed! 🔥
- Pre-training code
- Evaluation suite
- Synthetic data generation
- Post-training scripts with TRL
- On-device tools for summarization, rewriting & agents
All with Apache 2.0 licensed! 🔥
Soon, it'll be "on-chip" LLM. Or LLM cores. The system default local LLM. The coding framework's default local LLM.
I find this incredibly exciting. A privacy-first, self-contained, user-owned AI—a 24/7 agent for action, insights & feedback.
github.com/huggingface/...
Soon, it'll be "on-chip" LLM. Or LLM cores. The system default local LLM. The coding framework's default local LLM.
I find this incredibly exciting. A privacy-first, self-contained, user-owned AI—a 24/7 agent for action, insights & feedback.
github.com/huggingface/...
📊 A statistical approach to model evaluation @AnthropicAI
📐 Frontier MATH: a benchmark for evaluating advanced Mathematical reasoning in AI @EpochAIResearch
📝 Say What You Mean: A Response to 'Let Me Speak Freely' @dottxtai
🧵 👇
📊 A statistical approach to model evaluation @AnthropicAI
📐 Frontier MATH: a benchmark for evaluating advanced Mathematical reasoning in AI @EpochAIResearch
📝 Say What You Mean: A Response to 'Let Me Speak Freely' @dottxtai
🧵 👇
go.bsky.app/8MFcfXd
Let me know if you find such people here!
I'm still new here and probably the list misses many must-add people, so let's built it together💪
go.bsky.app/8MFcfXd
Let me know if you find such people here!
I'm still new here and probably the list misses many must-add people, so let's built it together💪