Dan Saattrup Smart
banner
saattrupdan.com
Dan Saattrup Smart
@saattrupdan.com
Researcher and consultant in low-resource NLP, with a focus on evaluation. saattrupdan.com
Amazing, well done! Have you conducted any experiments with finetuning LLMs on the data?
March 6, 2025 at 1:44 PM
See the full English leaderboard here: scandeval.com/leaderboards...

You can make your own radial plots, like the one above, using this tool: scandeval.com/extras/radia...

(4/4)
🇬🇧 English - ScandEval
scandeval.com
February 10, 2025 at 4:33 PM
If we dig down into more granular evaluations, we see that the main discrepancies between the two models lie in that o3-mini gets a higher text classification performance, where gpt-4o performs better at common-sense reasoning.

(3/4)
February 10, 2025 at 4:33 PM
Overall, the gpt-4o model achieves a slightly better rank score of 1.46, compared to o3-mini's 1.51. Here lower is better, with 1 being the best score possible (indicating that the model beats all other models at all tasks).

We use the default 'medium' reasoning effort of o3-mini here.

(2/4)
February 10, 2025 at 4:33 PM
Check out the full leaderboards on scandeval.com, which also includes results on the Llama-3.3-70B, Qwen2.5-72B, QwQ-32B-preview, Gemma-27B and Nemotron-4-340B.
ScandEval
scandeval.com
January 20, 2025 at 2:01 PM
On average, the 405B Llama-3.1 model achieves a solid second place with ScandEval rank of 1.53, where GPT-4-turbo is in the lead with a ScandEval rank of 1.39 🎉
January 20, 2025 at 2:01 PM
However, for Icelandic, Faroese and Norwegian, it's not quite there yet.
January 20, 2025 at 2:01 PM
For Danish, Swedish, Dutch, German and English, it turns out that it is roughly on par with GPT-4-turbo!
January 20, 2025 at 2:01 PM
Reposted by Dan Saattrup Smart
"Each task consumed approximately 1,785 kWh of energy—about the same amount of electricity an average U.S. household uses in two months"

This is one per-task estimate from Salesforce's head of sustainability -->>

www.linkedin.com/posts/bgamaz...
December 28, 2024 at 8:45 AM