This paper reveals systematic biases and transparency gaps in the Chatbot Arena leaderboard.
www.youtube.com/watch?v=URho...
This paper reveals systematic biases and transparency gaps in the Chatbot Arena leaderboard.
www.youtube.com/watch?v=URho...
This work introduces a method to improve model performance by adding markers to tokens of the pretraining data, enabling real-time targeting of the long tail using training-time markers.
www.youtube.com/watch?v=K3BU...
This work introduces a method to improve model performance by adding markers to tokens of the pretraining data, enabling real-time targeting of the long tail using training-time markers.
www.youtube.com/watch?v=K3BU...
Led by @kocmitom.bsky.social, Ekaterina Artemova, Eleftherios Avramidis, Eleftheria Briakou, @pinzhen.bsky.social, @mziizm.bsky.social...
Led by @kocmitom.bsky.social, Ekaterina Artemova, Eleftherios Avramidis, Eleftheria Briakou, @pinzhen.bsky.social, @mziizm.bsky.social...
Top systems reach ~95% pairwise accuracy open-ended and summarization tasks.
Smaller ones barely beat coin-flip territory at ~55%.
Top systems reach ~95% pairwise accuracy open-ended and summarization tasks.
Smaller ones barely beat coin-flip territory at ~55%.
Across open-ended generation and cross lingual summarization, the biggest weakness isn’t coherence or accuracy, but it is sounding like a native speaker. Many outputs still feel robotic or translated.
Across open-ended generation and cross lingual summarization, the biggest weakness isn’t coherence or accuracy, but it is sounding like a native speaker. Many outputs still feel robotic or translated.
Models like Gemini 2.5 Pro and Claude 4 sometimes did better in Korean, German, or Spanish than in English when solving reasoning tasks.
Models like Gemini 2.5 Pro and Claude 4 sometimes did better in Korean, German, or Spanish than in English when solving reasoning tasks.
Even top models scored below 50% on linguistic reasoning tasks, showing that structured linguistic deduction is still an open challenge.
Even top models scored below 50% on linguistic reasoning tasks, showing that structured linguistic deduction is still an open challenge.
Models don’t support all languages equally, and this skews rankings. Smaller open models especially struggle with broad coverage, affecting their aggregate ranking ⚠️
Models don’t support all languages equally, and this skews rankings. Smaller open models especially struggle with broad coverage, affecting their aggregate ranking ⚠️
📝 Open-ended generation testing naturalness and usefulness
📘 Cross-lingual summarization
🔁 Machine translation
🧑⚖️ LLM-as-a-Judge evaluating outputs of other models
All backed by human evals and public releases of data + outputs!
github.com/wmt-conferen...
📝 Open-ended generation testing naturalness and usefulness
📘 Cross-lingual summarization
🔁 Machine translation
🧑⚖️ LLM-as-a-Judge evaluating outputs of other models
All backed by human evals and public releases of data + outputs!
github.com/wmt-conferen...
Congrats to authors Ammar Khairi, Daniel D'souza, Ye Shen, @juliakreutzer.bsky.social, @sarahooker.bsky.social
📜 arxiv.org/abs/2506.20544
Congrats to authors Ammar Khairi, Daniel D'souza, Ye Shen, @juliakreutzer.bsky.social, @sarahooker.bsky.social
📜 arxiv.org/abs/2506.20544
Congrats to authors Yijiang River Dong, @tiancheng.bsky.social, Yinhong Liu, Ahmet Üstün, Nigel Collier.
📜 arxiv.org/abs/2502.19158
Congrats to authors Yijiang River Dong, @tiancheng.bsky.social, Yinhong Liu, Ahmet Üstün, Nigel Collier.
📜 arxiv.org/abs/2502.19158
Congrats to authors @yongzx.bsky.social , Beyza Ermis, @mziizm.bsky.social, Stephen Bach, @juliakreutzer.bsky.social.
📜 arxiv.org/abs/2505.24119
Congrats to authors @yongzx.bsky.social , Beyza Ermis, @mziizm.bsky.social, Stephen Bach, @juliakreutzer.bsky.social.
📜 arxiv.org/abs/2505.24119
Congrats to authors Nikolas Gritsch, Qizhen Zhang, @acyrl.bsky.social, @sarahooker.bsky.social and Ahmet Üstün.
📜 arxiv.org/abs/2408.15901
Congrats to authors Nikolas Gritsch, Qizhen Zhang, @acyrl.bsky.social, @sarahooker.bsky.social and Ahmet Üstün.
📜 arxiv.org/abs/2408.15901
Learn more and register now: https://tinyurl.com/CohereLabsConnect
Learn more and register now: https://tinyurl.com/CohereLabsConnect
🔬 Showcase cutting-edge research
💡 Highlight meaningful collaborations
🤝 Inspire new partnerships
🔬 Showcase cutting-edge research
💡 Highlight meaningful collaborations
🤝 Inspire new partnerships
Led by: David Mora, Viraat Aryabumi, @weiyinko-ml.bsky.social, @sarahooker.bsky.social, @juliakreutzer.bsky.social, and
@mziizm.bsky.social.
Led by: David Mora, Viraat Aryabumi, @weiyinko-ml.bsky.social, @sarahooker.bsky.social, @juliakreutzer.bsky.social, and
@mziizm.bsky.social.