Cohere Labs
@cohereforai.bsky.social
460 followers 12 following 170 posts
@Cohere.com's non-profit research lab and open science initiative that seeks to solve complex machine learning problems. Join us in exploring the unknown, together. https://cohere.com/research
Posts Media Videos Starter Packs
Pinned
cohereforai.bsky.social
We are committed to making meaningful progress in machine learning research through open collaboration. Follow this 🧵to stay on top of our research contributions.
cohereforai.bsky.social
Today at COLM, we are excited to share our work Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation, during Poster Session 4, 4:30 - 6:30pm.

Come connect with paper authors @juliakreutzer.bsky.social and @kocmitom.bsky.social.
Reposted by Cohere Labs
juliakreutzer.bsky.social
💡A collaborative➕diverse team is key. In real life as in the LLM world 💪🦾
Check out our latest work that builds on this insight. 👇
cohereforai.bsky.social
Is Best-of-N really the best use of your inference compute?

Introducing Fusion-of-N: a simple and powerful way to advance inference and distillation beyond Best-of-N.
cohereforai.bsky.social
We are excited to present FusioN as a plug-and-play replacement to Best-of-N, shifting from a monolithic selection framework to collaborative synthesis one that embraces the diverse strengths of today’s leading open LLMs.
cohereforai.bsky.social
How does FusioN use the same sample pool more effectively than BoN?

🧩While BoN picks just one sample per problem, FusioN synthesises one output from all samples – treating them as collaborators whose strengths can be integrated, not competitors in a zero-sum game.
cohereforai.bsky.social
Want the wisdom-of-the-crowd in 1 model?

🧑‍🎓🧑🏽‍🎓👨🏾‍🎓Fusion-of-N distills multiple teachers into richer synthetic data than BoN, training students that achieve bigger downstream gains, even surpassing teachers on multilingual factual reasoning 🌎
cohereforai.bsky.social
Test-time scaling doesn't need to waste samples, Fusion-of-N turns every sample into signal; outperforming BoN across tasks, languages and models. 🚀

Fusion-of-N boosts CommandA win-rates vs Gemini-2.5 Pro +8.3% across 11 languages – a +4.0% improvement over BoN 🥇
cohereforai.bsky.social
Fusion-of-N uses an LLM (the fusor) to merge multiple candidate answers into one 💎

Instead of selecting only one response, Fusion-of-N creates an even better answer by integrating insights across all samples 🏅
cohereforai.bsky.social
Is Best-of-N really the best use of your inference compute?

Introducing Fusion-of-N: a simple and powerful way to advance inference and distillation beyond Best-of-N.
cohereforai.bsky.social
We’re not your average lab. We’re a hybrid research environment dedicated to revolutionizing the ML space.

And we’re hiring a Senior Research Scientist to co-create with us.

If you believe in research as a shared, global effort — this is your chance.
cohereforai.bsky.social
Led by: Srishti Gureja, Elena Tommasone, Jingyi He, @sarahooker.bsky.social, Matthias Galle, and @mziizm.bsky.social

📄 Paper: https://arxiv.org/abs/2509.20837
cohereforai.bsky.social
🔹 The future of synthetic training hinges on rethinking verification. It’s calibrated verification: complex, diverse test suites combined with flexible signals that break the Verification Ceiling and improve code LLMs.
cohereforai.bsky.social
🔹 We also find that LLMs can serve as soft verifiers. Their judgments recover useful data and often match or surpass formal unit tests selection.
cohereforai.bsky.social
🔹 Relaxing verification thresholds boosts performance but only with sufficiently complex test suites. Correctness still matters, but how we define it is the real issue.
cohereforai.bsky.social
We find:

🔹 Rigid verification risks biasing toward easy problems, while richer correctness signals preserve both quality and diversity.
cohereforai.bsky.social
What if the way we verify synthetic code is limiting model performance?

In our latest work we uncover the Verification Ceiling Problem: strict “all tests must pass” rules throw away useful data, while weak tests let errors through.
Reposted by Cohere Labs
mziizm.bsky.social
I'm excited to share that I'll be stepping into the role of Head of @cohereforai.bsky.social. It's an honor and a responsibility to lead such an extraordinary group of researchers pushing the boundaries of AI research.
Reposted by Cohere Labs
cjpberry.bsky.social
Papers In The Park 14. Last one of the season! Still great weather. Surprising. Anthony is presenting the “Why Language Models Hallucinate”.

Thanks to @cohereforai.bsky.social for the copies and pizza.
cohereforai.bsky.social
🚨 Rare opportunity: Cohere Labs is hiring a Research Scientist!

If you’re passionate about studying fundamental AI problems and working in a globally collaborative, open-science environment, this is for you.

Apply here: jobs.ashbyhq.com/cohere/7ec9e...
Reposted by Cohere Labs
cjpberry.bsky.social
It’s papers in the park 7! Thanks to @cohereforai.bsky.social for the papers and the pizza, and to Alvin and Anthony for organizing.

It’s easily one of funnest paper reads in the city!
Reposted by Cohere Labs
mziizm.bsky.social
Breaking into AI research is harder than ever, and early-career researchers face fewer chances to get started.

Entry points matter.

We started the Scholars Program 3 years ago to give new researchers a real shot — excited to open applications for year 4✨
cohereforai.bsky.social
Applications are now open for the next cohort of the Cohere Labs Scholars Program! 🌟

This is your chance to collaborate with some of the brightest minds in AI & chart new courses in ML research. Let's change the spaces breakthroughs happen.

Apply by Aug 29.
cohereforai.bsky.social
Check out the full blogpost here: https://cohere.com/blog/elo-ratings-beyond-arena-style-evaluations

Great to collaborate with Adithya Venkatadri Hulagadri, @mziizm.bsky.social‬, @jiangangngui.bsky.social‬, and @juliakreutzer.bsky.social‬ on this exploration.
cohereforai.bsky.social
In this blogpost we propose a 3rd path:
✅ Balanced sampling across languages/tasks
✅ Offline pseudo-pairwise comparisons (Bradley-Terry)
✅ Confidence intervals & transparent breakdowns

The result? Rankings that better reflect real model utility.