Lightnews — Scholar-powered news

Julia Kreutzer @juliakreutzer.bsky.social · 18h

Ready for our poster today at #COLM2025!

💭This paper has had an interesting journey, come find out and discuss with us! @swetaagrawal.bsky.social @kocmitom.bsky.social

Side note: being a parent in research does have its perks, poster transportation solved ✅

1 9

Reposted by Julia Kreutzer

Cohere Labs @cohereforai.bsky.social · 8d

We’re not your average lab. We’re a hybrid research environment dedicated to revolutionizing the ML space.

And we’re hiring a Senior Research Scientist to co-create with us.

If you believe in research as a shared, global effort — this is your chance.

1 3 4

Julia Kreutzer @juliakreutzer.bsky.social · 6d

💡A collaborative➕diverse team is key. In real life as in the LLM world 💪🦾
Check out our latest work that builds on this insight. 👇

Cohere Labs @cohereforai.bsky.social · 6d

Is Best-of-N really the best use of your inference compute?

Introducing Fusion-of-N: a simple and powerful way to advance inference and distillation beyond Best-of-N.

1 1 3

Reposted by Julia Kreutzer

Marzieh Fadaee @mziizm.bsky.social · Aug 13

Breaking into AI research is harder than ever, and early-career researchers face fewer chances to get started.

Entry points matter.

We started the Scholars Program 3 years ago to give new researchers a real shot — excited to open applications for year 4✨

Cohere Labs @cohereforai.bsky.social · Aug 13

Applications are now open for the next cohort of the Cohere Labs Scholars Program! 🌟

This is your chance to collaborate with some of the brightest minds in AI & chart new courses in ML research. Let's change the spaces breakthroughs happen.

Apply by Aug 29.

1 3 6

Reposted by Julia Kreutzer

Cohere Labs @cohereforai.bsky.social · Aug 15

While effective for chess♟️, Elo ratings struggle with LLM evaluation due to volatility and transitivity issues.

New post in collaboration with AI Singapore explores why Elo falls short for AI leaderboards and how we can do better.

1 3 6

Reposted by Julia Kreutzer

Conference on Language Modeling @colmweb.org · Jul 14

COLM 2025 is now accepting applications for:

Financial Assistance Application -- docs.google.com/forms/d/e/1F...

Volunteer Application -- docs.google.com/forms/d/e/1F...

Childcare Financial Assistance Application -- docs.google.com/forms/d/e/1F...

All due by July 31

COLM 2025 Financial Assistance Application

Goal of the Financial Assistance Program. We at COLM believe our community should be diverse and inclusive. We recognize that some might be less likely to attend because of financial burden of travel ...

docs.google.com

4 6

Julia Kreutzer @juliakreutzer.bsky.social · Jun 26

🍋 Squeezing the most of few samples - check out our LLMonade recipe for few-sample test-time scaling in multitask environments.

Turns out that standard methods miss out on gains on non-English languages. We propose more robust alternatives.

Very proud of this work that our scholar Ammar led! 🚀

Cohere Labs @cohereforai.bsky.social · Jun 26

Can we improve the performance of LLMs during inference without the need for extensive sampling OR special reward models? 🤔

Our latest work introduces a new inference time scaling recipe that is sample-efficient, multilingual, and suitable for multi-task requirements. 🍋

1 4

Julia Kreutzer @juliakreutzer.bsky.social · Jun 4

🚨LLM safety research needs to be at least as multilingual as our models.

What's the current stage and how to progress from here?
This work led by @yongzx.bsky.social has answers! 👇

Cohere Labs @cohereforai.bsky.social · Jun 3

It’s been two years since cross-lingual jailbreaks were first discovered. How far has the multilingual LLM safety research field advanced? 🤔

📏 Our comprehensive survey reveals that there is still a long way to go.

2 4

Julia Kreutzer @juliakreutzer.bsky.social · May 28

🚧No LLM safety without multilingual safety - what is missing to closing the language gap? And where does this gap actually originate from?

Answers 👇

Cohere Labs @cohereforai.bsky.social · May 28

Over 7000 languages are spoken worldwide 🌐, but AI safety efforts focus on only a fraction of them.

Our latest paper draws on our multi-year efforts with the wider research community to explore why this matters and how we can bridge the AI language gap.

1 1

Julia Kreutzer @juliakreutzer.bsky.social · May 9

Multilingual 🤝reasoning 🤝 test-time scaling 🔥🔥🔥

New preprint!

@yongzx.bsky.social has all the details 👇

Yong Zheng-Xin (Yong) @yongzx.bsky.social · May 9

📣 New paper!

We observe that reasoning language models finetuned only on English data are capable of zero-shot cross-lingual reasoning through a "quote-and-think" pattern.

However, this does not mean they reason the same way across all languages or in new domains.

[1/N]

1 5

Reposted by Julia Kreutzer

Marzieh Fadaee @mziizm.bsky.social · Apr 30

1/ Science is only as strong as the benchmarks it relies on.

So how fair—and scientifically rigorous—is today’s most widely used evaluation benchmark?

We took a deep dive into Chatbot Arena to find out. 🧵

1 6 28

Julia Kreutzer @juliakreutzer.bsky.social · Apr 24

Thank you @rapha.dev 😊 hope we can establish going a little more into depth rather than just focusing on breadth (massive multilinguality) with evals.

1

Julia Kreutzer @juliakreutzer.bsky.social · Apr 17

🤓MT eyes on multilingual LLM benchmarks 👉 Here's a bunch of simple techniques that we could adopt easily, and in total get a much richer understanding of where we are with multilingual LLMs.
🍬Bonus question: how can we spur research on evaluation of evaluations?

Cohere Labs @cohereforai.bsky.social · Apr 17

🚀🌍The rapid advancement of multilingual large language models (mLLMs) is exciting, but are we evaluating them effectively?

Our new paper explores how we can improve generative evaluations for mLLMs by learning from machine translation (MT) evaluation practices. 🔎

3

Reposted by Julia Kreutzer

Tom Kocmi @kocmitom.bsky.social · Apr 17

Tired of messy non-replicable multilingual LLM evaluation? So were we.

In our new paper, we experimentally illustrate common eval. issues and present how structured evaluation design, transparent reporting, and meta-evaluation can help us to build stronger models.

Julia Kreutzer @juliakreutzer.bsky.social · Apr 17

📖New preprint with Eleftheria Briakou @swetaagrawal.bsky.social @mziizm.bsky.social @kocmitom.bsky.social!

arxiv.org/abs/2504.11829

🌍It reflects experiences from my personal research journey: coming from MT into multilingual LLM research I missed reliable evaluations and evaluation research…

Screenshot of the paper header with title and author list and affiliations

1 7

Julia Kreutzer @juliakreutzer.bsky.social · Apr 17

🎯In order to keep advancing mLLM models, we need to advance our evaluation methods.
We need meta-evaluation research to think beyond one-fits-all automatic evaluation and develop richer assessments in human evaluation, and iterate to adapt them to advances in capabilities. 🔄

1

Julia Kreutzer @juliakreutzer.bsky.social · Apr 17

🤔Yes, none of these principles are novel or the techniques particularly sophisticated.
Despite their effectiveness, none of them are standard practice.
✔️We’ve compiled a checklist to help incorporate them in model evaluations.

Checklist for multilingual LLM evaluation

1 2

Julia Kreutzer @juliakreutzer.bsky.social · Apr 17

(5) Advancing reproducibility through transparency 🪟
Current mLLM evaluations are near impossible to reproduce, due to intransparency of evaluation configurations (incl. task formulation as in the example below). We argue for open evaluation releases that include model outputs and their scores.

Table comparing model scores under different prompt templates.

1 1

Julia Kreutzer @juliakreutzer.bsky.social · Apr 17

(4) Conducting richer analyses 🔬
Aggregate benchmark metrics do not provide insights into what differentiates the outputs of two models - yet this is often the first step in human evaluation. For example, we can group evaluation prompts by length or category.

Diagram breaking down win rate comparisons across buckets of prompt length

1

Julia Kreutzer @juliakreutzer.bsky.social · Apr 17

(3) Aggregating responsibly 🏗️
How we aggregate results across tasks and languages informs the interpretation of model comparisons. Uniform weighting is not necessarily fair due to differences in training distribution (e.g. language or task support).

Table displaying model ranking changes depending on language resourcedness and task focus

1

Julia Kreutzer @juliakreutzer.bsky.social · Apr 17

(2) Measuring significance, power and effect size 🔋
Generative evaluations for mLLMs rarely consider significance of results, statistical power of the test setup or effect sizes. We illustrate how these can be helpful to reporting model differences more meaningfully.

Diagram that shows the significance of win rate differences in relation to sample sizes

1 1

Julia Kreutzer @juliakreutzer.bsky.social · Apr 17

(1) Treating synthetic data with care 💅
Translations are a common way to expand evaluation sets to new languages. We demonstrate that prompt translation can cause changes in win rates, with magnitudes depending on translation quality and generative models.

Diagram relating prompt translation quality to a change in win rate differences across languages

1 1

Julia Kreutzer @juliakreutzer.bsky.social · Apr 17

💡… turns out that by adopting practices from MT evaluations we can improve the expressiveness of generative multilingual LLM (mLLM) evaluations. Examples in thread below👇

1 2

Julia Kreutzer @juliakreutzer.bsky.social · Apr 17

📖New preprint with Eleftheria Briakou @swetaagrawal.bsky.social @mziizm.bsky.social @kocmitom.bsky.social!

arxiv.org/abs/2504.11829

🌍It reflects experiences from my personal research journey: coming from MT into multilingual LLM research I missed reliable evaluations and evaluation research…

1 1 11

Reposted by Julia Kreutzer

Cohere Labs @cohereforai.bsky.social · Apr 10

🚀 We are excited to introduce Kaleidoscope, the largest culturally-authentic exam benchmark.

📌 Most VLM benchmarks are English-centric or rely on translations—missing linguistic & cultural nuance. Kaleidoscope expands in-language multilingual 🌎 & multimodal 👀 VLMs evaluation

1 7 18

Reposted by Julia Kreutzer

Tom Kocmi @kocmitom.bsky.social · Mar 28

☀️ Summer internship at Cohere!
Are you excited about multilingual evaluation, human judgment, or meta-eval? Come help us explore how a rigorous eval really looks like while questioning the status quo in LLM evaluation.
I’m looking for an intern (EU timezone preferred), are you interested? Ping me!

2 2 7