Wessel Poelman
wpoelman.bsky.social
Wessel Poelman
@wpoelman.bsky.social
Working with languages @ KU Leuven
New EACL paper (with @mdlhx.bsky.social)! We tested if comparing perplexity of parallel data across languages is fair. Turns out: it depends. We show the choice of test set (even with consistent meaning) can flip conclusions about which language is easier to model.

Paper: arxiv.org/abs/2601.10580
Form and Meaning in Intrinsic Multilingual Evaluations
Intrinsic evaluation metrics for conditional language models, such as perplexity or bits-per-character, are widely used in both mono- and multilingual settings. These metrics are rather straightforwar...
arxiv.org
January 28, 2026 at 1:25 PM