Vilém Zouhar
@zouharvi.bsky.social
3.5K followers 1.4K following 250 posts
PhD student @ ETH Zürich | all aspects of NLP but mostly evaluation and MT | go vegan | https://vilda.net
Posts Media Videos Starter Packs
zouharvi.bsky.social
My two biggest take-aways are:
- Standard testsets are too easy (Figure 1).
- We can make testsets that are not easy (Figure 2). 😎
Reposted by Vilém Zouhar
kocmitom.bsky.social
We saw increased momentum in participation growth this year: 36 unique teams competing to improve the performance of MT. Furthermore, we added collected outputs of 24 popular LLMs and online systems. Reaching 50 evaluated systems in our annual benchmark.
zouharvi.bsky.social
It gets worse the more you look at it. Why is the height of 69.1 the same as the height of 30.8? Why are the labels rotated if there's enough space?
zouharvi.bsky.social
Organizers are happy to help with any questions. 🙂
Website with all details and contacts: www2.statmt.org/wmt25/mteval...
Shared task: Automated Translation Quality Evaluation Systems
www2.statmt.org
zouharvi.bsky.social
📐Task 3: Quality-informed segment-level error correction

Automatically post-edit machine-translated text using quality annotations to generate minimal and accurate corrections.

Description: www2.statmt.org/wmt25/mteval...

Submission platform: www.codabench.org/competitions...
QE-informed Segment-level Error Correction
www2.statmt.org
zouharvi.bsky.social
📐Task 2: Span-level error detection

Identify and locate translation errors within each segment (start/end indices) and classify their severity.

Description: www2.statmt.org/wmt25/mteval...

Submission platform: www.codabench.org/competitions...
Fine-grained error span detection
www2.statmt.org
zouharvi.bsky.social
📐Task 1: Segment-level quality score prediction

Predict a quality score for each source–target segment pair, using document-level context and either ESA or MQM annotations.

Description: www2.statmt.org/wmt25/mteval...

Submission platform: www.codabench.org/competitions...
MT Evaluation Subtask 1: Segment-Level Quality Score Prediction
www2.statmt.org
zouharvi.bsky.social
The 2025 MT Evaluation shared task brings together the strengths of the previous Metrics and Quality Estimation tasks under a single, unified evaluation framework.

The following tasks are now open for participants (deadline July 31st but participation has never been easier 🙂 ):
zouharvi.bsky.social
Faster but an extra dependency. 🤷
zouharvi.bsky.social
Not possible post-hoc but possible for the other direction! Thanks for your paper. 🙂
zouharvi.bsky.social
Thank you everyone who helped. 😊

Special thanks to @mrinmaya.bsky.social and Peng Cui from @csateth.bsky.social and all my friends I bugged with proofreading. 😁
zouharvi.bsky.social
Recommendation based on translation and summarization:
1️⃣ if you have a good automatic metric, use variance/consistency
2️⃣ if not, use model output diversity
3️⃣ if outputs not available, use artificial crowd/distilled predictors
4️⃣ if those are not available, use source diversity
zouharvi.bsky.social
We frame this as a 0/1 Knapsack problem: find a subset Y ⊆ X with maximum utility while staying under budget B. 🤓

maximize: ∑ zₓ · Utility(x)
subject to: ∑ zₓ · Cost(x) ≤ B
zₓ ∈ {0, 1}

The Utility(x) can be metric average, variance, diversity, etc.
zouharvi.bsky.social
This works even if you don't have the model outputs yet.
1️⃣ "artificial crowd" simulate what model outputs would look like; apply the previous methods.
2️⃣ "utility predictors" estimate usefulness from the source text.
3️⃣ "source-based diversity" remove similar inputs.
zouharvi.bsky.social
So what works? Selecting inputs that expose model differences:
1️⃣ high variance in metric scores
2️⃣ diversity in model outputs
3️⃣ high metric consistency with the rest of the dataset

We now need almost 30% fewer annotated examples to get the same model ranking.
zouharvi.bsky.social
We frame this as finding the smallest subset of data (Y ⊆ X) that gives the same model ranking as on the full dataset.

Simply picking the hardest examples (lowest average metric score) is a step up but can backfire by selecting the most expensive items to annotate.
zouharvi.bsky.social
You have a budget to human-evaluate 100 inputs to your models, but your dataset is 10,000 inputs. Do not just pick 100 randomly!🙅

We can do better. "How to Select Datapoints for Efficient Human Evaluation of NLG Models?" shows how.🕵️
(random is still a devilishly good baseline)
zouharvi.bsky.social
TIL that since python3.4 there's default `statistics` module with things like mean, mode, quantiles, variance, covariance, correlations, zscore, and more!. No more needless numpy imports!
zouharvi.bsky.social
Past iterations of the Terminology Shared Task don't come anywhere near the data quality and evaluation scrutiny of this one. In the era of LLM-as-MTs, participation has never been easier!
kiryukhasemenov.bsky.social
📣Take part in 3rd Terminology shared task @WMT!📣
This year:
👉5 language pairs: EN->{ES, RU, DE, ZH},
👉2 tracks - sentence-level and doc-level translation,
👉authentic data from 2 domains: finance and IT!

www2.statmt.org/wmt25/termin...

Don't miss an opportunity - we only do it once in two years😏
Terminology Translation Task
www2.statmt.org
zouharvi.bsky.social
Thank you for your response. I will keep my score.
zouharvi.bsky.social
For the longest time I've been using Google Translate as a gateway to explain machine translation concepts to people as it's a tool that everyone knows. Now I get to contribute over the summer. 🌞

If you're near Mountain View, let's talk evaluation. 📏
zouharvi.bsky.social
I resigned on making it work natively and so I just typeset examples in different scripts in Typst/Inkscape, export to pdf, and include the PDF in Latex in tables. As silly as that is, still easier than making things work by changing the compiler or the babel package.