Vilém Zouhar @ EACL
banner
zouharvi.bsky.social
Vilém Zouhar @ EACL
@zouharvi.bsky.social
PhD @ ETH Zürich | working on (multilingual) evaluation of NLP | on the academic job market | go #vegan | https://vilda.net
Yes, LabelStudio is a great general purpose tool (while pearmut aims at specific workflows).

However LabelStudio has limitations with tutorials/attention checks or when it comes to assigning annotation tasks to lay people (they have to register to annotate I believe?)
January 28, 2026 at 2:55 PM
Thanks! Experience reports should be more common.

Pearmut was created out of the frustration to set up humeval using existing tools with good defaults.

In the paper we have 5 researchers trying to set up humeval using 5 different platforms and reporting on time and ease of use and customizability.
January 28, 2026 at 2:43 PM
🥜 Platform for Efficient Annotation of Natural Utterances and Translation? 😁
January 28, 2026 at 2:23 PM
Thanks to all my friends who helped bring this to life. 🙂

Get in touch if you'd like to help with human evaluation for your paper/work! 🖐️
January 28, 2026 at 1:39 PM
The CLI gives you magic links: dashboard to monitor progress, and annotation links to distribute to your annotators.

Pearmut is open-source and extensible with many exciting features coming. 🍏

github.com/zouharvi/pea...
GitHub - zouharvi/pearmut: Platform for Evaluating and Reviewing of Multilingual Tasks
Platform for Evaluating and Reviewing of Multilingual Tasks - zouharvi/pearmut
github.com
January 28, 2026 at 1:39 PM
Get started with the following commands:

pip install pearmut
# Download example campaign
wget raw.githubusercontent.com/zouharvi/pea...
# Load and start
pearmut add esa.json
pearmut run
January 28, 2026 at 1:39 PM
The tool supports multiple annotation protocols of translation and multilingual tasks out of the box:
- direct assessment (with custom sliders),
- ESA, MQM,
- contrastive evaluation, video/audio/image, attention checks, tutorials, statistically sound model comparison, etc.
January 28, 2026 at 1:39 PM
How often is human evaluation skipped in papers/workflows just because it's too difficult to set up? Yet even small humeval can give so much more signal than automatic metrics.

Introducing Pearmut, Human Evaluation of Translation Made Trivial🍐 arxiv.org/pdf/2601.02933
January 28, 2026 at 1:39 PM
Join us and build a model that predicts human annotations of quality based on source speech and its textual translation.

iwslt.org/2026/metrics

Effort lead by @maikezufle.bsky.social, @marinecarpuat.bsky.social, @hjhan.bsky.social, @matteo-negri.bsky.social, and others. 🙂
Speech Translation Metrics track
Home of the IWSLT conference and SIGSLT.
iwslt.org
January 14, 2026 at 6:04 PM
Have you ever wondered how speech translation gets evaluated? Sadly, most speech evaluation downgrades to text-based metrics. Let’s do better!

At IWSLT 2026, we’re launching the first-ever ✨Speech Translation Metrics Shared Task ✨!
January 14, 2026 at 6:04 PM
Dissatisfied with EACL paper decisions? Fret not and submit your paper with ARR reviews to Multilingual Multicultural Evaluation workshop at EACL (both archival or nonarchival) until January 5th. 🔍🙂

multilingual-multicultural-evaluation.github.io
January 3, 2026 at 12:53 PM
Why is the self-attention masked only diagonally?
December 23, 2025 at 12:58 PM
Reposted by Vilém Zouhar @ EACL
Now onwards to making language models transparent and trustworthy for everyone! 🚀

For those curious to know more about my thesis:
- Web-optimized version: gsarti.com/phd-thesis/
- PDF: research.rug.nl/en/publicati...
- Steal my Quarto template: github.com/gsarti/phd-t...
From Insights to Impact
Ph.D. Thesis, Center for Language and Cognition (CLCG), University of Groningen
gsarti.com
December 16, 2025 at 12:21 PM
Do you have work on resources, metrics & methodologies for evaluating multilingual systems?

Share it at the MME workshop 🕵️ co-located at EACL.

Direct submission deadline in 10 days (December 19th)!
multilingual-multicultural-evaluation.github.io
Multilingual Multicultural Evaluation Workshop
LLMs in every language? Prove it. Showcase your work on rigorous, efficient, scalable, culture-aware multilingual benchmarking.
multilingual-multicultural-evaluation.github.io
December 10, 2025 at 9:42 AM
With great power comes warning underfull vbox badness 10000.
December 9, 2025 at 11:11 AM
I'm abandoning LaTeX and my next ACL paper will by in Typst (I fantasize).
December 9, 2025 at 11:11 AM
ctan.math.illinois.edu
December 9, 2025 at 12:40 AM
NLP evaluation is often detached from practical applications. Today I extrinsically evaluated one WMT25 translation system on the task of getting hair done without knowing Chinese.

Yes you got 67 BLEU points but is the resulting hair slaying? 💇

See the result on one datapoint (my head) at EMNLP.
November 3, 2025 at 5:49 AM
The inspiration for the subset2evaluate poster comes from Henri Matisse's The Horse, the Rider and the Clown. 🐎🚴‍♀️🤡
October 28, 2025 at 5:13 PM
- How to Select Datapoints for Efficient Human Evaluation of NLG Models? arxiv.org/abs/2501.18251

- Estimating Machine Translation Difficulty arxiv.org/abs/2508.10175

- COMET-poly: Machine Translation Metric Grounded in Other Candidates arxiv.org/abs/2508.18549
October 28, 2025 at 9:45 AM
Let's talk about eval (automatic or human) and multilinguality at #EMNLP in Suzhou! 🇨🇳

- Efficient evaluation (Nov 5, 16:30, poster session 3)
- MT difficulty (Nov 7, 12:30, findings 3)
- COMET-poly (Nov 8, 11:00, WMT)

(DM to meet 🌿 )
October 28, 2025 at 9:45 AM
...real interesting research problems I was passionate about and planning my research future.

You should apply to these fellowships, even if it's for the exercise of periodically refining your research statement.
October 24, 2025 at 12:32 PM
Grateful to receive the Google PhD Fellowship in NLP! 🙂

I am not secretive about having applied to 4 similar fellowships during my PhD before and not succeeding. Still, refining my research statement (part of the application) helped me tremendously in finding out the...

inf.ethz.ch/news-and-eve...
Google PhD Fellowships 2025
Yutong Chen, Benedict Schlüter and Vilém Zouhar, all three of them doctoral students at the Department of Computer Science, have been awarded the Google PhD Fellowship. The programme was created to re...
inf.ethz.ch
October 24, 2025 at 12:32 PM
Congratulations, doctor! 🤓
October 22, 2025 at 4:14 PM