Dirk Hovy
dirkhovy.bsky.social
Dirk Hovy
@dirkhovy.bsky.social
Professor @milanlp.bsky.social for #NLProc, compsocsci, #ML
Also at http://dirkhovy.com/
Found and added under data/
January 20, 2026 at 11:21 AM
I included some test cases on GitHub, will look if I still have the ones we used in the paper.
January 20, 2026 at 11:11 AM
If you are curious about the theoretical background, see

Hovy, D., Berg-Kirkpatrick, T., Vaswani, A., & Hovy E. (2013). Learning Whom to Trust With MACE. In: Proceedings of NAACL-HLT. ACL.

aclanthology.org/N13-1132.pdf

And for even more details:

aclanthology.org/Q18-1040.pdf

N/N
aclanthology.org
January 20, 2026 at 10:20 AM
I always wanted to revisit it, port it from Java to Python & extend to continuous data, but never found the time.
Last week, I played around with Cursor – and got it all done in ~1 hour. 🤯

If you work with any response data that needs aggregation, give it a try—and let me know what you think!

4/N
January 20, 2026 at 10:17 AM
MACE estimates:
1. Annotator reliability (who’s consistent?)
2. Item difficulty (which examples spark disagreement?)
3. The most likely aggregate label (the latent “best guess”)

That “side project” ended up powering hundreds of annotation projects over the years.

3/N
January 20, 2026 at 10:15 AM
However, disagreement isn’t just noise—it’s information. It can mean an item is genuinely hard—or someone wasn’t paying attention. If only you knew whom to trust…

That summer, Taylor Berg-Kirkpatrick, Ashish Vaswani, and I built MACE (Multi-Annotator Competence Estimation).

2/N
January 20, 2026 at 10:14 AM