Debora Nozza
deboranozza.bsky.social
Debora Nozza
@deboranozza.bsky.social
Assistant Professor at Bocconi University in MilaNLP group • Working in #NLP, #CSS and #Ethics • She/her • #ERCStG PERSONAE
Reposted by Debora Nozza
Found and added under data/
January 20, 2026 at 11:21 AM
Reposted by Debora Nozza
I included some test cases on GitHub, will look if I still have the ones we used in the paper.
January 20, 2026 at 11:11 AM
Reposted by Debora Nozza
If you are curious about the theoretical background, see

Hovy, D., Berg-Kirkpatrick, T., Vaswani, A., & Hovy E. (2013). Learning Whom to Trust With MACE. In: Proceedings of NAACL-HLT. ACL.

aclanthology.org/N13-1132.pdf

And for even more details:

aclanthology.org/Q18-1040.pdf

N/N
aclanthology.org
January 20, 2026 at 10:20 AM
Reposted by Debora Nozza
I always wanted to revisit it, port it from Java to Python & extend to continuous data, but never found the time.
Last week, I played around with Cursor – and got it all done in ~1 hour. 🤯

If you work with any response data that needs aggregation, give it a try—and let me know what you think!

4/N
January 20, 2026 at 10:17 AM
Reposted by Debora Nozza
MACE estimates:
1. Annotator reliability (who’s consistent?)
2. Item difficulty (which examples spark disagreement?)
3. The most likely aggregate label (the latent “best guess”)

That “side project” ended up powering hundreds of annotation projects over the years.

3/N
January 20, 2026 at 10:15 AM
Reposted by Debora Nozza
However, disagreement isn’t just noise—it’s information. It can mean an item is genuinely hard—or someone wasn’t paying attention. If only you knew whom to trust…

That summer, Taylor Berg-Kirkpatrick, Ashish Vaswani, and I built MACE (Multi-Annotator Competence Estimation).

2/N
January 20, 2026 at 10:14 AM