Marzena Karpinska ✈️ COLM'25
@markar.bsky.social
3.8K followers 930 following 63 posts
#nlp researcher interested in evaluation including: multilingual models, long-form input/output, processing/generation of creative texts previous: postdoc @ umass_nlp phd from utokyo https://marzenakrp.github.io/
Posts Media Videos Starter Packs
markar.bsky.social
Come to talk with us today about the evaluation of long form multilingual generation at the second poster session #COLM2025

📍4:30–6:30 PM / Room 710 – Poster #8
markar.bsky.social
Off to #COLM fake Fuji looks really good today.
本物は下からしか見たことがないが、今日は少なくとも偽物が上から見えて嬉しい。
markar.bsky.social
I feel like it was worth waking up early
markar.bsky.social
Wait how come, I'm flying direct at 7am..
Reposted by Marzena Karpinska ✈️ COLM'25
yoavgo.bsky.social
When reading AI reasoning text (aka CoT), we (humans) form a narrative about the underlying computation process, which we take as a transparent explanation of model behavior. But what if our narratives are wrong? We measure that and find it usually is.

Now on arXiv: arxiv.org/abs/2508.16599
Humans Perceive Wrong Narratives from AI Reasoning Texts
A new generation of AI models generates step-by-step reasoning text before producing an answer. This text appears to offer a human-readable window into their computation process, and is increasingly r...
arxiv.org
Reposted by Marzena Karpinska ✈️ COLM'25
kocmitom.bsky.social
📊 Preliminary ranking of WMT 2025 General Machine Translation benchmark is here!

But don't draw conclusions just yet - automatic metrics are biased for techniques like metric as a reward model or MBR. The official human ranking will be part of General MT findings at WMT.

arxiv.org/abs/2508.14909
Preliminary Ranking of WMT25 General Machine Translation Systems
We present the preliminary ranking of the WMT25 General Machine Translation Shared Task, in which MT systems have been evaluated using automatic metrics. As this ranking is based on automatic evaluati...
arxiv.org
markar.bsky.social
Happy to see this work accepted to #EMNLP2025! 🎉🎉🎉
Reposted by Marzena Karpinska ✈️ COLM'25
emnlpmeeting.bsky.social
✨We are thrilled to announce that over 3200 papers have been accepted to #EMNLP2025

This includes over 1800 main conference papers and over 1400 papers in findings!

Congratulations to all authors!! 🎉🎉🎉
Reposted by Marzena Karpinska ✈️ COLM'25
jessyjli.bsky.social
The Echoes in AI paper showed quite the opposite with also a story continuation setup.
Additionally, we present evidence that both *syntactic* and *discourse* diversity measures show strong homogenization that lexical and cosine used in this paper do not capture.
markar.bsky.social
At the same time I wish that whoever sparked this interest in data distribution would also help them with the design...
markar.bsky.social
Absolutely! Looking forward to seeing QUDsim at COLM!
markar.bsky.social
The issue is always what, which humans in what circumstances
markar.bsky.social
I think there are quite a few undergraduate students on this preprint and maybe there was a need for a bit more mentoring. The comparison to writingprompts is just one of the issues (amateur writers in very different conditions than normal writing + very short outputs).
markar.bsky.social
Check out the full leaderboard here: novelchallenge.github.io

We'll be updating the dataset with new books and claims within the next few months!
NoCha leaderboard
novelchallenge.github.io
markar.bsky.social
GPT-5 lands first place on NoCha, our long-context book understanding benchmark.

That said, this is a tiny improvement (~1%) over o1-preview, which was released almost one year ago. Have long-context models hit a wall?

Accuracy of human readers is >97%... Long way to go!
Screenshot of benchmark with gpt-5 on top with 68.46% accuracy.
Reposted by Marzena Karpinska ✈️ COLM'25
ankitagupta.bsky.social
🗓️29 July, 4 PM: Automated main concept generation for narrative discourse assessment in aphasia. w/
@marisahudspeth.bsky.social, Polly Stokes, Jacquie Kurland, and @brenocon.bsky.social

📍Hall 4/5.

Come by to chat about argumentation, narrative texts, policy & law, and beyond! #ACL2025NLP
Reposted by Marzena Karpinska ✈️ COLM'25
ankitagupta.bsky.social
Excited to present two papers at #ACL2025!

🗓️30 July, 11 AM: 𝛿-Stance: A Large-Scale Real World Dataset of Stances in Legal Argumentation. w/ Douglas Rice and @brenocon.bsky.social

📍At Hall 4/5. 🧵👇
Reposted by Marzena Karpinska ✈️ COLM'25
lasha.bsky.social
📣 Life update: Thrilled to announce that I’ll be starting as faculty at the Max Planck Institute for Software Systems this Fall!

I’ll be recruiting PhD students in the upcoming cycle, as well as research interns throughout the year: lasharavichander.github.io/contact.html
Kaiserslautern, Germany
Reposted by Marzena Karpinska ✈️ COLM'25
emnlpmeeting.bsky.social
For EMNLP 2025’s special theme of "Advancing our Reach: Interdisciplinary Recontextualization of NLP", we are organizing a panel of experts, and would like input from the community at large as we prepare. Please take a moment to fill in this survey: forms.office.com/r/pWFFA0Gss1
Reposted by Marzena Karpinska ✈️ COLM'25
melaniemitchell.bsky.social
A new definition for AGI just dropped, and it is a bad one.
abeba.bsky.social
lord grant me the courage to write with the confidence a mediocre white man
screenshot reads: Many (not all) insiders now say AGI — artificial general intelligence — stands a good chance of happening in the next few years. AGI is a generative AI model that could, on intellectually oriented tests, outperform human experts on 90% of questions. That doesn’t mean AI will be able to dribble a basketball, make GDP grow by 40% a year or, for that matter, destroy us. Still, AGI would be an impressive accomplishment — and over time, however slowly, it will change our world.
markar.bsky.social
Now accepted to #COLM2025 @colmweb.org
🇨🇦🎉
yekyung.bsky.social
Is the needle-in-a-haystack test still meaningful given the giant green heatmaps in modern LLM papers?

We create ONERULER 💍, a multilingual long-context benchmark that allows for nonexistent needles. Turns out NIAH isn't so easy after all!

Our analysis across 26 languages 🧵👇
markar.bsky.social
Had to always apply for IRB in Japan (UTokyo), though the process was much longer than in the US (committee was meeting only few times a year and you were almost guaranteed to be asked to correct something which extended the process). Could easily take 2-3 months.
Reposted by Marzena Karpinska ✈️ COLM'25
marinecarpuat.bsky.social
What should Machine Translation research look like in the age of multilingual LLMs?

Here’s one answer from researchers across NLP/MT, Translation Studies, and HCI.
"An Interdisciplinary Approach to Human-Centered Machine Translation"
arxiv.org/abs/2506.13468
An Interdisciplinary Approach to Human-Centered Machine Translation
Machine Translation (MT) tools are widely used today, often in contexts where professional translators are not present. Despite progress in MT technology, a gap persists between system development and...
arxiv.org