Niyati Bafna
@niyatibafna.bsky.social
77 followers 160 following 53 posts
PhD student @jhuclsp. Previously @AIatMeta, @InriaParisNLP, @EM_LCT| #NLProc
Posts Media Videos Starter Packs
niyatibafna.bsky.social
Accepted at ACL main! Come chat about dialectal MT at our poster today at 4 pm.
Also, check out this largely bug-free package for generating your own synthetic dialectal data:
pypi.org/project/dial...
niyatibafna.bsky.social
Dialects lie on continua of (structured) linguistic variation, right? And we can’t collect data for every point on the continuum...🤔
📢 Check out DialUp, a technique to make your MT model robust to the dialect continua of its training languages, including unseen dialects.
arxiv.org/abs/2501.16581
Reposted by Niyati Bafna
zouharvi.bsky.social
You have a budget to human-evaluate 100 inputs to your models, but your dataset is 10,000 inputs. Do not just pick 100 randomly!🙅

We can do better. "How to Select Datapoints for Efficient Human Evaluation of NLG Models?" shows how.🕵️
(random is still a devilishly good baseline)
niyatibafna.bsky.social
Thanks! Yeah, that idea's definitely still around :) Although "language-agnostic" to a large extent seems to be "English"(arxiv.org/pdf/2402.18815, aclanthology.org/2024.acl-lon...)
niyatibafna.bsky.social
This work was done with my amazing collaborators Tianjian Li, @kentonmurray.bsky.social , @davidrmortensen.bsky.social , David Yarowsky, Hale Sirin, and @danielkhashabi.bsky.social, @jhuclsp.bsky.social .
niyatibafna.bsky.social
If you can’t decide whether to go end-to-end or MT cascade for your next multilingual experiments, or you want to build alternative architectures or adapters for multilingual LLMs, or you wanted to know why God why can't LLMs solve tasks in other languages, this paper is for you.
niyatibafna.bsky.social
Main takeaway: Translation failure is an important failure mode! Your model may be having wise and intelligent thoughts all up to its last couple of layers, and then failing to communicate them in Telugu because (like me) it has tried but failed to learn Telugu.
niyatibafna.bsky.social
We break down the patterns in the above figure by source and target language, talk about what makes the neat pipeline picture a little more complicated, and show briefly what happens with a bigger model (spoiler: things improve but not too much). See paper for details!
niyatibafna.bsky.social
In general, intermediate accuracy stays high even for LRL targets, but final accuracy quickly drops. And so TLP is high for most target languages (>50%). Except for low-resource *source* languages, in which case task-solving fails before we get to translation.
niyatibafna.bsky.social
We then quantify *translation loss proportion*: the proportion of failure cases that had successful task-solving but failed translation (see paper for less hand-waviness). We look at intermediate task-solving accuracy (over all layers), final accuracy, and TLP.
niyatibafna.bsky.social
What languages does task-solving occur in? We look at the distribution over languages of correct intermediate outputs and see that 1) English dominates 2) But other supported HRLs have a considerable combined presence! Also, this mix looks largely the same regardless of target language.
niyatibafna.bsky.social
We visualize the task-solving—>translation pipeline, showing that intermediate layers have high *off-target* accuracy (task-solving), which gets converted (via translation) to *on-target* accuracy near the final layers. For HRLs. For LRL target languages, translation fails, resulting in bad outputs.
niyatibafna.bsky.social
We look at a word translation task for 108 language pairs, and use logit lens to trace *task-solving accuracy* (correct semantics regardless of language) and *on-targetness* (correct target language) over model layers.
niyatibafna.bsky.social
This hypothesis says that 1) Multilingual generation uses a model-internal task-solving→translation cascade. 2) Failure of the translation stage *despite task-solving success* is a large part of the problem. That is, the model often solves the task but fails to articulate the answer.
niyatibafna.bsky.social
🔈When LLMs solve tasks with a mid-to-low resource input or target language, their output quality is poor. We know that. But can we put our finger on what breaks inside the LLM? We introduce the 💥 translation barrier hypothesis 💥 for failed multilingual generation with LLMs. arxiv.org/abs/2506.22724
niyatibafna.bsky.social
This work was done in (a super fun) collaboration with Matthew Wiesner, at the HLTCOE and @jhuclsp.bsky.social.
niyatibafna.bsky.social
Apparently the ECAPA-TDNN model thinks I'm speaking Bengali when I read out Wordsworth to it. I wish I spoke Bengali. I wish Wordsworth spoke Bengali. But the cold harsh truth: SOTA LID should be better.
niyatibafna.bsky.social
This module by itself shows very little accent-language confusion. In combination with the ECAPA-TDNN model, it shows large improvements on LID for L2-accented speech in English, French, and German, and minimal degradation on mainstream accented speech.
niyatibafna.bsky.social
Okay, so how do we fix this problem? We investigate using a module that incorporates long-range information to help out. We look at two representations of the input: as a sequence of phones and a sequence of discretised SSL representations. And we put a classifier on top.
niyatibafna.bsky.social
This suggests that language identification models behave like accent identification models under the hood, largely relying on short-range phonotactics. When the accent-language association is broken, e.g. for L2-accented speech, LID models break. Badly!
niyatibafna.bsky.social
Models that show less block permutation invariance, such as the GEO model (aclanthology.org/2024.naacl-l...), also appear more robust to L2 accents.
niyatibafna.bsky.social
To test this, we look at *block permutation invariance* i.e. the length of ordered as well as unordered input features that SOTA models rely on; our experiments indicate that they use features describing only about 1-2 phones.
niyatibafna.bsky.social
Our hypothesis: this is caused by the model’s using too-short features. The intuition is that accents are characterised by short phone-usage-type features, languages by vocabulary and syntax. L2 accented speech imposes the former over the latter, causing confusion when models are short-sighted.
niyatibafna.bsky.social
Accent-language confusion: The mis-recognition of L2-accented speech as the L1 substrate or a related language. For example, when Indonesian-accented English is classified as Indonesian, Malay, etc. A large part of model error on L2-accented speech follows this pattern!
niyatibafna.bsky.social
We know that speech LID systems flunk on accented speech. But why? And what can we do about it? 🤔
Our work arxiv.org/abs/2506.00628 (Interspeech '25) finds that *accent-language confusion* is an important culprit, ties it to the length of feature that the model relies on, and proposes a fix.
niyatibafna.bsky.social
Presented DialUp (MT, dialect continua, robustness, etc.; arxiv.org/abs/2501.16581) to some new people this week! Thanks Hale and @schmidtsciences.bsky.social for inviting me up to New York 🥯

Saw some magnolias too :)