Lightnews — Scholar-powered news

Verena Blaschke @verenablaschke.bsky.social · Aug 17

#Interspeech2025 had a science fair today with lots of interactive speech tech demos, not just for conference attendees but also/especially for curious laypeople! The demos were fun, and I like the idea of combining a conference w/ a bit of scicomm for the local public

INTERSPEECH 2025 @interspeech.bsky.social · Jul 21

NL: Ben je nieuwsgierig naar taal, technologie en wetenschap?
Op 17/8 ben je van harte welkom op het Speech Science Festival in Ahoy Rotterdam!
----
EN: Are you curious about language, technology, and science?
Join us on Aug 17 at the Speech Science Festival in Ahoy Rotterdam!

4

Verena Blaschke @verenablaschke.bsky.social · Aug 7

Check out the...
- talk on Mon Aug 18, 15:50–16:10
- preprint: arxiv.org/abs/2506.02894
- suppl. material: github.com/mainlp/betth...

Joint work w/ Miriam Winkler & @barbaraplank.bsky.social from @mainlp.bsky.social, and Constantin Förster & Gabriele Wenger-Glemser from Bayerischer Rundfunk!

A Multi-Dialectal Dataset for German Dialect ASR and Dialect-to-Standard Speech Translation

Although Germany has a diverse landscape of dialects, they are underrepresented in current automatic speech recognition (ASR) research. To enable studies of how robust models are towards dialectal var...

arxiv.org

1 5

Verena Blaschke @verenablaschke.bsky.social · Aug 7

Automatic metrics like WER and human quality judgements are moderately correlated. Dialectal words are often rendered as nonsense. Dialectal syntactic structures are often retained in the output – whether this is acceptable in Std German is hit-or-miss.

1

Verena Blaschke @verenablaschke.bsky.social · Aug 7

All ASR models we benchmark perform much better on Standard German than dialectal audio. Whether the transcriptions of the dialectal audios tend to be closer to the Std German references or to the dialectal references depends on the model decoder type.

1

Verena Blaschke @verenablaschke.bsky.social · Aug 7

Betthupferl contains sentences from three dialect groups spoken in southeast Germany, as well as Std German sentences for comparison. The dialectal sentences have both dialectal and Std German gold transcriptions, showing differences between pronunciation, word choice and morphosyntax.

A sentence from the dataset with a Standard German and a dialectal transcription that differ on the word and phrase level.

1

Verena Blaschke @verenablaschke.bsky.social · Aug 7

At #Interspeech2025 I'm going to present Betthupferl, a dataset for German dialect ASR & dialect-to-standard speech translation! We analyze differences between dialectal & Standard German transcriptions, benchmark ASR models, and examine shortcomings of current ASR models & evaluation metrics.

Piper title ("A multi-dialectal dataset for German dialect ASR and dialect-to-standard speech translation") and a map of the German state Bavaria showing where the Franconian, Bavarian, and Alemannic dialect groups are spoken

1 4 16

Verena Blaschke @verenablaschke.bsky.social · Jul 27

UPDATE: Our poster presentation got moved to Tuesday, 16:00–17:30 (session 10)! #ACL2025NLP

Verena Blaschke @verenablaschke.bsky.social · Jul 18

At #ACL2025NLP I'll present our analysis of the effect of linguistic similarity on cross-lingual transfer! We looked at how 10 similarity measures correlate w/ transfer results btwn 263 languages across 3 NLP tasks. Different similarity measures matter for diff. experiments (no one-size-fits-all)!

Correlations between transfer results per experiment (parsing, POS tagging, topic classification with different input representations) and similarity measures. The results vary a lot across experiments and measures – some are described in the next posts.

1 3

Verena Blaschke @verenablaschke.bsky.social · Jul 27

The poster presentation slot got moved to Tuesday, 16:00–17:30!

1

Verena Blaschke @verenablaschke.bsky.social · Jul 18

Joint work with Masha Fedzechkina and @maartjeterhoeve.bsky.social produced during my internship at Apple last year!
See you at the Findings poster reception on Monday July 28 (18:00-19:30) :)
Preprint: arxiv.org/abs/2501.14491

Analyzing the Effect of Linguistic Similarity on Cross-Lingual Transfer: Tasks and Experimental Setups Matter

Cross-lingual transfer is a popular approach to increase the amount of training data for NLP tasks in a low-resource context. However, the best strategy to decide which cross-lingual data to include i...

arxiv.org

1 1

Verena Blaschke @verenablaschke.bsky.social · Jul 18

In practice, selecting a transfer language based on just one relevant similarity measure or the transfer results on a similar NLP task w/ similar input representations works well -- although it's best to compare multiple promising transfer candidates.

1

Verena Blaschke @verenablaschke.bsky.social · Jul 18

... Topic classification based on n-grams is sensitive to string overlap (+ correlated linguistic measures), but topic classification based on mBERT embeddings doesn't show any strong correlations – here, inclusion in the pre-training data is important instead.

1

Verena Blaschke @verenablaschke.bsky.social · Jul 18

Fortunately, the patterns confirm our intuitions – e.g., syntactic similarity matters for parsing but not for topic classification. However, input representations matter too....

1

Verena Blaschke @verenablaschke.bsky.social · Jul 18

At #ACL2025NLP I'll present our analysis of the effect of linguistic similarity on cross-lingual transfer! We looked at how 10 similarity measures correlate w/ transfer results btwn 263 languages across 3 NLP tasks. Different similarity measures matter for diff. experiments (no one-size-fits-all)!

1 1 21

Reposted by Verena Blaschke

Barbara Plank @barbaraplank.bsky.social · Jun 20

My ACL 2024 keynote talk on "Are LLMs Narrowing Our Horizon? Let’s Embrace Variation in NLP!" is online now:

underline.io/events/466/s...

2024.aclweb.org/program/keyn...

It was a huge honor to me to give last year's flagship-in-NLP-conference keynote in Bangkok 🇹🇭

Watch lectures from the best researchers.

On-demand video platform giving you access to lectures from conferences worldwide.

underline.io

1 3 19

Verena Blaschke @verenablaschke.bsky.social · Jun 4

Dei Boarisch heard ned bei "Servus" und "Pfiade" auf? Dann suach ma genau Di!
Wir suachan Bairischsprecher:innen, de a kurze Umfrage über KI-generierds Boarisch für a Masterarbeit beantwortn mechadn.
Mid jeder Teilnahme bring ma den boarischn Dialekt a Stickal weida in de digitale Weyd!

Verena Blaschke @verenablaschke.bsky.social · May 30

Bavarian dialect speakers needed! Our MSc student Miriam wants to find out 1. how good/bad LLM-generated "Bavarian" is, and 2. whether dialect speakers agree with each other on this. The survey takes <5 min: survey.ifkw.lmu.de/dialquali25/ Thank you for sharing/participating!

3 6

Verena Blaschke @verenablaschke.bsky.social · May 30

Bavarian dialect speakers needed! Our MSc student Miriam wants to find out 1. how good/bad LLM-generated "Bavarian" is, and 2. whether dialect speakers agree with each other on this. The survey takes <5 min: survey.ifkw.lmu.de/dialquali25/ Thank you for sharing/participating!

3 3

Reposted by Verena Blaschke

Queer in AI @queerinai.com · May 4

The first archival *CL Queer in AI workshop will kick off in about 15 min! Join us in-person if you're at NAACL or virtually 💜

We will have presentations from our amazing contributors and invited speakers. Read on for more details 🧵

1 2 5

Reposted by Verena Blaschke

Vinodkumar Prabhakaran @vinodkpg.bsky.social · May 4

Happening now at #NAACL2025 in room Pecos.

Kicking off with amazing talks and a panel by Monojit Choudhury, Isabelle Augenstein, and Katia Shutova

C3NLP @c3nlp.bsky.social · Apr 28

📣 Excited that our C3NLP 2025 Workshop program is finalized — just one week to go! 🎉

Full program: c3nlp.github.io

Co-organized with @vinodkpg.bsky.social @sunipadev.bsky.social @lucianabenotti.bsky.social @yongcao.bsky.social @danielhers.bsky.social Laura Cabello, Ife Adebara, and Li Zhou. ❤️

1 5

Reposted by Verena Blaschke

Manuel Mager (Turatemai) @pywirrarika.bsky.social · May 4

Happening now at @americasnlp.bsky.social 2025. Telegram in Aymara and how to translate tech terminology. #NAACL2025

3 3

Reposted by Verena Blaschke

Alan Ramponi @alanramponi.bsky.social · May 2

📣 Join us tomorrow May 3rd for the 10th Workshop on Noisy and User-generated Text #W-NUT at #NAACL2025 (📍 Room Navajo/Nambe)!

The workshop features 16 paper presentations and 2 exciting keynote talks by @verenablaschke.bsky.social and Su Lin Blodgett (titles+abstracts below)! #NLProc #NAACL

👇

1 1 6

Verena Blaschke @verenablaschke.bsky.social · Apr 29

The full workshop programme is here: noisy-text.github.io/2025/

W-NUT 2025: Workshop on Noisy and User-generated Text (at NAACL 2025)

noisy-text.github.io

1 6

Verena Blaschke @verenablaschke.bsky.social · Apr 29

On my way to #NAACL2025 where I'll give a keynote at the noisy text workshop (WNUT), presenting some of the challenges & methods for dialect NLP + also discussing dialect speakers' perspectives!

🗨️ Beyond “noisy” text: How (and why) to process dialect data
🗓️ Saturday, May 3, 9:30–10:30

1 7 27

Verena Blaschke @verenablaschke.bsky.social · Apr 22

This article is about a success story, but it also mentions unsuccessful prior attempts and discusses the different perspectives/priorities that NLP researchers vs. field linguists might have: hdl.handle.net/10125/24793

Integrating Automatic Transcription into the Language Documentation Workflow: Experiments with Na Data and the Persephone Toolkit

Automatic speech recognition tools have potential for facilitating language documentation, but in practice these tools remain little-used by linguists for a variety of reasons, such as that the technology is still new (and evolving rapidly), user-friendly interfaces are still under development, and case studies demonstrating the practical usefulness of automatic recognition in a low-resource setting remain few. This article reports on a success story in integrating automatic transcription into the language documentation workflow, specifically for Yongning Na, a language of Southwest China. Using Persephone, an open-source toolkit, a single-speaker speech transcription tool was trained over five hours of manually transcribed speech. The experiments found that this method can achieve a remarkably low error rate (on the order of 17%), and that automatic transcriptions were useful as a canvas for the linguist. The present report is intended for linguists with little or no knowledge of speech processing. It aims to provide insights into (i) the way the tool operates and (ii) the process of collaborating with natural language processing specialists. Practical recommendations are offered on how to anticipate the requirements of this type of technology from the early stages of data collection in the field.

hdl.handle.net

1

Reposted by Verena Blaschke

Barbara Plank @barbaraplank.bsky.social · Apr 15

Are you attending NAACL 2025 and are you interested in low-resource languages and dialects?

Then don't miss our very own @verenablaschke.bsky.social's keynote talk at the WNUT 2025 workshop on May 3rd:

Beyond “noisy” text: How (and why) to process dialect data

🌐 ☀️
noisy-text.github.io/2025/

1 5 17

Verena Blaschke @verenablaschke.bsky.social · Mar 24

I just finished reading the preprint -- cool paper + very timely!

1 1