Christoph Minixhofer
@cdminix.bsky.social
100 followers 220 following 110 posts
PhD Student @ University of Edinburgh. Working on Synthetic Speech Evaluation at the moment. 🇳🇴 Oslo 🏴󠁧󠁢󠁳󠁣󠁴󠁿 Edinburgh 🇦🇹 Graz
Posts Media Videos Starter Packs
Pinned
cdminix.bsky.social
🧪 SSL (self-supervised learning) models can produce very useful speech representations, but what if we limit their input to prosodic correlates (Pitch, Energy, Voice Activity)? Sarenne Wallbridge and I explored what these representations do (and don’t) encode: arxiv.org/abs/2506.02584 1/2
Prosodic Structure Beyond Lexical Content: A Study of Self-Supervised Learning
People exploit the predictability of lexical structures during text comprehension. Though predictable structure is also present in speech, the degree to which prosody, e.g. intonation, tempo, and loud...
arxiv.org
cdminix.bsky.social
TTSDS2 is one of the papers accepted by the @neuripsconf.bsky.social area chairs but but rejected by the senior area chairs with no explanation as to why. A bit frustrating after the long review process.
cdminix.bsky.social
100% agreed, also crisps are snack, not a side dish for lunch
cdminix.bsky.social
Accents are also best seen as a distribution, not a group of labels imo. We tried to incorporate some proxy of accent in TTSDS2, but a simple phone distribution did not work all that well, probably because it’s hard to disentangle from lexical content…
ninamarkl.bsky.social
in honour of interspeech this week, i’d like to issue a reminder that everyone has an accent, and that’s beautiful, actually: www.isca-archive.org/interspeech_...

www.isca-archive.org/interspeech_...
cdminix.bsky.social
It's been a great #interspeech2025!
I presented a TTS-for-ASR paper:
www.isca-archive.org/interspeech_...
And one on prosody reps: www.isca-archive.org/interspeech_...
There were many interesting questions & comments - if you have more and didn't get the chance feel free to send me a message.
cdminix.bsky.social
I’ll will be presenting this tomorrow at 8.50 at #interspeech2025, come by if you’re interested in prosodic representations!
cdminix.bsky.social
🧪 SSL (self-supervised learning) models can produce very useful speech representations, but what if we limit their input to prosodic correlates (Pitch, Energy, Voice Activity)? Sarenne Wallbridge and I explored what these representations do (and don’t) encode: arxiv.org/abs/2506.02584 1/2
Prosodic Structure Beyond Lexical Content: A Study of Self-Supervised Learning
People exploit the predictability of lexical structures during text comprehension. Though predictable structure is also present in speech, the degree to which prosody, e.g. intonation, tempo, and loud...
arxiv.org
cdminix.bsky.social
Thank you to everyone who stopped by, I’m grateful for all the feedback and interesting questions #interspeech2025
cdminix.bsky.social
In other news — if you’re an early bird and at #interspeech, feel free to drop by my poster presentation on scaling synthetic data tomorrow - who doesn’t want to chat about neural scaling laws early in the morning!
App: interspeech.app.link?event=687602...
Paper: www.isca-archive.org/interspeech_...
cdminix.bsky.social
I tried: “what sport should I pick up?” and for my original (male) voice it responded with “association football is the most popular sport in the UK”. For my female one… “oh, for a newbie? Something easy like […]” — Goes without saying that research into these biases is important. 2/2
cdminix.bsky.social
A highlight at #interspeech so far: the “hear me out” show&tell in which you can check how the spoken language model Moshi responds based on if it’s your voice or a voice converted version to the opposite gender.
Check it out here shreeharsha-bs.github.io/Hear-Me-Out/
1/2
Hear Me Out
Interactive evaluation and bias discovery platform for speech-to-speech conversational AI
shreeharsha-bs.github.io
cdminix.bsky.social
Looking forward to present a bunch of things at #INTERSPEECH and #SSW - will put the details here once my thesis final draft is done, which will probably be on the plane to Rotterdam.
cdminix.bsky.social
One day until the Q2 ttsdsbenchmark.com update. We‘ll see which TTS system tops the leaderboard this time - some new ones have been added that could shake things up.
cdminix.bsky.social
We used to have to tell people „not everything you see on the internet is true“ (and still do I guess) same applies to chatbots, but they can be more convincing (because of their eloquence and anthropomorphism) and hard/impossible to figure out where the false information comes from.
cdminix.bsky.social
Followed your advice and can confirm “Ughaaaghaghaa” was my reaction as well.
erikaishii.bsky.social
You ever watch a film and just know it’s a seminal medium-defining work of peak interdisciplinary storytelling all you can say in the moment is “Ughaaaghaghaa” and then cry?

So yeah everyone watch K-Pop Demon Hunters.
cdminix.bsky.social
This figure motivated a lot of my PhD (or at least nudged me into a direction) -- check out arxiv.org/abs/2110.11479 (Hu et al.) if you haven't come across it before, it really frames the problem of synthetic/real speech distributions well.
Figure showing two overlapping bell curves representing data distributions. The green curve on the left is labeled ‘synthetic data distribution’, and the black curve on the right is labeled ‘true data distribution’. The horizontal axis is divided into four regions: ‘artifacts’ (only covered by the green curve), ‘over-sampled’ (where the synthetic curve is higher than true), ‘under-sampled’ (where the true curve is higher than synthetic), and ‘missing samples’ (only covered by the black curve). Caption: Fig. 1 describes the gap between synthetic and true data distributions partitioned into four regions.
cdminix.bsky.social
Spotted a Norwegian flag across the Firth of Forth, didn’t know Norwegians had hytte on this side of the North Sea as well!
Norwegian flag in a sunny and green scene in Scotland with water and a bridge in the background.
cdminix.bsky.social
More details on this soon! Also this weekend is the last chance to submit your TTS system for the next round of evaluation (Q2 2025) by either messaging me at [email protected] or requesting a model here: huggingface.co/spaces/ttsds...
cssd-bot.bsky.social
Christoph Minixhofer, Ondrej Klejch, Peter Bell: TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems https://arxiv.org/abs/2506.19441 https://arxiv.org/pdf/2506.19441 https://arxiv.org/html/2506.19441
cdminix.bsky.social
It’s amazing how a days work can stretch out over a fortnight, and a week of work can be compressed into 24 hours sometimes…
cdminix.bsky.social
I wonder if there are naturally left-curling and right-curling cats, or if all cats curl both ways.
unseenjapan.com
Not Japan-related, but since we all need a distraction from The Horrors, Takaya Suzuki points out a study that examined 408 sleeping cats and found the majority (65%) curl leftwards.

I'm not sure how useful this information is, but...it's yours now.
Post by suzuki_takaya

衝撃の事実。ネコは1日12−16時間寝るけれども、この丸まったポジションの左巻き・右巻きのどちらが多いか。YouTubeで何匹もの眠るネコを観察し、266匹(65%)のネコが左巻き、142匹のネコが右巻きに寝ていたことを発見。有意な差が見られたとのこと。人類史上に残る幸せな研究といえる。
cdminix.bsky.social
I’ve only really encountered people trying to avoid sounding like AI… but it makes sense that it would alter how people speak if they interact with it a lot. Makes me sad though since it pushes people towards the mean, which is always the most boring.
cdminix.bsky.social
This made me think back to the last couple books I read… My last one that I could actually describe as wholesome is 8 books ago: A Wizard of Earthsea by Urusla Le Guin. Otherwise, anything by Terry Pratchett?
cdminix.bsky.social
Mileage may vary based on how long you and your poster wait at a bus stop in Edinburgh city centre and the current foot traffic and the number of American tourists with family members in your academic field passing by.
cdminix.bsky.social
Pro tip #1: don’t use a poster tube when travelling to and a from conferences, people might come up to you and ask about your research. Pro tip #2: get a poster tube if you don’t want to talk about your research anymore once you’re done with the conference.
cdminix.bsky.social
This is from „Text-to-Speech Synthesis“ (2009) - and honestly, I like a bold claim like this (even if wrong), and who knows what I would’ve thought if I had done research at the time - and I wonder how many of my current beliefs will turn out to be wrong in ~15 years!
cdminix.bsky.social
Taylor 2009:
„Sometimes the question is raised as to whether we really want a TTS system to sound like a human at all.“

me: I wonder where this is going

later:
„no matter how good a system is, it will rarely be mistaken for a real person, and we believe this concern can be ignored.“

me: oh no