Ben Hayes
ben-hayes.bsky.social
Ben Hayes
@ben-hayes.bsky.social
Machine learning for audio synthesis @ Sony CSL Paris
PhD @ C4DM, QMUL.
Former intern at Spotify, Sony CSL, Bytedance
🔊 Follow the links above for audio examples, full training code, and the arXiv pre-print.
June 10, 2025 at 10:13 AM
🏆 We then apply this method to a dataset of sounds sampled from Surge XT — a feature rich software synthesizer — and find that it dramatically outperforms state-of-the-art baselines on audio reconstruction.
June 10, 2025 at 10:13 AM
🤔 However, in the case of real synthesizers, we may not know the appropriate symmetries a priori. To allow them to be discovered adaptively, we introduce a technique called Param2Tok, which learns a mapping from synthesizer parameters to model tokens.
June 10, 2025 at 10:13 AM
🗺️ We can further improve performance by designing a model with equivariance to the appropriate symmetry.
June 10, 2025 at 10:13 AM
📈 We design a toy task that isolates this phenomenon and find that the presence of permutation symmetry degrades the performance of conventional methods. We then show that a generative approach, which can assign predictive weight to multiple possible solutions, performs considerably better.
June 10, 2025 at 10:13 AM
‼️ In this work, we argue that the problem is ill-posed: there are multiple sets of parameters that produce any given sound. Further, we show that many of these equivalent solutions are due to intrinsic symmetries of the synthesizer!
June 10, 2025 at 10:13 AM
🧑‍🔬 Previous approaches have struggled to scale to the full complexity of synthesizers used in modern audio production. Why?
June 10, 2025 at 10:13 AM
🎛️ Programming synthesizers is a fiddly business, and so a line of work known as "sound matching" has, over the last few decades, sought to answer the question: given an audio signal and a synthesizer, which configuration of parameters best approximates the signal?
June 10, 2025 at 10:13 AM
🎹 Audio synthesizers are diverse and complex beasts, combining a variety of techniques to produce sounds ranging from familiar to entirely alien.
June 10, 2025 at 10:13 AM
TL;DR: Predicting synthesizer parameters from audio is hard because multiple parameter configurations can produce the same sound. We design a model that accounts for this and find that it dramatically outperforms previous approaches, and works on production grade, feature rich VST synthesizers.
June 10, 2025 at 10:13 AM
the best ones combine two or more
March 29, 2025 at 12:23 AM
Two excellent recent resources:

1. (not strictly a paper) This tutorial from the last ISMIR, courtesy of: geoffroypeeters.github.io/deeplearning...
2. This overview of model-based deep learning for MIR: arxiv.org/abs/2406.11540
Deep Learning 101 for Audio-based MIR — Deep Learning 101 for Audio-based MIR
geoffroypeeters.github.io
February 13, 2025 at 10:15 AM
I look at it as squeezing a *slightly* better coupling out of the batch.

they do something related here (arxiv.org/abs/2306.15030) with the Kabsch algorithm, but they transform the target samples as they're specifically trying to learn a rotation invariant distribution with an equivariant flow.
Equivariant flow matching
Normalizing flows are a class of deep generative models that are especially interesting for modeling probability distributions in physics, where the exact likelihood of flows allows reweighting to kno...
arxiv.org
January 29, 2025 at 11:02 AM
haven't crunched through it on paper but my hunch is this works because of the spherical symmetry of the Gaussian dist, so any orthogonal transformation of the batch is exactly as probable (should work for any O(d) invariant distribution if true)
January 29, 2025 at 11:02 AM
very anecdotally, I've found that when using a normal source distribution, performing orthogonal Procrustes on the source samples (to match the target samples) after minibatch coupling by exact linear assignment (Hungarian algo), seems to speed up convergence by a noticeable amount.
January 29, 2025 at 11:02 AM