Lightnews — Scholar-powered news

dasaemjeong.bsky.social @dasaemjeong.bsky.social · May 23

You can refer details below:
arXiv: arxiv.org/abs/2505.12863
demo: sakem.in/u-must/

Work done by [Jongmin Jung, DongMin Kim, Sihun Lee, Seola Cho and Dasaem Jeong]@ MALer Lab, Sogang Univ, Hyungjoon Soh, [Irmak Bukey, Chris Donahue @chrisdonahue.com] at CMU🥳

Unified Cross-modal Translation of Score Images, Symbolic Music, and Performance Audio

Music exists in various modalities, such as score images, symbolic scores, MIDI, and audio. Translations between each modality are established as core tasks of music information retrieval, such as aut...

arxiv.org

1

dasaemjeong.bsky.social @dasaemjeong.bsky.social · May 23

By training a model to generate audio tokens from given score image, the model learn how to read notes from the score image. This led our model to break SOTA for OMR! Vice versa for AMT can work, while the gain was not significant enough compared to the OMR.

1

dasaemjeong.bsky.social @dasaemjeong.bsky.social · May 23

Score videos are slideshow of audio-aligned score image. Although they does not include any machine-readable symbolic data, we thought these score image - audio pairs can be used for understand each modality, because they share same semantic in (hidden) symbolic music domain.

1

dasaemjeong.bsky.social @dasaemjeong.bsky.social · May 23

Can we unify these tasks into a single framework? And what would be the benefit of that unification?

Answer: We can exploit tons of Score Video from YouTube!
We collected about 2k hours of score video from YouTube and used 1.3k hours after filtering.

1

dasaemjeong.bsky.social @dasaemjeong.bsky.social · May 23

Music exists in various modal, and the translation between modality is important MIR Tasks.
Score Image→Symbolic Music: OMR
Audio → MIDI: AMT
MIDI → Audio: Synthesis
Score → Performance MIDI: Performance Rendering
Audio → Music Notation: Complete AMT

1

dasaemjeong.bsky.social @dasaemjeong.bsky.social · May 23

🎶Now a neural network can read scanned score image and generate performance audio in end-to-end😎
I'm super excited to introduce our work on Unified Cross-modal translation between Score Image, Symbolic Music, and Audio.
Why does it matter and how to make it? Check the thread🧵

1 5

dasaemjeong.bsky.social @dasaemjeong.bsky.social · Dec 21

Our paper on 🎻 synthesis got accepted to ICASSP! (with @hermandong.bsky.social ) We used a dataset transcribed by Nazif’s last ISMIR paper that includes pitch bend info. We explicitly modeled these pitch bend for better performance!
📝: arxiv.org/abs/2409.12477
🎶: daewoung.github.io/ViolinDiff-D...

ViolinDiff: Enhancing Expressive Violin Synthesis with Pitch Bend Conditioning

Modeling the natural contour of fundamental frequency (F0) plays a critical role in music audio synthesis. However, transcribing and managing multiple F0 contours in polyphonic music is challenging, a...

arxiv.org

1 7