Adapting Speech Language Model to Singing Voice Synthesis

(NeurIPSW AIforMusic 2025)

Yiwen Zhao, Jiatong Shi, Jinchuan Tian, Yuxun Tang, Jiarui Hai, Jionghao Han, Shinji Watanabe

Abstract

Speech Language Models (SLMs) have recently emerged as a unified paradigm for addressing a wide range of speech-related tasks, including text-to-speech (TTS), speech enhancement (SE), and automatic speech recognition (ASR). However, the generalization capability of large-scale pre-trained SLMs remains underexplored. In this work, we adapt a 1.7B parameter TTS pretrained SLM for singing voice synthesis (SVS), using only a 135-hour synthetic singing corpus, ACE-Opencpop. Building upon the ESPNet-SpeechLM, our recipe involves the following procedure: (1) tokenization of music score conditions and singing waveforms, (2) multi-stream language model token prediction, (3) conditional flow matching-based mel-spectrogram generation. (4) a mel-to-wave vocoder. Experimental results demonstrate that our adapted SLM generalizes well to SVS and achieves performance comparable to leading discrete token-based SVS models.

Demo for discrete SVS comparison on ACE-Opencpop.

Demo for recipe design ablations.

CD Resynthesis: Codec Resynthesis.

LM+CD: LM predicted codec directly decode.

LM+Flow1+CD: Flow model refine from noise to codec embedding, conditioned on LM predicted codec.

LM+Flow1+Voc: Flow model refine from noise to mel, conditioned on LM predicted codec.

LM+Flow1+Voc: Flow model refine from noise to mel, conditioned on LM predicted codec and pitch in music notes.